Indexing and searching full text data with Druid

Created with Sketch.

Indexing and searching full text data with Druid

0
(0)

gis.csv

By default, Druid can load and search text data in a dimension. However, if the length of the text is long, it has a search performance issue and there is a limit to the complex text search like a search engine. How do you search for data that has eight or fewer words between washington and ave, as provided by lucene? The Metatron Druid distribution can use the lucene syntax in Druid by providing a lucene extension.

Let’s take a look at how to use lucene’s text search feature in Druid.

Before you begin, you should install the Metatron Druid distribution. Use the link below to install the Metatron Druid distribution.

First run Druid in single mode and create the following configuration file for indexing. The important parts of the example are marked as Bold. This example also included a GIS example of lucene, but only full text search example is covered here and GIS example will be covered in a separate blog. We will also cover other settings for Apache Druid separately.

index-lucene.json

{
  "context": {
    "druid.indexer.runner.javaOpts": "-server -Xmx4g -Xms4g -XX:MaxPermSize=1g -Xdebug -Xrunjdwp:transport=dt_socket,address=0.0.0.0:11200,server=y,suspend=n"
  },
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "property_inspect",
      "parser": {
        "type": "hadoopyString",
        "parseSpec": {
          "format": "tsv",
          "delimiter": ",",
          "dequote": true,
          "columns": [
            "property_id",
            "property_name",
            "address",
            "city",
            "cbsa_name",
            "cbsa_code",
            "county_name",
            "county_code",
            "state_name",
            "state_code",
            "zip",
            "latitude",
            "longitude",
            "inspection_score",
            "inspection_date"
          ],
          "timestampSpec": {
            "column": "inspection_date",
            "format": "MM/dd/yyyy"
          },
          "dimensionsSpec": {
            "dimensions": [
              "property_id",
              "city",
              "cbsa_code",
              "county_code",
              "state_code",
              "zip"
            ]
          }
        }
      },
      "evaluations": [
        {
          "outputName": "gis",
          "expressions": [
            "struct(latitude,longitude,address)"
          ]
        }
      ],
      "validations": [
        {
          "exclusions": [
            "latitude == 0 && longitude == 0"
          ]
        }
      ],
      "metricsSpec": [
        {
          "type": "relay",
          "name": "inspection_score",
          "typeName": "float"
        },
        {
          "type": "relay",
          "name": "cbsa_name",
          "typeName": "string"
        },
        {
          "type": "relay",
          "name": "county_name",
          "typeName": "string"
        },
        {
          "type": "relay", //Do not roll up
          "name": "state_name",
          "typeName": "string"
        },
        {
          "type": "relay",
          "name": "gis",
          "typeName": "struct(lat:double,lon:double,addr:string)"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "YEAR",
        "queryGranularity": "NONE",
        "intervals": [
          "2001-01-01/2009-12-31"
        ],
        "rollup": false
      }
    },
    "ioConfig": {
      "type": "index",
      "firehose": {
        "type": "local",
        "baseDir": "/mount/data/", //The directory location where the gis.csv file is located
        "filter": "gis.csv"
      }
    },
    "tuningConfig": {
      "type": "index",
      "targetPartitionSize": 5000000,
      "buildV9Directly": true,
      "maxRowsInMemory": 3000000,
      "maxOccupationInMemory": 1024000000,
      "maxShardLength": 256000000,
      "indexSpec": {
        "bitmap": {
          "type": "roaring"
        },
        "secondaryIndexing": {
          "gis": {
            "type": "lucene", 
            "strategies": [
              {
                "type": "latlon",
                "fieldName": "coord",
                "latitude": "lat",
                "longitude": "lon"
              },
              {
                "type": "text",  
                "fieldName": "addr" //full text field name to index
              }
            ]
          }
        }
      }
    }
  }
}
curl -X 'POST' -H 'Content-Type:application/json' -d @ index-lucene.json http://localhost:8090/druid/indexer/v1/task

When the indexing is complete, create the query as follows. The query below returns data including washington or ave.

select-fulltext.json

{
  "context": {
    "select.parallel": false,
    "allColumnsForEmpty": false,
    "useCache": false,
    "populateCache": false,
    "postProcessing": {
      "type": "tabular"
    }
  },
  "queryType": "select",
  "dataSource": "property_inspect",
  "virtualColumns": [
    {
      "expression": "haversin_meter(33.917877, -80.345172, gis.lat, gis.lon)",
      "outputName": "distance"
    },
    {
      "expression": "abs(distance - x)",
      "outputName": "delta"
    }
  ],
  "granularity": "ALL",
  "intervals": [
    "2001-01-01/2020-01-01"
  ],
  "metrics": [
    "gis",
    "distance",
    "x",
    "delta",
    "inspection_score",
    "h3",
    "geohash"
  ],
  "pagingSpec": {
    "pagingIdentifiers": {},
    "threshold": 1000000
  },
  "filter": {
    "type": "lucene.query",
    "field": "gis.addr",
    "expression": "washington ave"
  }
}

You can use any syntax that lucene uses in your filter expression. Try different things with substituting expression as follows.

Search for “washington st” within 4 words from each other.

"filter": { "type": "lucene.query", "field": "gis.addr", "expression": "\"washington st\"~4"}

Search for phrase “washington” and “ave” in the gis.addr field.

"filter": { "type": "lucene.query", "field": "gis.addr", "expression": "washington AND ave"}

Search for word “washington” and not “ave” in the gis.addr field.

"filter": { "type": "lucene.query", "field": "gis.addr", "expression": "washington AND -ave"}

Search for word string with “wash”.

"filter": { "type": "lucene.query", "field": "gis.addr", "expression": "wash*"}

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

As you found this post useful...

Share this post on your social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Leave a Reply

Your email address will not be published. Required fields are marked *