Custom Yokozuna Extractors

In this post we are going to create a custom yokozuna extractor for Riak Search (version 2.x). The code and a shorter readme can be found here https://github.com/drewkerrigan/riak_sandbox/tree/master/search. More information about yokozuna extractors can be found here: https://github.com/basho/yokozuna/blob/develop/docs/CONCEPTS.md#extractors.

The Data

To keep this guide simple, we are going to create a simple extractor which allows us to index the interesting pieces of an HTTP header packet.

Here is an example representation of an HTTP header:

GET http://www.google.com HTTP/1.1

Yokozuna Text Extractor

Custom yokozuna extractors have a very simple interface that must be implemented (in Erlang). Here is what the pure text extractor looks like:

-module(yz_text_extractor).
-include("yokozuna.hrl").
-compile(export_all).

extract(Value) ->
    extract(Value, []).

extract(Value, Opts) ->
    FieldName = field_name(Opts),
    [{FieldName, Value}].

-spec field_name(proplist()) -> any().
field_name(Opts) ->
    proplists:get_value(field_name, Opts, text).

This extractor simply takes the contents of Value and returns a proplist with a single field name and the single value associated with that name. By default, the field name is text. If the following erlang snippet were run in a Riak console session:

yz_text_extractor:extract("hello").

The output would look something like this:

[{text,"hello"}]

That proplist is handed off to Solr, and the value "hello" would be indexed under the fieldname text.

Custom Binary Extractor

Back to our example of parsing a binary HTTP header packet; Erlang luckily comes with a standard packet decoder that happens to handle HTTP packets:

erlang:decode_packet(http,<<"GET http://www.google.com HTTP/1.1\n">>,[]).

That snippet should return something like this:

{ok,{http_request,'GET',
                  {absoluteURI,http,"www.google.com",undefined,"/"},
                  {1,1}},
    <<>>}

The relevant bits to an application that needed to search these packets are probably the Method (GET), the Host (www.google.com), and the Uri (/).

Using the text extractor as an example, our custom extractor should look similar to this if we want to index those 3 fields:

yz_httpheader_extractor.erl


-module(yz_httpheader_extractor).
-compile(export_all).

extract(Value) ->
    extract(Value, []).

extract(Value, _Opts) ->
    {ok,{http_request,Method,
            {absoluteURI,http,Host,undefined,Uri},
            _Version},
        _Rest} = erlang:decode_packet(http,Value,[]),

    [{method, Method}, {host, list_to_binary(Host)}, {uri, list_to_binary(Uri)}].

Register the Custom Extractor

Writing the extractor was simple enough, but in order for it to be utilized by Riak Search, a few steps need to be taken:

Compile the Extractor

Firstly we’ll need to compile the extractor into a beam file and distribute it to a path that we can remember on each Riak node.

erlc yz_httpheader_extractor.erl

Move the resulting beam file to a directory like /opt/beams

mv yz_httpheader_extractor.beam /opt/beams/

Configure Riak

We’ll need to tell Riak where to find the new beam file. This cannot currently be done using riak.conf, but there is a workaround if you create a file called advanced.config in the same directory as your riak.conf

/etc/riak/advanced.config

[{vm_args, [{"-pa /opt/beams",""}]}].

This vm.args directive tells Riak to add /opt/beams to the erlang path when starting Riak up.

Register the Extractor in Riak

riak start
riak attach

This should log into the running Riak node allowing us to run the register function in yz_extractor:

(riak@127.0.0.1)1> yz_extractor:register("application/httpheader", yz_httpheader_extractor).

The register call should return the updated list of mimetype -> extractor mappings. It should look something like this:

[{default,yz_noop_extractor},
 {"application/httpheader",yz_httpheader_extractor},
 {"application/json",yz_json_extractor},
 {"application/riak_counter",yz_dt_extractor},
 {"application/riak_map",yz_dt_extractor},
 {"application/riak_set",yz_dt_extractor},
 {"application/xml",yz_xml_extractor},
 {"text/plain",yz_text_extractor},
 {"text/xml",yz_xml_extractor}]

Now, any new documents submitted to yokozuna with the content type application/httpheader should be run through the new extractor.

The new extractor can be verified using the yokozuna extract endpoint:

Create a file called testdata.bin

testdata.bin

GET http://www.google.com HTTP/1.1

(Note the trailing newline at the end)

Now run a PUT to the extract endpoint:

curl -XPUT -H 'content-type: application/httpheader' 'http://localhost:8098/search/extract' --data-binary "@testdata.bin"

That curl call should return this JSON:

{"method":"GET","host":"www.google.com","uri":"/"}

The new extractor can also be verified in the Riak console:

(riak@127.0.0.1)1> yz_extractor:run(<<"GET http://www.google.com HTTP/1.1\n">>, yz_httpheader_extractor).

Which should return

[{method,'GET'},{host,<<"www.google.com">>},{uri,<<"/">>}]

Index and Search for the Data

Create Schema

(Based on default Yokozuna Solr Schema with our own field definitions, the default schema can be found here: https://raw.githubusercontent.com/basho/yokozuna/develop/priv/default_schema.xml)

Create a my_schema.xml based on the default schema:

...
<field name="method" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="host" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="uri" type="string" indexed="true" stored="true" multiValued="false"/>
...

Store the schema

curl -XPUT "http://localhost:8098/search/schema/my_schema" \
  -H 'content-type:application/xml' \
  --data-binary @my_schema.xml

Create a search index using your schema

curl -XPUT "http://localhost:8098/search/index/my_index" \
     -H'content-type:application/json' \
     -d'{"schema":"my_schema"}'

riak-admin bucket-type create my_type '{"props":{"search_index":"my_index"}}'
riak-admin bucket-type activate my_type

Store Some Data

Use the testdata.bin file we created earlier to write data to Riak:

curl -XPUT \
  -H "Content-Type: application/httpheader" \
  --data-binary "@testdata.bin" \
  http://localhost:8098/types/my_type/buckets/headers/keys/google

Query the Data

curl 'http://localhost:8098/search/query/my_index?wt=json&q=method:GET'

And if everything is successful, we should see our record returned in the results!

{
    "response": {
        "docs": [
            {
                "_yz_id": "1*my_type*headers*google*15",
                "_yz_rb": "headers",
                "_yz_rk": "google",
                "_yz_rt": "my_type",
                "host": "www.google.com",
                "method": "GET",
                "uri": "/"
            }
        ],
        "maxScore": 0.71231794,
        "numFound": 1,
        "start": 0
    },
    "responseHeader": {
        "QTime": 8,
        "params": {
            "127.0.0.1:8093": "_yz_pn:64 OR (_yz_pn:61 AND (_yz_fpn:61)) OR _yz_pn:60 OR _yz_pn:57 OR _yz_pn:54 OR _yz_pn:51 OR _yz_pn:48 OR _yz_pn:45 OR _yz_pn:42 OR _yz_pn:39 OR _yz_pn:36 OR _yz_pn:33 OR _yz_pn:30 OR _yz_pn:27 OR _yz_pn:24 OR _yz_pn:21 OR _yz_pn:18 OR _yz_pn:15 OR _yz_pn:12 OR _yz_pn:9 OR _yz_pn:6 OR _yz_pn:3",
            "q": "method:GET",
            "shards": "127.0.0.1:8093/internal_solr/my_index",
            "wt": "json"
        },
        "status": 0
    }
}