Graph Phun in Solr

Recently Solr has acquired graph search capability similar to what you’d find in TitanDB or Neo4J. Essentially it acts asa collector that traverses link_from to link_to, while optionally applying a filter query each step of the way. You canalso decide to return all connected hops or just the leaves.

SOLR-7543: Create GraphQuery that allows graph traversal as a query operator

Now, general purpose graph databases have a radically different storage model than Lucene’s Inverted Index, soperforming complex or deep graph queries is not an option. The trade-off is that the traversal depth is restricted to 4.At that depth you’re still able to model something like Role-Based Access Controlor query augmentation with hypernyms an hyponyms.

Hyponyms and Hypernyms

Hyponyms and Hypernyms

Role-based Access Control

Role-based Access Control

Since this is so new you’ll need to pull down the latest version of Solr and build it.If you prefer git, you can also pull down the source from the GitHub mirror.

To get a feel for how this works I decided to use WordNet1 to index some words along with their hypernyms withpysolr and nltk. The indexing code itself isn’t that interesting so I won’t go into its details here, but here’s theGist.

The other thing I wanted to experiment with is Solr’s new schemaless capability. So once I built Solr I ran it withbin/solr -e schemaless to get a blank collection.

After the indexing code finished I opened up the Admin Console and checked out the schema. Sure enough, there were thefields, all indexed as string types. I also noticed that each was copied to the general _text_ field.

Okay, now let’s do some graph stuph! (last joke like that, I swear).

What are the hypernyms of “dog”?

Well, that’s easy because we indexed them directly as fields:


{  "responseHeader":{    "status":0,    "QTime":0},  "response":{"numFound":1,"start":0,"docs":[      {        "hypernyms":["canine.n.02",          "domestic_animal.n.01"]}]  }}

By the way, the reason it’s “dog.n.01” and not just “dog” is because WordNet is very specific about terms, and this happens to be the second word sense of “dog” the noun (unlike a perjoritive term for how someone looks, or the verb for following someone closely).

What are the hypernym’s of the hypernyms of “dog”?

So what we want to do here is start at the id field, look at the matching hypernyms field, then for eachvalue there try to find a doc with that id:


(From now on I’ll omit the other parameters and just focus on the graph query.)

{  "responseHeader":{    "status":0,    "QTime":4},  "response":{"numFound":4,"start":0,"docs":[      {        "id":"animal.n.01"},      {        "id":"domestic_animal.n.01"},      {        "id":"carnivore.n.01"},      {        "id":"canine.n.02"}]  }}

Notice we’re returning intermediate nodes of “caninie.n.02” and “domestic_animal.n.01”. In the last query these valueswere in the hypernyms field, now they’re in the id.

What are all the hyponyms of “canine”?


{  "responseHeader":{    "status":0,    "QTime":0,    "params":{      "q":"id:canine.n.02",      "indent":"true",      "fl":"hyponyms",      "wt":"json"}},  "response":{"numFound":1,"start":0,"docs":[      {        "hyponyms":["bitch.n.04",          "dog.n.01",          "fox.n.01",          "hyena.n.01",          "jackal.n.01",          "wild_dog.n.01",          "wolf.n.01"]}]  }}

How far up the chain of hypernyms of “canine” can I go?


{  "responseHeader":{    "status":0,    "QTime":6},  "response":{"numFound":12,"start":0,"docs":[      {        "id":"chordate.n.01"},      {        "id":"vertebrate.n.01"},      {        "id":"mammal.n.01"},      {        "id":"placental.n.01"},      {        "id":"entity.n.01"},      {        "id":"physical_entity.n.01"},      {        "id":"object.n.01"},      {        "id":"whole.n.02"},      {        "id":"living_thing.n.01"},      {        "id":"organism.n.01"}]  }}

Notice we stopped at 10, instead of returning all 12 numFound.

What are all the “domestic_animal”s in the index?

In this query I added rows=1000 just to see what would happen. I’ll omit the bulk of the response…


{  "responseHeader":{    "status":0,    "QTime":14},  "response":{"numFound":213,"start":0,"docs":[      {        "id":"feeder.n.01"},      {        "id":"stocker.n.01"},      {        "id":"head.n.02"},      {        "id":"puppy.n.01"},      {        "id":"dog.n.01"},      {        "id":"pooch.n.01"},      {        "id":"cur.n.01"},      {        "id":"feist.n.01"},      {        "id":"pariah_dog.n.01"},     ...      {        "id":"burmese_cat.n.01"},      {        "id":"egyptian_cat.n.01"},      {        "id":"maltese.n.03"},      {        "id":"abyssinian.n.01"},      {        "id":"manx.n.02"}]

And that’s just the beginning

The hierarchical nature of these queries means we can change one of the intermediate nodes and have radically differentresults. Take, for instance, the RBAC use-case I mentioned at the beginning. By simply adding an operation to apermission, immediately all of the roles that have that permission and all of the subjects that have those roles gainaccess to that operation. Likewise, catalogs can be reorganized by shifting one subcategoy to another withoutreindexing all of the products within them.

Solr 6.0 is going to be a fun release!

[1]: Princeton University “About WordNet.” WordNet. Princeton University. 2010.