Blog

A noob’s guide to indexing data with Solr’s classic schema

Intro

I joined OSC a month ago as their first data-scientist, so I’ve been drinking from a firehose trying to get up to speed on Solr. After breezing through the first two Solr tutorials, I felt pretty good about tackling an internal set of Solr challenge tasks. However the transition from the friendliness of start -e to standing Solr up myself was a little rough. This post is my newbie guide for setting up classic schema as of Solr v8.3.0.

Why classic?

Managed schema is the default in Solr v8.3.0 and is used in all of the tutorials, because it makes setup easier by handling the field definition through the Solr Admin UI, which is simpler and safer than editing complicated .xml files by hand. But the classic schema is still useful because it is explicit and forces you to interact with the nuts and bolts which is useful if you are learning how the engine works.

Setting up classic

To transition back from the managed schema to the classic schema of yesteryear, the Solr docs say you need to do two things:

  1. Rename managed schema to schema.xml
  2. Add to solrconfig.xml

Seems simple enough, just start my local Solr server and create a new core using the _default configset.

bin/solr 
startbin/solr create_core -c new

Navigate into the newly created solr-home in my VS-code (not pictured) and make those changes.

Reloading my core to pick up the new changes.

curl ‘http://localhost:8983/solr/admin/cores?action=RELOAD&core=new_core’

Trying to post some example data from films.json.

bin/post -c new example/films/films.json

Boo, error-message: “This IndexSchema is not mutable.”

But that Googling that error-message led me to this post on StackOverflow and the third step:

  1. Set update.autoCreateFields:false on line #1197
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
   processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

Those lines are related to the managed schema’s auto-update functionality for fields that were not explicitly named.

Trying my post operation again….

Boo, more errors: [doc=/en/45_2006] unknown field 'directed_by'.

Ok so I add these field definitions to schema.xml.

<field name="directed_by" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="initial_release_date" type="text_general" indexed="true" stored="false" multiValued="false"/>
<field name="genre" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="name" type="text_general" indexed="true" stored="false" multiValued="false"/>

Reloading my core and re-retrying my post … woohoo! No more error-messages!

Outro

This was not easy for me, navigating my Solr noob-ness through partial documentation, but I learned a lot about what is actually happening behind the scenes. I’m looking forward to getting back to my data-science comfort zone of making plots and exploring insights. Thanks for reading and I hope your classic schema set up goes smoother than mine!

If like me you need help getting up to speed with Solr, perhaps we can help! We offer various training packages.