Don’t want to deploy a separate Tika server? But need Tika server-like capabilities and you already have Solr? Then this is the solution for you! What I am going to show has been available out of the box in Solr for quite a while, but has been very opaque to configure, so hopefully this demystifies the process.
First we figured out the magic incantation to configure Tika from inside of Solr, which is via the
parseContext.config parameter and a specific XML formatted file:
You might be tempted to think that this is the same file format as a
tika-config.xml that you may have seen before, and you’d be wrong ;-). While visually very similar, this file is loaded by ParseContextConfig, which is part of the Solr
extraction contrib module. Yes, there are many different ways to specify configuration settings for PDF extraction and Tesseract OCR!
We then tweaked the default
/update/extract request handler to refer to the
parseContext.xml file. We want any fields that we don’t already have defined in
schema.xml to be prepended with the name
attr_, which triggers dynamic field generation. So if the field from Tika is
Creator, it becomes in Solr a text field called
Here is our config file:
parseContext.xml true attr_ 20480Limit to 20 MB PDF
Because PDFs can be really big, we also needed to bump the size on the
You can now hit Solr via:
curl 'http://localhost:8983/solr/documents/update/extract?literal.id=doc2&commit=true&extractOnly=true' -F "[email protected]/alvarez20140715a.pdf"
and get back from Solr the Tika processed content in a relatively easy to process structure that is very similar to what Tika server returns!
Read other posts in the Tika Tuesday series here.