Crawling with Nutch - OpenSource Connections

May 24, 2014 Elizabeth Haubert
Category: Uncategorized

Recently, I had a client using LucidWorks search engine who needed to integrate with the Nutch crawler. This sounds simple as both products have been around for a while and are officially integrated. Even better, there are some great “getting started in x minutes” tutorials already out there for both Nutch, Solr and LucidWorks. But there were a few gotchas that kept those tutorials from working for me out of the box. This blog post documents my process of getting Nutch up and running on a Ubuntu server.

0) Install Java

Included as step 0, as there is a good chance you already have the jdk installed. On Ubuntu, this is as simple as:

apt-get install default-jdkapt-get install default-jreexport JAVA-HOME=/path/to/jdk/folder

1) Install Solr

Ill be working off the LucidWorks build which is available free for download, but does require a license for beyond the trial use. Their install process is pretty well documented. I especially recommend their getting started guide if you are new to the search domain. If you are using a stand-alone Solr install, the nutch portion of this tutorial should be about the same, but your URLs for communicating with Solr will be slightly different.

2) Install Nutch

Nutch is an open-source project, and as such the active community ebbs and flows. In addition, some builds are more stable than others. Some documentation on the versions here:

Nutch 1.x series: This uses Hadoop for the map/reduce phases. It will integrate with a pre-existing Hadoop install, but includes the necessary pieces if you dont.

command referenced from the official nutch tutorial. Ill be using the 1.7 binary release, available here. To install:

tar -zxvf apace-nutch-1.7.bin.tar.gz

Nutch 2.x series: This uses Gora to abstract out the persistance layer; out of the box it appears to use HBase over Cassandra. At the time of writing, it is only available as a source download, which isnt ideal for a production environment.

3) Set up your nutch-site.xml

Nutch is highly configurable, but the out-of-the-box nutch-site.xml is empty. The default settings for the baked-in plugins are available in nutch-defaults.xml. Here are the settings I needed to add (and why):

 http.agent.name MyBot MUST NOT be empty. The advertised version will have Nutch appended. http.robots.agents MyBot,* The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,*. If you don't, your logfile will be full of warnings.  fetcher.store.content true If true, fetcher will store content. Helpful on the getting-started stage, as you can recover failed steps, but may cause performance problems on larger crawls. fetcher.max.crawl.delay -1  If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be.   plugin.includes protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic) At the very least, I needed to add the parse-html, urlfilter-regex, and the indexer-solr. 

4) Set your regex_urlfilter.xml

The regex_urlfilter.xml defines a set of include and exclude rules, which evaluates which urls from the crawldb will be fetched and indexed. The format of the rules is:

[+|-][regex]

This uses lazy evaluation so the first rule to match, top to bottom, will be applied. Make sure to put the most general rules last. Wildcards are generally expensive (especially on long urls) and uneccessary here. Evaluation is optimized to assume prefix paths. For example

+^http://www.totally.fake/subdomain

Will match both: http://www.totally.fake/subdomain http://www.totally.fake/subdomain/subsubdomain

but not http://www.totally.fake

For your very first crawl, it may be helpful to accept everything:

 +^.x

Even for a first run, this has its drawbacks: if Nutch pulls something that it cant parse, youll get errors. Obviously for a production crawl, you will want to limit your crawl/index domains appropriately.

5) Reconcile your solr and nutch schema mappings.

Nutch actually includes a schema.xml with all the fields it requires out of the box at $NUTCH_HOMEconfschema.xml. You could copy this directly to your Solr core directory, but I recommend adding these fields to an existing collection. Using LWS, this would be at:

 $LWS_HOMEconfsolrcoresyourCollectionNameconfschema.xml

Nutch also has a solrindex_mapping.xml with the default fields filled in. The defaults in 1.7 were good enough that I had could pull data in to solr. However, users using a non-LWS Solr may need to also add a version field. In addition, if you need to index additional tags (like metadata), or just want to rename the fields in solr you will need to edit this accordingly. Metadata is indexed from an additional plugins, parse-metadata and index-metadata. Documentation for those plugins is available here.

6) Plant your seeds

Nutch is a seed-based crawler, which means you need to tell it where to start from. These take the format of a text-based list of urls, one url per line, that go in a file named seed.txt. I like apaches site for a first go.

mkdir $NUTCH_HOME/urlsecho "http://nutch.apache.org" > $NUTCH_HOME/urls/seed.txt

Remember, whatever seeds you pick, that they needs to match the regex you set in your regex_urlfilter.txt or nutch will filter them out before crawling

7) Crawl

At this point, everything should be set up for a test run. Most of the tutorials Ive run into are based on the old compiled command bin/nuch crawl . This is deprecated in 1.8 and beyond, to be replaced by the /bin/crawl script.

To test, run the following from $NUTCH_HOME.

./bin/crawl urls/ testCrawl/ http://127.0.0.1:8888/solr/yourcore 1

There are more params you can add here, but you shouldnt need them to get started. Note that trailing 1 – this tells nutch to only crawl a single round. Since we set the regex-urlfilter to accept anything, it is important to set the number of rounds very low at this point.

A side note here: many of the tutorials I have come across use the command

./bin/nutch crawl

This command has been deprecated, and is disabled in version 1.8.

8) Validate

If that ran to completion, then you are ready to query Solr. From your browser, for a collection named test:

http://127.0.0.1:12888/solr/test/select?q=*%3A*&wt=json&indent=true

Should produce a single document – the nutch home page. Subsequent runs against the same crawldb should bring in pages referenced from the nutch home page, and on to the outside world.

9) Debug

There is a good chance that didnt work. Knowing how to debug your new tool is usually at least as important as how to set it up. This isnt a comprehensive guide, but Ill include the techniques I needed to get nutch off the ground.

9a) Run the crawl script step-by-step

The contents of the $NUTCH_HOME/bin directory are just bash scripts. It is educational to run through these steps once to understand what is going on, and this is what the nutch tutorial actually does. This does a few things: it lets you know where the failure occurred, recover from a failure, and skip any steps that dont apply to your crawl. I ultimately turned off both the dedup and invert link steps.

9b) Look in both logs

Nutch writts errors to $NUTCH_HOME/logs/hadoop.log.

LWS writes its errors to $LWS_HOME/data/logs, although I found the log viewer from the sol admin to be a nicer display.

9c) Look in the crawldb

Nutch provides a tool called readdb, which will dump the crawl-db and its contents to a human-readable format. From the command line:

$NUTCH_HOME/bin/nutch readdb testCrawl/crawldb -dump newPathToDumpless newPathToDump/part-00000

This is especially helpful for debugging fetch problems if your crawl completes without errors, but you still arent seeing any data in Solr.

9d) A note about robots.

Nutch is aggressively polite. This means that if a site has a robots.txt, nutch will obey the indications in that robots.txt. This will override your fetch rates, and potentially cause your fetches to fail as if the site were not reachable. There is also one network that I work from that blocks outbound bots altogether , which results in a “abc.xyz.com/robots.txt was not reachable” error. In general, politeness is the best policy, but this can be frustrating if you are trying to get a new system off the ground.

10) Extended reading