It’s a Balloon! A Blimp! No, a Dirigible! Apache Zeppelin: Query Solr via Spark

May 18, 2016 Eric Pugh
Category: Uncategorized

This blog post contains the code from the talk I gave at the NYC Solr/Lucene Meetup entitled It’s a Balloon! A Blimp! No, it’s a Dirigible. Apache Zeppelin: Query Solr via Spark.

Synopsis

Apache Solr powers search and navigation for many of the world’s largest websites. Solr is widely admired for its rock-solid full-text search and its ability to scale up to massive workflows. But Solr has moved beyond its roots as just a full-text search engine. Today, people use Solr for aggregating data, powering dashboards, geo-location, even building knowledge graphs! In fact, Solr is so powerful, it’s the standard engine for big data search on major data analytics platforms including Hadoop and Cassandra. Critical data is being accessed through Solr’s rich query interface and, now, big data engineers are including Solr as one more data store in the analytics processing chain. But, as we expand the data pipeline to include diverse data stores, we need consistent ways of working across different data access patterns and representations.

Enter Apache Spark. Apache Spark has seen a meteoric rise as the tool for big data processing. Spark makes distributed computing as simple as running a SQL query. Well, almost!Spark’s core abstraction, the Resilient Distributed Dataset (RDD), is capable of representing pretty much any data store, including Solr. So, let’s see how we can integrate Apache Solr into our data processing pipeline using Apache Spark using the Solr Spark library.

We’ll talk about the implications and opportunities of treating Solr as just another RDD. To top it off, we’ll demonstrate how to use your existing SQL skills with SparkSQL – declarative programming over distributed datasets in real time!

Finally, we’ll tie it all together with a new Apache project that marries the best of Jupyter Notebook (the favorite tool of data scientists) and the best of distributed computing (Apache Spark and SparkSQL). Apache Zeppelin is the interactive computational environment for data analytics. Just like Jupyter/iPython Notebook, Zeppelin supports collaboration, data exploration and discovery, and rich graphs and visualizations. But its deep integration with Spark means Apache Zeppelin is the “interactive analytics notebook” for Big Data.

Code

Find some bugs? What would you like to see? Let me know!