Blog

Streaming Index Progress Results to Browser

I recently needed to index from a local filesystem several thousand static webpages into Solr. I was already using Ruby on Rails for the admin interface, so I quickly threw together an action to index the documents using HPricot and RSolr. To monitor the progress I just output to standard out using puts

def index_bulk_htmlsolr = RSolr.connect :url=>SOLR_URLcount = 0files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")files.each do |file|path_ends_at = file.index("www.somesite.com")unless path_ends_at.nil?puts("Processed #{count} of #{files.size}") if count % 100 == 0url = "http://#{file[path_ends_at,file.size]}"title, content = parse_html(file, title, content)puts "Bad Content:#{!page_content.blank?} #{url} #{title}"beginsolr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_contentsolr.commitcount = count + 1rescue RSolr::RequestErrorputs "Could not index #{file}"endendendputs "Imported #{count} webpages successfully."solr.optimizeredirect_to root_pathend

This worked great, but I realized that indexing over 10,000 documents takes a long time, and meanwhile the user is staring at the browser slowly loading, wondering if things had frozen or not! So I wondered if I could somehow stream some info back to the user. Fortunately Rails has already solved that problem! ActionController has the ability to render as text a proc object, and stream the output:

# Renders "Hello from code!"render :text => proc { |response, output| output.write("Hello from code!") }[/code]

So I quickly wrapped my existing code in a large proc, changed the puts to output.write, and now stream out to the browser constant progress reports:

def index_bulk_htmlsolr = RSolr.connect :url=>SOLR_URLcount = 0files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")render :text => proc { |response, output|files.each do |file|path_ends_at = file.index("www.somesite.com")unless path_ends_at.nil?output.write("Processed #{count} of #{files.size}") if count % 100 == 0url = "http://#{file[path_ends_at,file.size]}"title, content = parse_html(file, title, content)output.write "Bad Content:#{!page_content.blank?} #{url} #{title}"output.flushbeginsolr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_contentsolr.commitcount = count + 1rescue RSolr::RequestErroroutput.write "Could not index #{file}"output.flushendendendoutput.write "Imported #{count} webpages successfully."}solr.optimizeend

Thank you Rails, Hpricot, and RSolr for making life so simple!