Streaming Index Progress Results to Browser

Eric PughDecember 11, 2009

I recently needed to index from a local filesystem several thousand static webpages into Solr. I was already using Ruby on Rails for the admin interface, so I quickly threw together an action to index the documents using HPricot and RSolr. To monitor the progress I just output to standard out using puts

def index_bulk_html
solr = RSolr.connect :url=>SOLR_URL
count = 0
files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")
files.each do |file|
path_ends_at = file.index("www.somesite.com")
unless path_ends_at.nil?
puts("<strong>Processed #{count} of #{files.size}</strong>") if count % 100 == 0

url = "http://#{file[path_ends_at,file.size]}"
title, content = parse_html(file, title, content)

puts "Bad Content:#{!page_content.blank?} #{url} #{title}"

begin
solr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_content
solr.commit
count = count + 1
rescue RSolr::RequestError
puts "<strong>Could not index #{file}</strong>"
end
end
end
puts "Imported #{count} webpages successfully."
solr.optimize
redirect_to root_path

end

This worked great, but I realized that indexing over 10,000 documents takes a long time, and meanwhile the user is staring at the browser slowly loading, wondering if things had frozen or not! So I wondered if I could somehow stream some info back to the user. Fortunately Rails has already solved that problem! ActionController has the ability to render as text a proc object, and stream the output:

# Renders "Hello from code!"
render :text => proc { |response, output| output.write("Hello from code!") }[/code]

So I quickly wrapped my existing code in a large proc, changed the puts to output.write, and now stream out to the browser constant progress reports:

def index_bulk_html
solr = RSolr.connect :url=>SOLR_URL
count = 0
files = Dir.glob("/Users/epugh/Documents/code/www.somesite.com/**/*.{html,htm}")
render :text => proc { |response, output|
files.each do |file|
path_ends_at = file.index("www.somesite.com")
unless path_ends_at.nil?
output.write("<strong>Processed #{count} of #{files.size}</strong>") if count % 100 == 0

url = "http://#{file[path_ends_at,file.size]}"
title, content = parse_html(file, title, content)

output.write "Bad Content:#{!page_content.blank?} #{url} #{title}"
output.flush

begin
solr.add :id=> url, :url=>url, :mimeType=>"text/html", :title => title, :docText => page_content
solr.commit
count = count + 1
rescue RSolr::RequestError
output.write "<strong>Could not index #{file}</strong>"
output.flush
end
end
end
output.write "Imported #{count} webpages successfully."
}
solr.optimize

end

Thank you Rails, Hpricot, and RSolr for making life so simple!




More blog articles:


Let's do a project together!

We provide tailored search, discovery and analytics solutions using Solr and Elasticsearch. Learn more about our service offerings