Hey Kibana, is anyone listening to my Podcast?

Matt OverstreetMarch 30, 2016

“Dashboard all the things!” was my motto leaving Elastic{ON} this year. I played a game that week trying to think of all the previously complex software projects that might be pretty trivial in the face of the current Elastic Stack.

Could we hook up all the treadmills in all the gyms in Washington DC to a Beats emitter and dashboard that with Kibana? Sure. What about the number of times my faucet drips at night? Absolutely, as a weekend project. What about my coworker Chris Bradford’s old app that counted the number of times the dogs in his old office would bark? Yep, and he could have spent a lot more time working and less time writing apps about not being able to work.

So it’s no surprise that when I wondered if anyone was actually listening to Search Disco my first thought was “How do I get this into a dashboard?” This is how. So grab your access logs and come along with me!

First thing, I needed a way to track downloads. If I could get an IP, timestamp and filename for each download that would be a pretty good start. We are self hosting our little podcast so that was pretty easy. Both Amazon’s S3 storage and Google’s cloud storage make this pretty easy. We are hosting Search Disco on the Google cloud, so I dug up some instructions to enable access logs for our podcasts bucket and in no time I was collecting data.

In order to use that data, it needs to get into Elasticsearch. But how do we load it? Does Elastic have a magic CSV endpoint like Solr? Is there a helper or import script?

No, but I’m really happy with what it does have. The answer (and thanks to everyone who groaned at the last paragraph for waiting) is Logstash. It takes a couple hours to get up to speed on Logstash, but you won’t regret it. Let’s have a look at the config I wrote to process the access logs. I’ve added some comments to help explain what’s in the file.

# First: where is the data coming from?
input {
  file {
    path => "/Users/mattoverstreet/Documents/search-disco/logs/podcast-logs/search_disco_usage*"
    type => "podcast-log"
    start_position => "beginning"

# Next: how can we understand the data; and, does it need
# to be cleaned up or changed?
filter {
	# our file is in csv format, so we’ll need to list the columns
  csv {
    columns => ["time_micros", "c_ip", "c_ip_type", "c_ip_region", "cs_method", "cs_uri", "sc_status", "cs_bytes", "sc_bytes", "time_taken_micros", "cs_host", "cs_referer", "cs_user_agent", "s_request_id", "cs_operation", "cs_bucket", "cs_object"]
    separator => ","
  # Kibana needs a @timestamp, so we’ll try to convert the
  # time_micros field.  The date filter will handle the
  # conversion and then set the value of @timestamp.
	date {
    match => [ "time_micros", "UNIX_MS" ]

# Last: where is the log data going to be stored?
output {
  elasticsearch {
    action => "index"
    hosts => "localhost"
    index => "logstash-sd-logs"
    workers => 1

If you are following along with your own access logs, don’t run that one just yet. You will see some very odd data. Looking in the access log CSV I noticed a problem with the download timestamp, it’s in microseconds! Checking the Logstash docs for the date filter, it can handle milliseconds or seconds for a UNIX timestamp; but not microseconds. So, what now?

There are a couple ways to solve this problem. I’ve taken advantage of the fact that Logstash is Ruby on the inside to process the timestamp. Let’s process our original timestamp, convert it to seconds and tell Elasticsearch we want to use it for the @timestamp attribute. To do that we’ll add another filter.

  # add this to the ‘filter’ section after the csv block
  ruby {
    # Chop some decimal places off the time_micros field
    code => "event['time_micros'] = (event['time_micros'].to_i / 1000)"
  # now our date filter will work!
  date {
    match => [ "time_micros", "UNIX_MS" ]

After I wrote this I realized this could actually be done with a mutate filter using a pretty simple gsub as well.

Now we are ready to suck in some data. Let’s save our Logstash config in the conf directory of the logstash folder as gcloud.json. From the logstash dir you can run something like:

bin/logstash -f conf/gcloud.json

Now, spin up Kibana, accept some defaults and your data is ready to dashboard!

But wait, wouldn’t it be nice to show where people where downloading from? I have an IP for each download, I could cross reference it with a GeoIP database to find the location and save that into Elasticsearch to visualize. How much work do I need to do to make that happen? Not much, as it turns out. Let’s update our Logstash job to handle that at import time.

  # Add this to the end of the ‘filter’ block in our config.
  geoip {
    source => "c_ip"
    target => "geoip"
    add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
    add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
  mutate {
    convert => [ "[geoip][coordinates]", "float"]

Next time, lets spin up Kibana and configure some visualizations!

Kibana podcast dashboard

More blog articles:

Let's do a project together!

We provide tailored search, discovery and personalization solutions using Solr and Elasticsearch. Learn more about our service offerings