A Solr performance debugging toolkit - OpenSource Connections

January 5, 2021 Charlie Hull
Category: Solr

Thanks to Matt Pearce for this guest post!

We’ve recently been working with a large client who were having issues with their Solr cluster. Very occasionally, nodes in the cluster were becoming unresponsive: could we help?

After a great deal of investigation, we identified that one shard of a secondary spellcheck collection had failed, which caused ever-increasing timeouts, and eventually caused other nodes to become unresponsive. (See https://issues.apache.org/jira/browse/SOLR-13793 for further information about a possible cause – the cluster was running on Solr 7.7.2, so our best “quick fix” suggestion was to update to 7.7.3, alongside a lot of longer-term advice.)

This process led us to thinking about building a toolkit for this kind of Solr performance debugging, a quick grab-bag of bits and pieces to use to analyse logs when faced with this kind of situation. We’ve included some starting points later in this blog.

Somewhat ironically, given that this was a Solr cluster, we ended up using Elasticsearch and Kibana to analyse the log data available, alongside specific tools for looking at Java garbage collection output. If you’d like to understand more about the Java Virtual Machine and the thorny issue of garbage collection do read this excellent blog on Optimising Solr Resources by Sayan Das.

Docker for a base instance

We spun up Elasticsearch and Kibana using Docker, which gave us a base instance to work from. We set it up with an external volume to hold the data, since we didn’t want it disappearing if the containers needed rebuilding for some reason. Here’s our docker-compose.yml file:

version: '2.2'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.8.1
    container_name: elasticsearch
    ports:
      - "9200:9200"
    expose:
      - "9200"
    environment:
      discovery.type: "single-node"
      NETWORK_HOST: "0.0.0.0"
    volumes:
      - es_data:/usr/share/elasticsearch/data
      - ./elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml

  kibana:
    image: docker.elastic.co/kibana/kibana:7.8.1
    container_name: kibana
    ports:
      - "5601:5601"
    expose:
      - "5601"
    environment:
      SERVER_HOST: "0.0.0.0"

volumes:
  es_data:
    driver: local

Elasticsearch mappings

We put together index settings and mappings for handling the Solr logs – we chose the shard setting fairly arbitrarily, since there was a lot of data coming in, but we didn’t know quite how much. The mappings were mostly chosen to keep the index size down – it was very unlikely we would be doing much text searching, so most fields used the keyword type. Here’s the mappings as a chunk of JSON:

{
  "settings": {
        "number_of_shards": 3,
        "analysis": {
            "char_filter": {
                "param_char_filter": {
                    "type": "pattern_replace",
                    "pattern": "=",
                    "replacement": " "
                }
            },
            "analyzer": {
                "param_analyzer": {
                    "type": "custom",
                    "char_filter": [
                        "param_char_filter"
                    ],
                    "tokenizer": "standard"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "message": {
                "type": "text"
            },
            "reason": {
                "type": "text"
            },
            "solrhost": {
                "type": "keyword"
            },
            "loglevel": {
                "type": "keyword"
            },
            "shard": {
                "type": "keyword"
            },
            "collection": {
                "type": "keyword"
            },
            "replica": {
                "type": "keyword"
            },
            "core": {
                "type": "keyword"
            },
            "webapp": {
                "type": "keyword"
            },
            "requesthandler": {
                "type": "keyword"
            },
            "param": {
                "type": "text",
                "analyzer": "param_analyzer"
            },
            "action": {
                "type": "keyword"
            },
            "tags": {
                "type": "keyword"
            },
            "querytime": {
                "type": "integer"
            },
            "status": {
                "type": "integer"
            },
            "hits": {
                "type": "integer"
            },
            "timestamp": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss.SSS"
            }
        }
    }
}

Logstash is your friend, mostly

We could have run Logstash in Docker as well, but since we weren’t expecting to need it running once the data had been imported, we opted to run it standalone. We built a configuration file inspired by one put together by Sebastian Muller, although it rapidly changed once it became apparent that Solr’s log format has changed a fair bit since 2015:

input {
  file {
    path => "/home/matt/tmp/*.solr.log.gz"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    mode => "read"
    file_completed_action => "log"
    file_completed_log_path => "/home/matt/tmp/debugindex_logstash.complete"
    codec => multiline {
      pattern => "^%{TIMESTAMP_ISO8601}"
      negate => true
      what => "previous"
    }
  }

filter {
  # Extract host
  grok {
    match => {
      path => "%{GREEDYDATA}/%{GREEDYDATA:solrhost}\.solr\.log"
    }
    tag_on_failure => [ "solrhost_not_found"]
  }

  # Extract event severity and timestamp
  grok {
	  match => { 
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{WORD:loglevel}.+\((?<thread>[^)]+)\) \[.*?\] %{JAVACLASS:classname} %{GREEDYDATA:reason}" 
    }
  }

  if [thread] =~ /^commit/ {
    # Commit message - format is different to search request logging
    mutate {
      add_field => {
        "action" => "commit"
      }
    }
  } else {
    # Extract the collection, shard, replica, core
    grok {
      match => {
        message => "\[c:%{WORD:collection} s:%{WORD:shard} r:%{WORD:replica} x:%{WORD:core}\]"
      }
      tag_on_failure => [ ]
    }

    if [classname] =~ /Request$/ or [classname] =~ /HttpSolrCall$/ {
      if [reason] =~ /\[admin\]/ {
        # Admin request - slightly different format, webapp always null
        mutate {
          add_field => {
            "action" => "admin"
          }
        }
        grok {
          match => {
            message => "webapp=null path=%{URIPATHPARAM:requesthandler} params=\{%{GREEDYDATA:params}\}.+status=%{NUMBER:status} QTime=%{NUMBER:querytime}"
          }
          tag_on_failure => [ ]
        }
      } else {
        # Extract the request details
        grok {
          match => {
            message => "webapp=%{URIPATHPARAM:webapp} path=%{URIPATHPARAM:requesthandler} params=\{%{GREEDYDATA:params}\}.+status=%{NUMBER:status} QTime=%{NUMBER:querytime}"
          }
          tag_on_failure => [ ]
        }
      }

      # Extract the hits, if available
      grok {
        match => {
          message => " hits=%{NUMBER:hits}"
        }
        tag_on_failure => [ ]
      }

      if [params] {
        urldecode {
          field => "params"
        }
        mutate {
          gsub => [ 
            "params", "\+", " "
          ]
        }
        mutate {
          split => {
            "params" => "&"
          }
        }
        # Extract the query into its own field
        grok {
          match => {
            params => "^q=%{GREEDYDATA:querystring}"
          }
          tag_on_failure => [ ]
        }
      }
      mutate {
        remove_field => [ "reason" ]
      }
    }
  }

  if "drop" in [tags] {
    drop {}
  }
  # Remove intermediate tags, and multiline added randomly by multiline stage
  mutate {
      remove_tag => [ "multiline" ]
  }
  # Drop empty lines
  if [message] =~ /^\s*$/ {
    drop {}
  } 
}

output {
  # Send directly to local Elasticsearch
  elasticsearch {
    index => "debugindex"
  }
}

(Logstash is great, but one quirk we quickly discovered was that the file input plugin deletes files by default once they’ve been read. After all, why on earth would you want to keep those files once they’ve been Logstashed, even if that is 100GB of data you’ve downloaded over a not-especially-quick network connection?)

We chose Elasticsearch & Kibana because it proved the quickest solution for pulling together graphs of status codes, log status levels, and so on, allowing us to identify the trouble hotspot very quickly, and to drill down within those hotspots to identify the specific nodes which were giving trouble. This was invaluable, since the node the client had originally identified as problematic was actually not the cause of the issue.

What else should go in the toolkit?

We’d be interested to hear what you think should go in the toolkit – what tools are you commonly using for Solr performance debugging? As clusters grow bigger, and Solr and Elasticsearch evolve to manage ever-growing quantities of data, in different cluster and replica formations, what would you want in your kitbag to get to the bottom of problems you see?

Updates

Cassandra Targett kindly reminded us of the bin/postlogs tool available in Solr version 8.5 and above – as the documentation shows this can be used with Apache Zeppelin to build some great debugging dashboards. Of course, you can use this to analyze logs from earlier versions of Solr.

Let us know your ideas and we’ll update this post – or join us and over 1000 search and relevance professionals in Relevance Slack, sharing experiences & tips.