Wednesday, December 20, 2017

Apache Solr Cloud Sanity - 3 simple sanity tests

Having a clear picture of your services status is a fundamental requirement.
Otherwise you are blind to the system. 

I am going to present a general concept of sanity checks and some more specific Solr database sanity checks. 

There are many tools and methods to monitor a running service, whether it is a micro service, a 3rd party database or any other piece of code.

We use Datadog, and Grafana to have the system fully covered and to dive into relevant metrics when needed.

In addition to these too well known tools,  we implement, for each in-house service or a 3rd party tool, a custom sanity test, that in a failure we know we have a problem.
The sanity tests result are being collected by a centralised monitoring service and they are all being displayed in a monitoring dashboard.  
The dashboard has many features, as have been probably thinking of by yourself, such as deep dive into sanity test logs, get history of sanity tests and so on.

Let's take apache Solr as an example. 
In addition to our Solr docker instance, we deploy a sanity script, for each solr instance. 

We defined 3 sanity tests - They are all based on Solr cluster status output:
  1. Replica status - If a replica is down or recovering, we mark the hosting solr instance of that replica as problematic. 
  2. Balanced leaders within Collection - If a solr server is hosting more than 1 leader of shards from the same collection, it is also a problem (it might overload that host).
  3. Not too many leaders per host - If a Solr server is hosting more than X leaders, we are marking the host as problematic. 


The test is running every M minutes (15 for Solr test).

To conclude, 
a custom sanity test is a simple and an efficient way to be aware of a service status. Our monitoring procedure starts from the centralised sanity dashboard, and only then goes into Datadog or Grafana. 


No comments:

Post a Comment