Troubleshooting Splunk Search Head Clusters

SP6
October 15, 2019

There is always a sense of dread when your search head cluster (SHC) goes down. It’s the interface to Splunk, and for the analysts, it’s their window to the data world. Without a working search head (SH), you end up with a security operations center (SOC) or business unit sitting around anxiously twiddling their thumbs hoping to get access to their dashboards and alerts.

Fear not, there are a few simple trouble shooting tricks we can try to help get the SHC back in prime shape.

Common Hurdle (1) – A search head is refusing to join the cluster due to Automatic Detention

If the SH is in detention the most likely culprit is something was changed on only that server and the rest of the cluster, like an uptight 8^th grade English teacher, put it in detention for being different. The most common causes of this are where someone has made changes in a default folder on the back end or tried to install an app locally on only that SH.

Remediation:

The first step would be to push a fresh bundle from the Deployer, if it was a rogue app installation or a change in an app/default folder this should clear up the issue.

If the issue persists, some more surgical validation may be necessary, first check server.conf in app/default, or system/local to verify your hostname is the correct hostname. It has been documented that SHCs managed by tools like Git can possibly share their system/local settings, and another SH may have overwritten the server.conf on your rogue SH, with its own server.conf. This could create a situation in which two SHs in the same cluster end up with the same hostname.

If the server.conf files are all good, then your next step would be a side-by-side comparison or diff of a btool output. Run a btool looking at server.conf [splunk btool server –debug list] and output to a file on both a working SH and the detained one, then (if in Linux) run a diff of the two outputs. This will allow you to see the settings of the configs that work, and the other that doesn’t work, modify the rogue one accordingly.

Common Hurdle (2) – The SHC is over-tasked with too many scheduled searches, resulting in a high skip ratio

It is not uncommon when your team falls in love with Splunk to set up a ton of scheduled searches and accelerate everything.

Splunk can only handle so much at any given time, this is where your CPU cores come into play. A scheduled search, for the most part, takes up one core for the duration of the time its search runs. If you have more searches kicking off at the same time than you have cores in your cluster, it’s logically impossible to run them all simultaneously. Keep in mind the captain also has to delegate these searches, which reduces its capacity to run searches with its own cores.

Remediation Options:

Make you SHC captain Ad-hoc Only

In a clustered environment your cluster captain is making sure all the searches get run when they are supposed to, by delegating to the rest of the cluster the searches jobs to run. To make this process much smoother, especially in clusters with 5 or more members, we want to avoid the captain using up cycles on running searches itself. Ideally, it should use its resources to delegate everything to the other members. We can do this by configuring it as ad-hoc only, this is done by adding a line in your shclustering stanza in server.conf [captain_is_adhoc_searchead = true]. This setting will allow you to still run searches on your captain manually (whichever SH that may be if you are using RAFT), but not bog it down with running scheduled searches.

Stagger your search times

This is relatively common practice, so we will go a little deeper than the usual “don’t run all of your searches at exactly midnight”. As you may know, if you are using Enterprise Security, by default, accelerated data models and accelerated searches run every 5 minutes starting at the top of the hour. If you are accelerating a lot of data models it would be wise to cede those searches to 5-minute intervals.

It is best practice to use cron schedules for all of your searches, as it gives you great control over the time schedule the searches kick-off. Try having your hourly searches kick off at 7 minutes past the hour or 22 minutes past the hour. For daily searches, have them kick off at 1:03 am or 2:52 am, again any time that does not line up with a 5-minute interval. Crontab.guru is a great source for getting your cron schedules fine-tuned.

It may make sense to reduce the acceleration rate of your data models and/or searches if you find your skip ratio is very high. A metric to consider, running a search every 5 minutes equates to 288 total searches per data model/search over the course of a day. Simply reducing acceleration to every 10 minutes, would halve the number of daily searches scheduled. Where you can afford to wait 10 or even 20 or 30 minutes instead of 5, you should consider it.

You can edit the acceleration rate away from 5-minute intervals, by going into [ Edit -> Edit Acceleration -> Advanced Settings -> Summarization Period ] for data models and [ Edit -> Advanced Edit -> auto_summarize.cron_schedule ] for scheduled searches.

Please note: this technique is primarily only used by Splunk PS or someone who knows exactly what they are doing.

Maximize CPU Cores for more searchesg.

Your base_max_searches setting, found in limits.conf is set by default based on some Splunk math about your environment. Most Splunk engineers will never need to touch this, but if you notice a large number of skipped searches, while your CPU utilization remains very low, adjusting this number may be able to help. We mentioned earlier that one search takes up one core, that true in a sense, but when running as a cluster your captain can fork searches, and allow a core to run more than one search in parallel with another via some Splunk scheduler magic. If our CPU utilization low, we can ask the captain to work just a bit harder by editing our limits.conf settings. By default, base_max_searches will most often equal 6, you can increase this in increments of 10 (6, 16, 26, etc) until your CPU utilization is at ~60%. Do not go beyond this or you risk disk failure.

NOTE: limits.conf lives in /system/default, DO NOT edit the setting here, it is best practice to create a new limits.conf in system/local or within an app/local directory. It would look something like this

[search]
base_max_searches = 16

Common Hurdle (3) – All search heads refuse to join, or only one site from a multi-site cluster joins the SHC

While there are many reasons this can happen, from misconfigurations in the shclustering stanza to a networking issue. It can be a terrifying sight after an upgrade when none of your production search heads come back online.

Remediation Options:

Should you initiate a rolling restart or complete an upgrade and no SHs come back online, an easy way to validate it is a clustering issue is to comment out the shclustering settings in system/local and see if the SH themselves are healthy. If your SHs can come back online healthy as a stand-alone SH, then you know it must be cluster-related.

A common issue that causes this is RAFT clutter, this occurs when using dynamic captains. When only one site comes back online after a change or upgrade, it is almost a dead giveaway of a RAFT issue, since the election is designed to favor the site with more SHs (hence why its best practice to always have an odd number of SHs, with more at one site than another). If a site-specific outage occurs, then it means the boxes at the other site did not get to vote, and thus may think the cluster is gone. You can easily fix this by cleaning the RAFT: [splunk clean raft]. Run this on every SH, and they will all hold a new election, giving the secondary site a chance to vote, after a successful vote, all the boxes should be back in the cluster.

About SP6

SP6 is a Splunk consulting firm focused on Splunk professional services including Splunk deployment, ongoing Splunk administration, and Splunk development. SP6 has a separate division that also offers Splunk recruitment and the placement of Splunk professionals into direct-hire (FTE) roles for those companies that may require assistance with acquiring their own full-time staff, given the challenge that currently exists in the market today.

Troubleshooting Splunk Search Head Clusters

Common Hurdle (1) – A search head is refusing to join the cluster due to Automatic Detention

Remediation:

Common Hurdle (2) – The SHC is over-tasked with too many scheduled searches, resulting in a high skip ratio

Remediation Options:

Make you SHC captain Ad-hoc Only

Stagger your search times

Common Hurdle (3) – All search heads refuse to join, or only one site from a multi-site cluster joins the SHC

Remediation Options:

About SP6

SP6

Splunk Services

Cyber Compliance

Company