Splunk Guide: Monitoring Data Sources for Outages

Imagine this: it’s Friday evening, and you’re wrapping up your work for the week. You think to yourself, it’s been a good week, a quiet one even. You’re about to leave the office, but you need to pull a quick report of the network traffic throughput from your firewalls to send to your infrastructure team. You run your report as you always do, but this time, you notice the report is empty. You tweak the SPL and run it again, but the report is still empty. Expanding the search’s time range, you see that Splunk stopped receiving logs from your firewalls more than a week ago.

And then it hits you – you haven’t been monitoring your data sources for outages. Even worse, you realize that it’s been a quiet week because your alerts that depend on the firewall logs have been silent. You’ve been flying blind for several days.

Let’s set up monitoring for your data sources to prevent this from happening again.

Why Monitor Data Sources?

Splunk is a powerful tool that can ingest data from a variety of sources. Even if your Splunk environment is set up perfectly, sometimes there are changes in the network that are out of your control. This can lead to data source outages that can go unnoticed until it’s too late.

The Goal, Requirements, and Assumptions

The Goal
Build alerting in Splunk that will notify you when a data source stops sending logs.

These alerts may not be plug-and-play in your environment. This will serve as a good starting point to build your own logic. Make sure to conduct testing to ensure that the alerts are working as expected. Remember that minimizing false positives is important, but false negatives can be catastrophic.

The Requirements

  • An alert should trigger if a data source has not sent logs within a specified threshold.
  • An alert should trigger if there is a significant drop in count of hosts sending logs.

The Assumptions

  • The _time field is being extracted correctly and is accurate. Any timestamp parsing issues, such as incorrect time zone, can lead to false positives or false negatives in your monitoring.
  • Most people associate data sources with index-source pairs. This is a good starting point, but you can add more granularity to your monitoring as needed. Try to use metadata or indexed fields to keep your searches fast.
  • Proper testing will be conducted to ensure that the alerts are working as expected.

Monitoring Data Sources for Complete Outages

We will start by using tstats to pull the last event time for each index and sourcetype combination that we want to monitor. In this example, we’ve excluded some indexes that we don’t want to monitor (Ex. NOT (index IN (“dev*”, “test*”))). You can also exclude sourcetypes that may not be feasible to track (Ex. sourcetype!=*-too_small). We can use this information to determine how long it has been since the last event was seen.

Note: The earliest is set to -7d so that we can include all data sources that have been seen in the last 7 days. Keep in mind that shrinking this time range can cause the alert to miss data sources that have not sent logs in that time frame. For example, if you set earliest=-12h and a data source stopped sending logs 18 hours ago, depending on the frequency of the scheduled search and thresholds being used, an alert may not trigger for that data source.

| tstats latest(_time) as lastSeenEvent where earliest=-7d latest=now() index=* NOT (index IN ("dev*", "test*")) by index sourcetype
| eval timeSinceLastEventSecs = now() - lastSeenEvent
| fieldformat timeSinceLastEventSecs = tostring(timeSinceLastEventSecs, "duration")
| convert ctime(lastSeenEvent)

From here, we can set some thresholds using a case() statement. This can get messy quickly, so if you have lots of different thresholds, you may want to consider using a lookup table. In this example, we are setting some thresholds based on the index and sourcetype. If there is no specific threshold set, we default to 6 hours.

| eval threshold=case( 
    index="prod*", 300, 
    index="qa*" AND sourcetype="network:firewall", 300, 
    index="qa*", 600, index="stage*", 900, 
    index="astro", 1800, 
    true(), 21600 
    )

With the added threshold, we can now determine if a data source is down.

| eval status=if(timeSinceLastEventSecs > threshold, "🔴 Down", "🟢 Up") 
| search status="🔴 Down"

Here is what the full SPL looks like:

| tstats latest(_time) as lastSeenEvent where earliest=-7d latest=now() index=* NOT (index IN ("dev*", "test*")) by index sourcetype 
| eval timeSinceLastEventSecs = now() - lastSeenEvent 
| fieldformat timeSinceLastEventSecs = tostring(timeSinceLastEventSecs, "duration") | convert ctime(lastSeenEvent) 
| eval threshold=case( 
    index="prod*", 300, 
    index="qa*" AND sourcetype="network:firewall", 300, 
    index="qa*", 600, 
    index="stage*", 900,
    index="astro", 1800,  
    true(), 21600
    ) 
| eval status=if(timeSinceLastEventSecs > threshold, "🔴 Down", "🟢 Up") 
| search status="🔴 Down"

You can add more complexity to this search as needed. As you add more complexity, make sure to test your alerts to ensure that they are working as expected. Added complexity can lead to false positives or false negatives. This SPL can be used to set up alerts to notify you when a data source is down. Keep in mind that with this setup, a data source (index-sourcetype pair) will only trigger an alert if it has completely stopped sending logs.

Monitoring Hosts Sending Logs for Partial Outages

In the previous section, we looked at monitoring data sources for complete outages. In this section, we will look at monitoring hosts sending logs for partial outages. This can be useful if you have a large number of hosts sending logs and you want to be alerted if a significant number of hosts stop sending logs.

We will start with a similar tstats search to pull the count of hosts sending logs for each index and sourcetype combination. We can also calculate the average number of hosts sending logs per day.

| tstats dc(host) as hostCount 
    where earliest=-30d latest=now() index=* 
    by _time index sourcetype span=1d 
| eventstats avg(eval(if(_time < relative_time(now(), "@d"), hostCount, null))) as per_day_avg_host_count 
    by index sourcetype

From here, we can use a static percentage threshold to determine if there is a significant drop in the number of hosts sending logs. In this example, we are using a 20% drop as the threshold. Again, you can use a lookup table to set different thresholds based on the index and sourcetype.

Note: You can use some more advanced statistical functions, such as standard deviation, to determine the threshold if the data fits a normal distribution.

| eval threshold=case( 
    index="prod*", 0.1, 
    index="qa*" AND sourcetype="network:firewall", 0.1, 
    index="qa*", 0.2, 
    index="stage*", 0.2, 
    index="astro", 0.2, 
    true(), 0.2  
    )

With the added threshold, we can now determine if a significant number of hosts have stopped sending logs.

| eval status=if((per_day_avg_host_count - hostCount) / per_day_avg_host_count > threshold, "🔴 Down", "🟢 Up") 
| search status="🔴 Down"

Here is what the full SPL looks like:

| tstats dc(host) as hostCount    
    where earliest=-30d latest=now() index=* 
    by _time index sourcetype span=1 
| eventstats avg(eval(if(_time < relative_time(now(), "@d"), hostCount, null))) as per_day_avg_host_count 
   by index sourcetype 
| eval threshold=case( 
    index="prod*", 0.1, 
    index="qa*" AND sourcetype="network:firewall", 0.1, 
    index="qa*", 0.2, 
    index="stage*", 0.2, 
    index="astro", 0.2, 
    true(), 0.2 
    ) 
| eval status=if((per_day_avg_host_count - hostCount) / per_day_avg_host_count > threshold, "🔴 Down", "🟢 Up") 
| search status="🔴 Down"

Based on the results of this search, you may need to tune the SPL or adjust the thresholds to fit your environment. Keep in mind that the goal is to minimize false positives, but not at the cost of false negatives.

Conclusion

As you can see, although the task seems simple, monitoring data sources for outages can be quite complex. The consequences of not monitoring your data sources can be catastrophic. It’s important to have some level of monitoring in place to ensure that you are alerted when a data source stops sending logs.

The logic we built in this guide is a good starting point, but you may need to adjust it to fit your environment. Proper testing is key to ensure that the alerts are working as expected. False positives can be annoying, but false negatives can be even worse.

If you are looking for more advanced and out-of-the-box solutions, here are some amazing apps you can check out. These apps may be more complex to set up, but they offer lots more functionality for handling complex monitoring and alerting needs:

Beyond other solutions, for personalized assistance from a team of Splunk core-certified engineers, reach out to our team at SP6!