What is a sourcetype?
If you have any experience with Splunk, you’re probably familiar with the term sourcetype. It is one of the core indexed metadata fields Splunk associates with data that it ingests. The Splexicon definition of sourcetype is “a default field that identifies the data structure of an event. A source type determines how Splunk Enterprise formats the data during the indexing process.”
But what really makes a sourcetype a sourcetype? Most of the time, Splunk users don’t have to think about this as sourcetypes are already pre-defined by Technology Add-ons and Apps. However, when you onboard a custom data source that doesn’t have these tools already built, you will have to make your own sourcetypes which requires a deeper understanding of what really makes a sourcetype a sourcetype.
Splunk’s definition provides good general guidelines, but I find it leaves too much room for interpretation. By the end of this article, you should be able to review a custom data source, assess the data, determine how many sourcetypes you will need to define, and create the configurations that make a sourcetype a sourcetype.
Configurations associated with sourcetypes
The most important configuration for a sourcetype that should be implemented every single time data is ingested, is to specify a sourcetype value within the inputs.conf stanza for the data (sourcetype can also be set with props and transforms. It doesn’t matter which method is used so long as a sourcetype is explicitly set). When data comes into Splunk without a sourcetype explicitly assigned, Splunk tries to create one for it. This can cause non-descriptive sourcetype names, improper line breaking, improper timestamp extraction, and unnecessary processing load on the indexers as they iterate through the data trying a number of approaches to determine these configurations.
Always assign a sourcetype to your data prior to onboarding it.
In addition to specifying the sourcetype, you must also specify the configurations that define the structure of the data. The primary characteristics of the format of an event, and thereby a sourcetype, are timestamp extraction and line breaking of streams of events into individual events.
The backend props.conf configurations that Splunk uses to perform these actions are: TIME_PREFIX, TIME_FORMAT, MAX_TIMESTAMP_LOOKAHEAD, SHOULD_LINEMERGE, LINE_BREAKER, and TRUNCATE.
The first three attributes tell Splunk where to start looking within an event for a timestamp, what format the timestamp is in, and how many characters long the timestamp is. Timestamps are one of the few fields determined at index time and have a huge impact on Splunk’s ability to monitor events effectively which makes this data incredibly important.
The last three props.conf attributes mentioned above determine how individual events are formed. LINE_BREAKER provides a regex pattern for Splunk to use to determine when to break the stream of events it receives into an individual event. Without this setting configured, Splunk breaks events at every new line and has to merge the individual lines back together into events later. By using this setting and setting SHOULD_LINEMERGE to false, Splunk removes a step from the indexing process and becomes much more efficient. The TRUNCATE attribute establishes what the maximum size of an event associated with this sourcetype should be so Splunk can disregard larger events (it assumes events larger than this number are not legitimate events and discards them to save licensing).
Sourcetypes table
Now that you know what configurations make a sourcetype, you need to know how to determine what those configurations should be. Once you determine the configuration values, you can determine which data can share a sourcetype and which ones will need to be broken out into their own sourcetype.
I will display this information in a table to make it easier to reference:
Attribute | Description | How to determine | Example |
TIME_PREFIX | A regex expression that represents all characters preceding the timestamp of an event | Copy sample logs into regex101 (purge any sensitive info from the log prior) and write a regex | If the timestamp is the first thing in the event,TIME_PREFIX=^ |
TIME_FORMAT | A representation of the timestamp using time variables | Compare the timestamp of the event to the time variables found here | 2019-04-13T14:00:15TIME_FORMAT=%Y-%m-%dT%H:%M:%S |
MAX_TIMESTAMP_LOOKAHEAD | A number representing the number of characters in the timestamp | Count the number of characters in the timestamp | 2019-04-13T14:00:15MAX_TIMESTAMP_LOOKAHEAD=19 |
SHOULD_LINEMERGE | True or false to determine if linemerging should be done | Always set to false when using LINE_BREAKER | SHOULD_LINEMERGE=false |
LINE_BREAKER | A regex expression that represents what data should be dropped as event separators and what data precedes or follows the separator | Copy sample logs into regex101 (purge any sensitive info from the log prior) and write a regex | If logs have a new line followed by timestamp 2019-04-13T14:00:15LINE_BREAKER= ([rn]+)d{4}-d{2}-d{2}Td{2}:d{2}:d{2} |
TRUNCATE | A number representing the maximum number of bytes of expected events | Review logs and find the largest value and add a 10% buffer | If the largest log had 90000 bytesTRUNCATE=100000 |
Breaking out sourcetypes
If Splunk is left to its own devices, it may name sourcetypes after the name of the file it’s monitoring. For rolling logs that append a -# to the file name, this results in a large number of distinct sourcetypes for the same data. Other instances that can cause similar results include one directory with multiple different named files all being given distinct sourcetypes or the same data format being monitored on multiple devices and each one having a unique sourcetype name.
In each of these cases (or any combination of them), all the files that share the props configurations you determined above should be configured using the same sourcetype. Simply define the sourcetype’s settings in props.conf once and apply the sourcetype to the appropriate data via any number of inputs stanzas that are required.
If the data is wildly diverse and high-volume, you may still want to break the data into several sourcetypes (think WinEventLog). However, for most custom applications this will not be necessary.
How to name a sourcetype
Sourcetypes are one of the few instances where Splunk provides clearly defined guidance for a naming convention. Splunk suggests naming your sourcetypes by the format vendor:product:technology:format, keeping the name as short as possible while still uniquely identifying the data (to read more on this, see: https://docs.splunk.com/Documentation/AddOns/released/Overview/Sourcetypes).
If the data you are onboarding only contains one sourcetype, you could just name it by the vendor or application name. If the data contains multiple sourcetypes that are part of a suite, you could name it suite:application. You can get as specific as you need to, but never be more specific than you need to. Making sourcetype names overly complicated makes typing them more time-consuming and error-prone.
Ah, so that’s a sourcetype!
Now that you know what a sourcetype is, what the main configurations are that need to be defined with a sourcetype, and how to name a sourcetype, you can go forth and onboard your custom applications like a Splunk Professional Services Consultant!
Caveats
Oftentimes, your custom applications will rely on common applications, such as java or apache, which will generate their own logs. I’ve seen many clients mistakenly create new sourcetypes associated named after their custom application and write their own configurations for these types of logs. Carefully review logs to determine if they are truly generated by the custom application itself or if they are the byproduct of a supporting technology that already has a Technology Add-on or otherwise defined sourcetype.
I’ve seen many clients break different log files into their own sourcetypes because the entire log isn’t exactly the same. The decision may have been made to make parsing easier, but there is no reason to do that. As long as the line-breaking and timestamping are the same, you can write multiple EXTRACT attributes into props.conf to account for the different log bodies. Because Splunk doesn’t index most field extractions, I generally ingest the looks prior to developing my field extractions.
Once I have the logs in Splunk, I run a search over them and pipe that search into dedup punct. The punct field in Splunk is a handy field that shows the pattern of the first thirty punctuation characters in the first line of the event with which it is associated. I use this to find examples of the unique formats of the logs. Once I have a sample of each unique format, I use them in Regex101 to write my EXTRACT attributes. Just make sure your regex only matches the log format it is intended for and that it fails to match the others in an efficient manner.
About SP6
SP6 is a Splunk consulting firm focused on Splunk professional services including Splunk deployment, ongoing Splunk administration, and Splunk development. SP6 has a separate division that also offers Splunk recruitment and the placement of Splunk professionals into direct-hire (FTE) roles for those companies that may require assistance with acquiring their own full-time staff, given the challenge that currently exists in the market today.