Graylog Parsing Rules and AI Oh My!

In the log aggregation game, the biggest difficulty you face can be setting up parsing rules for your logs. To qualify this statement: simply getting log files into Graylog is easy. Graylog also has out-of-the-box parsing of a wide variety of common log sources, so if your logs fall into one of the many categories of log for which there is either a dedicated Input; a dedicated Illuminate component; or that uses a defined Syslog format; then yes, parsing logs is also easy.

 

The challenge arises when you have a log source that does not neatly fall into one of these parsed out-of-the-box categories. A Graylog Raw/Plaintext input will accept just about any log format you can find, so getting the message into Graylog without parsing isn’t hard.

The difficulty is usually then turning your message from being a block of raw text that looks like this:

 

Into a useful array of fields that can be searched and aggregated, like this:

It is difficult to provide a step by step process on how to parse a log message. Log messages do not obligingly follow a widely agreed-upon format. Developers often make up their own log formats on the fly, and don’t necessarily do so with a lot of thought to how easy it will be to parse later. It follows that the process of breaking log messages down into fields is usually bespoke. It is a common joke in the field that even as technology gets better, parsing data that can be given in a wide array of different formats – in particular, timestamps –  remains very challenging.

 

Since there is no one-size-fits-all approach, and we understand that you are too good-looking and busy for an exhaustive manual on every single approach to parsing, this guide will instead just try to provide useful quick examples and links to the primary methods of parsing logs. We will assume in all the subsequent examples, that the text that needs parsing is in the $message.message field – when lifting Pipeline rules from this guide, remember to replace this field in the code block with the field from which you are trying to parse text.

1. Look for Delimiters

Fields that are consistently separated by a delimiter – a comma, a pipe, a space – are very easy to parse.For example, the message:

Graylog 100 awesome

Let’s say this message lists a software; its review score; and one word review summary. The following pipeline rule will parse named fields out of the contents of $message.message (eg. the message field), delimited by a “ “ (a space). Changing the character within those speech marks allows you to delimit by other characters. The fields are extracted (and so named) in the order they appear.

Rule "Parse fields from message"
when   
true
then
    let pf = split(
           pattern: " ",
           value: to_string($message.message)
           );
set_field("fieldname_1",pf[0]);
set_field("fieldname_2",pf[1]);
set_field("fieldname_3",pf[2]);

end

For example, if the message field is currently “Graylog 100 awesome”, this rule would create three new fields with the current values:

fieldname_1: “Graylog”

fieldname_2: “100”

fieldname_3: “awesome”

Very easy!

We can also change the delimiter to be “,” or “, “ or “|” as needed by changing the value in the pattern field. Now, sometimes a message is very nearly consistently separated by a delimiter, but there are some annoying junk characters messing the parsing up. For those cases, here is an example of the same pipeline rule, but which first removes any annoying square bracket characters from the message, before then parsing it into space delimited fields.

rule "Parse fields from message"
when   
true
then

    let cleaned = to_string($message.message);
    let cleaned = regex_replace(

           pattern: "^\\[|\\]$",
           value: cleaned,
           replacement: ""
   );
    let pf = split(
           pattern: " ",
           value: to_string(cleaned)
           );
set_field("fieldname_1",pf[0]);
set_field("fieldname_2",pf[1]);
set_field("fieldname_3",pf[2]);

end

This technique of “cleaning” values from messages before parsing can of course be copy-pasted to act before any other parsing method.

2. Look for Key Value Pairs

Messages that consist of a list of key value pairs are also very easy to parse.
For example, the message:

fieldname_1=graylog fieldname_2=100 fieldname_3=awesome

Key Value Pairs is also the extraction method you would employ if the contents of $message.message (eg. the message field) looked like this:

“fieldname_1”=”graylog” “fieldname_2”=”100” “fieldname_3”=”awesome“

Or like this:

fieldname_1=’graylog’,fieldname_2=’100’,fieldname_3=’awesome’ Or like this:“fieldname_1”,”graylog” “fieldname_2”,”100” “fieldname_3”,”awesome“

Any consistent format that lists a field name followed by a value is a good target for this parsing approach.

There is a nice Graylog Blog post that talks about Key Value Pair extraction in great detail here and documentation on using the function here. For the reader who is too executive to have time to read a whole blog post right now, here is a pipeline rule that would parse that last example (observe that we are trimming the “ characters from both the key and values, and that “ has to be escape-character-ed to be \”):

rule “key_value_parser”

when
true
then
set_fields(
   fields:key_value(
   value: to_string($message.message),
   trim_value_chars: "\"",
   trim_key_chars:"\"",
   delimiters:" ",
   kv_delimiters:","
)
);
end

This rule would again create three new fields with the current values:

fieldname_1: “Graylog”
fieldname_2: “100”
fieldname_3: “awesome”

3. Look for JSON Format

JSON formatted messages are easily recognized from their structured organization of brackets and commas. JSON logs work nicely with Graylog, since the format provides not only the values but also the field names. Graylog can parse JSON format logs very simply using JSON flattening, which is detailed in the Graylog documentation here.

If we take the below JSON message as an example:

{
   "type": "dsdbChange",
   "dsdbChange": {
       "version": {
           "major": 1,
           "minor": 0
       },
       "statusCode": 0,
       "status": "Success",
       "operation": "Modify",
       "remoteAddress": null,
       "performedAsSystem": false,
       "userSid": "S-1-5-18",
       "dn": "DC=DomainDnsZones,DC=XXXXX,DC=XXXX,DC=com",
       "transactionId": "XXXX-XXXX-XXXX-XXXX",
       "sessionId": "XXXX-XXXX-XXXX-XXXX",
       "attributes": {
           "repsFrom": {
               "actions": [{
                   "action": "replace",
                   "values": [{
                       "base64": true,
                       "value": "SOMELONGBASE64ENCODEDVALUE"
                   }]
               }]
           }
       }
   }
}

We can parse this effortlessly with a generic JSON parsing Pipeline Rule, below:

rule "JSON FLATTEN"
when
   true
then
   let MyJson = flatten_json(value: to_string($message.message), array_handler: "flatten", stringify: false);
   set_fields(to_map(MyJson));
end

This will parse all the fields out of the JSON structure, fire and forget.

4. Look for a consistent format for Grok

OK, so your logs don’t follow a format that Graylog can parse out-of-the-box, are not consistently delimited, are not set up in key value pairs, are not in a JSON format. But the format is at least consistent, even if the way the fields are broken up maybe isn’t. There is a structure here that we can parse using Grok. For example, the message:

2023-02-22T09:29:22.512-04:00   XXX.XXX.XXX.XXX  <179>50696: Feb 22 13:29:22.512: %LINK-3-UPDOWN: Interface GigabitEthernet1/0/11, changed state to down

This log format is all over the place with delimitation of fields, but there is still a consistent pattern of fields we can see: timestamp, ip_address, priority, process_id, event_timestamp, interface_name, interface_state. In this situation, the easiest way to extract these fields is to use Grok. You can read more about using Grok within a Pipeline Rule in the Graylog documentation here.

Grok might look a bit intimidating, but it’s actually pretty easy once you get started. Online Grok de-buggers, such as this one, are your best friend when writing a Grok rule. The key to writing Grok is to focus on capturing one word at a time before trying to capture the next, and to remember that whitespace – including trailing whitespace, which often catches people out – is included in the pattern.

Here is the Grok to parse this message:

%{TIMESTAMP_ISO8601:timestamp}\s+%{IPORHOST:ip_address}\s+<%{NUMBER:priority}>%{NUMBER:process_id}: %{MONTH:month}\s+%{MONTHDAY:day}\s+%{TIME:time}: %{GREEDYDATA:interface_name}: %{GREEDYDATA:interface_state}

Seen here in the Grok debugger https://grokdebugger.com/ in which it was written:

 

Once you have a Grok pattern that works – and check it against multiple examples of the log message, not just on one, to make sure it works consistently – the next step is to convert your Grok pattern into a Graylog Pipeline Rule. Note that all escape characters within your Grok string need to be prefaced with a \, including “\”.

Here is the pipeline rule for parsing the message field using this Grok rule:

rule "Parse Grok"
when
   true
then
let MyGrok = grok(
   Pattern: "%{TIMESTAMP_ISO8601:timestamp}\\s+%{IPORHOST:ip_address}\\s+<%{NUMBER:priority}>%{NUMBER:process_id}: %{MONTH:month}\\s+%{MONTHDAY:day}\\s+%{TIME:time}: %{GREEDYDATA:interface_name}: %{GREEDYDATA:interface_state}",
   value: to_string($message.message),
   only_named_captures: true
);
set_fields(
   fields: MyGrok
);
end

5. Nothing is consistent? Time for Regex

If the field you need to extract from your data is really inconsistently placed, and none of these techniques are useful, then it’s probably time to write some Regex.

Regex can be used in Pipeline Rules much the same as Grok, though it is better suited to scalpelling out a single tricky field than trying to parse a whole message into fields. There is a Graylog Documentation page on using Regex in Pipeline Rules here. Regex is especially useful when capturing errors or stacktraces, which can blow out to many lines of text and otherwise confuse your parsers.

For example, the message:

26/03/2023 08:03:32.207 ERROR:  Error in EndVerifySealInBatch()Rep.dingo.Library.Serialisation.dingoHelperException: The exception has occured in one of the dingo Helper classes: ISL_LINK                
Server stack trace:
   at Rep.dingo.Library.Serialisation.DataFrame.VerifySeal(dingoSecurity2 itsSecure, Boolean dyeISRN, Byte[]& native, shipmentType shipmentType)
   at Rep.dingo.Library.MessageProcessor.Incoming.Class1Handler.AsyncVerifySeal(Boolean decryptIsrn, DataFrame df, Byte[]& dfNative)
   at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)
   at System.Runtime.Remoting.Messaging.StackBuilderSink.AsyncProcessMessage(IMessage msg, IMessageSink replySink)
Exception rethrown at [0]:
   at System.Runtime.Remoting.Proxies.RealProxy.EndInvokeHelper(Message reqMsg, Boolean bProxyCase)
   at System.Runtime.Remoting.Proxies.RemotingProxy.Invoke(Object NotUsed, MessageData& msgData)
   at Rep.dingo.Library.MessageProcessor.Incoming.Class1Handler.AsyncVerifySealDelegate.EndInvoke(Byte[]& dfNative, IAsyncResult result)
   at Rep.dingo.Library.MessageProcessor.Incoming.Class1Handler.EndVerifySealInBatch()

If you want to capture the first 3 words after the first occurrence of “ERROR” in your log message, you could use a Regex rule.

We would highly recommend the free online Regex tool available at https://regex101.com/ for the purposes of composing your Regex.

In this example, the Regex rule would be: [E][R][R][O][R].\s+(\S+\s\S+\s\S+)

This would capture the value “Error in EndVerifySealInBatch()Rep.dingo.Library.Serialisation.dingoHelperException:”

 

Once your Regex rule is working in https://regex101.com/ then it is time to put it into a Graylog Pipeline Rule. Note that all escape characters within your Regex string need to be prefaced with a \, including “\”.Here is the Pipeline Rule for capturing the first 3 words after the first occurrence of “error” in the message field using this Regex rule:

rule "Regex field extract"
when
true
then
 let MyRegex = regex("[E][R][R][O][R].\\s+(\\S+\\s\\S+\\s\\S+)", to_string($message.message));
 set_field("MyFieldname_1", x["0"]);

end

This rule would create a new field with the current value:

MyFieldname_1: “Error in EndVerifySealInBatch()Rep.dingo.Library.Serialisation.dingoHelperException:”

Very cool!

6. Stuck? Look for Extractors in the Graylog Marketplace

Extractors are a legacy feature of Graylog, providing an interface for extracting fields from messages hitting an input using Regex. We recommend against creating your parsing rules using the Extractors interface, as it is rather fiddly and outdated. You can read more about Extractors and how they work in the legacy Graylog Documentation here.

Extractors have been around for many years, so there is A merit to continuing to use this functionality: the Graylog Open community has created a lot of useful Extractor Parsing rules over the years, and these are all available to download from the Graylog Marketplace.

If you require a parser for the complex logs of a common hardware device or software suite, it can be worth checking if the Graylog Open Community has already produced them. Work smarter not harder: downloading someone else’s ready-made parser is often quicker than writing your own 😎

Be mindful however that this option is presented late in this guide because it is something of a last resort. Extractors are a vestigial mechanism, and being community written and maintained, carry no guarantee on being correct, up to date, or even working. There will often be a bit of TLC required to get such content working and up to date.

7. Stuck? ChatGPT can write both Graylog Pipeline Rules and GROK/Regex Parsing… sometimes.

Technology is a beautiful thing! ChatGPT, the AI that needs no introduction, can write Graylog Pipeline rules. It can also write GROK or Regex parsers – just paste in your log sample and ask nicely. This is really useful in theory and can often point you in the right direction, but be warned that in practice, the AI will make various mistakes. Rather than entering your requests into ChatGPT directly, we recommend checking out this useful Community tool that leverages OpenAI’s GPT API and an extensive prompt designed to improve results.

https://pipe-dreams.vercel.app/

 

AI is far from perfect at these tasks at this stage, but still very useful – particularly at showing syntax and structure. Please note the tabs on the top left that switch between Pipeline and GROK parsing modes.

 

8. I am still stuck – Parsing logs is hard!

Yes, parsing logs can be hard. If you really get stuck, and you still can’t parse your logs, there are several avenues for assistance you might pursue.

  • If your log message is from a common network hardware device or a software suite with a security focus, maybe we can write it for you! Graylog has a standing offer to create parsing rules for Enterprise Customers in these circumstances, for free and within 30 days. Simply provide the device model, the firmware version, and a sample log file (sanitize it first of course) containing at least 20 lines of log text to Graylog Support, and we will seek to include parsing rules for your device in a subsequent release of Illuminate.
  • Ask for help on the Graylog Community Forums. People do this for fun!
  • For Enterprise Customers, ask for help with a specific rule that you can’t get working from Graylog Support. Graylog Support cannot write your parsers for you, but they are more than happy to point out where you might be going wrong if you can provide them with the Pipeline Rule in question.
  • For Enterprise Customers, ask your Customer Success Manager about a Graylog Professional Services Engagement. Professional Services are not free, but it never hurts to have the option to call in the experts for a day to write your parsing rules, should you need it!

 

Categories

Get the Monthly Tech Blog Roundup

Subscribe to the latest in log management, security, and all things Graylog blog delivered to your inbox once a month.