Webinar: What's New in Graylog 6.0? | Watch On-Demand >> ​

Log Wrangling: Make Your Logs Work For You

Senior Sales Engineer Chris Black enlightens users on ‘Log Wrangling’. Utilizing his expertise, Chris compares logs to livestock and provides strategies to manage them effectively just like a wrangler would handle livestock. Topics discussed include ways to understand and maximize the utility of logs, the complexities of log wrangling and how to simplify the process, and the significance of data normalization. He also touches on organizational policies, the importance of feedback mechanisms in resource management, and key considerations when choosing your log priorities. This video is recommended for anyone looking to bolster their understanding and effectiveness in log management.

Introduction to Log Wrangling

Chris Black here. I am a senior sales engineer at Graylog. Today we’ll be talking about log wrangling. Log wrangling, using it in the colloquial North American sense of that word, is usually referring to the handling of livestock of some sort. And, unfortunately, logs are a lot more like livestock than we’d like them to be. You have to get them corralled, you have to get them branded, you have to get them organized, move them in the same direction, you have to care for them, you have to feed them, and sometimes it’s not as easy as it seems. And we’re going to talk about how you can make the most out of your logs, get the most out of them, and make them do the most for you that you can.

You’ve seen the table of contents; it’s in the description, so we’re not going to go over that, but we’re just going to try to talk about what it is you need to do, or you can do, to get the most out of your logs. To understand them better, to make them do more for you, to offer more information to people that have to read them.

Understanding the Complexity of Logs

Why are we going to do this? Because logs are hard. They just are. Randy Franklin Smith, while I was working on this presentation, he put out a marketing piece, and this quote was in there. I found it on his website. There’s been a lot of companies who’ve tried to put out a lot of products over the years, and I’ve worked for several of them, but no matter how hard they try, none have ever been able to make SIEM, or even centralized log management for that matter, simple.


The Challenges of Log Data

It’s not simple. It just isn’t. Why is that? I can give you Randy’s reason for that. For him, he says, to begin with, it’s a tall order to detect breaches from raw log data, because raw log data is very raw. Log data is cryptic, it’s redundant, it’s poorly formatted, and it varies widely from source to source. Log data is also untrustworthy.

You need to know so much about the idiosyncrasies of each of these log sources in order to interpret it, that it’s very difficult. And then there’s the problem with scale. He says that logs are huge in three-dimensional terms. The dimensions he talks about are the volume of events that you need to process every second, the volume of storage needed to archive that data, and then just the sheer quantity of endpoints that you have to deal with.

All of those are challenges, but to that, I’ll add a few of my own. I think we often don’t even know what information is available inside those logs, or what those logs could tell us. There’s a lot in there. If you’re not an expert in 25 different products, you may not know exactly what’s in there. A lack of context. Not always easy to understand how this fits in with other things, or if there’s other types of information that might make what’s in this message clearer to us. Logs are messy. He says they’re very raw, but they’re messy. Sometimes they’ve got a lot of information in them that’s not, usually not very relevant.

It’s not always even put in the same order. There are some vendors that don’t even write their own logs in a consistent format. Low signal to noise ratio is a big problem because there’s a lot of data. But not all of it is useful. You’ve got to be able to pick those two apart. You’ve got to be able to tell what’s important information from the noise.

The lack of effective leadership is another one, or executive leadership. The policies that are in place can help a lot to make it easier to be effective with your SIEM or log management. And the last is misaligned incentives. We put, we set things up in a way that make us more likely to we’ll go over what that means, but there are things you can do to try to help that.

Discovering the Value in Your Logs

What can the logs tell us? I mentioned that we often don’t know what’s in there. There are categories of information, so you can get:

  • Authentication information
  • Network information
  • Identity access management
  • Endpoint
  • Application


Those are all categories. There are fields within that have specific information of that type.

But you don’t always know ahead of time what those are. And there are a couple of ways you can find out. You can either collect those logs and you can look through them yourself and pore through every single log and see what kind of information is in there and make notes and come up with an idea of what it is that’s in each of those and come up with your use cases from that.


Utilizing Online Resources for Log Analysis

Or you can go to some of these online resources because sometimes it’s coming from, or the vendors themselves offer resources. The first one I’ve got on there is ultimatewindowssecurity.com. Randy Franklin Smith, the quote that I just referenced a minute ago, has an outstanding site called Windows IT Security.

If you haven’t seen it, I’m going to put a shot up here, but Google that. Randy Franklin Smith, Ultimate Online Security Encyclopedia. Great resource. Cisco. There are a lot of people that have Cisco out there, and they have a guide for their log formatting. It describes every single event ID, describes what’s in each one of them, what kind of value is going to be there, and it can be very helpful if you’re just, if you’re not terribly familiar with those products.

Palo Alto has one that you’ll see. Plesk has got the Linux logs explained. Fortinet’s got their own. Most vendors have got Checkpoint, you name it. Most of them have got some kind of resource for helping you to understand what data is available, what kind of format you can expect it to come in, and that can help you a lot when you’re trying to decide what information is going to be important to you and your team.

So this is the screenshot for that ultimate Windows security page. If you’ll notice here, you’ve got all the different types of Windows logs that are available, and then you’ll notice to the right, you’ve got the different versions. So you can cross-reference Windows 2016 to Windows, gosh, Windows XP or whatever.

You can go through and cross-reference those. You can get the event IDs. You can get a description, a human-readable description of what that event does, or how it relates to other ones, or how it’s identical to other messages, or in the same category as other messages. If you don’t know Windows very well, this is a fantastic resource.

It’s been around for 20 years, and it just keeps getting better.


Enriching Log Data for Better Analysis

I talked about adding context. You can add context to that data. We call that enriching data. There’s a lot of different kinds of enrichment. We’re going to go through what some of those are, some of the details and what you can get out of it.

A lot of different categories. We’ll go through each of them specifically. Some of that stuff is coming from external sources. Some of it comes from internal sources, and you use it in different ways.

So who is information this gives us? It gives the analyst the ability to connect the geography to the IP address. So you have some idea of where they’re coming from. It’s not perfect, it’s not a computer there are all sorts of ways to get around it, but it’s still useful to have some idea. If someone’s coming in from Serbia and you are a junior college in Kansas, probably not traffic you’re likely to see very often.

Reverse DNS lookups. Just being able to tell the analyst what domain it’s coming from can be very helpful. G.O.I.P. we mentioned. If you look at this example, we’ve added G.O.I.P. information to the log, and it will give us not only G.O.I.P., it’ll give us the cities, it’ll give us also autonomous system network information, so you can see where it’s coming from, what system, what company owns this, or what company owns this AS.

And that can be very helpful to the analyst. Some other types, A.D. lookups. These days, A.D. is still the center of the universe for a lot of places. It’s the single source of truth for authentication and authorization. Being able to look up data within Active Directory to provide maybe the user’s manager or the group that they belong to.

There’s all sorts of other objects that are stored in Active Directory that can be helpful to an analyst. Or even like a help desk person. You don’t have to, it doesn’t always have to be a security analyst. Internal lookups. You can do things like location information. What building are they in? What campus are they on?

What floor are they on? What department do they belong to? That can all be useful information when you’re trying to evaluate what you’re seeing in a log. Asset information can give you the owner of that resource. It can give you criticality for that resource. It can tell you the vulnerabilities of a resource.

Let’s say you’ve got a WAF and you are seeing an attack against a server. If you know for a fact that they’re not, you don’t have the vulnerability for the exploit that they’re attempting to use against you, you can downgrade that. You don’t have to pay attention to that with the same level of attention that you would if you know for a fact you’re vulnerable to it.

If they’re hitting you with a database attack and there’s no database behind it, okay. It goes from an 8 down to a 3. It, you can make everybody’s job a whole lot easier by adding information to it. The last one, this is always the sexiest, is threat intelligence. There are two types of threat intelligence that people talk about.

Indicators of compromise, where I’m going to have a list of IP addresses, or a list of domains, or a list of hashes, or something like that’s going to give me something I can match against. I store them in lists, and everything that comes in, I check every IP address against that list, or check the domain against that, or malware hashes so that I can see that this is, in fact, the malware that’s been going around.

And that’s one sort, and that’s threat intelligence. It’s useful, it’s something that you’ve, there are, It’s own challenges maintaining that list and maintaining the lookups, but it can be useful. And then there’s the other types of threat intelligence that offers more generalized context. And that’s the threat actors information, the campaigns.

A lot of times you’ll see these in threat intelligence platforms. Where you may have, you may get from them the list of indicators of compromise, but when you find one, you then click on the link and go to investigate that, and you’ll see what that IP has been associated with as far as campaigns, which threat actors have used it.

All of that information can help you when you’re evaluating an attack to see if this looks like something that’s been seen before. If you are part of the target for this threat actor, usually goes after nation-states and they’re going after you, then maybe it’s not that threat actor. Or maybe if they focus on utility companies and you are a sausage factory or something like that, you may be able to get useful information from that.

The problem with enriching data, after all those things I gave you a list of, what, six or eight things that you could add to it, the problem is, you add that to the logs, and then you pay to store them. That can cause what I call log bloat, but it can cause those logs to be much bigger than they were when they were collected, and you have to pay not to just collect it, but you have to pay to store it over the long term.

So it may be that’s not the best way to do it. It may be, like with Graylog, where we offer something we call decorators. And decorators are a way to add that information. On the display, and let the user see that information and, but not append it to the raw logs. It ensures the integrity and the originality of the original log data.

It also gives that context that’s useful to users, and it’s more user-friendly so that they have to go fewer places to get the context they need to evaluate what they’re looking at. And decorators are something that I don’t see in a lot of other products. It’s extremely helpful if you don’t want to store that data.


The Importance of Parsing and Normalizing Log Data

In addition to enriching data, and I probably should have put this before you enrich it, because if you enrich it before you parse it, it’s a lot less useful. But, you do have to parse data. This is usually a point of pain for a lot of people, because the vendors do make these logs so messy. They often don’t put them in a consistent format, or if they don’t follow the rules.

Yes, there’s an RFC for syslog. No, we don’t follow it. Parsing that data can be a challenge. Why would you want to? Why do you care? If it’s that much work, why do you care? What do I get out of it if I put the effort in? The answer is you get simpler queries. You get faster searches. And it sucks less to have to use it.

The reason I say that is this is an example of an unparsed log. Okay? It’s structured. That’s not too bad. It’s not like you’ve got to pull it apart with regex, but it’s not that easy to find what you’re looking for. You’d have to go through here and pay very close attention to go find the key-value pair that you’re looking for and pull out the information that’s relevant.

Writing a query against this would be difficult at best. So would you rather look at a log that looks like this, or one that looks like this? Because this is what it looks like when you parse it. You put it in nice, neat fields. You can scroll up and down. You can write queries looking for one specific thing. You can aggregate them. You can graph it. Pie charts out of it. Bar charts to your heart’s content. And it just makes it so much easier.

In Graylog, you’ve got a couple of different choices about how to do your parsing. The first is using inputs. We have a number of inputs that have the format built in. If you’re using a Ceph input, it’s using the Common Event format. If you’re using the Gelf input, it’s using the Graylog Enhanced Log format.

If you’re using Palo Alto, some are tied to specific products. Palo Alto, we’ve got the parser attached to the input. Other products, we don’t parse them on the inputs. We use our Illuminate content. So if you are an open-source user, you may not be familiar with Illuminate. If you’re a commercial customer…

Then you are familiar with Illuminate because you get that with your product. Illuminate is a series of parsers. Along with parsers, you get dashboards and some alerts. And you get the Graylog information model, which I’ll talk about a little bit more in a minute. The way that we do parsing for products that don’t use either of those two, or those two processes, is via pipelines.

So you may or may not be familiar with pipelines. I would encourage you, if you’re not familiar with them, to become familiar with them. There’s some new things coming along that are going to make pipelines a whole lot easier, but they are the way, and the best way, for you to parse logs for yourself, if it’s not something that we do for you out of the box.

Once the data gets parsed, the next step that will give you the most value is to normalize that data. Normalization is the process of standardizing the data so that it… Is, so that no matter how it comes in, it gets stored in like formats. So that like data is stored in the same field, regardless of where it came from.

What do I mean by that? If you’ve ever had to do this yourself, if you’ve got more than 5, 6, 7 products, you’ve probably got some variation of this. Every vendor uses a different way to describe the same data. Source IP could have 5, 6, 7 different ways to record it. If I want to write a query, it all has to be in the same field.

Otherwise, I’m writing really ugly, long queries with lots of OR’s, and it becomes a lot harder to do. If I could put it all into one field called source IP, I can do a query that says, show me the source IP, and it doesn’t matter if it came out of a firewall, or a server, or an EDR, it doesn’t make any difference.

It’s all source IP. Normalization is something that’s incredibly important to do. But it’s also pretty hard. It’s something that a lot of people don’t do. Why would you want to? What do you get? Again, when you normalize data, what do you get? The first thing it does is it makes it just easier. It makes you, it streamlines your log management.

You end up having a much neater environment. You don’t have your signal to noise ratio goes way up. It enhances your security because now you can manage your logs efficiently. It reduces your cost, for one thing, for storage because storing like data is more efficient. You’ll get better compression ratios.

It also reduces your costs because you don’t spend as much time getting the answers that you need. It simplifies the analysis because you now know where to go look for data. It’s not hidden in different spots for different products. You can do things like aggregating across large groups of products, and it works, unlike when you don’t have it normalized.

And it just overall improves your efficiency. How do you do it? Okay. You can do it yourself. If you’ve ever tried to normalize data yourself, it is a project. It is something that most people do as an ongoing project, and it’s something that they never quite finish, because just as soon as you get them all normalized, somebody wants you to add another one, and then you’ve got to normalize that one into it.

And there’s a lot of work that goes into it. Or, you can have your vendor do it. And in the case of Graylog, we have a schema that you can see at schema.Graylog.org. called the Graylog Information Model. You’ll hear me call it the GIM, but that’s just shorthand. Graylog Information Model is our schema, and it is fields and the category information, which I’ll talk about in just a minute, that we use to normalize our data. So when it comes in, if you’re using Illuminate, we apply this normalization scheme to every single message that comes in. So no matter what they store source IP as, we’re going to map it to source IP, or src_IP, or dest_IP, or whatever the case may be. And we’re going to do all of that for you, which is a huge advantage, and we do it on the way in the door.

We don’t wait until after it’s come in to try to do it. You have to, in the way that Graylog works, you’ve got to do it when it gets in there. So we’re using pipeline rules and things like that to normalize that data for you. And I wanted, if you go to schema.org, you’ll see this. This is just an example of some of the fields that we’ve created.

You’ll see a number of different fields, and you’ll see the field type, you’ll see the example, some sort of an example value. And so this can give you an idea, before you’ve ever seen these messages, where the data is likely to be and what kind of data is likely to be in it. For those products that you don’t see the event messages very often, if it doesn’t happen unless something bad happens, then you can at least have some idea where it’s going to be stored and what it’s going to be.

What’s going to consist of. I wanted to point this part out. Remember I mentioned the Graylog information model. If you’ll forgive me, I’m going to get a little bit meta on you. By that, I mean that in addition to normalizing these fields, we are actually adding metadata for context. And this is all about providing context.

This allows two things. An analyst can write queries that cross all devices, regardless of silos. Doesn’t matter if it’s a firewall or a server or a switch. For example, you could ask for all failed logins from a user and get every attempt on every single system that user touched. Centralized, distributed, doesn’t matter.

If it’s authentication and it’s had this schema applied to it, it’s going to show up in an authentication request. And these categories, you’ll see, and I’m going to talk about it a little bit in a minute, you’ve got categories, you’ve got subcategories, and you’ve got event types. Categorizing data can give the analyst, at a glance, an idea of what any individual message is.

If you’re an analyst that’s got 30 products they’re responsible for, you can’t be an expert in every single message. You cannot see every single message, and you don’t know what the messages mean, necessarily, the first time you look at them. So without having to go look them up, that categorization information is within the message itself.

Okay, so it’s a human-readable description that’ll make it a whole lot easier for you to know, oh, that’s an authentication message, or that’s an identity and access management change, or something like that, network message. This is just an example of what those categorizations look like. You’ll notice the subcategory is authentication default.

And then the event type is it’s an authentication message. Name resolution DNS request is a DNS query, etc. And so this just makes it a lot easier to understand what you’re seeing. You can also write queries that show me all authentication, all IAM object create events. Every time a user was created on any system.

That’s a difficult query to write if you don’t have something like that. This metadata gives you a better understanding of what you’re looking at. It gives you a 10,000-foot view of the logs that you’ve gotten so that you can write queries that cross these categorical barriers rather than sticking product by product the way that you may be used to doing.


Reducing Log Volume and Refining Data

Reasons you might want to reduce your log volume. You might just want to refine what gets produced. You don’t want to put out messages. If your network team… It puts Cisco devices on informational. You’re going to get a whole bunch of messages, builds, and teardowns. That might or might not be useful to you.

If you don’t use them, then you’re generating 50 percent more traffic than you need. But you can reduce that. It’s not going to do anything. You’re not going to have any less capabilities because you weren’t using it to begin with. That kind of goes into the second one, which is to filter out noise. And to filter out noise, you can drop messages.

So if you can’t stop someone from… Sending them to you, you can just drop them when they get there. You don’t have to collect them, you don’t have to store them. It’s a very simple pipeline rule. You can remove redundant data from the message itself, but keep the whole message. In the example of Graylog, you’ll notice that the message field often contains all of the information from the parsed fields, but in its original form.

If you don’t need that original form after you’ve parsed it, rewrite the message field. You can’t delete it in Graylog, but you can rewrite it. And then you reduce the size of that message. So when it gets stored, you’re storing the parsed data. Even if you’ve got compliance that you have to meet, that’s still the original.

You can show them how the data was processed and how it was changed and how it still contains the same information that it did in its original format. And that’ll get you past most audits. You can remove fields if you don’t need that field. If, the example I get is, a lot is stack traces. If you are not a dev…

But you have to monitor the dev systems. There may be three or four fields in that message that matter to you, but the bulk of the message is a blob of text that takes up one field. If you don’t need that data, delete it.

The way that Graylog does that, though, is, as I mentioned for parsing, is also through pipelines. So pipelines are the way for you to not just enrich your data, but also trim your data should you need to, or drop your data if you need to. So here’s an example of one of those pipelines. I’m not going to go into real detail, but you’ll notice at the bottom that you’ve got it putting it into a stream, taking it out of a stream, and then removing a field, full message, because in this case the full message field contains another copy of By reducing the full message field, you can cut the size of that message in half. This is an example of a pipeline that would accomplish that for you.


 Implementing Organizational Policies for Effective Log Management

Organizational policies might not be the first thing you think of in a technical talk, but they’re crucial for teams due to their function as air cover.

– When management dictates how other groups, which you don’t control, must operate, it aligns more easily with the company’s business objectives.

  • If it’s just the security team or the ops team making these demands, they’re often dismissed because it’s not the other groups’ idea.
  • Policies can simplify your life by setting conditions everyone must follow, so you don’t have to argue—it doesn’t come across as personal.
  • Your information security policy should require logging for the organization. Without this requirement, there’s room for pushback.
  • Data classification helps you determine what’s important and establish criticality.
  • The incident response plan should outline how teams will interact with the SIEM or log manager during an incident, detailing responsibilities and access.
  • Access control policies should specify the necessary information for monitoring access control events, like logging all failed logins, to avoid disputes.
  • Vendor security policies can dictate logging requirements for vendors’ products or services before purchase.
  • Network and cloud security policies set logging requirements, coming from the top down, providing you with support when dealing with other groups.


Establishing Feedback Mechanisms to Regulate Logging

Feedback mechanisms address the common issue where one group is responsible for logging, but everyone consumes the logs.

  • This can lead to a tragedy of the commons, where the shared resource of logging is overused, depleting it for everyone.
  • To prevent this, you need a way to make groups self-regulate their logging.
  • Chargebacks are a common solution, where groups pay for their share of the logging resources based on usage.
  • This encourages groups to consider the cost of their logging and to prioritize it.
  • If a group, like the dev team, is responsible for a large percentage of logs, they should bear the cost of the logging solution.
  • Visualization tools can help track which groups are contributing most to the logs.
  • When it’s time to expand the logging solution, costs can be divided based on usage, encouraging responsible logging practices.



Setting Organizational Policy for Developers

For companies with in-house developers, organizational policy should mandate that logs are output in a structured format.

  • This simplifies the process of managing logs and prevents unnecessary bloat.
  • You can adopt a default ‘no’ policy, where logging is only permitted if there’s a clear business case for it.
  • Alternatively, use a functional policy based on use cases, focusing on authentication, access control, and access to critical resources.

– Always decide what to log ahead of time, rather than turning everything on and sorting it out later.

– Security is about risk management; focus on the threats that could cause the most harm, not just what’s most likely to occur.

– Understand your resources, threats, and potential exploit methods to inform your logging priorities.


Thank you very much.