Deciding what to send into your SIEM is a very important task. You need to decide what your priorities are and what types of security related events you wish to monitor. Starting off understanding the most important sources to monitor is a good start. Watch this video and get some good ideas where you need to be.
Introduction
Hello, my name is Joel, and I’m part of the solution engineering team at Graylog. Our role involves assisting customers and prospective customers with their use of Graylog, which can range from architecting clusters to determining log sources and volumes, and ironing out the details of implementation. Every day, I collaborate with Graylog users to maximize the value Graylog provides to them.
Today, we’re going to discuss what to feed into your first SIEM. You might currently have no SIEM, a suboptimal one, or perhaps you’re using Graylog merely as a log management platform without reaping significant benefits. We’ll explore what should be included in a standard environment and how to evaluate data sources as you expand from a minimum viable SIEM to a mature, comprehensive solution.
Building Blocks for SIEM Data
Let’s begin with the building blocks necessary before feeding any data into your SIEM. Presumably, your tool of choice is Graylog since you’re attending this session. We’ll discuss the rules for deciding what should and shouldn’t be fed into Graylog and then delve into specifics about different data sources and their importance. We’ll briefly touch on data enrichment, directing you to Chris’s later talk for an in-depth look at the process within Graylog. Finally, we’ll conclude with a summary and a Q&A session, addressing questions throughout the presentation as appropriate.
Foundational Elements
The foundational elements we need include:
- A tool, likely Graylog.
- Data normalization, which involves creating a consistent data schema for fields used in your SIEM. This allows for effective data correlation. The Graylog Information Model provides a schema for content produced under Graylog Illuminate, which you can adopt even if you’re not using Illuminate.
- Synchronized timestamps across your logs, ensuring accurate time representation and considering time zone information.
Rules for Data Inclusion
The rules for deciding what data to include are:
- Whether the data will trigger alarms or alerts directly or as part of a correlation.
- Regular usage for reports or investigations.
- Necessity following an incident, where failing to produce the data would be problematic.
- Data enrichment potential.
- Regulatory requirements.
Importance of Authentication Data
Now, let’s address authentication data, which is crucial and can originate from various sources. We need to answer questions regarding:
- User identity and least privilege access.
- Appropriate access locations and times.
- Account rights alterations, group changes, account creation and deactivation.
These considerations are generally applicable, especially in smaller environments, and are essential for maintaining security and oversight. Now, what are we going to use to be able to answer them? Authentication. The big ones are pretty straightforward.
We’re probably going to get this information from our domain controllers if we have them, from Azure Active Directory, an ID, something like that. The good news for most of these sources is that the information should come in a fairly structured way. So unlike some things we’ll look at in the networking category later on, where they have a tendency for being really gnarly log messages that have no form or function to them at all, and we have to figure out what’s inside of them, the information we’re going to get from these sources is pretty easy to digest by Graylog. It will probably come in already parsed, already in fields.
Now we may need to take this information and convert it into our data schema. The field names may not necessarily match up, but it should come in some kind of structured format. If we’re bringing it in with domain controllers and we’re using winlogbeat or NX log, it’s going to come in structured. From Azure, it’s going to come in structured. Most of these things will give us structured data, which makes it really easy to ingest that data in and have it all as this central part. You may need to turn on some additional functions around auditing. So on your domain controllers, you may want to turn on some additional auditing to get the group changes and all of that kind of information coming in, but it is incredibly valuable. So it’s definitely worth doing that.
So I’d say on the authentication piece, let’s make sure we’re bringing in every time someone’s authenticated, obviously. So that will apply to all of the devices you bring in. So even as we look at things in networking, obviously there’ll be authentication events happening there. What we’re talking about in this stage is really the sources of the investigation of the truth. And this is your domain controllers, those central authentication providers, which is the only place we can get information like group membership changes and all those kinds of things. That’s what we want to get from there.
Endpoint Monitoring
The next thing we want to talk about is our endpoints. Endpoints can be a really tricky one. It can become something that can become unmanageable if we don’t do it in the proper way. And that’s just because of the scale, because we may have hundreds and hundreds of workstations. The question is really, do we need to be monitoring everything that’s happening on those workstations? In some cases, yes, in some cases, no, let’s dive in and see what we need to think about when we make that decision. The things we want to be able to answer about our endpoints:
First of all, is the endpoint healthy?
This is a very easy thing to alert on.
Things are going on in the endpoint that we think are nefarious.
That’s something we definitely want to know about right away.
What are the details of actually what we’re looking for?
That is a completely different story. That is not something that’s simple, but in some cases it may be right. Certain things are absolutely red flags, and we’re going to be able to look at those on day one.
We want to look at the health of the endpoint.
We need to be able to know if that endpoint has been altered. Again, this may be something, if we see an alteration of a certain thing, we see a new application that was installed.
We want to know right away when that happened.
Lastly, we need to keep the full history of that when we do need to go back and do an investigation.
Now, this is where it really comes down to what are you going to bite off at the beginning for the scale of endpoints that you’re going to monitor. You’re definitely going to want to start with the crown jewels. So things like your domain controllers, you want to be monitoring everything that’s happening on those, all of those Crown Jewel servers, possibly the entire data center, right? All of your servers running inside of your data centers, those are the things that you want to start with, and then maybe you move out from there.
Now, one thing about this is, there are some ways that we can cheat at what we’re collecting, which may make this less complex. When we actually look at what we’re going to collect, obviously, the best case scenario may be to just collect the direct machine event logs, pulling stuff in from the Windows event logs, pulling stuff in straight from our Linux devices, those kinds of things. However, deploying an agent out to 600 workstations to collect all of those logs, how are we going to ship them back?
- Are they always online to ship them in real time?
- How are we handling all that?
It can get complex. If we don’t need that level, then maybe there are some shortcuts we can take, and those shortcuts may be around some things we can get from Endpoint Protection. Maybe we’re running MDM on some of these devices. In a lot of cases now, Endpoint Protection and MDM is doing some amount of log collection and centralization already.
If you’re using a cloud Endpoint Protection product.
There’s a very good chance instead of having to get the logs from every single device, we can just connect to the cloud end of that equation and pull the fire hose of all the logs. Now, when I say fire hose, I will bring up one thing to keep in mind. You always have control in Graylog of what messages you choose to keep or not. We don’t always have the control at the source of what those messages are. In some cases, yeah, we can set and say, I only want to send this event ID, right? Something like running Winlogbeat on a Windows machine, we can control very finite what gets sent and what doesn’t. But in other cases, we may not, right? In the case of using endpoint protection in the cloud, we may have send the logs, don’t send the logs. And that’s our only choice. You always have that control in Graylog pipelines. To decide what you want to keep and it’s okay to get rid of some of this stuff that you know, you’re not going to need, you can always start collecting it later. Now, you obviously can’t go back in time, but you don’t have to accept everything just because it shows up on the doorstep of your Graylog environment.
That’s endpoint and the things that we generally see as collecting from that, obviously. We will cover some questions at the end, so if you have specifics on any of this, definitely shoot that in the chat, we can cover that at the end.
Network Traffic and Security
Next, we’re going to talk about our network. What is the thing that our network is going to tell us? Our network is going to tell us who is talking to who, which is incredibly important. As a start, you may not necessarily cover your internal east-west traffic, but definitely how was the internet talking to us? How are we talking to the internet is something that we need to be able to answer these questions for things like,
- Were their devices communicating with known bad actors?
- Was their unusual data being transferred outside the network?
- Was there any usual data being transferred into the network between zones of the network?
If you have that kind of monitoring set up and then lastly,
- What was going on with the VPNs?
Obviously, that is a huge part of our network now is that we’re not all in one building where everything just by looking at the LAN, we can figure out what’s going on. No, we have people coming in from all over the place and we need to be able to correlate this information back. We’ll talk a little bit when we talk about enrichment, about how we can bring VPN into that, but at a really high level, what it comes back to is we need to be able to figure out if we have a log that only shows us
- An IP address
- Who that was at that time that was using that IP
So VPN plays a critically important part in that.
Now, networking traffic is the default thing on that is going to be firewall are absolutely the most noisy thing that I see every single day. And I think this is the one that we need to be the most careful with. I have seen a firewall, which properly tuned produces five gigs of logs a day on properly tuned, it produced a hundred and twenty gigs of logs per day, the pretty big difference and that hundred and fifteen gig difference. There was no value in those logs. They were mostly denied firewall rules, unless I’m specifically looking for something, stuff that didn’t come in is probably not anywhere near the top of my priority list. And that’s a huge difference. Firewalls. There is incredibly important information there that we want to capture, but take your time with this one, understand what you’re capturing it, why you’re capturing it. These can be rather tricky depending on the vendor to parse the logs for now. If you’re using Graylog or Illuminate, good news. We’ve done most of that heavy lifting for especially most of the really nasty ones, but it really depends. Some of them, it’s quite nice. Some of them, those logs are going to come in. They’re going to be in key-value pairs. We can throw a pipeline rule for key-value pairs at it. It’s going to solve it all right away. Others, yeah, we’re going to get into regex and we’re going to do it. I will tell you at the end of the day, it is absolutely worth it. But your firewalls and your network devices will probably be your number one pain point for bringing things in. There’s just so much data there. This is the part that we’re probably going to spend the most time on as we actually go through this process of deciding what we want to bring in.
Externally Facing Services
The next thing we want to cover is externally facing services. Ideally, you have none of these, but I highly doubt that’s true. What I’m talking about when I’m talking about externally facing services is anything which is… We’re providing a web app that is available on the internet or some variation of that things are not inside our network. This doesn’t necessarily mean it’s being provided to the public. It might be a public site. It might be an employee site, but it is being provided to the general Internet which brings with it a whole bunch of additional problems because although we have protections in front of it, there’s so many other things we need to worry about.
So what do we need to worry about with external services? We need to worry about who is accessing those services. So back to users, but also being port scanned and all of those kinds of things. Again, was there unusual traffic patterns? And ideally, are we able to answer the question of if there was a vulnerability, that was known in the application, or the service, could we go back in time and see if that was exploited?
Sometimes this is easier than other times, but for example, in a certain exploit, a certain URL would have been hit to trigger the exploit because that’s where the exploit lived. Do we have the information in the logs to go back and see, was that hit? Who was it done by? When did it happen? And then start to be able to chain from there what actually happened, but it gives us a starting point. If we don’t have that information, that’s much harder to do.
Where are we going to get this from?
It’s really going to depend on what you’re doing in this case. I’m not talking about necessarily hosted apps, like fully hosted SaaS applications. Some of them may give you some logs. Ideally, they are responsible for the security of those systems. There is some risk you can put onto them. I’m mostly talking about things we’re running in our own data centers under our full own control that we need to worry about here. Where are we going to get the information from? Web servers, Apache, Nginx, IIS, all that kind of stuff. There is pretty decent logging that can come out of those. If we’re running anything in front of those, a reverse proxy, a WAF, something like that, we want to make Sure that we’re bringing that in as well.
Again, just something to think about: Are you covering all of the bases on these? Especially because they are externally facing? They have a lot less layers of protection than some other areas of our network. This is my favorite slide of the day because I didn’t change it from saying “network,” and I had to submit my slides a week ago. I noticed immediately after I submitted it, and I’ve known for a week that the slide was coming and that it had a typo on it, and there was nothing I could do about it. So, I’m glad this slide is over. We can move on to the rest of our slides, and I can get on with my life now.
Cloud Applications and Compliance
Okay. Cloud applications.
What I’m talking about specifically here is your standard line of business cloud apps. Obviously, there’s a ton of things we can talk about—all the kinds of cloud apps that there are—but we’re talking about things like Office 365, Google Workspaces and all of those kinds of things that we all use because hopefully, we’ve been able to replace the headache that is our Exchange servers with the headache that is Office 365—a tiny upgrade. Bear. I never really got to do that. Unfortunately, almost all of the time that I was a sysadmin, I got to deal with Exchange 2003 and all of the fun that brought with it, and then moving on from that.
So, what do we need to be able to answer with these kinds of applications? Probably everything that we’ve already talked about. Because they are so wide and cover so many things, we have all of the same questions that come out of there. But in our cloud environment, now, the good news on this is that it is relatively, although I will give a couple of caveats, easy to collect this information. All of the vendors who are worth anything will provide you with somewhat decent logging. Now, obviously, there’s been a lot of cases where we’ve learned Microsoft wasn’t necessarily giving everybody everything, and it’s all changing and whatever it is. Easy in the fact that it is generally going to be one firehose of data. The data will generally be classified. It will come in with field names, everything attached to it. It’s going to come in JSON or something like that. That’s really easy to parse. All we need to do is convert that into our information model, our schema, and we’re good to go.
I have seen logs coming in from Microsoft 365, which every log message has different fields. If we’re taking all of that and we’re dumping it into an index, by default, an index in Graylog can only hold a thousand distinct fields. I have absolutely seen indexes where if you push a lot of different information coming out of 365 into a single index, that index can reach that thousand maximum, and then we have problems. So, we do need to plan. Do we want to route this into different streams? Do we want to place it in different indexes? All of these kinds of things are things we need to keep in mind as we bring in this data because it is so much different data that’s coming all from one source. So, we may need to do some routing on, but generally, it’s fairly straightforward to go ahead and bring that data in.
Okay, now for an easy one. Compliance.
Compliance is not easy, but it’s easy because a lot of the thinking about this has been taken from us and placed into documentation. I’m not going to claim that it’s easy to understand what any of this documentation says; however, it’s fairly straightforward in the fact that we need to collect what they say to collect. We don’t really need to care why; we don’t have to have a specific use for it. We have to do it because we have to do it. There’s no questions specifically we need to be able to answer, except can you produce this? Yes, here you go. It is produced for you. Definitely make sure you keep that in mind as you’re planning this.
Are there things which are compliance only? The reason I say “compliance only” is there’s a lot of good stuff actually that the compliance requirements will tell you to keep that you really should keep and you do have a use for. So, the ideal is that we have very little data which is not being actively used by us but is required by compliance. Now, when it comes to something like retention, that may be a different story. If we have to keep everything for seven years, are we really going back on a daily basis and looking at stuff from seven years ago? Probably not. But for a lot of this, I would hope you have a lot of overlap between compliance and what you’re actually using. It is a good measuring stick to measure yourself against, but something to keep in mind as we’re deciding what we need to bring in there.
Data Enrichment and Its Necessity
That brings us to an end of the specific things that we need to bring in, except for one last thing, which is around enrichment. As I said, enrichment is the information that we need to bring into the system that we’re going to use to tie into the other messages we’re getting to make those messages more useful. This may mean that there’s information in the messages for us to make a determination more quickly on what’s actually going on, or it may mean that we are able to actually get the full picture, and without it, we couldn’t.
So, what do I mean by that? As I said, we’re going to talk about DHCP. We’ll dive a little deeper into DHCP in a second and why we need to bring that in, but DHCP is something that I would say is required to be brought in. Now, DHCP is a little different from the other things in this list in that it actually is another stream of messages. You’re collecting the information on your domain controllers. If you’re in a small environment, there’s a good chance those domain controllers are also providing your DHCP. It’s an easy place to pick that up. Maybe it’s coming from somewhere else. We need to make sure we’re grabbing it from there as well, feeding that information in.
So, we need it. We’ll talk about why we need it in a second, but I promise you, I feel very strongly that DHCP is something that you absolutely should be feeding into your system, and we’ll get into why in a sec. The other kinds of enrichment we can do, though, are not necessarily a stream of data. These are more lookups that we’re doing inside of pipeline rules. As the other messages are flowing through, if we have messages that contain external IP addresses, we can go ahead and add GeoIP information to that. That information can be helpful. I will tell you that information is very helpful to make very nice maps that we all know that the higher-ups love to look at. For that alone, it may be worth adding the GeoIP information, but there is really usable data inside of there around getting that bigger picture first of, is this normal? Is this not normal? Are these IP addresses in our town? And yeah, that is our staff “VPN-ing” in, or most likely it is versus they’re on the other side of the world? What’s going on here? We’re able to figure that out really quickly by having that information.
The other kind of bucket for that would be, can we add threat intelligence information to this? So, what is the health of this IP address? Has it been flagged somewhere as being nefarious, even very light information? Is this IP address classified as being a residential user? It’s used in the data center, these kinds of things. Being able to add all of this information really helps us get that full picture as we’re looking at the messages of what’s going on.
Now, as I promised, let’s dive into DHCP. Why is DHCP so important? There are a lot of messages that we will have in Graylog, which only contain IP addresses as the only thing we can correlate that back to. Obviously, over time with dynamic addresses, who has that address will change, and most likely, we are not looking at a message where that machine still has that same IP address today that it did at the time that this happened. So, we need to be able to go back and correlate that together. At a minimum, that means let’s find this message. We have the ability to grab the IP address, grab the time, go and look in that other stream of data for our DHCP events, and see what information is going on there.
Now, the ideal with this is that we wouldn’t have to do that. And right here, I’m going to give you a little sneak peek of what’s going to be in Graylog 5.2, specifically in Graylog Security, is around an asset model and tying information to actual machines and tracking this over time. So, this is going to take this information and it’s going to make it very easy. So, when I’m looking at a message here inside of Graylog, I’m going to see the asset that had that IP at that time. We’re going to use the DHCP logs for this. So, this is where you can see how, in a fully fleshed-out solution, this is why those DHCP messages are so important is that it allows us to make that correlation. Now, all of a sudden, we can jump back and see exactly what this machine was. We can immediately pivot and start investigating what was happening on that machine at that time. So, that’s DHCP, why it’s so important, and why I feel so strongly that we have to have it. Without it, we can’t necessarily attribute things that are going on to our network to specific devices. If we can’t do that, we can’t complete the remaining steps of the investigation.
Final Tips and Best Practices
Alright, we are almost to the end. I will answer questions when we get to the end. I’m going to have to because I can’t keep talking because I’ve been now talking for 40 minutes, and I’m not nearly as young as I used to be, completely running out of steam. I’m putting myself to sleep, so I’m hoping the rest of you are not getting put to sleep by this.
Our final tips. Most important but least fun tip: Documentation
I was a sysadmin for 15 years. In those 15 years, I knew every single day that I needed to document things. Did I do it? Not particularly well. Did I enjoy doing it? Never. Was it super important? Every day, it was super important. And it would always come back to bite me when I didn’t do it. What does documentation mean specifically to this case in Graylog? Documenting… It doesn’t necessarily have to be something outside Graylog. There are lots of places inside Graylog where we can add descriptions. We can add inline comments inside of our pipeline rules. The reason we want to do this is that by the time we actually need to go and do an investigation or by the time something breaks, we’re probably not going to remember why we did this in the first place. So having documentation around why you created an alert, what kinds of messages this alert relies on, so if this alert is no longer firing, we can go in and we can investigate. Are we receiving all the messages that we originally set it up to receive? Or as we want to start growing the information that we’re bringing into Graylog, this alert was built for monitoring when there was a failed authentication anywhere inside of our network?
Okay. Now, when I bring something new in, I go, okay, that’s an alert that I need to make sure that this new message source that I’m bringing in works with this alert because this is why we built this having that documentation in place so we can understand that is very important again.
Same thing on dashboards.
What information actually needs to be there for these dashboards to function? I’m looking at it. It looks like it works, but am I able to go and check that I’m still receiving all of the messages from all of the sources that I was expecting to? And then definitely in pipelines, inline comment out the ying-yang, why did I do it this way? Make sure you include an example message that you’re basing the parsing you built off of. When it stops working, we can go in and we can say, Oh, the reason it’s not working is because the format changed from the way that it used to be. We look at this new message. This doesn’t look like… How it used to look when we built this rule is super easy to add in pipelines. Document the live iteration of the thing, because you will a hundred percent want that later on.
The health of your system
The last tip is creating some alerts around the health of your system, something like, “Hey, the logging.” We’re going to create an alert, let’s think of an alert. We’re going to create an alert that if the number of distinct sources that we’re receiving information from drops, we want to be alerted to that. Or if the volume that we get from any particular device, we want to be alerted to that. Or if, maybe if it goes up. I’ve seen cases where people were notified of something nefarious going on at first, just because the volume of logs increased from the normal, because if you’re under attack, I can guarantee you your firewalls, your API endpoints, all of these things are probably seeing increased traffic. If you’re being brute-forced, you’ll see increased traffic, right? All of these kinds of things. So let’s set up some really generic alerts to watch for this kind of stuff, just so that we can make sure that the health of everything we’ve set up continues to go. We don’t want to get to a point where we have an incident that happens. We go to get the logs and we realize, you know what? The logs stopped working six months ago. And we never realized it. That’s not the time you want to figure that out. It’s a very good way to wreck it.
Conclusion
So that is everything I had to cover today. Thank you very much for your time. We’ll definitely get into questions in a second. If you have additional questions after the fact, definitely feel free to reach out to me via email, hit me up on LinkedIn, whatever your tool of choice is. I’m happy to chat about how these things go. Also, I’d love to hear long-term, six months from now, you’ve gone through this process. You’ve actually done it. How has that worked out for you? Would love to hear all of those things.
Q&A Session
Can we use the Graylog schema as the default schema for fields in different indices?
Yes. So the schema in Graylog is not some hard thing that is applied to an index or to Graylog. It’s just using those field names and we can use those field names between different indexes. So basically, all of it is in practices as we bring our logs in, we’re running through our pipelines, are we taking the names of those fields, converting them to the fields that we have and want to use inside of our data schema, and then writing those. The caveat I will give you to this to be careful with is that if you’re using the same field name across multiple indexes, make sure that what is in those and if you’re following the GIM, you should be doing this anyway, make sure that the values of those fields are in the same format. Now having said that, sometimes that’s very simple, right? We’re going to make sure that we only put numbers and that kind of thing where this can get a little tricky is just because it looks like an IP address over here, because Graylog will do some guessing as to the type of data it is, does not mean it’s actually stored as an IP address in this index over here. So you do need to make sure that it’s being stored as the same data type. So not only does it look the same. Is it stored as the same data type across those indexes?
Graylog will function without that, but you will not be able to run searches with mixed data types. I actually literally ran into this today when I was doing the CTF. Because I work with the GIM every single day in the CTF, when I had to do the grok exercise, I used GIM field names for my source IP address. Source IP already existed inside of that environment. And I hadn’t done the work to properly parse the data into the exact right format. There was a conflict and I got an error page. I got an error in my search page. So you want to make sure that you’re using the GIM. It’s not something we’re applying. It’s not this heavy-handed thing. It’s really just more the way we approach the naming of our data and the validation of our data inside Graylog to make sure that we can use it across the board.
There was another question around where can we find a list of devices that we support in Illuminate.
If you search Google for Graylog Illuminate, our documentation site goes into exact detail on which products Illuminate covers. For some of them, it will even tell you the exact event IDs that it works with, ones that it doesn’t work with, how they all need to be set up, all of that kind of thing. All of the details are on the doc site for everything that we support in a very good level of detail.
How can we suppress duplicate logs not to send over other systems?
Duplicate logs is a fun one. Log messages are treated as distinct objects inside of Graylog. We can deal with them in processing pipelines. But we need a way of finding that there is a duplicate. We can use things like lookup tables to do that. We can look up and see, is this… We have to make sure that we’re able to write the information when it comes in. We received this message, we didn’t receive… Have we received this message in the past? Be able to look that up. It can get a little tricky. It is easiest to do that at the source for sure, but there are tools in pipelines that allow us to deal with duplicate logging. The easiest way is to not have it happen in the first place. There is stuff inside pipelines. It can get pretty advanced. To do it, but there’s absolutely ways that it can be done inside pipelines. I’ve done it before for some services where, you know, every time you send a request to their API, they just return all of the logs like that they ever received back to you. And we need to figure out which is which.