All In The Family: Architecting and Managing Shared Graylog Clusters

The script is a comprehensive presentation by Joel Duffield, a member of the Solution Engineering team at Graylog. He explains the role of sales engineers or solution engineers like himself and explains how they assist customers on effectively deploying and architecting Graylog. Then Joel commences a detailed discussion on operating Graylog in a shared capacity, coining it as multi-tenancy shared clusters. He delves into Graylog features and how to apply them successfully in a shared environment. Joel also introduces topics like permissions, architectural building blocks, traffic accounting, and how to extend data beyond Graylog via API calls. The presentation is filled with detailed explanations along with challenges that might come up and how to overcome them. His ultimate goal is to help users efficiently manage large Graylog deployments in a shared environment.

Introduction to Graylog and Multi-Tenancy

My name is Joel. I am from the solution engineering team at Graylog. What we do in the solution engineering team is. We talk with our current customers and prospective customers of Graylog Enterprise about how they would deploy Graylog, how they would architect it, and the kinds of sources they would bring in. We’re SEs, which means if you know what sales engineers are, we’re sales engineers, just with a different name because every company in the world calls sales engineers a different thing for some reason. But that’s what I do at Graylog, and as part of that role, I get to work with some of our largest customers, both on the MSP, MSSP side of the house, as well as with some of our customers who have large Graylog deployments that are used widely internally.


Defining Multi-Tenancy in Graylog

Over the last year, I’ve been dealing a lot with these different customers as they roll out different flavors of multi-tenancy, shared environments, and different kinds of things. Hopefully, this talk today will be very helpful. We’re going to be talking about running Graylog in some kind of shared capacity, and we’ll talk about these different kinds of shared capacity because multi-tenancy shared clusters can mean all kinds of different things to different kinds of people.


And so we’re going to define what those are. Then we’ll talk about some of the specific features in Graylog and how we can actually go about doing that. You’ve been really successful with Graylog. Congratulations. Everyone now wants into your cluster, and that changes things a fair bit, right? It’s very simple to run Graylog when it’s only you or only your team in there. You don’t really have to worry about permissions or people stepping on other people’s toes or any of those kinds of things. As we start to grow Graylog out larger, there are some challenges that can come up with that. That’s what we’re going to chat about today.


Architectural Building Blocks for Graylog

Points we’re going to touch on:


  • We’re going to define what is multi-tenancy.
  • What are the different kinds of multi-tenancy?
  • The architectural building blocks that we’ll talk about through the rest of the talk, streams, indexes, all of those kinds of things, we’re going to make sure we have a solid understanding of what those are and how those play a role as we go through the rest of this.
  • We’ll talk about permissioning as that becomes very important in some of this, how we actually go about architecting this, not architecting the cluster itself, not the physical cluster, but how we go about architecting the design of our streams, our indexes, those kinds of things for different kinds of use cases.
  • We will touch on traffic accounting.


Traffic Accounting and Chargeback

So chargeback, basically understanding who is bringing in what data, how much space is that taking up, those kinds of things.


Extending Data Beyond Graylog

Then we’ll wrap up with a discussion around how we can extend the data beyond Graylog. Obviously, in several situations where you’re running in a shared environment, especially if you have customers outside your organization as the end consumer of this data. In certain cases, it may actually not make sense to give them direct access to Graylog, but we may want to give them access to the data that lives inside Graylog. And so we’ll talk about some of the ways that we can do that, as there are some much easier ways to get that data through the API in some of the newer versions of Graylog than there has been in the past.


Let’s dive right in. What do we mean when we say multi-tenancy? It can mean a lot of things. It’s gotten to the point where it’s become a little bit of an inside joke with myself when I hear that someone wants to talk about multi-tenancy because I’m like, that doesn’t really mean much because it can mean so many things.


What kind of are some of those general things that we see? I look at it as a pyramid where at the lowest level, we don’t care who sees what. This is probably going to be an internal use case. We have different people coming in, but a big challenge that we see very often with our customers is that the department that runs Graylog and most often pays for Graylog is not necessarily actually in charge of or has control over the data that comes in. So they have to run the infrastructure, but some other team, we won’t name names, but very often this would be somebody like development or something like that, and then the dev ops team has to run Graylog, an infrastructure team. They don’t have control over what people are bringing in, what we’re going to talk about is how we can actually track the volumes of this data, whether you’re actually going to split the bill with these departments, or you just want to be able to show to management, “Hey, these people are using much more than everyone else,” or those kinds of things. We want to be able to see this data. And luckily, Graylog gives us a lot of ways that we can do that. That’s our very first base. We want to break apart the data. It just, we know what data is what and who is the owner of that data so that we can maybe bill them or show what we would bill them, things like that.


The next kind of tier on this pyramid is that, and these very often go together, is that there will be multiple requirements from these different data source providers on how long they need the data kept for. At this point, we need to start worrying about okay, it’s not enough to just tag the message and say who brought it in. We actually have to worry about where are we going to put it so that we can retain certain information for a different amount of time than someone else’s data that needs to be retained for, and so we start to have to now bring in indexes and things like this that we can control the retention on. From that, we’ll generally see the solving of the problem of stepping on people’s toes. This is where we may start to take our streams and we’re actually going to apply permissions to them so that some of them are only seen by certain individuals inside the organization. This also becomes critically important as the size of your Graylog deployment scales, the ease of use can drop. As things like the number of dashboards in your environment grow, so we’ll also talk about how you can use permissions to not only control permissions of the data itself, but how we can use that so that people only have access to the dashboards, the alerts, those kinds of things that they actually need to, so that it’s not overwhelming when you get inside of Graylog. And as we start to approach the pinnacle, we get into what I would consider kind of customer isolation, where someone comes into the environment. They don’t necessarily know that anyone else exists inside of that same cluster. So they can see their data, they can see their dashboards with their data on it, but they have no idea that there’s other people going in who are also having their own buckets inside of Graylog as well. There’s little to no crossover between the two of them at all.


Then at the very top, we have the idea of full multi-tenancy. Today, full multi-tenancy as I would define it is, and I’ll go into what this definition is, not something that’s really available in Graylog. You can get incredibly close. It’s not something that Graylog has been built to do out of the box. What I mean by full multi-tenancy is the idea of having some kind of administrative console that is creating these tenants inside of our environment that we can go and just remove an entire tenant altogether, that there can be an administrator inside that tenant who has complete control over the tenant itself and could see nothing outside of that. Graylog today, again, has not been built for that kind of level of control. We’re not going to be talking about that kind of thing, but we will be covering all of the things below that today. How can we actually go about doing this? What parts of Graylog can we bring to bear on it?


Understanding Streams and Indexes

The first and probably most critically important that we’ll use today will be streams. A stream is a metadata collection of messages. So as data comes in, we can choose what stream to route it to. Data can be a part of multiple streams, but it is a collection of generally like data. We can use it for filtering searches, but in this case, we can use it for permissioning. Also, streams have an important role because a stream is what is actually tied to the index. So when data flows into a stream, it will be stored in the index that is associated with that stream.


That brings us to indexes. What an index is, is the actual backend collection of data as it’s stored inside of open search, an index is going to be what we have control of the retention on, right? So that’s where, as we look at these different ways of routing it, if we need to have control over retention, we’re going to need that data inside different indexes because we can have that level of control. Also, depending on your requirements, you may have a requirement that customer data is stored completely independently. So we may actually want to store them in separate indexes because then from top to bottom, it is classified as being stored independently.


The Role of Permissioning and Teams

We have permissioning. Now, in a lot of Graylog, you’ll actually see permissions. In the buttons referred to as sharing, the idea is that we’re taking this particular object, be it a stream, be it a dashboard, be it whatever it is, and we’re sharing it out to individuals or groups, we can set that on both the data as well as the object itself, like a dashboard, and we’ll get into that in a second.


Another critically important piece on this is around the use of teams. Groups are called teams, and so you’ll probably hear me refer to them interchangeably. Fifteen years of being a Windows sysadmin, I can’t really get groups out of my speaking pattern. And so I’ll probably refer to groups and teams interchangeably. Teams is an enterprise feature. So most of the things we’re going to be doing, I will be approaching these as using teams to solve them. You absolutely can do this without teams, but you’re having to manage at the individual user level. So if you’re doing automations via API or something like that, you just keep that in mind that you’re going to have to be continually updating those as new users onboard and offboard. You’ll have to do that at the individual user level. Teams make this incredibly simple. We can set it and then let the user and teams management take over from there.


Utilizing Forwarders in Graylog

The last piece that we’ll talk about in this is forwarders. For those of you who don’t know, who don’t have the enterprise version, Graylog enterprise includes functionality called forwarders. It is basically the inputs of Graylog that we can remotely deploy into the environment. We can deploy a Graylog forwarder service to a server. It can be running a syslog input, a beats input, whatever else, all the normal inputs can run on forwarders. That forwarder will take all of that data, collect it all together, ship it back to the main cluster. Obviously, if you are doing remote collection, this makes your life incredibly easy. If you’re running an MSP and MSSP, and you need to be collecting on customer sites, being able to deploy a forwarder into there is very helpful. But the other thing that we can use forwarders for, and we’ll talk about this a bit, is that we can actually make it very simple to understand who’s sending in what data by giving them a dedicated forwarder. So we’re basically giving someone a dedicated IP address for them to input all of their data. It doesn’t matter what kind of data they’re sending, everything they’re doing, they’re pushing it through that one device. And we can use the information about what forwarder it came through to make decisions around where we’re going to route that data. Going to store the data, deciding whose data it was, all of those kinds of things. Those are all the pieces that we’re going to bring together as we go through today. First of all, let’s dive into permissions as it is the lowest and most core feature that we’re going to need for the rest of this.


I’ve said some of this before, but there is an idea in Graylog that we have these permissions or sharing. And that can apply to the data itself to apply it to the data itself. And when I say data messages, we’re going to be applying permissions to the streams. So we’re going to give someone read-only access to this particular stream that messages have been routed into.


We can also give them access to content being saved searches, dashboards, all of those kinds of things you’ve built inside Graylog, but the permissioning on both of these will work together. If I can’t see a dashboard, obviously, I can’t look at the data that I have access to on that dashboard, but in the inverse, if I have access to a dashboard, just because this dashboard has been written to maybe look at these 10 different streams, if I don’t have access to those 10 streams, I’m not seeing all of that data. I am only seeing the data on that dashboard that I have been given the permission to the stream of messages to see. What this allows you to do is to create shared dashboards that users are using to look at their individual data.


So I can create an enterprise-wide dashboard, but when I come in, I’m only looking at my enterprise, right? I’m only looking at the chunk of the pie that is actually under my control that I’ve been given. We don’t have to create a custom dashboard for every customer, for every department. We can create these shared dashboards, and then we can go in and they’re looking at the individuals, but maybe as the administrator, we can come in and we can use that exact same dashboard to look at the entirety of the messages inside of now, we also have roles. We haven’t really talked about this before, but there’s also roles inside Graylog. Roles are going to be things like the ability to create dashboards, the ability to manage users, right?


It’s not related to access to particular things inside the system is more related to administrative functions and roles is a permission in a way it can be assigned to users. It can be assigned to teams, but those roles do not directly relate to the content itself. Again, they do somewhat play together. If I haven’t been given access to go in and look at events or alerts, it may not matter that I’ve been given access to a particular event because I can’t get into those pages to see it. So there is some overlap in that generally rules is the different piece. The big important part of this is really that play between stream permissions and content permissions and how we can use that to simplify our life in the number of assets that we need to make inside Graylog while keeping all of that data separate now, great, we want to do this, but how do we go about actually setting this up so that we can get the data in, we can route it.


And what kind of, do we need to keep in mind as we do that? We’re going to look at how do we route this data? We’ll actually look at that second, and how do we go about storing this data? So there’s a couple of ways that we can do this. These are definitely not all the ways, but it’s a couple of examples so you can think about some different ways of doing this.


  • We may want to break things up by putting customers, and when I say customers, this could be departments, it could be really whatever you want, into their own indexes. We have our one shared cluster. Then we’ve got an index for customer A and an index for customer B. Inside of those, we could have multiple streams.
  • Maybe it still makes sense that instead of having just a customer A stream, we have a stream for customer A’s servers, we have a stream for customer A’s firewalls, and the same for customer B. The benefit of this way of breaking it up is that we can control different retention periods for different customers.
  • So customer A has a requirement that they have to keep all of their data for 90 days. Customer B only needs to keep it for 30 days. We can set that independently because they each have their own index and they can still use the streams to only look at the data they want. The one thing to keep in mind with this is that there is a penalty, and depending on your architecture, this penalty may not be super large, but it is something to keep in mind when we’re running searches that take place across multiple indexes.
  • If generally in this environment, customer A is actually coming in and running the searches themselves and they only have access to their own environment, to their own messages, then that’s completely fine because all of the searches they’re running live all inside their index. We’re only hitting one index when we’re doing that.
  • But if what we primarily do is that we actually run as the administrative team, we want to run a search across all firewalls, then this may not be the best way of architecting it necessarily. Again, this all depends on scale. Some of this is theoretical and you only see it when you really get to a large scale.
  • This may not be the best way of doing that because when we run a search across the firewalls, we’re now hitting every index that may cause a lot more usage on our open search data backend than it would if we architected it a different way. So another way of looking at this may be that we actually want to create indexes based on the technology that we’re ingesting.
  • Maybe we have an index for firewalls and we have an index for servers. And then we just have streams for those different customers. Again, the customers are only seeing their own data. A is only seeing A, B is only seeing B, but the data itself is being stored in independent indexes and the customer actually has now different retentions possibly applied to their data because they’re in different indexes.
  • The other reason that this approach can be helpful is as you start to scale out more, you may start to hit and really, this will depend on what kind Graylog. You may start to hit the limits of the number of independent fields that an index can hold. So this is not the amount of data. This is the number of fields themselves. So by default, that will be a thousand fields in an index.
  • If we bring in a lot of messages, which have a lot of fields in them, something that’s notoriously bad for this is something like Office 365 has a whole bunch of fields in each message. And it seems like every message has different fields in it, right? The message names are not shared across now, normalization of our data, making our data play inside the data schema that we’ve created using the GIMP.
  • And if we had this setup where we had by customer. And all of a sudden we have several dozen fields from their firewalls, maybe a couple hundred fields from their Office 365, you can see a point where we might start to approach that thousand limit.
  • And so maybe in those cases, it actually does make more sense to do this kind of thing where we now have customer data sitting side by side, but it’s using the same fields in each of them. This can also help with compression because open search will handle compression of the same field type.
  • So storing the same kinds of data together, all of our customers or all of our customers who have FortiGate firewalls are all stored in the same index, we’ll get much better compression on that index than we would if we were to have some Windows stuff, some FortiGate stuff, some 365 stuff, all in that same index, because all of that data is different.
  • There are 5 million other ways of doing this as well, but hopefully that kind of covers some of the bases of different things to think about as we decide how we want to do this. Of course, you could have multiple indexes per customer, you could break this out as I say, 10, 10 million different ways, but these are just some of the things to think as we start to plan this now, when it comes to actually getting the data into these streams, how do we go about doing that? Luckily, we have pipelines and pipelines are incredibly flexible and give us a lot of tools which can really help with this.
  • The key to remember, though, is that we need something that we can put in our when clause that’s going to allow us to figure out not necessarily in when, but we need something we need in pipelines in general to identify whose message this is and any other information we need to route it. In the case of our last slide, we needed to know who the customer was, and we needed to know the kind of log that it was, right?
  • We need that somehow. That doesn’t necessarily have to be in the message itself. It might be. In this case, what we have is our messages contain a field called customer number. That could have been added upstream inside of Graylog by the input that could have been added before the log was ever shipped to us.
  • At some point, a customer number has been attached to every single message. I would highly recommend that you always have something like this, whatever you’re breaking these apart by department, customer, whatever it may be, always have this in the message because you can use it for so many things. Don’t rely just on the stream, make sure that they’re in that stream for the permissioning, but also put that information in the message because we may want to use this in alerts or things like that to garner that information later.
  • We always want to have it in there. Maybe we have this customer in the room. As I said before, maybe we’ve deployed the Graylog forwarders and we can actually look at, because that information is included in the message, right? What forwarder it came through, what input it came through, all of that information is attached to the message.
  • Maybe we can use that in this kind of situation as well, where we’re actually doing a lookup against an external database. So this may be a true external database. This could be your CRM tool and a true outside source that Graylog is making an API call to, or it could just be a CSV that you’ve uploaded into Graylog that contains the mapping of these customer numbers to the customer name and that’s all it’s doing, right?
  • Lots of ways we can do this, but the lookups are very helpful. The lookups we can use with that forwarder as well. Let’s store this forwarder ID number is assigned to this customer. And we can have multiple mappings of these, right? Multiple forwarders are assigned to a single customer. However, we want to break this out.
  • So in this case, we’re doing this lookup, right? We’re saying, okay, we have the customer number. We want to get the customer name and store that as a variable. We also want to read out that product because we need it for what we were doing here before. And then we want to go ahead and route that to the stream.
  • Now, this could be a manual pipeline rule that is built for a specific customer, and we’ve hard coded where that’s going to go to. And in that case, if we did hard code it, we could route to an individual stream by the ID number. But what we’ve built here is a way that we can dynamically route it. So what we’re basically taking is we are routing to a stream and that stream name is built out of variables that we’ve created in this pipeline rule.
  • So we are able to say, okay, we’re going to take the customer name, we’re going to add a dash. And then we’re going to take the name of the product and that’s what we end up with. So we now are routing this to a stream called customer a space dash space firewall, right? We’re going to make sure we remove it from the default stream that we’re not creating a duplicate of this.
  • And that’s all we’re doing here, right? Lots and There’s a lot we can do here for sure. But just know this does not have to be something, right? That you’re creating a custom rule for every single one of these. We can write these in a way where we are looking at the data that’s coming in and dynamically deciding what we’re going to do with it. That’s routing. That’s architecture.


Routing and Storing Data in Graylog

Next, we’re going to get into this traffic accounting. So to charge back, show back, whatever you want to call it, whatever you need to do with it. How can we go about doing this in real life? Luckily, there are some built-in ways already that we can use to collect this data. There is a very handy field, which is attached to every single message called gl2_accounted_message_size. Glad I got through all that without messing anything up, especially since it’s about lunchtime and I tend to lose my ability to speak in the second half of the day, that’s attached to every message that will contain the size of the message in bytes.


Now, it won’t contain necessarily the exact true size, obviously there will be compression, there are some internal Graylog metadata fields that we don’t count, but it gives us a solid number of the size of that message. Very close and is at least directly relatable to all other messages that we have inside Graylog. Again, this is in bytes. We can now use this and query these messages and do aggregations on this message field to start to understand the size of messages that are coming in and how that’s all breaking out. We can group by that customer ID number, right? That we applied to every single message that came in. We can group by stream. The stream is a field in the message that we can use. Very exciting.


I don’t have a sneak picture of this, but you will see it if you try out the Graylog 5.2 beta or try out Graylog 5.2 when it comes out later on a couple of things have made this actually way easier to do than it’s been up to 5.2 and the main one of those. Is that the stream name actually will now show in place of the stream ID when we use it inside an aggregation. So up until today, if I made an aggregation that I said I wanted to group by the streams, I would have seen all the stream IDs and I would have had to go look all those up in Graylog 5.2. I will see the name of the stream there, which makes this really nice, makes really nice dashboards and. Even the default dashboard that ships with Graylog, the sources dashboard has a complete overhaul now and actually includes the stream names in it. So you can see out of the box, the number of messages, the size of messages by stream and actually see the names of all those streams.


So that’s a really cool thing to check out in 5.2. The one caveat I will give to this is. At scale, these aggregations can fall apart if you are ingesting hundreds of gigabytes a day, and then you’re trying to do an aggregation over 30 days of those messages that can become problematic. That’s a lot of data for open search to do an aggregation on. And so in that case, what you may want to do is you may want to do some kind of pre-aggregation. You can do this with scripts. You actually can do this even with an alert that runs on a schedule that’s every day aggregating the number. So it’s doing a sum of GL accounted message size field every day, and then storing that value maybe back into Graylog or into an external system so that we’re not actually having to run that aggregation on the full data. Maybe we’re running the aggregation on just the data that there is one message per day. That was the sum of that day’s messages or whatever. At a small scale, this doesn’t matter if you’re running out over short periods of time, this doesn’t matter. But if you do need to run this over large periods of time with large amounts of data, you just may run into just the queries literally are timing out. And so you may want to take an approach to roll that data up. In smaller quantities first before we actually go through and aggregate it, but that’s really traffic accounting. It is fairly simple because all those fields are there. There’s a ton of things you can do with it, but that’s really the basics of it.


Last piece we’re going to cover is around getting this data out of Graylog into some kind of external system. Maybe you have a portal for your customers to get into. Maybe you have some kind of invoicing that you send them that you actually want to include this information on. Hey, we. We’re charging you this much, but we generated this many alerts for you, or we received this many messages, or we stored this amount of data, whatever this may be. We want to get at that easily. Obviously, Graylog is built with a very open API, and it’s been possible to run queries for a very long time. In Graylog 5.1, which hopefully most of you are already running by now, we added in some new API endpoints specifically around doing incredibly simple calls to gather data out of the system. If you’re looking in your API browser through the API are called search slash simple. And there are two of them. There’s one for messages, which actually will return messages as a result. And there’s another, which we’ll look at in a second that will return aggregations. What is nice about these APIs versus the way that the search APIs have functioned recently in the past before this is that this is a single call you can do this literally through a get so I can go ahead and I can hit this endpoint with my query, my time ranges, all of these kinds of things, and it will return the data in JSON with all of my rows with all of my fields.


You’ll get back more than this, but this is how you’re going to get it, right? You’re going to get an array. The array will contain all of the different fields that you asked for it to get. So in this case, I’m searching for messages. I can put in a query in there. This query can be written basically in the same way that you write it. In Graylog itself, we can put in a time range. We can request the specific fields that we want returned to us. So it’s not returning everything. It’s just going to return the fields that we wanted. We can do sorting. We can choose a sorter and we can choose the number of events that are returned. Because of this, you can actually use this to return like only one event. So you can just say, Hey, I want to sort it this way. I want a size of one. I’m basically just going to get back the most recent. Message that matches this criteria, lots of fun things you can do with this because it is such a straightforward API to run, you could theoretically run this probably even from like inside Excel to pull the data directly into a spreadsheet report or into invoices, stuff like that. You can do it with messages. You can also do it with aggregations. Some of these also apply. So we could have filtered by stream inside the messages one as well. Again, the time range we want to do it on what we want to group by metrics that we want. Super simple to run these. We’re just doing a metric on. The count of the source field, and we’re doing a sum of the GL two accounted message size, basically what we’re getting back in this one. And this is that kind of thing where I was talking about possibly pre-computing those accounting numbers on a daily basis or something. I guess when we’re looking at the last 24 hours grouped by the source, what were the number of messages we saw and the sum of the accounted size, right? And so we’re going to get back, okay, here was the source. Here was the number of messages and here was the bytes of that message size. Again, something really simple that we can use to pull that information in for us.



That is everything I had to cover today. We flew through it. Obviously, there’s a ton more we could have gone into in there. If you have questions, definitely let me know, hit me over email. You can find me Joel Duffield on LinkedIn, I have the same picture as here, which makes it incredibly easy to find me. Although there is only one Joel Duffield that’s a solution engineer at Graylog. Okay.


Q1: Should we use a shared cluster for both lower and higher environments, or is it better to separate Graylog clusters?

A1: Whether to use a single cluster or separate clusters for different environments depends on your specific needs. A single cluster offers the advantage of a unified view for one team over all activities, whereas separate clusters might be necessary when the operations in each are very distinct. If the data and activities are similar, a shared cluster is feasible. Graylog clusters can handle large volumes of data, making them suitable for various scenarios.

Q2: What is the best way to reduce or shrink the size of indexes in Graylog?

A2: The time size indexing feature in Graylog helps manage both the retention and size of indexes. Ideally, each index should be around 40 gigabytes for optimal performance. This feature allows you to control the size of each index and its retention period. Additionally, compressing data more effectively involves grouping similar types of data in an index. Using APIs, you can automate the creation of new indexes and manage them efficiently.

Q3: Can indexes be rotated based on a specific time zone instead of GMT?

A3: Index rotation in Graylog is based on the system’s time zone. While manual rotation is an option, using the index time sizing feature is recommended as it separates the concepts of rotation and retention. This approach offers more flexibility and control. You can also use the API for more complex tasks involving index rotation. Moreover, Graylog’s pipelines provide extensive capabilities for data manipulation and can even interact with external systems for both reading and writing data.


Okay. Hopefully that was helpful. Definitely make sure again, hit me up if you have any further questions and thank you very much and happy Graylogging.