At Distil, we wanted to build a dashboard to monitor all our data centers as close to real time as possible. We wanted to see everything from total throughput and connections to event monitoring. Using AJAX to update the dashboard in near real time was not going to be the best solution which lead us to look into Websockets. Websockets allow us to send and receive live data between a browser and a web server over a single connection eliminating the need for long polling with AJAX.
The means to push the live data to the browser is only half the battle. Aggregating all the data from all our servers in real time is the real challenge. Every server in all our data centers run a monitoring agent which broadcasts performance metrics every second and system events as they occur in real time to our monitoring server using ZeroMQ.
In our first iteration, the monitoring server dumped all the performance metrics into MySQL. The workers responsible for broadcasting these performance metrics to the browser via websockets would then aggregate all the metrics for all servers associated with a datacenter in real time. This was a pretty ugly and slow query involving multiple joins and group by’s. The larger this table got, the slower it was. Our real time dashboard turned into every few seconds or longer. Our quick band aid solution until we could architect something better was to keep the table small and discard any stale metrics.
What we learned
We learned a lot from version 1 and discovered a few bottlenecks that we wanted to address in version 2. The biggest performance bottleneck was MySQL. Redis looked like a more appropriate data store for what we wanted to achieve. Since Redis is a key-value store we had to architect things a little differently. Instead of our monitoring server dumping all server metrics into a MySQL table, it dumps them into Redis using the server UUID as the key and storing the metrics as a hash. This approach eliminates the endlessly growing MySQL table because Redis will update the hash for each server if the key exists instead of adding a new key-value pair.
Designing jobs to be small and specific
Now that we have replaced MySQL with Redis, we still needed to aggregate to the data center level. There are two parts to this solution. We didn’t want the workers publishing messages to the various websocket channels to have to establish a connection with MySQL, determine which servers belong to which data centers and aggregate the server metrics in real time. We wanted to keep their role as simple as possible, get a hash from Redis, publish that hash to a channel. We extracted the aggregation into separate background processes which sets the aggregated server metrics into a single hash per data center keyed with each data center’s UUID. Now our jobs are small and very specific and workers can be scaled out independently as we add more and more data centers and servers.
Significant performance improvements
The end result, data center performance metrics were being published so quickly to the browser that we had to artificially slow it down to only every second. Now we have our real-time dashboard and can actively monitor the health of all our data centers throughout the world. If a server stops sending metrics for any reason, an alert in the dashboard is immediately displayed along with an email about the incident. We also use it to monitor individual processes on the servers. If a process stops, restarts or crashes for any reason, an alert is displayed immediately in the browser and an email notification along with it.
In my next blog post I will cover in more detail how we setup our websockets server with Ruby on Rails and how we handled authentication through websockets.
About the AuthorFollow on Twitter More Content by Mark Malek