How We Monitor Our 1000+ RPS Heroku App

Jon Sully

Jon Sully

@jon-sully

As a leading provider of high-performance autoscaling, Judoscale handles quite a few requests per day — over a thousand every second, 24/7! This is probably on the lower end of “a high-traffic app” (depending on who you ask) but we nonetheless want to share our monitoring strategy and why we use the tools we do. Making sure that a hundred million requests per day get served quickly, accurately, and correctly requires some thought! Since our service ensures that other teams’ and companies’ apps remain available and responsive, it’s extra (and meta?) important that we keep our eyes peeled! Here’s how we do it... ice cream sundae style. Because ice cream is awesome.

For added fun, all of today’s diagrams will be provided courtesy of DALL-E in the post-impressionist artistic style. My mileage is going to vary.

It Starts With Sentry

Just like a proper sundae, we need a bowl. Something that holds all of our great things together in one place and provides a solid external boundary in case things get wild. That’s what a production error monitoring/tracking system is like! It can’t stop bad things from happening, but it at least provides some containment when they (inevitably) do!

illustrative painting of a bowl

So why Sentry, in particular? Why not Bugsnag or Rollbar or any of the other competitors out there? It’s not because those products aren’t great or that they’re missing features — we don’t even use half of the features that Sentry has added over the years. There are a few simple reasons we continue to stick with Sentry:

  • Its primary (original) feature still works very, very well (error tracking)
  • We’ve found it to be highly stable and available whenever we need it
  • It alerts us promptly but doesn’t suffer from being too noisy
  • New features the Sentry team has added over the years haven’t cluttered or busied the UI to the point where the original feature feels lost
  • Nobody has presented a compelling alternative that delivers a notably better solution to the original problem (error tracking)
  • It allows us to selectively ignore some errors on an “until it happens X times per hour” basis, which is great for some recurring issues

That’s a tough list! Props to Sentry for keeping the high quality of their main product and not letting it get lost amongst its other features. At its core, Sentry is a great error tracker. And that’s what we use it for. It’s our faithful, solid, bowl.

We’re Going to Need a Spoon

If a bowl contains the unknowns, I suppose it fits our metaphor to cast our logs parser and explorer as a spoon! What’s a log system spoon for if not allowing us to navigate through the rough edges and help us find the sweet stuff?! DALL-E, go!

illustrative painting of a bowl with a spoon

Brilliant.

We’ve tried a few avenues for our logs over the years, generally constrained by a few realities:

  • We generate a lot of logs. Even accounting for just Heroku’s router logs, it’s almost 200GiB per day. Add in Rails logging and it’s... a lot more.
  • We don’t manually dig through logs all that often... needle-in-a-haystack kind of thing
  • Log systems are expensive!

So, while we were on Logentries’ drop-in Heroku Add-on for several years, last year we looked around and realized that BetterStack (which was LogTail at the time) offered more features for a cheaper price with a better UI! And it turns out that BetterStack’s log filtering and searching is way more powerful than what we’d experienced before. You will feel much better about your application when the phrase ‘digging through your logs’ doesn’t strike despair into your heart. BetterStack nails that for us. It also has no problems ingesting all of our data and can display it to us in real-time. It’s a sharp and quick spoon!

Now that we’ve got the bowl and spoon (call them the ‘logistics’ tools), let’s look at the actual ice-creams in our monitoring sundae.

Ice Cream 1: AppOptics

Given that AppOptics is a visualization and aggregation layer on top of our application logs, it fits that we’ll cover it just after BetterStack! But we need an illustration first. Let’s call AppOptics our first scoop of ice cream in our monitoring sundae (though there’s no real particular order):

illustrative painting of a bowl, spoon, and a single scoop of ice cream

What does AppOptics do? In short, it reads our logs to generate visualizations and graphs based on aggregate data. That’s a mouthful. Logs go in, charts come out:

example of four data charts Judoscale uses in AppOptics

Those are a few of the charts we keep up for our background jobs’ health, but our main production dashboard has 28 different charts on it! If there’s some piece of data or some metric we want to keep an eye on, we build a chart for it in AppOptics. Let me show you the zoomed-out view:

zoomed out view of all 28 charts Judoscale watches in AppOptics

But why AppOptics in particular? In truth, we don’t have any grand affinity to this tool over others. And there are lots of others. We chose AppOptics because it has a simple drop-in interface via their Heroku Add-on, it handles our log volume with ease — and, yes, because the charts look nice 🙂. AppOptics also offers much better pricing for smaller teams like ours. We’re not a giant enterprise business... there’s just two of us!

Thanks to its easy Slack alerts, AppOptics gives us a great holistic view of our production environment and pings us if things aren’t going as-planned. There are probably other tools out there that accomplish these goals, but we’ve had a great experience with AppOptics! It’s been smooth, like a great scoop in a sundae.

Side-note: you would not believe how many attempts and prompt-tweaks it took to get DALL-E to generate an image of just ONE scoop of ice cream in a bowl. More than twenty tries. It only knows how to generate images with lots of ice cream. And, honestly, that’s my preference in life too... but sheesh!

Ice Cream 2: Scout

While AppOptics lets us visualize terabytes of data with ease, sometimes we want to dig into the Rails stack itself and what’s going on from the perspective of Rails and its internal layers. There are a lot of Application Performance Monitoring (APM) tools out there — NewRelic, Datadog, and countless others. We prefer Scout. It’s a crunchy and exciting treat that complements AppOptics’ charts with pizzaz. Much like this scoop of yum!

illustrative painting of a bowl, spoon, and two scoops of ice cream, one with sprinkles!

The sprinkles are obviously the bursts of joy that come from the Scout UI, am I right? Okay, I’ll tone it down. But we really do love Scout as an APM tool. Its dashboard design makes grokking traffic patterns easy and understanding exactly which layer of the stack is slow (DB vs. view rendering etc) much more approachable. It also comes as a Heroku Add-on, which means install is extremely simple for us. It’s a great tool. Granted, Scout’s pricing structure may not be as favorable as other products depending on your team make-up and size, but we’re not switching any time soon! We love these views.

example screenshot of a chart in Scout’s UI

Oh, and if you too have a lot of traffic, make sure you familiarize yourself with wrapping the Scout SDK in sampling. It’s a great technique to keep in your back-pocket. We’d run through our Scout transaction limit very quickly if not for sampling.

While some tools we use without much strong preference, Scout is a tool that we deliberately choose over its competitors and would again tomorrow. It’s not that the competitors aren’t great, it’s that Scout has won our full attention! And just like a chunky scoop of the frozen stuff with sprinkles on top keeps us engaged, so too does Scout’s oversight of our application layers!

Ice Cream 3: Judoscale

You better believe that Judoscale uses Judoscale! Aside from being an excellent automatic scale-adjuster, the Judoscale dashboard and alerts also make for a great monitoring tool. Like a third scoop of ice cream in a sundae can add that pop of accent-flavor, Judoscale works in conjunction with AppOptics and Scout to reveal a few key insights — just like the top scoop in this growing bowl!

illustrative painting of a bowl, spoon, and three scoops of ice cream, one with sprinkles, and one sitting on top of the other two

The key is to understand queue time. With a single glance at our Judoscale dashboard for our web process, I can be confident that we’re at the right scale count for the number of requests we’re currently receiving and how many resources each request is taking.

screenshot of the Judoscale UI showing steady scale and low queue times

Similarly, if we’re getting errors or something is going wrong, I can know from a brief look at our Judoscale information whether that issue may be a scale problem (that Judoscale is likely already adjusting) or, more importantly, that it’s not a scale problem. If we’re seeing production issues but our queue time remains low, I can rule out our scale being the source of that issue. That’s powerful!

This article isn’t intended to be a sales pitch (and it’s not) but we do use Judoscale as a critical component of our monitoring workflow to simply know if our scale is right for the current time. We built it because we couldn’t find any other tools out there that would accomplish that. Especially for background jobs — it’s extremely helpful to open the dashboard and realize, “oh okay, nothing’s wrong, we just had a temporary backup and are currently scaling up to take care of it.” Moments like this:

screenshot of the Judoscale UI showing scale events as job queue times spiked in the worker system

Judoscale simply couldn’t be Judoscale without Judoscale! That’s why we love Judoscale. 😏

Hot Fudge All Day!

We’ve covered system monitoring, application monitoring, scale monitoring, and logs-based monitoring — let’s zoom out! How about monitoring your overall network and what-what-where is hitting your system? We use Cloudflare for that. Aside from managing our DNS and having a zillion other tools, Cloudflare’s network analytics are pretty great! Obviously continuing our now-painful metaphor, network monitoring is like the hot fudge on top of our other great tools:

illustrative painting of a bowl, spoon, and three scoops of ice cream, now with hot fudge on top!

Now, truth-be-told, we don’t hop in and check our network monitoring all that often. But Cloudflare’s UI and easy access to all of our network data (as our reverse-proxy DNS provider) makes it painless to give it the responsibility of ‘network monitor’ too!

screenshot of the Cloudflare network monitoring UI showing requests broken down by their origin zone: US, Ireland, Germany, etc.

We’ve even been able to leverage the advanced searching and filtering tools to help debug customer issues (turns out they had some rogue dynos running in another region they didn’t know about)! Using Cloudflare for network analytics is just too easy of a ‘win’ to not use it. We recommend giving it a shot and seeing what your own Cloudflare data looks like. Experience some hot fudge network monitoring!

Bonus: The Cherry on Top

With our bowl, spoon, scoops, and hot fudge all situated, we just need a cherry on top to complete this monitoring sundae. Holding true with the imagery, we’ll top it off with a super small, super simple tool that sparks joy: uptime monitoring. A tiny concept with a big impact.

illustrative painting of the whole sundae: a bowl, a spoon, lots of ice cream, hot fudge, and a cherry on top

Like a cherry on top of a sundae, uptime monitoring is simple but powerful: a third-party service that pings our servers fairly often just to make sure they’re active, responding, and serving requests correctly. While we’ve used different products for this in the past, we’ve actually settled on BetterStack here. We’re already using BetterStack for log storage (see the Spoon breakdown above), and uptime monitoring comes out-of-the-box with BetterStack. Might as well use the tools we already have!

I feel like it’s worth calling out some praise for their other tooling too, even if it’s not ‘monitoring’. In addition to replacing both our Log Storage and Uptime Monitoring, BetterStack also comes out-of-the-box with incident / on-call escalations, public, hosted status pages, and a few other goodies. BetterStack replaced several tools for us!


So that’s the sundae stack:

  • Sentry (error tracking)
  • BetterStack (logs and uptime monitoring)
  • AppOptics (infrastructure / general monitoring)
  • Scout (application performance monitoring)
  • Judoscale (scale monitoring)
  • Cloudflare (network monitoring)

A full suite of tools that keeps us aware of our overall application health with ease and beautiful graphs. That’s how we monitor our 1000+ RPS Heroku app!

Now, if you’ll excuse me, I’m craving some ice cream!