dogstats-threads: a tale of performance and bugs

TL;DR

I fixed a race condition that corrupted Datadog metrics by updating the dogstatsd-ruby gem, the client library, to the newer thread-safe version. However, the newest version used a sender thread to perform async IO, which was incompatible with an existing optimisation to reduce site latency. The optimisation queued metrics during an HTTP request delaying submission until response was sent to the user, causing a burst of messages that overflowed the sender thread queue and resulted in metric loss. The optimisation was partly obsolete with the new async IO, but serialisation still happened in the main thread and represented 20ms/req on average, so I proposed and implemented a new optimisation to dogstatsd-ruby moving it to the sender thread, which was more appropriate for the threaded model. The patch solved the bugs without compromising performance, reduced maintenance burden and benefited the wider community. CPU time could be saved with native extensions to metric serialisation, and thread complexity could be simplified with non-blocking socket IO.

Corruption: a race condition

Once upon a time, I dealt with a bug that caused Datadog metric corruption. Every now and then, we would see metrics in Datadog with either truncated or joined names.

A colleague identified it to be a race condition. Datadog's client library, the dogstatsd-ruby gem, was known not to be thread-safe. They tried to fix it by instantiating one client per thread, using thread-local variables, which did fix the corruption but created concerns due to how the instances were recycled and passed around by objects.

When I looked at dogstatsd-ruby's repository, I saw that the site was a couple major versions behind and that the newest version was actually thread-safe. Colleagues were reluctant to update it because the new version actually used threads to perform async IO, which they feared could cause other problems... spoiler alert: they were right, but lagging behind was creating problems too, so it was worth the risk.

I double checked the compatibility between the new dogstatsd-ruby threads and Unicorn (the web server library), such as whether it was safe to use it after fork() and if I could rely on at_exit() to flush metrics. All positive, so I deployed to production and observed no negative impact... until months later when we heard about a discrepancy between two strongly correlated metrics that started exactly after my update.

Discrepancy: threads vs optimisation

I suspected that the metrics discrepancy had to do with a previous optimisation that our site had to reduce request latency associated with dogstatsd-ruby. What it did was to wrap the client instance, intercept and queue metric submissions, and then execute them all at once after the HTTP response was sent. It looked like this:

class LazyDogstats
  def initialise(dogstats)
    @dogstats = dogstats
    @metrics = []
  end

  def increment(*args)
    @metrics << [:increment] + args
  end

  def distribution(*args)
    @metrics << [:distribution] + args
  end

  # ...one method for each metric type...

  def flush(*args)
    @metrics.each do |metric|
      @dogstats.send(*metric)
    end
    @dogstats.flush
  end
end
      

The clever trick was incompatible with the new async sender thread model. It was holding metrics and emitting them in a single burst instead of gradually over time, overflowing the sender thread queue and causing metrics loss. Also, it seemed to have been superseeded by the new model, which no longer blocked the main thread with socket IO.

My hypothesis was confirmed by briefly turning off the optimisation and observing the reported metric discrepancy disappear. However, we also observed an increase of 20ms in average request time, so I instead turned it back on and temporarily bumped the queue size to a large value while I worked on a proper fix.

After benchmarking and profiling, I found that the 20ms latency was due to dogstatsd-ruby still performing metric serialisation in the main thread. I proposed to the dogstatsd-ruby developers to move serialisation into the sender thread. The proposal got accepted and I implemented the delay_serialization feature, which solved the bugs without compromising performance, reduced maintenance burden and benefited the wider community.

Future directions

My optimisation and the previous existing one didn't actually improve CPU utilisation, but just moved the work to a more convenient time. It could be possible to reduce CPU time with native extensions to the dogstatsd-ruby gem, performing metric serialisation in place with a specialised routine instead of relying on generic string operations. Also, it could be possible to simplify the complexity of having an async sender thread by instead relying on non-blocking socket IO, which is essentially the same concept but already implemented in the Kernel.