satana: dogstats-attribution

Introduction

By the end of last year I assisted a team who needed to tag DataDog metrics with ownership information. What seemed like a simple task turned out to require some serious performance consideration. The task was a priority and the team was under pressure to deliver.

They started with a straightforward but naive implementation, decorating the dogstatsd-ruby gem, DataDog's client library, to obtain a stack trace, calculate the code owner of the caller, and inject the tag on metric submission. Their code looked like this:

class AttributedDogstats < Datadog::Statsd
  def increment(metric, opts)
    stack = caller()
    owner = calculate_owner(stack)

    if opts[:tags].is_a?(Hash)
      opts[:tags]["owner"] = owner
    else
      opts[:tags] ||= []
      opts[:tags].reject { |t| t.start_with?("owner:") }
      opts[:tags] << "owner:#{owner}"
    end
    
    super
  end

  #...
end

I saw this patch fly by and immediately intervened.

Analysis

Metrics emission must be fast. Every microsecond spent there is multiplied by thousands of invocations, becoming milliseconds of latency for the user. However, they weren't aware of the impact. Their naive implementation was then estimated to increase average request latency by 30x, bringing most of them to timeout.

I explained that it would be impossible to acquire stack traces and calculate the code owner at runtime as this kind of reflection is orders of magnitude slower than metric emission. The best I could think of would be to use a hash map of metrics to owners calculated once at boot time, but then there was no point in doing it at run time and I advised them to just post-process the results and touching this sensitive part of the stack.

However, it was still desirable to tag metrics at the source for integration with existing analysis tools, so they tried my hash map suggestion. We paired and I showed them how to do benchmarks, profiling and flame graphs. We identified expensive array operations and unnecessary allocations, and made the code much faster, but in the end just adding a mere literal tag was still too expensive.

I created local benchmark to compare the performance of dogstatsd-ruby emitting metrics with 1 and 2 tags. Then I tried to profile the benchmark but the output I got didn't show a clear difference between the two. I'm not 100% sure why, but I suspect it has to do with Vernier being a sampling profiler and the methods being so fast, in the order of 10 microseconds, so I also added another case with 20 tags to exacerbate differences:

#!/usr/bin/env ruby
require "datadog/statsd"
require "benchmark/ips"
require "vernier"

$dogstats = Datadog::Statsd.new('127.0.0.1', 8125)

def tags1
  $dogstats.increment("metric", tags: {
    k00: "v00",
  })
end

def tags2
  $dogstats.increment("metric", tags: {
    k00: "v00", k01: "v01",
  })
end

def tags20
  $dogstats.increment("metric", tags: {
    k00: "v00", k01: "v01", k02: "v02", k03: "v03", k04: "v04", k05: "v05",
    k06: "v06", k07: "v07", k08: "v08", k09: "v09", k10: "v10", k11: "v11",
    k12: "v12", k13: "v13", k14: "v14", k15: "v15", k16: "v16", k17: "v17",
    k18: "v18", k19: "v19",
  })
end

Vernier.profile(out: "dogtags.json") do
  Benchmark.ips_quick(:tags1, :tags2, :tags20)
end

I set the CPU governor to performance to avoid scaling during:

cpupower frequency-set -g performance

and then ran the script, which gave me the following output:

ruby 3.4.6 (2025-09-16 revision dbd83256b1) +PRISM [x86_64-linux]
Warming up --------------------------------------
               tags1    10.294k i/100ms
               tags2     9.298k i/100ms
              tags20     3.450k i/100ms
Calculating -------------------------------------
               tags1    102.344k (± 1.4%) i/s    (9.77 μs/i) -    514.700k in   5.030109s
               tags2     90.594k (± 2.0%) i/s   (11.04 μs/i) -    455.602k in   5.031253s
              tags20     33.369k (± 2.6%) i/s   (29.97 μs/i) -    169.050k in   5.069859s

Comparison:
               tags1:   102344.3 i/s
               tags2:    90593.9 i/s - 1.13x  slower
              tags20:    33369.2 i/s - 3.07x  slower

and the following flame graph:

which revealed that adding more tags increased time spent in serialisation more than in anything else, specially in the TagSerializer.

Indeed there was a lot going on with tag serialisation. Just escaping tags required string lookups and substitutions, being generally expensive and potentially allocating new objects:

def escape_tag_content(tag)
  tag = tag.to_s
  return tag unless tag.include?('|') || tag.include?(',')
  tag.delete('|,')
end

Proof of concept

Instead of wrapping dogstatsd-ruby and manipulating it from the outside, I moved to monkeypatching it from the inside. My solution was to pre-escape the map of metrics to owners at boot time, and override the StatSerializer#format method to inject them bypassing usual serialisation. Here's a proof of concept:

#!/usr/bin/env ruby
require "datadog/statsd"
require "benchmark/ips"

# Mapping of metrics to owners
$map ||= Hash.new("UNKNOWN")
$map["a.metric.for.test"] ||= "someonenotme"
$map["another.metric"]    ||= "alsonotme"

# Pre-escape at boot as it's expensive at runtime
$map.transform_values do |tag|
  tag.to_s.delete('|,')
end

# Injects metric attribution tag during serialisation so that it's fast and can
# be done async with the delay_serialization option
module FastAttributionTag
  def format(name, delta, type, tags: [], sample_rate: 1)
    name = formated_name(name)

    if sample_rate != 1
      if tags_list = tag_serializer.format(tags)
        "#{@prefix_str}#{name}:#{delta}|#{type}|@#{sample_rate}|#owner:#{$map[name]},#{tags_list}"
      else
        "#{@prefix_str}#{name}:#{delta}|#{type}|@#{sample_rate}|#owner:#{$map[name]}"
      end
    else
      if tags_list = tag_serializer.format(tags)
        "#{@prefix_str}#{name}:#{delta}|#{type}|#owner:#{$map[name]},#{tags_list}"
      else
        "#{@prefix_str}#{name}:#{delta}|#{type}|#owner:#{$map[name]}"
      end
    end
  end
end

$dogstats_standard   = Datadog::Statsd.new("127.0.0.1", 8125)
$dogstats_attributed = Datadog::Statsd.new("127.0.0.1", 8125)

# Override method on instance by prepending module
$dogstats_attributed.instance_eval do
  @serializer.instance_eval do
    @stat_serializer.singleton_class.prepend(FastAttributionTag)
  end
end

def standard
  $dogstats_standard.increment("a.metric.for.test", tags: [])
end

def attributed
  $dogstats_attributed.increment("a.metric.for.test", tags: [])
end

Benchmark.ips_quick(:standard, :attributed)

Its output suggested it could be fast enough, showing negligible performance impact:

ruby 3.4.6 (2025-09-16 revision dbd83256b1) +PRISM [x86_64-linux]
Warming up --------------------------------------
            standard    18.318k i/100ms
          attributed    17.492k i/100ms
Calculating -------------------------------------
            standard    174.778k (± 2.7%) i/s    (5.72 μs/i) -    879.264k in   5.034907s
          attributed    171.389k (± 2.4%) i/s    (5.83 μs/i) -    857.108k in   5.003999s

Comparison:
            standard:   174777.5 i/s
          attributed:   171389.1 i/s - same-ish: difference falls within error

But I didn't deliver it without warning...

My proof of concept seemed to be sufficient, but was it really a good idea? DataDog's throughput was in the order of several GiB/s, meaning that any extra byte added to metrics could have dramatic effects on the network and receivers.

dogstats-attribution: a risky optimisation

TL;DR

Introduction

Analysis

Proof of concept

Conclusion