11 Sep 2022

Two Measures of Fast: Throughput and Latency

A service can double the requests it completes per second while every individual request gets slower. There is no contradiction, because the two numbers answer different questions: latency is the time one operation takes, from request to response, while throughput is the rate at which the system as a whole completes operations. Much of performance engineering consists of keeping the questions apart: knowing which number a change moves, and at what cost to the other.

The two are coupled, but not as reciprocals, and the tradeoff between them is not one phenomenon. Sometimes a system buys throughput deliberately, as batching does by waiting a little longer to amortize fixed cost. Sometimes the price is arithmetic, paid in queueing when a service runs near capacity. And sometimes there is no tradeoff, because the operation itself can still be made cheaper.

Who Feels Each Number

A person waiting on a page, or an upstream service holding its own caller open, experiences latency one request at a time; that is why a service-level objective (SLO) is usually written against it at a percentile. Where the clock runs matters too: a handler may record 8 ms for a request whose caller waited 120, because proxies, serialization, queueing in front of the handler, and the network in both directions may be invisible to an in-process handler metric. Latency is also never a single number: a busy service produces thousands of measurements per second, so any one figure is a summary of a distribution.

A single caller rarely feels throughput directly. It is mostly the operator’s number, visible in aggregates: queue depths, fleet sizes, bills. It also needs a stated time window, because a rate held for one second and a rate sustained all day are different claims. Capacity planning, autoscaling, and cost per request are all stated in terms of it, and its ceiling is the system’s capacity: the largest throughput it can sustain without falling behind. Confusing the caller’s number with the operator’s leads to missed latency objectives in one direction and over-provisioned fleets in the other.

Work in Flight

The intuition that throughput is just $1/\text{latency}$ comes from the one system where it actually holds: a single worker, handling one operation at a time with no idle gaps, completes 100 requests per second when each takes 10 ms. Concurrency breaks the equivalence. Ten such workers complete up to 1,000 requests per second while each request still takes 10 ms; a nightly pipeline moves millions of rows per second while any given row takes hours to make it through. The link between the two is a third quantity: how much work is in progress at once.

Little’s law, an identity about long-run averages proved in 1961, pins the relationship down: in any stable system keeping up with its load, average work in flight equals throughput times average latency, $L = \lambda W$ , without requiring Poisson arrivals, first-come-first-served scheduling, or any particular service-time distribution. That generality makes it everyday sizing arithmetic. A service sustaining 2,000 requests per second at 50 ms has $2000 \times 0.05 = 100$ requests in flight on average, so any resource that each request holds for its entire duration needs at least 100 slots. A request that holds a database connection for 20 ms caps a 10-connection pool at $10 / 0.02 = 500$ requests per second, however many threads queue behind it.

The identity also says which lever moves which number. Concurrency (more workers, I/O that overlaps waiting) can raise throughput without making any single operation faster. Its effect on latency depends on the regime: under load it can reduce queueing, under light load it changes little, and past the point where workers contend for shared resources (locks, CPU, the pool itself) it makes latency worse. Reducing the work itself is the clean case that improves both numbers at once. What remains is the situation where the operation cannot be made faster and one of the numbers still has to improve.

Paying for Throughput

Batching

The first source is amortization. Much of an operation’s cost is fixed (a network round trip has a fixed component whether it carries one row or a hundred; a disk flush, fsync, commits one transaction or twenty), so items that are sent together share it. At 2 ms of overhead plus 0.1 ms per row, inserts move about 480 rows per second one at a time and about 8,300 in batches of a hundred, on the same hardware. The cost is waiting: a batch is only worth sending once it is reasonably full, so an early row sits in a buffer until the rest arrive or a timer expires, and that hold becomes part of its latency.

This tradeoff is common enough that many systems expose it directly as configuration. A Kafka producer can wait up to linger.ms for more records before sending a batch of up to batch.size. TCP’s Nagle algorithm can coalesce small writes by holding them while earlier data remains unacknowledged, and latency-sensitive protocols often turn it off with TCP_NODELAY. PostgreSQL’s commit_delay can briefly delay a write-ahead-log flush when several transactions are committing, so they can share it. In each case the waiting is explicit and adjustable.

The same tradeoff appears in the serving of large language models (LLMs): batching concurrent requests keeps an accelerator busy at the cost of the delay before each caller’s first token of output, and Orca, a recent serving system, admits new requests into the running batch at every generation step to soften that cost.¹

What keeps this tradeoff benign is that its price is bounded and chosen: a 10 ms linger caps the hold at 10 ms, and the knob turns both ways.

Running Hot

The second source is utilization, and it is not chosen in the same deliberate way. Under load, a request’s time splits into service time $S$ , the work itself, and waiting time, spent queued for busy resources. Optimization can shrink service time, but waiting time is set by load, because traffic is irregular: arrivals burst and some operations run long. A system that is underloaded on average can still be overloaded in moments, and the busier it runs, the longer each of those queues takes to drain.

For the simplest standard queue (one server, first-come-first-served, exponential arrival gaps and service times: M/M/1), the average time in system at utilization $\rho$ is

T = \frac{S}{1-\rho}

That is twice the service time at 50% utilization, ten times at 90%, twenty at 95%. The exact model matters less than that shape, nearly flat for a long stretch and then bending sharply upward at what is conventionally called the knee. A short simulation, using nothing beyond the standard library, checks the prediction and shows the tail:

import random
random.seed(7)

SERVICE_MS = 10.0     # mean time the work itself takes, with no waiting
JOBS = 500_000        # arrivals per load level; still a short run near saturation

for util in (0.50, 0.70, 0.80, 0.90, 0.95):
    clock = free_at = 0.0
    latencies = []
    for _ in range(JOBS):
        clock += random.expovariate(util / SERVICE_MS)  # arrivals at util × capacity
        start = max(clock, free_at)
        service = random.expovariate(1 / SERVICE_MS)
        free_at = start + service
        latencies.append(free_at - clock)               # waiting + service
    latencies.sort()
    mean = sum(latencies) / JOBS
    print(f"util {util:.0%}:  mean {mean:5.1f} ms   "
          f"p99 {latencies[int(JOBS * 0.99)]:6.1f} ms")

With the seed fixed, the sweep prints:

util 50%:  mean  20.0 ms   p99   92.5 ms
util 70%:  mean  33.4 ms   p99  156.5 ms
util 80%:  mean  50.9 ms   p99  230.7 ms
util 90%:  mean  96.2 ms   p99  421.7 ms
util 95%:  mean 187.6 ms   p99  792.1 ms

The mean tracks $S/(1-\rho)$ closely (the 95% row falls short of its predicted 200 ms because even half a million arrivals is a short run that close to saturation). The numbers make the tradeoff concrete: from the first row to the last, throughput rises by a factor of 1.9 while mean latency rises ninefold, and the 99th percentile (p99) of a job with a 10 ms mean service time approaches 0.8 seconds, with the growth dominated by queueing. The exponential assumptions are a modeling convenience, not the cause: burstier arrivals or a wider mix of cheap and expensive requests make the curve worse.

Headroom looks like waste on a utilization dashboard, but it is what keeps a system on the flat part of the curve. For latency-sensitive workloads, utilization targets often sit well below saturation because the knee turns the difference into user-visible delay, and burstier workloads need lower targets still. The exact number varies by workload; the point is that the target should come from the measured curve, not from the urge to keep machines busy.

Overload

Utilization has a hard edge. In an idealized system with an unbounded queue, once arrivals exceed capacity, the queue grows without bound: latency climbs while throughput holds at its maximum. Real systems usually hit other limits first: queues fill, callers time out, retries and timeout-driven rework eat capacity, and useful throughput (goodput) sags below the nominal maximum. At that point the tradeoff is over. A brief spike can push a system past saturation, and the retry traffic alone can then hold it there; the pattern has been studied as metastable failure.²

Overload also splits throughput from goodput: once callers have timed out, completing their requests produces answers no caller can use. The defenses share a principle: protect accepted work by bounding how much extra work can enter or remain in the system. A bounded queue rejects new work rather than buffering minutes of backlog. Propagated deadlines abandon doomed work early. Load shedding rejects the excess outright, and backpressure slows producers or propagates pressure upstream before it becomes local backlog.

Buying Latency Back

So far, the system has mostly spent latency to buy throughput: wait a little longer, keep resources busy, accept queueing. The trade also runs in the other direction. Throughput can be spent to improve latency.

The headroom of the previous section is the simplest case: idle capacity is throughput per machine given up to keep waits short. Hedged requests spend it more directly: a call still unanswered after a threshold such as the 95th percentile is resent to another replica, the first answer wins, and the duplicate work buys a shorter tail.³ Hedging is useful when the operation is safe to duplicate, the extra load is capped, and the system has enough spare capacity that it does not become overload by another name.

Precomputation spends throughput ahead of time. Warm caches, materialized views (query results stored ready-made), and search indexes do bulk work in advance, where latency does not matter, so that the interactive read does almost nothing; the read is fast because the work was already done elsewhere. Read-through caches are the less tidy version of the same idea: after the first miss, repeated work moves off the hot path, paid for with memory, invalidation, and stale-data risk. Each of these consumes capacity, memory, or freshness, and the cost is best measured when the trade is made.

Splitting the Paths

The levers so far tune a single path. A stronger option is to separate the paths, because most systems contain both kinds of work, and serving both from one place forces one configuration onto both. Once separated, each path can be tuned for the number that matters to it. The interactive path keeps as little as possible before the response (validating input, recording the intent durably, queueing the rest) and runs with small batches, short queues, and headroom. Deferrable work (fulfillment, notifications, indexing, analytics, log shipping) moves behind a queue onto a path where mostly throughput matters, so batching can be aggressive and utilization high, because no caller is synchronously waiting.

The queue in the middle absorbs bursts that would otherwise become synchronous waiting and lets the back end consume steadily near capacity while the front stays bursty. It also changes what latency means: deferred work is measured by backlog age, so “eventually consistent” still needs a number attached (consumer lag on a Kafka topic, replication delay on a read replica), and an unbounded backlog is just overload arriving slowly.

Reading the Curve

None of these decisions can be made from single numbers. Latency was a distribution back in the first section; here its shape starts to matter: the median describes the typical request, the p99 describes the unlucky one in a hundred. The two also degrade for different reasons: the median moves with overall load, while the tail reflects lock contention, garbage-collection pauses, cold caches, and slow dependencies. Fan-out multiplies the tail’s reach: if a page is assembled from 100 independent backend calls, each slow just 1% of the time, it includes a slow call on $1 - 0.99^{100} \approx 63\%$ of page loads. At scale, enough fan-out turns rare tail events into routine page-level behavior.

A latency figure is equally incomplete without its load: 12 ms at 30% utilization and 80 ms at 85% can be the same system. The useful summary is the full curve: latency percentiles plotted against offered load (the rate at which work arrives). Capacity is then read off the curve as the highest throughput at which the SLO percentile still holds.

Producing that curve has a classic pitfall: the load generator has a model too. A closed-loop generator (a fixed set of virtual users, each waiting for one response before sending the next) slows down when the target does, throttling its load at the very moment an open-arrival workload would have queued. Many production workloads look closer to open arrivals than to a fixed population of perfectly patient users, though closed and backpressured systems, where real clients do slow down when the server does, need their own model.⁴ The same feedback produces coordinated omission: during a stall the generator does not send the requests it had scheduled, so the worst intervals are missing from the data. For open-arrival workloads, constant-rate tools such as wrk2, recording into HdrHistogram, measure against the intended schedule instead.

A Choice, Arithmetic, or Neither

Much of the time there is no tradeoff at all, because the operation is not yet minimal. A cache in front of repeated computation, an index under a scanning query, a query-per-item pattern collapsed into one round trip: each cuts $S$ , which lowers latency and raises maximum throughput ( $1/S$ per worker) at once, shifting the whole curve down and to the right. A tradeoff accepted at that stage pays for throughput that optimization would have delivered for free.

Once the operation is as cheap as it will get, what remains is budgeting. The latency objective fixes how far up the curve a fleet can sit; the measured curve converts that ceiling into requests per second per machine; expected volume converts machines into cost.

Most of the recurring decisions reduce to a few rules:

Work removed from an operation improves both numbers; profiling precedes knob-turning.
Batching is bought with a deadline: linger-style knobs get caps chosen against the latency budget.
Latency is bought with throughput: headroom, hedges, and precomputation are deliberate purchases, with utilization targets read off the SLO and the measured curve.
Queues absorb bursts, not backlogs: bounded depth, propagated deadlines, and early shedding keep overload from feeding itself.
Work that no caller needs before the response does not belong on a path where somebody is waiting; once deferred, it owes a freshness number, because backlog age is the latency of a throughput path.
A latency claim needs a load, a percentile, and a stated load model; the result worth keeping is the curve.

A system is not meaningfully fast until both of its speeds are stated: the time one request takes, and the rate at which the system completes them, each at the load where it was measured.

References

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for Transformer-based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022. 521–538. ↩︎
Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. 2021. Metastable failures in distributed systems. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ‘21), 2021. 221–227. ↩︎
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM 56, 2 (2013), 74–80. ↩︎
Bianca Schroeder, Adam Wierman, and Mor Harchol-Balter. 2006. Open versus closed: A cautionary tale. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI 06), 2006. 239–252. ↩︎

Boyang Yue