Skip to content

Commit

Permalink
Merge pull request #461 from aalexand/fast-7
Browse files Browse the repository at this point in the history
Publish abseil.io/fast/7 on productivity metrics.
  • Loading branch information
tituswinters authored Sep 29, 2023
2 parents af238cc + f343ff8 commit 2388bbf
Show file tree
Hide file tree
Showing 2 changed files with 117 additions and 2 deletions.
4 changes: 2 additions & 2 deletions _posts/2023-03-02-fast-39.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,8 @@ This prefetch appears to be extraordinarily costly: Microbenchmarks measuring
allocation performance show potential savings if it were removed and Google-Wide
Profiling shows 70%+ of cycles in `new`'s fastpath on the prefetch. Removing it
would "reduce" the [data center tax](https://research.google/pubs/pub44271.pdf),
but we would actually hurt application productivity-per-CPU. Time we spend in
malloc is
but we would actually hurt [application productivity](/fast/7)-per-CPU. Time we
spend in malloc is
[less important than application performance](https://research.google/pubs/pub50370.pdf).

Trace-driven simulations with hardware-validated architectural simulators showed
Expand Down
115 changes: 115 additions & 0 deletions _posts/2023-09-14-fast-7.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
title: "Performance Tip of the Week #7: Optimizing for application productivity"
layout: fast
sidenav: side-nav-fast.html
published: true
permalink: fast/7
type: markdown
order: "007"
---

Originally posted as Fast TotW #7 on June 6, 2019

*By [Chris Kennelly](mailto:[email protected])*

Updated 2023-09-14

Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)


## Overview

Google manages a vast fleet of servers to handle search queries, process
records, and transcode cat videos. We don't buy those servers to allocate memory
in TCMalloc, put protocol buffers into other protocol buffers, or to handle
branch mispredictions by our processors.

To make our fleet more efficient, we want to optimize for how productive our
servers are, that is, how much useful work they accomplish per CPU-second, byte
of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's
resource consumption is easy, it's harder to tell just how much useful work it's
accomplishing without help.

A task's CPU usage going up could mean it suffered a performance regression or
that it's simply busier. Consider a plot of a service's CPU usage against time,
breaking down the total CPU usage of two versions of the binary. We cannot
determine from casual inspection what caused the increase in CPU usage, whether
this is from an increase in workload (serving more videos per unit time) or a
decrease in efficiency (some added, needless protocol conversion per video).

To determine what is really happening we need a productivity metric which
captures the amount of real work completed. If we know the number of cat videos
processed we can easily determine whether we are getting more, or less, real
work done per CPU-second (or byte of RAM, disk operation, or hardware
accelerator time).

If we do not have productivity metrics, we are faced with *entire classes of
optimizations* that are not well-represented by existing metrics:

* **Application speedups through core library changes**:

As seen in our [2021 OSDI paper](https://research.google/pubs/pub50370/),
"one classical approach is to increase the efficiency of an allocator to
minimize the cycles spent in the allocator code. However, memory allocation
decisions also impact overall application performance via data placement,
offering opportunities to improve fleetwide productivity by completing more
units of application work using fewer hardware resources."

Experiments with TCMalloc's hugepage-aware allocator, also known as
Temeraire, have shown considerable speedups by improving application
performance, not time spent in TCMalloc.

We spend more *relative* time in TCMalloc but greatly improve application
performance. Focusing just on relative time in TCMalloc would produce an
error in sign: We'd deprioritize (or even rollback) a strongly positive
optimization.

* Allocating more protocol buffer messages on
[Arenas](https://protobuf.dev/reference/cpp/arenas/) speeds up not just the
protocol buffer code itself (like message destructors), but also in the
business logic that uses them. Enabling Arenas in major frameworks allowed
them to process 15-30% more work per CPU, but protobuf destructor costs were
a small fraction of this cost. The improvements in data locality could
produce outsized benefits for the entire application.

* **New instruction sets**: With successive hardware generations, vendors have
added new instrutions to their ISAs.

In future hardware generations, we expect to replace calls to memcpy with
microcode-optimized `rep movsb` instructions that are faster than any
handwritten assembly sequence we can come up with. We expect `rep movsb` to
have low IPC: It's a single instruction that replaces an entire copy loop of
instructions!

Using these new instructions can be triggered by optimizing the source code
or through compiler enhancements that improve vectorization.

Focusing on MIPS or IPC would cause us to prefer any implementation that
executes a large number of instructions, even if those instructions take
longer to execute to copy `n` bytes.

In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with
`--march=haswell` shows a MIPS regression while simultaneously improving
*application productivity improvement*. These instructions can do more work
per instruction, however, replacing several low latency instructions may
mean that *average* instruction latency increases. If we had 10 million
instructions and 10 ms per query, we may now have 8 million instructions
taking only 9 ms per query. QPS is up and MIPS would go down.

Since Google's fleet runs on a wide variety of architectures, we cannot
easily compare instructions across platforms and need to instead compare
useful work accomplished by an application.

* **Compiler optimizations**: Compiler optimizations can significantly affect
the number of dynamically executed instructions. Techniques such as inlining
reduce function preambles and enable further simplifying optimizations.
Thus, *fewer* instructions translate to *faster*, *more productive* code.

* **Kernel optimizations**: The kernel has many policies around hugepages,
thread scheduling, and other system parameters. While changing these
policies may make the kernel nominally more costly, for example, if we did
more work to compact memory, the application benefits can easily outweigh
them.

Availability of these metrics help infrastructure and efficiency teams guide
their work more effectively.

0 comments on commit 2388bbf

Please sign in to comment.