diff --git a/_posts/2023-03-02-fast-39.md b/_posts/2023-03-02-fast-39.md index 47ae58a..1f1f010 100644 --- a/_posts/2023-03-02-fast-39.md +++ b/_posts/2023-03-02-fast-39.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021 *By [Chris Kennelly](mailto:ckennelly@google.com) and [Alkis Evlogimenos](mailto:alkis@evlogimenos.com)* -Updated 2023-03-02 +Updated 2023-10-10 Quicklink: [abseil.io/fast/39](https://abseil.io/fast/39) diff --git a/_posts/2023-03-02-fast-9.md b/_posts/2023-03-02-fast-9.md index 49763ef..d6f0189 100644 --- a/_posts/2023-03-02-fast-9.md +++ b/_posts/2023-03-02-fast-9.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #9 on June 24, 2019 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2023-03-02 +Updated 2023-10-10 Quicklink: [abseil.io/fast/9](https://abseil.io/fast/9) @@ -145,9 +145,9 @@ in 2008. microbenchmark; and it makes it easier to revert to the reference code when (not if) the machine-dependent implementation outlives its usefulness. * Include a microbenchmark with your change. -* When designing or changing configuration knobs, ensure that the choices stay - optimal over time. Frequently, overriding the default can lead to suboptimal - behavior when the *default changes* by pinning things in a +* When [designing or changing configuration knobs](/fast/52), ensure that the + choices stay optimal over time. Frequently, overriding the default can lead + to suboptimal behavior when the *default changes* by pinning things in a worse-than-out-of-the-box state. Designing the knobs [in terms of the outcome](https://youtu.be/J6SNO5o9ADg?t=1521) rather than specific behavior aspects can make such overrides easier (or even possible) diff --git a/_posts/2023-09-14-fast-7.md b/_posts/2023-09-14-fast-7.md index 9782e55..1681575 100644 --- a/_posts/2023-09-14-fast-7.md +++ b/_posts/2023-09-14-fast-7.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #7 on June 6, 2019 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2023-09-14 +Updated 2023-10-31 Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7) diff --git a/_posts/2023-09-30-fast-52.md b/_posts/2023-09-30-fast-52.md new file mode 100644 index 0000000..c2fb210 --- /dev/null +++ b/_posts/2023-09-30-fast-52.md @@ -0,0 +1,184 @@ +--- +title: "Performance Tip of the Week #52: Configuration knobs considered harmful" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/52 +type: markdown +order: "052" +--- + +Originally posted as Fast TotW #52 on September 30, 2021 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2023-09-30 + +Quicklink: [abseil.io/fast/52](https://abseil.io/fast/52) + + +Flags, options, and other mechanisms to override default behaviors are useful +during a migration or as a short-term mechanism to address an unusual need. In +the long term they go stale (not providing real benefit to users), are almost +always haunted (in the +[haunted graveyard](https://www.usenix.org/sites/default/files/conference/protected-files/srecon17americas_slides_reese.pdf) +sense), and prevent centralized consistency/optimization efforts. In this +episode, we discuss the tradeoffs in technical debt and optimization velocity +for adding configurability. + +## The ideal flag lifecycle + +When developing a new feature, it's straightforward and often recommended to +guard it behind a flag. This approach of using +[feature flags](https://abseil.io/resources/swe-book/html/ch24.html#continuous_delivery-id00035) +makes it possible to decouple pushing the code changes to production from +turning on a new feature, which might have undiscovered correctness bugs or +different resource requirements. + +For a commonly-used library, flags also allow early opt-ins from users. When the +default is changed, the flag also provides an escape hatch to revert to the old +behavior. + +For example, this was employed successfully for the rollout of +[TCMalloc](https://github.com/google/tcmalloc/blob/master/tcmalloc)'s +[Huge Page Aware Allocator optimization](https://research.google/pubs/pub50370.pdf): +many applications opted-in early, but even with extensive testing, a few +applications saw changes in their resource requirements. These could be +opted-out while deeper investigation occurred without rolling back the +efficiency gains seen by most other users of TCMalloc. + +These experiences suggest flags are an unalloyed good in theory, but practice is +wholly different. Whether flags are considered good or not is dependent on what +percentage of users will use the feature: + +* If the number of users of a flag is always expected to be small, its + existence hampers future evolution. +* If that number is mid-ranged, this can be a well-justified level of + complexity but it can be challenging to set the flags optimally. Some teams + have observed that, often, only the authors of features have the necessary + context to set the flag appropriately - either to know when to set it at all + or to which value it should actually be set. +* If that number is near 100% -- then probably we're transitioning to a new + default and the flag exists to provide an opt-out - this can be a good use + of the flag. Nonetheless, it is important to clean that flag up after the + rollout is complete so it doesn't linger indefinitely. Without the cleanup, + this becomes technical debt that hinders future changes or becomes a + "standard knob with a weird name." + +## Flags failing to convey intent + +The units for many flags are entirely opaque and often have second or third +order effects that may not be immediately intuitive. + +In his [2021 CppCon talk](https://www.youtube.com/watch?v=J6SNO5o9ADg), Titus +Winters makes a real-world note of this phenomenon: The "popcorn button" of +microwaves should not be used for microwave popcorn, as the button does not +align with the settings required. + +Moving to Google's C++ codebase, SwissMap, Abseil's high performance hashtables, +does not provide an implementation of the flag `max_load_factor`. The low +utility of `max_load_factor` was uncovered during the migration to SwissMap. +Even worse, in many of the situations where `max_load_factor` was set, it was +set incorrectly. + +Even when the role of `max_load_factor` was correctly understood, its value was +often misconfigured to achieve a desired goal. While `max_load_factor(0.25)` +might convey an intent to "trade RAM for speed," such a setting can make CPU +performance worse while simultaneously using more RAM, defeating the intent of +its user. + +In other situations, different implementations can be API-compatible, but their +behaviors do not transfer effectively between implementations. Open addressing +hashtables have typical load factors <1, while chained hashtables have load +factors typically ≥1. Changing between these implementations would cause the +`max_load_factor` to have a surprisingly different effect. + +This experience led the SwissMap authors to make `max_load_factor` a no-op, +providing it only for API compatibility. + +## Stale configuration parameters + +Tuning a configuration is another optimization that +[does not age well](/fast/9). + +For flags defined in commonly used libraries, the defaults themselves have +probably evolved: a feature was launched or an optimization landed. The nature +of Google's production configuration languages often means that once a service +has hard-coded a flag's value, it takes precedence over the default. This was +the whole reason for choosing a non-default value in the first place; but with +the codebase evolving at a high rate, it's easy to overlook that the underlying +infrastructure has improved and that overriding value now is *worse* than the +default. + +The key action here is to use customized flags lightly and regularly reconsider +their use. When designing new options, prefer good defaults or make parameters +self-tune if possible. Self-tuning may come in the form of adapting +automatically to workloads, rather than requiring careful tuning through flags. + +## Reduced long-term velocity + +Titus Winters notes that "If 99% of your users understand an API's behavior +through the lens of the default setting, the 1% of users that change that +setting are at risk: APIs built at a higher level have a good chance of assuming +the default behavior, leaving your 1% semi-supported." + +Configurability can be a great short-term boon; but long-term, configurability +is a double edged sword. Options increase the state-space that has to be +considered with every future change, making it more difficult to reason about, +test, and successfully land new features in production. Beyond just optimizing +*costs*, this complexity also hampers achieving better business objectives: +Extra complexity that delays an improvement to product experiences is a +non-obvious externality. + +For example, TCMalloc has a number of tuning options and customization points, +but ultimately, several optimizations came from sanding away extra configuration +complexity. The rarely used malloc hooks API required careful structuring of +TCMalloc's fast path to allow users who didn't use hooks--most users--to not pay +for their possible presence. In another case, removing the `sbrk` allocator +allowed TCMalloc to structure its virtual address space carefully, enabling +several enhancements. + +## Beyond knobs + +While this discussion has largely focused on knobs and tunables, APIs and +libraries have the same challenges. + +An existing library, *X*, might be inadequate or insufficiently expressive, +which can motivate building a "better" alternative, *Y*, along some dimensions. +Realizing the benefit of using *Y* is dependent on users both discovering *Y* +and picking between *X* and *Y* *correctly*--and in the case of a long-lived +code base, keeping that choice optimal over time. + +For some uses, this strategy is infeasible. `my::super_fast_string` will +probably never replace `std::string` because the latter is so entrenched and the +impedence mismatch of living in an independent string ecosystem exceeds the +benefits. Multiple +[vocabulary types](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf) +suffer from impedance mismatch--costly interconversions can overwhelm the +overall benefits. The costs of migrating the world also need to be considered +upfront. Without active migration, we end up with two things. + +There are times where a new library or API is truly needed -- +[SwissMap](https://abseil.io/about/design/swisstables) needed to break stability +guarantees provided by `std::unordered_map` on an instance-by-instance basis to +avoid waiting for every problematic usage to be fixed. In that case, however, +the performance benefits it provided were only realized by active migration. +Being able to aim for a complete migration eases maintenance and educational +burdens as well. A compelling performance case simplified to "just use SwissMap" +avoids the need for painstaking benchmarking with every use where the optimal +choice could get out of date. + +## Best practices + +When adding new customization points, consider how they'll evolve over the +long-term. + +* When using flags to gate new features that will be enabled by default, make + a plan for removing any opt-outs so the flag itself can be removed, rather + than end up as technical debt. +* Flags are a powerful tool for tuning and optimization, but the author of a + customization point has the most context for how to use it effectively. + Choosing good defaults or making features self-tune is often better for the + codebase as a whole. + + Discoverability, let alone optimal selection, is challenging. diff --git a/_posts/2023-10-10-fast-64.md b/_posts/2023-10-10-fast-64.md new file mode 100644 index 0000000..7014fbf --- /dev/null +++ b/_posts/2023-10-10-fast-64.md @@ -0,0 +1,235 @@ +--- +title: "Performance Tip of the Week #64: More Moore with better API design" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/64 +type: markdown +order: "064" +--- + +Originally posted as Fast TotW #64 on October 21, 2022 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2023-10-10 + +Quicklink: [abseil.io/fast/64](https://abseil.io/fast/64) + + +Optimizing library implementations only carries us so far in making software +more efficient. In this episode, we discuss the importance of good APIs and the +right abstractions for finding optimization opportunities. As we can make the +hardware--especially with the end of Moore's Law--and software run only so fast, +the right abstractions give us continued optimization opportunities. + +## Correctness is paramount + +We can simplify an implementation down to `return 42;` regardless of the input +to see blazing fast results, but an API that doesn't work correctly isn't doing +its job. + +"Subtle" and "clever" code has costs for both maintainers and users alike. +Today's tricky edge cases can be tomorrow's headaches when we try to optimize an +implementation. Threading the needle of preserving explicitly (or +[implicitly](https://hyrumslaw.com)) promised quirks makes the optimization +process slower and more fragile over time. Being able to +[iterate](https://en.wikipedia.org/wiki/OODA_loop) [faster](/fast/39) helps with +exploring more of the design space to find the best minima. + +At times, we may need to break abstraction boundaries or have complex +preconditions to unlock the best possible performance. We need to document and +test these sharp edges. Future debugging has an opportunity cost: When we spend +time tracking down and fixing bugs, we are not developing new optimizations. We +can use assertions for preconditions, especially in debug/sanitizer builds, to +double-check contracts and *enforce* them. Testing robots never sleep, while +humans are fallible. Randomized implementation behaviors provide a useful +bulwark against Hyrum's Law from creeping in to implicitly expand the contract +of an interface. + +## Express intents + +Small, composable operations give users flexibility to express their intents +more clearly. We can find optimizations by combining high-level but related +concepts. + +Consider `memcpy` and a hypothetical `memcpy_but_faster` API that we could +build. They both express the same intent, but presumably with +[different tradeoffs around performance](/fast/52). + +* Users need to think about which one to call. This adds a cognitive cost to + every call site. They cannot quickly reach for precisely one to realize + their desired functionality. When in doubt, typing fewer characters is + faster. Over time, choices made will be incorrect, either because they were + suboptimal from the start or circumstances changed. +* Bifurcating the API gives us two implementations, each with less usage. This + lowers the leverage from developing optimizations to one, unless its + maintainers can reliably cross-pollinate ideas from one to the other. + Actively maintaining *two* implementations requires a larger investment, + reducing the RoI from having two in the first place. Engineers may give the + more commonly used implementation more care and attention, leading it to + eventually outstrip the "faster" implementation. +* Data structures and types can be especially costly to duplicate, due to the + "[impedance mismatch](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf)" + of having a library that works solely with one type (say `std::string`) and + another that needs a different one (`absl::my_fast_string`). In order for + the two to interoperate, the interfaces will require expensive copies--a + single type would not require such conversions. + +While this hypothetical might seem far-fetched, this is precisely what happened +with the [predecessor implementation to `absl::popcount`](/fast/9). We had two +implementations, but the "better" one was ultimately outstripped by the "worse" +one because engineers optimized the one with the wider usage instead. + +In terms of API design around intents, we can consider: + +``` +void* memcpy(void* dest, const void* src, size_t count); +crc32c_t absl::ComputeCrc32c(absl::string_view buf); +crc32c_t absl::MemcpyCrc32c(void* dest, const void* src, size_t count); +``` + +With the first two primitives, we can build a trivial, but non-optimal +implementation for the third. Combining the concepts makes sense when it is a +common operation where finer-grained operations might leave performance on the +table. Knowing we are going to both copy and checksum the bytes allows us to +read data once, rather than twice. We can decompose the implementation into its +components, as well, if that ever became more efficient. + +The extra output of the operation (the `crc32c_t`) distinguishes it from just +being a `memcpy` with different performance characteristics. We would recommend +using the combined operation when we need to both *copy* data and *checksum* it. +`MemcpyCrc32c` isn't a suitable replacement for calls to `memcpy` without a need +for a checksum, which removes the cognitive cost of considering it solely for +performance reasons. + +The explicit function calls can also help with understanding the purpose of the +code when we are looking at profiles later. For example, we can compare +protobufs for equality in two ways: + +* By serializing and comparing the bytes, which is unfortunately both common + and + [unsound](https://protobuf.dev/programming-guides/encoding/#implications). +* [Field-by-field](https://github.com/protocolbuffers/protobuf/tree/main/src/google/protobuf/util/message_differencer.h) + directly, which is faster. + +While reading a profile, we might see the individual calls to serialize and +`memcmp`, but it is harder to ascertain the intended semantics later. We may be +tempted to optimize the discrete functions--the process of serializing and +subsequent the process of comparing the resulting string. Understanding the +high-level intent and data flow gives us opportunities to optimize further up +the stack to find the "Room at the Middle", optimizing the direct comparison. At +a minimum, an optimized version could avoid holding the serialized versions in +memory. + +## Avoid unnecessarily strong guarantees + +There are situations where the benefits of duplicate APIs outweight the costs. + +The Abseil hash containers +([SwissMap](https://abseil.io/about/design/swisstables)) added new hashtable +implementations to the code base, which at first glance, appear redundant with +the ones in the C++ standard library. This apparent duplication allowed us to +have a more efficient set of containers which match the standard library API, +but adhere to a weaker set of constraints. + +The Abseil hash containers provided weaker guarantees for iterator and pointer +stability, allowing them to improve performance by reducing data indirections. +It is difficult to implement `std::unordered_map`'s guarantees without resorting +to a node-based implementation that requires data indirections and constrains +performance. Given `std::unordered_map`'s widespread usage, it was not feasible +to relax these guarantees all at once. + +The migration was a replacement path for the legacy containers, not an +alternative. The superior performance characteristics meant that users could +"just use SwissMap" without tedious benchmarking on a case-by-case basis. +There's little need for a user to revisit their decision to migrate to SwissMap +with the passage of time. This meant that usage could be actively driven towards +SwissMap: Two types would be a temporary (albeit long) state, rather than one +where every individual usage had to be carefully selected. + +Years after SwissMap's development, there are far fewer--but non-zero--uses of +`std::unordered_map`. Blocking the improvement on the complete cleanup means no +benefit would have accrued. We were able to migrate instance-by-instance, +realizing incremental benefits over time. + +It's important to avoid ascribing intent--even with expressive APIs--to use of a +previously predominant one. A use of `std::map` might require keys to be +ordered, but the more likely explanation might be that it is older code in need +of updating. + +## Avoid leaking implementation details + +Hyrum's Law reminds us that observable behaviors will be relied upon, but +sometimes our API design choices constrain our implementation details. These +often arise from returning references to data or giving fine-grained control in +APIs. This can help performance in the short-term, but care is required to make +sure it allows long-term evolution to continue to improve performance over time. + +Consider protocol buffers for a simple message. + +``` +message MyMessage { + optional string foo = 1; + repeated string bar = 2; +} +``` + +As of October 2023, the accessor `.foo()` returns a `const std::string&`. This +*requires* that we have an in-memory representation of a `std::string` instance +that can be returned. This approach has two problems: + +* `std::string` encodes a specific allocation strategy (`std::allocator`). If + we change the allocation strategy, for example wrapping `Arena`, we change + the type. +* Individual fields can have a wide range of sizes (or likelihoods of + presence) that we can determine from profiling, which could benefit from + variable small string object buffer sizes. Returning `const std::string&` + constrains the implementation to that particular size of buffer. + +In contrast, by returning `std::string_view` (or our +[internal predecessor](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3442.html), +`StringPiece`), we decouple callers from the internal representation. The API is +the same, independent of whether the string is constant data (backed by the +`.rodata` section), allocated on the heap by a `std::string` instance, or +allocated by an `Arena`. We've abstracted away the implementation detail from +our user, giving us more optimization freedom. + +Similarly, consider the allocation-aware APIs in protobuf, `add_allocated_...`, +`release_...`, and `unsafe_arena_...`. Fine-grained control over when and where +allocations occur can offer significant performance benefits, but they also +constrain future implementations by creating sharp performance edges. + +* `release_...` allows us to remove a submessage and return ownership to the + caller. Subobjects were heap allocated and the operation was fast--it's hard + to beat swapping two pointers. When Protobuf Arenas became available, + `release_...` created a new copy of the underlying message on the heap, so + it could release that. The API couldn't convey that the returned pointer was + owned by the Arena, not caller, so making a full copy was required to keep + code working. As a result, code that calls `release_...` may be O(1) or O(n) + based on non-local information (whether the source object was constructed on + an arena)! +* With Arenas, `unsafe_arena_...` gives us the raw hooks we need to add or + remove fields from a message without making the copy mentioned above , with + "unsafe" in the name conveying the subtlety and gravitas of what we're + doing. These APIs are tricky to use correctly, though, as today's tested + combination of arena and heap ownership may change over time and assumptions + break. The APIs are also extremely fine-grained, but do not convey the + higher-level intent--transferring pointer ownership, "lending" a submessage + to another one, etc. + +### Concluding remarks + +Good performance should be available by default, not an optional feature. While +[feature flags and knobs can be useful for testing and initial rollout](/fast/52), +we should strive to make the right choices for users, rather than requiring +users adopt the improvement on a case-by-case basis. + +Developing an optimization for an existing implementation can provide a larger +return-on-investment by targeting widespread, current usage upfront. Adding a +new API or optimization knob can be expedient, but without widespread usage and +adoption, the benefit is far more limited. + +Optimization of existing code can hit stumbling blocks around unnecessarily +strong guarantees or APIs that constrain the implementation--and thus the +optimization search space--too much to find improvements. diff --git a/_posts/2023-10-15-fast-60.md b/_posts/2023-10-15-fast-60.md new file mode 100644 index 0000000..299ffec --- /dev/null +++ b/_posts/2023-10-15-fast-60.md @@ -0,0 +1,181 @@ +--- +title: "Performance Tip of the Week #60: In-process profiling: lessons learned" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/60 +type: markdown +order: "060" +--- + +Originally posted as Fast TotW #60 on June 6, 2022 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2023-10-15 + +Quicklink: [abseil.io/fast/60](https://abseil.io/fast/60) + + +[Google-Wide Profiling](https://research.google/pubs/pub36575/) collects data +not just from our hardware performance counters, but also from in-process +profilers. + +In-process profilers can give deeper insights about the state of the program +that are hard to observe from the outside, such as lock contention, where memory +was allocated, and the distribution of collisions on a hashtable. In this tip we +discuss how to determine that a new profiler is necessary, and the best +practices for producing one. + +## Overview + +> "The purpose of computing is insight, not numbers." -- Richard Hamming + +Developing a new profiler and augmenting existing ones allows us to have more +information to make optimization decisions and aid debugging. The goal isn't to +have perfect information and to make perfect decisions, but to make better +decisions faster, shortening our +["OODA loop" (Observe Orient Decide Act)](https://en.wikipedia.org/wiki/OODA_loop). +The value is in pulling in the area-under-curve and landing in a better spot. An +"imperfect" profiler that can help make a decision is better than a "perfect" +profiler that is unwieldy to collect for performance or privacy reasons. Extra +information or precision is only useful insofar as it helps us make a *better* +decision or *changes* the outcome. + +For example, most new optimizations to +[TCMalloc](https://github.com/google/tcmalloc/blob/master/tcmalloc) start from +adding new data points to TCMalloc's statistics that are collected by visiting +malloc profile handlers across the fleet. This information +[helps with understanding](https://github.com/google/tcmalloc/blob/master/docs/stats.md) +the scope of a particular phenomenon. After landing an optimization, these +metrics can help provide indicators that we changed what we set out to change, +even if the actual CPU and RAM savings might be measured by other means. These +steps didn't directly save any CPU usage or bytes of RAM, but they enabled +better decisions. Capabilities are harder to directly quantify, but they are the +motor of progress. + +## Leveraging existing profilers: the "No build" option + +Developing a new profiler takes considerable time, both in terms of +implementation and wallclock time to ready the fleet for collection at scale. +Before moving to implement one, it is valuable to consider whether we can derive +the necessary information from existing profilers and tools we already have. + +For example, if the case for hashtable profiling was just reporting the capacity +of hashtables, then we could also derive that information from heap profiles, +TCMalloc's heap profiles of the fleet. Even where heap profiles might not be +able to provide precise insights--the actual "size" of the hashtable, rather +than its capacity--we can make an informed guess from the profile combined with +knowledge about the typical load factors due to SwissMap's design. + +It is important to articulate the value of the new profiler over what is already +provided. A key driver for hashtable-specific profiling is that the CPU profiles +of a hashtable with a +[bad hash function look similar to those](https://youtu.be/JZE3_0qvrMg?t=1864) +with a good hash function. The added information collected for stuck bits helps +us drive optimization decisions we wouldn't have been able to make. + +## Sampling strategies + +A key design aspect of a profiler is deciding when and how to collect +information. Most profilers do some kind of sampling to provide an estimate of +the total without the overhead of recording every event. Collecting some data, +even if heavily sampled, can be useful for gauging behaviors of a library at +scale. There are two aspects to a profiler that need to be decided up front: + +* **Duration versus duration-less**: Several of our profilers track sampled + events over a period of time. Other profilers capture an instantaneous + snapshot of the program's state. + + Duration-less handlers are profiling during the entire program lifetime, + which imposes a higher bar on the stability and overhead that can be + required. In contrast, a profiler that is only active during collection can + be more expensive, as collecting itself is rare. + +* **Sampling strategy**: The overheads of capturing data about every instance + of a class can be prohibitive, so it's important to determine a strategy for + sampling - and figure out how that sampling strategy can be scaled to + provide information representative of the entire fleet. + + Sampling operations can make-or-break the feasibility of a profiler: Our + compression profiler originally recorded statistics about *every* + compression operation during its collection window, but the high overhead + caused major services to turn off the profiler altogether. It was fixed by + moving to a sampling strategy to only record statistics on a subset of + compression operations during the profiling window. This allowed the + profiler to be reenabled. + + While [knobs are often undesirable](/fast/52), allowing applications to tune + their sampling rate (or whether sampling occurs at all) can be helpful for + balancing the information gained against the overheads imposed. + + Unless there is a justified exception, we require that the profiler applies + the sampling factor back to the data to "unsample" it before returning it. + This allows consumers to easily use the data without having to deal with the + sampling arithmetics themselves. This is especially important as sampling + rate can be variable--either automatically adjusted or tunable via a + configuration knob such as a flag. This step can also help with validation + via cross-checking with other profilers. For example, SwissMap's total + memory allocations seen by TCMalloc's heap profiles are consistent with the + total capacity seen by the hashtable profiler. + + Choosing the right sampling approach (and the unsampling counterpart) needs + to carefully balance accuracy vs. overhead. For example, with the heap + profiler in TCMalloc one might decide to simply pick every Nth allocation. + But that would not work well: in a typical app memory profiles are dominated + by many small allocations and sampling those with reasonable overhead would + require high sampling factor. It is also likely to miss more rare large + allocations. Interestingly, the next obvious improvement of sampling an + allocation every N bytes would "almost work" but is subject to statistical + bias. This was fixed by introducing Poisson sampler which is used to date. + +For libraries with significant global process state, the threads running in a +process or the state of malloc, we may use a more exhaustive strategy. For +example, a profiler could snapshot and summarize the state of the library +without further sampling. + +## What data to record + +In addition to choosing a sampling strategy, we need to decide what data to +collect. We want to choose data that will influence an optimization decision. +Just as different optimization projects have varying returns on investment, we +want to strike a balance between the cost of implementing our profiler, of +running it, and implementing the optimizations it motivates. + +Mutation operations can be an excellent place to record additional statistics on +sampled instances. These are frequently heavyweight for unrelated reasons--they +trigger copies and reallocations--so checking whether to record statistics has +minimal added performance penalty. This is the strategy we use for many of our +existing profilers. In contrast, non-mutating operations, such as hashtable +lookups, can be prohibitively expensive as we use these operations frequently +and rely on them being fast. + +There is a cost-benefit tradeoff for having more information. Sampling more +frequently or collecting more data with each sample can paint a richer picture, +but this increases the runtime cost of profiling. TCMalloc's heap profiling has +low, but non-zero costs, but it more than pays for itself by allowing us to look +at where much of our RAM usage goes. Increasing the sampling rate would give us +extra precision, but it wouldn't materially affect the optimizations we can +uncover and deploy. The extra overheads would negatively impact performance. + +More practically, a minimal set of information can be a good starting point for +getting a new profiler up and running to start debugging it. While obvious in +hindsight, several new profilers have hit issues with their stack trace +collection and filtering. While collecting more data can give additional +insights, implementations that compute too many statistics or add contended +locks may simply be infeasible. A profiler that is too expensive to leave +enabled may be worse than no profiler at all: We spend time implementing it and +rolling it out, but we lose the visibility into the library usage that we were +after in the first place. + +## Summary + +Profilers are a useful tool for probing the internal state of a program to +*answer questions* during debugging and optimization. The types of questions +posed can greatly influence the design and architecture of a profiler. + +While a particular design may not be able to answer all questions, all at once, +the goal is ultimately to make *better decisions faster*, shortening our +["OODA loop" (Observe Orient Decide Act)](https://en.wikipedia.org/wiki/OODA_loop). +Just as optimization projects are framed in terms of return-on-investment, we +can frame how additional information influences or changes course of a decision.