diff --git a/_posts/2023-03-02-fast-39.md b/_posts/2023-03-02-fast-39.md
index 47ae58a..1f1f010 100644
--- a/_posts/2023-03-02-fast-39.md
+++ b/_posts/2023-03-02-fast-39.md
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021
 
 *By [Chris Kennelly](mailto:ckennelly@google.com) and [Alkis Evlogimenos](mailto:alkis@evlogimenos.com)*
 
-Updated 2023-03-02
+Updated 2023-10-10
 
 Quicklink: [abseil.io/fast/39](https://abseil.io/fast/39)
 
diff --git a/_posts/2023-03-02-fast-9.md b/_posts/2023-03-02-fast-9.md
index 49763ef..d6f0189 100644
--- a/_posts/2023-03-02-fast-9.md
+++ b/_posts/2023-03-02-fast-9.md
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #9 on June 24, 2019
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-03-02
+Updated 2023-10-10
 
 Quicklink: [abseil.io/fast/9](https://abseil.io/fast/9)
 
@@ -145,9 +145,9 @@ in 2008.
     microbenchmark; and it makes it easier to revert to the reference code when
     (not if) the machine-dependent implementation outlives its usefulness.
 *   Include a microbenchmark with your change.
-*   When designing or changing configuration knobs, ensure that the choices stay
-    optimal over time. Frequently, overriding the default can lead to suboptimal
-    behavior when the *default changes* by pinning things in a
+*   When [designing or changing configuration knobs](/fast/52), ensure that the
+    choices stay optimal over time. Frequently, overriding the default can lead
+    to suboptimal behavior when the *default changes* by pinning things in a
     worse-than-out-of-the-box state. Designing the knobs
     [in terms of the outcome](https://youtu.be/J6SNO5o9ADg?t=1521) rather than
     specific behavior aspects can make such overrides easier (or even possible)
diff --git a/_posts/2023-09-14-fast-7.md b/_posts/2023-09-14-fast-7.md
index 9782e55..1681575 100644
--- a/_posts/2023-09-14-fast-7.md
+++ b/_posts/2023-09-14-fast-7.md
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #7 on June 6, 2019
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-09-14
+Updated 2023-10-31
 
 Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)
 
diff --git a/_posts/2023-09-30-fast-52.md b/_posts/2023-09-30-fast-52.md
new file mode 100644
index 0000000..c2fb210
--- /dev/null
+++ b/_posts/2023-09-30-fast-52.md
@@ -0,0 +1,184 @@
+---
+title: "Performance Tip of the Week #52: Configuration knobs considered harmful"
+layout: fast
+sidenav: side-nav-fast.html
+published: true
+permalink: fast/52
+type: markdown
+order: "052"
+---
+
+Originally posted as Fast TotW #52 on September 30, 2021
+
+*By [Chris Kennelly](mailto:ckennelly@google.com)*
+
+Updated 2023-09-30
+
+Quicklink: [abseil.io/fast/52](https://abseil.io/fast/52)
+
+
+Flags, options, and other mechanisms to override default behaviors are useful
+during a migration or as a short-term mechanism to address an unusual need. In
+the long term they go stale (not providing real benefit to users), are almost
+always haunted (in the
+[haunted graveyard](https://www.usenix.org/sites/default/files/conference/protected-files/srecon17americas_slides_reese.pdf)
+sense), and prevent centralized consistency/optimization efforts. In this
+episode, we discuss the tradeoffs in technical debt and optimization velocity
+for adding configurability.
+
+## The ideal flag lifecycle
+
+When developing a new feature, it's straightforward and often recommended to
+guard it behind a flag. This approach of using
+[feature flags](https://abseil.io/resources/swe-book/html/ch24.html#continuous_delivery-id00035)
+makes it possible to decouple pushing the code changes to production from
+turning on a new feature, which might have undiscovered correctness bugs or
+different resource requirements.
+
+For a commonly-used library, flags also allow early opt-ins from users. When the
+default is changed, the flag also provides an escape hatch to revert to the old
+behavior.
+
+For example, this was employed successfully for the rollout of
+[TCMalloc](https://github.com/google/tcmalloc/blob/master/tcmalloc)'s
+[Huge Page Aware Allocator optimization](https://research.google/pubs/pub50370.pdf):
+many applications opted-in early, but even with extensive testing, a few
+applications saw changes in their resource requirements. These could be
+opted-out while deeper investigation occurred without rolling back the
+efficiency gains seen by most other users of TCMalloc.
+
+These experiences suggest flags are an unalloyed good in theory, but practice is
+wholly different. Whether flags are considered good or not is dependent on what
+percentage of users will use the feature:
+
+*   If the number of users of a flag is always expected to be small, its
+    existence hampers future evolution.
+*   If that number is mid-ranged, this can be a well-justified level of
+    complexity but it can be challenging to set the flags optimally. Some teams
+    have observed that, often, only the authors of features have the necessary
+    context to set the flag appropriately - either to know when to set it at all
+    or to which value it should actually be set.
+*   If that number is near 100% -- then probably we're transitioning to a new
+    default and the flag exists to provide an opt-out - this can be a good use
+    of the flag. Nonetheless, it is important to clean that flag up after the
+    rollout is complete so it doesn't linger indefinitely. Without the cleanup,
+    this becomes technical debt that hinders future changes or becomes a
+    "standard knob with a weird name."
+
+## Flags failing to convey intent
+
+The units for many flags are entirely opaque and often have second or third
+order effects that may not be immediately intuitive.
+
+In his [2021 CppCon talk](https://www.youtube.com/watch?v=J6SNO5o9ADg), Titus
+Winters makes a real-world note of this phenomenon: The "popcorn button" of
+microwaves should not be used for microwave popcorn, as the button does not
+align with the settings required.
+
+Moving to Google's C++ codebase, SwissMap, Abseil's high performance hashtables,
+does not provide an implementation of the flag `max_load_factor`. The low
+utility of `max_load_factor` was uncovered during the migration to SwissMap.
+Even worse, in many of the situations where `max_load_factor` was set, it was
+set incorrectly.
+
+Even when the role of `max_load_factor` was correctly understood, its value was
+often misconfigured to achieve a desired goal. While `max_load_factor(0.25)`
+might convey an intent to "trade RAM for speed," such a setting can make CPU
+performance worse while simultaneously using more RAM, defeating the intent of
+its user.
+
+In other situations, different implementations can be API-compatible, but their
+behaviors do not transfer effectively between implementations. Open addressing
+hashtables have typical load factors &lt;1, while chained hashtables have load
+factors typically &ge;1. Changing between these implementations would cause the
+`max_load_factor` to have a surprisingly different effect.
+
+This experience led the SwissMap authors to make `max_load_factor` a no-op,
+providing it only for API compatibility.
+
+## Stale configuration parameters
+
+Tuning a configuration is another optimization that
+[does not age well](/fast/9).
+
+For flags defined in commonly used libraries, the defaults themselves have
+probably evolved: a feature was launched or an optimization landed. The nature
+of Google's production configuration languages often means that once a service
+has hard-coded a flag's value, it takes precedence over the default. This was
+the whole reason for choosing a non-default value in the first place; but with
+the codebase evolving at a high rate, it's easy to overlook that the underlying
+infrastructure has improved and that overriding value now is *worse* than the
+default.
+
+The key action here is to use customized flags lightly and regularly reconsider
+their use. When designing new options, prefer good defaults or make parameters
+self-tune if possible. Self-tuning may come in the form of adapting
+automatically to workloads, rather than requiring careful tuning through flags.
+
+## Reduced long-term velocity
+
+Titus Winters notes that "If 99% of your users understand an API's behavior
+through the lens of the default setting, the 1% of users that change that
+setting are at risk: APIs built at a higher level have a good chance of assuming
+the default behavior, leaving your 1% semi-supported."
+
+Configurability can be a great short-term boon; but long-term, configurability
+is a double edged sword. Options increase the state-space that has to be
+considered with every future change, making it more difficult to reason about,
+test, and successfully land new features in production. Beyond just optimizing
+*costs*, this complexity also hampers achieving better business objectives:
+Extra complexity that delays an improvement to product experiences is a
+non-obvious externality.
+
+For example, TCMalloc has a number of tuning options and customization points,
+but ultimately, several optimizations came from sanding away extra configuration
+complexity. The rarely used malloc hooks API required careful structuring of
+TCMalloc's fast path to allow users who didn't use hooks--most users--to not pay
+for their possible presence. In another case, removing the `sbrk` allocator
+allowed TCMalloc to structure its virtual address space carefully, enabling
+several enhancements.
+
+## Beyond knobs
+
+While this discussion has largely focused on knobs and tunables, APIs and
+libraries have the same challenges.
+
+An existing library, *X*, might be inadequate or insufficiently expressive,
+which can motivate building a "better" alternative, *Y*, along some dimensions.
+Realizing the benefit of using *Y* is dependent on users both discovering *Y*
+and picking between *X* and *Y* *correctly*--and in the case of a long-lived
+code base, keeping that choice optimal over time.
+
+For some uses, this strategy is infeasible. `my::super_fast_string` will
+probably never replace `std::string` because the latter is so entrenched and the
+impedence mismatch of living in an independent string ecosystem exceeds the
+benefits. Multiple
+[vocabulary types](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf)
+suffer from impedance mismatch--costly interconversions can overwhelm the
+overall benefits. The costs of migrating the world also need to be considered
+upfront. Without active migration, we end up with two things.
+
+There are times where a new library or API is truly needed --
+[SwissMap](https://abseil.io/about/design/swisstables) needed to break stability
+guarantees provided by `std::unordered_map` on an instance-by-instance basis to
+avoid waiting for every problematic usage to be fixed. In that case, however,
+the performance benefits it provided were only realized by active migration.
+Being able to aim for a complete migration eases maintenance and educational
+burdens as well. A compelling performance case simplified to "just use SwissMap"
+avoids the need for painstaking benchmarking with every use where the optimal
+choice could get out of date.
+
+## Best practices
+
+When adding new customization points, consider how they'll evolve over the
+long-term.
+
+*   When using flags to gate new features that will be enabled by default, make
+    a plan for removing any opt-outs so the flag itself can be removed, rather
+    than end up as technical debt.
+*   Flags are a powerful tool for tuning and optimization, but the author of a
+    customization point has the most context for how to use it effectively.
+    Choosing good defaults or making features self-tune is often better for the
+    codebase as a whole.
+
+    Discoverability, let alone optimal selection, is challenging.
diff --git a/_posts/2023-10-10-fast-64.md b/_posts/2023-10-10-fast-64.md
new file mode 100644
index 0000000..7014fbf
--- /dev/null
+++ b/_posts/2023-10-10-fast-64.md
@@ -0,0 +1,235 @@
+---
+title: "Performance Tip of the Week #64: More Moore with better API design"
+layout: fast
+sidenav: side-nav-fast.html
+published: true
+permalink: fast/64
+type: markdown
+order: "064"
+---
+
+Originally posted as Fast TotW #64 on October 21, 2022
+
+*By [Chris Kennelly](mailto:ckennelly@google.com)*
+
+Updated 2023-10-10
+
+Quicklink: [abseil.io/fast/64](https://abseil.io/fast/64)
+
+
+Optimizing library implementations only carries us so far in making software
+more efficient. In this episode, we discuss the importance of good APIs and the
+right abstractions for finding optimization opportunities. As we can make the
+hardware--especially with the end of Moore's Law--and software run only so fast,
+the right abstractions give us continued optimization opportunities.
+
+## Correctness is paramount
+
+We can simplify an implementation down to `return 42;` regardless of the input
+to see blazing fast results, but an API that doesn't work correctly isn't doing
+its job.
+
+"Subtle" and "clever" code has costs for both maintainers and users alike.
+Today's tricky edge cases can be tomorrow's headaches when we try to optimize an
+implementation. Threading the needle of preserving explicitly (or
+[implicitly](https://hyrumslaw.com)) promised quirks makes the optimization
+process slower and more fragile over time. Being able to
+[iterate](https://en.wikipedia.org/wiki/OODA_loop) [faster](/fast/39) helps with
+exploring more of the design space to find the best minima.
+
+At times, we may need to break abstraction boundaries or have complex
+preconditions to unlock the best possible performance. We need to document and
+test these sharp edges. Future debugging has an opportunity cost: When we spend
+time tracking down and fixing bugs, we are not developing new optimizations. We
+can use assertions for preconditions, especially in debug/sanitizer builds, to
+double-check contracts and *enforce* them. Testing robots never sleep, while
+humans are fallible. Randomized implementation behaviors provide a useful
+bulwark against Hyrum's Law from creeping in to implicitly expand the contract
+of an interface.
+
+## Express intents
+
+Small, composable operations give users flexibility to express their intents
+more clearly. We can find optimizations by combining high-level but related
+concepts.
+
+Consider `memcpy` and a hypothetical `memcpy_but_faster` API that we could
+build. They both express the same intent, but presumably with
+[different tradeoffs around performance](/fast/52).
+
+*   Users need to think about which one to call. This adds a cognitive cost to
+    every call site. They cannot quickly reach for precisely one to realize
+    their desired functionality. When in doubt, typing fewer characters is
+    faster. Over time, choices made will be incorrect, either because they were
+    suboptimal from the start or circumstances changed.
+*   Bifurcating the API gives us two implementations, each with less usage. This
+    lowers the leverage from developing optimizations to one, unless its
+    maintainers can reliably cross-pollinate ideas from one to the other.
+    Actively maintaining *two* implementations requires a larger investment,
+    reducing the RoI from having two in the first place. Engineers may give the
+    more commonly used implementation more care and attention, leading it to
+    eventually outstrip the "faster" implementation.
+*   Data structures and types can be especially costly to duplicate, due to the
+    "[impedance mismatch](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf)"
+    of having a library that works solely with one type (say `std::string`) and
+    another that needs a different one (`absl::my_fast_string`). In order for
+    the two to interoperate, the interfaces will require expensive copies--a
+    single type would not require such conversions.
+
+While this hypothetical might seem far-fetched, this is precisely what happened
+with the [predecessor implementation to `absl::popcount`](/fast/9). We had two
+implementations, but the "better" one was ultimately outstripped by the "worse"
+one because engineers optimized the one with the wider usage instead.
+
+In terms of API design around intents, we can consider:
+
+```
+void* memcpy(void* dest, const void* src, size_t count);
+crc32c_t absl::ComputeCrc32c(absl::string_view buf);
+crc32c_t absl::MemcpyCrc32c(void* dest, const void* src, size_t count);
+```
+
+With the first two primitives, we can build a trivial, but non-optimal
+implementation for the third. Combining the concepts makes sense when it is a
+common operation where finer-grained operations might leave performance on the
+table. Knowing we are going to both copy and checksum the bytes allows us to
+read data once, rather than twice. We can decompose the implementation into its
+components, as well, if that ever became more efficient.
+
+The extra output of the operation (the `crc32c_t`) distinguishes it from just
+being a `memcpy` with different performance characteristics. We would recommend
+using the combined operation when we need to both *copy* data and *checksum* it.
+`MemcpyCrc32c` isn't a suitable replacement for calls to `memcpy` without a need
+for a checksum, which removes the cognitive cost of considering it solely for
+performance reasons.
+
+The explicit function calls can also help with understanding the purpose of the
+code when we are looking at profiles later. For example, we can compare
+protobufs for equality in two ways:
+
+*   By serializing and comparing the bytes, which is unfortunately both common
+    and
+    [unsound](https://protobuf.dev/programming-guides/encoding/#implications).
+*   [Field-by-field](https://github.com/protocolbuffers/protobuf/tree/main/src/google/protobuf/util/message_differencer.h)
+    directly, which is faster.
+
+While reading a profile, we might see the individual calls to serialize and
+`memcmp`, but it is harder to ascertain the intended semantics later. We may be
+tempted to optimize the discrete functions--the process of serializing and
+subsequent the process of comparing the resulting string. Understanding the
+high-level intent and data flow gives us opportunities to optimize further up
+the stack to find the "Room at the Middle", optimizing the direct comparison. At
+a minimum, an optimized version could avoid holding the serialized versions in
+memory.
+
+## Avoid unnecessarily strong guarantees
+
+There are situations where the benefits of duplicate APIs outweight the costs.
+
+The Abseil hash containers
+([SwissMap](https://abseil.io/about/design/swisstables)) added new hashtable
+implementations to the code base, which at first glance, appear redundant with
+the ones in the C++ standard library. This apparent duplication allowed us to
+have a more efficient set of containers which match the standard library API,
+but adhere to a weaker set of constraints.
+
+The Abseil hash containers provided weaker guarantees for iterator and pointer
+stability, allowing them to improve performance by reducing data indirections.
+It is difficult to implement `std::unordered_map`'s guarantees without resorting
+to a node-based implementation that requires data indirections and constrains
+performance. Given `std::unordered_map`'s widespread usage, it was not feasible
+to relax these guarantees all at once.
+
+The migration was a replacement path for the legacy containers, not an
+alternative. The superior performance characteristics meant that users could
+"just use SwissMap" without tedious benchmarking on a case-by-case basis.
+There's little need for a user to revisit their decision to migrate to SwissMap
+with the passage of time. This meant that usage could be actively driven towards
+SwissMap: Two types would be a temporary (albeit long) state, rather than one
+where every individual usage had to be carefully selected.
+
+Years after SwissMap's development, there are far fewer--but non-zero--uses of
+`std::unordered_map`. Blocking the improvement on the complete cleanup means no
+benefit would have accrued. We were able to migrate instance-by-instance,
+realizing incremental benefits over time.
+
+It's important to avoid ascribing intent--even with expressive APIs--to use of a
+previously predominant one. A use of `std::map` might require keys to be
+ordered, but the more likely explanation might be that it is older code in need
+of updating.
+
+## Avoid leaking implementation details
+
+Hyrum's Law reminds us that observable behaviors will be relied upon, but
+sometimes our API design choices constrain our implementation details. These
+often arise from returning references to data or giving fine-grained control in
+APIs. This can help performance in the short-term, but care is required to make
+sure it allows long-term evolution to continue to improve performance over time.
+
+Consider protocol buffers for a simple message.
+
+```
+message MyMessage {
+  optional string foo = 1;
+  repeated string bar = 2;
+}
+```
+
+As of October 2023, the accessor `.foo()` returns a `const std::string&`. This
+*requires* that we have an in-memory representation of a `std::string` instance
+that can be returned. This approach has two problems:
+
+*   `std::string` encodes a specific allocation strategy (`std::allocator`). If
+    we change the allocation strategy, for example wrapping `Arena`, we change
+    the type.
+*   Individual fields can have a wide range of sizes (or likelihoods of
+    presence) that we can determine from profiling, which could benefit from
+    variable small string object buffer sizes. Returning `const std::string&`
+    constrains the implementation to that particular size of buffer.
+
+In contrast, by returning `std::string_view` (or our
+[internal predecessor](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3442.html),
+`StringPiece`), we decouple callers from the internal representation. The API is
+the same, independent of whether the string is constant data (backed by the
+`.rodata` section), allocated on the heap by a `std::string` instance, or
+allocated by an `Arena`. We've abstracted away the implementation detail from
+our user, giving us more optimization freedom.
+
+Similarly, consider the allocation-aware APIs in protobuf, `add_allocated_...`,
+`release_...`, and `unsafe_arena_...`. Fine-grained control over when and where
+allocations occur can offer significant performance benefits, but they also
+constrain future implementations by creating sharp performance edges.
+
+*   `release_...` allows us to remove a submessage and return ownership to the
+    caller. Subobjects were heap allocated and the operation was fast--it's hard
+    to beat swapping two pointers. When Protobuf Arenas became available,
+    `release_...` created a new copy of the underlying message on the heap, so
+    it could release that. The API couldn't convey that the returned pointer was
+    owned by the Arena, not caller, so making a full copy was required to keep
+    code working. As a result, code that calls `release_...` may be O(1) or O(n)
+    based on non-local information (whether the source object was constructed on
+    an arena)!
+*   With Arenas, `unsafe_arena_...` gives us the raw hooks we need to add or
+    remove fields from a message without making the copy mentioned above , with
+    "unsafe" in the name conveying the subtlety and gravitas of what we're
+    doing. These APIs are tricky to use correctly, though, as today's tested
+    combination of arena and heap ownership may change over time and assumptions
+    break. The APIs are also extremely fine-grained, but do not convey the
+    higher-level intent--transferring pointer ownership, "lending" a submessage
+    to another one, etc.
+
+### Concluding remarks
+
+Good performance should be available by default, not an optional feature. While
+[feature flags and knobs can be useful for testing and initial rollout](/fast/52),
+we should strive to make the right choices for users, rather than requiring
+users adopt the improvement on a case-by-case basis.
+
+Developing an optimization for an existing implementation can provide a larger
+return-on-investment by targeting widespread, current usage upfront. Adding a
+new API or optimization knob can be expedient, but without widespread usage and
+adoption, the benefit is far more limited.
+
+Optimization of existing code can hit stumbling blocks around unnecessarily
+strong guarantees or APIs that constrain the implementation--and thus the
+optimization search space--too much to find improvements.
diff --git a/_posts/2023-10-15-fast-60.md b/_posts/2023-10-15-fast-60.md
new file mode 100644
index 0000000..299ffec
--- /dev/null
+++ b/_posts/2023-10-15-fast-60.md
@@ -0,0 +1,181 @@
+---
+title: "Performance Tip of the Week #60: In-process profiling: lessons learned"
+layout: fast
+sidenav: side-nav-fast.html
+published: true
+permalink: fast/60
+type: markdown
+order: "060"
+---
+
+Originally posted as Fast TotW #60 on June 6, 2022
+
+*By [Chris Kennelly](mailto:ckennelly@google.com)*
+
+Updated 2023-10-15
+
+Quicklink: [abseil.io/fast/60](https://abseil.io/fast/60)
+
+
+[Google-Wide Profiling](https://research.google/pubs/pub36575/) collects data
+not just from our hardware performance counters, but also from in-process
+profilers.
+
+In-process profilers can give deeper insights about the state of the program
+that are hard to observe from the outside, such as lock contention, where memory
+was allocated, and the distribution of collisions on a hashtable. In this tip we
+discuss how to determine that a new profiler is necessary, and the best
+practices for producing one.
+
+## Overview
+
+> "The purpose of computing is insight, not numbers." -- Richard Hamming
+
+Developing a new profiler and augmenting existing ones allows us to have more
+information to make optimization decisions and aid debugging. The goal isn't to
+have perfect information and to make perfect decisions, but to make better
+decisions faster, shortening our
+["OODA loop" (Observe Orient Decide Act)](https://en.wikipedia.org/wiki/OODA_loop).
+The value is in pulling in the area-under-curve and landing in a better spot. An
+"imperfect" profiler that can help make a decision is better than a "perfect"
+profiler that is unwieldy to collect for performance or privacy reasons. Extra
+information or precision is only useful insofar as it helps us make a *better*
+decision or *changes* the outcome.
+
+For example, most new optimizations to
+[TCMalloc](https://github.com/google/tcmalloc/blob/master/tcmalloc) start from
+adding new data points to TCMalloc's statistics that are collected by visiting
+malloc profile handlers across the fleet. This information
+[helps with understanding](https://github.com/google/tcmalloc/blob/master/docs/stats.md)
+the scope of a particular phenomenon. After landing an optimization, these
+metrics can help provide indicators that we changed what we set out to change,
+even if the actual CPU and RAM savings might be measured by other means. These
+steps didn't directly save any CPU usage or bytes of RAM, but they enabled
+better decisions. Capabilities are harder to directly quantify, but they are the
+motor of progress.
+
+## Leveraging existing profilers: the "No build" option
+
+Developing a new profiler takes considerable time, both in terms of
+implementation and wallclock time to ready the fleet for collection at scale.
+Before moving to implement one, it is valuable to consider whether we can derive
+the necessary information from existing profilers and tools we already have.
+
+For example, if the case for hashtable profiling was just reporting the capacity
+of hashtables, then we could also derive that information from heap profiles,
+TCMalloc's heap profiles of the fleet. Even where heap profiles might not be
+able to provide precise insights--the actual "size" of the hashtable, rather
+than its capacity--we can make an informed guess from the profile combined with
+knowledge about the typical load factors due to SwissMap's design.
+
+It is important to articulate the value of the new profiler over what is already
+provided. A key driver for hashtable-specific profiling is that the CPU profiles
+of a hashtable with a
+[bad hash function look similar to those](https://youtu.be/JZE3_0qvrMg?t=1864)
+with a good hash function. The added information collected for stuck bits helps
+us drive optimization decisions we wouldn't have been able to make.
+
+## Sampling strategies
+
+A key design aspect of a profiler is deciding when and how to collect
+information. Most profilers do some kind of sampling to provide an estimate of
+the total without the overhead of recording every event. Collecting some data,
+even if heavily sampled, can be useful for gauging behaviors of a library at
+scale. There are two aspects to a profiler that need to be decided up front:
+
+*   **Duration versus duration-less**: Several of our profilers track sampled
+    events over a period of time. Other profilers capture an instantaneous
+    snapshot of the program's state.
+
+    Duration-less handlers are profiling during the entire program lifetime,
+    which imposes a higher bar on the stability and overhead that can be
+    required. In contrast, a profiler that is only active during collection can
+    be more expensive, as collecting itself is rare.
+
+*   **Sampling strategy**: The overheads of capturing data about every instance
+    of a class can be prohibitive, so it's important to determine a strategy for
+    sampling - and figure out how that sampling strategy can be scaled to
+    provide information representative of the entire fleet.
+
+    Sampling operations can make-or-break the feasibility of a profiler: Our
+    compression profiler originally recorded statistics about *every*
+    compression operation during its collection window, but the high overhead
+    caused major services to turn off the profiler altogether. It was fixed by
+    moving to a sampling strategy to only record statistics on a subset of
+    compression operations during the profiling window. This allowed the
+    profiler to be reenabled.
+
+    While [knobs are often undesirable](/fast/52), allowing applications to tune
+    their sampling rate (or whether sampling occurs at all) can be helpful for
+    balancing the information gained against the overheads imposed.
+
+    Unless there is a justified exception, we require that the profiler applies
+    the sampling factor back to the data to "unsample" it before returning it.
+    This allows consumers to easily use the data without having to deal with the
+    sampling arithmetics themselves. This is especially important as sampling
+    rate can be variable--either automatically adjusted or tunable via a
+    configuration knob such as a flag. This step can also help with validation
+    via cross-checking with other profilers. For example, SwissMap's total
+    memory allocations seen by TCMalloc's heap profiles are consistent with the
+    total capacity seen by the hashtable profiler.
+
+    Choosing the right sampling approach (and the unsampling counterpart) needs
+    to carefully balance accuracy vs. overhead. For example, with the heap
+    profiler in TCMalloc one might decide to simply pick every Nth allocation.
+    But that would not work well: in a typical app memory profiles are dominated
+    by many small allocations and sampling those with reasonable overhead would
+    require high sampling factor. It is also likely to miss more rare large
+    allocations. Interestingly, the next obvious improvement of sampling an
+    allocation every N bytes would "almost work" but is subject to statistical
+    bias. This was fixed by introducing Poisson sampler which is used to date.
+
+For libraries with significant global process state, the threads running in a
+process or the state of malloc, we may use a more exhaustive strategy. For
+example, a profiler could snapshot and summarize the state of the library
+without further sampling.
+
+## What data to record
+
+In addition to choosing a sampling strategy, we need to decide what data to
+collect. We want to choose data that will influence an optimization decision.
+Just as different optimization projects have varying returns on investment, we
+want to strike a balance between the cost of implementing our profiler, of
+running it, and implementing the optimizations it motivates.
+
+Mutation operations can be an excellent place to record additional statistics on
+sampled instances. These are frequently heavyweight for unrelated reasons--they
+trigger copies and reallocations--so checking whether to record statistics has
+minimal added performance penalty. This is the strategy we use for many of our
+existing profilers. In contrast, non-mutating operations, such as hashtable
+lookups, can be prohibitively expensive as we use these operations frequently
+and rely on them being fast.
+
+There is a cost-benefit tradeoff for having more information. Sampling more
+frequently or collecting more data with each sample can paint a richer picture,
+but this increases the runtime cost of profiling. TCMalloc's heap profiling has
+low, but non-zero costs, but it more than pays for itself by allowing us to look
+at where much of our RAM usage goes. Increasing the sampling rate would give us
+extra precision, but it wouldn't materially affect the optimizations we can
+uncover and deploy. The extra overheads would negatively impact performance.
+
+More practically, a minimal set of information can be a good starting point for
+getting a new profiler up and running to start debugging it. While obvious in
+hindsight, several new profilers have hit issues with their stack trace
+collection and filtering. While collecting more data can give additional
+insights, implementations that compute too many statistics or add contended
+locks may simply be infeasible. A profiler that is too expensive to leave
+enabled may be worse than no profiler at all: We spend time implementing it and
+rolling it out, but we lose the visibility into the library usage that we were
+after in the first place.
+
+## Summary
+
+Profilers are a useful tool for probing the internal state of a program to
+*answer questions* during debugging and optimization. The types of questions
+posed can greatly influence the design and architecture of a profiler.
+
+While a particular design may not be able to answer all questions, all at once,
+the goal is ultimately to make *better decisions faster*, shortening our
+["OODA loop" (Observe Orient Decide Act)](https://en.wikipedia.org/wiki/OODA_loop).
+Just as optimization projects are framed in terms of return-on-investment, we
+can frame how additional information influences or changes course of a decision.