Advances in Computing: Queue: In Practice

Data Analysis: Why Is It So Complicated?

The Official ACM — Wed, 01 Apr 2026 11:06:59 GMT

Have you ever asked a very simple question of data scientists, only for them to spend way too long doing analysis and then refuse to give you a straight answer, hedging with, “It depends” or “There is some evidence for”? Or maybe you were building a dashboard and you just wanted to aggregate analysis outputs, and instead of handing over their models, the data science people kept grumbling about “needing expertise to use the model outputs”? Have you then thought: This should have been simple—why is it so complicated?

This article aims to give you a sense of the depth and breadth of why it’s so complicated to conduct and interpret data analysis. It begins with an overview of the purpose of data analysis, reviews different components of data and modeling and how each component introduces complexity to the process of analysis, discusses interpretation of analytic results, and concludes with a few recommendations for productively managing all of these challenges.

First: What is Data Analysis?

Data analysis underlies most decisions, even if you’re not directly aware of it. If you know you have a family history of heart disease, should you eat oatmeal or bacon for breakfast? If you want to sell your house, how high should you set the price?

Fundamentally, these questions are about prediction: If I eat oatmeal, am I less likely to die of a heart attack? If I price my house at $x, will it sell, and how quickly?

Basic data analysis involves just describing data: Maybe oatmeal eaters are five percent less likely to die of heart attacks than bacon eaters. This is great to know, but it would be far more powerful to understand why these dietary choices are related to cardiac outcomes. Oatmeal reduces the odds of a heart attack because its soluble fiber lowers cholesterol. Understanding this mechanism means that you can predict which other foods might be heart-healthy because they’re high in soluble fiber.

Ultimately, if you understand the mechanisms that generate your data, you can make better predictions, and thus better decisions. Data science can of course be wielded for other purposes, but this article focuses on data analysis used to inform decision-making.

Data analysis rests on the basic assumption that your data means something. Your data reflects something real in the world, and analysis is the process of trying to describe the world and its mechanisms based on the data you collected.

Sometimes you have a clear idea what your data means, either because you understand the process that generates the data, or because you have scoped the system that you’re observing very tightly. Even systems that seem to be well scoped and deterministic, however, invariably collide with the chaotic nature of the world and human behavior and end up with startlingly weird edge cases, exceptions, and other messiness. So, in most cases, figuring out what data means becomes the primary challenge of data analysis.

Why It’s So Complicated, in Short

Data analysis is complicated for the following basic reasons:

• The world is complicated.

• Your view of the world is incomplete and inaccurate.

• Data and tools used to analyze data come with assumptions, caveats, and limitations.

• Interpreting a result is a lot harder than getting a result.

Data analysis can be simple if you are willing to ignore these problems and just slap your data into some equations, but the result will not help you much.

Data Science Fundamentals

The following sections dive into the fundamental components of data science, exploring why they’re complicated and how to successfully navigate their impact on your work.

Populations and samples

In statistics, a population refers to all of whatever you’re interested in and can be defined very broadly (e.g., hospitals) or quite specifically (e.g., hospitals in California that treated patients for motorcycle accidents in March 2025).

A sample is some subset of the population that you can actually collect data about, like 10 hospitals in California that have signed a data-sharing agreement and set up reporting streams to your database. By analyzing data from your sample, you hope to learn something that you can then extrapolate to the entire population.

Accordingly, the more representative a sample is—the more accurately it reflects the properties of the entire population—the better you can accurately extrapolate whatever you learn from the sample. If your sample of 10 California hospitals are all in rural areas, then the things that you learn from their data might not be true for urban hospitals. In pure research, large random samples are the gold standard, because they are likely to be representative of the population; but, in real-world applications, you might not have a choice about which subset of the population opts to provide data.

In closed, well-scoped systems, you might have data about your entire population (e.g., log files from all of your Amazon EC2 [Elastic Compute Cloud] instances), but in systems that collect data from out in the world, it is rare to have access to an entire population. If you somehow do have access to your entire population’s data, then you might be getting far more data than you can practically analyze, and you might need to downsample to make analysis tractable.

Measurements and operationalization

Data is produced by taking measurements, most of which are proxies rather than direct measurements. For example, a mercury thermometer doesn’t measure the temperature in a room directly; instead, it measures the height the column of mercury reaches as a result of expanding or contracting in response to the room’s temperature. For another example, a survey can’t measure customer experience directly; instead, it measures the numbers that a customer chooses in response to questions.

Measurements vary in how well defined they are and how well their definitions map to the world. You know exactly what your thermostat measures: Each degree on the Fahrenheit thermostat is the same as every other degree on that thermostat—and on every other thermostat.

In contrast, you know very little about what your customer-experience survey measures. The measurement is not well defined: What does satisfied even mean? Does it mean the same thing to everyone? Worse yet, it might not even measure satisfaction with your product at all, because a bad survey response could just reflect a customer having a bad day, or feeling spiteful, or responding randomly.

In real-world systems, what you want to measure is often abstract or ambiguous, so it isn’t obvious what data you should collect or how. The process of deciding how to turn the abstract thing that you want to analyze into a definition of something you can measure is called operationalization.

Maybe you want to know how many people in a city are infected with influenza. You can’t measure infection directly—no tool can peer into a living body’s cells and see the viral machinery at work—but you can operationalize infection as measures that might reflect infection. You could ask people about symptoms, assuming that infection produces symptoms and that self-report of symptoms is accurate; you could aggregate clinic and hospital data, assuming that infected people seek medical care and that those facilities will generate records that indicate influenza; you could even look at sales of tissues and decongestant medicine, assuming that infected people will purchase more of these products.

All of these approaches are valid operationalizations of influenza infections, but each measures something different and requires you to make different assumptions about that measurement’s relationship to infection. This is a necessary evil—every operationalization requires you to make assumptions—but any assumption will be more or less true under specific circumstances. For example, a more severe strain of influenza could lead to a greater proportion of infected people going to medical facilities, which, in that situation, might make medical records a better measure than tissue sales.

One challenge in operationalizing abstract concepts is that most interesting things about the world don’t fall into cleanly defined categories, so ground truth may not even exist. Even something like “being infected” with the flu is very ambiguous: Should a person count as infected when a single one of their cells has been taken over by the influenza virus? A hundred of their cells? A million? How do you define which people your measure should count?

Ultimately, how you operationalize a variable must be driven by the bigger picture of what you care about. If you’re collecting data about the flu because you want to know about the burden on healthcare systems, then you probably don’t want to count asymptomatic people as infected, regardless of how many of their cells are infected.

Error

Measurement error is the difference between what you measure and reality. Because most things cannot be measured directly, and because the world is a complicated place, error is inescapable. The following three sections review several of the many sources of error in data analysis.

Error in the measurement

• The resolution of a measurement device (e.g., an analog thermometer marked every five degrees) limits the precision of measurements, while inaccuracies in a device (e.g., a mislabeled analog thermometer) limit the accuracy of measurements.

• Even very precise measurements might not measure what you want to be measuring, or might not measure it completely or correctly. For example, speeding is nearly universal in the United States, but drivers who speed minimally are unlikely ever to be stopped by law enforcement, and thus no data is generated about their speeding. You may correctly count how many tickets are given for speeding, but that count does not reflect the true rate of speeding, because so many speeding drivers are not ticketed.

• Your assumptions about the relationship between a measure and reality may be incorrect or incomplete. For example, your software for a wearable device might assume that a user’s spiking heart rate reflects psychological distress, but the spike might instead be caused by physical activity, excitement, or even a cup of strong coffee.

Error in the situation around the measurement

• Basic human mistakes (like typos or mixing up date formats) are more common than you might think. Even automated systems inevitably experience edge cases or exceptions that require humans to intervene manually and thus introduce mistakes.

• Contextual factors may interfere with a precise and accurate measure that does measure what you want. For example, a person who is fatigued while taking a math test might score lower than their true level of knowledge should allow.

• Datasets often represent a snapshot in time, and as time ticks relentlessly onward, the data will age and might diverge from current unmeasured reality. The data might have been accurate when collected, but not at later dates.

Error in the bigger picture

• Systematic biases are ways in which certain data is more likely to be recorded, which leads to datasets that may be correct but incomplete and can thus be misleading in sneaky ways. In the previous speeding-ticket example, drivers of flashier cars might be more likely to be ticketed, leading us to believe that drivers of flashy cars are more likely to speed.

• System-level events can also impact data, from a fire that consumes boxes of records in a warehouse to hospital reporting systems switching to a new coding system. The truly insidious thing about these errors is that you may not be aware of them; you can’t see missing data, and often you have no way of knowing that miscoded data is wrong.

• Events in the world usually don’t document themselves: The data you can acquire about an action or process is often generated secondary to the task itself, like paperwork filled out by a nurse to document administering a vaccination. Doing a task is typically higher priority than producing data about the task, so the data will be subject to basic errors (like typos), delays, and biases (e.g., busier clinics or those with lower funding might produce less accurate or timely data because they have fewer people and not as much time to complete paperwork).

Oh man, that’s a lot of error! What can I do about It?

Ultimately, you can’t guarantee that your data is (a) measuring what you want to measure, (b) measuring that thing correctly, (c) measuring that thing in an unbiased way, and (d) not missing or wildly wrong because of system-level issues. There is no comprehensive checklist of ways to detect or address error; often the best you can do is be aware of these sources of error, be vigilant about their impact on your data, and think carefully about how to mitigate error you do detect.

“Think really hard about it” can be frustratingly unhelpful advice to receive, so here are a few strategies that can help address some of the sources of error in your data:

• Incorporating external sources of information (such as regular check-ins with human stewards of data feeds) can help detect missing or erroneous data.

• Deploying automated anomaly detection on as many facets of data you can think of, from distribution of values to frequency of data dumps, is not a replacement for vigilance but can be a useful tool for monitoring.

• Incorporating multiple measures of the same or similar data can help you identify issues. For example, if wastewater signals indicate that influenza rates are rising, but hospital data shows no increase in flu cases, then there might be a problem with a data feed or analysis pipeline.

• Estimating the prevalence of errors, even roughly, can help you calibrate your confidence in your analysis or inform you whether to include a particular factor in modeling. For example, if a variable is missing in 50 percent of cases, you might not want to include it; if it’s missing in five percent of cases, maybe it’s okay to include with the caveat that it is a source of error in the model.

• As you gain familiarity with your data, you will discover errors that are likely specific to your situation. For example, you may ingest inventory records that don’t specify units or files that contain both MMDDYY and DDMMYY date formats but don’t indicate which record uses which format. Documenting these issues and bundling that documentation with the data can help ensure that all parties handle the data correctly.

• If you don’t have a deep understanding of the source of your data, find someone who does and pick their brain about the data, the data source, how other fields or industries deal with the data, and potential sources of error. Experts can provide valuable context as well as lessons learned from the mistakes they’ve made with similar analyses.

Nothing on this list can address error from systematic bias, or contextual factors that influenced measurements, or how closely your operationalization gets to the reality of what you want to measure. If you could wave the magic data wand to get perfectly measured and operationalized variables collected from a perfectly representative sample free of human mistakes and systematic bias, the thing that you’re measuring is still a thing in the world, and the world ranges from probabilistic to downright chaotic. The world is a messy place, so data about it will be messy too, in a very fundamental way.

Do your best to reduce error in your data, but remember that the presence of error is not a solvable problem.

Summary of the fundamentals

Where do we find ourselves in trying to understand the world using data? First, we have to figure out what measurements to make and which are likely to be proxies of what we’re actually interested in. Then we get inaccuracies introduced by the proxiness of our measurement, the nature of the measurement itself, the system collecting the measurements, and random chance. And all of these measures are collected about only a sample or subset of the world that we’re interested in. Great.

To make things worse, once we have data, we have to do something with it so, next let’s talk about models.

What Is a Model?

In data science, a model is an abstraction of something in the world. A model typically includes input variables, relationships among the variables, and output variables that are calculated or estimated based on the inputs and their relationships. The parameters that define the strength and direction of relationships among variables are typically fit or trained based on historical data but may be defined by the modeler.

For a simple example, a model of sales on a website might represent people and day of the week, and parameters estimating the relationship between purchases and day of the week could be calculated as each day’s average from the past year. A website selling office supplies and one selling hobbyist fishing equipment might be modeled using the same simple structure (people and day of the week), but training on past data would lead to different parameters: The office-supply model might estimate higher purchasing on weekdays, while the fishing-supply model might estimate higher purchasing on weekends.

For a more complex example: An inventory logistics model might include many inputs, such as how many of a product can fit on a truck, when and in what quantity the product arrives at a warehouse, and how far away the retail sites are that must receive the product. The model structure might assume that the product arrives in bulk shipment at the warehouse, where it is packed onto trucks and shipped out to the retail sites. With historical data about how many trucks per day can be packed and shipped, you could build a model that estimates outputs of how quickly and how much product can be distributed to the retail sites.

As shown in this example, you need data from the past to build the model, but usually you also need to be able to collect the same kind of data in the future so you can use the model to support decision-making. For example, there’s not much point in including warehouse staffing levels (that you have historical data on) in your inventory logistics model if you know that you won’t have access to staffing data in the future.

The availability of data is thus a major consideration when deciding which factors to include in a model.

What goes into a model?

As the adage goes: All models are wrong, some models are useful. The goal of developing a model is to make one that is more useful than wrong.

Models that capture the subset of the world that you need to answer your question are useful, so a lot of the challenge (and artistry) in data science lies in identifying the factors that should be included in a model. Four major components contribute to this decision: knowledge about how the modeled thing works in the world, knowledge about how the available data is generated, knowledge about modeling, and a clear understanding of the question the model should answer or the decision it should inform.

Models of things in the world are vanishingly unlikely to include every factor that impacts the output, because the world is complicated, and collecting data is hard.

Maybe you want to investigate the impact of homework on kids’ performance in school, so you ask a teacher for data about how many assignments each student in their class completed as well as their test scores. You fit a simple linear regression model to your data and find that the line is flat: There is no relationship between homework (as measured by how many assignments are completed) and performance (as measured by test scores).

What other factors might be relevant that are not accounted for in the model? Maybe the subject is important, such that homework helps with math but not with social studies. Maybe factors about the students are important, like whether they ate breakfast that morning.

It’s only worth including a factor, however, if the additional value that it brings to the model outweighs the error introduced by its inclusion. Whether a student ate breakfast at all is likely much more important (and easier to measure accurately) than what percentage of their meal was protein or how fast they ate it. An infinite number of variables could theoretically be measured about a given situation (Did the meal include beans? What kind? Were they grown in high-nitrogen soil?, etc., etc.), but most won’t have much impact on the outcome.

This is actually great news, because collecting data can take a lot of time and resources. Deciding which factors to include in the model thus needs to account for not only which factors are important, but also which are practical.

The world is complicated. Should I make complicated models?

More complex models are not necessarily better. The short reason: garbage in, garbage out. The longer reason: Every variable is associated with error, assumptions, estimations, and caveats. Slapping more variables into a model without a deep understanding of those variables, their sources of error, and an accurate view of how they relate to other variables or the world will increase the complexity and uncertainty around your model without necessarily adding value.

Models that are too complex can also lead you astray in interesting ways, such as overfitting. As previously discussed, data is an incomplete and error-filled sample of the world; if you build a model that represents that one sample really well, it is unlikely to fit the next sample very well, because that next sample will be incomplete and error-filled in different ways.

More complex models don’t necessarily give you better answers, even if you have a good understanding of the why, good data to feed the model when it’s deployed, and continuing resources to maintain the model to account for changes in the world, the datasets, and your questions.

Critically for the software engineers who may be tasked with code management, complex models are more difficult to understand and reason about. This makes them much more difficult to debug, maintain, and update. As a result, they’re also more likely to produce unintended behavior, leading to exciting surprises when they are deployed to production.

This article has now addressed the major pieces involved in data analysis. Next, let’s tie it all together by looking at the process as a whole.

What Does the Data Analysis Process Actually Look Like?

The complications discussed here are extensive and touch every step of data analysis from beginning to end. What does this process look like in practice?

As a data scientist approaching a new analysis, your first task is to understand the goal: What question must be answered or what decision must be informed by your analysis? Is there any context around risk and stakeholder motivations that can help you understand the decision to be made? This framing drives every downstream decision you make in your analysis, so be as crystal clear as possible.

Then you engage in a potentially lengthy process of finding data. You research the problem space to understand which factors are most important and which are not. You might design new measures and collect new data or seek out existing datasets; for each dataset you think deeply about what the data actually measures, understand potential error in the data, and do the legwork of acquiring files or feeds and cleaning them.

You figure out what kind of model is best suited to the problem and is feasible given constraints of data, time, processing power, and assumptions. You identify variables that are critical but not available, how their absence will impact modeling, and what relationships should be constructed between factors in your model. You implement, test, and debug the model.

Finally, success! At the end of this process, you have achieved a model that you’re confident in: You understand all of the variables in it and how they relate and what’s missing (and why), and you can brief decision-makers on the strengths and weaknesses of the model, caveats around applying it, and sources of error. But before you can use it for decision-making support, there is just one more tiny piece to consider, which is: everything else.

Data analysis means nothing alone

Specifically, you must consider context, or the additional information you need to evaluate or interpret the output of your model. The requirement for understanding the context around an analysis is a major gatekeeper in accurately using data science (and incidentally, it’s also why data scientists get so grouchy when asked to chuck their model output onto a dashboard, viewable by anyone with zero context).

Context here can take many forms, but generally it means information that tells you what is normal or good or bad so you can evaluate the numbers that come out of your analysis. That context can come from subject-matter experts, historical data, other ways of looking at your datasets, or other data sources entirely.

For a simple example, say you build a model, crunch the numbers, and end up with a figure showing exponential increase. Is this situation good or bad? If the data represents the number of spiders in your desk drawer, probably bad. If the data represents the number of dollars in your bank account, probably good.

For a more complex example, maybe you’re the U.S. federal government and you’re distributing a limited supply of COVID-19 medicine during the pandemic, and you want to know if any state is stockpiling the medicine instead of distributing it to patients. What does stockpiling mean? A state having more treatments stored up than it should have. Well, how much should it have?

You could make a huge model that predicts need for medicine by estimating how many COVID-19 cases will occur on what timeline in each state based on past case counts, vaccination rates, and assumptions about future holiday travel, heat waves, school vacations, and transmission rates of future variants. Then you could estimate how many of those cases would seek treatment and compare each state’s stored quantity of medicine to that number: If a state is storing more medicine than it needs, then it must be stockpiling.

But your model has a zillion assumptions built into it. What if states are calculating how much they need using different assumptions? They could be making very reasonable models but reaching very different conclusions than you do about how much medicine they should be storing.

The point of this huge model is to provide the context of “how much should states have” by estimating how much they need, but this approach introduces a heap of error and assumptions. Instead, you could just stick with the data you are confident in and use the states as context for each other. You could build a simple model that normalizes each state’s stored supply of treatments by the population of that state to estimate how much medicine each state has per person, then compare those numbers across states. If most states have 0.1 doses per person, but one state has two doses per person (20 times more!), maybe you can conclude that that state is stockpiling. You can certainly be more confident in the numeric result, which rests on only two numbers per state that you probably have high confidence in: population size and current inventory.

But wait. You still need more context before concluding that a state is being greedy. Maybe there’s a good reason for having so many doses: Maybe that state manages its supply differently, or has a much more vulnerable population, or is expecting a bad hurricane season that would impede future deliveries, or any of an infinite universe of sensible reasons. You need to understand why that state has so many doses per person in order to make any kind of decision about it.

In summary, the requirement for context in interpreting analytic results cannot be overstated: Your numbers mean nothing alone. Context is what allows you to actually use analytic results to support your decision-making by telling you if your result is expected or surprising, good or bad, and help you understand why you found that result so you can make better decisions about it. The flip side of this principle is that analysts (and consumers of analysis) must not make assumptions about context. The real danger here is jumping to a conclusion about context or interpretation but not realizing that it’s an assumption. Failing to identify assumptions in a thought process is a stunningly easy error to make and can lead to wild misapplication of otherwise very nice analytics.

Closing Remarks

The messiness of the world is incurable, but two principles about the way you approach data science can make the whole endeavor much more valuable to you.

First, decision-makers need to understand the fundamentals of a model in order to effectively or accurately use its output. Specifically, what factors were accounted for, what factors weren’t accounted for, and how much will the exclusion of those factors impact the outcome? How does the model approximate interactions between factors, and how critical are those assumptions to the output? Bundling this kind of metadata about an analysis with the analytic results makes the results much more valuable (and usable) for decision-makers.

Second, as much as we all love data, it is not a magic bullet and must be considered as part of a bigger picture. If you could wave the magic analysis wand, fix all of the massive and fundamental issues discussed here, and generate a completely correct answer to a data-analysis problem, what would you do with it? If an analyst could tell you with total certainty that eating bacon instead of oatmeal for breakfast today raises your odds of a heart attack by exactly 0.0002 percent, what would you do with that extremely precise number? How would you turn that analytic result into a decision?

Data science feels very authoritative and objective—It’s got numbers! Graphs! Fancy equations!—so it’s easy to think of it as more precise and trustworthy than other sources of information. As discussed at length here, however, even the most rigorous of analyses rest on all kinds of squishy judgments and are beset by error at every step. Ultimately, data analysis is a deeply valuable and worthwhile enterprise that can dramatically increase the quality of decision-making—but it’s only one piece of the puzzle.

Alice Jackson is the founder of Colorpoint Data, specializing in helping people turn their data into information. Previously, she worked as a senior scientist and project manager at the Johns Hopkins Applied Physics Lab, applying her skills in experimental design, analysis, and scientific communication to projects ranging from human performance to public health. She holds a Ph.D. in cognitive neuroscience from the University of Maryland.

Thanks for reading Queue: In Practice! This post is public so feel free to share it.

Smart-and-fast thinkers are good, but slow-and-steady thinkers are great

The Official ACM — Wed, 18 Mar 2026 11:00:38 GMT

Dear KV,

I’ve been asked to mentor a new engineer on our team and it’s proving to be quite a challenge. While they are very bright and eager, they also are very excitable and seem to have a thing for reaching for the wrong tool at the wrong time. In particular, they believe that all the problems in our system can be solved by making modifications to low-level software, right down to the operating system and driver level, when many of the problems could be solved by simply making changes to our libraries and user-space programs. I suspect our manager has assigned me to be the new engineer’s mentor because this is the exact area in which I work and there are hopes I’ll be able to “train them up” to become another “one of those systems people.” I’m happy to do so, but how do you get someone like this to slow down and consider what they’re doing? I feel like I’m trying to prevent a toddler from knocking over the furniture.

Toddled

Dear Toddled,

Ah, the lure of low-level systems programming. No data structures, no debuggers—nothing but you and a compiler (or maybe just the assembler) in silent solitude, making the lights blink. Who could deny the allure of telling tall tales at the bar after a hard day of toil conquering single bit flips, corrupted storage blocks, and priority inversions? Huzzah!

Of course, we know this is not at all what programming at the systems level is about. Instead, it involves stumbling around with a dim, smoky torch in a dangerous, dark cave armed with only a stick and hoping not to meet Grendel or the Minotaur.

Smart-and-fast thinkers are good, but slow-and-steady thinkers are great, and the former can become the latter with a little bit of practice.

Since it seems you’re the systems guru in your group, or perhaps for your whole company, it really is in your interest to train up a partner-in-systems-crimes to help you with what are surely some difficult problems. But, first, you must teach them to reach for the right tool at the right time.

No systems programmer in their right mind—and there are still a few of us in our right minds—reaches first for a kernel modification. The tools available to study problems are, for the most part, far richer above the user/kernel boundary than below. Also, new ideas are easier to try out in a user-space library or program, where the price of failure is that you crash a single program, instead of waiting 10 minutes for a whole server to reboot. One might think a few experiences like these would be enough to teach your young protégé an important lesson. Not even KV would suggest throwing a chaos monkey at your protégé’s test machine to teach them that lesson, but it is tempting.

A more positive way to convey this lesson would be to show them the set of tools they have at their disposal and then ask questions that will lead them to use those tools. Debuggers, profilers, and tracing systems long ago made the tracking of problems in code a much more productive endeavor. When KV was a very young person, he had a boss who required him to run each new program in the debugger first, before ever running it bare. That was an important lesson and one that I encourage many people to start with.

This can now be expanded to include running the program under a profiler first, even if the program is brand new. Storing the output of these runs allows you to track the evolution of code over time and can even prove helpful years later, after a system has been in the field for a while as a baseline for comparison. Giving yourself a way to know when a program started to perform differently will surely be of benefit to you at some point.

If the richness of tooling isn’t sufficient to convince your protégé, you can try showing them the commit history and relevant bugs for a piece of operating-system code. Have them summarize a year or two of changes to a NIC (network interface card) driver (the Intel e1000 is hilarious) and then ask them what they have learned from the experience. If they don’t come back with, “Kernel programming is hard, and spurious changes are to be avoided,” make them review five years of commits instead.

When it comes to your protégé’s own code, make sure you force them to document what they’ve changed, no matter at what level, in lengthy detail in each commit message. “Fixed bug” (something KV wrote about decades ago) is not a commit message. It’s a slap in the face to every programmer who reads it, effectively telling them, “Go figure it out yourself.” Good commit messages show that the person who wrote them understood both the problem and the solution to the problem. In the fullness of time, it may turn out their take on the problem was incorrect, but at least there will be something to reference that shows what the thought process was.

People, alas, mostly learn from experience, when they learn at all. You will need to figure out the right set of experiences to convince your protégé to choose the right tool for the job, as opposed to whatever they might think would be the coolest.

We both know that any person who works on low-level systems will always say, “The last tool you reach for is the one that can cut you most deeply.” A driver or OS modification is more like grabbing the business end of a mace at the bottom of a trunk of swords. It’s really best to try the short sword on top of the pile first.

George V. Neville-Neil works on networking and operating system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large codebases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly 20 years he has been the columnist better known as Kode Vicious. Since 2014 he has been an industrial visitor at the University of Cambridge, where he is involved in several projects relating to computer security. He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of the ACM, the Usenix Association, and IEEE. His software not only runs on Earth, but also has been deployed as part of VxWorks in NASA’s missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City.

Thanks for reading Queue: In Practice! This post is public so feel free to share it.

Can Modern C++ Save Us?

The Official ACM — Wed, 04 Mar 2026 13:07:46 GMT

Over the past few years there has been a lot of talk about memory-safety vulnerabilities, and rightly so—attackers continue to take advantage of them to achieve their objectives. Aside from security, memory unsafety can be the cause of reliability issues and is notoriously expensive to debug. Considering the billions of lines of C++ code in production today, we need to do what we can to make C++ measurably safer over the next few years with as low of an adoption barrier as possible.

In 2019, Alex Gaynor, a security expert and one of the leading voices in memory safety, wrote a piece titled “Modern C++ Won’t Save Us,” where he gave examples of foundational types such as std::optional that were unsafe in the idiomatic use cases. What happens when these unsafe types are used beyond their contract? Well, you guessed it: undefined behavior. The std::optional type isn’t the only one to behave like this. If you look at how this compares with modern languages, you can see that C++ is the outlier.

So, what’s to be done? Possibly one of the best places to start today is by improving our standard libraries. They provide the baseline “vocabulary types” for developers—and if they’re not safe, it will be tough to build safety around them. The std::optional type is only one of many vocabulary types in the C++ Standard Library that aren’t safe by default today. Given the current state, it seems mostly clear that the first step should be hardening our standard library, and in our case, this was LLVM’s libc++.

(credit: unsplash)

Subscribe now

The Limits of Debug-Only Modes

The idea of a debug mode with extra checks is not new—every major implementation of the C++ Standard Library has had one. Historically, however, they suffered from several shortcomings, notably including ABI (application binary interface) compatibility issues; but perhaps the most important shortcoming was reflected directly in its name: debug mode was seen as a bug-finding tool to be used in a testing environment.

There seemed to be a common sentiment reflected in other implementations of the time as well: that as long as the code is well-tested, checks in release builds were unnecessary and would introduce an unacceptable performance hit. This led to feature bloat (since performance was not a priority) and made “infeasible in release mode” a self-fulfilling prophecy.

With experience, we can confidently say that this test-only approach is not sufficient to prevent bugs in the real world. Projects with extensive test suites that make use of a multitude of bug-finding tools still suffer from costly bugs and vulnerabilities when deployed to production. Even extensive fuzzing can’t provide a complete guarantee, as high-profile vulnerabilities like the one in libwebp have demonstrated that bugs can lurk even in heavily fuzzed code. Real-world use exposes corner cases that no test suite, no matter how exhaustive, can feasibly cover—not to mention attackers with a special aptitude for finding or triggering these exact sorts of cases. Though well-intentioned, a debug-only approach ends up being of limited use during development and of no use during deployment, and, as a result, sees little adoption.

The Case for Production Hardening

The alternative, therefore, is to enable hardening universally in production. While testing is vital, it cannot replicate the exact conditions, subtle timings, or adversarial pressures of a live environment. Many latent bugs manifest only under production traffic or adversarial inputs. To provide safety guarantees, checks must be active where the code actually runs.

This stance is often met with immediate skepticism based on two reasonable and deeply ingrained fears: (1) destabilizing services with crashes, and (2) unacceptable performance overhead. For hardening to be viable, both must be addressed.

First, let’s address stability. A crash from a detected memory-safety bug is not a new failure. It is the early, safe, and high-fidelity detection of a failure that was already present and silently undermining the system. The alternative to a “loud crash” is not a healthy system; it is a silently corrupted one that will fail later in a more complex, damaging, and less understandable way.

Adopting this fail-stop policy—terminating the process immediately upon detecting an unrecoverable memory violation—has been shown to be superior: It is more secure, makes bugs far easier to detect and fix, and ultimately leads to more reliable systems.

The second fear, performance, is equally critical. This is precisely where historical debug modes failed. Because these modes were not intended for release builds, performance was not a design constraint. Performance must be treated as a core design requirement, not an afterthought. As shown in the rest of this article, the combination of careful design and recent compiler optimization improvements made hardening affordable enough to be enabled at scale.

Designing libc++ Hardening for Production

The affordability of production hardening was not an accident; it was the result of a long, deliberate evolution in design. The initial push at Apple began by exploring domain-specific classes that provided improved bounds safety, but it quickly became clear that requiring code modification to migrate to nonstandard utilities lacking a rich ecosystem made adoption an uphill battle. For example, introducing a span-like class that provides improved bounds safety is a daunting task if it does not interoperate seamlessly with the plethora of algorithms provided by the standard library.

We then noticed that many C++ Standard Library data structures already had enough information to ensure bounds safety (sometimes in limited ways) and that simply hardening those accesses was the constriction point that would allow improving bounds safety in a large amount of existing code.

This led to successive iterations of hardening in libc++, culminating in the current design in LLVM 18, which is built on a set of core principles that make it practical for production use.

A safety spectrum, not a switch

Early experiences with an on/off “safe mode” (introduced in LLVM 15) were encouraging. The key difference from previous debug modes was that safe mode was intended to be used in production. This imposed significant constraints on the design, as features could no longer be enabled without serious consideration of their performance impact.

As we gained experience with deployment of safe mode, new requirements surfaced. While deployment experience showed this to be a particularly good fit for some projects with adoption in Safari and Chromium, it quickly became clear that there were environments for which safe mode was too expensive. A one-size-fits-all approach is too blunt; developers need to choose the right security-versus-performance tradeoff for their environment.

The current incarnation of hardening in libc++, first released in LLVM 18, reflects this by offering four hardening modes. The two most important are intended for production:

• Fast mode is passionately minimalistic and enables only those hardening checks that are security-critical and can be performed with low overhead, usually in constant time; in practice, this almost exclusively means checking for out-of-bounds accesses (and thus improving spatial memory safety). It is a very lucky coincidence that some of the most valuable checks also happen to be some of the cheapest!

• Extensive mode enables all available library checks that can be performed with relatively low performance overhead, including those that lead to undefined behavior but aren’t security-critical.

Enabling a hardening mode is a matter of passing the right compiler flags and rebuilding the application. If the application doesn’t violate library preconditions, the code should not require any changes.

The idea is that almost all applications should be able to allow fast mode, while more security-conscious applications might opt into extensive mode. Additionally, there is a none mode (no hardening checks—that is, the status quo) and a (new, unrelated to legacy) debug mode; debug mode contains more expensive checks, although it still aims to never affect the big-O complexity of algorithms. Each subsequent mode is a superset of the previous one, both in terms of the number of checks and the performance overhead (none → fast → extensive → debug).

ABI compatibility, if needed

A critical lesson from past experience is that hardening must be orthogonal to the ABI. While some attractive checks would require an ABI break (e.g., storing bounds information in iterators), tying safety to the ABI can make it unusable in many production environments. An application cannot unilaterally declare a new ABI that differs from the rest of the platform with which it must link. Platforms that can allow ABI breaks might use libc++ ABI flags that enable additional hardening checks (such as hardened iterators); when an application selects a hardening mode, it enables all checks possible in the current ABI configuration.

Partial enablement options

For real-world adoption, developers must be able to selectively opt out of hardening in the most performance-critical parts of their code. The practical choice is often between disabling checks for one percent of the code or not enabling hardening at all.

Thus, an important requirement for hardening in libc++ is that it can be turned on and off on a per-TU (translation unit) basis. This is achieved using Itanium ABI tags—the hardened mode that is in effect in a given TU is encoded in the tag that is attached to all library functions and affects their mangled names. Thus, a call to a vector’s subscript operator would resolve to two distinct functions if one TU calls it under fast mode and another under none mode (same for all other modes) so the ODR (One Definition Rule) is not violated.

Efficient and customizable failures

By default, when a check fails, libc++ terminates the program with a trap instruction, which is the most secure and lowest-overhead option. This behavior can be completely overridden by the vendor of a given platform when building the library by providing a custom header with the desired implementation of the termination handler. This is different from the weak definition approach used in safe mode. While more flexible, the previous approach resulted in binary “bloat” since the linker in the general case cannot inline the function call. In practice, most applications don’t need to override the termination handler and, in line with the general C++ principle, should not pay for what they don’t use.

Deploying Hardening at Scale

While a flexible design is essential, its true value is proven only by deploying it across a large and performance-critical codebase. At Google, this meant rolling out libc++ hardening across hundreds of millions of lines of C++ code, providing valuable practical insights that go beyond theoretical benefits. While hardening has also been adopted in various open-source codebases (e.g., Google Chrome, Apple’s WebKit) and in a variety of other security-critical projects at Apple, the best documented case study is that of Google’s adoption of the feature across its server-side production systems, to be discussed next.

Phase 1: Enabling hardening in tests

The journey to production began more than a year before the final rollout with a large-scale cleanup effort to enable hardening in pre-production builds. The first attempt to enable the checks in our unit and integration tests broke more than 1,000 tests.

Fixing this required a concerted effort, including crowdsourcing fixes from interested engineers using their 20 percent time, resulting in hundreds of patches across Google’s monorepo. This was essential to establish a “green” baseline, ensuring new code submitted with hardening violations would fail in CI (continuous integration), preventing backsliding.

Once the tests passed, the hardened runtime was enabled in pre-production environments (canary, staging). This allowed developers time to learn about the hardening, fix newly surfaced issues, and build confidence.

This process also drove improvements to libc++ hardening itself. For example, the original two percent binary size increase was a blocker in certain environments, so a non-verbose mode was added with a much smaller footprint, reducing the binary size overhead to less than 0.5 percent.

This phase demonstrated the sheer volume of latent issues and reinforced the necessity of the project, setting the stage for the move to production.

Phase 2: Data-driven consensus building

With a clean test baseline, the next hurdle was proving that production hardening was viable for a fleetwide deployment. This required performance measurement, production pilots, and consensus building.

The primary concern was performance. To address this, key services were benchmarked to understand libc++ hardening’s performance characteristics. This is where we identified that profile-guided optimization allowed us to keep hardening overhead low.

Armed with early performance data, we ran pilots with early adopters, including large, high-traffic services. Those pilots provided real-world evidence that systems remained stable and that the crashes were manageable and highly valuable for debugging. It also provided real-world performance data to estimate the cost of a fleetwide rollout, relying on Google’s continuous profiling infrastructure.¹

These success stories and production data were key to building broad consensus for a default-on rollout. We made the case to Google’s engineering leadership based on four arguments:

• Demonstrated affordability of hardening

• Clear security benefits

• Immediate impact on debuggability

• Long-term improvements to reliability

Ultimately, securing buy-in across a large engineering organization was the most time-consuming phase of the project, a reflection not on the technology, but on the diligence required for a change at this scale.

Phase 3: Production rollout

By this point, more than 100 pilots were running in production, ranging from security-critical services to high-performance parts of the Search backend. The final phase was the full production rollout, activating hardening by default across the fleet. This began the most critical stage of the project: uncovering and fixing the latent bugs that manifest only under the unique pressures of a live production environment.

Hardened services were progressively rolled out to production, and, as expected, the safety checks began to fire. Our hypothesis was that these new deterministic crashes would not be creating new instability but, rather, mostly displacing more dangerous and opaque memory-corruption errors. Live monitoring during the rollout confirmed this theory: As the new assertion failures appeared, the baseline of segmentation fault crashes began to recede.

We staffed a centralized response to rapidly diagnose and fix underlying issues, sending hundreds of patches across our monorepo. This process purged a significant volume of latent bugs. In some rare cases, we temporarily opted out specific workloads from hardening while we worked on a fix.

Performance

The most significant concern—performance—proved largely unfounded in practice. Across Google’s server-side C++ codebase, the average production performance overhead of enabling libc++ hardening was measured at a remarkably low 0.3 percent.

This affordability wasn’t accidental. As Chandler Carruth, Distinguished Engineer and overall C++ language lead at Google, detailed, several factors likely converged:

• Efficient check implementation. The hardening checks in libc++ were carefully implemented to be lightweight.

• Compiler optimizations. Modern compilers such as Clang/LLVM became adept at optimizing checks, eliminating redundant ones within loops or proven code paths.

• Cross-pollination. LLVM’s optimization capabilities for these kinds of checks have significantly improved over the years, partly driven by the needs of memory-safe languages such as Swift and Rust, which rely heavily on runtime checks and use LLVM as a compiler backend. C++ benefited indirectly from this broader ecosystem investment.

• PGO (profile-guided optimization). This was critical. High-quality PGO data allows the compiler to identify hot paths and often move checks out of the hot paths, minimizing impact on latency-sensitive code.

We anticipated that some critical code paths would be too sensitive for any overhead. To address this, we provided two distinct escape hatches: a mechanism to opt an entire service out of hardening, and a fine-grained API to bypass checks for a specific line of code. The final tally after the rollout was remarkable. Across hundreds of millions of lines of C++ at Google, only five services opted out entirely because of reliability or performance concerns. Work is ongoing to eliminate the need for these few remaining exceptions, with the goal of reaching universal adoption.

Even more telling, the fine-grained API for unsafe access was used in just seven distinct places, all of which were surgical changes made by the security team to reclaim performance in code that was correct but difficult for the compiler to analyze. This widespread adoption stands as the strongest possible testament to the practicality of the hardening checks in real-world production environments.

While a production performance overhead of 0.3 percent is relatively small, at Google’s scale, it represents a substantial absolute cost in terms of computing resources and energy. This was a deliberate, strategic investment in improving security and reliability.

The payoff: quantifiable impact

• Bug detection. More than 1,000 bugs were found and fixed during the rollout, including several security vulnerabilities and bugs that had lurked in the codebase longer than a decade. Hardening is projected to prevent 1,000 to 2,000 new bugs annually at the current development velocity.

• Reliability. The baseline segmentation fault rate across the production fleet dropped by approximately 30 percent after hardening was enabled universally, indicating a significant improvement in overall stability. There was an initial uptick in crashes due to hardening checks failures, but this matched the expected hypothesis mentioned earlier; those failures would replace many segmentation faults.

• Security. Hardening demonstrably disrupted active internal offensive exercises and would have prevented others, proving its real-world effectiveness against exploitation attempts.

• Debuggability. Many subtle memory corruptions that are notoriously hard to debug were transformed into immediate, easily identifiable crashes at the point of error, saving significant developer time.

The Path Forward

The ultimate goal is to make these safety guarantees portable and available to all C++ developers. There is a quickly growing recognition in the C++ community that the status quo is undesirable, creating momentum for real change.

One way to push memory safety forward is to put a notion of a hardened Standard Library into the C++ Standard itself, so that developers across the board can get portable security guarantees. The initial proposal from Apple,² based on the implementation of hardening in libc++, has been recently voted into the upcoming C++26 Standard; the successful deployment experience of the hardened libc++ at Google and Apple has been crucial in getting the paper adopted.

The paper essentially standardizes a subset of libc++’s fast mode under the name of a hardened implementation that a program may choose, covering spatial memory safety in some of the most widely used standard classes, such as contiguous containers (whether a hardened or non-hardened implementation is the default is ultimately the choice of the vendor; some security-oriented platforms might choose to default to the hardened implementation). The paper is based on an observation that in fact all the hardening checks are already stated, almost always explicitly, in the Standard in the form of preconditions; it’s just that violating a precondition used to result in the dreaded undefined behavior. Changing these cases of undefined behavior into useful well-defined behavior is, from the textual point of view, quite straightforward, making the proposal a lot less disruptive than might be expected.

Follow-up papers are intended to cover any missing checks that satisfy fast mode criteria (security-critical and low performance overhead), and it is expected that new additions to the standard library will use hardened preconditions as appropriate to avoid OOB (out of bounds) in the hardened implementation. Modes beyond fast mode are not currently considered for standardization; at least for the time being, the Standard should contain only checks that almost any program can afford. There are also no plans to propose any checks that would change the ABI.

Notably, the Standard leverages another C++26 feature, Contracts, which provides an extensive framework for specifying program invariants and handling violations. That gives developers significant flexibility in how they handle a failing hardening check. Contracts were designed with consideration for Library Hardening as a use case, ensuring that libc++ assertion failures can be modeled directly by the Contracts evaluation semantics (specifically, the trapping mechanism used in libc++ hardening is precisely the quick-enforce evaluation semantic).

Conclusion

The challenge of improving the memory safety of the vast landscape of existing C++ code demands pragmatic solutions. Standard library hardening represents a powerful and practical approach, directly addressing common sources of spatial safety vulnerabilities within the foundational components used by nearly all C++ developers.

Our collective experience at Apple and Google demonstrates that significant safety gains are achievable with surprisingly minimal performance overhead in production environments. This is made possible by a combination of careful library design, modern compiler technology, and profile-guided optimization.

Rolling this out initially at a massive scale was a large undertaking. However, much of the foundational work, in both the toolchain and in uncovering issues, has now been completed. The path for other organizations to adopt hardening is now significantly clearer and less daunting.

While not a panacea and not without tradeoffs, hardening eliminates entire classes of bugs and provides a substantial return on investment for security and reliability. As such, we highly recommend that any organization using C++ enable hardening in their standard library today. Whether this means enabling hardening in LLVM’s libc++ or requesting a comparable safety feature from other standard library implementations, it is a critical and affordable step forward in building a more secure and reliable C++ ecosystem.

References

1. Ren, G., Tune, E., Moseley, T., Shi, Y., Rus, S., Hundt, R. 2010. IEEE Explorer 30(4), 65–79; https://ieeexplore.ieee.org/abstract/document/5551002.

2. Varlamov, K., Dionne, L. 2025. Standard library hardening; https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3471r4.html.

Louis Dionne works at Apple in Languages and Runtimes. He is the lead maintainer of libc++, has contributed to various security initiatives in recent years, and is an active member of the C++ Standards Committee.

Alex Rebert is an engineer at Google, where he focuses on memory safety. Previously, he co-founded ForAllSecure and led the creation of Mayhem, the autonomous system that won the 2016 DARPA Cyber Grand Challenge. He was recognized by MIT Technology Review as one of the 35 Innovators Under 35 and by Forbes’s 30 Under 30.

Max Shavrick is a security engineer at Google and one of the technical leads focusing on addressing memory unsafety. Before that, he worked on Windows and Azure security, finding and fixing remote code execution vulnerabilities. He was also previously president and captain of RPISEC, a student-run cybersecurity club and CTF team at Rensselaer Polytechnic Institute.

Konstantin Varlamov works at Apple in Languages and Runtimes and is one of the maintainers of libc++ and a member of the C++ Standards Committee. For the past few years, his primary focus has been on the development and standardization of libc++ hardening.

Thanks for reading Queue: In Practice! This post is public so feel free to share it.

Is SRE Anti-Transactional?

The Official ACM — Wed, 18 Feb 2026 13:07:09 GMT

A major concern in the SRE movement is automation—that is, moving IT from focusing on transactional work to building systems that do work for us.

There are many competing definitions of SRE. If you ask 10 SRE engineers to define SRE, you’ll get 11 definitions. Some will focus on the R (reliability) as the driving force in everything from feature priorities, to budgets, to technology choices. Others use SRE as an umbrella term encompassing everything from application upkeep, cloud engineering, service provisioning, and platform engineering to nearly all aspects of keeping a technology working.

What they all have in common is an imperative to move away from transactional work. It may not be the defining attribute of the SRE domain, but it is an important aspect. Let’s drill down into it.

SREs do not like manual work. Give a team of SREs a manual process, and over time they will evolve it from fully manual (people doing work), to automated (software tools that humans use to make progress), to autonomous (systems that identify when work needs to be done, do the work, and alert humans in exceptional situations). The aim is to progress toward eliminating manual work from the system altogether.

SREs consider transactional work to be toil. Toil is work that, while valuable in and of itself, doesn’t raise the bar and improve the process for any similar future requests. It is tactical, not strategic. It addresses immediate needs rather than contributing to long-term improvements or strategic goals.

For example, deploying a new software release to a website is a good thing. The new software release has new features that benefit the user, increase profit, or otherwise provide value. In the old days, this process was done manually and would frequently require days or weeks of planning and execution. Then, the next time a new software release arrived, the same amount of work would be required to deploy it. That is toil.

On the other hand, a project that replaces all that manual toil with a CI/CD (continuous integration/continuous delivery or deployment) system such that developers can safely deploy new software releases in a self-service manner is strategic. It not only automates a manual process, thus reducing toil, but also permits the process to be done more frequently (even continuously) in smaller batches without waiting for IT to become available.

SREs gladly do manual work, but only as a necessary step of learning the process so that it can be automated and eventually become autonomous. Toil is acceptable when a task is infrequent or a process is new and ill-defined or experimental in nature. Those are exceptional situations, however.

As the old joke goes, in the future there will be a factory that needs only two employees: one human and one dog. The human’s job will be to feed the dog. The dog’s job will be to make sure nobody touches the equipment.

Once SREs rule the world, the dog will be automated as well.

Subscribe now

Transactional Work

Transactional work usually takes the form of a request that’s worked on by an individual until it’s completed. Tickets opened with your IT organization are transactional work. Whether you are requesting a replacement mouse or need assistance with some technology that isn’t working correctly, the work is transactional. It begins with a request, work is performed, the requester verifies the work is complete, and then, lastly, the ticket is closed.

Transactional work is typically manual and tactical (not strategic). It addresses immediate needs rather than contributing to long-term improvements or strategic goals. It doesn’t improve the system, product, or processes in a meaningful way.

Toil scales linearly. If the number of requests increases 10x, the number of people required to complete the work increases 10x. This is unacceptable in today’s business environment. Since Ford popularized the automotive assembly line, it has become an expectation that cost should scale sublinearly with growth. That is, if 10 units of work requires 10 people, 100 units of work should require significantly fewer than 100 people.

Toil is reactive and interrupt-driven by nature, which makes planning difficult. A request arrives at an unpredictable time and requires humans to stop whatever they are doing to work on the task. Since people are not available 24x7, the requester must wait. Everyone has experienced the frustration of waiting days for someone to perform a task that takes minutes.

Automation is available 24x7, so there is less waiting.

Toil drains engineers’ time and energy, preventing them from focusing on high-impact engineering work like automation, system improvement, or innovative projects. It reduces the overall efficiency and morale of the team, as engineers are bogged down in routine maintenance rather than more meaningful contributions.

Luckily, toil is usually repetitive and, thankfully, repetitive tasks often prove to offer hot opportunities for automation.

Automation

SREs focus on building automation that eliminates the need to do manual work. It’s not that they’re lazy; they just want to be able to scale.

Give a team of SREs a fully manual task, and over time it will go from fully manual, to automated, to autonomous. The difference between automated and autonomous is the level of automation required: Automated systems are tools that humans use to get a task done. Autonomous systems are triggered when new work arrives and they complete work on their own. With autonomous systems, the humans are there only for exceptions—edge cases that haven’t yet been automated, bugs that need to be fixed, or system optimizations.

When work is automated, it can scale super linearly. This is particularly important for Internet-based companies. As Michael O’Dell, chief scientist of the first commercial ISP, frequently said, “The only problem is scaling. All the others inherit from that one.”

The analogy we like to make is that SREs aren’t the assembly-line workers at an auto manufacturing plant. They’re the people who make the robots that make the cars.

A process generally goes through these stages:

1. Manual. Progress is made by humans doing the work, often typing or clicking, observing what results, and then making decisions about how and when to proceed.

2. Automated (sometimes called manual automation). A tool performs each step. When a task needs to be done, a human uses a tool (software) to perform that task. A single process might require many steps. SREs tend to write tools that automate those steps. Often, individual steps have bespoke tools, and the process is completed by humans who use the appropriate tool for each step, judging the results, and then deciding when to move forward to the next step.

3. Autonomous systems (sometimes called fully automated). Software initiates and executes work without human intervention. Humans are there to handle exceptions and improve the system. Once a system is stable, it’s mostly concerned with reducing the number of exceptions that require human intervention, addressing any new features and business concerns, and achieving greater efficiency.

An autonomous system is never finished. SREs live in a Sisyphean struggle: As they get closer to perfect automation, a changing world throws new toil at them. If every edge case were handled today, new business needs would appear tomorrow. If the system runs smoothly now, an underlying system will require an upgrade tomorrow.

A good system will always be a victim of its own success, becoming so popular that it exceeds its original design limits and thus requires reengineering to overcome the new bottlenecks this growth exposes. Popularity inspires new use cases, which requires redesigns and refactoring. If the system were designed to handle only green widgets, surely its success would make it attractive to the yellow-widget division. Sadly, while extending it to a new use case sounds easy, it rarely proves to be.

It’s financially impractical to automate every conceivable use case. Some use cases are so infrequent or complex that automation can offer little or no return on investment. Automating those cases would be wasteful and considered to be a bad business decision.

Thus, an autonomous system needs a team to maintain it, optimize it, and handle the exceptions. Autonomous systems are never “one and done.”

Example: Cluster Builds

Tom was once on a team that was deploying storage clusters. Each cluster was made up of many individual hardware and software components.

Initially, the process was manual, with hundreds of steps, each documented to varying degrees of completeness.

Soon, each phase of the process was automated; one or more tools did most of the work of each phase. It required human judgment, however, to determine whether a phase had been completed correctly or fixes were required. We relied on human judgment to determine when to start each phase. This often required checking calendars or getting a go/no-go decision from other teams.

While this situation was a great improvement over the manual process, it still required considerable human intervention to tell us when to start, what parameters were required in each instance, when to proceed to each new phase, as well as how to handle quality control. Still, it was better than a manual process in that it was more consistent, took less time, and required less cognitive load to complete each task.

The next phase was to help the system become autonomous. Orders for new clusters were no longer presented during status meetings or via tickets. Instead, orders came from a database (imagine an order received via SalesForce). A validation process not only verified the request’s parameters, but also sent an email with an approval link to the designated budget controller. Each phase then was kicked off via timetables, verification of completion of preceding steps, and other triggers.

Once the system was stable, we discovered that a typical cluster build still required human intervention at least twice during the multiday process. As we fixed edge cases, improved algorithms, and fine-tuned the system, this was reduced to nearly zero. The system had become data-driven and even had a dashboard that showed our “human intervention” count dropping over time.

This count would still spike upward occasionally. These spikes correlated with new releases of software received from the developers, since each release had a tendency to break our fussy automation. Improving this required us to break out of our silo and work more closely with the developers.

We realized that their release process needed to include verifying that the cluster-deployment system supported the new release. It was a “lightbulb moment” for the product manager when he realized the ability to deploy clusters was as much a part of a new release as the new features we had promised to ship. Maybe this is obvious in hindsight, but the authors have seen this cycle (silo, lightbulb, cooperation) repeat frequently throughout our careers.

It took more than a year to go from fully manual to an autonomous system. Most of the fine-tuning that was required after that was to scale the system both horizontally (efficiently creating larger clusters) and vertically (creating more clusters per month).

The old manual system didn’t make anyone happy. Sales recognition (which starts after cluster delivery) was delayed. Customers were unhappy with misconfigurations and other bugs.

The new system was fast and predictable (three days). It had an SLA that allowed the company to tell customers when to expect delivery of the service, and the sales department was alerted that a contract had to be signed three days before the end of the quarter for timely revenue recognition. Best of all, the system was autonomous, meaning orders could be fulfilled even if the SREs were asleep.

Example: Mouse and Cable Distribution

We used to think that one help-desk task that could never possibly be automated had to do with the procurement of small items such as replacement mice and USB cables. A customer would open a ticket requesting (for example) a mouse, whereupon the help-desk person would provide the mouse, assign the cost to the person’s department, and so on. How could that get any better?

I was impressed to discover that Google’s help desk had actually succeeded in automating that process with a vending machine for physical items. Employees could swipe their badges, take the items they desired, after which their department would be billed. This system eliminated a significant percentage of help-desk visits—meaning the toil was gone. All that was left was a management process that involved ordering new items and monitoring for abuse.

Some industry pundits claim there is a world of “old IT” that is all toil, unfixable, and doomed. We like to believe there is always hope whenever there is sufficient scale and a willingness on the part of management to embrace creativity and innovation.

How to Interface with SREs

Given the SRE distaste for transactional work, what’s the best way to interact with an SRE team?

Ideally, transactional requests that are manual in nature (verbal requests, tickets generated by humans, and so on) should be replaced by interactions through software. That is, your team should be able to provide an API to your SREs or request that they provide you with one. An API provides a way for the SRE software to interact with your systems and vice versa. This paves the path to automation.

Sometimes, we get lucky because an API exists. Sometimes, though, the API is obscured. For example, the database or SaaS-based product your team is using might already have an API, but access to it must be enabled, properly secured, with rules of engagement defined.

Usually, however, a new API needs to be created from scratch. This isn’t easy. There’s no magic button you can press to make an API appear. Creating an API requires the very human process of sitting down, identifying the need and requirements, proposing designs, and allocating resources to build, test, and implement that design. This usually requires software engineering skills, which not every team has available. For example, in our experience human resources departments rarely have software engineers on staff.

Efforts that mesh two teams with different levels of experience with process automation can prove to be frustrating for both teams. The non-SRE team, lacking software-development experience, can be put off by the requirement to work through a detailed process analysis given that the transactional process seems so obvious to them. The SRE team, meanwhile, can often become frustrated by the other team’s tolerance of toil and surprised at the lack of disdain for it.

If creating an API is not an option, sometimes a little creativity is all that’s required. In one case, Tom discovered that a “mostly autonomous” system was hamstrung by the fact that another team was responsible for allocating ID numbers required by the process. Basically, the process was fully autonomous apart from one “stop the world” moment that required waiting for a human on another team to allocate an ID number. This other team, meanwhile, didn’t understand the problems that resulted from this delay. Accordingly, the requests were often ignored for days. No other team made such requests and so they were viewed as “weird” and assumed to be of low priority. In reality, the delays were causing major headaches.

By sitting down and discussing the situation, the teams were able to propose two creative solutions. One was to allocate large batches of ID numbers, reducing the human interaction to just refill the ID pool—something that might happen only a few times each year. The other idea was simply to permit the SRE team to allocate ID numbers with a particular prefix, thus giving them the ability to manage their own allocation process.

Likewise, SRE teams should be willing to create interfaces for others. That could be a traditional network-based API, but often a web-based user-interface or dashboard will suffice. For example, the other team’s needs may be satisfied by gaining access to a dashboard in an existing monitoring or business information system. That could be a static webpage that’s updated periodically. And while the need to visit a webpage to find answers might smell like unacceptable toil to an SRE, a non-SRE might just as easily view it as a welcome improvement.

In the end, when we can find a way to work together to eliminate the need for tickets or to reduce toil in some other fashion, we all win.

Conclusion

The value of a team is its output. If the team’s role is to repeat the same task over and over again, that team’s value is going to scale linearly with their effort. But if the team’s role is to optimize processes through automation and the reduction of toil, then that team’s value will prove to be like compound interest, with growth propelling growth. SREs are happiest when they are in the latter category: creating value that scales.

Systems built by SREs are not fully autonomous on day one. It’s iteration over time that leads to fully autonomous, functional, reliable service. This iterative process requires SREs to evaluate how much time and money should be spent to achieve the objective. It is the heart of engineering to find the fastest, cheapest, and safest way to create and maintain a system.

We have seen conflicts between SRE goals and “traditional IT” mindsets. Generally, these stem from the fundamental differences between the SRE mindset of automating wherever possible to eliminate toil and the more immediate focus other teams have on completing certain specific tasks.

Understanding this tension is key to interacting with SRE teams.

Operations and Life:
The Small Batches Principle
Reducing waste, encouraging experimentation, and making everyone happy
Thomas A. Limoncelli
https://queue.acm.org/detail.cfm?id=2945077

Thomas A. Limoncelli is a senior platform engineer at Stack Overflow Inc. He works from his home in New Jersey. His books include The Practice of Cloud Administration (https://the-cloud-book.com), The Practice of System and Network Administration (https://the-sysadmin-book.com), and Time Management for System Administrators. He is @YesThatTom on BlueSky and blogs at YesThatBlog.com. He holds a B.A. in computer science from Drew University.

Christian Pearce is a platform-engineering manager and a technology leader. He is a manager at Stack Overflow, leading the Infrastructure Platform Team. He focuses on building high-performing teams that focus on cloud infrastructure, developer experience, and building scalable platform solutions. He maintains a personal website at https://www.christianpearce.com and is active in the engineering community.

Thanks for reading Queue: In Practice! This post is public so feel free to share it.

From floats to characters and back again

The Official ACM — Wed, 04 Feb 2026 12:25:20 GMT

Hello, you’re reading “Queue: In Practice,” a brand-new ACM newsletter series. Featuring insights from ACM Queue, this series is written by engineers, for engineers.

Queue does not focus on either industry news or the latest "solutions"—it focuses on the technical challenges looming on the horizon. Each issue provides a critical look at emerging tech, highlighting the problems likely to arise and posing the tough questions that software engineers should be thinking about right now. Sharpen your thinking and stay ahead of the curve.

(credit: Art Attack)

Driven to Distraction

by George Neville-Neil (Kode Vicious)

Dear KV,

I know you have mentioned working on embedded systems in the past, and now that I have a related project, I have a question for you. The device we’re building has a lot of sensors and I’ve been slowly adding support for these to our embedded Linux kernel. Each device samples data from the environment, and they’re each a bit different, but the only common drivers that seem to match what we do are character drivers that just generate a stream of bytes. It seems strange to take a lot of floating-point data, turn it into characters, and then turn it back into floating point. Is this common? Or am I missing some type of driver that would do this better?

Floating

Dear Floating,

I remember when I was preparing for my first talk years ago, people would tell me, “Start with a joke.” So, here’s one for this occasion:

Three hardware designers go to see an OS developer.

The first hardware designer says, “I just developed a new device and I need a driver.”

OS developer: “Sure, what does it do?”

First hardware designer: “It reads GPS data and correlates it with star charts from a visible telescope.”

OS developer: “You should write a character driver.”

The second hardware designer says, “My new device takes temperature readings to five significant digits from across a square kilometer of ocean at depths of 1 to 100 meters.”

OS developer: “Oh, so you need a character driver too.”

The third hardware designer says, “My device samples quantum states throughout the known universe. It can literally see the face of god, should such exist.”

OS developer: “Character driver!”

Alas, some jokes require a very specialized audience, but in this case I think the audience is... listening.

The thing is, once upon a time, operating systems handled many different types of data, and they did so in sometimes wildly incompatible ways. This was a problem because it meant applications had to be ported whenever a new operating system version came out or whenever someone got a new model computer.

The “everything is a byte stream, we don’t care what the format of your data is” ethos of early Unix had some very real benefits in that it allowed programs to be ported to different machines and different versions of Unix to be used with few or no changes.

But, of course, every good fairy tale also has a dark side—at least if you read the original Grimm and not that drivel that comes out of Los Angeles. And here, the problem with a simplifying assumption is that it can make for inefficient or, in your case, patently illogical software. Converting a number (which is what a computer is great at processing) into a string (which a computer is not ideally suited for) and then back to a number wastes time, memory, and CPU cycles—the very three things we’re all told never to waste in our software!

KV’s particularly favorite form of this stupidity: one of the best-known and most commonly used protocols for sending stock-ticker information does exactly this—that is, it converts numbers to strings and then back to numbers. This ridiculous scenario led finance companies to invest heavily in FPGAs (field-programmable gate arrays) just to get an edge in processing stock prices for high-frequency trading. KV was both appalled when he learned this and chagrined that he hadn’t bought stock in certain FPGA companies soon enough.

All of which is to say that sometimes, simplifying assumptions are a real problem, and sometimes, they make everything look like a nail, which then makes you think all you need is a hammer. The big challenge with modern systems is that 50 years of doing things the Unix way has left us bereft of better APIs. It’s not just the drivers but also the application APIs on top of the operating system that deal only in byte streams. It’s as if the operating system designers threw up their hands and said, “Not my job!” and left all the data interpretation to the application programmers and device developers. Since these two parties rarely, if ever, talk to each other, no real progress has been made in this area from that time until now.

So, how might we make progress? The current fad for kernel bypass libraries addresses the matter by just ignoring the operating system altogether—which, as an operating system developer myself, I can totally understand. Most days, I’d like to ignore not just the operating system but go so far as to imagine putting on elbow-length rubber gloves, grabbing the operating system, and dunking it in a 55-gallon drum of acid in order to listen to it scream and die as it dissolves (with apologies to Who Framed Roger Rabbit?).

But, alas, in this world outside of fantasy, we must either deal with the operating system through modification or extension, bypass it, or start over from scratch by writing a new one. This last option is the most attractive to an operating system developer, but KV doubts your employer will sign up for the associated multiyear project. But if they do, please get in touch with me!

Your most expedient but least preferable solution would be to bypass the operating system since you can write a few APIs that allow your applications to talk efficiently with the devices. Bypassing gives you the most control, the highest fidelity of data, the lowest latency of access—but all at the price of generality. That is, when you bypass the operating system, you make it so that each new program must, again, be compatible with your bypass libraries. And if you want generality back, you wind up doing what operating system developers have done forever, which is to produce generalized APIs that can be consumed by arbitrary programs. The current middle ground is to add new system calls to the operating system—a process that is fraught with peril, because modern operating systems are huge and fragile. While people do add new system calls to operating systems, this is a rare event. And, if you were to do this, you would wind up maintaining the new syscalls locally for a very long time unless you were able to convince an open-source project that your new calls were generally useful.

So, that’s where you are Make a character device driver and pay the cost of converting real data into strings, bypass the operating system, or innovate in the operating system space and produce some new system calls. I can’t say any of these choices are great. In fact, they all kind of suck in one way or another. But innovation in operating systems went out in the 1980s, and the people who are trying to bring that back now have a harder job than that crazy Greek guy rolling a rock up the hill.

George V. Neville-Neil works on networking and operating system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large codebases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly 20 years he has been the columnist better known as Kode Vicious. Since 2014 he has been an industrial visitor at the University of Cambridge, where he is involved in several projects relating to computer security. He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of the ACM, the Usenix Association, and IEEE. His software not only runs on Earth, but also has been deployed as part of VxWorks in NASA’s missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City.

Thanks for reading! This post is public so feel free to share it.

If you're tired of hearing about memory safety, this article is for you

The Official ACM — Wed, 21 Jan 2026 12:45:31 GMT

Happy New Year! Introducing “Queue: In Practice,” a brand-new ACM newsletter series for 2026. Featuring insights from ACM Queue, this series is written by engineers, for engineers.

Memory safety—the property that makes software devoid of weaknesses such as buffer overflows, double-frees, and similar issues—has been a popular topic in software communities over the past decade and has gained special prominence alongside the rise of the Rust programming language. Rust did not invent the idea of memory safety, nor was it the first memory-safe language, but it did break the dam on the last major holdout context where memory safety had not yet been achieved: what we often call “systems programming.”

Rust’s big step function was to offer memory safety at compile time through the use of static analysis borrowed and grown out of prior efforts such as Cyclone, a research programming language formulated as a safe subset of C. Rust, by offering a memory-safe-by-default experience for the “systems” domain, where operating systems, databases, file systems, embedded software, and the like are made, suddenly presented a new possibility to public policymakers and to leaders of engineering organizations: the mass reduction of memory unsafety across any domain.

In the years since Rust hit the scene, tech companies have published experience reports on the adoption of Rust for production systems, whether through rewrites of existing code or by producing new modules in Rust that might have otherwise been written in C or C++. The numbers were broadly consistent: a roughly 70 percent reduction in memory-safety vulnerabilities. Rust, more than just promising memory safety, was demonstrably translating safety guarantees into practical improvements in software security. This evidence, turning the abstract benefits of a semantic improvement into bottom-line improvements in business costs (vulnerabilities are expensive for all involved), meant that organizations beyond just engineering groups began to take notice.

Of course, the choice of programming language is a contentious one. Languages do not exist in a vacuum, and the “right” language for a job is heavily path dependent. What languages do the developers already know? What’s the timeline and budget for the project? How serious are the correctness constraints? The performance constraints? Do you expect to hire more developers, and what resources can you allocate to train them? If you’re an open-source project, you might ask which languages would possibly bring in more developers to contribute. For any project, your answer might be determined by what other libraries or tools you will need to integrate.

What’s more, many projects have already made their programming-language decision years ago—possibly decades ago. What should they do? If the code you have today is memory unsafe, in C or C++, how can you pursue safety without throwing the whole thing out?

In some circles, the answer given might be to “rewrite it in Rust” to replace legacy software written in memory-unsafe languages with new Rust equivalents. The benefits, supporters argue, are clear: comparable performance, modern tooling, and memory safety.

Yet, experienced developers know rewrites are expensive and frequently misguided . Often, demands for large-scale rewrites are not a carefully reasoned argument about tradeoffs, but an aesthetic criticism of code that looks “ugly” or “too old.” If anything, those calling for mass rewrites show their own inability to do the difficult work of understanding and working with an existing codebase that does a job and does it well.

There are paths between accepting memory unsafety as the cost of doing business or performing a mass rewriting of stable systems in a new language to achieve safety. Reflexive rejection of a move to memory safety is misguided, especially when the benefits of memory safety can be achieved in a cost- and schedule-efficient way.

If you’re not yet sold on the value of memory safety, this article is for you. The goal in writing it is to treat the question of pursuing memory safety in legacy systems with the seriousness and rigor that it deserves.

Pursuing memory safety is worthwhile, with or without Rust, and I’d like to convince you to try.

In Budget, On Schedule

Software systems do not exist in isolation; software is built to do things, to serve the needs of businesses and individuals by making systems more efficient or automatic. The development of software is constrained not by the theoretical limits of software’s abilities, but by the cost and schedule limitations of the team building it.

In “The Case for Memory Safe Roadmaps,” a collection of government agencies from the “Five Eyes Countries” (the United States, United Kingdom, Australia, New Zealand, and Canada) collectively recommended that organizations develop roadmaps for transitioning their software development efforts to memory-safe languages.

It’s worth being clear here: Their focus is on roadmaps, and they very explicitly accept and discuss the challenge of the cost and schedule impacts of any transition toward memory safety.

Budget and schedule constraints and the desire for efficiency are part of what motivates the creation of software in the first place. Once that software is in place, it’s frequently mission critical, having replaced the knowledge and labor of people who would have previously done the jobs the software now performs. Instead of accountants, a company may have accounting software, with a smaller number of accountants who know how to interact with the software and use it to perform their own jobs built on the knowledge the software provides with its data and embedded business logic.

Rewrites to critical software systems are risky precisely because the software itself is so important. Rewriting a software system, whether as an in-place rewrite where components are swapped out piece by piece, or as a wholesale rewrite with a cutover date, risks the proper functioning of the business if the rewrite fails .

Complex, long-running software systems can face other severe constraints as well. They may be unable to be turned off—because, having become so business critical and time sensitive, any attempt to bring them offline for maintenance or replacement is unacceptable. They may also have become lost artifacts, where the expertise of the individuals who created them or previously worked on them is lost because it wasn’t transferred to newer engineers, resulting in a current team that does not understand the system or feel comfortable making changes to it.

There are also ongoing costs associated with the development of software that might have to be diverted to support even an incremental rewrite. Depending on the business, there might not be funding available to support an increase to the development team, so diversion of resources toward a rewrite would mean reduction in the delivery of features for the project, which may be untenable.

All of this is to say that rewrites, even incremental ones, are business decisions that have to be made as tradeoffs with other strategic goals. While motivated developers can make the best case possible for the upside of a rewrite, they must also grapple with the businesses’ needs to deliver features and address bugs impacting users of the system today.

At the same time, a transition to memory-safe languages can bring benefits beyond just the safety (and thereby security) claims that are often given priority in these discussions.

Memory-safety violations, such as null pointer dereferences or indexing outside of the bounds of a memory buffer, can result in denial of service (or, in the context of the classic security CIA triad, availability failures) of the relevant software. This might mean on-call pages to respond to a production incident, a failure to meet service-level agreement guarantees for customers, or reduced revenue from lost customers or interruption of business operations.

Memory-safety issues are also often a central building block in a kill chain for achieving remote code execution by attackers. Even as far back as “Smashing the Stack for Fun and Profit” in 1996, we could see cybersecurity professionals documenting how to turn a buffer-overflow weakness into remote code execution and full access to the host. With that foothold in place, attackers can begin to exfiltrate data, move laterally within a network, escalate privileges, lock down a system with ransomware, conscript a host into a botnet, and more.

Software problems are cheaper to fix the earlier they occur in the software development lifecycle. In the long term, stopping a bug from being written is cheaper than responding to a bug bounty submission or triaging a production outage.

This is not to say that all moves toward memory safety are cost effective or that all roadmaps for memory safety should be as aggressive as possible, but it is intended to make clear that there are both costs and savings to be had with any transition to memory safety.

Setting the Goalposts

What is memory safety? There should be a table-stakes answer to that question to have in hand amid the push toward memory safety in public discourse, but there is not a single, fully agreed-upon, and rigorous definition. There is a new effort, announced in a recent article published in Communications of the ACM, to develop a standard definition of memory safety, but it is just beginning.

There is, however, a rough consensus among practitioners of what kinds of program behaviors are memory unsafe. That’s a good place to start.

My favorite short definition comes from Michael Hicks, an academic who works on programming languages:

“[A] program execution is memory safe so long as a particular list of bad things, called memory-access errors, never occur:

• Buffer overflow

• Null pointer dereference

• Use after free

• Use of uninitialized memory

• Illegal free (of an already freed pointer, or a non-malloc-ed pointer)”

You will sometimes see these categories broken down further, into spatial and temporal memory safety. Spatial covers memory-safety issues arising from accessing locations in memory that a program should not have access to (like a buffer overflow); temporal covers operations on memory done in the wrong order: for example, reading memory before it is initialized, trying to free an already freed pointer, or using a pointer after it has been freed.

There’s also the CWE (Common Weakness Enumeration) category for memory-safety issues, which decomposes Hicks’s list into more granular options. CWE is a taxonomy of software weaknesses, or as CWE puts it: “condition[s] in... software... that, under certain circumstances, could contribute to the introduction of vulnerabilities.”

In CWE’s memory-safety category, “buffer overflow” is further broken down into six different, more-specific weaknesses, some of which are further decomposed into their own variants. This can be useful when maximum precision is warranted but is perhaps too much detail for the purposes of this article.

These definitions provide a reasonably clear picture of what constitutes memory unsafety. So, memory safety is when a program is guaranteed not to have those weaknesses. This can be achieved by compile-time constraints on the semantics of programs or by runtime management of memory by a garbage collector, so long as the guarantee is upheld.

This is often when perceptive onlookers will cry foul. Rust permits unsafety! There’s a whole unsafe keyword! How is that meaningfully different from the guarantees of C or C++?

Of course, they’re right. Rust does permit programmers to write unsafe code, but as anyone who works in safety or security will tell you, defaults matter. In fact, defaults matter a lot!

Let’s use seat belts as an example. Seat belts became generally mandatory across the United States between the late 1980s to the early 1990s. In 1985, when mandatory seatbelt laws first saw passage among the states, seatbelt usage sat at 21 percent of riders. In 1994, the average seatbelt usage rate in the U.S. was 58 percent. As of 2017, it was 89.7 percent. That change in defaults led to massive increases in seatbelt usage and therefore saved more lives. The National Highway Transportation and Safety Administration estimates that in 2017 alone, seatbelts saved the lives of nearly 15,000 Americans.

The same truth applies in software. Before version 4.0.0 (published in 2017), Redis, the extremely popular key-value store, offered no access controls in its default configuration. Frequently, new users of Redis would unintentionally expose their instance publicly, and this insecurity would result in data spills or become a vector for host exploitation. As of version 4.0.0, Redis enters a “protected mode” when run with its default configuration and without password protection. This limits access to loopback interfaces. As the Redis company itself has since touted, the introduction of protected mode has caused the number of publicly accessible Redis instances tracked on Shodan.io, a popular internet host aggregator, to decline substantially. In 2017, it had identified roughly 17,000 exposed Redis instances; in 2020, that number had declined to 8,000 in an audit by security company TrendMicro.

Bringing it back to memory safety, we can and should think of memory-safety guarantees by languages as a continuum, and we can split languages between “memory safe by default” and “non-memory safe by default” groups. This framing, recommended by the OpenSSF’s (Open Source Security Foundation’s) Memory Safety SIG (Special Interest Group), makes the options clearer:

• Using memory-safe-by-default languages

• Using memory-safe-by-default languages to interface with non-memory-safe-by-default languages

• Using non-memory-safe-by-default languages

Here, memory-safe-by-default languages include not only Rust, but also common garbage-collected languages such as Java, C#, Go, Swift, Python, Ruby, and more. Non-memory-safe-by-default languages include C and C++ most notably, but also Zig, which may be surprising to those who have watched memory-safety discussions from the sidelines.

While Zig does provide more ergonomic options for programmers to write memory safe programs themselves, Zig is not a memory-safe language, because it does not guarantee memory safety even in its most conservative configuration. Jamie Brandon’s breakdown of Zig’s memory safety is a good walkthrough of why Zig’s guarantees are insufficient.

With a shared understanding of memory safety and memory-safe languages, let’s now dig into the concrete strategies for pursuing memory safety in real-world programs.

Strategies for Safety

Any of the following strategies are intended to maximize the benefit of memory safety while minimizing the cost of pursuing it. The specific choice of which approach is right is context dependent and should be made with consideration of the importance of the component, the current and new target language, the team involved, and the timetable.

Make new code memory safe

The first and most obvious option is to make new code memory safe—that is, to write new components in a memory-safe language. While this seems simple, you must address certain caveats to make this approach successful.

First, you are unlikely to reap the benefits of memory safety if you try introducing memory-safe code alongside new memory-unsafe code. Think of it this way: In a fixed codebase that you continue to assure (via testing, code review, bug bounties, and more), the density of vulnerabilities decreases exponentially over time. As vulnerabilities become less and less dense in the codebase, the rate of new vulnerability discoveries also decreases, and so the overall assurance level of the code increases. The riskiest thing you can do to a codebase is change it. In the case of memory-unsafe languages, that change can induce memory-safety vulnerabilities.

The Google Chrome and Android teams have published extensively about their experiences incentivizing a move to memory-safe languages in their codebases. They instituted a rule called the “Rule of Two,” where all new code must be either sandboxed or in a memory-safe language. In practice, because sandboxing is difficult, this naturally gave developers incentive to pursue memory safety in most cases.

Surprisingly to the team, they reaped the benefits of this new policy across the entire codebase—even the parts that weren’t rewritten. Because they had certain assurances about the new code inherent in the safety mechanisms it came with, they could focus assurance efforts on old code, which was now static. Through this, they not only reduced the overall rate of memory-safety vulnerabilities in the codebase, but also decreased the prevalence of vulnerabilities overall.

Target rewrites to critical components

Sometimes, rewriting code in a memory-safe language can be the right choice, but this is often a path to pursue only once you fully understand the challenges faced by the current memory-unsafe code.

Early in its history, the development of Rust was funded by Mozilla, makers of the Firefox web browser, and the flagship Rust project besides Rust’s own compiler was the Servo web-rendering engine. Despite this, the first actual Rust code that Mozilla integrated into Firefox was not Servo; it was an MP4 video file parser. They replaced the existing C++ parser with one written in Rust, moving from a memory-unsafe language to a memory-safe language, because it had long been a source of vulnerabilities. Firefox needs to parse MP4 files from untrusted sources, and failures to correctly handle that parsing could be dire. For Mozilla, it was a small but security-critical surface area that made sense to target for a rewrite.

Another helpful tool for targeting is Kelly Shortridge’s SUX Rule: target code that is Sandbox free, Unsafe, and eXogenous. This means you should prioritize rewriting code that processes untrusted (exogenous) input, runs without a sandbox, and is written in a memory-unsafe language. Reviewing your own codebase for these areas can be a fast way to identify critical paths with high risk of exploitation in the presence of memory-safety vulnerabilities.

Wrap unsafe code with safe interfaces

When fully rewriting existing memory-unsafe code to a memory-safe language is not feasible, it might instead make sense to wrap it in a memory-safe interface. This does still lay the burden of ensuring safety properties on the programmer, both for the original code in the memory-unsafe language and for the correctness of the interface, but it then permits building safe and trusted new code on top of the old code. If you continue to work to assure the old code with techniques such as fuzz testing, analysis by sanitizers, or formal modeling, you can gain increased confidence in the latent unsafe code being wrapped.

This is in fact how many of the Rust standard library’s common container types are written. Under the hood, they contain unsafe code to manage buffers and pointers in a way that is as efficient as possible, but the interface provided to the user does not give access to any materials (buffers, pointers, lengths) that would permit the user to violate memory-safety guarantees.

This “wrapping” approach helps constrain the “blast radius” of memory-unsafe code that can’t feasibly be removed or replaced and constrains the auditing scope and assurance costs of that code as well.

“Get good” is not a strategy

There is a common reply in conversations about memory safety, coming from the most hardcore skeptics: Programmers should just write better code. They argue, explicitly or implicitly, that programmers who benefit from the guardrails of memory safety are bad programmers, and that real programmers are sufficiently skilled that they do not need a machine double-checking their work.

Let’s be clear: This is anti-intellectual nonsense—macho self-aggrandizement masquerading as a serious technical argument. You should not take it seriously and should consider someone advancing this argument as fundamentally unserious and to be ignored.

There is no step function in quality of work in the history of human achievement that happened because people one day woke up and decided to be better at their jobs. Improvements in productivity or quality or reductions in error and harm happen because of the invention of new techniques, processes, and tools.

Reductions in traffic fatalities in the 1980s and 1990s didn’t happen because drivers suddenly got better at driving; they happened because states enacted mandatory seat-belt laws.

While individuals can become more skilled at their jobs, working faster or producing fewer errors, large groups of people generally don’t do so without some force that works to provide incentive or enable that change. Even when improvements are nontechnical, they come from enhancements to process or incentives for behavior. Over the past several decades, hospitals, for example, have reduced in-hospital mistakes because of increased use of standard checklists and provisioning of common materials needed for emergencies in crash carts.

Programmers who argue against memory safety by arguing for mass self-improvement are posing an impossible future as an alternative against a credible opportunity for improvements in software safety and security. While there are credible case-specific arguments against individual paths to memory safety, they do not include sudden mass improvement of skill and quality across the industry. It’s important to make this clear.

Regulations and Requirements

No, governments are not banning C or C++

One specter of the conversations around memory safety is whether the use of memory-unsafe languages will become generally unacceptable, either through formal government regulation or a rise in common requirements for software purchasing.

Today no agency, either in the U.S. or outside of it, regulates against the use of languages that are non-memory safe by default. Nor are there purchasing requirements in place calling for the use of memory-safe-by-default languages or even the presence of memory-safety roadmaps, at least for governments.

The Five Eyes report mentioned previously, “The Case for Memory Safe Roadmaps,” recommends that organizations establish roadmaps for the pursuit of memory safety, but this is nonregulatory, and no amendments have been made to federal software acquisition policy to require such a roadmap in the U.S. or elsewhere. The U.S. is not outlawing C or C++. While these agencies have recommended moving away from these languages for future software development, they have not recommended indiscriminate mass rewrites of existing code.

Also note that the processes for establishing regulation or requirements would face challenges and, whether successful or not, would be slow to take effect and offer ample time for feedback and consideration.

First, regarding the prospect of regulation around memory safety in the U.S., such regulation would need to be pursued by a relevant agency that can establish a relevant jurisdiction. With the end of Chevron deference in U.S. law, a requirement that judges defer to U.S. regulatory agencies’ determinations in most cases that was abolished in 2024 by the U.S. Supreme Court in Loper Bright Enterprises v. Raimondo, this pursuit of memory-safety regulation would also likely need to be explicitly backstopped by Congressional mandate to ensure it survived legal challenges where judges may overrule agency rulemaking.

Second, regarding acquisition requirements (the federal government’s term for the rules around purchasing done by the government), the FAR (Federal Acquisition Regulation) would need to be updated to incorporate requirements for memory safety. For reference, in 2020 President Biden signed Executive Order 14028, which included a request that federal agencies pursue an amendment to the FAR to require inclusion of an SBOM (software bill of materials) for all software purchased by the federal government. To date, those changes have not been made, and no such requirement is in place within the FAR.

This is not to say that regulation or future federal purchasing requirements are impossible, but simply to point out that none are in place today, and any such changes would take time to be enacted and implemented.

The U.S. government’s role around memory safety has so far been to act as cheerleader and promoter of the idea, including with the Office of the National Cyber Director’s report, “Future Software Should Be Memory Safe;” CISA (Cybersecurity and Infrastructure Security Agency) et. al.’s “Case for Memory Safe Roadmaps;” and CISA’s inclusion of memory-safety recommendations within its Secure by Design effort to collaborate with industry on improving software security.

Governments do not need to ban C or C++

Even without government mandate, many organizations in the tech industry have publicly stated their support for pursuing memory safety. In 2023, the Office of the National Cyber Director put out an RFI (request for information) seeking advice from the public on how best to support improving the security of open-source software. That RFI included an interest in the possibility of promoting adoption of memory-safe languages in open source.

Respondents to the RFI, which includes a number of universities, think tanks, corporations, and individuals, overwhelmingly supported a move toward memory safety. Few espoused a hardline goal of rewriting all existing code from non-memory-safe languages, but many did recognize the value of pursuing memory safety in new code, and in rewriting critical components in security-sensitive contexts when possible.

Some companies, most notably Google, have been especially vocal about their experiences with the value of memory safety. To them, the promise of memory safety is a reduction in security-related costs for long-term products such as the Google Chrome web browser or the Android operating system. By reducing memory-safety vulnerabilities at the source, they shift vulnerability costs left in the software development life cycle; the cost of catching a bug during development and stopping it from being shipped at all is orders of magnitude cheaper than the cost of receiving a security report, perhaps paying out a bug bounty, and then coordinating, preparing, and releasing a patch.

In some corners there has been a paranoia and fear that recommendations around memory safety from the U.S. government and others portend some forced end to C or C++. Bjarne Stroustrup, originator of C++ and continued major participant in the ISO Working Group that maintains the C++ specification, has recently begun to sound alarm bells in papers and speeches about the existential threat posed to C++ by failing to address the demands for memory safety, with clear reference to the possibility that software written in non-memory-safe-by-default languages may be disallowed or become practically untenable to market and sell in the future.

This fear is simultaneously overblown and correct. It is overblown to suggest the U.S. or any other government is close to outlawing C or C++, but it is correct to note that the benefits of memory safety are becoming clearer with each case study performed at scale and that we should expect natural incentives to slowly accrue larger use and developer interest in memory-safe languages over non-memory-safe languages. C and C++ won’t die, but they will likely decline and become legacy languages like Cobol or Ada. They will still sustain some degree of interest and community, and a smaller number of developers will likely continue to be able to make their careers as developers in these languages, but they will be languages that present developers with fewer labor-market opportunities in the future and are unlikely to ascend in popularity and use again without substantial changes to address these safety deficiencies.

Safety Is Worth Pursuing

Memory-safe languages present the clearest opportunity today to substantially improve software security. While memory safety does not eliminate all classes of software weaknesses, it does eliminate a particularly pernicious class that leads to disproportionately severe vulnerabilities. While there are other techniques for addressing these kinds of weaknesses (for example, hardware-based approaches such as CHERI), they are less mature and generally more difficult to adopt at scale.

The state of possibility with memory safety today is similar to the state of automobile safety just prior to the widespread adoption of mandatory seat-belt laws. As car manufacturers began to integrate seat belts as a standard feature across their model lines and states began to require that drivers wear seat belts while driving, the rate of traffic fatalities and severity of traffic-related injuries dropped drastically. Seat belts did not solve automobile safety, but they credibly improved it, and at remarkably low cost.

The same can be done with memory safety. There is an opportunity to make substantial inroads at addressing a serious class of vulnerabilities while also, long-term, saving money on the development and operation of software systems. Memory safety is not a silver bullet, but it is a credible and cost-effective assurance technique that we as an industry should pursue aggressively. We do not need to wait for regulation to catch up; it is in our best interests to act today.

Andrew Lilley Brinker is a principal engineer at MITRE, where he works on software security. He contributes to the CVE Quality Working Group, serves on the OmniBOR Core Team, and leads development of Hipcheck. He lives in southern California with his wife and two dogs.

Thanks for reading! This post is public so feel free to share it.

Advances in Computing: Queue: In Practice

Data Analysis: Why Is It So Complicated?

First: What is Data Analysis?

Why It’s So Complicated, in Short

Data Science Fundamentals

Populations and samples

Measurements and operationalization

Error

Error in the measurement

Error in the situation around the measurement

Error in the bigger picture

Oh man, that’s a lot of error! What can I do about It?

Summary of the fundamentals

What Is a Model?

What goes into a model?

The world is complicated. Should I make complicated models?

What Does the Data Analysis Process Actually Look Like?

Data analysis means nothing alone

Closing Remarks

Smart-and-fast thinkers are good, but slow-and-steady thinkers are great

Can Modern C++ Save Us?

The Limits of Debug-Only Modes

The Case for Production Hardening

Designing libc++ Hardening for Production

A safety spectrum, not a switch

ABI compatibility, if needed

Partial enablement options

Efficient and customizable failures

Deploying Hardening at Scale

Phase 1: Enabling hardening in tests

Phase 2: Data-driven consensus building

Phase 3: Production rollout

Performance

The payoff: quantifiable impact

The Path Forward

Conclusion

References

Is SRE Anti-Transactional?

Transactional Work

Automation

Example: Cluster Builds

Example: Mouse and Cable Distribution

How to Interface with SREs

Conclusion

Related article

From floats to characters and back again

Driven to Distraction

If you're tired of hearing about memory safety, this article is for you

In Budget, On Schedule

Setting the Goalposts

Strategies for Safety

Make new code memory safe

Target rewrites to critical components

Wrap unsafe code with safe interfaces

“Get good” is not a strategy

Regulations and Requirements

No, governments are not banning C or C++

Governments do not need to ban C or C++

Safety Is Worth Pursuing