Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
It has been said that Documentation/memory-barriers.txt can be used to frighten small children, and perhaps this is true. But even if it is true, it is woefully inefficient. After all, there is a huge number of children in this world, so a correspondingly huge amount of time and effort would be required in order to read it to them all.
This situation clearly calls for automated tooling, which is now available in prototype form; it is now possible to frighten small children at scale. This tool takes short fragments of concurrent C code as input, and exhaustively analyzes the possible results. In other words, instead of perusing memory-barriers.txt to find the answer to a memory-ordering question, you can get your answer by writing a small test case and feeding it to the tool, at least for test cases within the tool's current capabilities and limitations. This two-part series gives an introduction to the tool, describing how to use it and how it works.
To the best of our knowledge, this tool is the first realistic automated representation of Linux-kernel memory ordering, and is also the first to incorporate read-copy-update (RCU) ordering properties.
This article is organized as follows, with the intended audience for each section in parentheses:
The second article in this series will look into the issues raised by cycles and how they are important for memory models. It also demonstrates how the herd tool can be used for testing memory models.
Those wishing to skip the preliminaries found in these two articles and dive directly into the strong model will find it here. (Yes, there is also a weak model, but it is described in terms of how it differs from the strong model. So you should start with the strong model.)
Even before Linux, kernel hacking has tended to involve more intuition and less formal methods. Formal methods can nevertheless be useful for providing definite answers to difficult questions. For example, how many different behaviors can a computer program exhibit? Particularly one that uses only values in memory, with no user input or output?
Computers being the deterministic automata they are, most people would say only one, and for uniprocessor systems they would be basically correct. But multiprocessor systems can give rise to a much wider range of behaviors, owing to subtle variations in the relative timing of the processors and the signals transmitted between them, their caches, and main memory. Memory models try to bring some order to the picture, first and foremost by characterizing exactly which outcomes are possible for a symmetric multiprocessor (SMP) system running a certain (small) program.
Even better, a formal memory model enables tools to automatically analyze small programs, as described here and here. However, those tools are specialized for specific CPU families. For analyzing the Linux kernel, what we need is a tool targeted at a higher level, one that will be applicable to every CPU architecture supported by the kernel.
Formal memory models require extreme precision, far beyond what the informal discussion in memory-barriers.txt can possibly provide. To bridge this gap in the best way possible, we have formulated the guiding principles listed in the following section.
Our memory model is highly constrained because it must match the kernel's behavior (or intended behavior). However, there are numerous choices to be made, so we formulated the following principles to guide those choices:
When all else is equal, a stronger memory model is clearly better, but this raises the question of what is meant by “stronger”. For our purposes, one memory model is considered to be stronger than another if it rules out a larger set of behaviors. Thus, the weakest possible memory model is one that would allow a program to behave in any way at all (as exemplified by the “undefined behavior” so common in programming-language standards), whereas the strongest possible memory model is one that says no program can ever do anything. Of course, neither of these extremes is appropriate for the Linux kernel, or for much of anything else.
The strongest memory model typically considered is sequential consistency (SC), and the weakest is release consistency process consistency (RCpc). SC prohibits any and all reordering, so that all processes agree on some global order of all processes' accesses, which is theoretically appealing but expensive, so much so that no mainstream microprocessor provides SC by default. In contrast, RCpc is fairly close to the memory models we propose for the Linux kernel, courtesy of the Alpha, ARM, Itanium, MIPS, and PowerPC hardware that the Linux kernel supports.
On the other hand, we don't want to go overboard. Although strength is preferred over weakness as a general rule, small increases in strength are not worth order-of-magnitude increases in complexity.
Simpler is clearly better; however, simplicity will always be a subjective notion. A formal-methods expert might prefer a model with a more elegant definition, while a kernel hacker might prefer the model that best matched his or her intuition. Nevertheless, simplicity remains a useful decision criterion. For example, assuming all else is equal, a model with a simpler definition that better matched the typical kernel hacker's intuition would clearly be preferred over a complex counterintuitive model.
The memory model must support existing non-buggy code in the Linux kernel. However, our model (in its current form) is rather limited in scope. Because it is not intended to be a replacement for either hardware emulators or production compilers, it does not support:
The memory model must be compatible with the hardware that the Linux kernel runs on. Although the memory model can be (and is) looser than any given instance of hardware, it absolutely must not be more strict. In other words, the memory model must in some sense provide the least common denominator of the guarantees of all memory models of all CPU families that run the Linux kernel. This requirement is ameliorated, to some extent, by the ability of the compiler and the Linux kernel to mask hardware weaknesses. For example:
Nevertheless, the memory model must be sufficiently weak that it does not rule out behaviors exhibited by any of the CPU architectures the Linux kernel has been ported to. Different CPU families can have quite divergent properties, so that each of Alpha, ARM, Itanium, MIPS, and PowerPC required special attention at some point or another. In addition, hardware memory models are subject to change over time, as are the use cases within the Linux kernel. The Linux-kernel memory model must therefore evolve over time to accommodate these changes, which means that the version presented in this paper should be considered to be an initial draft rather than as being set in stone. It seems likely that this memory model will have the same rate of change as does Documentation/memory-barriers.txt.
Providing compatibility with all the SMP systems supporting Linux is one of the biggest memory-model challenges, especially given that some systems' memory models have not yet been fully defined and documented. In each case, we have had to take our best guess based on existing documentation, consultation with those hardware architects willing to consult, formal memory models (for those systems having them), and experiments on real hardware, for those systems we have access to. In at least one case, this might someday involve a computer museum.
Thankfully, this situation has been improving. For example, although formal memory models have been available for quite some time (such as here [400-page PDF]), tools that apply memory models to litmus tests have only appeared much more recently. We most certainly hope that this trend toward more accessible and better-defined memory models continues, but in the meantime we will continue to work with whatever is available.
The memory model should support future hardware, within reason. Linux-kernel ports to new hardware must supply their own code for the various memory barriers, and might one day also need to supply their own code for similar common-code primitives. But since common code is valuable, an architecture wishing to supply its own code for (say) READ_ONCE() will need a very good reason for doing so.
This proposal assumes that future hardware will not deviate too far from current practice. For example, if you are porting Linux to a quantum supercomputer, the memory model is likely to be the least of your worries.
Where possible, the model should be compatible with the existing C and C++ memory models. However, there are a number of areas where it is necessary to depart from these memory models:
On the positive side, the Linux kernel has recently been adding functionality that is closer to that of C and C++ atomics, with the ongoing move from ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE() being one example and the addition of smp_load_acquire() and smp_store_release() being another.
Defining a memory model inevitably uncovers interesting questions and areas of uncertainty. For example:
In a perfect world, we would resolve each and every area of uncertainty, then produce a single model reflecting full knowledge of all the hardware that the Linux kernel supports. However, astute readers might have noticed that the world is imperfect. Furthermore, rock-solid certainties can suddenly be cast into doubt, either with the addition of an important new architecture or with the uncovering of a misunderstanding or an error in documentation of some existing architecture. It will therefore be sometimes necessary for the Linux kernel memory model to say “maybe”.
Unfortunately, existing software tools are unable to say “maybe” in response to a litmus test. We therefore constructed not one but two formal models, one strong and the other less strong. These two models will disagree in “maybe” cases. Kernel hackers should feel comfortable relying on ordering only in cases where both models agree that ordering should be provided, and hardware architects should feel the need to provide strong ordering unless both models agree that strong ordering need not be provided. (Currently these models are still very much under development, so it is still unwise to trust either model all that much.)
Causality is an important property of memory models, in part because causality looms large in most peoples' intuitive understanding of concurrent code. However, causality is a generic term, lacking the precision required for a formal memory model. In this series we will therefore use the terms “causality” and “causal relationship” quite sparingly, instead defining precise terms that will be used directly within the memory model. But a brief discussion now will help illuminate the topic and will introduce some important relationships between causality, ordering, and memory models.
Causality is simply the principle that a cause happens before its effect, not after. It is therefore a statement about ordering of events in time. Let's start with the simplest and most direct example. If CPU A writes a value to a shared variable in memory, and CPU B reads that value back from the shared variable, then A's write must execute before B's read. This truly is an example of a cause-and-effect relation; the only way B can possibly know the value stored by A is to receive some sort of message sent directly or indirectly by A (for example, a cache-coherence protocol message).
Messages take time to propagate from one CPU or cache to another, and they cannot be received before they have been sent. (In theory, B could guess the value of A's write, act on that guess, check the guess once the write message arrived, and if the guess was wrong, cancel any actions that were inconsistent with the actual value written. Nevertheless, B could not be entirely certain that its guess is correct until the message arrives—and our memory models assume that CPUs do not engage in this sort of guessing, at least not unless they completely hide its effects from the software they are running.)
On the other hand, if B does not read the value stored by A but rather an earlier value, then there need not be any particular temporal relation between A's write and B's read. B's read could have executed either before or after A's write, as long as it executed before the write message reached B. In fact, on some architectures, the read could return the old value even if it executed a short time after the message's arrival. A fortiori, there would be no cause-and-effect relation.
Another example of ordering also involves the propagation of writes from one CPU to another. If CPU A writes to two shared variables, these writes need not propagate to CPU B in the same order as the writes were executed. In some architectures it is entirely possible for B to receive the messages conveying the new values in the opposite order. In fact, it is even possible for the writes to propagate to CPU B in one order and to CPU C in the other order. The only portable way for the programmer to enforce write propagation in the order given by the program is to use appropriate memory barriers or barrier-like constructs, such as smp_mb(), smp_store_release(), or C11 non-relaxed atomic operations.
A third example of ordering involves events occurring entirely within a single CPU. Modern CPUs can and do reorder instructions, executing them in an order different from the order they occur in the instruction stream. There are architectural limits to this sort of thing, of course. Perhaps the most pertinent for memory models is the general principle that a CPU cannot execute an instruction before it knows what that instruction is supposed to do.
For example, consider the statement “x = y;”. To carry out this statement, a CPU must first load the value of y from memory and then store that value to x. It cannot execute the store before the load; if it tried then it would not know what value to store. This is an example of a data dependency. There are also address dependencies (for example, “a[n] = 3;” where the value of n must be loaded before the CPU can know where to store the value 3). Finally, there are control dependencies (for example, “if (i == 0) y = 5;” where the value of i must be loaded before the CPU can know whether to store anything into y). In the general case where no dependency is present, however, the only portable way for the programmer to force instructions to be executed in the order given by the program is to use appropriate memory barriers or barrier-like constructs.
Finally, at a higher level of abstraction, source code statements can be reordered or even eliminated entirely by an optimizing compiler. We won't discuss this very much here; memory-barriers.txt contains a number of examples demonstrating the sort of shenanigans a compiler can get up to when translating a program from source code to object code.
The second half of this series focuses on the specific problem of cycles.
We owe thanks to H. Peter Anvin, Will Deacon, Andy Glew, Derek Williams, Leonid Yegoshin, and Peter Zijlstra for their patient explanations of their respective systems' memory models. We are indebted to Peter Sewell, Susmit Sarkar, and their groups for their seminal work formalizing many of these same memory models. We all owe thanks to Dmitry Vyukov, Boqun Feng, and Peter Zijlstra for their help making this human-readable. We are also grateful to Michelle Rankin and Jim Wasko for their support of this effort.
Quick Quiz 1: But my code contains simple unadorned accesses to shared variables. So what possible use is this memory model to me?
Answer: You are of course free to use simple unadorned accesses to shared variables in your code, but you are then required to make sure that the compiler isn't going to trip you up—as has always been the case. Once you have made sure that the compiler won't trip you up, simply translate those accesses to use READ_ONCE() and WRITE_ONCE() when using the model. Of course, if your code gives the compiler the freedom to rearrange your memory accesses, you may need multiple translations, one for each possible rearrangement.
Index entries for this article | |
---|---|
Kernel | Development tools |
GuestArticles | McKenney, Paul E. |
Posted Apr 27, 2017 7:53 UTC (Thu) by jzbiciak (guest, #5246) [Link]
At a previous job, I know we used diy/litmus to perform various litmus tests against our interconnect. This was a few years ago, so obviously it was an older version than the one described there today. But, it seems like the work and the tool are still relevant.
At first blush, it appears this work has some overlap with the work that team's doing. Or, if nothing else, the two efforts seem complementary. Your work seems focused on modeling the Linux kernel requirements, while the diy/litmus folks seem more focused on analyzing hardware implementations.
Is there an opportunity for some cross-pollination here? For example:
If nothing else, diy/litmus seems like it could provide a ready-made experimental framework for testing hypotheses against current hardware.
Posted Apr 27, 2017 17:51 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]
We have done some work generating and running Linux kernel modules from litmus tests, though this is limited by the fact that the automatically generated pseudo-C litmus tests lack type information. I am looking to add the type information, which would make it easier to generate a kernel module that GCC was happy with, but haven't gotten this task done as quickly as I would like. (Story of my life!)
We have also done some comparisons against hardware memory models.
In short, I completely agree that validation of memory models such as this one requires considerable time and attention!
Posted Jun 8, 2017 4:26 UTC (Thu) by johnbender (guest, #116592) [Link]
Some thoughts as I read through:
> Any number of compiler optimizations. For example, our model currently does not account for compiler optimizations that hoist identical stores from both branches of an if statement to precede that statement.
This has until very recently (POPL 2017) been a very serious problem for weak memory models. There is only one that I'm aware of for C++11 that avoids the "thin air" read problem that generally arises as a consequence of supporting these types of optimizations. See [1] for more.
> The memory model must be compatible with the hardware that the Linux kernel runs on. Although the memory model can be (and is) looser than any given instance of hardware, it absolutely must not be more strict. In other words, the memory model must in some sense provide the least common denominator of the guarantees of all memory models of all CPU families that run the Linux kernel.
This sort of "lower bound" on bad memory behavior is normally why people target C11 or C++11, but if the author's happen to see this comment I would recommend checking out the work of Karl Crary at CMU [2]. In his model he has omitted the traditional total order on writes present on all modern architectures with the intent of future proofing his semantics.
As a further plug for his semantics, it is not "axiomatic" in the sense of the traditional memory models. Importantly it introduces the notion of programmer specified orders which fits more closely with programmer intuition (by my reckoning) when writing the algorithms than say, the C++11 approach of memory orderings for particular cross thread operations.
In any event an enjoyable read that I will have to come back to later!
[1] https://people.mpi-sws.org/~orilahav/papers/main.pdf
[2] https://www.cs.cmu.edu/~crary/papers/2015/rmc.pdf
Posted Jun 8, 2017 15:14 UTC (Thu) by johnbender (guest, #116592) [Link]
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds