Thayumanavar's Blog: C++11 Memory Model and Atomics

12/02/2022, 08:50 Thayumanavar's Blog: C++11 Memory Model and Atomics
Thayumanavar's Blog
WebHome

GitHub

HackerRank

LinkedIn

FB
Thayumanavar Saturday, March 12, 2016

My Web Page
Disclaimer: This is a C++11 Memory Model and Atomics

personal weblog. I do not
represent anyone but
myself.
Introduction
View my complete profile
This blog post shall describe the C++11 Memory Model and atomics API. Before we delve into the topics of C
Blog Archive memory model and the atomics API, let's understand about compiler transformations and processor executi
code you write. Modern compilers and multi-core processors want to squeeze performance and hence they trans
►
2020
(
2
)
►
and reorder the code. Thus the program being executed is not in the order as specified by the programmer in his so
►
2019
(
3
)
►
code. The key thing is that they need to maintain the contract with the programmer in that the transform
►
2018
(
4
)
►
maintains the semantics and does achieve the same end result.
►
2017
(
1
)
►
The current trend of computing model we have is a shared memory utilized by many processors. The ge
▼
2016
(
4
)
▼
programming model adopted involve creating threads that communicate via a shared location or variable. One o
►
August
(
1
)
►
key requirements is to ensure mutual exclusion to the shared variable. This is usually provided by locks, mut
►
April
(
1
)
►
latches, semaphore constructs invented. This constructs are are usually implemented by means of special instruc
▼
March
(
1
)
▼
called the atomic instructions that the instruction set of particular architecture provides. The atomic instruc
C++11 Memory disallow instruction reordering around it and thus serves as a full fence. Depending on nature of code, we can
Model and reorder in one direction and allow reorder in certain situations. This allows for significant performance enhance
Atomics in certain pieces of code and on certain weakly ordered platforms like POWER, ARM. One of the trend in cu
computing is by developers is to avoid locks and this calls for a language runtime API to support it.
►
January
(
1
)
►
►
2015
(
7
)
►
Reasoning about correctness of program using above programming model is not intuitive as well . This becomes
►
2013
(
1
) difficult in the face of code transformations being done by compiler and various processors. Processors and com
►
optimize execution of code by performing transformations as if the program executes in a serial isolated way. Th
►
2012
(
10
)
►
case of multi-threaded, we require 'special marking or directive'(this are nothing but volatile, atomics, special f
►
2011
(
26
)
►
instructions) to avoid this transformation and thus maintain the sanity of the program being executed. Varie
►
2010
(
36
)
►
processor architectures exists on market like Intel x86, SPARC, POWER, ARM, ALPHA. They reorder instruction
does speculative execution ranging from very weak to strong. Thus to support the multi-threaded code that runs
►
2009
(
2
)
►
this platforms, it becomes necessary for language designers to define what is the memory model. The memory m
►
2004
(
15
)
►
provides the semantics for the concurrent program execution and defines it in a platform agnostic way wit
platform dependent implementations being provided by language runtime. This post shall describe C++ me
Labels model and the atomics API it provides.
2 G Scam
(
1
)
AI Class
(
1
)
Instruction reordering and Memory Barriers
Algorithms
(
11
)
The CPU's execution speed is orders of magnitude faster than the time it requires to fetch the instruction or the d
Algorithms Data
Structures
(
10
) operates on. This arises because of the finite latency exhibited by the interconnects that connects CPUs to ca
Aries
(
1
) memory and other CPU's. Also access times of caches and memory differ by orders of magnitude. The laws of n
limits from overcoming this time gap. However, to squeeze performance and improve scalability, CPU designers
C++
(
9
)
levels of hierarchy of cache with different access times and additional components like the store buffer , the inval
Calculus
(
1
)
queues and ensure that CPU execution pipeline is kept busy. The end result of all this is that the memory refere
cluster
(
1
)
doesn't occur in program order as perceived by the programmer. In addition the compiler also do aggre
crypto
(
1
) optimization to improve program performance. The reorderings done by the processor is consistent for a s
cryptography
(
1
) threaded program. A multi-threaded prorgram that has a shared variable with atleast one of the memory opera
cs 373
(
1
) being a write result in a data race. Data race cause a prorgram to behave in an undefined and unpredictable way.
Databases
(
1
)
Design
(
6
) To illustrate reordering a processor performs, let's consider two threads T1 and T2, where in they execute the sn
Distributed Systems
( of the code as shown below:
5
)
Dynamic
Programming
(
3
) Initially data = 0 , flag = false
Embedded Systems
(
3
) Thread 1 Thread 2
Entrance Exams
(
1
)
Euler
(
1
) data = 7 while (!flag)
Free Online Education flag = true print data
(
1
)
Functional
Programming
(
2
)
The semantics a programmer would expect with above snippet of code is value of 7 shall be output . But to o
Geometry
(
3
) surprise the code can possibly output a value of 0. This is feasible if the following happened:
ideas
(
1
)
Either the lines 1 or 2 got reordered in thread 1
IIT
(
1
)
manavar.blogspot.com/2016/03/c11-memory-model-and-atomics.html 1/5
Intel Architecture
(
1
) A similar reorder of line 1 or 2 happened in thread 2
Interviews
(
7
) The necessary interleaved execution of the code happen (so that value 0 is output).
JEE
(
1
) Before I discuss further on how we go about solving the above problem, let's formalize the concept by introducing
kernel
(
4
) some definitions:
Mathematics
(
4
) Sequential Consistency: Sequential Consistency represents the most intuitive concept for reasoning about concu
ML Class
(
1
) programs and it satisfies two conditions below:
Network
(
4
) The operations of each thread are executed in some sequential order on each processor.
Notes (for my use)
(
1
)
The sequence of the operations executed by a thread is in the program order.
Probability
(
4
)
It is very clear from the above discussion that processors with their out-of-order execution and the compiler wi
programming
(
2
)
code optimization easily violate sequential consistency.
Puzzles
(
3
)
python
(
2
) Memory Operations:
quantum algorithms
( The basic memory reference operations are:
1
)
Load(read) from a memory location
Quotes
(
1
)
Recovery
(
1
) Store(write) to a memory location.
Robotics
(
1
) In addition there are the synchronization memory operations like the lock, unlock, atomic load, atomic store, atom
RSA
(
1
) RMW (read modify write).
search
(
1
)
Sequenced Before Relation: We can define a binary relation among the memory operations X and Y in a thread as:
software products
(
4
)
stanford classes
(
1
)
Storage
(
1
)
Systems
(
6
) All evaluations (memory references) of X precede before evaluations of Y. (In other words, neither X memory
references precede Y and vice-versa, then X is not Sequenced Before Y).
Tools
(
1
)
If there is a third memory operation Z and X is sequenced before Y and Y is sequenced before Z, then it
UVA
(
1
)
follows that X is sequenced before Z.
virtualization
(
2
)
The above defines a sequenced before relation and this induces a partial order in a program. Mathematicall
sequenced before relation is a binary relation $\lt$ that induces a partial order satisfying the properties of asymm
($x \lt y$, then $\lnot x \lt y$) and transitivity ($ x \lt y$ and $ y \lt z$, then $x \lt z $)
Conflicting Operations: Two operations X and Y are said to conflict if they access the same memory operation
atleast one of them is a modifying operation (store, atomic store, atomic RMW).
Let us return now to the code snippet we introduced above. Obviously the solution is to avoid reordering of the me
operations the CPU does. To avoid this reordering, the processor provides special instructions called the fen
barrier instructions. Also there are implicit instructions which along with a memory operation provide this barr
fence effect. What a fence or barrier does is, it tells the processor to avoid reorders.
Thread 1 essentially consists of store memory operations (write to memory locations that the variable aliases). W
essentially needed is there is to avoid the stores being reordered. Similarly in thread 2, we want to avoid the
being reordered. There are two fundamental memory operations the load and the store and hence base
sequencing these loads & stores, there exists 4 different combinations of barriers:
StoreStore Barrier: There is no reordering of stores that come before the barrier with the stores after the
barrier or fence. Thus all preceding stores will complete at barrier before subsequent stores take affect.
StoreLoad Barrier: There is no reordering of stores that come before the barrier with the loads that come
after it. Thus all preceding stores will complete at the barrier before which subsequent loads take affect.
LoadLoad Barrier: There is no reorder of load that come before the barrier with the loads happen after the
barrier. All loads are deemed to complete at barrier before which subsequent loads take affect
LoadStore Barrier: There is no reorder of load that come before the barrier with the stores that happen after
the barrier. Thus all loads are deemed to complete at barrier before which subsequent stores take affect.
In general, we can generalize the definition of X-Y barrier as memory operations X sequenced before the barrier
the memory operations Y that come after it. The X and Y can be either load or store. For the program snippet a
what need in thread 1 is a StoreStore Barrier between line 1 and line 2. In thread 2, what we need is a LoadLoad Ba
Below shows the code snippet with the correct barriers introduced as required for the sequential consistent exec
of the program.
Initially data = 0 , flag = false
Thread 1 Thread 2
data = 7 while (!flag) { } // wait
StoreStore Barrier LoadLoadBarrier
flag = true print data
With this in place, we have laid the necessary foundation to delve into the C++11 Memory model which is the topic
next section.
C++11 Memory Model and Atomic API
Memory model defines semantics of concurrent operations. The language run-time hides complexity of the diff
platform architectures thereby providing programmer with a unified view. This provides an opportunity fo
programmer to write lock-free code or provide an optimized, tuned in-house implementation of the synchroniz
constructs as well write portable race free code.
C++11 provides a std::atomics<> , a template type, that provides atomic operations and an option to specif
memory ordering attribute. The various operations (some operations are specialized for specific types) are load, s
exchange, compare_exchange_weak, compare_exchange_strong, fetch_add, fetch_sub, fetch_and, fetch
fetch_xor, ++, --.
Specific to each operation is an attribute called memory ordering which the programmer specifies. The possible v
that this memory order attribute can take are:
memory_order_seq_cst
memory_order_acq_rel
memory_order_acquire
memory_order_release
memory_order_consume
memory_order_relaxed
Before we define what each of this memory ordering implies, we need to define two more relations:
Synchronizes With relationship: It specifies a ordering relationship between operations on different threads. The
that is feasible among two independent threads in time happens by means of synchronization operation. S
examples of this synchronizes with relationship are:
Thread creation synchronizes with start of thread creation.
Thread completion synchronizes with the return of join.
Unlocking(releasing) a mutex synchronizes with locking (acquiring) the mutex on another thread
A write (-release) operations synchronizes with read-acquire on other thread that reads from the write.
The relation defines the 'inter thread happens before' and transitivity is obeyed. These relationship describe the
relation that happens that is established at runtime.
Happens Before relationship: There exists a happens before relation between two memory operations X and Y, if th
any of the following is satisified:
X is sequenced before Y.
X synchronizes with Y
For memory operation Z, X happens before Z and Z happens before, then it follows that X happens before Y.
Based on the the above, we can define data race in terms of the happens-before relation as: Any two operations
relate to same location on different threads and there is no happens before relation among these operation and at
one of the operation being a write shall result in a data race. Let me illustrate with an example 'happen-be
relation that in the C++11 code below:
#include <atomic>
#include <iostream>
#include <thread>
#include <chrono>
int data;
std::atomic<bool> flag(false);
void store_thread()
{
data= 7; // ... (1)
flag.store(true, std::memory_order_release); // ... (2)
void load_thread()
{
while (!flag.load(std::memory_order_acquire)) // ... (3)
{
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
std::cout<<"Data Value= " <<<data<<std::endl; // ... (4)
}
int main()
{
std::thread t1(store_thread);
std::thread t2(load_thread);
t1.join();
t2.join();
return 0;
Various relations that exist in the above code are:
(1) is sequenced before (2) ( because of memory order associated with flag). Hence (1) happens before (2)
(3) is sequenced before (4) (because of memory order associated with flag). Hence (3) happens before (4)
When value of flag is true, the write (release) in 2 synchronizes with the read of the flag in 3. Hence (2)
happens before (3)
The above three relations suffices to say that output of the program is always 7!.
Sequential Consistent Ordering
By default the memory ordering attribute all atomic operations in C++ use is memory_order_seq_cst. This guara
the sequential consistency and ensures the Data Race Free Model of C++ memory model. The sequential consi
memory model incurs performance and is not required in all cases where the programmer can make use of the
operations and ensure the program behaves defined for the use-case at hand. Below is the code that illustrate
sequential consistent ordering.
#include <atomic>
#include <thread>
#include <cassert>
#include <iostream>
std::atomic<bool> x,y;
int z= 0;
int main()
{
// Note no need for explicit specification of std::memory_order_seq_cst as it is default

// Better for readibility and understanding purpose, it is specified here explicitly.

auto set_x= [&]() {
x.store(true, std::memory_order_seq_cst);
};
auto set_y= [&]() {
y.store(true, std::memory_order_seq_cst);
};
auto check_x_then_y= [&]() -> void {
while (!x.load(std::memory_order_seq_cst)) { } // wait until x is set
if (y.load(std::memory_order_seq_cst))
++z;
};
auto check_y_then_x= [&]() {
while (!y.load(std::memory_order_seq_cst)) { } // wait until y is set
if (x.load(std::memory_order_seq_cst))
++z;
};
std::thread t1(set_x);
std::thread t2(set_y);
std::thread t3(check_x_then_y);
std::thread t4(check_y_then_x);
t1.join(); t2.join(); t3.join(); t4.join();
assert(z != 2); // Impossible to hit this assert.
std::cout <<z<<std::endl;
return 0;
Acquire release Ordering
Acquire release ordering is more relaxed than sequential consistency in that it doesn't introduce a full barri
defines a synchronizes with relationship by means of release operation that happens on write to a memory loc
which is acquired by means of read operation from another thread. The code snippet introduced while explainin
happens before provides an example of this. The release operation disallows preceding accesses to move beyo
where as the acquire operation disallows subsequent accesses to move before the load
Relaxed Ordering
The relaxed memory ordering is weakest memory ordering. It allows the memory operation to freely reorde
optimization and performance provided it doesn't violate any existing happens before relation. It doesn't introd
synchronizes with relationship. Below code illustrates an scenario where we make use of the relaxed ordering.

I have decided to not to discuss the details relating to std::memory_order_consume. I plan to discuss this in a sep
post if possible.
Mapping C++ memory ordering to the memory barriers
The acquire operation is essentially a load-load and load-store barrier where as the Load-store and Store-Store ba
is required for a release operation. An Acquire release operations uses the Load-Load barrier, Load-Store an
StoreStore barrier. Sequential consistency uses all the four barriers.
Conclusion
In this post, I have introduced the C++ memory model and the atomics API the language provides. We have also l
the formal foundations that underlying the concurrency mechanisms the C++ API provides. C++11 provide
programmer amply opportunity to employ this APIs to write lock-free, highly performing code without sacrifi
portability and program correctness.
Posted by
Thayumanavar
at
10:16 PM

No comments
:
Post a Comment
Newer Post Home Older
Subscribe to:
Post Comments
(
Atom
)

Thayumanavar's Blog: C++11 Memory Model and Atomics

Uploaded by

Copyright:

Available Formats

You might also like

Thayumanavar's Blog: C++11 Memory Model and Atomics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thayumanavar's Blog: C++11 Memory Model and Atomics

Uploaded by

Copyright:

Available Formats

12/02/2022, 08:50 Thayumanavar's Blog: C++11 Memory Model and Atomics

Thayumanavar Saturday, March 12, 2016

Disclaimer: This is a C++11 Memory Model and Atomics

Free Online Education flag = true print data

Initially data = 0 , flag = false

data = 7 while (!flag) { } // wait

StoreStore Barrier LoadLoadBarrier

flag = true print data

data= 7; // ... (1)

flag.store(true, std::memory_order_release); // ... (2)

while (!flag.load(std::memory_order_acquire)) // ... (3)

std::cout<<"Data Value= " <<<data<<std::endl; // ... (4)

Various relations that exist in the above code are:

// Note no need for explicit specification of std::memory_order_seq_cst as it is default

auto set_x= [&]() {

auto set_y= [&]() {

auto check_x_then_y= [&]() -> void {

while (!x.load(std::memory_order_seq_cst)) { } // wait until x is set

auto check_y_then_x= [&]() {

while (!y.load(std::memory_order_seq_cst)) { } // wait until y is set

t1.join(); t2.join(); t3.join(); t4.join();

assert(z != 2); // Impossible to hit this assert.

Acquire release Ordering

Mapping C++ memory ordering to the memory barriers

Newer Post Home Older

You might also like