Skip to content

Prerequisite of Go Sync Mutex Implementation

This blog is split from the blog of sync mutex implementation due to its length. For better reading experience and clearer organization, it's reasonable to make these background knowledge a separated article.

This blog introduces the language level instructions/functions such as atomic and synchronization, the runtime spinning, sleeping and awoken. It also briefs the memory barrier and assembly code in golang.

Share Memory by Communicating

Go guides us to share memory by communicating, and don't communicate by sharing memory.

The latter part is easy: don't read-write the same memory to communicate among go routines. The "share memory by communicating" says you need to share memory(such as a pointer) through the communicating mechanism, even though it doesn't state clearly what the "communicating" way will be.

It's not a rule that describes the same operation with positive and negative statements, but a guide about how to share memory and caution when sharing it.

Atomic

As the name refers, atomic means that several instructions will be executed as a batch without getting disturbed by the other instructions inserted after compiler re-arrangement. I won't discuss the CPU instruction like a barrier to ensure the atomic. Instead, we focus on the language level API on the top of the OS, which communicates through system calls. The atomic package provides several APIs, which are the foundation of implementing a language level mutex.

  • the swap operation, is the atomic equivalent of:
    old = *addr
    *addr = new
    return old
  • the compare-and-swap operation, is the atomic equivalent of:
    if *addr == old {
        *addr = new
        return true
    }
    return false
  • the add operation, is the atomic equivalent of:
    *addr += delta
    return *addr
  • the load operation, is the atomic equivalent of:
    return *addr
  • the store operation, is the atomic equivalent of:
    *addr = val

Note that the atomic is highly affected by the memory model as it's the primitive way to make sure how a memory is accessed and modified without affecting by others. As the Go documentation states:

In the terminology of the Go memory model, if the effect of an atomic operation A is observed by atomic operation B, then A “synchronizes before” B.

Additionally, all the atomic operations executed in a program behave as though executed in some sequentially consistent order.

This definition provides the same semantics as C++'s sequentially consistent atomics and Java's volatile variables.

The memory model is too far out of the current discussion scope, hence, I won't talk it much here.

Assembly as Go Func Implementation

Before diving deeper, it's better to know about how Go code calls assembly code. To highlight, this blog doesn't take care about the go assembly code, the data layout(abi) at all. It records the convention that Go function calls several lines of assembly code.

When we say assembly code, we mean the "function body/function implementation" is written in assembly instead of high-level go code. Note that Go assembly is not the machine-specific assembly code; it's the portable assembly code.

In Go, a function can be constituted by two parts, which are separated when using assembly.

  • declaration: must in Go format as that's the identifier to call it
  • implementation: it could reside inside a .s file with the convention

The convention between the declaration and implementations is fragile because you need to consider the data layouts carefully for functions. For more details, refer to the Go Assembly Guide and Go Internal ABI Specification. For now, it's far out of scope.

Function Declaration:

// main.go
func Sum(a, b int) int

Then it's the body, we need to use the same name with the declaration, and it has a prefix ·. It's not a dot! The others I don't learn much, just skip it. The code here cannot be used and I don't continue to fix it.

// file sum.s, naming is not a part of convention
#include "textflag.h"

// Function signature: ·Add(SB), NOSPLIT, $0-8
TEXT ·Add(SB), NOSPLIT, $0-16
    // Load first argument into register AX
    MOV 0(FP), AX
    // Load second argument into register BX
    MOV 4(FP), BX
    // Add the values in AX and BX
    ADD AX, BX
    // Move the result to AX
    MOV BX, AX
    // Return the result
    RET

// note that the empty new line is crucial otherwise you will get a

The introduction here is necessary as the later CAS is implemented by the assembly as well instead of the system call.

CAS: Implement by Assembly

The CAS(CompareAndSwapInt32 func) compares the contents of a memory location with a given value(old) here, and, only if they are the same, modifies the contents of that memory location to a new given value as an atomic operation.

The atomic is done through the architecture-specific assembly code. The signature is defined at src/sync/atomic/doc.go:

// CompareAndSwapInt32 executes the compare-and-swap operation for
// an int32 value. Consider using the more ergonomic and
// less error-prone [Int32.CompareAndSwap] instead.
func CompareAndSwapInt32(addr *int32, old, new int32) (swapped bool)

The func body, which contains a jump instruction, is in src/sync/atomic/asm.s:

TEXT ·CompareAndSwapInt32(SB),NOSPLIT,$0
JMP runtimeinternalatomic·Cas(SB)

The real implementation is stored in the src/runtime/internal/atomic/atomic_arm64.s. We only show the arm64 version here, but note it's architecture-specified.

// bool Cas(uint32 *ptr, uint32 old, uint32 new)
// Atomically:
//  if(*val == old){
//      *val = new;
//      return 1;
//  } else
//      return 0;
TEXT ·Cas(SB), NOSPLIT, $0-17
    MOVD    ptr+0(FP), R0
    MOVW    old+8(FP), R1
    MOVW    new+12(FP), R2
    MOVBU   internalcpu·ARM64+const_offsetARM64HasATOMICS(SB), R4
    CBZ     R4, load_store_loop
    MOVD    R1, R3
    CASALW  R3, (R0), R2
    CMP     R1, R3
    CSET    EQ, R0
    MOVB    R0, ret+16(FP)
    RET
load_store_loop:
    LDAXRW  (R0), R3
    CMPW    R1, R3
    BNE ok
    STLXRW  R2, (R0), R3
    CBNZ    R3, load_store_loop
ok:
    CSET    EQ, R0
    MOVB    R0, ret+16(FP)
    RET

In the end, we need to highlight the atomic of CAS is guaranteed by the assembly instruction, which is a still a higher level because it doesn't case the CPU cores.

runtime doSpin and canSpin

The mutex tries to spin for locking, and this feature is provided by the golang runtime package. In this section we take an eye on it to learn more about the runtime spinning.

runtime_doSpin

The runtime_doSpin is declared in src/sync/runtime.go but the definition lives in src/runtime/proc.go due to the go:linkname directive.

// runtime_doSpin does active spinning.
func runtime_doSpin()
//go:linkname sync_runtime_doSpin sync.runtime_doSpin
//go:nosplit
func sync_runtime_doSpin() {
    procyield(active_spin_cnt)
}

func procyield(cycles uint32)

The proc.go provides many functions that take charge of the Goroutine scheduler, the famous GMP model in golang runtime.

The definition of procyield is writen by assembly in src/runtime/asm_arm64.s as well:

TEXT runtime·procyield(SB),NOSPLIT,$0-0
    MOVWU   cycles+0(FP), R0
again:
    YIELD
    SUBW    $1, R0
    CBNZ    R0, again
    RET

The parameter cycles is moved to R0, and keeps subtracting 1 from R0 until it's not zero. The yield instruction hints to the processor that the current thread is willing to yield its execution resources. This can be used for cooperative multitasking or other similar scenarios.

In short, it does a specified number of cycles of subtraction operations and returns.

runtime_canSpin

Comparing to runtime_doSpin, the runtime_canSpin is much complex as it tightly couples with the processor part. It's better to explain it in the future in another passage after learning about the GMP.

One thing interesting is that if the program is running with GOMAXPROCS<=1 or one core machine, the canSpin will always be false. In the other words, it won't spin on these circumstances.

// Active spinning for sync.Mutex.
//
//go:linkname sync_runtime_canSpin sync.runtime_canSpin
//go:nosplit
func sync_runtime_canSpin(i int) bool {
    // sync.Mutex is cooperative, so we are conservative with spinning.
    // Spin only few times and only if running on a multicore machine and
    // GOMAXPROCS>1 and there is at least one other running P and local runq is empty.
    // As opposed to runtime mutex we don't do passive spinning here,
    // because there can be work on global runq or on other Ps.
    if i >= active_spin || ncpu <= 1 || gomaxprocs <= sched.npidle.Load()+sched.nmspinning.Load()+1 {
        return false
    }
    if p := getg().m.p.ptr(); !runqempty(p) {
        return false
    }
    return true
}

runtime SemacquireMutex and Semrelease

Similarly, the function runtime_SemacquireMutex is linked to function sync_runtime_SemacquireMutex inside src/runtime/sema.go through the go:linkname. The semacquire1 concerns about how woken the pending go routines.

The sema, shorter as Semaphore,is a mechanism to implement sleep and wakeup in the Go language runtime level. It concerns about acquiring the mutex in a block mode in a go routine level.

func runtime_SemacquireMutex(s *uint32, lifo bool, skipframes int)
//go:linkname sync_runtime_SemacquireMutex sync.runtime_SemacquireMutex
func sync_runtime_SemacquireMutex(addr *uint32, lifo bool, skipframes int) {
    semacquire1(addr, lifo,
        semaBlockProfile|semaMutexProfile, skipframes, waitReasonSyncMutexLock)
}

To clarify, the sleeping and waking up on the language runtime level could be split into two aspects:

  • 1: the schedule of go routine sleep and awoken
  • 2: the condition when a go routine sleep and awoken

It's similar to the discussion about function call time.Sleep, we concern about what happens after sleeping(how go scheduler manages it) and when it's awoken(sleep time passed).

go func(_ int){
    time.Sleep(time.Second)
}

To learn sync_runtime_SemacquireMutex, we need to see the semacquire1 to focus the question 2. The first question requires a deep understanding about the Go runtime scheduler, which is not necessary to understand for the mutex.

The family APIs of runtime_Semacquire work a single task: ~it waits until s > 0 and then atomically decrements it~. It is intended as a simple sleep primitive for use by the synchronization library and should not be used directly. As the documentation states, *it targets the same goal as Linux's futex.

That is, don't think of these as semaphores.

Think of them as a way to implement sleep and wakeup such that every sleep is paired with a single wakeup, even if, due to races, the wakeup happens before the sleep.

Hence, once a go routine calls the API, it will sleep and blocked there until it's awoken in the future uncertainly. From this view, how the go routines are scheduled is a less important problem, and we needn't consider it much. If multiple go routines call this API, it will be executed in a sequence. One by one, the blocked go routines unblock and keep running the code after API.

Then, we see runtime_SemacquireMutex, slightly more complex than the runtime_Semacquire as it exposes more details to control the go routine scheduling(behaves as the priority of unblocking) during awoken.

// SemacquireMutex is like Semacquire, but for profiling contended
// Mutexes.
// If lifo is true, queue waiter at the head of wait queue.
// skipframes is the number of frames to omit during tracing, counting from
// runtime_SemacquireMutex's caller.
func runtime_SemacquireMutex(s *uint32, lifo bool, skipframes int)

The lifo is a flag to introduce the priority of a go routine, and the skipframes are the argument for tracing. If the lifo is true, the queue waiter is at the head of the waiting queue so we can assume it will unblock earlier than the go routines added with a false value. But suppose we only try to understand the API behavior briefly. In that case, we can even ignore the underlying "queue" and just think about it providing a new way to wait with higher priority.

Reversely, Semrelease atomically increments *s and notifies a waiting goroutine if one gets blocked in Semacquire. The implementation is out of scope, so we shouldn't pay much attention. How it works deserves another blog to introduce it.

Cache Coherency Mechanism

Atomic operations are guaranteed by assembly instructions which are architecture specific.

The compiler respects the atomic operations as well, and it won't insert the other instructions among the atomic instructions.

The CAS works fine on a single CPU core machine as one instruction could interact with the memory at a point in time.

However, in the multiple CPU cores, each of them might execute the instructions to access memory; here, the atomic isn't helpful for memory access. As a result, we need a new As it's the (one) CPU core to execute the assembly code; there might be multiple instructions that want to modify memory for the same time, hence,

there is a mechanism for CPU to control it called "cache coherency mechanism".

The cache coherency mechanism provides some memory barriers such as read barrier_, write barrier, and so forth. We need to know the CAS relies on a cache coherency mechanism to ensure the correctness.

However, this mechanism is out of topic a bit as it's not limited to the mutex itself.