TIL: Go’s CompareAndSwap is not always Compare-and-swap

Go's standard package sync/atomics provides programmers with functions to use the underlying CPU-level atomic operations such as compare-and-swap (CAS), through atomic.CompareAndSwapT (where T is an integer type).

Problem: Not all CPU architectures offers a CAS instruction to rely on to implement atomic.CompareAndSwapT. However, Go must compile that function code to something semantically equivalent—let's see what.

What is a compare-and-swap?

Conceptually, CAS is an operation defined as followed, except it's atomic.

func CAS(ptr *T, old, new T) bool {
    if (*ptr != old) {
        return false
    *ptr = new
    return true

x86_64 (AMD64)

Let's take a look at how atomic.CompareAndSwapInt64 works with int64. The function calls a lower-level function Cas64 which is defined in assembly language as follows. It's a mapping to x86_64's LOCK CMPXCHG ("Compare and Exchange") instruction that performs the CAS.

TEXT ·Cas64(SB), NOSPLIT, $0-25
  MOVQ  ptr+0(FP), BX
  MOVQ  old+8(FP), AX
  MOVQ  new+16(FP), CX
  SETEQ ret+24(FP)

The code is from Go 1.21.5 source.

Remarks: LOCK is not an instruction, it's a prefix to make CMPXCHG atomic in concurrent contexts. AX, BX, CX and FP are translated into real corresponding registers during compilation; The Go assembler does not use the actual architecture's regsiter names directly.

Not much more to say, the Go function translates to the equivalent CPU instruction, here.

Arm64 (AArch64)

The Arm64 case is interesting. The function Cas64 defined for Arm64 has two ways to perform a CAS operation.

The CASAL instructions is part of the Armv8.1-A Large System Extension (LSE) that provides CAS operations. If the CPU supports it, this path is taken. If the CPU does not support LSE, then the function fallbacks on another path, performing an "emulated" CAS with the help of LDAXR and STLXR.

TEXT ·Cas64(SB), NOSPLIT, $0-25
  MOVD  ptr+0(FP), R0
  MOVD  old+8(FP), R1
  MOVD  new+16(FP), R2

// Check support for LSE atomics
  MOVBU internal∕cpu·ARM64+const_offsetARM64HasATOMICS(SB), R4
  CBZ   R4, load_store_loop

// LSE supported, use CASAL
  MOVD  R1, R3
  CASALD  R3, (R0), R2
  CMP   R1, R3
  CSET  EQ, R0
  MOVB  R0, ret+24(FP)

// LSE not supported, use LDAXR/STLXR
  LDAXR (R0), R3
  CMP R1, R3
  BNE ok
  STLXR R2, (R0), R3
  CBNZ  R3, load_store_loop
  CSET  EQ, R0
  MOVB  R0, ret+24(FP)

(I added the comments.)

I was curious about the Arm64 case, because typical RISC architectures are not usually keen to provide instructions that operate directly on memory like x86_64. With RISC, one usually needs to load a value from memory, operates on it, and then stores the value into memory in distinct steps, unlike CAS does.

In theory LSE instructions such as CASAL should boost code performance. It appeared to be true on micro-benchmarks simulating cases of high contention, but according to some benchmarks conducted by the author of the MySQL on ARM blog, the results are more mitigated on real workloads.

RISC-V 64 (RV64A)

RV64A's LRD and SCD instructions are equivalent to Arm64's LDAXR and STLXR seen previously. The hand-coded logic is similar, too.
TEXT ·Cas64(SB), NOSPLIT, $0-25
  MOV ptr+0(FP), A0
  MOV old+8(FP), A1
  MOV new+16(FP), A2
  LRD  (A0), A3
  BNE  A3, A1, cas_fail
  SCD  A2, (A0), A4
  BNE  A4, ZERO, cas_again
  MOV  $1, A0
  MOVB A0, ret+24(FP)
  MOVB  ZERO, ret+24(FP)

Code from Go source.

It seems that RISC-V is purer than Arm64; it does not provide a ready-to-use CAS instruction. CAS must be done by hand with the LL/SC pattern. I found an interesting rationale in the RISC-V instruction manual:

Both compare-and-swap (CAS) and LR/SC can be used to build lock-free data structures. After extensive discussion, we opted for LR/SC for several reasons: 1) CAS suffers from the ABA problem, which LR/SC avoids because it monitors all accesses to the address rather than only checking for changes in the data value; 2) CAS would also require a new integer instruction format to support three source operands (address, compare value, swap value) as well as a different memory system message format, which would complicate microarchitecture

Additional notes

I looked at Go's implementation because that's what I use. Unsurprisingly, other implementations take the same approach: use LL/SC when CAS does not exist in silico, and for Arm64 use one or another pattern depending on the availability of LSE. The previously linked article says as follows:

GCC auto emits a code with dynamic check with both variants (lse and non-lse). Runtime a decision is taken and accordingly said variant is executed.