can readswrites between two threads be torn if the system cache is coherent?

Yes, you can have tearing.

A share-request for the line could come in between committing the two separate 32-bit stores. If they’re done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.

Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.

The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.

(And if the interconnect itself doesn’t introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)

Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.

If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?

By delaying response to MESI requests to share or invalidate a line when it’s in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn’t. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.

A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.

But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)

Further reading:

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top