Tuesday, February 7, 2012

Transactional Hardware on x86

While I'm posting anyway, I might as well mention the big concurrency news of the day: the new transactional memory specification from Intel. Transactional memory sounds a lot like what it is. Your code begins a transaction and continues to execute it until it ends, performing some memory writes along the way. If there are no conflicting transactions (i.e., transactions that write to the same memory that your code wrote) executing at the same time as yours, your transaction will end normally and commit your memory writes. If there are conflicting transactions, your transaction will abort and roll back your writes.

Transactional memory on x86 will come in two flavors:
  1. Hardware Lock Elision (HLE), which consists of XACQUIRE and XRELEASE instruction prefixes. These can optimistically turn lock regions into transactions. What's the advantage? Well, transactions can execute concurrently, as long as they aren't writing to the same memory. Locks are serialized: only one thread can be in a lock region at a time. So this mechanism can (hypothetically) execute more code concurrently. Also - and perhaps more importantly - it should be much cheaper to acquire a lock using XACQUIRE than it is to acquire it without, since with XACQUIRE you don't have to do anything other than tell the system to start looking out for conflicts. Between the two advantages, HLE should be able to provide a nice performance boost. These are backwards compatible, so that software written to use them can be run on earlier hardware.
  2. Restricted Transactional Memory (RTM), which consists of XBEGIN, XEND and XABORT instructions. These have the traditional transactional memory semantics: you begin a transaction, and if you abort before you end it, your writes are undone. These are very flexible, but are not backwards compatible (i.e., you can't use them on older hardware).
I like the idea of hardware-level transactions / atomicity. Compilers can be written to take advantage of them, and they can be a win for people implementing synchronization primitives and doing very low-level multithreading. They have the potential of making synchronization very cheap when there isn't much contention. AFAICT, Azul has claiming very large benefits from their HLE-style synchronization for years (edited: They claim <10%: see this presentation on hardware transactional memory from Cliff Click).

There is also likely to be a renaissance in lock-free data structures. Currently, lock-free data structures are pretty much all built on top of one non-blocking atomic instruction: the compare-and-swap (LOCK: CMPXCHG on x86, or AtomicReference.compareAndSet() to Java programmers). Trying to shoehorn everything you might want to do atomically into a single instruction that operates on one word has led to some very interesting data structures over the last 20 years or so, but is really the wrong way to solve the problem.

The flip side of this is that I don't think these will have much of an impact on most user-level synchronization. The HLE is really just an optimization for current lock implementations. The RTM aborts if you execute any of a large number of instructions (like PAUSE or CPUID, or many operations that affect control flow), or if there is a conflicting write to the cache line (which may not have any logical connection to the data the programmer cares about): the average programmer who deals with multithreaded code can't reasonably be expected to think about what is going on at the microarchitectural level.

Also, as with most transactional memory systems, RTM only deals with memory writes to a limited number of cache lines. If you have to deal with, for example, I/O, or if you touch too many cache lines, you have to write an abort handler that deals with rollback correctly. My feeling is that this will probably be far too difficult for most programmers.

Transactions' usefulness also depends enormously on the quality of their implementation. I'd love to see some hardware, but it isn't expected to be released until 2013.


11 comments:

Cris said...

Really interesting news and commentary, thanks for the nice clear post.

Jeremy Manson said...

Thanks for the kind words.

Roland said...

I don't see azul claiming big gains here: http://sss.cs.purdue.edu/projects/tm/tmw2010/talks/Click-2010_TMW.pdf

Jeremy Manson said...

I guess I was remembering the 2x, rather than the <10%.

Golovach Ivan said...

Hi, Jeremy.
I have another question: i'm very interested in deep understanding of New JMM but i found that i haven't access to page with Simulator (http://www.cs.umd.edu/users/jmanson/java.html). Can i have access to it or to another source of Simulator?

Anonymous said...

What control flow does abort?
I don't read that in the spec.

Jeremy Manson said...

@Ivan: It's not that new a JMM anymore! I don't recommend the simulator; there are known bugs I never got around to fixing. I still have a copy of it around somewhere, but I'm not sure you won't do better with some of the other people who have built formal models of it, and plugged them into their model checking frameworks.

@Anonymous: Far CALL, Far JMP, Far RET and IRET may (or may not) cause aborts. Also, kernel calls, if you count that. It's possible that calling those control flow is misleading.

crescent said...

Nice Sharing with this blog.

tgamblin said...

Seems like it's worth mentioning here that transactional memory and speculative execution are already shipping from IBM, albeit in a machine that most people won't be able to afford:

http://researcher.watson.ibm.com/researcher/files/us-pengwu/BGQPerfPaper-final-PACT12.pdf

BlueCube said...

Thanq for sharing a complete Information

srinu cNu said...

Nice Post About the JAVA Concurrency