Discussion: Multithreading

General Considerations

In a multithreaded program, the separate threads all have the same unfettered access to the memory, files descriptors and other program resources. There are two quite distinct reasons why you would write such a program, asynchronous task separation and sheer speed. The first allows you to do such things as write reports and gather updates whilst continuing to serve pages at something approaching real time - with all tasks operating on the same data in the program space. And if all you want to do is exploit this resource sharing property without having your program lock up, writing multithreaded programs is not particulaly difficult. It does call for due dilligence. Every resource that is accessed by more than one thread requires a lock to protect it and steps have to be taken to ensure that all possible routes to the resource first obtain the lock - and then release the lock when done, regardless of the outcome. As a task, it is similar to ironing out memory leaks except that you must get them all. Leaks can be tolerated to a certain extent but lockups are by definition, a show stopper.

But if you want sheer speed and are looking to use multithreading to boost performance by bringing more processors to bear on the task in hand, things can get a lot tougher. There is an overhead on locking and unlocking and it goes without saying that a thread waiting for a lock is stuck doing nothing. Moreover, if your threads are all doing much the same thing, they are likely to be getting in each other's way. And if a lot of the time the threads are waiting for a lock to be freed, it is quite possible your program will run slower than it would if single threaded. Sometimes a lot slower.

To write multithreaded program effectively, to get the extra processing power and make sure you get it reliably, you have to get into some basics of real time programing. There are two basic methods of locking. You can have all waiting threads spining round polling the lock until it is free, or you can arrage for the system to wake up a waiting thread when the lock becomes free. The first thing you need to do is completely forget the latter. Spining round constantly reading a value might be a waste of time but context switching is a hell of a lot worse. Think spinlocks and nothing but spinlocks.

Locking carries a small processing overhead of at least 10 nanoseconds (on a typical 3GHz CPU). Whilst small, this can often dwarf the processing that actually takes place inside the lock. Whilst this might seem irksome, it isn't what matters. What does matter is the overall percentage of operations performed on locked resources (including aquisition and release of the lock) - because that is the strongest factor in both the overall performance cost and the proneness to gridlock. The only real golden rule then, is that locks should be locked for the least possible time. Here are some basic guidelines for acheiving this:-

1) Remember that it is only data objects that need protection from concurrent access - and not all data objects at that! You don't need to lock data that either cannot or won't ever be subjected to concurrent access. Any resource allocated from the stack as a automatic variable in a function, such as a collection class instance in a sort function for example, won't need a lock since no other thread can access it. However, as useful as that tip is, most resources you will need to share between threads won't meet that desciption. You need to go as far as you can to design the program so that objects that need sharing are kept to a minimum and part of this approach will be thread specialization. But in the end, you will come to a hard core of data objects that do need locking so what then?

2) Lock, read/write data, unlock, run! Don't linger inside a lock. The lock's sole purpose is to avoid corruption through chaotic writes - so limit what you do inside the lock, strickly to reading or writing the data. Don't process the data inside the lock unless two conditions are met:-

a) The result of processing would logically be written back under the same lock and

b) that the processing takes a comparable time (or less) than releasing and relocking would take.

3) Use read/write locks whenever possible, particularly if you have a resource (eg a collection class) that will be read from more often than it will be written to. Under a read/write lock regime, multiple threads can enter the lock at the same time as long as thier only purpose is to read. As soon as any thread enters the lock to write, the lock becomes exclusive.

4) Partition big resources. Lets say you are trying to process a large number of transactions per second and they all require a lookup in one particular collection of objects, such as social security numbers or IP addreses. And when a new value is encountered, it must be added. Social security numbers partition naturally as do IP addresses but so do most data types one might use as a key. If you do something simple like put numbers begining/ending in 0-9 in 10 separate collections, you massively reduce waiting on locks.

5> Work on copies. If the processing will be complex (such as inserting an object into a collection), it might be worth paying a price in extra RAM usage to keep a copy of the collection and only lock the pointer to it. If the pointer says the object is a A, write-lock the object that is at B, change B, then lock and change the pointer to B and unlock the pointer. Then with the write lock still on B, write-lock A and copy the changed bits from B to A. Then release the A and B write locks.

The HadronZoo C++ class library provides three locking classes, hzLockS, hzLockRW and hzLockWRD. hzLockS or 'simple lock' consists only of a 32-bit value which will store the thread id of the holding thread if locked, be zero if free or be set to 0xffffffff if defunct (the resource protected by the lock is deprecated). hzLockS is for fine-grain locking where the lock is commonly part of the object it protects. With this practice, when the object is deleted the lock is deleted along with it. Any thread waiting for the lock to be release needs to know this and abandon the intended action. A thread with the lock and seeking to delete such a resourse must calls Kill() instead of Unlock() to set the value to 0xffffffff - then call delete.

It is worth noting that this technique fails if the freed memory is overwritten before a waiting thread that has been context switched out, sees the 0xffffffff. To rectify this, all threads in HadronZoo based applications have a 'scratch-pad' area and will place the address of any hzLockS sought on the scratch-pad at a particular location (a thread can only wait on one lock at a time so this is OK). Another location is for kill notices. The Kill() function checks the scratch pads of all threads and if it is found waiting on the exact same deprecated lock, places the lock address in a second designated location. The Lock() function checks this second location before returning and will abandon the lock if the lock addresses match. It is inefficient in highly concurrent systems as Kill() must check the scratch pads of all threads. However it only applies when resources protected by a hzLockS are deleted which should not be often, and it allows for locking of a large number of small objects which at only 4 bytes per object, saves a lot of memory. hzLockS and the infrastructure to back it up is a good example of why it is important to design multi-threaded programs from the start as multi-threaded, not simply to add locks to a program that was origionally single threaded.

hzLockRW is a read-write lock comprising the 32-bit thread holder and a counter. The counter keeps track of how many threads have been granted read access while the thread holder controls write access. The counter must be incremented and decremented by a _sync operation to prevent the possibility of missing either an increment or a decrement. Write access is first a matter of setting the thread holder to the thread id and then waiting for the read count to fall to zero. No thread can gain read or write access if the thread holder is non-zero. The hzLockRWD is an extension of hzLockRW in that each instance has a name and there are extra parameters such as spin counts to record activity on the lock. A report on all hzLockRWD activity can be invoked at any time - without locking. And that is the complete story. So far we have not encountered any need to add more classes of locks.

Note on Atomic Operations

These are a widely touted feature of C++11 but are actually implimented in hardware and are available without C++11 (We currently use gcc 4.4.7). The Function __sync_val_compare_and_swap(TYPE*, TYPE oldval, TYPE newval) where TYPE can be a singed or unsigned 32 or 64 bit integer, is one of a set of atomic functions and is the basis of the locking mechanism we use.

Atomic locks are smaller than mutexes but in our benchmark tests, did not confer a speed advantage and to our dismay, were found to grant locks less evenly than pthread_mutex_trylock. But that might be how it is in gcc 4.4.7, one would hope it would be faster in another compiler perhaps - except that it always pays to consider the fundamentals. The obvious question being how is pthread_mutex_trylock implimented without some underlying atomic instruction? We plumped for atomic in the end because of the size advantage. Without the size advantage the fine grain locking we do in the hzStr class for example, could not be done.

Lock Free Synchronized Data Strutures

To be clear, there is no such thing as a data structure that accepts writes from multiple threads without the applications of locks. Yes we know what it says on the tin but that is for those who prefer to think at a high level. With lock-free entities the programer might not have to worry about locks just as most people buy cars without worrying about the engine. But just because you don't have to worry about the engine doesn't mean there isn't one.

What there are though, are data structures designed specifically with multithreading in mind. Most data structures are workhorses that have been around for many years and are not optimized for multithreading, including some of our own. In general, one has to design a data structure along entirely different lines when considering simultaneous writes. All you can do with the traditional set of nodes that underpin your queues, stacks, sets, maps and vectors, is to put a lock round the whole entity. A data structure optimized for multiple threads will only lock out individual nodes as required. This means that the locks apply for a shorter period. And given that it is relatively unlikely that two threads are seeking to write to the same exact node, contention is inherently lower. Some of these data structures, most notably the 'lock free' queue, can also, at the expense of greater memory consumption, actually reduce the amount of locking needed so that a significant number of push/pop operations can be performed without any locking (hence the name). This is acheived by arranging things so that each thread can write to a favored area.

Low Contention Synchronized Data Strutures

One simple technique that is simple to impliment is where it is nessesary only to aquire any one of a set of locks. This works well for queues and is of particular use when managing such things as a pool of memory blocks. In essense, instead of just one freelist, there are multiple freelists in a round robin. Each freelist might comprise the count of blocks in the list, pointers to the first and the last blocks and two locks - A for releasing blocks (adding to the freelist) and B for allocating blocks (removing from the freelist).

To release one or more blocks, the thread enters the round robin and checks if the lock A for the freelist is zero. If not it moves on to the next freelist until it does find a freelist with lock A equal to zero. It then attempts to set it to the thread id with a __sync_val_compare_and_swap. It then tests lock A. If this is not equal to the thread id then the lock has not been aquired and so onto the the next freelist. But if lock A is equal to the thread id then the lock has been aquired - and the blocks are then added to the freelist.

To allocate one or more blocks, the process is the same except that the operation is performed on the opposite pointer, lock B is used instead of A and before a lock is attempted, the count is checked to see if there are available blocks. The overall effect of such a arragement is a spinlock that instead of spining round the same lock spins round the round robin of locks - but with a much greater likelihood of finding one unlocked and accordingly less time wasted.