More Data in the Memory
Heap allocation is threadsafe and reliable, but not always efficient. There is an 8 byte per allocation overhead which is expensive if there are large numbers of small objects. The OS however, usually has very good reasons for doing what it does in the way that it does it. An obvious allocation overhead mitigation method is to allocate large space blocks from the heap, then with the aid of free list pointers, allocate small slots from the space blocks. An obvious problem with this method, is that it is prone to hogging memory at or very near the high water mark. Unless freed slots happen to result in empty space blocks, memory is not returned to the OS. Empty space blocks arise in some circumstances, such as where temporary collections are created in order to sort lists, but they are otherwise unlikely. Space blocks with low occupancy can be emptied by moving slots to other space blocks, but as this would invalidate slot addresses, a virtual addressing regime is required - So back to square one as this is pretty much what the OS does! It makes no sense to replicate OS functionality within programs. This is not to say the method is a complete dead loss. It is particularly well suited to programs that for performance reasons, hold large volumes of data memory resident throughout runtime.
This is not to say the method is a complete dead loss. In programs that for performance reasons, hold large volumes of data memory resident throughout runtime, the method is almost ideal. In such programs which are very much the prevail of HadronZoo, free lists tend to run short and hogging is more by design. The maxim is the more data one can fit in the RAM, the better. Cutting allocation overheads in this way, confers the obvious advantage of more data in the memory.
Class Specific Allocation Regimes
Several allocation regimes based on the above method are present in the library, including an experimental global new/delete override. This of necessity, assigns 64-bit pointers to allocations that are 8-byte aligned and are multiples of 8 bytes in size. However, at least in the current library, this regime is the exception. All the other regimes are specific to particular classes, and assign allocations 32-bit unsigned addresses instead of 64-bit pointers. Cutting allocation overheads at source is important, but it is not the only game in town. The 32-bit addresses, which effectively halves the pointer size, can bring comparable benefits.
The savings are most apparent in the common case where programs host a large number of strings. The allocation regime used by the hzString class, allocates slots that are multiples of 4 bytes and 4-byte aligned to 4 byte boundaries. The finer granularity saves memory, but the size of the hzString class itself, saves more. hzString uses a smart pointer, so the slot which actually contains the string value, is not part of the class. The class itself consists only of the slot address, so hzString is 4 bytes instead of 8. Many classes have hzString members, so these are all smaller. hzString is also among the most used classes in collections constructed using collection class templates.
The thinking behind 32-bit addresses is simple: Although regimes generally don't exploit the full 32-bit unsigned address range, the range is still a large number, around 1 billion in the case of hzString. Programs are likely to need more than 4Gb RAM but they are less likely to need a billion strings. They are even less likely to need hundreds of millions of instances of other classes covered by such regimes. Use of class specific allocation regimes necessitate address to pointer translation to access slot content, but this is usually a matter of using the address as an offset in an array, so the performance hit is minor. There is a considerable impact on the coding of member functions of the host class, but not on the coding of programs that use the class.
It is important to understand the exact sense in which class specific allocation regimes are specific to the class. The regimes are not actually part of the class. Instead there is a single instance of a regime that is tailored to the needs of the class, which all instances of the class will use. If that regime happens to fit the needs of another class, there will be a separate instance of the regime which all instances of the other class will use. Note also that regimes NEVER allocate instances of the class. They only allocate space on behalf of class instaces. Allocation of a slot from a regime, is the functional equivilent of malloc(). It is NOT the functional equivilent of the new operator.
Allocation Regimes: Basic Methods
Allocation regimes allocate slots from space blocks that are of a single fixed size, or are of one of a set approved sizes. Oversized allocations are either disallowed or allocated from the heap. Regimes use two different types of space block, dedicated and mixed. Dedicated space blocks source slots of one size only. Mixed space blocks source slots of several different sizes. Depending on the range of sizes chosen in the design, regimes may use dedicated blocks for some sizes and mixed blocks for others. Regimes are classified according to the types of space blocks they use as follows:-
Single-size: | Allocated slots are all the same size. Only uses space blocks dedicated to the size. |
Multi-size: | A MultiSize is a fusion of several SingleSize regimes. |
Single-range: | Slots of a range of sizes are sourced only from mixed space blocks which cover the entire size range. |
Multi-range: | A fusion of two or more single-range regimes. For example, one set of mixed blocks sources slots of the 8 to 32 byte sub-range while another set of mixed blocks sources slots of the 40 to 128 byte sub-range. |
Hybrid: | A fusion between a single-range or multi-range regime and either an single-size or multi-size regime. |
Regardless of which space blocks a regine uses, all space block pointers are held in a single array, known as the regime array. The X most significant bits of a 32-bit slot address gives the offset to the space block in the regime array, while the Y least significant bits provide the offset to the slot in the space block. The exact value of X and Y which must always add to 32, depends on the design. Y is usually chosen to be 8, 9, 10, 11 or 12, corresponding to slot/block ratios of 256, 512, 1K, 2K or 4K. This leavs X in the range 20 to 24, corresponding to limits of 1M, 2M, 4M, 8M or 16M blocks.
Note that the regime array is usually arranged as a two-tier array, i.e. a root block of pointers point to second-tier blocks of pointers, which point to the object blocks. In some cases a three-tier array is used, with one more pointer to follow. This is doing exactly what hzArray and hzVect do, but faster as there are larger blocks so fewer tiers. The fewer the tiers, the less pointers to follow, the faster the address to pointer translation.
Block Anatomy
Dedicated blocks can always utilize the full addressing capabilty. If we have Y as 8, so a slot/block ratio of 256, all 256 slots can be allocated and every value of the 8-bit slot address range (0-255), can point to a slot. This will only be true of a mixed block if all allocated slots are the smallest size in the allowed range. Should any of the slots be of a larger size, the block will become full before all possible addresses are used.
While dedicated blocks consist entirely of an array of equal sized slots, mixed blocks are two arrays that start from opposite ends of the block space. One for slots, the other for slot meta data. By convention the slots descend while the slot meta data ascends. The block is full when the two meet in the middle. Slot addresses (Y bits), are interpreted as the offset into the meta data array. The meta data gives slot offset and size. Allocation regimes usually have multiple, size related free lists (see below), The slot size is used in a DELETE to place slots in the correct free list. Having the meta data as an array makes it easy to determine if the slot being freed, is adjacent to another already free slot. Where this is the case, the free slots are merged and the combined slot, free listed according to its size.
Mixed block meta data is an overhead. The size of the meta data entries depends on the applicable range of allocation sizes and the total data capacity of the block, expressed as a multiple of the regime granularity. Good designs will keep this within 2 bytes. For example, 16 possible sizes (4 bits), 4,096 possible slot positions (12 bits). With a granularity of 8, the block's data segment will be 32K.
Free List Incidence and Operation
Free lists of slots are implemented as a simple pointer to (or address of) the first slot in the list. Each slot points to or addresses the next slot. Allocation from the free list is a matter of:-
- Recording the current free list pointer/address - Visit the slot it addresses and read the next pointer/address - Setting the current free list pointer/address to the next pointer/address - Return the recorded pointer/address as that of the newly allocated slot.
And freeing to the free list is a matter of:-
- Writing the current free list pointer/address to the first 8 or 4 bytes of the slot to be freed, by means of a cast. - Setting the current free list pointer/address of that of the block being freed.
In regimes using only dedicated space blocks, there is an equal number of approved sizes and free lists. For any given approved size, allocation will be from the free list for that size. If the free list for that size is empty, a new space block dedicated to that size is created, divided into slots of that size, and these slots are added to the free list for that size! The allocation can then proceed. There is no crossover, even if one approved size is a multiple of another.
In regimes using mixed blocks, there are more free lists and extra steps to the process. For each approved size handled by mixed blocks, there is a primary and secondary free list. The primary is for slots that are exactly the size, the secondary for is for slots that exceed the size, which arise from slot mergers on delete. In addition, there is a supersized slot free list for each sub-range, which is fed by the new space blocks when they are created. Allocation of an approved size slot is first from the primary free list for the size but if this is empty, from the secondary for the size and if this is empty, from the supersized slot free list for the sub-range. Finally if this is empty, a new mixed space block of the sub-range is created as a single giant slot and placed in the supersized slot free list.
Where slots are drawn from the secondary or the supersized free list, they are split. A slot of the required size is allocated, and the remainer slot space placed in whichever free list is appropriate. There is always a net migration away from the supersized free list. Slots are allocated from it but are only freed to the primary and secondary free lists for their size.
Oversized and Placeholder Slots
Where regimes allow oversized allocations from the heap, these are also assigned 32-bit addresses. The oversized allocations have 64-bit pointers, and the 32-bit addresses are used to locate the pointers. There are two approaches. Either the pointers are stored in a separate array with the address serving as offset, or the pointers are given placeholder slots within the regime's system of space blocks. The array approach is prefered if it is a straight case of storing the pointer. The placeholder approach is prefered if slots begin with control data. With both methods, there must be a way of distinquishing between oversized and normal allocations. This is usually acheived by using the top bit of the address as the indicator, which cuts the range of the 32-bit addresses from 4 to 2 billion.
Multithreaded Operation
Courtesy of the OS, heap allocation is threadsafe and any contention is outside the control of the developer. By contrast, allocation regimes which are part of the program, require explicit steps to make them threadsafe. Given how central memory is to absolutely everything, very heavy use is anticipated. There has to be a strategy for reducing contention, and above all, for avoiding locks. The regime resources that require protection are the regime array and the free lists. Of relevence to this discussion, note the HadronZoo rule on the number of threads, which should not exceed the number of processors.
The regime array is a busy resource as it is used to translate slot addresses to slot pointers. Translation occurs during allocation and deletion, but also when allocated slots are used in the program. It is a very high frequency action, but fortunately not one that requires protection. Translation is a read operation and the only possible data change, is the addition of a new space block, which an append rather than an insert. Nothing is moved in the process and if nothing moves, pointers are not invalidated.
Space block addition is a medium frequency action, occurring at most every N allocations, where N is the slot/block ratio (usually at least 256). As many allocations are from space freed by deletes, the actual frequency is lower. Space block addition is a matter of allocating the space block from the heap, preparing it as a linked list of slots or as a single large slot, and then placing the pointer to it in the regime array. The position the new slot space pointer will have in the regime array, is given by the regime array block count. To ensure multiple threads cannot place new space block pointers in the same position at the same time, each thread calls __sync_add_and_fetch() on the block count. Note that there is no actual locking.
Assuming for the purpose of this discussion, that the regime array is two-tier, with a current pointer block, as is recommended. If a thread is assigned a position by the sync call on the block count, that is exactly divisible by the pointer/block ratio (usually 1K to 4K), then it falls to that thread to create a new current block. As it is possible that this thread is switched out before the new current block is created, the other threads must check for this and spin until it is done. This is the only scenario in which the regime array could be said to be locked and as should be clear, the creation of a new current block is a low frequency event.
Protecting Free Lists
Both allocate and free operations alter free lists and occur with high frequency. Contention arises wherever two or more threads alter the same free list. With dedicated blocks the threads will clash when allocating or deleting slots of the same size, which is highly likely. With mixed blocks the threads will clash when allocating or deleting slots within the same size range, which is even more likely.
Dealing first with free lists of slots in dedicated blocks, two techniques can be brought to bear. Firstly, providing free lists do not become empty, two threads can simultaneously safely operate on opposite ends. Secondly, free lists can be multiplied. For each natural free list (which would exist in an eqivelent single threaded regime), there can be one for each thread or pair of threads, all serving the same slot size. On its own, free list multiplication is prone to accumulation in some lists while others are starved into allocating new blocks from the heap. There has to be a means of sharing slots, i.e. some form of synchronization.
The HadronZoo regimes multiply each natural free list by half the number of processors. The free lists are not permitted to be empty so on initialization a space block is allocated and the slots distributed among them. Free lists are also required to avail a midpoint if they have an excess of slots (more than the slot/block ratio). It is the responsibility of threads to ensure the free lists meet these obligations.
Threads are assigned a default free list and an end, which they will always use except in one circumstance. If the list has insufficient slots to allocated from, AND another thread has an excess (has availed a midpoint), then the thread will TRY to claim the midpoint by means of the __sync_compare_and_swap() function, and allocate from it. If the midpoint is already claimed the thread will not wait for it to become available. Instead it will default to what it would do if there was no free list with an excess, which is allocate a new space block from the heap. Threads always use their assigned free list and end when deleting slots and will avail a midpoint if this action gives the free list an excess.
For the mixed blocks, each approved size has a primary and a secondary free list and each sub-range has an oversized free list. The primary and secondary free lists employ the same techniques. The odd one out is the sub-range oversized free list. As this is only ever allocated from, it is not prone to accumulation. This is replicated once per thread.
In summary, at the relatively minor cost of running higher free slot inventories than the equivelent single threaded regime would do, we have a multithreaded regime with virtually no locking or contention.