hzString Use
String classes are a very convienient means of handling string values with std::string being among the most established and the most versatile. There would need to be a good reason for developing an alternative. For HadronZoo that good reason was the anticipation of millions of mostly small strings in memory resident collections. hzString per string overheads are lower so the memory consumed by any given collection is lower and the data capacity of any given install of RAM is higher.
hzString is less sophisticated than std::string. While native operations suffice for appending filenames to paths and such like, it is more efficient to build complex string values in a hzChain and cast to a hzString once complete. hzChain is an unlimited character buffer with a Printf function in addition to the usual append operations. Unlike hzString which holds its value in contiguous memory, hzChain uses an array of fixed sized blocks and so avoids expensive reallocation and copying during string value assembly.
hzString Anatomy
hzString applies the standard smart pointer technique and uses a dedicated memory allocation and management regime, which issues 32-bit 'string space' addresses. hzString instances are thus 4 bytes in size, half that of std::string. String spaces are encoded as a concatenation of (a) the copy count, (b) the length indicator and (c) the string value as a Cstr. As the copy count and length indicator are part of the string space, they are a per string overhead. The length indicator is 1 byte if the string value is 128 bytes or less.
The copy count is always a single byte, which is unusual. There is no limit on the number of times a hzString may be soft copied, but there is a limit to the extent that copying is recorded. Once the copy count hits 100 it stays at 100. Operations which normally alter the copy count cease to do so and the string effectively becomes fixed. This can result in a form of memory leak in which strings no longer in use continue to hog memory, but the risks are low. When strings are passed during normal processing, unless the process is flawed, string copy counts remain low, rarely reaching 10 let alone 100. In data collections of course, the copy count can easily go to 100. However once a string value appears in the data more than a few times, unless the data is extremely dynamic, it is unlikely to ever fall out of use. The extent to which redundant string values persist is presumed to be minor and so will waste less space than a larger copy count would consume.
String Space Allocation Regime
The allocation regime is a multi-size and multi-range hybrid. The granularity is 4 although the minimum size of a string space is 8. In the current version, string spaces of 8, 12, 16 and 20 bytes are held in dedicated space blocks, while mixed space blocks cover sub-ranges of 24 to 60, 64 to 124 and 128 to 1K bytes. String spaces over 1K are allocated by the heap. The placeholders issued in respect of these larger string spaces are encoded as the copy count, the length indicator (maximum 3-bytes), followed by 8 bytes being the pointer to the actual string space in big-endian format. The placeholders are thus 12 bytes and are placed in the dedicated blocks for that size.
The per allocation overheads are calculated as follows: The length indicator is 1 byte for strings upto 128 bytes, 2 bytes for strings of 129 to 8,320 bytes and 3 bytes for strings larger than this. The copy count is always 1 byte as is the null terminator. A granularity of 4 wastes an average of 2 bytes. For the smaller strings the average overhead is thus 5 bytes. This rise to 7 for strings using mixed blocks and to 8 bytes for strings in excess of 128 bytes. These overheads don't sound that good but they are a significant improvement on std::string. It is only the oversized strings that are 12 bytes worse than the heap and these are generally rare.
It is difficult to exactly determine the total volumetric capacity or the maximum string population of the combined allocation regimes, as both parameters depend on string content. If all the strings were small, 2 billion could be accommodated. This would be ~500 million in the worse case scenario with all strings oversized. There is a general consensus that overall capacity is comfotably of the order of 1 billion strings, subject to available memory.
hzString Operation
The hzString class supports many operations that can or will alter the string value. Values can be truncated, added to, stripped of leading or trailing whitespace, converted to all upper or lower case, encoded or decoded, etc, etc. In theory, if the new string value would fit in the old string space, the string space would not need to be reallocated providing the copy count was 1. This however, is not the policy. Instead all string value altering operations reallocate the string space, even if the program is running single threaded. The steps are broadly as follows:-
1) Raise the copy count of the original string space by 1
2) Test the copy count is now > 1. IF not THEN
2.1) Return with status of FAILURE.
3) Test if the operation would actually alter the string value (or assume it will). IF a new value will result THEN
3.1) Allocate new string space with copy count of 1
3.2) Populate new string space applying a process to the original string value
3.3) The string space address in this hzString instance is set to that of the new string space.
4) Decrement the copy count of the original string space by 1. IF the copy count of the original string space is now zero THEN
4.1) Original string space is freed.
5) The function returns with status of SUCCESS.
Note that when multithreading, _sync_add_and_fetch() is called to alter the copy count. Nothing is actually locked. As the original string space content is not altered at any point during this process, there is no need to prevent other threads from accessing it.
String Repositories
The hzString smart pointer prevents hard copying where one instance is being set equal to another, but there is nothing in the hzString class or its allocation regime that prevents multiple string spaces from having the exact same value. If N separate hzString instances are declared and separately set to "Hello World", the result will N separate string spaces all holding a string value of "Hello World".
This is not an issue when processing data as the strings are short lived and limited in number. In RAM Primacy repositories such repetition would be a prolific waste. The answer is to maintain a memory resident collection of unique string values, and make sure all strings in RAM Primacy repositories are copies of those in the collection. These collections are implemented as a hzSet collection class template, and are generally known as 'string repositories'. The HadronZoo library provides global string repositories as follows:-
hzSet<hzString> | _hzGlobal_setStrings hzSet<hzDOmain> | _hzGlobal_setDomains hzSet<hzEmaddr> | _hzGlobal_setEmaddrs |
The hzChain Class
The hzChain class is an 'unlimited' character buffer which is commonly used for the assembly of lengthy string values, such as the HTML in a HTTP response. To this end, hzChain has a Printf function and can be appended with Cstr and hzString values and other hzChain instances. Unlike hzString, the content is held in an array of fixed sized blocks, rather than a contiguous block of memory. This structure avoids expensive reallocation and copying during string value assembly, but foregoes direct casting to Cstr. Instead hzChain instances are expected to be iterated from begining to end. An iterator is provided as a subclass for this exact purpose.
hzStream is an adapted form of hzChain intended for processing data streams. hzStream is effectively a queue. All operations that add content, appends the queue. There is no insert or delete operation as such but there is a specific delete operation which discards the first block in the queue. Incoming requests from clients and the responses to those requests are both processed as data streams, i.e. they are instances of hzStream. Once the iterator processing incomming data passes the first block and goes onto the next, the first block is discarded. Likewise once blocks in the outgoing data have been written to the socket, they are discarded. Chain blocks are 1,460 bytes, the same size as an IP packet.