Idsets are compact sets of ascending unsigned integers - for any purpose but particularly well suited for storing 32-bit datum and object ids. Compaction is acheived by storing the integers as a series of increments, which can be expressed in forms smaller than the integer size. Reverse translation to absolute values is by a process of iteration and addition. The hdbIdset class is based on this method.
Within hdbIdset, the idset is held as a series of EIUs (encoded incremental units). EIUs are either single-increment and represent a single id, or are multi-increment and represent a group of ids. Idsets are divided into nominal segments which have a range of 512 value. The type of EIU deployed depends on id density within the applicable segment. EIUs vary in size but are always a whole number of bytes.
The first byte of an EIU contains one or more control bits to state the EIU type. If the top bit is 0, the EIU is a single byte single-increment. The remaining 7 data bits give the increment but as there cannot be an increment of zero, 1 is added to give an increment range of 1 to 128.
If the top bit is 1 the next two bits are treated as controls so the available binary codes are 100, 101, 110 and 111. The first three of these are 2, 3 and 4 byte single-increment EIUs which respectively have 13, 21 and 29 data bits representing increments ranging from 129 to 8,320, 8,321 to 2,105,472 and 2,105,473 to 538,976,384. Code 111 turns the whole of the first byte into control bits allowing up to 32 further EIU types. Of these, code 11100001 uses the next 4 bytes to give an absolute value.
The complete set of encodings currently in use are as follows:-
0xxxxxxx | 1-byte single-increment. Range 1 to 128. |
100xxxxx | 2-byte single-increment. Range 129 to 8,320. |
101xxxxx | 3-byte single-increment. Range 8,321 to 2,105,472. |
110xxxxx | 4-byte single-increment. Range 2,105,473 to 538,976,384. |
11100001 | Single-increment, absoluete value. |
11100010 | Invert Segment. From now to end of segmenmt, numbers are the ids the segment does not have. |
11100011 | Invert Ongoing. From now until further notice, numbers are the ids the segment does not have. |
11100100 | Invert Stop. Stop inverting (can only negate an invert ongoing directive). |
11100101 | Bitmap Variable Size: The next N bytes are a straight bitmap. |
11100110 | Bitmap Full: Full 512 bit segment. This must specify the segment number (3 bytes). Total size of EIU is 68 bytes as there is the control byte and 64 data bytes. |
11100111 | Bit-tree: These have a root index byte, each bit of which indicates an index byte in the second level. Upto 8 index bytes in the second level indicate up to 64 data bytes in the final level. Each bit in each data byte indicates a number allowing the bitmap to store up to 512 numbers. These are only used for full segments and so like Bitmap-Full, must specify the segment number. |
With the single-increment EIUs, there is negative compression with the absoulte, no compression with the 4-byte and neglibile compression with 3-byte EIU. These however will be few in number. The 2-byte single-increment EIUs have a compression ratio of only 50% but as the minimum value of these is 129, the highest possible incidence of these is 1 in every 129 ids. The 1-byte single-increment EIU achieves 75%. For dense regions of idsets where average increments are very low, bitmaps offer better compression. To this end code 111001xx is used to deploy multi-increment EIUs based on bitmaps.
Idset Segmentation.
Idset segmentation is both by id range and by volume. Ranges become important in INSERT and DELETE operations, as idset population density rises. In a sparsely populated idset, the ids are generally represented by single values EIUs. If so, the INSERT or DELETE of an id will usually result in the addition or removal of a single-value EIU - and a recalculation of the following EIU increment. As more and more ids are added and ids become closer in value, multi-value EIUs become more favourable. Without an arbitrary constraint, determining this is an onerous calculation, hence the use of id ranges.
Volume segmentation serves two purposes. Firstly it constrains the inevitable byte shifting resulting from the volumetric expansion and contraction of the set of EIDs. Secondly, by limiting volume segments to the 4K blok size of the file system, is is easier for data classes that use idsets, to render the idset data persistent. Idsets are thus a series of one or more nodes, each containing an integer number of range segments, that will fit within 4K.
Idset Processes.
The main hdbIdset operations are FETCH, INSERT and DELETE, and boolean AND and OR. FETCH identifies the node, then iterates it. INSERT and DELETE do likewise but will usually build a new node. As FETCH, INSERT and DELETE only involve a single node, these operations are fast. <5 microseconds in the case of FETCH, <10 microseconds for INSERT and DELETE. Boolean operations can take a lot longer, depending on the idset populations. In practice, searches are useless if the result is too large.
Idsets have two very different main roles: In index updates and in searches. In updates, a single object id is added to or removed from the idset. In a search, idsets are ANDed Various HDB index classes store idsets and modify them as the repositories they index are updated (please see article 5.6.3 "Index Implementation"). In searches, idsets are fetched from indexes into hdbIdset instances which perform boolean operations to produce the search result idset. For larger idsets, efficient boolean operations depend on random access to segments.
Idset Forms.
Idset populations can vary from 1 to many millions. To cope with this, hdbIdset has a header which can morph into different forms depending on the number of EUIs. In outline, these are as follows:-
1: | A simple idset is one that fits within 8 bytes. Such idsets are rare but they do exist. |
2: | An idset with only a single node (total size <4K). |
3: | Idset exceds 4K and must have a hzVect of nodes. |