Note that the following discussion concerns memory resident ISAM and Indexed Chain only. For persistant media implementations (ISAM-Files), please see article 5.2 "Datacrons".

Simple ISAM

From the glossary of terms we have: "ISAM (Indexed Sequential Access Method) is a method of indexation characterized as a tree with two distinct node types: Index and data. The top level of the tree may only contain data nodes, which contain key/element pairs. The lower levels may only contain index nodes, whose entries point to the nodes in the level above". The glossary goes on to qualify the term 'node', stating that "logical or physical data store units are described as nodes if they must contain an integer number of whole entries, and blocks if they are not subject to this constraint".

HadronZoo prefers ISAM to binary trees. One reason is that lower pointer/element ratios can be achieved, reducing memory consumption. In simple binary trees the ratio is 2, as each element has a pointer to a higher value element, and another to a lower value element. The more complex 2,3,4 tree for all its complexity, is only marginally better. Nodes can have 2, 3 or 4 children (hence the wierd name), and the number of elements is always 1 less than the number of children. Thus a 2-node has 1 element and 2 pointers, so a pointer/element ratio of 2, a 3-node has a ratio of 1.5 and a 4-node has a ratio of 4/3. By contrast ISAM, or at least ISAM as implemented by HadronZoo, has fewer than 2 pointers per data node. As data nodes contain the target elements, they have no need for pointers. So depending on the ratio of elements per data node, ISAM can have a far lower pointer/element ratio.

The pointer/element ratio is only one factor. Another, arguably more important factor, is the node occupancy ratio. When an element is added to a full node, the node splits in two. With a fixed number of elements per node, the split leaves one node with (N/2) elements and the other (N/2)+1, creating a total gap of N-1 where N is the node size. Techniques such as migrating elements to an adjacent node, make inserts less likely to require a split. Also the more numerous the gaps, the more likely it is that an insert will fill a gap rather than allocate a new node. Overall, fixed sized node achieve a node occupancy of around 75%. This might be acceptable in the less populous index nodes, but the same rate in the data nodes would be costly, particularly is the elements were large.

To improve node occupancy, HadronZoo ISAM uses a mix of node sizes. Various regimes exist but all are based on dividing nodes into buckets. Even a simple arrangement in which nodes can have between 1 and 4 buckets of the same size, can be effective. If X is the bucket size, a split node would result in a node with 2 buckets and a node with 3. That is a gap of X-1, which when N=4X, is 3X less than N-1. Having two or more bucket sizes, e.g. 4 and 8, further improves the effective occupancy rate.

Note that with memory resident ISAM, index node entries do not have to be key/node-pointer pairs. Only the node pointer is needed because the key for comparison (to the lookup test key), can be obtained by following the node pointer upwards until a data node is reached.

Indexed Chain

Again from the glossary of terms: "Indexed chain is an ISAM variant, adapted for variable length elements. The index level nodes are similar to those of standard ISAM, but the data level is structured as an ordered concatenation of elements. This concatenation is held in a chain virtual blocks, which elements can span".

Standard ISAM can handle variable length elements such as strings without undue profligacy, but there are circumstances where indexed chain offers significant improvements in space efficiency. Consider as an example, the following list of domain names and email addresses:- <pre> @hadronzoo.com info john.doe ... more addresses @mydomain.com i me myself ... more addresses @more domains ... </pre>

Assuming the domains appear in the list in lexical order and the addresses listed under each domain are likewise, lists of this form can be directly loaded into memory and searched by binary chop. Domains are first found by chopping the whole list. Addresses are then found by choping the list of addresses for the given domain. This approach is space efficient with the compact list serving as its own index, but not process efficient. After each chop, the data pointer must be wound on to find an element to compare. This is acceptable when chopping a list of addresses, but not when chopping the whole list to find the domain. This suggests an index of domains, akin to the index level in standard ISAM. There also needs to be a way of limiting data shifting when adding and deleting domains and addresses, which is where the chain of virtual blocks comes in. One might ask if this takes us right back to square one, i.e. to standard ISAM. The answer is not quite and the rationale of the indexed chain rests on the extent of the not quite!

It is presumed that data volumes will be substantial so a chain of blocks becomes necessary to avoid the need for contiguous memory. Data state changes will introduce wasteful gaps and there is no way of avoiding this. However the extent of the gaps can be limited. The blocks are virtual and can vary in size. Regimes differ as they are tailored to the task in hand, but all work on the same principle. The blocks consist of multiple 'buckets' which come in several sizes, with 512, 256, 128 and 64 bytes being typical. Occupancy rates of 95 percent or better, can be achieved by this means.

The index is also smaller than it would be with a collection of domains based on standard ISAM. HadronZoo indexed chain implementations use block anchoring addressing regimes. With block anchoring, index entries do not point to the first or last element within chain blocks as they do with ISAM, but to whole chain blocks and only chain blocks in which at least one element begins. Chain blocks wholly occupied by a mid part of a large element, don't have entries in the index. The technique ensures chains longer than 4Gb can be addressed by 32-bit numbers, but requires each chain block to state if elements begin within it and state the offset to the first one. All elements are prepended with a length indicator so that elements beyond the first, can be rapidly located by iteration. In the case of domain names and email addresses, the length indicator costs nothing as it will rarely exceed 1 byte in size and replaces the null (or newline) terminator.