Persistent Stores: Hardware Considerations
An obvious point to make is that a persistent store that works with the media hardware will outperform one that works against it. With hard drives, data transfer rates are high but seek times are typically ~10ms, an age in computing terms. Unless the disk controller happens to have the requested data in its buffers, an unlikely scenario, a data read must wait for the disk head to be over the right track and for the target block to rotate into view. In particular, unless datum are contiguous, retrieval will require multiple costly seeks. Due to file buffering write operations are fast, providing they are to the file end, or are 'whole block' (both write size and target file offset are an exact multiple of the block size). Any other write will partly change blocks and as blocks are only read or written whole, the blocks must be read first, thereby requiring a seek.
With an SSD, seeks are not an issue since the device is random access. However all writes must be to virgin or erased blocks and erasure is only done in bulk, in large superblocks. As a result, freed blocks do not become available for writing again until the host superblock is erased, which can only be done when all blocks within it are free. Waiting for this to come about naturally is a non-starter, so the disk controller routinely moves blocks around to clear superblocks for erasure. It should be noted that in spite of this background defragmentation activity, SSDs are renown for slowing over time.
Other than ensuring all writes are to the file end, there is little that can be done within programs, to mitigate this. One SSD optimization technique, is to preallocate blocks for files, in chunks that are a multiple of the superblock size. This however, only ensures superblocks are dedicated to the file in question, if all writes are to (or beyond) the file end. Any write to a position before the end, results in original blocks being marked as deleted and replaced by new blocks - throwing the preallocion out of kilter.
Management of persistent Media Space is a Bad Idea
When designing a persistent media data store, thinking that new datum space should come from space freed by deletes in preference to growing data files, is an easy mistake to make. This is just good housekeeping surely? Isn't it always better to tidy up as you go along, rather than leave a huge mess to be cleaned up later?
Not in the case of persistent media. Unlike RAM which should be returned to the OS as soon as possible after use, persistent media can wait for system administrators to clear it at their leisure, weeks, months or even years later. RAM is all about the machine's capacity to act. Persistent media is all about how long the machine can record the action. The more spare space there is, the longer the machine can handle the anticipated data aquisition rate before more extra capacity is required. In this calculation there isn't much difference between unused space and space hogged by gigabytes of old log files awaiting deletion.
While ultimately programs should clear obsolete data and return space to the OS, doing so within data operations will significantly slow and complicate these operations. persistent and failsafe free space management is onerous, not least because the data needed to do it must itself be persistent and failsafe! Among the cited reasons for designing a persistent data store in the first place, is to improve on the granularity imposed by the file system. However the smaller the granularity, the wider the range of possible sizes and the more complex the management process becomes. In summary, space management of persistent media space is a dead loss and should be abandoned. Accept a level of gap persistance and have a background thread periodically iron out the gaps.
Always Append, Never Insert
Once incoming data has been validated, it should be committed to persistent media as soon and as fast as possible. Soon because of the prospect of a crash. Fast because performance is always important, and all the more so where the program is multithreaded and the commit takes place within a lock. There is a simple and effective way to commit data as soon and as fast as possible, with either hard drive or SSD: Ensure files are buffered and all writes are to the file end. In other words AANI: Always Append, Never Insert.
Logfiles work this way, with logs benchmarking as microsecond operations. It follows that any data file can work this way providing it is acceptable for the datum to be in order of arrival (unorderd in HadronZoo terms). From the glossary of terms, a data repository is an "unordered persistent data store which issues ids on datum recipt, and retrieves datum on recipt of an id". AANI therefore, is a natural fit for a formal data repository.
Datacron Classification
AANI compliant data stores are known as datacrons because in normal operation (periodic rationalization notwithstanding), the data file(s) form a chronological record of data state changes. Note that by convention, entities stored by datacrons are always referred to as datum (with a datum id), even where they are data objects or more sophisticated constructs. Datacrons are classified as follows:-
Strict: | A datacron is said to be strict if the original datum id is preserved on UPDATE. It is non-strict if it does not support an UPDATE operation or implements UPDATE as a combined DELETE of the old datum and an INSERT of its replacement - which would issue a new datum id. |
Augmented: | A datacron is augmented if it has one or more RAM Primacy components. Serial datacrons are by definition, augmented. |
Composite: | A datacron is composite if it is constructed as, or can be considered in functional terms as, a combination of two or more datacrons. Composite datacrons which will either act as repositories or indexes, are generally constructed as a persistant store (itself a form of datacron), aided by one or more serial datacrons. The basic idea is that serial datacrons are the solution to persistant store inflexibility, while the persistant store is the solution serial datacron volume limitations. The latter feat requires periodic rationalization of the base datacron index and data files. |
Note that serial datacron data capacity is limited by the startup time, perhaps more than it is by the available RAM. Restoration of the last known data state on startup requires a complete delta file read. The time taken is not a simple function of data volume. It depends on the number of entities, and on the sort process (if applicable). In the common case where the RAM component maps string keys to datum, inserts take around 2-3 microseconds, limiting the key/datum population in most applications, to the low single digit millions.
Note also that the ids of previously deleted datum never recur. Datacrons have no natural tendency to reuse ids and while it would be easy to do, HadronZoo considers id reuse to be very poor practice.
In Outline: Composite Datacron Repositories
The (persistant) store of a composite datacron repository is a base datacron with optional adaptions (see below). As the base datacron has no DELETE, a serial datacron known as the OBSOLETE index implements DELETE as DEPRICATE. The OBSOLETE index RAM component is an idset of all datum, deprecated since repository creation or last rationalization (whichever is most recent). If FETCH finds the datum id in the OBSOLETE index, access is denied. Nothing is actually deleted until rationalization, after which the OBSOLETE index is truncated.
UPDATE is facilitated by another serial datacron known as the UPDATE index. The UPDATE index RAM component is a 1:1 map of all datum updated since creation or last rationalization, to their latest data file address and size. If FETCH finds the datum id in the UPDATE index, the UPDATE index entry takes precedence over the index file entry. The UPDATE index is also truncated after rationalization.
Rationalization in simple terms, creates a new data and index file from surviving datum (not deprecated or superceded), then deletes the originals. However, in order to prevent the need for ever more free space and time as data volumes rise, both data and index file are partitioned. Thus, both data and index file are virtual constructs, consisting of a series of one or more files. The partitioning is arranged as follows: At any given time, there is a current partition and zero or more past partitions. As datum ids are non-recurring, the current partition is the only partition appended by INSERT. Once the current partition exceeds a preset volume threshold, it becomes a past partition and a newly created partition takes its place. Partitions have an id offset. That of the current partition is N+1, where N is the highest id reached by its predecessor. In a virgin datacron N is 0 so the first datum id is 1. The process is controlled by a dedicated background thread, which covers all datacrons in the applicable program space. Only past partitions are considered and these are only rationalized if sufficient volume reduction would result (the threshold ratio is usually set around 20 percent). Typically the background thread wakes daily, with actual rationalizations often months apart. Note that the number of open files rises with the number of partitions, which will be a consideration when setting the volume threshold.
Various options exist for the store. If implemented as a pure base datacron, the index file of rationalized partitions must have blank entries in place of deleted datum in order to preserve the base datacron id/offset relationship. This waste can be considered minor and be ignored. Alternatively it can be eliminated, but only by partly departing from the base datacron model. The index file is itself given an index, yet another serial datacron known as the STORE index. This maps the highest (or lowest) id in each index block to the block address. The STORE index in a composite datacron repository has a low overhead as there would only be one entry per 4K index block.
The base datacron on its own or with the STORE index adaption, needs two seeks to retrieve datum. One on the index file, then another on the data file. If the objective was only to check if the id was still valid, an index file seek would still be needed. Eliminating the index file seek in this instance, would require a DELETE index. This has the same form as the OBSOLETE index, but is not truncated on rationalization.
The index file seek can be eliminated altogether by means of a MAIN index. This maps ids to the latest address and size, exactly as the UPDATE index does. The difference is that it does this for all stored datum, not just datum updated since creation or last rationalization. The MAIN index is the RAM component of a RAM Primacy datum index, with the index file acting as backup. All seeks can be eliminated by full implementation of RAM Primacy, with the entire data body memory resident, with both the index and data files acting as backup. As should be obvious however, full RAM Primacy or the less radical MAIN index, are only worth doing if the data is stored on hard drive, or extreme performance is required.
ISAM-Files and Composite Datacron Indexes
Indexes map keys (values), to data objects, data object ids, or sets of data object ids - placing data objects in value order as opposed to the natural order of repositories, which is in order of occurence. That the keys must be in value order, rules out the base datacron as a persistant store. Instead the store is a 'persistant ordered collection', otherwise known as an ISAM-File.
ISAM-Files, which can be deployed as 1:1 key-object maps, 1:Many key-object maps, or as key sets, are persistant renditions of an indexed chain (there being no persistant rendition of the simple ISAM). Memory resident indexed chains (see article 4.3 "ISAM and Indexed Chain"), comprise a chain of blocks with elements (key-datum pairs) in value order, and a map to aid lookup (the chain index). ISAM-Files essentially mirror this structure, with chain of blocks held in a data file. This isn't straightforward. The easy part is keeping blocks in value order. AANI dictates that blocks are written in order of occurance, so they are placed in value order virtually by the chain index. The chain index must itself persist, so it is implemented as a serial datacron.
The hard part is keeping elements within blocks in value order, without excesive writes. This requires another, much larger serial datacron, which serves as the staging buffer. All new and changed elements are placed in the staging buffer, and are migrated to the data file chain blocks during periodic rationalizations. The staging buffer can be likened to the current partition. As elements can either be in the buffer or the data file chain, LOOKUP must search both.
ISAM-File implementations vary depending on what is being indexed and how. Elements can be key-id pairs, key-idset pairs or key-datum pairs - although the key-datum case is usually reduced to the key-id case by storing the datum in a binary datum respository. The keys are invariably small.
In the key-id case, the small element size allows a high rationalization trigger threshold (staging buffer population limit), with 1 million elements being typical. Total ISAM-File element populations can be in the billions. Data file lookups require a seek and as the population rises, an ever greater proportion of it falls within the data file. There is thus a performance degradation, but it should be noted that this never goes beyond a single seek. With a hard drive, it might be worth raising the trigger threshold. With an SSD, such a measure is unlikely to be worthwhile. What will slow as the population rises, albeit without detracting from operational performance, are the periodic background rationalizations. Note that statisically, the value range spanned by the staging buffer, will be similar to that spanned in the data file chain as a whole. In the first rationalization, migrating 1 million elements results in around 5000 blocks being written, a ratio of some 200 elements per block. In subsequent rationalizations this ratio drops, ultimately approaching 1, as as it becomes progressively less likely that elements in the buffer, will share a target block in the data file chain.
In the key-idset case, an INSERT will either increase the element population or expand the idset of an existing element. Likewise a DELETE will either reduce the element population or shrink the idset of an existing element. The staging buffer is usually arranged exactly as it is in the key-id case.
Note that due to AANI, the ISAM-File data file chain is a datacron. And as the ISAM-File chain index and staging buffer are both serial datacrons, ISAM-Files are composite datacron indexes. It is possible to devise composite datacron indexes with components beyond those needed for an ISAM-File, but rarely necessary. However it is difficult if not impossibe to devise a composite datacron index with fewer components than those needed for an ISAM-File. The reader is thus at liberty to decide for themselves, whether ISAM-Files and composite datacron indexes are one and the same thing.