hdbClass and hdbMember: Data Classes and Members
With a standard SQL database, creating a table with columns simultaneously defines a data structure (the set of columns), and creates a container for data objects of that structure (the table). With the HDB data structures (known as data classes), cannot be inferred into existance in this way. The strict rule is that data classes and other data types, must be defined before use. Accordingly, HDB data object containers can only be initialized to a predefined data class. The data class must be explicitly defined beforehand.
Data classes and their members are virtual entities, respectively represented within the program space by instances of hdbClass and hdbMember. Definition of a Data class is a three stage process as follows:-
1) | The hdbClass instance is created and the class name assigned. As data classes are data types in their own right, the class name must be unique among all data data types. 2) | The members are declared (hdbMember instances created and initialized), one by one. 3) | The data class definition is declared complete to prevent further changes. |
All data class members must have a legal data type. At the start of the config read, the allowed set of legal data types consists entirely of the standard HadronZoo data types. The set expands as data class and other data type definitions are encountered in the application configs. The 'define before use' rule allows the config read to be a one-parse process. Since data classes are themselves data types, a member of a (host) class can have another (sub) class as its data type. This facilitates data object hierarchy, with the 'values' of the member concerned, being subclass objects.
The data class however, only sets out the general form that objects of the class can take. Repositories can apply further validation rules. Members can be made compulsory and where this applies, the member in question must have a value otherwise the data object is invalid and cannot be placed in the repository. Repositories can also require data objects to be unique on one or more members. These rules however, are stated in the repository initialization, they are independent of the class and so are not stated in the class definition.
hdbObject: Single Object Container
hdbObject is a single data object container and the sole means of passing data objects from application to repository and vice versa. In an INSERT operation, the new data object is assembled inside an initially blank hdbObject (object id 0), by assigning data object members values using the hdbObject SET functions. The newly populated hdbObject is then passed to the repository INSERT function, which stores the object and issues an object id in respect of it. To retrieve a data object from a repository into a hdbObject, a blank hdbObject is passed along with the object id, to the repository FETCH function. The hdbObject is populated by the requested data object and can be read at will using hdbObject GET functions. The data object resident in the hdbObject, can then be changed by calls to the hdbObject SET functions and written back to the repository by passing the hdbObject to the repository UPDATE function.
hzAtom: Single Datum Container
hzAtom holds a single datum of any atomic HadronZoo data type, so serves as a universal means of passing atomic values. hzAtom has the following components:-
- An _atomval union to store the value. - A hzBasetype enum to indicate the data type. - A status flag to state if the hzAtom has been set or is in error. This disambiguates zero where the data type is numeric.
Note that the _atomval union (defined in hzBasedefs.h), has members of types void*, double, and signed and unsigned integers of 64, 32, 16 and 8 bits. While C allows unions to have members that are structs, C++ does not allow union members to be class instances. In order to assume values of types STRING, DOMAIN, EMADDR and URL, hzAtom casts the _atomval space to a pointer to hzString, hzDomain, hzEmaddr, or hzUrl. For BINARY or TXTDOC values, the cast is to a hzChain pointer. Ghastly perhaps, but effective.
Note that hzAtom has the standard HadronZoo prefix of hz rather than the HDB prefix of hdb. This is because it was not created specifically for the HDB and is commonly used outside the HDB in applications. Origianally hzAtom was defined in hzBasedefs.h, along with the _atomval union. hzAtom was moved to hzDatabase.h because it is predomiently used in the HDB.
Native Data Class and Subclasses
When a data object container, either a repository or a hdbObject instance, is initialized to a data class, the data class is said to be the native data class of the container. Data object containers can only accept and contain, data objects of their native data class. This has important consequences. For repository data operaions, the hdbObject used to INSERT or FETCH data objects, absolutely must be of the exact same data class as the repository native. hdbObject however, can INSERT/FETCH subclass data objects to/from the data space of the applicable (native data class member), but only to another hdbObject that is initialized to the applicable subclass.
Data Object Repositories
Data object repositories are instances of the hdbObjRepos class. Initializaion involves a sequence of member function calls as follows:-
InitStart() | Sets the repository name, data directory and native data class. Note that data classes only describe |
InitMbrIndex() | Apply an index to a member. The exact index type is contrained by the member type. |
InitMbrStore() | This only applies to members of binary data types, and states which binary datum respoitory will be used for the member, if this differs from the default. |
InitDone() | Concludes the initialization process. Blocks off further initialization function calls and validates the initialization thus far supplied. |
Once initialized the repository data operations are as follows:-
hzEcode | Insert | (uint32_t& objId, const hdbObject& Obj) | Inserts the standalone object and sets the object id. |
hzEcode | Update | (uint32_t objId, const hdbObject& Obj) | Replaces the object that currently holds the object id, with the standalone object. |
hzEcode | Delete | (uint32_t objId) | Deletes the object that currently holds the object id, from the repository. |
hzEcode | Fetch | (hdbObject& Obj, uint32_t objId) | Populates the supplied hzChain with the object found at the supplied object id. |
Note that in the original design there were two data object repository classes, hdbObjCache (RAM primacy), and hdbObjStore (persistent media). Both were derivatives of hdbObjRepos, which was the unifying base class that set the common interface. The current hdbObjRepos is a merger of these two derivatives. The interface remained unchanged. The reason for this merger is partly technical, and partly for the avoidance of confusion, and is discused in more detail in section 5.5 "HDB Repositories".
Binary Datum Repositories (hdbBinRepos)
hdbBinRepos is the HDB binary datum repository class. These store binary datum, both on behalf of data object repositories and applications (e.g. uploaded files). With binary datum there is no concept of data type, classes or members. No meaning is attached so no indexes apply. Without indexes to add, initialization is a single call to hdbBinRepos::Init(). In addition, since binary datum are simply values, there is no sense in which one binary datum can be said to be an earlier or later version of another. Accordingly, UPDATE operations are implimented as a DELETE of the original datum and an INSERT of its replacement - which does not preserve the original datum id.
hdbBinRepos has the following interface:-
hzEcode | Insert | (uint32_t& datumId, const hzChain& Obj) | Inserts the binary datum contained in the chain, and sets the datum id. hzEcode | Update | (uint32_t& datumId, const hzChain& Obj) | Replaces the resident datum identified by the datum id, with the chain content. hzEcode | Delete | (uint32_t datumId) | Deletes the resident datum identified by the datum id, from the repository. hzEcode | Fetch | (hzChain& Obj, uint32_t datumId) | Populates the supplied chain with the datum identified by the datum id. |
hdbIdset and the Index Classes
Retrieval of data objects from repositories requires object ids that are usually obtained by user searches. As data object repositories are unordered, indexes are needed to avoid a full repository scan. Also in the common case where a repository requires objects to be unique on a particular member, the INSERT operation must first establish that the repository does not already have the object in question. This simply would not be practical without an index.
In order to keep things simple, indexes in HDB data object repositories are tied to data class members. No index can apply to more than one member and members are either subject to a single index or no index. Where an index does apply, the index type is dictated by the member data type. HDB indexes are derivatives of the hdbIndex base class as follows:-
hdbIndexUkey: | 1:1 mapping between values and object ids. Only used where the repository requires objects to be unique on the applicable member. hdbIndexOder: | 1:1 mapping between values and sets of ids (idsets). This is used where the repository does NOT require objects to be unique on any member. hdbIndexEnum: | 1:1 mapping of enumerated values to idsets of objects where the applicable member has the value. ENUM is the only permitted data type. hdbIndexText: | 1:1 mapping of words to idsets of objects whose applicable member has these words. Acceptable data types are TEXT and TXTDOC. hdbIndexBool: | Single idset. Acts as both data store and index. BOOL members values held in a single hdbIndexBool and TBOOL members held in a hdbIndexBool pair. |
With the exception of hdbIndexUkey, all HDB indexes are based on idsets. Index idsets can have id populations ranging from 1 to the total repository population, which can be in the millions. HDB object ids are 32-bit unsigned itegers and would consume too much space if stored as-is. So idsets are stored in encoded form. To encode and decode idsets we have the hdbIdset class. This is a memory resident single idset container with boolean operators that work directly with the encoded form, and is the principle means by which applications read ids from idsets.
The HDB facilitates searches where the expression is either a single 'search term' or has the general form:-
search_term boolean_operator remainder-of-expression.
And a search term has the general form:-
data_class_member [boolean_unary_operator] valueA [comparison_operator valueB].
Each search term amounts to a single index query, on a single data class member, in which either a single value or a range of values is specified. The result of a query will always be a set of one or more ids, so an idset. If the expression comprises a single search term, the resulting single idest will be returned as the final result. Otherwise there will be a query and an idset in respect of every search term with the boolean expression applied to produce a final result idset.
The result of an OR of two input idsets will be at least the size of the larger idset, whereas the result of an AND will be at most the size of the smaller idset. Unless the inputs are identical, the AND produces the smaller result and takes less time to execute. Idset populations are known and this can be used to optimize searches. For example, if A is small and B and C are large, a search on "A && (B || C)" processed as-is, would build a large interim idset from B and C using the slower OR operation. This expression can be rearranged as "(A && B) || (A && C)" since this is logically the same. It will run faster as OR is only applied to two small interim idsets.
Meaning of Data Object Identifiers
Data object ids (or simply object ids), are 32-bit unsigned integers, issued in ascending order by data object repositories, upon successful INSERT of a new data object. Object ids start at 1 as they follow the convention that 0 is NULL, and are unique and nonrecurring within their issuing repository. Thus, the fully qualified address of any given data object is the repository id plus the object id.
Applications have no control over object id issuance so beyond serving as data object addresses, object ids are generally taken to be meaningless. Note however, that object ids are preserved on UPDATE. This makes it practical to attach meaning to the ids and deploy them in applications. If for example, object ids issued by a repository of customers can double as customer ids, it is not necessary for the application to generate or manage customer ids.
Data objects within a repository, are usually located by a search on that repository. Searches may be instigated by a background report generator or by user action. Either way, the search result will be a set of zero or more of datum ids. The usual practice with user action is to present a "no records found" message if the set is empty, the data object itself if it is singular, or a menu of data objects if the set contains two or more datum ids. Unless meaning has been attached to the object ids, they are not usually made visible to the user.
Binary datum ids issued by binary datum repositories (hdbBinRepos), are likewise ascending 32-bit unsigned integers that start at 1. However hdbBinRepos does not preserve datum ids on UPDATE. In effect, an UPDATE is a DELETE of the existing datum and an INSERT of its replacement. This limits the usefulness of datum ids in applications but as should be clear, binary datum are not on a par with data objects. They are only values.
The purpose of the HDB is to store and retrieve data objects, which is fulfilled by the data object repositories. Although developers are free to use the hdbBinRepos class directly in programs, its only role in the HDB is to store values on behalf of data object members of a binary data type. The value (binary datum) is placed in the binary repository, while the applicable data object member stores only the address (the datum id).
Binary datum are indexable by free text index if they are text or text can be extracted from them (no extraction tools are currently provided by HadronZoo). If the binary datum are not text and text cannot be extracted from them, no indexation is possible. Binary datum such as uploaded films or photographs, are usually are hosted by a data object. Under this arrangement, one member of the host data object holds the binary datum while other members name and describe it - thereby facilitating searches. Providing the host data object can be located, the binary datum can also be located.
Note that it has become common practice to declare separate binary datum repositories for different binary datum sources. This is understandable but unnecessary unless there is any prospect of running out of datum ids! As binary datum do not have members, there is nothing to distinquish a binary datum from one source from that of another. It is perfectly safe to hold all binary datum for the application in a single binary datum repository. If more than one binary datum repository is declared in the application, it becomes necessary for all binary members to state which repository they will use.