What is the HDB?
As stated in the introduction, the HDB is a non-SQL, hierachical, share-nothing database regime, embued to programs by the HDB class group, which are part of the HadronZoo library. The HDB is internal to programs which invoke it, as the data model entities are represented by HDB class instances that must exist in the program space. The data model entities are thus available for direct operation within the program.
Instead of storing rows of data in columnated tables, data objects (instances of a data class), are stored in data object repositories. Data classes are data structure definitions, essentially lists of one or more members, with each having a predefined data type. In this respect, data classes are akin to C-Struct definitions. Data classes are themselves data types, so data class members can have a data type that is another data class (referred to as the subclass). It is by this mechanism, that the HDB is hierachical.
The HDB class group includes a binary datum repository class hdbBinRepos, a data object repository class hdbObjRepos, and five index classes to support hdbObjRepos operation. There is also an entity register hdbADP, to hold the data model. This inter-alia, enables data object repositories to use another data object repository to store subclass objects, should this be desired. However this is the extent of any linkage between repositories. In the general case, repositories are completely independent (share nothing), and by default, store subclass objects internally. The basic idea is that data objects are stored whole, retrieved whole, and treated as whole when operated on within the program.
HDB Origins
The HDB design was informed almost entirely by consideration of the Dissemino method (DM), and what was known about website data requirements. At the outset the DM concept was only an outline and there was formal means of defining data object model entities, so no database. Experimental control panel webapps were built, partly to help devise the DM and partly to explore RAM Primacy. In these webapps, data classes were implemented as C++ classes, with instances (data objects), stored in memory resident collections. The exact arrangement depended on how data objects were intended to be be found. If found by unique key lookup, the collection would be a 1:1 map. If found by non-unique key lookup, the collection would be a 1:Many map. If found by multiple keys, multiple collections would be used and the collections would be of pointers to data objects, rather than actual instances, to avoid data repetition.
In order for data object collections to persist, the C++ data classes were given member functions of Import() and Export(). Import() was called once on startup to load data objects from a data file. Export() was called on INSERT, MODIFY or DELETE, to append the data file with either the latest data object content or a delete marker. Initially the data objects were always committed to persistent media whole. This was later refined. Data object members were assigned ids and thier values were written out individually, so that while INSERT continued to commit whole objects, MODIFY only wrote the deltas. As a result of this refinement, data object member values in the data file become known as deltas and the data file itself became known as the delta file.
The experimental webapps also needed to store binary datum, to accommodate file uploads. For this purpose, two append-only files were used: a data file to store datum; and an index file to store the address and size of the datum in the datafile (plus other optional metadata). The datum ids issued on INSERT, were the count of index file entries so far, plus 1. On FETCH, the datum id minus 1 was used to calculate the index file offset. This arrangement had no DELETE or MODIFY function, but INSERT was fast due to file buffering. FETCH took two seeks but this could be cut to one by holding the index memory resident.
Although the control panel approach has been superseded by the DM, it is still used today to generate actual control panel pages, such as the in-built memory and stats reports that come as standard will all DM webapps. Delta file backed memory resident collections and the two-file binary datum respository have fared better. As Both these devices write only to the file end, both are reliable and failsafe. Write errors due to a program crash are less likely and as they can only be at the file end, they are easy to detect and easier to fix on startup. The two-file binary datum respository is now known as the base datacron, while delta file backed memory resident collections are known as serial datacrons. Both devices not only survive in the current HadronZoo software, but have a prominent position. Although neither device appears in raw form in the modern HDB, the HDB data object repositories and indexes are essentially composites of them.
HDB Configuration
The experimental webapps offered extreme performance. There was the obvious volume limitation of the serial datacron but there were many data sets that would fit with the available RAM, and RAM compliments were increasing. The uninspiring appearence was also not seen as an impediment in many potential applications. The killer was that the webapps could not be configured by external means (i.e. by config file). Collections can only be of instances, of classes, defined within the program code, so the data object model was hard coded. This dictated a generic solution; a single C++ class that can represent any data class, and another to store data objects of any given data class.
In the chosen solution, hdbMember describes a single data class member, while hdbClass is the generic data class representer. Data classes are defined by instantiating hdbClass and adding members. Data object repositories are created by instantiating hdbObjRepos, then initializing to a data class. hdbObject is a single data object container class, that serves as the generic program/repository interface. This is also initialized to the applicable data class. To operate on a data object, it is placed in an instance of hdbObject, which has the necessary member functions to add, delete or modify data object member values.
The config read is a one-pass process as it was the firm intention that nothing can be used until it is first defined. hdbMember states for the data class member, the data type and whether the member is single value or an array. The set of data types used by hdbMember, is the HadronZoo set of legal data types. On startup this contains only the data types that come as standard, but it expands as new data types (principally data classes and data enums), are defined in the configs. As stated in the introduction, all HadronZoo config are in XML. The HDB tags correspond 1:1 to instances of the indicated data object entity. While this 1:1 relationship is a common sense approach, it means that data object model entities can be created by tags in the configs, by direct C++ function calls, or a mixture of both.
Why No SQL?
Once a clear picture of the DM emerged, the next step was to ensure it handled the complete set of HTML5 tags and could cope with complex website features. Confidence in this could only come from extensive trials. There were also questions conerning the database. The current compliment of repositories and indexes had already been identified, but would more be needed? Thus began the Dissemino Observations, a wide ranging study of websites selected for their noteable data functionality. The objective was not to establish how the websites actually worked, but how the same effect could be achieved with the DM and the HDB. The observations identified numerous features that were missing or poorly handled by the DM, but no examples of data functionality that would exceed the capabilities of the HDB.
The HDB is non-SQL because the Dissemino observations found no data functionality that required SQL. Most observed queries were either single key lookups, or they involved multiple keys to narrow the search results. These do not need SQL. Other queries were more complex and for these, SQL was not an obvious solution. A good example would be automated guidence in shopping carts, in which users are informed that "People who bought this item also bought ...". Can this be done in SQL? Yes. Can it be done efficiently in SQL? Probably. Is SQL the only method? Absolutely not! To a C++ programmer the task might present a challenge, but not one that would be simplified by SQL. So if you do not need SQL for the simple stuff and it does not help with the complicated stuff, why use it at all?
Operational Advantages
Although the HDB is internal to HadronZoo based programs that invoke the HDB classes, such programs can serve as external databases to other programs on the same or other machines. There is always some cost to outsourcing data operations to an external database, even if it is only the effort of query formulation and processing the response. The HDB cannot get round that one. But it can avoid other common pitfalls. A generic third party database will lack specific knowledge of the applications it serves. As a consequence, data operations are often less than optimal. External databases tend to be good at second guessing the needs of applications, but this comes at the price of extra algorithms and the memory needed to support them. Tuning helps but no matter how sophisticated this is, it can never match the precision the developer is afforded by an internal database.
In Summary
There are some downsides to the HDB compared to a standard SQL RDBMS. There will not be data models the HDB cannot accommodate although there will be queries, standard in SQL which in the HDB require specific implementation. In particular, joins are not implicit and so have to be performed explicitly. However with hierarchical data classes rather than tables and columns, joins are generally not required. Crucially, there is no equivelent to the order-by or group-by commands in the HDB. This is because all queries return a set of one or more object ids, which are always in ascending order. Any other ordering or grouping, would require a separate explicit step. In most webapps however, the permitted data operations tend to form a limited set. And if the HDB can cover that limited set, considerable performance gains are on offer.
The RAM Primacy Equation
Repository memory consumption depends on what members the repository has and how they are configured. BINARY members are never memory resident and cannot be indexed. BOOL and TBOOL members are always memory resident but are their own index. The other members have the option of being memory resident and of being indexed. Index memory consumption depends partly on how the index is configured by mostly on the index type, which is dictated by the data type of the member to which the index applies. However, all repositories and indexes rest on RAM Primacy to a greater or lesser extent, so is there a risk of swap mode?
The working assumption is that for any given Linux installation, there is a RAM size threshold above which the system will not enter swap mode when idle. Any 'safe volume' for user programs, can only come from having RAM in excess of this threshold. As a margin of error it is assumed the safe volume is at most 75 percent of this excess. Only part of the total user program space is available for RAM Primacy purposes, but as adding RAM does not in itself increase program size, additional RAM plays well in this equation. With enough RAM to support a safe volume, the bulk of any extra RAM would add to the safe volume, and the bulk of extra safe volume would be available for RAM Primacy purposes. Finding values to plug into this equation is not straightforward. However estimates based on HadronZoo server monitor logs put the threshold at around 4.0Gb, suggesting a safe volume for a 32Gb server of around 21Gb and a total RAM Primacy capacity of 16Gb.