The HDB indexes are optional extensions to HDB data object repositories, where they are each applied to a single member of the repository native data class. As previously stated in the HDB Class Overview, there are five HDB index classes in total, all of which are derivatives of hdbIndex. The indexes were designed primarily for supporting the HDB repositories but it is worth noting that they are themselves a form of repository. They can be made persistent and can be deployed directly by applications, outside the jurisdiction of the HDB. The five HDB index classes are described below:-

hdbIndexUkey

hdbIndexUkey is applied to members that are required by the repository to have unique values. The member must be of a suitable data type. Either the data type has inherently unique values, such as EMADDR, URL or PHONE, or it has values that can be unique such as APPDEF and STRING. As is obvious, repositories cannot require data objects to be unique on members with data types that have a restricted range of values, as this would limit repository population. This rules out members of BOOL, TBOOL and ENUM. Numbers, dates and times are also ruled out as these are unlikely to be unique. As previously stated, hdbIndexUkey is implimented as 1:1 mapping between values and object ids.

Note that the application may require a form field input format to be purely numeric e.g. customer id, but this would be happenstance. The actual data type of the member concerned, would be STRING. Note also that as the role of this index class is to ensure data objects are unique and to do lookups on unique values, it is usually memory resident.

hdbIndexOder

hdbIndexOder is not used for members of types ENUM, TEXT, BOOL or TRIBOOL as these have other index classes (see below), but it is used for members of most data types providing the repository does NOT require the member to have a unique value. And as many data objects in the repository can have the same member/value combination, hdbIndexOder is implemented as an indexed chain that maps values to idsets.

hdbIndexEnum

Implimented as 1:1 mapping of enumerated values to idsets of objects where the applicable member has the value. The data type must be a defined ENUM.

hdbIndexText.

hdbIndexText is a simple free text index, implemented as a 1:1 mapping of words to idsets. Acceptable data types are TEXT and DOC. Free text systems vary but the basic idea is that for each word incident in a document collection, there is a set of ids of documents that contain the word. A single word search returns the idset for the word. A search on multiple words takes the idset for each word and applies boolean operators to arrive at a resultant set. Boolean operators are either supplied as part of the search criteria or are implied. A search on "Quantum energy" for example, will find documents that contain both the word 'quantum' and 'energy'.

The process is simple but with potentially millions of documents totaling billions of words, the volumes are not. Firstly, the number of unique words and therefore idsets, can grow very large very quickly. Documents can span multiple languages and contain inter-alia, numbers and nomenclature specific to particular disciplines. Then there is the size of idsets to consider. Most idsets will be small, a fact which greatly aids searching. Others however will not be small and a significant number of them will have populations that are a good fraction of the total number of documents in the repository.

Idsets do not conform to any particular pattern. Densely populated idsets corresponding to common words tend to have more even distrubutions of ids - but that is about all that can be said. In a repository of news articles for example, given ids refelect the order of arrival, words often used in connection with a particular news story but were otherwise rare, would have clumps of ids corresponding to the period when the story dominated the headlines. Apart from this, id distributions are random.

In the design process, the question of how to best order free text search results was given due consideration. It is common practice to present free text search results in order of 'relevance', but this term is poorly defined. One possible measure of relevance would be the number of times words appeared in the documents found. Three methods were identified as follows:-

(a) Storing a count or an approximate count along with every document id in the set for each word

(b) A post search word count (given the objective of a search is generally to narrow down the results)

(c) Distinquishing between words in different sections of the article (e.g. the title).

The view taken was that (a) would detract significantly from the ability to compact idsets and so was not viable. (b) would be hopelessly inefficient where the search produced more than a page or two of results. While hundreds of documents could be scanned within what would be a reasonable period for a user to wait, the presumption must be that there would be many other users. (c) is considered viable but implementation of it would be at the application level and so need not impact idset encoding.

hdbIndexBool.

hdbIndexBool is a single persistent idset. Within data object repositories, hdbIndexBool serves as both index and data store for BOOL and TBOOL data class members. Instead of these members occupying data object space (EDO space), the repository maintains a single hdbIndexBool instance in respect of a BOOL member, and a pair of hdbIndexBool instance in respect of a TBOOL member.

Index Persistance.

As mentioned in the HDB Class Overview, indexes are optionally applied to members, and the type of index is dictated by the member data type. Indexes have been affected by the data object repository merger of hdbObjCache and hdbObjStore into hdbObjRepos. Hitherto, indexes for hdbObjectCache were memory resident and had no need to persist as the data and index states were restored on startup - by re-running the entire history of INSERT, DELETE and UPDATE operations from the delta file. This restoration process used to be fast, but as RAM compliments and data volumes grew, it became too slow. In the current HDB, hdbObjRepos uses improved versions of the indexes developed for hdbObjStore. These are all persistent but have configurable levels of RAM Primacy.

The available HDB indexes are derivatives of the hdbIndex base class as follows:-

hdbIndexUkey: 1:1 mapping between values and object ids. Only used where the repository requires objects to be unique on the applicable member. hdbIndexOder: 1:1 mapping between values and sets of ids (idsets). This is used where the repository does NOT require objects to be unique on any member. hdbIndexEnum: 1:1 mapping of enumerated values to idsets of objects where the applicable member has the value. ENUM is the only permitted data type. hdbIndexText: 1:1 mapping of words to idsets of objects whose applicable member has these words. Acceptable data types are TEXT and TXTDOC. hdbIndexBool: Single idset. Acts as both data store and index. BOOL members values held in a single hdbIndexBool and TBOOL members held in a hdbIndexBool pair.

Of the above, only hdbIndexUkey has no idset. hdbIndexUkey takes care of what is arguably the most important type of search in applications: the lookup of a single data object on a single unique key. However the other index types have crucial roles and rendering them persistent means rendering idsets persistent. It should be noted that idsets are compact. The encoding method can pack several thousand ids into a 4Kb block. Each of these ids is a potential reason for the block to change, as are ids which are not in the block but could be. Even before the SSD it was a bad idea to write out a whole block every time a few bits changed. Now the SSD is here it is an even worse idea.

The need for idset persistence is one reason why data object ids and binary datum ids are non-recurring. By running search results against a dead-index of deleted ids, repositories are able to delay id removal from indexes and pass the task to a background thread. This is not a cure-all as indexes have changes other than deletions to contend with. To this end indexes have their own delta files and once these reach a preset size, the index is rationalized by the background thread.

The delta formats are, for hdbIndexOrdr, hdbIndexEnum and hdbIndexText, "value+id" to add an id to the idset for the value and "value-id" to remove the id from the idset. Where the value itself is new it will be added, if a delete leaves the value with an empty idset, the value is removed. hdbIndexBool deltas are even simpler. Either +id or -id. Although the value part of the delta could be quite long, each delta is substantially smaller than a 4Kb block. Typically there would be about 150 or so deltas for each block write to the delta file, representing significant relief to the SSD.