Error codes: The hzEcode class
In the HadronZoo library, many functions return an hzEcode instance by value. Unless otherwise stated, if a function allocates a resource it will return a pointer to that resource. If a function performs a simple test it will return a simple true or false. If a class member function modifies a class instance it will return a reference to the changed instance. Where a size or a count is requested or where it is otherwise appropriate to return a number, the function will duly do so. Some functions return void as they either have nothing to return or no purpose would be served by checking the return value. But if a function falls outside any of the above, it will almost certainly return a hzEcode.
hzEcode is really an enum but was redefined as a class in order to improve type checking by the compiler, particularly to avoid confusion with integers. There was a plan to extend this approach to other HadronZoo enums. This however has been suspended, pending consideration on the benefits or otherwise of C++1x, which HadronZoo has yet to adopt.
HadronZoo loosely classifies error conditions and thus error codes, by thier nature and by thier consequence. As a rule, libraries are servants of programs so the library functions should limit themselves to obeying orders and reporting the result. It is for the calling program to decide what action should be taken in the event of errors. That said there is a clear benefit to standardizing error handling across all programs, which puts the onus on the library.
The initial thinking on this was to draw a distinction between programs that processed data until complete; and programs that operated as omnipresent servers. The former terminates if denied a resource, at which point it is over to the system administrators to resolve the matter. The latter must handle resource denial without terminating, in order to go on to process requests for which the resources can be obtained. It is only acceptable for server programs to exit during initialization.
It was quickly realized the distinction had little impact on error codes. In terms of access to files, either the paths and access rights were set up correctly or they were not and if they were not, there was nothing a program could do about it. If the error code was anything other than E_OK, programs in the former group would terminate while programs of the latter would trot out the stock answer of "Page not found". Regardless of the action taken, it was still imcumbent upon both groups to properly report the exact nature of the error because unless they did, administrators would not be aided in the matter of rectification. In the case of files and for that matter other resources such as shared memory segments, the error code regime would be the same in both groups. In the light of this, the distiction was abandoned.
Another important resource class are other services, most notably the DNS. Attempts to access these resources will have one of three outcomes; success, failure or try again. While it can be difficult to properly think through how programs should respond to this, the error codes needed to report these outcomes are very clear.
The only remaining hazard to a server program was running out of memory or disk space. These conditions however are best detected and recified by separate omnipresent system health check programs, not programs providing an online service. In the end though, there will be limits to what such health monitors can do. When disk space is running low, such monitors can free up pre-allocated disk space - until they run out. When memory is running low monitor could kill non-essential programs except that there should not really be any! There is no solace in pre-allocation as all this will do is bring forward the onset of swapping. Given the HadronZoo emphasis on low latency, swapping is a disaster. Be that as it may, you are not going to get an actual allocation failure until the swap space runs out, usually quite some time after server programs have began to perform below standard. For this reason HadronZoo has retained the simple policy of exit on memory allocation failure in all cases. That isn't particularly sophisticated but there really isn't any point doing anything else.
In many respects hzEcode was ill concieved and arguablly, wholesale adoption of errno would have been a more sensible approach. However there are hzEcode values for which there are no direct errno equivelents. Furthermore, there is merit in enhancing the feedback. A simple error code isn't sufficient on its own. While it can easiliy be translated into text by means of an array of strings, this would provide no operational context. The hzEcode works in conjection with a number of logging functions and a regime in which function calls are stacked in order to provide a properly textual stack trace.
Logging
The HadronZoo loging regime. This covers the following:-
1) Periodic log files
2) Dynamic debug levels
3) A naming regime to build function names into log output.
4) A protocol for loging via the HadronZoo logserver.
5) The hzLogger class for writing log output to.
The hzProcess class
HadronZoo based programs must declare a hzProcess instance per thread in all cases. The purpose of the hzProcess class is to hold information about each thread thereby facilitating anything that might need to be done on a per-thread basis. To a degree this is more about future proofing than anything else. Currently, hzProcess does little more than maintain a process stack for diagnostic purposes in the event of a crash or terminal error. The macro _hzfunc(function_name) declares a static string to note function name but also creates an instance of _hz_func_reg. And it is the constructor of _hz_func_reg that places the pointer to the function name string on the stack for the applicable thread.
Shared Memory Segments
Shared memory segments are used to store reference data which is expected to be common to multiple programs, such as country codes and the map of IP address blocks to locations. If the data is only used by one program the cost of this approach is negligible. Two and you are ahead. Shared memory also plays an important role in diagnostics and program recovery, and is an important means of inter-process communication.
All shared memory segments are wiped when the server itself goes down but segments survive the crash of the program that created them, and the crash of any that use them. This will always be a valuable feature. Programs can be brought to a state where they are very reliable, but 'very' isn't 100 percent. When the program crashes and the server remains up, the segments can contain clues that are difficult or costly to effect with logfiles, and starting points for recovery such as the exact state of repositories.