Introduction

The delta server (DS) is middleware that mirrors deltas (data state changes), across an agreed cluster of machines in real time. This action backs up data, enables redundant server resiliance strategies, and facilitates the rollout of new and upgraded services. The DS is manifest as singleton server process on each machine in the cluster, which is expected to be omnipresent. Delta mirroring is conceptually similar to cloud computing and has similar benefits. To be clear however, delta mirroring is NOT cloud computing. HadronZoo software does not run on cloud platforms and the DS mirrors data resource deltas and nothing but data resource deltas. Absoutely nothing else goes across the network.

Programs that host data resources that export HDN deltas, such as the DWE and the single repository microservices (SRMs), connect to the DS as clients. From the DS perspective, the demarcation of significance is the app(lication), as defined by an ADP. An app in this sense is essentially a service, so the two terms may be used interchangeably. Note the DWE is unusual in that it can support multiple webapps, each of which is an independent service with its own ADP.

In DS configs, each app has a server arrangement (SA). The SA states which machine(s) have what role in respect of each data resource. Note that within each SA, only data resources internal to the host program are of interest. External data resources (invariably data object repositories availed by an SRM), are not taken into account. This is because SRMs host their respective repositories internally and can be assumed to have thier own connection to the DS and their own ADP.

For any given data resource, only one instance of the app that hosts the resource, can be authoritive and change the data state, at any given time. As apps equate to services, i.e. "are that available at a given IP address and port", apps monopolize the host machine. Each machine running the app, will only have one instance of it. It makes a certain amount of sense to replace 'instance of the app that hosts the resource' with 'machine'. Thus for any given data resource, only one machine is authorative at any given time. The machine that is authoritive, is said to be in LIVE mode.

LIVE mode requires a machine to be live-capable, i.e. to have a fixed IP address. Other live-capable machines may run the app at the same time, but will not be authorative. This is known as STANDBY mode. The purpose of STANDBY is twofold: To load balance on FETCH; and to be ready for a switch to LIVE operation in the event of the current LIVE server crashing, or suffering a network outage. That the app is already up and running and up to date, avoids delays due to program initialization.

There is also BACKUP mode in which the machine does NOT run the app, and the DS simply stores deltas on the app's behalf. The app can be manually started on the machine, and if the machine is live-capable, the machine can be manually placed in LIVE or STANDBY mode. However BACKUP mode does not require a live-capable machine. In the minimum delta confiuration, one live-capaple machine runs the app LIVE, while another machine without a fixed IP address (e.g. a home computer), acts as BACKUP.

DWE instances, SRMs and other hzIpServer based programs, can typically handle around 2,000 requests per second, and about half this where requests result in a data state change. In cases where this performance suffices, the app is usually run 'whole', meaning it is entirely hosted in a single machine. Typically one machine will run the app LIVE, while another runs runs the app in STANBY mode or act purely as BACKUP. The SA in this case will name the same machines and cast them in the same roles, for every resource. Apps expecting higher traffic volumes will have different arrangements. They may have several front line servers that are essentially inert, with the data resources facing the heaviest demands, hosted by dedicated servers.

Why Middleware? The DS Rationale

It is firmly intended that the DS has a complete monopoly in the matter of facilitating and directing inter-server delta flows. Programs that need to mirror deltas, are expected to outsource this task to the DS and NOT attempt to mirror deltas to counterpart programs on other machines themselves. It is worth understanding the rationale behind this regime.

The DS has two core functions as follows:-

1) Receive and store origin deltas from local programs, and transmit them to the DS on the next machine (named in the applicable data resource entry, in the applicable SA). 2) Receive and store transmission deltas from other DS, and pass them as notification deltas to the applicable local program.

Transmission fails in (1) if the network or the target DS is out, and in (2) if the local program is out. However in both, all deltas are stored and this only fails in the event of a crash or if the device is out of space. In (1) if the cluster has more than two machines, mirroring is achieved by sending the deltas to the machine beyond the next. In all cases however, failures are easy to recify. The recovering DS only has to ask its counterparts for deltas later than the latest it has, in respect of each data resource in each SA.

DS operation is simple but for it to work, the client programs must connect to the DS, agree an ADP with the DS, send origin deltas to the DS and act upon notification deltas from the DS. If the clients are going to go to all this effort to talk to the DS, and if mirroring is simple, why not just mirror deltas without the DS? This is a very bad idea for the folowing reasons:-

1) No effort would be saved. Clients would have to go to at least the same effort to negotiate a connection to a counterpart, as they would to the DS.
2) The DS is simpler than the would be clients, so it is more reliable. As the DS does nothing except mirror deltas, maintenence downtime is rare. Memory consumption is low so the DS is not prone to swap mode seizure. To all practical intents and purposes, the DS is robust and can be assumed to be omnipresent.
3) The DS has a superior vantage point. Server programs will not necessarily be aware of outages. Cessation of incoming requests may be due to a lack of traffic. That Outgoing responses fail, could be due to browsers disconnecting. Inter-DS connections are mostly to fixed IP addresses and are always intended to stay open. Outage detection is much easier that way. In the general case, publically available services don't have that luxury.

Omnipresence and outage awareness are important in outage countermeasures. In the event of an outage on the live machine, the DS which is also unable to reach the outside world and so cannot communicate with any other DS, will set all LIVE apps to STANDY. On the machine where the apps are in STANDBY, the apps are made LIVE. Setting LIVE apps to STANDY on the dark machine prevents the apps from resuming service with out of date data.

Could programs do all this themselves? Yes. Would it make sense for them all to do this? No. Therein lays the rationale for the DS.

Not Compulsory

Given DS robustness and the critical importance of backing up live service data, it may seem sensible to force DS use upon programs and by extension, to require mirror confirmation before data transactions are confirmed to the user. In reality, forced DS use is a very bad idea. Mirroring is not possible without a cluster of at least two machines, and it would be absurd if HadronZoo software was unable to run on a single machine. Requiring mirror confirmation is an even worse idea. It cannot make sense for services to go offline because deltas already committed to persistent media, cannot be immediately mirrored. Accordingly, the DS is optional and mirror confirmations are not required.

Equal Servers

The DS configs begin with the list of machines in the cluster. While mirroring requires at least two, this is not compulsory so the list can be singular. Against each machine, is a name, password and an IP address, which must be either fixed or left blank. If an IP address is supplied, the machine will be LIVE capable. If not, the machine will not be LIVE and can only be used as BACKUP. As should be clear, at least one machine must be LIVE capable. All delta servers in the cluster are of equal authority. There is no hierarchy among the servers and nothing sits in the middle to direct proceedings. There is however, one rule of precedence. What is established takes precedence over what is new. Upon startup, a DS reads its configs (which includes the list of machines). The starting DS then goes through the list until it finds an established DS, from which it obtains the current working configs. It then compares these to its own configs. The comparison is similar to the ADP comparison made by DeltaInit(). The two configs can differ but they must not contradict. Each config can contain entities the other omits, but where an entity is common to both, both configs must exactly agree on the detail of that entity. If there is not complete agreement on every common entity, the starting DS assumes its configs are in error and shuts down.

DS Service Commencement

If no contradictions arise the DS proceeds to the next initialization stage. This incorporates entities that are new to the DS, if any are found in the DS configs on other machines in the cluster. This step must also avoid config contradictions, but these will not arise if due process has been followed in the app upgrade.

The final initialization step is announcement (to the DS of the other machines), of the configs the starting DS intends to operate under. At this juncture, no aspect of the configs on the other machines, will be in advance of these configs. If as is more likely, the starting DS configs are in advanced of the established, the other DS instances will sync thier configs to that of the starting DS.

The starting DS is now initialized with the latest data model, and can go into service. This will not necessarily be true of the apps on the local machine. Apps intended to be LIVE on the local machine, may have been LIVE on another machine whilst the DS was out. If so the app must aquire the same data state before going LIVE. Once this has been achieved, the local DS then notifies the other DS of this fact.

Syncing data and service commencement is done one app at a time. For every app the starting DS is responsible for, it is a case of sync then make LIVE. Deltas have sequence numbers so syncing is a matter of requesting all deltas above the sequence number the local delta server currently has, for each data resource in turn.

Delta servers responding to update requests, send the deltas for the applicable data resource, including those arising in the meantime. They then send an 'update complete' message. Thereafter, the starting delta server is deemed operational for that data resorce and is expected to receive, store and pass on deltas in respect of it, as per normal. When this is done for every applicable data resource, the delta server considers itself fully operational and gives the 'all clear' for applications running on the local machine to proceed. The applications then can accept client connections.

Note that the protocol in which established matters take precedence over new, only ensures that all the delta servers in the cluster agree on the configuration. It does not ensure the configuration is correct. That is the job of the app designer!

Server Positioning

If the objective is simply to improve upon the uptime percentages offered by data centers, placing two or more servers in different data centers is a good starting point. Some data centers claim uptimes of 99.99 percent which suggests outages will be no more than 53 minutes a year. Such figures are presumably based on the SLA (Service Level Agreement) between the data center and the network provider. The data center as a whole might see 99.99 percent uptime but that does not necessarily mean your server will. HadronZoo considers uptimes of 99.9 percent to be more realistic and advises clients to assume only 99.8 percent to be on the safe side. Even on this cautious basis, the odds of having at least one out of two servers up at any given time goes up to an impressive 99.9996 percent.

The location of the data center is often chosen on the basis of the anticipated user base. Network transmission is fast (some two thirds the speed of light), but it makes sense for the server to be on the same continent as the bulk of the users, since average latency is reduced. Where the anticipated user base is more global, lower latency can be achieved by placing servers in two or more continents. However, network transmission time between the front line server and browser is only one, relatively minor consideration. The application may depends on inter-server communication in order to respond to requests. Where this can be done within the same data centers it is not an issue, but if it is necessary for two or more front line servers to coordinate their efforts, the transmission time between then should be as short as possible. For this the data centers would ideally be in the same city.