4 Failure Model

Distributed systems have the partial failure property, that is, part of the system can fail while the rest continues to work. Partial failures are not at all rare. Properly-designed applications must take them into account. This is both good and bad for application design. The bad part is that it makes applications more complex. The good part is that applications can take advantage of the redundancy offered by distributed systems to become more robust.

The Mozart failure model defines what failures are recognized by the system and how they are reflected in the language. The system recognizes permanent site failures that are instantaneous and both temporary and permanent communication failures. The permanent site failure mode is more generally known as fail-silent with failure detection, that is, a site stops working instantaneously, does not communicate with other sites from that point onwards, and the stop can be detected from the outside. The system provides mechanisms to program with language entities that are subject to failures.

The Mozart failure model is accessed through the module Fault. This chapter explains and justifies this functionality, and gives examples showing how to use it. We present the failure model in two steps: the basic model and the advanced model. To start writing fault-tolerant applications it is enough to understand the basic model. To build fault-tolerant abstractions it is often necessary to use the advanced model.

In its current state, the Mozart system provides only the primitive operations needed to detect failure and reflect it in the language. The design and implementation of fault-tolerant abstractions within the language by using these primitives is the subject of ongoing research. This chapter and the next one give the first results of this research. All comments and suggestions for improvements are welcome.

4.1 Fault states

All failure modes are defined with respect to both a language entity and a particular site. For example, one would like to send a message to a port from a given site. The site may or may not be able to send the message. A language entity can be in three fault states on a given site:

If the entity is currently not working, then it is guaranteed that the fault state will be either tempFail or permFail. The system cannot always determine whether a fault is temporary or permanent. In particular, a tempFail may hide a site crash. However, network failures can always be considered temporary since the system actively tries to reestablish another connection.

4.1.1 Temporary faults

The fault state tempFail exists to allow the application to react quickly to temporary network problems. It is raised by the system as soon as a network problem is recognized. It is therefore fundamentally different from a time-out. For example, TCP gives a time-out after some minutes. This duration has been chosen to be very long, approximating infinity from the viewpoint of the network connection. After the time-out, one can be sure that the connection is no longer working.

The purpose of tempFail is quite different from a time-out. It is to inform the application of network problems, not to mark the end of a connection. For example, an application might be connected to a given server. If there are problems with this server, the application would like to be informed quickly so that it can try connecting to another server. A tempFail fault state will therefore be relatively frequent, much more frequent than a time-out. In most cases, a tempFail fault state will eventually go away.

It is possible for a tempFail state to last forever. For example, if a user disconnects the network connection of a laptop machine, then only he or she knows whether the problem is permanent. The application cannot in general know this. The decision whether to continue waiting or to stop the wait can cut through all levels of abstraction to appear at the top level (i.e., the user). The application might then pop up a window to ask the user whether to continue waiting or not. The important thing is that the network layer does not make this decision; the application is completely free to decide or to let the user decide.

4.1.2 Remote problems

The local fault states ok, tempFail, and permFail say whether an entity operation can be performed locally. An entity can also contain information about the fault states on other sites. For example, say the current site is waiting for a variable binding, but the remote site that will do the binding has crashed. The current site can find this out. The following remote problems are identified:

All of these cases are identified by the fault state remoteProblem(I), where the argument I identifies the problem. A permanent remote problem never goes away. A temporary remote problem can go away, just like a tempFail.

Even if there exists a remote problem, it is not always possible to return a remoteProblem fault state. This happens if there are problems with a proxy that the owner site does not know about. This also happens if the owner site is inaccessible. In that case it might not be possible to learn anything about the remote sites.

The complete fault state of an entity consists of at most one element from the set {tempFail, permFail} together with at most two elements from the set {remoteProblem(permSome), remoteProblem(permAll), remoteProblem(tempSome), remoteProblem(tempAll)}. Permanent remote problems mask temporary ones, i.e., if remoteProblem(permSome) is detected then remoteProblem(tempSome) cannot be detected. If a (temporary or permanent) problem exists on all remote sites (e.g., remoteProblem(permAll)) then the problem also exists on some sites (e.g., remoteProblem(permSome)).

4.2 Basic model

We present the failure model in two steps: the basic model and the advanced model. The simplest way to start writing fault-tolerant applications is to use the basic model. The basic model allows to enable or disable synchronous exceptions on language entities. That is, attempting to perform operations on entities with faults will either block or raise an exception without doing the operation. The fault detection can be enabled separately on each of two levels: a per-site level or a per-thread level (see Section 4.2.4).

Exceptions can be enabled on logic variables, ports, objects, cells, and locks. All other entities, e.g., records, procedures, classes, and functors, will never raise an exception since they have no fault states (see Section 4.4.1). Attempting to enable an exception on such an entity is allowed but has no observable effect.

The advanced model allows to install or deinstall handlers and watchers on entities. These are procedures that are invoked when there is a failure. Handlers are invoked synchronously (when attempting to perform an operation) and watchers are invoked asynchronously (in their own thread as soon as the fault state is known). The advanced model is explained in Section 4.3.

4.2.1 Enabling exceptions on faults

By default, new entities are set up so that an exception will be raised on fault states tempFail or permFail. The following operations are provided to do other kinds of fault detection:

fun {Fault.defaultEnable FStates}

sets the site's default for detected fault states to FStates. Each site has a default that is set independently of that of other sites. Enabling site or thread-level detection for an entity overrides this default. Attempting to perform an operation on an entity with a fault state in the default FStates raises an exception. The FStates can be changed as often as desired. When the system starts up, the defaults are set up as if the call {Fault.defaultEnable [tempFail permFail]} had been done.

fun {Fault.defaultDisable}

disables the default fault detection. This function is included for symmetry. It is exactly the same as doing {Fault.defaultEnable nil}.

fun {Fault.enable Entity Level FStates}

is a more targeted way to do fault detection. It enables fault detection on a given entity at a given level. If a fault in FStates occurs while attempting an operation at the given level, then an exception is raised instead of doing the operation. The Entity is a reference to any language entity. Exceptions are enabled only if the entity is an object, cell, port, lock, or logic variable. The Level is site, 'thread'(this) (for the current thread), or 'thread'(T) (for an arbitrary thread identifier T).1 More information on levels is given in Section 4.2.4.

fun {Fault.disable Entity Level}

disables fault detection on the given entity at the given level. If a fault occurs, then the system does nothing at the given level, but checks whether any exceptions are enabled at the next higher level. This is not the same as {Fault.enable Entity Level nil}, which always causes the entity to block at the given level.

The function Fault.enable returns true if and only if the enable was successful, i.e., the entity was not already enabled for that level. The function Fault.disable returns true if and only if the disable was successful, i.e., the entity was already enabled for that level. The functions Fault.defaultEnable and Fault.defaultDisable always return true. At its creation, an entity is not enabled at any level. All four functions raise a type error exception if their arguments do not have the correct type.

4.2.2 Binding logic variables

A logic variable can be declared before it is bound. What happens to its enabled exceptions when it is bound? For example, let's say variable V is enabled with FS_v and port P is enabled with FS_p. What happens after the binding V=P? In this case, the binding gives P, which keeps the enabled exceptions FS_p. The enabled exceptions FS_v are discarded.

The following cases are possible. We assume that variable V is enabled with fault detection on fault states FS_v.

These cases follow from three basic principles:

4.2.3 Exception formats

The exceptions raised have the format

system(dp(entity:E conditions:FS op:OP) ...)

where the four arguments are defined as follows:

4.2.4 Levels of fault detection

There are three levels of fault detection, namely default site-based, site-based, and thread-based. A more specific level, if it exists, overrides a more general level. The most general is default site-based, which determines what exceptions are raised if the entity is not enabled at the site or thread level. Next is site-based, which detects a fault for a specific entity when an operation is tried on one particular site. Finally, the most fine-grained is thread-based, which detects a fault for a specific entity when an operation is tried in a particular thread.

The site-based and thread-based levels have to be enabled specifically for a given entity. The function {Fault.enable Entity Level FStates} is used, where Level is either site or 'thread'(T). The thread T is either the atom this (which means the current thread), or a thread identifier. Any thread's identifier can be obtained by executing {Thread.this T} in the thread.

The thread-based level is the most specific; if it is enabled it overrides the two others in its thread. The site-based level, if it is enabled, overrides the default. If neither a thread-based nor a site-based level are enabled, then the default is used. Even if the actual fault state does not give an exception, the mere fact that a level is enabled always overrides the next higher level.

For example, assume that the cell C is on a site with default detection for [tempFail permFail] and thread-based detection for [permFail] in thread T1. What happens if many threads try to do an exchange if C's actual fault state is tempFail? Then thread T1 will block, since it is set up to detect only permFail. All other threads will raise the exception tempFail, since the default covers it and there is no enable at the site or thread levels. Thread T1 will continue the operation when and if the tempFail state goes away.

4.2.5 Levels and sitedness

The Fault module has both sited and unsited operations. Both setting the default and enabling at the site level are sited. This protects the site from remote attempts to change its settings. Enabling at the thread level is unsited. This allows fault-tolerant abstractions to be network-transparent, i.e., when passed to another site they continue to work.

To be precise, the calls {Fault.enable E site ...} and {Fault.install E site ...}, will only work on the home site of the Fault module. A procedure containing these calls may be passed around the network at will, and executed anywhere. However, any attempt to execute either call on a site different from the Fault module's home site will raise an exception.3 The calls {Fault.enable E 'thread'(T) ...} and {Fault.install E 'thread'(T) ...} will work anywhere. A procedure containing these calls may be passed around the network at will, and will work correctly anywhere. Of course, since threads are sited, T has to identify a thread on the site where the procedure is executed.

4.3 Advanced model

The basic model lets you set up the system to raise an exception when an operation is attempted on a faulty entity. The advanced model extends this to call a user-defined procedure. Furthermore, the advanced model can call the procedure synchronously, i.e., when an operation is attempted, or asynchronously, i.e., as soon as the fault is known, even if no operation is attempted. In the synchronous case, the procedure is called a fault handler, or just handler. In the asynchronous case, the procedure is called watcher.

4.3.1 Lazy detection with handlers

When an operation is attempted on an entity with a problem, then a handler call replaces the operation. This call is done in the context of the thread that attempted the operation. If the entity works again later (which is possible with tempFail and remoteProblem) then the handler can just try the operation again.

In an exact analogy to the basic model, a fault handler can be installed on a given entity at a given level for a given set of fault states. The possible entities, levels, and fault states are exactly the same. What happens to handlers on logic variables when the variables are bound is exactly the same as what happens to enabled exceptions in Section 4.2.2. For example, when a variable with handler H_v1 is bound to another variable with handler H_v2, then the result has exactly one handler, say H_v2. The other handler H_v1 is discarded. When a variable with handler is bound to a port with handler, then the port's handler survives and the variable's handler is discarded.

Handlers are installed and deinstalled with the following two built-in operations:

fun {Fault.install Entity Level FStates HandlerProc}

installs handler HandlerProc on Entity at Level for fault states FStates. If an operation is attempted and there is a fault in FStates, then the operation is replaced by a call to HandlerProc. At most one handler can be installed on a given entity at a given level.

fun {Fault.deInstall Entity Level}

deinstalls a previously installed handler from Entity at Level.

The function Fault.install returns true if and only if the installation was successful, i.e., the entity did not already have an installation or an enable for that level. The function Fault.deInstall returns true if and only if the deinstall was successful, i.e., the entity had a handler installed for that level. Both functions raise a type error exception if their arguments do not have the correct type.

A handler HandlerProc is a three-argument procedure that is called as {HandlerProc E FS OP}. The arguments E, FS, and OP, are exactly the same as in a distribution exception.

A modification of the current release with respect to handler semantics is that handlers installed on variables always retry the operation after they return.

4.3.2 Eager detection with watchers

Fault handlers detect failure synchronously, i.e., when an operation is attempted. One often wants to be informed earlier. The advanced model allows the application to be informed asynchronously and eagerly, that is, as soon as the site finds out about the failure. Two operations are provided:

fun {Fault.installWatcher Entity FStates WatcherProc}

installs watcher WatcherProc on Entity for fault states FStates. If a fault in FStates is detected on the current site, then WatcherProc is invoked in its own new thread. A watcher is automatically deinstalled when it is invoked. Any number of watchers can be installed on an entity. The function always returns true, since it is always possible to install a watcher.

fun {Fault.deInstallWatcher Entity WatcherProc}

deinstalls (i.e., removes) one instance of the given watcher from the entity on the current site. If no instance of WatcherProc is there to deinstall, then the function returns false. Otherwise, it returns true.

A watcher WatcherProc is a two-argument procedure that is called as {WatcherProc E FS}. The arguments E and FS are exactly the same as in a distribution exception or in a handler call.

4.4 Fault states for language entities

This section explains the possible fault states of each language entity in terms of its distributed semantics. The fault state is a consequence of two things: the entity's distributed implementation and the system's failure mode. For example, let's consider a variable. There is one owner site and a set of proxy sites. If a variable proxy is on a crashed site and the owner site is still working, then to another variable proxy this will be a remoteProblem. If the owner site crashes, then all proxies will see a permFail.

4.4.1 Eager stateless data

Eager stateless data, namely records, procedures, functions, classes, and functors, are copied immediately in messages. There are no remote references to eager stateless data, which are always local to a site. So their only possible fault state is ok.

In future releases, procedures, functions, and functors will not send their code immediately in the message, but will send only their global name. Upon arrival, if the code is not present, then it will be immediately requested. This will guarantee that code is sent at most once to a given site. This will introduce fault states tempFail and permFail if the site containing the code becomes unreachable or crashes.

4.4.2 Sited entities

Sited entities can be referenced remotely but can only be used on their home site. Attempting to use one outside of its home site immediately raises an exception. Detecting this does not need any network operations. So their only possible fault state is ok.

4.4.3 Ports

A port has one owner site and a set of proxy sites. The following failure modes are possible:

A port has a single operation, Send, which can complete if the fault state is ok. The Send operation is asynchronous, that is, it completes immediately on the sender site and at some later point in time it will complete on the port's owner site. The fact that it completes on the sender site does not imply that it will complete on the owner site. This is because the owner site may fail.

4.4.4 Logic variables

A logic variable has one owner site and a set of proxy sites. The following failure modes are possible:

A logic variable has two operations, binding and waiting until bound. Bind operations are explicit in the program text. Most wait operations are implicit since threads block until their data is available. The bind operation will only complete if the fault state is ok or remoteProblem.

If the application binds a variable, then its wait operation is only guaranteed to complete if the fault state is ok. When it completes, this means that another proxy has bound the variable. If the fault state is remoteProblem, then the wait operation may not be able to complete if the problem exists at the proxy that was supposed to bind the variable. This is not a tempFail or permFail, since a third proxy can successfully bind the variable. But from the application's viewpoint, it may still be important to know about this problem. Therefore, the fault state remoteProblem is important for variables.

A common case for variables is the client-server. The client sends a request containing a variable to the server. The server binds the variable to the answer. The variable exists only on the client and server sites. In this case, if the client detects a remoteProblem then it knows that the variable binding will be delayed or never done.

4.4.5 Cells and locks

Cells and locks have almost the same failure behavior. A cell or lock has one owner site and a set of proxy sites. At any given time instant, the cell's state pointer or the lock's token is at one proxy or in the network. The following failure modes are possible:

A cell has one primitive operation, a state update, which is called Exchange. A lock has two implicit operations, acquiring the lock token and releasing it. Both are implemented by the same distributed protocol.

4.4.6 Objects

An object consists of an object-record that is a lazy chunk and that references the object's features, a cell, and a class. The object-record is lazy: it is copied to the site when the object is used for the first time. This means that the following failure modes are possible:

Compared to cells, objects have two new failure modes: the object-record can be temporarily or permanently absent. In both cases the object cannot be used, so we simply consider the new failure modes to be instances of tempFail and permFail.


1. Since thread is already used as a keyword in the language, it has to be quoted to make it an atom.
2. Since lock is already used as a keyword in the language, it has to be quoted to make it an atom.
3. Note that each site has its own Fault module.

Peter Van Roy, Seif Haridi and Per Brand
Version 1.3.2 (20060907)