Errors And Alarms

[an error occurred while processing this directive] Creating, reporting and propogating 'Alert' messages: errors, alarms, warnings, information and trace messages

This article describes different 'alert' messages and when you might create them, and some things to consider when passing such messages about code and between services.

Errors, alarms, warnings, information messages and trace messages are fundamental to modern software. Even if your software is bug-free, you are likely to be using services and libraries that are not, in less than ideal environments.

Some Terms

There are various status messages that you might want to pass around a system. Sometimes these are given as semi-standard 'levels' of seriousness, ie: || Type | Description | Examples ~~#666666:DEBUG/TRACE~~ | low level tracking messages indicating what the software code is doing; usually describing code processes and so targeted for programmers. | "entered connect() method", "i for loop, i=1000", "opening file 'banana.txt'" | ~~#000000:STATUS/INFO~~ | Information for the user (human or machine) about what is going on | "Processing row 14..", "Starting query", "Connecting..." | ~~#FF9900:WARNING~~ | A problem, or potential problem, has been found but can be ignored for now. | Disk quota or memory not yet exceeded, but 'close to'. | ~~#FF0000:ALARM~~ | A problem has been found that needs to be resolved but the system is still operating as it should | A server is down, A sensor indicates overheating | ~~#FF0000:ERROR~~ | A problem has been found that the system cannot cope with, or the system itself has failed to carry out an operation that it should be expected to complete | Code failures. ||

There is a fair amount of overlap here, and the seriousness does not necessarily escalate cleanly. A monitoring service might raise an alarm if a service goes down. But a client that fails when trying to use it might raise an error. If it can use another service instead, it may just raise a warning, or an info, or even just a trace/debug note depending on the requirements of the system.

Monitoring systems typically define __''soft''__ and __''hard''__ limits. Soft limits indicate that a value is outside its 'ideal' but work can continue. Hard limits indicate that a value is outside its operating limits. For example, available disk space might raise warnings when more than 90% full (the soft limit), and alarms when completely full (the hard limit).

Relevent Detail

We've all been frustrated by those "Cannot load file" messages that don't indicate which file cannot be loaded and the reason why (not there? too big? wrong format? If so, what format is it and what was expected?). On the other hand, users can be overloaded by irrelevent detail; end-users rarely care about the IP address that a service might be trying to connect to, for example.

So when the error is propagated up the call stack, it should accumulate relevent details in such a way that (1) high level code can test for and trap errors, and (2) the reporting system can display them suitably to the end-user, developer or maintainer. The former requires suitably typed heirarchical exceptions (or status codes), the latter requires some intelligent work by the programmer to split reporting so that details are logged (for the maintainer) and sensible messages displayed (for the user).

For example, low level file access might report "Permission Denied" when opening a file. The calling code should trap that and add what the filename is, the path, what permission is expected and so on. Higher level code traps that and adds that the file is a configuration file, and the application terminates. So you might get "Failed to read configuration file: Permission Denied for user 'mch' to 'c:\properties.txt'". Surrounding inserted strings with single quotes - such as the filename here - can catch problems with typos such as trailing spaces, extras periods, etc.

As the error message text changes, so does the error type. A low level -+PermissionDenied+- exception should be an -+IOException+- when it exits the file access code, and a -+ConfigurationException+- when it exits the configuration-reading code. This means we can test for and catch - and report - suitably and cleanly at each stage. !!Wrap, Log and/or rethrow Exceptions? It's tempting to blindly trap exceptions and wrap them in a new one with the extra context details. That way you get maximum detail for debugging. However you also get massive stacktrace logs that you'll need to wade through to find the particular relevent error. So yes, wrap if you genuinely ''need'' that extra detail; otherwise if your language lets you, create a new one with the same stack trace as the original, and add the text "from exception Xxxx" to the technical part of the message. Low level libraries should throw exceptions, not log errors; then it's up to the application to decide if it needs to report it or ignore it. Most definitely definitely definitely, do not log and rethrow. Applications may not care that an error has occured, in which case you're just going to clog up the logs. Or someone's going to have to mess around tuning the logging system to ignore your package. If you're throwing it, let the calling code handle it, that's what exceptions are for. Generally speaking, throw a 'new' exception rather than rethrow a lower level library one, particularly if it's 3rd party. This is part of the 'hide implementation' ethos; a high level library interfaces to the mid-level one - it should not care what low level library is being used. More practically, it means the mid-level developer has to think about how and why the status is being reported to the calling program. Don't blindly create one overall exception type for your project and subclass all others from it; this makes trapping exceptions by supertype much harder. Make use of the existing heirarchies - for example it is better to extend IO errors from IOException, so that code can cleanly check for any IOException. !!In code - Exceptions vs Status Codes Exceptions were meant to make code cleaner. Instead of checking the return status of every call and returning a new status code at every level, only the call that needs to deal with the error need check for it. Intermediate calls just declare the exception as a 'possible event'. However since I usually want to add context information at most intermediate levels, my code ends up checking for exceptions at each call anyway. And frequent try/catch/finally blocks do not make code cleaner. Exceptions disrupt the 'normal flow' of code, and are relatively 'expensive' to create compared to primitive status codes. So they should be reserved for ''exceptional'' circumstances, not ones that you expect to occur. If you want more than primitive status codes - say, to give type and hold supplementary information - consider a callback or event listening mechanism. !!Centralised vs Distributed Codes & Texts In The Olde Dayes, error codes were often centralised into one monolithic code library. This meant that errors reported by an application (often in the form of "Error 402F") could be easily found in a lookup manual. The lookup manual could be written in different languages for the same set of faults. Similarly displays and reports could use a single look-up table to find the text string to use to display, and the look-up table could be replaced for different languages. Unfortunately if you are writing 'decoupled' libraries, you can't have a centralised static error 'repository'. A centralised error code library demands a full recompile of all dependent libraries, and it's not always clear which those might be. With decoupled libraries each library defines and reports its own errors. This devolves naturally down to each package defining and reporting its own errors, and perhaps even to each class. This makes looking up error codes difficult as you need to locate the class responsible for ''defining'' the error (not the one reporting it), then find the right bit of source to look up what that code indicates. Instead we have to include text ''with'' the status code. We still need the status code (or exception type), because the calling program needs to be able to test for, trap and branch based on different statuses. Testing strings can be awkward and prone to failure - compare "File xxxxx not found" with "Could not find file xxxxx". Including detail in text strings makes translation between different languages very hard (see below, localisation). It's tempting to share static status codes because some appear to be common. I would argue that this is rarely, if ever, the case in practice and can mislead. Consider a library that defines a code (or an exception) CannotConnect. Other library developers may see this code and think "aha, we can use that, Standards Are Good". But the original library developer decides to split it into different codes, and reuses that one as ServerNotAvailable. There is no way to tell what other libraries might be using that error code, if any; if s/he changes its meaning, the other libraries will report the same code but the wrong meaning will be associated with it. ''Aside: I wonder if it would be worth including some kind of dynamic error 'repository' in applications. Libraries statically register their status codes and language packages. Code namespaces would have to be agreed to prevent conflicts, eg package names''. !!Crossing Service Boundaries When ''reporting'' errors externally, we need to be slightly careful about what details are sent. Path details for example should not be reported from web services as these can indicate how the server is set up. I would say that service boundaries should log errors and return new ones (despite hating log and rethrow :-). The original error should go into a log suitable for the local sysadmin and/or the system maintainer. The external error should include enough information to let the client know whether the problem is a service failure or some problem with the inputs sent by the clent, and some idea about what to do about it. ''(...More on nested service calls...)'' !!Internationalisation/Localisation Reporting errors in different languages [http://wiki.astrogrid.org/bin/view/Astrogrid/LocalisingMessages|looked so difficult] that, despite being an international project, we gave up trying to implement it for AstroGrid. These are the difficulties and some of the solutions that we considered. There were more, if the AstroGrid forum ever comes online again... In summary, each alert must have some kind of unique indicator that is language independent which can then be associated with some suitable text string. Along with it must also be a set of parameters that give details about the alert - such as filename, or a progress indicator. The alerts can then be created with the appropriate language at the point of display or report. Reporting errors should be extremely reliable. Adding breakable bits to error reporting can make debugging extremely frustrating, particularly for end users. So to resolve alert codes to human text at the point of display the human text must be available for that code at that point. (Using a service to resolve it, for example, would add a very breakable step - in our working environment there were no overall authorities to guarantee any kind of service availability). With monolithic and reasonably stable systems agreeing and defining status codes and including them in all the relevent applications is fairly straightforward. In distributed systems from different suppliers, different services can end up defining the same codes for different alerts. This can be solved by assigning suitable 'namespaces', such as using strings as codes and prefixing codes with web-like namespaces (such as 'org.astrogrid.512545'). The real problem is handling change. When a new service is introduced, all the clients that use it must know that system's error codes, otherwise they will end up reporting incomprehensible sets of numbers and parameters to the client. For single-step calls, users might be able to resolve problems by contacting the new system's admins when errors occur. But in a distributed system, where calls are nested through several services, it can be extremely hard to track where such an error started. We resorted to just using English; to a certain extent we could get away with that as it's a fairly common language in academia. If anyone's solved this, I'd be very interested. [an error occurred while processing this directive]

Errors and Alarms

Some Terms

Relevent Detail