Service Interchange Formats

[an error occurred while processing this directive]

Notes for writing specifications for the formats of messages used by services to communicate with each other (Work in Progress)

Consider a distributed system made up of many services provided by different organisations. Each service carries out some process on shared data, and may pass on its results to some other service to process further in a workflow defined in this case by an astronomer.

With large numbers of applications run by different organisations, each with its own data format (or 'flavour' of data format), you can very quickly get into trouble converting ''n'' formats into ''n-1'' other formats. Any new format introduced with a new application has to be translatable to and from the existing formats, and it can be impractical adapting the complete existing distributed system to handle it.

Terms

First I'm going to define some terms just for the purpose of this article. I'm not going to pretend these are suitably unambiguous terms for general use!

__Form:__ How the data is layed out. For example: "Table", "Tree", "Related Tables", "List".

__Format:__ The binary representation of the data. For example, a particular XML schema, CSV, FITS, Bespoke Binary, perhaps SQL RDBMS.

__Type:__ Fully expressed definition of what the data includes and its meaning. For example: "Person", "Company", "Telephone Number", "Invoice", "Book Catalogue", "Spectral Energy Distribution". Some types can be simple forms, eg "Table" without indicating what the values in the table mean.

__Syntax__ is a term borrowed from language where it describes (largely) grammer; that is, how words should be arranged in a sentence so that the sentence can be understood. In data formats, computer languages, etc, it describes how bits of data are ''arranged'', but not what they ''mean''. So we might say that the first two bytes are a signed real number, the next one is the length of the next data structure which consists of a number of structures, each a byte, a four byte signed big-endian integer, etc, etc.

__Semantics__ describes what data ''means'', either to man or machine. So we might say that the first number above is a 'version number'. The next chunk is an array of star characteristics, where the first number is its ID, the second is ''that star's'' distance in fathoms from the earth, etc. Note that not only are we assigning information to the values described in detail by the syntax, but also some of the relationships; the distance value is related to the previous ID value.

__Flavour:__ A 'flavour' of a data format is where a group has used a format in a rather idiosynchratic way. Typically certain fields are missing as considered irrelevent or assumed, or are used for values the original form was not intended for. See Cory Doctorow's [http://www.well.com/~doctorow/metacrap.htm|Metacrap].

Common Solutions

Specifying Form not Type

It's been known to 'solve' the problem by specifying an interchange form rather than a type. So, the interchange format could be an XML document that defines a table (eg VOTable). Many data types will fit into a table, so ''voila'', a solution has been found. Libraries can be written and reused to read and write them. In VOTable's case displays for plotting its values can be written and applied to all instances.

This covers a lot of useful applications; sometimes all you want to do is plot two different columns of a table, and ''you'' know what they are and represent. In this case the services you use are only interested in the form, and not it's type (not what the data in the table represents).

One Size - Too Big - Fits All

It can be tempting to introduce ''one'' interchange format to handle all the various combinations. Instead of having to convert each application's native format to and from all the others, we need only write converters to and from the one interchange format. At first sight this should save us a huge amount of work, and 'uncouples' (a Good Thing) the services from each other.

Common Consequences

In both cases above what you end up with is a format that has to handle all the possible data types. In summary, this means:

Service contracts cannot be defined at their interfaces
Validation must somehow be based on the data values, after they've been parsed.
Meaning is deferred to some other method (eg tables vs SEDs - UCDs)
Flavours multiply

Enforcing

Impossible to ''enforce'' particular flavour if solution is too generic. eg, if you can't tell whether something is a spectral energy distribution or a stellar catalogue, can you even tell if it's a properly filled out stellar catalogue?

Inappropriate data sets can be sent as inputs to services.

Completeness

Any single interchange form has to contain ''all'' the information that could possibly come from any of the 'native' data forms. This can not only make for a very cumbersome single form, but some of the common values may not be obvious. We can say, for example, that the common form represents angles in degrees, but the accuracy must be to the most accurate of the native forms. [TBD: another example, eg fluxes to mags].

Generally speaking we can describe what we have, rather than requiring a particular format for elements. In the above example we might say that the common form must say what 'units' angles are in, and to what accuracy, along with the value.

Versions and Flavours

There will not be only one... No there won't. As time goes on, there will be new versions. Flavours. But versions only need to transform between each other, not between all others. Flavours are another problem; sometimes caused by shoehorning, sometimes caused by too much redundancy, sometimes just lazines.

Ghost Dependencies

Making all forms depend on one common form does not necessarily decouple them all from each other. When new forms are introduced that require extra information not originally in the common form, then the common form needs to be changed. This either becomes a new version (so you no longer have a single common form) or all those forms depending on it have to change.

Effort

We generally make common interchange formats to save us effort writing large numbers of converters. But it might be worth looking at the number of native or legacy formats that you have before assuming that you will need a common form. If you only have a few, and some existing converters, the extra effort writing and maintaining one might not be required.

Example

For example, if you define an XML schema for documents to describe a person, it might not be suitable to use for describing companies. So you can either make an XML schema for describing, say, 'entities', or have two schemas, one that describes people and the other companies. The choice depends on whether you want to make the distinction; if you use 'entities' then you cannot distinguish between people and companies at the interface or contractual level. You cannot specify on your web interface 'addStaffMember' method, say, that you expect only people documents.

Defining Types

So - we want some way of reducing the number of ''formats'', but we need to reduce them to a suitable number of ''types''. What's the difference? Let's have a look at some terms.

Which ones.

Heirarchies.

Appropriateness.

Change/Versioning

Extending types. [an error occurred while processing this directive]