[caption id="attachment_340" align="alignleft" width="300" caption="Time Flies"]1214482_77411415[/caption]

In about one of two projects the customer comes up with the requirement of 'historization' of data. And more often then not this lead to an unholy back and forth of discussions, prototypes and complaining. The reason for this as far as I can tell is: This is not a well defined requirement. It can mean many things, and depending on what is meant a different implementation is due. So here is my list of possible implementation approaches along with the circumstances in which they might be useful. Note: Back in the times when this blog was mostly German I wrote an article on the options of versioning or 'historization'. This article you you just read is basically a translated rerun of that article.

Audit Trail

The audit trail is the simplest of all approaches. When ever an object (or row in the database) changes, a copy of that object (row) gets stored in a separate table, together with a time stamp. This is so easy, it can be implemented with database tools alone, namely using triggers. With this approach it is fairly easy to identify the state of a given row at a given time. But it gets complicated to identify the same for a whole object graph, since the objects that are part of that object graph change at independent times. So you can't just write a join. Also if you are not so much interested in the state of a row at a certain time, but at the changes that happen, the next approach might be more usefull to you.

Change Log

The Change Log is a variation of the audit trail. In both a record is produced on every change in a database row. But a change log, keeps track of the change, not so much of the state the row was in. So it would contain at least for the changed attributes the old and the new value. This way it gets easy to find out what kind of changes happen how often. A question often asked in data minining settings. Obviously you can and should fine tune your implementation between audit trail and change log to your exact needs.

Snapshot

In the Snapshot approach you take a snapshot of a complete object graph when an element of that graph changes. Note that now we are operation on objects and no longer on tables and rows. So there is no easy way to do this kind of thing in the database. Since you are creating copies of object graphs, you can navigate and query these graphs just as you can navigate and query the original graph. On the other hand the copies contain objects that didn't change at all, so it is easy to end up with huge amounts of data fast.  Since these tables contain large amounts of duplicated data, compression might be really effective. Since you are creating the complete object graph for your history objects, nothing forces you to use the same model as with the live data. So your history schema might look very different.

All approaches so far are concerned with the past state of objects. But sometimes this isn't enough. Sometimes what starts as 'historization' ends up being full blown ...

... Versioned Objects

This is the most powerful approach: You model the versions i.e. the changing state of your objects as part of your domain model. Like so:

public class Thing {
private Set versions;
public ThingVersion getVersion(Date timestamp){...}
// ...
}

public class ThingVersion {
private Date validFrom;
// ...
}


Depending on the Context you might use references to the 'Thing' or to a specific Version or to the version, which is valid at a specific point in time. Let me give you an example. When dealing with trains, a lot of rules have to get considered. Certain combination of good must not travel in wagons right next to each others. Wagons with certain goods have to have  certain features. And so on. These Rules change. And of course you need proof that the train you send on a track one year ago obeyed all the rules valid at that point in time. So far this sounds like a case for on of  the first three options. But when these rules change, you will get upfront notice of that change. But if you have a train that you plan for end of the month, which will obey the rules valid at that time, although it might violate rules which are in effect right now. In order to model this, you need Rules, that exist in Versions. And the train does reference that rule and uses it's time of departure to request the appropriate rule version, which is valid at that day.

This will definitely make your model much more complex. So don't use it if you don't need to. But if you are dealing with planned changes this has proven to be a useful approach for me.

Talks

Wan't to meet me in person to tell me how stupid I am? You can find me at the following events: