tsunami

SQL2005: historical data via EAV

Luke Breuer
2008-04-25 22:10 UTC

Problem definition
It must be possible to pull a record from a given table, as it was at any point in time.
SQL
--drop table History
--go
create table History (
    table_name  varchar(64),
    column_name varchar(64),
    record_id   int,
    new_value   varchar(100),
    update_date datetime
)
go

;with V as (
    select  c = 'first_name', v = 'Larry',          d = '1/1/06' union
    select  c = 'last_name',  v = 'Smith',          d = '1/1/06' union
    select  c = 'job_title',  v = 'Manager',        d = '1/1/06' union
    select  c = 'job_title',  v = 'Senior Manager', d = '1/1/07' union
    select  c = 'job_title',  v = 'CEO',            d = '1/1/08'
)
insert  History (table_name, record_id, column_name, new_value, update_date)
select  'Person', 1, c, v, d
from    V
union
select  'Person', 2, 'first_name', 'Victor', '1/1/06'

;with Ranked as (
    select  *,
            rank = rank() over (partition by record_id, column_name order by update_date desc)
    from    History
    where   table_name = 'Person'
        and update_date <= '1/1/07'
), Filtered as (
    select  column_name,
            record_id,
            new_value
    from    Ranked
    where   rank = 1
)
select  record_id,
        first_name,
        last_name,
        job_title
from    Filtered
pivot   (max(new_value) for column_name in (first_name, last_name, job_title)) pvt
Results
record_id first_name last_name job_title     
--------- ---------- --------- --------------
1         Larry      Smith     Senior Manager
2         Victor     NULL      NULL          
Caveats
The above is effectively Entity-Attribute-Value (EAV), which is an extremely flexible way to use relational databases, but also a very inefficient one. The query above will not scale well. The data returned are all the same data type as History.new_valuevarchar(100) in this case. This does not bode well for, example, calculating sums over nontrivial sets of data. Moreover, the above model assumes all tables have the same type of [synthetic] primary key. (Although, this could be addressed by having a separate history for each table.)
Caching
There are many variations on the above that are really just the above, plus caching. For example: consider the use of table_name in the example — it actually works without there being a Person table in the database. However, it makes sense to have the latest version of data available in tables, for quick access. The latest version is simply the select query above, with no date criterion. Consider such tables the "most recent snapshot".

If this "most recent snapshot" approach is taken, the History table above could be thought of as a sort of audit log — it could be populated by triggers. (Deletion would need special treatment, unless all entries in the audit log are also deleted.)

Another way to do versioning is to never update records in tables — just create new ones and mark the old ones somehow. This has the benefit of faster data access and avoidance of EAV issues. However, this is not very space-efficient. A possible compromise between this approach and the EAV approach would be to create occasional snapshots, with the gaps filled by the EAV approach.