Handling EAV-data - Info Support Blog

Home » Handling EAV-data

Handling EAV-data
- Door Mark Streutker
- 15 years ago
- 4 Reacties
- Business Intelligence
 

In my line of work I often have to develop reports on all kinds of environments, but most of the time the underlying databases are limited to relational or dimensional models.
A few months ago, however, I was faced with the tedious task challenge to build a couple of reports on a system that was entirely based on an EAV-architecture.
Even though I was familiar with the technique, I have never put it to use for reporting purposes.
After doing some research on the internet, looking for best practices and guidelines, I noticed that the amount of information about reporting on EAV-data was limited at best.
That’s why I decided to dedicate a few articles on my personal experiences (and pitfalls) with the wonderful world of EAV.

There are a lot of interesting challenges to overcome when working with EAV-data, but I will try to start out with a few simple examples.
In this first article I will focus on the selection and formatting of EAV-data, the more complex techniques (like handling timelines) will be addressed in future articles.

First of all, let’s dive into some of the basics concerning EAV-data (since this type of datamodelling isn’t very common in a reporting environment).
EAV is an abbreviation for Entity – Attribute – Value. Data stored in this type of model usually consists out of three ‘base’ columns:
- Entity : An Entity is a ‘thing’, and object. Examples could be ‘Person’, ‘Employee’, ‘Contract’.
- Attribute : An Attribute describes a part of the Entity. If we take the Entity ‘Person’ as an example, attributes could be ‘Name’, ‘Birthday’, ‘Nationality’, etc.
- Value : The Value of the Attribute, i.e. ‘John’, ’12-12-1970’, ‘English’.
In a relational datamodel, an Entity would most likely be a table, a Attribute would correspond with a column and a Value would match a field.
Following table is a representation of a very simple EAV dataset.

Table: Entity

EntityID Entity

1 Employee

33 Contract

Table: Attribute

AttributeID Attribute

100 Name

101 Birthday

102 Nationality

360 Type

361 Startdate

Table: EAV_Data

ID EntityID AttributeID Value

1234 1 100 John

1234 1 101 12-12-1970

1234 1 102 English

5678 33 360 Fulltime

5678 33 361 01-01-2009

In the above example, we notice a few things.
First of all, in a relational datamodel we probably would have had a ‘Employee’ table and a ‘Contract’ table.
In our EAV example, most the data is stored in 1 single table. Both the tables (Entities) and columns (Attributes) are stored in a row-wise manner, therefore there is no ‘fixed’ datastructure. Second, all the ‘Values’ are stored in the same column, which implicates that we have 1 datatype to store everything (varchars, datetimes, ints, etc.).
To keep this example simple, ill ignore this ‘datatype-problem’ for now.

EAV-modelling works exceptionally well for systems that need a great amount of flexibility.
Both ‘tables’ and ‘columns’ can be dynamically added without changing the underlying model.
Systems that need this amount of flexibility are i.e. clinical applications.
So instead of having several (fixed) smaller tables, you end up with one (flexible) huge table.

While this is all fun and games for systems that storing information, things can get a bit challenging when you need to build reports.
First challenge: How do you retrieve EAV data?
To demonstrate the differences between a relational model and a EAV model when building queries, I’ll try to make an easy selection on all employee’s with the name ‘John’ that have the ‘English’ nationality.

The relation employee table in this example has the following columns: ID, Name, Last Name, Birthday, Nationality and Marriagal status.

Example 1: Relational table
```
   1: SELECT * 
```
```
   2: FROM Employee 
```
```
   3: WHERE Name = 'John' 
```
```
   4: AND Nationality = 'English'
```
 

Result

ID Name Last name Birthday Nationality Marriagal Status

1234 John Doe 12-12-1970 English Married

365 John Cook 01-08-1983 English Single

All the information from an Employee is stored in 1 record, which makes it a very easy task to select all the information we need.

In contrary to the Employee table, the EAV table has every attribute (column) stored in a new record.

 

Example 2: EAV table
```
   1: SELECT * 
```
```
   2: FROM EAV_Data
```
```
   3: WHERE Attribute = 'Name'
```
```
   4: AND Value = 'John'
```
Result

ID EntityID AttributeID Value

1234 1 100 John

365 1 100 John

Hmm, not exactly what we wanted to see. Even though this is a correct result, the other Attributes (like Last Name, Birthday, etc) are not returned by our query.

If we were to make a report on the Employees, we have to figure a way out to present the Attributes on the horizontal axe, as if they were columns. In short, we want the rows to be converted to columns.

One way to achieve this, is to access the same table several times by means of selfjoins.

 

Example 3: EAV table with selfjoins
```
   1: SELECT  t1.ID as 'ID'
```
```
   2:       , t1.Value AS 'Name' 
```
```
   3:       , t2.Value AS 'Nationality'
```
```
   4:       , t3.Value AS Birthday
```
```
   5: FROM EAV_Data t1
```
```
   6: LEFT JOIN EAV_Data t2
```
```
   7: ON t1.ID = t2.ID
```
```
   8: LEFT JOIN EAV_Data t3
```
```
   9: ON t1.ID = t3.ID
```
```
  10: WHERE t1.Attribute = 'Name'
```
```
  11: AND t1.Value = 'John'
```
```
  12: AND t2.Attribute = 'Nationality'
```
```
  13: AND t2.Value = 'English'
```
```
  14: AND t3.Attribute = 'Birthday'
```
Result

ID Name Nationality Birthday

1234 John English 12-12-1970

365 John English 01-08-1983

While this works to some extent, you’ll notice that the query for the EAV-data is much more complex that the relational one. On top of that, in order to get the same amount of columns as the relational query, we need even more selfjoins! Needless to say, using lots and lots of selfjoins is stressfull for the database and has a negative impact on the performance.

A better (and more elegant) solution is to ‘rotate’ the table by making use of the PIVOT operator that was introduced with SQL Server 2005.

 

Example 4: EAV table with PIVOT
```
   1: SELECT * FROM
```
```
   2: (
```
```
   3:     SELECT ID
```
```
   4:            , [100] AS Name
```
```
   5:            , [101] AS Birthday
```
```
   6:            , [102] AS Nationality
```
```
   7:     FROM 
```
```
   8:     (
```
```
   9:            SELECT ID, EntityID, AttributeID, Value
```
```
  10:            FROM EAV_Data
```
```
  11:     ) p
```
```
  12:     PIVOT
```
```
  13:     (
```
```
  14:            MAX (Value)
```
```
  15:            FOR AttributeID IN ([100], [101], [102])
```
```
  16:     ) AS pvt
```
```
  17: ) AS result
```
```
  18: WHERE Name = 'John'
```
```
  19: AND Nationality = 'English'
```
Result

ID Name Birthday Nationality

1234 John 12-12-1970 English

365 John 01-08-1983 English

By making use of the PIVOT operator we are capable of ‘rotating’ the table, which makes it possible to query the data as if it was stored in a relational table.

Of course this is a very simple example, but combined with metadata it can be a powerful technique.

Systems that make extensive use of EAV-modelling are usually backed up with a great amount of metadata, which is used to describe properties like the datatypes of attributes, the attributes that belong to a certain entity, the relation between entities, etc.

In my current project we use the metadata to generate an extra layer in the database, based on certain combinations of attributes.

This way it’s like we have a couple of hundred different (relational) tables, while everything is effectively stored in 1 big table.

I hope this was an interesting read, in my next article I’ll elaborate on the subject of handling multiple timelines

Share this

Plaats een reactie

Click here to cancel reply.

4 comments

Why not create views for easy reporting? I used it previously to easily query XML-data, which is a similar problem.

With views, it’s the resposibility of the original designer/developer to create them and ofcourse he knows best how the underlying model works.

Robert te Kaat/ 14 years ago Reply
Thats a legitimate question, it’s a common practice to use views.
In the examples I’ve shown, I took one important dimension out of scope : time.
Mostly when you run a report, you want data between two specific points in time.
When you use a view, you usually realize this by adding a JOIN/WHERE clause on the BeginDate and EndDate columns.
This method is efficient way of selecting data, since selecting through a view supports the usage of indexes.

However, when you use the PIVOT construction I described above, adding a JOIN/WHERE clause _after_ the PIVOT won’t be as efficient. The PIVOT statement will pivot all records before applying the filters, which makes it a very expensive operation. To make a PIVOT efficient, you want to delimited as much records as possible _before_ you pivot. The only way to achieve this, is by making use of variables; i.e. a startdate and enddate.
Since views don’t support the usage of variables, we were bound to use functions instead.

The downside is, of course, that the resultset of a tablebased function is a table variable; so filtering the resultset of a tablebased function won’t use any indexes.

marks/ 14 years ago Reply
Very good information here.

Thanks for sharing.

Anyway i have some more questions.

1. Any performance issue using this technique. (if you can provide some matrix like data size, time, should be very good)

2. What if I don’t use EAV-modeled but using xml to keep data instead, how can you generate report from xml?

Thanks,

ensecoz/ 14 years ago Reply
Very good exposition of the EAV design.

I have a question though: is it really necessary to use the PIVOT operator as not all RDBMSes may support it. I have seen another example of ‘pivoting’ using standard SQL like so:

create view vw(o, name, age, gender)
as
select o,
max(case a when ‘name’ then v end) as name,
max(case a when ‘age’ then v end) as age,
max(case a when ‘gender’ then v end) as gender
from eav2
group by o
having age > 3 and name like ‘y%’ order by age;

(where o is the object id, a is the attribute and v is the value).

Does this do the same as the PIVOT operator? Does it have the same performance?

tsaixingwei/ 13 years ago Reply

EntityID	Entity
1	Employee
33	Contract

AttributeID	Attribute
100	Name
101	Birthday
102	Nationality
360	Type
361	Startdate

ID	EntityID	AttributeID	Value
1234	1	100	John
1234	1	101	12-12-1970
1234	1	102	English
5678	33	360	Fulltime
5678	33	361	01-01-2009

ID	Name	Last name	Birthday	Nationality	Marriagal Status
1234	John	Doe	12-12-1970	English	Married
365	John	Cook	01-08-1983	English	Single