Data Model Best Practices for Data Warehousing
In our last post here we talked about documentation best practices for data warehousing. In this post we’re going to focus on data modeling and the key information that you need to know.
The data model is where all of the action takes place. Now that your documentation and naming conventions are defined and you have formally established metrics and dimensional attributes defined, it’s time to begin the design process by creating a dimensional model (star schema) upon which the individual fact and dimension records will be stored. The dimensional model is preferred because it is more intuitive and performs better than a fully normalized data warehouse.
We strongly recommend using a dimensional model (Ralph Kimball) over an entirely normalized data model (Bill Imnom) for three primary reasons:
- Understandability – the complexity of a fully normalized data warehouse often makes it difficult for non-technical, business people to comprehend.
- Performance – the normalization of the data warehouse will cause the user to join more tables together to get the answer they are seeking. These combinations can sometimes take significant amounts of CPU and memory to complete.
- Extensibility – although it is easier to move from one or two applications into a fully normalized database it becomes harder to add additional source data into the warehouse because of the impact the change may have on the existing data model.
Now that we have decided on the modeling technique, you need to focus on the facts and dimensions which comprise a dimensional data model.
Dimensions are the containers for the attributes upon which you wish to slice and dice your calculated metric. For example, a CUSTOMER dimension may contain the following attributes:
- Customer ID
- Customer Name
- Customer Type
- Customer Age
- Customer Marital Status
- Customer Gender
- Customer City
- Customer State
- Customer Postal Code
- Customer E-Mail Domain
As you can see, the attribute list can come from many sources. In the example above, the data can be derived (denormalized) from a Customer table, an Address table and an Email table yet stored as a single data structure.
We can also see that almost all of the attributes for the customer can change over time. Therefore, we need to define how we wish to handle these slowly changing dimensions inside the warehouse.
- Type 1 – overwrite the modified attribute. This technique is only to be used if the reporting and analytics application has no historical reporting requirement.
- Type 2 – keep the existing record and add a new record. To capture this historical data, the design must add a Customer Key, and Active Record Flag columns. The Customer Key is the primary key for the table, and the Active Record Flag is used to denote the most active record in the dimension for each Customer ID, speeding up the compare process.
- Type 3 – add columns to capture the changing data values and add an effective date to obtain the date of the change. This will allow you to capture at least some historical information, but this is not reliable enough if historical data is critical.
From this we can see that that the Type 2 Dimension provides the customer with the greatest flexibility. We recommend designing all dimensions, except the DATE dimension, as type 2 dimensions, even if you are certain that a type 1 dimension will work because there is no penalty for using a type 2 dimension. This will provide you with the flexibility should an unforeseen change need to be made in the future.
Next you should always try to make your dimensions conformed dimensions. Conformed dimensions are dimensions that can be used with multiple facts and data marts. Here, again, we see the ease of understanding and enhanced performance in the reuse of a single table in various locations.
Finally, snowflaking, the normalization of dimensional tables, should be avoided at all costs. Snowflaking introduces the negative impact on the understandability; performance and extensibility of the warehouse found in a fully normalized data warehouse. One of the causes for snowflaking is the need to express hierarchies within the dimension. Research and use the techniques available to flatten the hierarchies and eliminate the snowflakes within your data warehouse design.
Now that we have the dimensions defined, we need to design the facts. A fact table is the foundation of a data warehouse as it captures the values of the data you are measuring at a point in time. Fact tables can be very large, containing millions to billions of records depending on the grain of the table. The grain of the fact table is what you are measuring. Examples of a fact grain are:
- Line items of products of a bill
- Insurance policy claims
- Temperature measurements from multiple locations each hour of the day
There are three kinds of fact tables: transaction, periodic snapshot and accumulating snapshot.
- The transaction grain is the measure that was taken at a single instant in time, for instance, the entrance and exit of a building. Transaction fact tables are the largest fact tables which often hold millions to billions of records.
- The periodic snapshot reports the roll-up of the transaction fact details to a predefined period of time (month, quarter, year, etc.). This periodic snapshot fact table is often created to meet the needs of senior executives and enhance reporting performance. The periodic snapshot fact table should be derived from a corresponding transactional fact table so that drill-down functionality can be used if necessary.
- Accumulating snapshot facts define the grain at a high level and overwrite the measures as the values of the measures change over time. Accumulating snapshot facts are the smallest of the fact tables as they have no detailed history to provide the user.
We recommend using the combination of the transaction grain with the periodic snapshot grain as the default design method and using the accumulating snapshot where necessary.