Wednesday 17 September 2014

Slowly Changing Dimensions (SCD) in Dimensional Modeling

Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.  

In a transaction system, many a times the change is overwritten and track of change is lost.  However, a data warehouse needs to maintain all the history as the key benefit of a warehouse is to provide historical information to analyze the trend.

Below SCD Types are the most implemented methods to handle these changing dimensions in a warehouse. Let’s understand these with an example below

Example:

You have a dimensional table with customer ID ' C01' with marital status as 'single' mentioned below. Overtime, customer gets married and also moves to a new location.

Let’s see how this scenario is managed with different SCD types.

Initial Data Record:

Surrogate Key (Surrogate Key)
Customer ID (Natural Key)
Date Valid
Marital Status
Date of Birth
City
100
C01
Jan 23, 2008
Single
Jan8, 1982
Palo Alto

SCD Type1:

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is obviously done, when we are not analyzing the historical information.

Surrogate Key
Customer ID
Date Valid
Marital Status
Date of Birth
City
100
C01
July 7,2012
Married
Jan8, 1982
Francisco

The record is simple over-written and no history is maintained here.

SCD Type 2:

The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

This is implemented using a version column or through effective date columns to know the active record.

Implemented through a Version Column

Surrogate Key
Customer ID
Date Valid
Marital Status
Date of Birth
City
Version
100
C01
Sept 23, 2004
Single
Jan8, 1982
Palo Alto
0
101
C01
Sept 23, 2004
Married
Jan8, 1982
Francisco
1

Implemented through Effective Start and End Date Columns

Surrogate Key
Customer ID
Date Valid
Marital Status
Date of Birth
City
Start-Date
End-Date
100
C01
Sept 23, 2004
Single
Jan8, 1982
Palo Alto
01-Sep-2000
23-Sep-2004
101
C01
Sept 23, 2004
Married
Jan8, 1982
Francisco
24-Sep-2004
31-12-9999

A new record is added every time there is a change in the source with the version or Effective Date columns updating accordingly.

SCD Type 3:

The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data and will  have only the recent historical change. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 adds additional columns to the tables

Implemented through additional Original Columns

Surrogate Key
Customer ID
Date Valid
Original Marital Status
Marital Status
Date of Birth
Original City
City
100
C01
Sept 23, 2004
Single
Married
Jan8, 1982
Palo Alto
Francisco

Original Columns have been added to capture the most recent historical change.

Additional SCDs which are occasionally used:

 SCD Type 0:

·    The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken.

·    Values remain as they were at the time of the dimension record was first entered.

SCD Type 4

·    The Type 4 method is usually referred to as using "history tables", where one table keeps the current data,

·    An additional table is used to keep a record of some or all changes.

SCD Type 6 / hybrid:

·     The Type 6 method combines the approaches of types 1, 2 and 3.

·     This method is also called as “Unpredictable Changes with Single-Version Overlay" in The Data Warehouse

Now that we have a good learning experience on slowly changing dimensions (SCD) and how to design tables as per the requirement to maintain history or not for the important business columns. We now move on to a new feature Introduced in SQL server 2005 and is being used extensively for large volume tables from SQL server 2005 to SQL server 2008/2012.