Dimension is a term in data management and data warehousing that refers to logical groupings
of data such as geographical location, customer
information, or product information. Slowly Changing Dimensions (SCDs) are
dimensions that have data that changes slowly, rather than changing on a
time-based, regular schedule.
In a transaction system, many a times the change is overwritten and track of change is lost. However, a data warehouse needs to maintain all the history as the key benefit of a warehouse is to provide historical information to analyze the trend.
Below SCD Types are the most implemented methods to handle these changing dimensions in a warehouse. Let’s understand these with an example below
Example:
Let’s see how this scenario is managed with different SCD types.
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is obviously done, when we are not analyzing the historical information.
SCD Type 2:
The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
This is implemented using a version column or through effective date columns to know the active record.
Implemented through a Version Column
Implemented through Effective Start and End Date Columns
A new record is added every time there is a change in the source with the version or Effective Date columns updating accordingly.
SCD Type 3:
The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data and will have only the recent historical change. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 adds additional columns to the tables
Implemented through additional Original Columns
Original Columns have been added to capture the most recent historical change.
Additional SCDs which are occasionally used:
SCD Type 0:
In a transaction system, many a times the change is overwritten and track of change is lost. However, a data warehouse needs to maintain all the history as the key benefit of a warehouse is to provide historical information to analyze the trend.
Below SCD Types are the most implemented methods to handle these changing dimensions in a warehouse. Let’s understand these with an example below
Example:
You have a dimensional table with customer
ID ' C01' with marital status as 'single' mentioned below. Overtime, customer
gets married and also moves to a new location.
Let’s see how this scenario is managed with different SCD types.
Initial Data Record:
Surrogate Key (Surrogate Key)
|
Customer ID (Natural Key)
|
Date Valid
|
Marital Status
|
Date of Birth
|
City
|
100
|
C01
|
Jan 23, 2008
|
Single
|
Jan8, 1982
|
Palo Alto
|
SCD
Type1:
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is obviously done, when we are not analyzing the historical information.
Surrogate Key
|
Customer ID
|
Date Valid
|
Marital Status
|
Date of Birth
|
City
|
100
|
C01
|
July 7,2012
|
Married
|
Jan8, 1982
|
Francisco
|
The record is simple over-written and no history is maintained here.
SCD Type 2:
The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
This is implemented using a version column or through effective date columns to know the active record.
Implemented through a Version Column
Surrogate Key
|
Customer ID
|
Date Valid
|
Marital Status
|
Date of Birth
|
City
|
Version
|
100
|
C01
|
Sept 23, 2004
|
Single
|
Jan8, 1982
|
Palo Alto
|
0
|
101
|
C01
|
Sept 23, 2004
|
Married
|
Jan8, 1982
|
Francisco
|
1
|
Implemented through Effective Start and End Date Columns
Surrogate Key
|
Customer ID
|
Date Valid
|
Marital Status
|
Date of Birth
|
City
|
Start-Date
|
End-Date
|
100
|
C01
|
Sept 23, 2004
|
Single
|
Jan8, 1982
|
Palo Alto
|
01-Sep-2000
|
23-Sep-2004
|
101
|
C01
|
Sept 23, 2004
|
Married
|
Jan8, 1982
|
Francisco
|
24-Sep-2004
|
31-12-9999
|
A new record is added every time there is a change in the source with the version or Effective Date columns updating accordingly.
SCD Type 3:
The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data and will have only the recent historical change. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 adds additional columns to the tables
Implemented through additional Original Columns
Surrogate Key
|
Customer ID
|
Date Valid
|
Original Marital Status
|
Marital Status
|
Date of Birth
|
Original City
|
City
|
100
|
C01
|
Sept 23, 2004
|
Single
|
Married
|
Jan8, 1982
|
Palo Alto
|
Francisco
|
Original Columns have been added to capture the most recent historical change.
Additional SCDs which are occasionally used:
SCD Type 0:
· The Type 0 method is a passive approach to
managing dimension value changes, in which no action is taken.
· Values remain as they were at the time of the
dimension record was first entered.
SCD Type 4
· The Type 4 method is usually referred to
as using "history tables", where one table keeps the current data,
· An additional table is used to keep a
record of some or all changes.
SCD Type 6 / hybrid:
· The Type 6 method combines the approaches
of types 1, 2 and 3.
· This method is also called as “Unpredictable
Changes with Single-Version Overlay" in The Data Warehouse
Now that we have a good learning experience on slowly
changing dimensions (SCD) and how to design tables as per the requirement to
maintain history or not for the important business columns. We now move on to a
new feature Introduced in SQL server 2005 and is being used extensively for
large volume tables from SQL server 2005 to SQL server 2008/2012.