Sunday, 31 August 2014

Types of Dimensions and Facts in a Data Warehouse

I found this documentation in http://bintelligencegroup.wordpress.com/ and it was so helpful in my initial days of learning data warehouse, million thanks to all the bloggers/writers across the world for educating and enlightening the world with their writings.. happy learning!!


Dimension:

A dimension table typically has two types of columns, primary keys to fact tables and textual\descriptive data.
Fact:

A fact table typically has two types of columns, foreign keys to dimension tables and measures\metrics that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.

Types of Dimensions:
Slowly Changing Dimensions:
Attributes of a dimension that would undergo changes over time. There are quite a few SCD techniques (Type 1, Type 2, Type 3.. ) on  how these changes should be handled\preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension. 
Rapidly Changing Dimensions:
A dimension attribute that changes frequently is a Rapidly Changing Attribute. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension.
Junk Dimensions:
A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.
Inferred Dimensions:
While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.
Conformed Dimensions:
A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.
Degenerate Dimensions:
A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.
Role Playing Dimensions:
A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.
Shrunken Dimensions:
A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.
Static Dimensions:
Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.
Types of Facts -
Additive:
Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.
Semi-Additive:
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.
Non-Additive:
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.
Factless Fact Table:
 In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.
Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.
Based on the above classifications, fact tables are categorized into two:
Cumulative:
This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Snapshot:
This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

What are facts and dimensions in a data warehouse

FACTS & DIMENSIONS

Data warehouses are built using dimensional data models which consist of Fact and Dimension tables”.

In simpler words, Facts and Dimensions are the two major categories into which we divide our entities (tables) in a DWH.

The architecture and efficiency of the DWH depends on how we identify/categorize/design the Facts and Dimensions.

Before getting into any further technical explanation of what facts and dimensions actually are, let us look at the following statement:

“It is 10122 runs!!!

Though we know that 10122 runs is something related to cricket, we have no clue what the above statement represents.


However, there is a whole new meaning when we say that “
10122 are the total runs scored by Sunil Gavaskar in Test cricket”.

Let’s relate this: “10122 runs” is a raw FACT, and it has no meaning until we mention the DIMENSIONS which are the Player-Name (Gavaskar) and the Format (Test cricket).

 Here is another example that we encounter in an enterprise:
 “Quickie-Mart registered Total Sales of 2.2 million on 12th-Apr-2011”.

Based on the earlier scenario, we can make out that, here - “Total-Sales” is a Fact whereas “Quickie-Mart” (Organization Dimension) and “12th-Apr-2011” (Date dimension) are the dimensions.

We can segregate the above information by adding further dimensions.

Example:

·  Quickie-Mart registered Total-Sales of $1.3 million on Product A on 12th-Apr-2011 --- (adding a product Dimension)
·  Quickie-Mart registered Total-Sales of $0.5 million of Product A in the Asia-Pacific region on 12th-Apr-2011 --- (adding a Sales-Geography Dimension)
 
After the above examples we can deduce the following statements:

Fact Tables consist of the measurements, metrics or facts of a business process.
Dimension Tables provide a structured information to quantify (give a meaning to) those facts.

Addition Info:

·   Fact tables form the Center of Data Warehouse surrounded by Dimensions. Each fact table would have a Foreign-key that would have a corresponding Primary key in the dimension tables.

·    Fact table keeps on growing in size whereas the growth of Dimensions is generally slower (There are number of Sales transactions in a day but cannot imagine product category to grow at that pace).

·    Since the fact tables keep on growing in size very rapidly, it is desirable that the data redundancy is reduced to the max, hence the fact tables are normalized to the maximum extent, and however it may not be the same with Dimension tables.

* In the coming posts we shall look into other aspects of DWH with more examples which shall serve in getting more insights into Facts and Dimensions.

What is a Data Warehouse, how is it different from a traditional database

What is a Data Warehouse ?

To keep it simple, let us take Bill Immon’s definition of the Data Warehouse (one of earliest authors on DWH):

A Data Warehouse is a central repository for the Entire Enterprise”.

As we know, the end product of Business Intelligence are the reports/dashboards that assist the stakeholders in analyzing the trends and help them take decisions. But it is important that these reports are projected on data that is accumulated over a period of time (historical data) and not just recent data or data from a single system. In other words, these reports would make sense only when they are created on top of integrated and historical data. Considering the fact that Organizations these days have their data scattered across multiple systems, platforms and formats, DWH’s role becomes all the more crucial as it acts as a one stop destination.
  
How is it different from a traditional database?

Following are the differences between Operational/Transactional systems and a DWH:
    
Operational/Transactional database
Data Warehouse

It deals with operational data i.e. data involved in the operation of a particular system and it is characterized by a large number of short on-line transactions

Features
§ Designed Primarily for fast INSERTs, UPDATEs and DELETEs.
§ Quick Singular Transactions.
§ Operational/transactional Data
§ Application specific DBs

It is a collection of Historical and integrated Data accumulated from various operational systems. 


Features
§ Designed Primarily for : SELECTs
§ Fast Reading of Large Data Sets
§ Historical Data or Archival Data
§ Integrated data set with a global relevance

Now that we got to know that a Data Warehouse is the most important Cog in the BI (Business Intelligence) wheel, it is time to get our hands dirty and scan deeper into the Data Warehouse (DWH).

In next posts, we shall begin with the core concept that drives the functionality and architecture of a Data Warehouse. 

Friday, 29 August 2014

SQL - Query Tuning Techniques

While dealing with performance problems, database professionals tend to focus on technical aspects of the system such as resource queues, resource utilization and so on.

SQL Server introduced a component called data collector that collects different sets of data related to performance and other from different sources and store it in a relational warehouse known as management data warehouse.

The data collector installs three system data collection sets that collect disk usage, server activity, and query statics information.

The data collection platform helps you automate the collection of performance and other information and also analyze it graphically with preconfigured  reports.


Monday, 25 August 2014

Metadata Management - Crawler Query - To get schema info from a list of servers

Please find further details on usage in my earlier posts and below is the code snippet to get schema information:
SET NOCOUNT ON
DECLARE @AllTables TABLE (
ServerName sysname,
DATABASE_NAME sysname,
TABLE_SCHEMA sysname )

DECLARE @SQLTemplate nvarchar(1500)
SET @SQLTemplate=
N' SELECT @@SERVERNAME AS Servername'
+ N', ''?'' AS DATABASE_NAME'
+ N', B.Name AS TABLE_SCHEMA FROM [?].sys.schemas B WHERE B.schema_id=1 OR B.schema_id BETWEEN 5 AND 16383;'        
  INSERT INTO @AllTables (ServerName ,
DATABASE_NAME,
TABLE_SCHEMA)

EXEC sp_msforeachdb @SQLTemplate

SET NOCOUNT OFF

SELECT * FROM @AllTables
WHERE DATABASE_NAME NOT IN ('master','model','msdb','tempdb')
ORDER BY DATABASE_NAME, TABLE_SCHEMA

Thursday, 21 August 2014

Metadata Management - Crawler Query - To get complete tables and views info from a list of servers

To fetch list of tables in a server usually we use sys.tables, but if your IT infrastructure is scaled to multiple servers we need to collect this information from all the serves in given IT landscape. 

Usually in metadata management you capture this information in to a table to do your data analysis and reporting, you either create "Central Management Servers" to run your query at a single go in all these servers or create an SSIS data flow task using dynamic source connection manager to connect all these serves and pull information into your metadata tables.

In both cases a plain select * from sys.tables might not help, you get the results in different windows for each server.

Use below piece of code to get the results in a single plain window or place this in your SSIS OLEDB Source - SQL Command option to pull and push this information straight into your underlying tables:

SET NOCOUNT ON

DECLARE @AllTables TABLE (

ServerName sysname,
DataContainerName sysname,
SchemaName sysname,
EntityName sysname,
EntityType sysname,
EntityCreatedDate sysname,
EntityUpdatedDate sysname,
EntitySizeInKB sysname,
EntityRowCount sysname)

DECLARE @SQLTemplate nvarchar(1500)

SET @SQLTemplate=

N' SELECT @@SERVERNAME AS Servername'
+ N', ''?'' AS DATABASE_NAME'
+ N', B.Name AS TABLE_SCHEMA'
+ N', A.Name AS TABLE_NAME'
+ N', A.Type_Desc AS TABLE_TYPE'
+ N', A.Create_Date AS CREATED_DATE'
+ N', A.Modify_Date AS MODIFIED_DATE'
+ N', ISNULL(SUM(E.used_pages)*8096/1024,0) AS Table_Size'
+ N', sum(case when (D.index_id < 2) and (E.type = 1) then D.rows else 0 end) AS Table_RowCount FROM [?].sys.objects A'
+ N' INNER JOIN [?].sys.schemas B'
+ N' ON A.schema_id=B.schema_id AND B.Name NOT IN (''temp'',''tmp'')'
+ N' LEFT JOIN [?].sys.tables C'
+ N' ON A.object_id = C.object_id'
+ N' LEFT JOIN [?].sys.partitions D'
+ N' ON D.object_id = C.object_id'
+ N' LEFT JOIN [?].sys.allocation_units E'
+ N' ON D.partition_id = E.container_id WHERE A.type IN (''U'',''V'')'
+ N' GROUP BY C.Name, B.Name, A.name,A.Type_Desc, A.Create_Date, A.Modify_Date'
+ N' ORDER BY A.name;'

INSERT INTO @AllTables (ServerName ,

DataContainerName ,
SchemaName ,
EntityName ,
EntityType ,
EntityCreatedDate ,
EntityUpdatedDate,
EntitySizeInKB,
EntityRowCount)

EXEC sp_msforeachdb @SQLTemplate

SET NOCOUNT OFF

SELECT * FROM @AllTables
WHERE DataContainerName NOT IN ('master','model','msdb','tempdb')
ORDER BY DataContainerName, SchemaName , EntityName

I am filtering out the results (skipping master tables and few attributes) based on my need you can play around with them accordingly.

Happy Learning.. :)