“A
staging area is an intermediate storage area between the sources of information and the
Data Warehouse (or Datamart). — WIKI
Scenario B:
Diagrammatic representation to ease the understanding:
On the first look it always seems – Why do we need the staging
area? Why can’t we simply load the data from the source system to DWH
directly???
Though there are several reasons behind this, let us look into some
major ones:
Scenario A:
Consider a
workflow, where a DWH integrates data from various transaction systems.
We know that the
data has to undergo several transformation/processing while loading into a Data
Warehouse;
In the absence of
a staging area, the data load will have to go from the OLTP system to the OLAP
system directly.
This can severely
hamper the performance of the OLTP system as the complex transformative queries
will use the Source system resources for a long time which in turn can cascade
in worrying the End users of that Transaction system.
Pulling directly from an active
Transaction Source with the advent of Staging Area
Scenario B:
In an enterprise
scenario, we know that warehousing the data involves integrating multiple
source systems.
The challenge
becomes much more steeper when we have a business requirement of joining the
data across systems.
Though this can be
achieved using the ETLs to some extent, it becomes a walk-in-the-park if we
know that all the data is consolidated and stored in a common database.
The flexibility is
especially high when we have Flat-file Sources in the scheme of things.
Scenario C:
Timing
reasons
Due to varying business cycles, data processing cycles,
hardware and network resource limitations and geographical factors, it is not
feasible to extract all the data from all Operational databases at exactly the
same time.
For example:
a. It
might be reasonable to extract sales data on a daily basis, however, daily
extracts might not be suitable for financial data that requires a month-end
reconciliation process.
b. Or
say, we need to extract data from a database in Tokyo and integrate the
contents coming from a DB in Toronto. The extraction time will be different for
both the databases to make sure we extract the latest/correct data. Staging
proves to be the ideal foil as a Temporary Storage area for the Tokyo data.
Summary and Additional info:
·
To
gather data from different sources that will be ready to process at different
times.
·
To
quickly load information from the operational database, freeing it up as soon
as possible. All the needed transformations can then occur without interfering
with the operation.
·
Because
staging is generally a direct load from Source, in some scenarios, staging data
comes in handy to analyze Source data issues (in cases where we fear that
source may have refreshed).
·
Since staging
area might not be persistent i.e., once data is used for the further process it
can be deleted.