DataFlow

From DWH
Jump to: navigation, search



General

For data to become available for users it has to pass the flow from the data providers to the reports

following the defined procedure.


Normally the dataflow is started by the register manager who also has to decide, if any, which checks

are needed, and has to validate the new records in the Data Validation Tool before they are enter the flow.


During the process the data need to be:

  1. Loaded
  2. Transformed
  3. Aggregated


We use different layers in the dataflow to achieve this:

  • STG – STaGing
  • CDW - Central DataWarehouse
  • DM - Dimensional Model
  • EXP - EXPloitation
  • EXT – EXTernal


Additionally we make a difference between two types of layers:

  1. The layers where data are stored similarly to the way they arrive for the industrialized reports.
  2. The layers where data is transformed and aggregated for analytical and ad-hoc reports.


The independence between these layers reduces the rework when changes in the data flow are needed.


A second distinction we have to make when discussing the dataflow of the data warehouse is the one between:

  • Historical Data: data that was already present before HealthData existed
  • Periodical Data: data loaded by using the HD4DP application


Finally, the data can be pushed into the data flow or can be pulled.


Types of Layers

The programs that load the data into industrialized layers are generic and therefore the same for every

data collection. They are driven by ‘data collection specific’ parameters (metadata).

Dataflow Technical vs Logical Layers 2.png


These industrialized reports are accessible via http://www.healthstat.be. The layers involved are:

  • STG – STaGing
  • CDW - Central DataWarehouse


Up until the CDW layer the data are stored as they arrive without manipulations nor calculations.


The layers where data is transformed and aggregated for analytical and ad-hoc reports are

accessible via http://remote.healthstat.be are:

  • DM - Dimensional Model
  • EXP - EXPloitation
  • EXT – EXTernal


In the analytical and ad-hoc reporting layers are constructed for analysis and

reporting purposes information is added, values are removed etc. based on requirements from the

users of the data collection. Intelligence is added to the data to obtain information.


The programs that load these layers are specific per register.


Database Layers

The data is loaded into different layers. Every layer has a specific purpose.

Dataflow-Technical vs Logical Layers.png

STG – StaGing

  • data is timestamped
  • data is added to the table(s) of the data collection
  • records are validated


CDW - Central DataWarehouse

  • the information receives a validity period. This period indicates that a situation was valid between
a starting and an ending date
  • a change in information results in a new record for this information with a new time slot


DM - Dimensional Model

  • adding information to the data
  • combining different data sources for one dimension


EXP - EXPloitation

  • this is the basis for the aggregations
  • different data sources are combined for different dimensions
  • contains only valid records
  • used for analysis on detailed data


EXT - EXTernal

  • the data are aggregated
  • we have 2 versions of this layer:
    • internally accessible by Enterprise Guide
    • externally accessible. This is a published version of the first.


The publication of a report is a deliberate act of the person responsible for the data during the Data Validation process


Historical Data and Periodical Data

There is a distinction between:

  • Historical Data: data that is are already present before HealthData existed
  • Periodical Data: data loaded by using the HD4DP application


Historical Data

Data that was already present before Healthdata existed is called 'Historical Data'. We need this data to get the historical overview

but unlike the periodical data, which is collected using HD4DP, we only have to load the data once.


Depending on the register and the requirements some parts of the layers will not be loaded until ‘new’ data arrives.

<? kunnen we 'new' vervangen door Periodical? of "Until the first time the periodical data arrive.?>

For example, some registers do not need any reporting yet, but need data to be sent to the data providers as ‘stable’

data so the data providers do not have to enter name, first name, date of birth etc manually in the HealthData data collection.


Historical data can be stored data in SAS tables, some in csv files, others in SQL Server tables etc. Based on the file

specifications of the historical data generic programs load the data in the STG and CDW layer.

DataFlow Historical Data.png


To be able to do this using the DWH, we need to load the data in the CDW layer. Other registers already wanted

some reports on the website, so for these we loaded the historical data all through to the EXT layer. This

action is normally only done once, right before the ‘new’ data collection.


Periodical Data

Periodical Data is recurrently loaded by using the HD4DP application and after validation.


Every time a field, a data type or anything else changes in the data collection, a new version has to be created.


The biggest change we will see immediately is of course the change to HealthData. But afterwards, changes can lead to a similar

process. The major element to retain from all of this is that we aim to keep the complete independence between these two

kind of layers. <? Hoe bedoel je dit? >


DataFlow - Periodical Data.png


The first step in the process is to load the version into the CDW data model. The first time this will be the historical data.


Because also lists of values or field contents can be different besides table names, field names etc, we have to physically

load the data after having performed a mapping of field names and field contents.


Once the data is loaded correctly in the CDW layer, we will need to adapt the programs that perform steps to load the data

towards the DM layer or EXP. The step from EXP to EXT should be transparent because the business questions have not changed.


For all the changes in the data model we will have to keep the old fields and tables operational, because

of auditing purposes:

  • EXT layer for the logging of the queries used


Push or Pull?

The normal data flow after the staging layer is to pull the data via the Data Validation tool. But we also foresee the possibility to push

data or a combination of both. The decision lies with the responsible of the management of the register.


A push consists of a program that is scheduled periodically (once per night every hour,…). This program

will look for ‘new’ data that have to be loaded into the STG lagyer. Once loaded it waits for the register manager

to be pulled through the other layers.


E.g. in the case of the registers concerning the implants for instance (Qermids), the data are generated from a database

that only consists of validated data. In this case an extra manual validation is not needed, so the data is pushed

automatically through the flow.