Data Validation

From DWH
Jump to: navigation, search



Concept

Data needs to be validated by the register manager or researchers responsible for the register. Data Validation happens in the Data Quality Control

step of the Data Warehouse part in the Healthdata cycle. Data need to be approved, corrected or annotated if needed in this step and therefore before

they are visible in the data warehouse for storage, analysis or reports.


Project+Cycle.png


There are 2 ways to do these validations:


When no custom checks are needed, the semantic and, if there are any, logical checks defined in the Rule Engine, can be automated.


Important to know is that the use of the Rule Engine represents the first step in the Data Validation Tool. So when this tool is used,

the predefined and the custom checks will be performed when executing all the steps foreseen in the Data Validation project for this register.


Timing of the Validation

For data to become available for users it has to pass the flow from the data providers to the reports. The dataflow is started by the register manager

who also has to decide, if any, which validations are needed, and has to validate the new records in the Data Validation Tool before they enter the flow.


Before these manual checks are executed, automated checks kan be definded.


The data need to be:

  1. Loaded
  2. Transformed
  3. Aggregated


We use different layers in the dataflow to do so:

  • STG – STaGing
  • CDW - Central DataWarehouse
  • DM - Dimensional Model
  • EXP - EXPloitation
  • EXT – EXTernal


The validation happens when the data is moved from:

  • the source to the STG layer for the automatic validations or predefined checks
  • the STG to the CDW layer for the custom analysis and validations


Types of checks

The records that do not meet the predefined requirements of the defined checks are flagged.


Additionally researchers can compare new data with existing records within register to be able examine and flag the data exhaustively.

Researchers accept or rejects flagged these records. Rejected records will not be loaded further in the flow.


Data Quality Control2.png


We distinguish 4 types of checks, each with a different purpose:

  • Semantic checks: this is an automatic check that checks whether the field has right format
E.g. subject name does not contain numbers
  • Logical checks: combinations of values and/or acceptable range of parameters. These checks can be done manually using the
validation tool or can be automated by adding them to the Rule Engine
E.g. age should be between 0 and 130 years
  • Cross-sectional checks: these checks allow the validator to uncover statistical outliers
E.g. age should be between 0 and 130 years
  • Longitudinal checks: these allow validators to compare new results with historical data
E.g. previously length was 1m57, now length is 1m75


The Data Validation Process

To assure the data validation is executed correctly, the validation tool first executes the load from the data source to the STG layer including the

automated predefined checks. Then the steps for the custom and therefore manual validation need to be completed by the researcher to load the data

from the STG to the CDW layer.


In short, we need to be sure that the researcher:

  1. Validates semantic and logical errors detected during predefined checks
  2. Performs cross-sectional and longitudinal analyses and validation on non-error data
  3. Consolidates all data validations
  4. Starts the load process for further loading


When no manual checks are needed, the semantic and, if there are any, logical checks defined in the Rule Engine, can be automated.


These steps result in the upload of validated data through the layers in the production environment. Important to know is that the use of

the Rule Engine represents the first step in the Data Validation Tool. So when this tool is used, the predefined and the custom checks will

be performed when executing all the steps foreseen in the Data Validation project for this register.