Dimensional modeling – architecture and terminology

by Joakim Dalby (danish Dimensionel modellering, Datamodellering, Datavarehus)

I find knowing the theory really helps the practice.

 

1. Introduction

»The most valuable commodity I know of is information« says Gordon Gekko in Wall Street movie from 1987. »Knowing is good, but knowing everything is better« says CEO in The Circle movie from 2017. »Information at your fingertips« was a concept by Bill Gates in 1990.

 

Purpose: Let's enable business users to make better decisions faster.

 

Information has become a major asset for any orga­nization. The book Das Kapital by Karl Marx could today get the title Die Information.

 

A data warehouse is the way to provide information to the business users and add value to the business and reuse data in a new way to increase the revenue.

 

A data warehouse contains data from several operational systems and most im­portant is it, that data from a source system must be integrated, because integra­tion of data is the keyword to represent an enterprise data platform with data assets.

 

When a source system database forgets, a data warehouse remembers everything.

 

A data warehouse is a separated sy­stem so a business user query and analysis will not slow down and not reduce the work­load on the operational systems.

 

»Alot of you are going to have to make decisions above your level. Make the best decision that you can with the information that’s available to you at the time, and, above all, do the right thing.« says Atul Gawande.

 

Star schema for business users to do their analytics

 

Data is stored in a data warehouse through an ex­tract, transform and load process, where data is extracted from operational sy­stems, transformed into high-quality data and checked for compliance with bu­si­ness rules and enriched by business users data before loaded into a data warehouse system.

 

Data quality isn't everything. It’s the only thing.

 

Gartner formulates that the platform for data and analytics is the heart of the digi­tal business. Here you will find data management and the company's governance for data and analysis. Data is the raw material in our digital age. Data is collected, cleaned, controlled, optimized, shared and used, and our ability to work flexibly and intelligently at this »assembly line« is growing fast. Data management is crucial to tomorrow's business success. They must shape the technology and the data re­sour­ce both short-term and long-term for the sake of both the existing business and the future's business. Now, the actual data platform is, with good reason, a focus area for the companies. In a data-centric world, tools, processes and com­peten­cies to handle data is infinitely crucial. All of today's greatest techno­logical trends ran­ging from Internet of Things IoT, robotics, artificial intelligence and virtual reality are ultima­tely de­pen­dent on data managers, architects and company data strategies.

 

»That's all data is. A gift from yesterday that you receive today to make tomorrow better.« Jon Acuff: Finish: Give Yourself the Gift of Done, page 126.

 

Performance Management Capabilities required:

  • Planning – What do I want to happen?
  • Forecasting – What will happen?
  • Reporting – What happened?
  • Dasboard – What is happening?
  • Analytics – Why did it happen?

 

A data warehouse tells you what has happened (lag measures) - not why it has happened or what is going to happen (lead measures). »Description is not analysis.« Dr. Donald J. Wheeler: Making sense of data.

 

The four levels of analytics to get better data skills and data literacy are:

  • Descriptive analytics use dashboards, scorecards and reports, i.e. reporting what happened in the past or is currently happening (facts). Help a business make more money.
  • Diagnostic analytics is finding why things are happening or why it happened that way (insights). Help a business stop an inefficient process and save money.
  • Predictive analytics does predict what is likely to happen in the future, a future result with data by using machine learning, statistics and algorithms, not with a crystal ball. Help a business capitalize on future trends and make tons more money.
  • Prescriptive analytics does prescribe what should be done with the data and analytics. How to optimize a process. It means that organizations will get answers that will help them know what business decisions should be made. Help a business full potential through Data-driven decision-making (in danish Datadrevet beslutningstagning).

 

There are many gaps that need to be filled in by a different kind of insight than the one that can be delivered in a data flow task from multiple source systems to an enterprise data warehouse, but let's start with what is a data ware­house.

 

Data warehouse is

 

  • Subject-oriented because data from the business is merged into areas of subjects and not organized around the functional applications of the business.
  • Integrated because multiple operational source systems is using common and consistent business names, definitions, formats, data types and units.
  • Time-variant because historical data is kept as current data at any given time. Time variance records change over time where the items of data is time­stam­ped to inform when the data was registered. In contrast, a source system will often only contain current data that is correct at the moment.
  • Non-volatile because data do not change or historical data will never be altered.
  • Unite, interconnect, compile, conform and consolidate data into common format.
  • Enrichment and processing of data for providing new knowledge and wisdom.
  • Collection of data in support of management's decision-making process and analy­zing the business with business performance measurement qualifiers and KPI Key Performance Indicator used as a measure of how a company is doing.
  • Utilizing dimensional modeling, a data model for supporting analyses where busi­ness users and analysts can easily understand and navigate the data structure and fully exploit the data for self-service BI.
  • Goal to support slice and dice of data for business analysis and insight.
  • Data-driven to become enjoyable and pleasurable to work with for the busi­ness users, and it is important for business and data analysts to have the right data, in the right form, at the right time, so they can turn data into insight and achieve business intelligence.
  • Compliance requirement for provide proof that the reported numbers are accurate and complete data quality by having an audit trail for reconciliation check etc.
  • Driven by business requirements specification and user story from stake­holders and business users. A data warehouse must have focus on the needs of the business.

 

Data from multiple operational source systems and source legacy systems are sto­red in tables in a rela­tio­nal database called a data warehouse. Data will be merged and integrated inside a data warehouse to get a consistent format and use same naming.

Transversal data warehouse, it cannot be said more tellingly (in danish Tværgående datavarehus, mere sigende kan det ikke siges).

 

Uniformity means that within the data warehouse the common data ele­ments of multiple source systems are referred to in a consistent manner.

 

Confor­ming means that within the data warehouse creating a conform column with a single and consistently labelled name that has the same common meaning with identical contents all over the data warehouse e.g. same spelling and unit of measurement (in danish ens­artet).

 

For example, IBM or I.B.M. or International Business Machines Corporation.

 

For example, pro­duct data can be found in multiple source systems with diffe­rent names of co­lumns, different spelling of product names and different segmentations of pro­ducts, there­fore a data ware­house will unite, interconnect, compile, conform and consolidate the data into one product table and one product name column to make it easy for the business users to make a list of all products. For example, the length of a pipe is registered as 55 cm. in one source system and as 0.55 m. in another source system, but in a data ware­house the values become conformed to 550 mm.

 

For example, status can be found in order statuses, shipping statuses and payment status from multiple source systems where a data warehouse dimension will unite and conform all statuses in a hierarchy.

 

A conformed dimension supports integration of multiple source systems into a data warehouse and is essential for enterprise data warehousing.

 

It is important to design a data warehouse to sup­port re­porting and data analyses by a single and common data model that is easy to navi­gate in and a business user does not have to think of data across of source systems. Data integra­tion becomes seamlessly. Many source systems update its data to reflect the most current and recent state while a data warehouse also maintain history.

 

For example, an ambas­sador who has lived in Washington, London and New Delhi for the last 25 years and always bought her perfume from a Lancôme store in Paris with a IT system that only stores current shipping address, all the sales over the last 25 years will be placed in India, while a data warehouse will remember all cities and give a correct picture of the sales shipped around the world.

 

Data mart is a specific subject area that contains the data and information that is re­le­vant for a business function or a department or business users. A data mart has one pur­pose to custo­mize and sum­ma­rize data for tailored support of a specific ana­ly­tical re­qui­re­ment from a bu­si­ness unit. It utilizes a common enter­prise view of stra­te­gic data and provides busi­ness units more flexibility, con­trol and re­spon­sibi­lity. A mart display data used to support decision-making, reporting and dashboards (in danish udstiller).

 

For example, Sales mart, Cu­sto­mer mart, CRM mart, Churn prediction mart, Market mart, Pro­duc­tion mart, Inventory mart, Shipment mart, HR mart, Tax mart, Credit risk mart, Fraud de­tec­tion mart, Financial mart, Budget mart.

 

A data warehouse can have multiple data marts bound together with conformed and shared dimensions. A conformed dimension is shared by multiple facts that has conformed measures with same calculation methods and common unit of measure e.g. sales revenue and support revenue in two facts are both pre-VAT (value added tax) in United States Dollar USD and adding them will result in a total revenue.

 

A data mart contains the data to calculate Key Performance In­di­cators (KPI) e.g. MRR Monthly Recurring Revenue and ESG Environmental, Social, and Governance. (KPI in danish Nøgletal).

 

Dimensional modeling and the process of dimensionalization by thinking dimen­sionally is a technique approach that seeks to divide and present data in a stan­dard, intuitive framework of:

 

·       Dimension contains context descriptive data, hierarchy and grouping for a fact.

·       Fact contains measurement, measures, metrics for a business process, an event, a transaction, a circumstance, an instance, an incident etc.

 

A fact may tells how much or how often, and a dimension adds texts to it. Both fact and dimension are tables in a database. A fact has several dimensions connected via dimension key columns and is often displayed as a star schema diagram. (Fact in danish faktum, faktuelt, faktuelle, kendsgerning, forretningsproces, en begiven­hed, en transaktion, en omstændighed, en instans, en hændelse).

 

Dimension and fact will focus on ease of business users (end users) accessibility and provides a high level of performance access to the data ware­house/data mart.

 

Ralph Kimball re­com­mends in the design of a data warehouse to decide what busi­ness process(es) to model by combining an understanding of the business require­ments with an under­standing of the available data.

 

Kimball page 37: »The BI team needs to understand the needs of the business as well as the realities of the underlying source data.« A process-centric fact table supports integration of multiple source sy­stems via a set of conformed dimensions, and the entire data warehouse is build process-by-process.

 

Data warehousing (DWH) is made up of three components: The architec­ture, The data model and The methodology. It is a method to help database designer to build a comprehensive and reliable data warehouse system, e.g. based on Dimensional modeling prin­ciples. A OLAP (OnLine Analytical Pro­ces­sing) cube or a Tabular model can be at the top of the Dimensional modeling to present data in a data access tool like Excel, Power BI, Tableau, QlikView, Alteryx or Targit.

 

Business Intelligence (BI) system provides the information that management ne­eds to make good business decisions. BI is going from data and information to knowledge and wisdom for the business users. BI was previously called Decision Support System (DSS) (in danish Beslutningsstøttesystem, Ledelsesinformations­system LIS). BI could also stands for Business Insight.

 

Data does not turn into insight without effort of building a good data warehouse with an architec­ture of data layers or data areas, a data model for each data layer and a me­tho­do­logy for process of data in each data layer or data area with a famous abbre­via­tion ETL that stands for three operations on data:

 

  • Extract
  • Transform
  • Load
  • Extracting
  • Transforming
  • Loading
  • Extraction
  • Transformation
  • Loading

 

Extract is the process of fetching data and extracting a desired subset of data.

Transform is the process of transforming the extracted data into a required shape.

Load is the process of loading the transformed data into a suitable target table.

 

ETL process brings multiple source systems together and combines their data to an integrated set of data into a data warehouse and into data marts. Read more in section 1.3.6.

 

In the article I will focus on the basic concepts, terminology and architecture in Di­men­sio­nal mo­deling. My homepage has other articles of how to design and im­ple­ment a data warehouse system with levels of data in the ETL process and orche­stration to orchestrate a set of tasks.

 

I see Dimen­sional modeling as a subset of the Entity Re­la­tion­ship (ER) data mode­ling design method of a rela­tional database system OLTP (On­Line Transaction Pro­­ces­­sing).

 

Peter Chen, the father of ER modeling said in his seminal paper in 1976: »The entity-relationship model adopts the more natural view that the real world consists of entities and re­la­tion­ships. It incorporates some of the important se­man­tic information about the real world.«

 

Where a ER data mo­del can have a little redundant da­ta or no data redundancy, a dimensional model typi­cally has a large a­mount of redundant data. Where a ER data model has to as­sem­ble data from nu­me­rous tables to produce anything of great use, a dimensional model store the bulk of their data in the single fact table or a small number of them and divide data into multiple databases called data mart. Read more.

 

Back in the 1970's Bill Inmon began to define the term data warehouse as a one part of the overall business intelligence system. An enterprise has one data ware­house with entities in a relational model as a centralized repository, and mul­ti­ple data marts which source their information from the data warehouse. In the data warehouse, data is stored in third normal form (3NF). A data mart is a simple form of a data warehouse that is focused on a single subject. Denormalization and data redundancy is a norm for data modeling techniques in a data mart.

 

Ralph Kimball said: »A data warehouse is a union of all its data marts.«

 

Kimball’s data warehousing architecture is also known as Enterprise Data Ware­house Bus Architecture matrix (BUS matrix) as a col­lec­tion of conformed dimensions that has the same meaning to every fact (e.g.).

 

Database concepts

A data model represents correctness, completeness and consistency.

 

A schema is a logical grouping of and a container for database objects e.g. tables, views, table-valued functions and stored procedures, and it can handle user access rights.

 

A table contains data in rows also known as records, tuples, instances or data rows.

A table contains columns also known as attribu­tes, fields, elements or properties.

 

A column contains an atomic value like a First name or a compound value like a Full name that can be further divided into three columns for FirstName, MiddleName and LastName or Surname or Familyname. Sometimes FirstName is called Given­name.

 

Null represents a blank value, an empty value, an indefinite value, a nothing or a none in a column which value is missing at the present time because the value is not yet registered, not yet assigned or the value is not defined for the row, or it is okay that the column not has a value. A column can contain a null value or a co­lumn has a con­straint of enforcing it to always contain a value e.g. by a default value.

In a Cu­stomer table a column LastPurchase­Date has a row with null that indicates the lack of a known value or it is known that customer hasn’t made a pur­chase yet, therefore it is known to be absent or omitted or mis­sing.

When a person column Date of Birth is null I will say it is missing because of the person prefer not to say or a lack of the information at the time the person was registered – a not knowing situation.

When a person column MiddleName is null, I will say it is nothing because it is okay that there is no value because a value is not required and therefore is not missing – a knowing that there is nothing situation. Sometimes people for MiddleName typein a value »(None)« to tell that a person has no middle name, or »(Missing)« to tell that a person has a middle name but we don’t know it, and let null indicates that we don’t know if a person has a middle name or not or typein »(Unknown)«.

A column Number of items is going to be assigned a value that is too big or too small to be represented for the data type, maybe because of an error, and therefore we are replacing the value with a null and adding an audit description »Out-of-bounds value« to an Audit co­lumn for later examination and investigation. On the other hand I would prefer column Number of items to be not allow null and a use of a negative value could be interpreted as a return of an item.

A nu­me­ric co­lumn with null in several rows can cause a problem, because queries using aggre­gate functions can provide misleading results. A SQL statement like this: Sum(<column>) can return one row that contains null, or use of Outer Join can result in a column with null.

In a division it is nice to use Null functions to avoid division by zero error. In sql, e.g.: SELECT ISNULL(Amount / NULLIF(NumberOfItems, 0), 0).

Null reminds me of: »No answer is also an answer« when I ask a person a question and he or she does not reply, his or her very silence is an answer in itself.

When I receive a Christmas gift in a box with something inside, I do not know the contents of the box, therefore the gift is unknown to me, and known to the giver. I will not say that the gift is null for me, okay maybe null if the box is empty but I only know that when I open the box and I could say the gift is missing.

Joe Celko's SQL for Smar­ties: Advanced SQL Pro­gram­ming, said: »Null is not a value; it is a marker that holds a place where a value might go.« Null is a placeholder for a future value e.g. an Address when it is known from the customer. In sql, e.g.: WHERE CustomerAddress IS NULL. Opposite term of Null can be Assigned, Extant, HasValue, IsSet, Usable, Something might imply not nothing or not null. In sql, e.g.: WHERE CustomerAddress IS NOT NULL.

The question is how you will interpret null when you are reading it.

 

Empty string or zero-length string has zero characters with a value as "" or '' or String.Empty, and null string has no value at all. An empty string could be used as a default value for a column that does not allow a null value. Trim a string value is taking away the spaces at beginning and at ending of the value, and when the string only contains spaces the trim makes it as an empty string.

[Alt+0160 at numeric keyboard gives a Non-breaking space (No-Break Space) that can be replaced to an empty string e.g. Replace(<StringColumn>,CHAR(160),'').]

To treat null, blank, empty, nothing and empty string as equal to null in sql, do like this: WHERE (NULLIF(TRIM(CustomerAddress),'') IS NULL)

 

Void I have seen on a paper fly ticket with stopovers to mark the empty lines that there is nothing here, to avoid me to be able to add an extra stopover.

 

Unknown represents a value therefore it is not null, but the value is not known in a system because it does not exists in the system, or it is a mistyping, a typos or a misspelling and therefore is incorrect. Unknown is something, null is anything and thus nothing. (In danish: Ukendt er noget, null er hvad som helst og dermed intet.)

For example, I receive an address with a zip code I don’t know therefore the zip code is an unknown value for me and I need to validate it and make the zip code value known for the system by insert the zip code value in a table of zip codes and cities and countries.

When a person column MiddleName is null, I will not call it unknown because many persons do not have a middle name and therefore it not exists for them. A default value could be used like »unknown« and later changed to »none« or »nothing« when we know that the person is without a middle name, and later changed to »missing« when we know that the person has a middle name but we don’t know the middle name. I would prefer to have another column MiddleNameStatus, but I think we in most cases accept a null value in MiddleName for none or nothing until we know the name if it exists and type it into a system.

A person column Date of Birth does exists somewhere and can missing and there­fore is unknown at the moment. A default value could be used like 0001-01-01.

The cause of an event, usually a bad event, is the thing that makes it happen. Example of causes are: accident, crash, delay, disaster, flood, sickness. If some­thing is unknown to you, you have no knowledge of it. The cause of death is unknown.

John Doe (for men) and Jane Doe (for women) are used when the true name of a person is unknown or is being intentionally concealed or anonymized.

 

Data type for a column defines the kind of data that can be stored such as integer, numeric, decimal, string or text, date, time, datetime, amount, boolean or bit for true (1, yes) and false (0, no). Important tip by naming columns, do not use possessive nouns such as Recipient's date of birth better to use CamelCase like RecipientBirthDate in a column name, and don’t use values in co­lumn name like EmployeeDay/NightCode better with Em­ploy­­ee­­Shift­Type­Code.

 

With conforming or conformance we avoid synonym columns with same content but different names, and we avoid homonyms columns with same name but different content. For example, Type is a general term better give a column a name as ProductType, CustomerType, TransactionType etc.

 

Names of tables and columns must be busi­ness-friendly and must make sense for both the business users and the IT deve­lopers. It is possible to use a view to make better names. For example, an IT developer named a date column in a table as ReservationExpiryDate and in a view to a business user the column is renamed to Date_of_Reservation_Expiry. Another example where BankAccountBalanceAmount becomes Saldo and IncludeTipsBoolean becomes Tips with Yes or No values instead of database values as true/false or 1/0.

 

A derived column represents a value that is derivable from the value of a related co­lumn or set of columns and not necessarily in the same table. For example, the age of a person is derivable from the date of birth column and date of today, or a calculation of an amount or a glue column to bind rows together for fast search.

 

A single-valued column holds a single value for an entity occurrence.

A multi-valued column holds multiple values, MovieCategory »Children, Comedy«.

 

A candidate key is a unique identifier for all rows in a table so each row contains a uni­que value and it can be a null value in one row. Ensures that each row is unique. For example, column EmployeeId, Em­ployee Social Security Number (SSN), Employee Login name, Employee Passport Number or Employee Car number plate.

 

A primary key is selected among the candidate keys and is a unique identifier for each row in a table and its value is unique for all rows in a table, so each row contains a uni­que value and it can’t be a null value (is not null, not allow null, non null). For example, column EmployeeId is the primary key in an Employee table because Id is an artificial auto-generated unique sequence number or an identity column as a surrogate key.

 

Alternate key or Secondary key is the key that has not been selected to be the pri­mary key, but it is still a candidate key and it can be a null value in one row.

 

A compound primary key is concatenated data e.g. a US phone number contains area code + exchange + local number in an EmployeePhonebook table.

 

A composite primary key is composed of multiple columns which combination is used to uniquely identify each row in a table e.g. column EmployeeId and column CourseNumber in a Participant table. Useful when a single column isn’t enough to uniquely identify a row.

 

An entity is a set of objects with the same properties and is implemented as a table with rows where a primary key is unique and for use in identification of rows.

 

A foreign key (one column or a collection of columns) in a table refers to the primary key in another table to establishing relationships or links between the two tables. An orphan child as a row in a Child table with a column MotherId as Null. Another orphan child has also MotherId as Null. It is good that Null = Null will evaluate to Null, because I don’t know if the two children are siblings or not, therefore no answer.

 

A relationship defines how the entities are related to each other with a meaningful association among entities in three ways called one-to-one, one-to-many or many-to-many displayed in a Entity Relationship Diagram ERD. A relationship is im­ple­mented by in­clu­ding the primary key in the related entity and calling it the foreign key to make sure of the consistency in database. When a database is consistent we can na­vigate with a join among tables and relationships become navigation paths.

 

Example of a many-to-many relationship is between two entities Employee and Course, where one employee can take many courses and where one course can be taken by many employees, may be modeled through the use of a third entity that provides a map­ping between them with foreign keys back to Employee and Course plus extra data e.g. date of the course, evaluation of the course, student grade.

 

Normalization is the process of structuring a database in several tables to remove redundancy or reduce data redundancy and improve data integrity, through steps called normal form: 1NF, 2NF, 3NF, BCNF, 4NF and 5NF. Guide to normal forms.

But a data warehouse loves data redundancy, because it is not a human that update data, instead it is a ETL process that runs as a computer software program developed by a ETL developer or a data engineer.

 

SQL stands for Structured Query Language e.g. Select From Join Where Union.

 

A view is a stored sql query criteria statement used to fetch data from a database and the data set has columns as a table and is called a result set, rowset or a recordset. Also called a virtual table representing the result of a database query.

 

A materialized view is a database object that contains the result of a sql query e.g. data is being persisted into a table, a file or placed in a server memory cache. A view can be materialized through an index. Materialization is for performance rea­sons as a form of optimization of the ETL process and for fetching data to business users e.g. five tables that are joined and aggregations are precomputed and precalculated. For a client-server database application, the result can be placed as a local copy at a users computer PC.

 

A table-valued function is an intelligent view with parameter as a mathematical function and it is returning a result set wrapped in a table to be used in a join etc.

 

A stored procedure (sometimes parameterized) performs actions on data and may return a result set, a value or an output parameter. Procedures are important for the ETL process.

 

Join is often between two tables columns of primary key and a foreign key. An equi join (equality join condition) includes matched rows in both tables by matching column values and is indicated with an equal operator =. A natural join is a type of equi join which occurs implicitly by comparing all the same names columns in both tables. The join result has only one column for each pair of equally named columns.

 

A theta join or non-equi join matches the column values using any operator other than the equal operator like <, >, <=, >=, !=, <>. A outer join includes both matched rows and unmatched rows of a table, normally the left side table else right or full from both tables.

 

SQL example:

SELECT COUNT(*) AS NumberOfBoats

FROM dbo.CopenhagenCanals

WHERE SailingPast = 'My window'

 

I prefer alias using = instead of AS:

SELECT NumberOfBoats = COUNT(*)

FROM dbo.CopenhagenCanals

WHERE SailingPast = 'My window'

 

Because assignment of a variable looks very much as my sql before:

DECLARE @NumberOfBoats int

SELECT @NumberOfBoats = COUNT(*)

FROM dbo.CopenhagenCanals

WHERE SailingPast = 'My window'

SELECT @NumberOfBoats

 

Data is a plural of datum e.g. 5.5 or 38 and after labelling them shoe size in US and in EU they become information. When I know many of my friends' shoe sizes to­gether with other information I have knowledge that I can use for a better birthday gifts or sell to a marketing campaign.

38 is data like a person age 38 years old or a person weight 38 kilograms becomes information because I put 38 in a context and I use 38 together with an unit. An example of metadata to the person weight 38 kilograms could be the date and time it was measured.

 

Data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data is deleted, leaving only one copy to be stored. Also called Removing duplicates, because the unwanted data is discarded.

 

Metadata

Metadata is in short »data about data« because metadata provides data or infor­mation about »other content data« (in danish indholdsdata) e.g. author and title of a book, table of content in a book, index in a book or an error description from a IT system like »violation of a foreign key constraint«.

 

Example of a XML structure with an element as SalePriceAmount with a data value content of a real sale price amount together with a metadata attribute as a currency or an unit:

 

<SalePriceAmount Currency="USD">295.73</SalePriceAmount>

<SalePriceAmount Currency="DKK">2236.14</SalePriceAmount>

<SalePriceAmount Currency="EUR">300.65</SalePriceAmount>

 

In a data warehouse metadata adds an extra value to the raw data from a source systems together with meta­data for the data flow between data areas. A data warehouse use metadata e.g. to define and specify, describe, classify, trace, debug, audit, logs and metrics data, access restrict and monitor data.

 

Metadata is partly business data (e.g. business descriptions, trace/lineage, access restriction) and partly non-business data (e.g. monitor data, logs, metrics, technical definition). I am using metadata columns for housekeeping columns, sup­port columns or administrative columns in tables, for data lineage and timeline columns with names as InsertTime, UpdateTime, ValidFrom and ValidTo.

 

I have chosen to use two specific dates as metadata values (yyyy-mm-dd):

  • Beginning of time as the first occurrence of data specified in column ValidFrom begins with date 1900-01-01, can in­clu­de a time 00:00:00.0000000 (00:00:00.000 or 00:00:00).
  • End of time as the last occurrence of data specified in column ValidTo ends with date 9999-12-31, can include a time 23:59:59.9999999 (23:59:59.997 or 23:59:59).

 

ValidFrom and ValidTo represents the span of time when a row in a table was the »current truth«.

 

If a source system has a date column with a null value and the null doesn’t means »missing« and a data warehouse only like to have a non-null value, then the data warehouse can use a forever perpetual date as 9999-12-31 to identify a future undetermined date, an infinite date (in danish forevig uendelig dato for en uafklaret ikke aftalt dato). The benefit with a non-null date column is that a normal sql query join can be used to a calendar table in the data warehouse. For an amount a data warehouse can use a fixed value to represent a non-existent value.

 

Other examples of column flag of metadata is IsCurrent, IsDeleted, IsInferred, more about them later.

 

Metadata columns will be mentioned in many places in the article and about its naming in section 6.3. I will end with some categories of metadata:

Structural metadata is containers of data how compound objects are put together e.g. rule of a sql left-join and an enrichment of a derived column in a sql case-when statement.

Descriptive metadata is about a resource e.g. a link to a User Story Id in Jira and to a description and explanation in Confluence.

Reference metadata is about the contents, when and how it was created, record source, data lineage, movement of data between systems and data layers or data areas. When we deliver information to the business users, we must be able to tie that back to a source data set.

Administrative metadata is about how to read the contents e.g. column PerDate. Also permissions to the contents.

Process metadata is about how a ETL process is running e.g. a package start/stop time, its duration, execution result and error message.

Audit metadata is about reconciliation check e.g. rowcount, summarized measures to monitor and control movement of data between systems and data layers or data areas. Also about monitor and control of data quality and audit trail.

Statistical metadata is based on process metadata and audit metadata to describe the status of a data warehouse to be shown in a dashboard for Data management and DataOps.

 

Conclusion: It takes metadata to manage data and to use data.

More reading in Metadata.pdf.

 

Conforming - what to call it?

Humans all over the world are using many terms for the same thing. For example, a transport system: Subway in New York City, Tube or Underground in London, Metro in Paris or Copenhagen, U-Bahn in Berlin, Vancouver and Bangkok has SkyTrain as an elevated train system above the streets.

When a filipino wants to buy toothpaste in a store, he will ask for a colgate. A sign on the road says xerox instead of photocopying. In Denmark I will ask for a kleenex instead of a facial tissue. Some names of brands becoming words in our language.

It can be a little tuff job to find and to determine a good and common conformed term to become a column name, but later on the communication will be much easier among business people and IT-people.

How to present workload per day as 7½ hours, hh:mm 7:30 or decimal 7.50?

How to present a date, 2001-09-11 or 11-09-2001 or 9/11/2001?

 

Transaction

If Shakespeare had been a data modeler, I wonder if he had made the question:

»A transaction has relationships or a transaction is a relationship?«

  • If a transaction can be referenced to and it has a TransactionID as a primary key and it is a thing (its own entity) then it has relationships.
  • If a transaction only contains foreign keys and it has no need for a unique identi­fier and nothing else will reference to it, then it is a relationship, e.g. many-to-many data-bearing relationship.

 

In Peter Chen's Entity Relationship Diagram ERD there are many kind of transac­tions e.g. an order, an invoice or deposit money into a bank account that is a many-to-many data-bearing relationship in the conceptual level design with the caption like »deposit transaction«. Moving on to a logical level design is when a many-to-many relation­ship becomes an entity and primary keys becomes foreign keys. Moving on to a physical level where an entity becomes a table in the data­base through a Relational Database Management System RDBMS. The most important thing for me is, that the names of tables and columns reflect as much as possible the user's conceptual world (in danish afspejler brugerens begrebsverden).

 

A counterpart to an invoice is called a credit memo, credit memorandum or a credit nota to reduce the amount of the invoice for several reasons, for example, the item is returned, the billed unit quantity is larger than the one delivered, the invoiced unit price is higher than agreed or there are errors in the delivery that justify price reductions. If a transaction in a source system is changed, corrected or can­cel­led then a data warehouse can calculate a counterpart transaction with a nega­tive amount. The calculation of the final amount will be a summary of transactions whereby the final total amount will be less than the original amount.

 

Top 10 Analytics And Business Intelligence Trends for 2019.

 

1.1. Data areas (data layers) in a dimensional modeling architecture

A description of the content of data areas or data layers of the throughput from multiple source systems to business users PC screen for improved analytic decision making. I am using the term area to point out that an area may contain several databases to split data from multiple source systems to make the entire solution scalable and can be distributed over multiple servers. Each area is labelled with data ware­house acronyms as an abbreviation formed from the initial letters.

 

Data flow

 

Source systemLanding zone areaInput data areaArchive areaData staging areaData mart area.

 

Source file/tableInput tableArchive tableStaging tableTarget table (destination).

 

Source system area – SSA (Operational system and Legacy system)

A company or an organization is using several operational systems to handle daily data of many different kinds. Most systems is based on a database with pri­ma­ry keys and foreign keys, and there is three types of candidate keys to identity an entity object, we can call them source keys:

 

Natural key exists in the real world e.g.:

  • Phone number
  • Hotel room number
  • A fingerprint
  • A medical diagnosis
  • A three-letter currency code
  • A two-letter ISO country code and other codes that a user understand
  • A JobCode e.g. MNG is manager, PRG is programmer and CLN is cleaner
  • A website address/url
  • An address with country + state/pro­viden­ce/ region/district + postal code + street name + house number + floor number (stairwell, door number, PO box)
  • Country name
  • Names of my friends combined with their Date of Birth.

 

A composite natural key of multiple columns is still a natural key.

 

A natural key value is mutable (changeable, not immutable), and is meaningful for a human being. Back in 1990s we could remem­ber many phone numbers, but the cell mobile phone contact list took over.

 

Business key exists in a system and is build as an artificial num­ber e.g.:

  • ISBN International Standard Book Number
  • Vehicle identification number (VIN)
  • Vehicle registration number plate
  • Custo­mer number
  • Purchase order number
  • Invoice number with Product numbers
  • Insurance policy number
  • Flight number
  • General ledger account number
  • Login name
  • Social Security Number (SSN) a civil registration number like each danish person by birth is given a CPR number for Central Person Register (Person Civil Registration Identifier) but it is GDPR sensitive because it contains a natural part with a Date of Birth and the last digit tells gender as an even number for woman and odd numbers for man (anonymization/pseudo­mization).

 

Some of the above numbers do exists in a natural way in the real world because they are printed on papers and things, therefore can be called an application key.

 

A composite business key of multiple columns is still a business key.

 

A business key value is either mutable (changeable) or immutable (unchangeable), and is meaningful for a human being where a business user prefers to use a business key value to identify an entity for a search criteria lookup va­lue giving in a phone call to a company, a bank, a hospital or the government.

 

Surrogate key exists in a system as an artificial auto-generated unique sequen­ce number or an identity column (id, uid unique identification) that is used as an un­changeable column in­stead of a natural key or a busi­ness key or a composite primary key. Surrogate key is represen­ted as an integer, a guid, a uniqueidentifier or a hashbyte of a natural key value or a busi­ness key value.

 

A surrogate key value is immutable (unchangeable), and is meaningless for a human being and will nor­mal­ly not be exposed to outside business users, but is useful for join operations inside an IT system.

 

A natural key or a business key is what the business uses to uniquely describe data. If a natural key or a business key is immutable and unchangeable in a source system then call it a durable natural key or a durable business key. (Durable in danish holdbar eller slidstærkt).

 

A table called Salestype has a business key as an one-letter code and has a natural key as a textual value, e.g. R = Retail sales, D = Direct sales, O = Online sales.

 

A table called Status has a surrogate key StatusId as an artificial auto-generated unique sequence number or an identity column, a co­lumn StatusNumber as a durable business key and a column StatusCode as a durable natural key with a textual value which can be understood by a human being, otherwise look at column StatusName which does not need to be unique.

 

A table called Customer has a surrogate key CustomerId, a co­lumn CustomerNum­ber as a business key and a column CustomerName as a natural key.

It is common to have several customers with same name, there­fore I will not call CustomerName a durable natural key. It is common in a customer system that the same person exists several times because a person contains in duplicate rows with the same name and address but with different Customer­Number, there­fore I will not call Customer­Number a durable business key.

 

From a data warehouse point of view, a natural key and a surrogate key is also a business key. For example, a table called LeadState has a surrogate key Id and a column State with values as »New, Contacted, Qualified, Lost« then a data ware­house lets Id become the business key because a text value can be changed over time e.g. »Lost« will one day be changed to »Not interested«. A beloved child has many names.

 

Key map table, surrogate key mapping table, cross reference table or translate table as a shadow table is for each business key or composite business key of multiple columns, (or natural key) to transform business key values either to an immutable surrogate business key _sbkey e.g. a code which can be understood by a human, or to an artificial auto-generated unique sequence number or an identity column as a durable key _dkey for later use in a dimension to glue multiple rows together in a timeline for a type 2 or a type 7 dimension. Durable key is a surrogate key for a business key (or a natural key) and has the function of a stable key.

 

Since a dimension can have multiple source data, a key map table contains a busi­ness key for each of its source systems for data lineage to provide only one immu­table surrogate business key (code) and/or one durable key across multiple source systems.

 

Furthermore, a key map table must be able to handle a business key value which can be mutable (changeable) to keep the same code or same durable key value. For example, a business key Employee number may be changed if an employee resigns and some years later is rehired, the belonging durable key value must be remained.

 

Kimball calls it a durable supernatural key and it can be part of an integration hub in a data warehouse system.

 

Kimball page 101: »The durable supernatural key is handled as a dimension attri­bute; it’s not a replacement for the dimension table’s surrogate primary key« which is called a dimension key regardless of dimension type (0-7, see later).

 

Dimension key is the primary key in a dimension table in a data mart area.

A dimension key is an artificial auto-generated unique sequence number or an iden­tity column as a surrogate key for the primary key of a dimension.

 

I prefer a suffix _key for a dimension key co­lumn in a dimension as a primary key for the dimension. An alternative suffix _dimkey or _dim_key.

 

For example, Customer_key column in a Customer dimension which is related to a Sales fact to a Customer_key column as a foreign key.

 

Please notice, that a business key translation in a key map table to a durable key same as surrogate key does not make the key value to a dimension key in a dimension, because there have to an independence to a business key, and in a type 2 dimension the dimension key has unique values for each row even though the same durable key value or business key value is standing in several rows.

 

Data profiling of a source system is very important and the choice of co­lumn to become a data warehouse business key requires a good data analysis and criteria for the choice and data type and data range. Read more.

 

Be aware of a value of a natural key or a business key can be changed in a system because the original value was entered incorrectly and need to be corrected. A composite natural key in a Member table with Name+Address will be up­da­ted when a person change place of residence. I saw in an airport, that the flight number was selected as the business key for the data warehouse and used in the ETL process, but no one had taken into account, that when a plane was delayed, the letter D was added at the end of the flight number, therefore the data ware­house thought it was a new flight.

 

A surrogate key as an integer or guid won’t ever need to be changed, but if a cu­sto­mer is deleted by a mistake and the customer later will be inserted again with the same CustomerNumber as bu­si­ness key, then it will get a new sequence num­ber in CustomerId. When a company has several ope­ra­tio­nal systems handling cu­sto­mer data, each system will use its own surrogate key for CustomerId, therefore business key CustomerNumber is the glue between the systems.

 

It is important for a data warehouse to receive the value of the business key (and/or the natural key) from a source system, and in case of a value has been changed, that data warehouse also receive the before-value. A database has many tables for example, a Customer table has columns of CustomerId as sur­ro­ga­te key, CustomerNumber as business key and a CustomerName and an Invoice table has columns of InvoiceId, InvoiceNumber, InvoiceDate and Custo­mer­Id as foreign key, the invoice data must be extended with the business key Custo­mer­Number either by a source system or inside a data warehouse to know the customer of an invoice. I prefer that a data warehouse also store the value of surrogate key to be helpful when a value of a business key is changed and to obtain traceability for data lineage to tell where does the data come from, more in section 1.6. Merging data and integrating data can be hard to set up in a ETL process and to test afterwards.

 

Landing zone area – LZA

From source systems to landing zone area as a file storage of source data wherea landing zone area can be a file server. A landing zone area is a data reception area with data from mutiple source systems. A data set can arrive in many formats, therefore a good data profiling document is nice to have. For example:

 

  • csv file, txt file, xml file, json file, parquet file.
  • Excel file or pdf file.
  • gz file or zip file with compressed file(s).
  • Database file e.g. Access accdb file or mdb file.
  • Database backup file from a SQL Server, DB2 server, Oracle server etc. and a backup file can be restored at a landing zone area server.
  • Database data is copied from a source database server to a landing zone area server persistent database placed in a dmz or a similar zone:

·       by database replication (batched)

·       by data strea­ming (streamed) where data flows out from a source system through a delivery or a redelivery

·       by receiving data from a real-time data replication that fetch new, changed and deleted source data. An example of a Change Data Capture CDC tool is Qlik Replicate Change Data Capture, read more in section 1.2.

 

A landing zone area can be a file server to receive files.

A landing zone area can be a database server to receive a database backup file that will be restored to a database at the server, or there is a persistent database that receive data from a source system.

 

This area is also called Operational Data Store (ODS), a place to store source data.

Microsoft is using the term Landing Zone as the place where source data is landing and often is never deleted or deleted after a longer period.

 

A file it can be transferred to a file handling area or a file can be fetch from a SFPT server through WinSCP to a folder which represents a landing zone area, and from there copy the file to an archive folder to keep the original source files, and move the file to an input folder for later unzip to another folder and loading into a data warehouse. A file can have metadata in the filename like a date of contents and delivery. Sometimes a delivery consists of several files which can be advan­tageously placed in a delivery folder with a name including a date value or a batch number. Contents of the files must be checked for data quality before they are loaded into the input data area. In case you like to open a 20 GB csv file and delete the last line that is corrupt, use Sublime Text, its free and no registration, only long waiting for large file. A nightmare is like 25 csv files that occasionally change filenames and change co­lumn­ names and data type and content.

 

It is an advantage to receive a delivery note together with files from a source system with information about the files like filenames, maybe a missing file and the number of lines in a file or rows in a xml file. For a database backup file it is a must to identify them as full backup or incremental backup before restoring.

 

A data set should have accompanying metadata describing the origins and processing steps that produced the data for tracking data lineage.

 

Input data area - IDA

From Landing zone area to Input data area for temporary storage of source data which has been received in files as csv, xml, json, parquet, excel, access, database files, database backup files or by database replication, data strea­ming or CDC.

 

The files in landing zone area can be imported or extracted into a input data area to become database tables with loose data types e.g. string, text, varchar or nvarchar with a length of 100+ characters.

 

When a source system has millions of transaction data it can take a long time to do a full load every night, therefore to save on loading time from a source system to a data warehouse it is good to detect delta data as new data, changed data or removed data in a source system. To fetch delta data from a source system database by using delta load or incremental load, read more in section 1.2.

 

This area is also called Raw data, Data Sourcing layer, Source, Le­gacy, Extract, Capture Operational Data layer, Data Acqui­sition or Data Lakehouse.

 

Empty input data area in the beginning of the ETL process.

 

Data types will be ad­justed to fit with the receiving database system, and it is a good idea to use a wide data type e.g. a source system provides a string of length of 6 or 50 characters, I will make it as 100 characters to be prepared for a later ex­ten­sion of characters in a source system. In case a date format can’t be agreed for a csv file e.g. 05/08/2014 what is order of day and month as format mm/dd/yyyy or as format dd/mm/yyyy; an amount format with a decimal separator and a thousand separator as a period or a comma e.g. 1,234.56 or 1.234,56 and the use of field quote, field terminator, text qualifier and row delimiter, I recommend to load each column into a data type of string e.g. varchar(100) to have data in a table to do try_cast data conversion to a new column with the appropriate data type e.g. Date, Decimal(19, 4), Currency, Money, Integer etc. I prefer a Tab delimited text UTF-8 (65001) file where the list seperator is <Tabular>, Chr(9), \t. Study Schema.ini and XML Schema Definition XSD.

 

Do reconciling between data ware­house and source systems with reconcilia­tion of row count and sum of values and mark as re­con­ci­led and do auditing report (to reconcile in danish at afstemme, stemmer overens). Maybe a receipt sy­stem where IDA tells a source system that data has been received. Data might be wrong in input data area or in archive and later source system will resend a new revision of data, therefore important to create output views that filter wrong data away and making correct data as an extraction to the next area of the data warehouse. Read more about data reconciliation in section 1.4.

 

For the data lineage most tables in the Input data area will include metadata columns: RecordSource as a reference back to a source system e.g. value "Dyna­mics­365.Sales.Product". RecordSourceId is origin pri­mary key column from source system. IdaBatchDataCaptureId is in all rows in all tables and will use the same number of identification together with a Ida­Batch­Data­CaptureTime as a received date and time to represent a snapshot of the total amount of data as it look like in a source system, and can be used later for keep history and in ETL jobs/pac­ka­ges/trans­forma­tions/sql will reuse the same id and time for all data rows. IdaInsertTime (date­ti­me2(7) default sys­date­time(), not unique per row) when the fetching is done. So­me­times a source system has a Changed co­lumn with data/time of first inserted or data/time of latest updated data in the row/record that can be used for later history. Read more about data lineage in section 1.6.

 

Exchanging data from a source system to IDA can be in different formats e.g. JSON data-interchange format where I prefer storing data for IDA in a relational database with tables with rows, columns and data types, I must parse a json for­mat.

 

Kimball recommends that a source system express data at the lowest detail possible for maximum flexibility and extensibility to provide the data warehouse for simplicity and ac­ces­sibility and it will be the data warehouse to make summary data, and not the source system. IDA is a mirror of source data.

 

I recommend for each source system to create its own Input data area e.g. IDA_Dyna­mics­365, IDA_Salesforce, IDA_ServiceNow, IDA_Proteus to keep the source data separately. Each IDA has metadata for scheduled expected delivery with dates, filenames or tablenames, estimated number of rows, receiving time, row count and flags like IsReceived, IsCompleted, IsReady together with different control tables, reconciliation tables and staging tables to make data ready for the archive area.

 

I recommend for each source system to create its own ETL process (job with packages) to extract, maybe a little transform and load data into an Input data area and possibly furthermore to an Archive area. When a source system is bad or down, data must continue to be loaded from other source systems. When a daily source data delivery is bad, the next day delivery must still be able to be loaded and it is nice when a redelivery of a bad data can be provided and handled in the ETL process.

 

Archive area - ARA

From Input data area to Archive area for keeping source data indefinitely through forever storage of source data from source systems as a data basis for a data warehouse.

 

This area is also called Persistent Staging Area (PSA), Persistent Historized Data Store, Persistence Layer, Detail Data Store, Data Repository or History or Archive data area. Archive is a versioning of the source.

 

Never empty archive area because it is archiving of time variant source data and it will retain historical value changes in the source system.

 

Simple data adjustment can be done to gain same date and amount format and same data representation of a social security number etc. but it will be in new columns so the original values are unchanged.

 

For the data lineage most tables in the Archive area will include metadata columns: RecordSource, RecordSourceId, Ida­Batch­Data­Cap­tureId, IdaInsertTime and Arc­Batch­DataCaptureId, ArcBatchDataCaptureTime, ArcStorage, ArcInsertTime (date­ti­me2(7) default sysdatetime(), not unique per row), ArcTs (timestamp, row­version, unique per row), ArcRecordId is a unique se­quen­ce number per row per table to obtain data lineage and traceability back to the archive and the column can also be used to fetch and pull delta data out of the archive that have not yet been loaded into the data warehouse. ArcGlobalId is a unique sequence number per row across all tables in archive.

 

A different kind of an archive that does not do »forever storage of source data« can be called Current Data Area – CDA with the purpose to have a storage of current source data because the data change frequency and volume of changes is too big to retain historical value changes in a source system, therefore the archive has »current only« version of source data to improve performance and use less storage. CDA means that you have to detect data to insert new data, to update changed data and to delete removed data from a source system. CDA can be based on either full load or delta load or incremental load. CDA does not mean that you can’t have a dimension of type 2 history in a data mart area.

 

Data staging area - DSA

From Archive area to Data staging area for extracting archive data by full load or data delta load or incremental load for transforming data enrichment according to business rules to identify dimension data and fact data to staging tables to shape and suit tailored target tables for ensuring data is trustworthy in order to prepare data for loading into a data mart area for the purposes of querying and analysing in different ways by the business users for their analysis, reporting and dashboards.

 

Data processing can be for example:

 

  • cleansing, cleaning, scrubbing read more about data cleansing
  • fetch inaccurate or incompatible data, inconsistent formats and outliers
  • filtering, checked for redundancy, removing duplicates, purging
  • ensuring of referential integrity and business key or application key
  • handle missing values, unknown values, add default values, correct errors
  • validating, validation, auditing, authenticating for syntax and business rules
  • conducting audits to ensure data quality and compliance with business rules
  • translations, mappings, mapning, conversions, convertings, formatting
  • combining, integration, merge, merging, determine granularity and grain
  • establish consistency, consolidating, transposing
  • conforming data and renaming columns to common names and terms
  • enriching data by adding derived value in new column in pipeline expression
  • calculation computing measures, aggregation, summarization, ordering, KPI
  • common units of measurement for measures and metrics, standardizing
  • encrypting protecting data governed by industry or governmental regulators
  • slowly changing dimensions SCD, rapidly changing dimensions RCD type 0-7
  • making data ready for the right type of fact

 

DSA can be divided into tables for StagingInput and for StagingOutput by using da­tabase schemas like StagingDim, StagingFact, StagingAnalysis (one big table) to shape and suit target tables in schemas Dim, Fact, Analysis in a data mart area.

 

This area is also called Work, Batch, Calculation, Enrichment, Prepare, Temporary, Transformation or just Staging, Stage or Stg.

 

Empty data staging area in the beginning of the ETL process because data is only transient and temporary in this area.

 

Examples of concrete data processing are trim string, unicode string, max length of string, null gets a value as 0 or empty string '', replace illegal characters, replace value, verification of data type and do data conversion to a suitable data type, e.g. a csv file contains a value 1.0E7 in the scientific notation and it will be converted to the value 10000000 and saved in a column in a table. Data syntax, date format like mm/dd/yy or dd.mm.yyyy will be converted to yyyy-mm-dd, correcting mistyping, typos or misspelling, fix im­pos­sible values, punc­tuation and spel­ling differences to achieve common format, notation, re­pre­sen­ta­tion, validate data e.g. you expect two values »true« and »false« in a boolean column in a file but suddently there is a value like »none«, »void« or a sign ?, correct data and do data deduplica­tion by selecting one of the duplicate rows and removing duplicate rows to keep them in a wrong table. The data cleansing and data integration process with multiple source systems is to make data cleaned and conformed, e.g. a gender code from one source has »Man, Women«, another has »M, F« will be con­for­med to »Male, Female« through a mapping of source data. Calculations and computing metrics e.g. total cost and re­venue, elapsed time and overdue time.

 

When an archive is stored on an another server it is common to load data from the archive into a data staging area at a data warehouse server, e.g. data to a dimen­sion of type 1, type 2 or type 7 and from a dimension staging table perform a loading as a merge or an insert and update to a dimension table in a data mart area.

 

I know that many like to send a query to a source system for translate names of columns and data types, merge or divide columns like a Name to become two co­lumns of FirstName and LastName (Surname), but then a data warehouse will not receive the original source data.

 

I like to use computed columns in a staging table for calculation, string manipula­tion. Sometimes I use a hashbyte value for a com­parison column for a composite business key that is composed of multiple columns (ComparisonBkey_meta) and for the other data value columns (ComparisonData_meta) to merge and load data from a staging dimension table into a mart dimension table. Other times I use a generic stored procedure with dynamic sql statements (string gymnastics <template> merge construction) that will perform the load from data staging area to data mart area for many dimensions.

 

DSA database will contain different auxiliary tables with names as: stg, stage, tmp, temp, temporary, and with multiple staging tables to a target table, where the staging tables can have an additional number e.g.:

Stg_Customer_Order_1, Stg_Customer_Order_2, Stg_Customer_Order_3

and the last staging table to shape and suit the target table is using number 0 as in

Stg_Cu­stomer_Order_0.

 

DSA will perform a data quality and filter wrong data and invalid data into a quality assurance and quality control (QA/QC) database. DSA continues reconciling data from the archive for re­fe­ren­tial integrity e.g. foreign key value exists as primary key value, re­la­tion­ship cardinality rules for depen­dencies as mandatory, optional or contingent to make sure there is a foreign key or not, it can be null or it needs a value. Many other value checks and summation and so on for a validation check of data. When all tables in this area is loaded with correct data, the DSA is suc­cess­fully completed. A source system’s new date or changed date or the IdaInsertTime or ArcInsertTime which represent a batch of insertion and mark all data to that batch time no matter which source data is coming from. In case of Valid­From and ValidTo, all Valid­From will be using the same batch time which is useful for join between tables and between source systems to fetch the right data shapshot at a particular time back in history. Read more about data quality in section 1.5.

 

Furthermore is derived co­lumns and cal­cu­la­tions ready for loa­ding to the next area of the data warehouse. In case of an error in the ETL process or in important data there will be raised a halt condition to stop the ETL process. Other times data will pass by with a flag in an audit dimension. Kimball has called it data wrangling to lasso the data and get it under control (data mun­ging).

 

Remember to do an auditing reporting after a validation check.

 

For the data lineage most tables in the Data staging area will include metadata columns from the Archive area as RecordSource, ArcStorage and ArcRecordId.

 

(In danish data staging area stands for et data forberedelses­område eller data ran­ge­rings­område til data vask, data ud­søg­ning, data berigelse, data be­hand­ling, data beregning, data bearbejdning, data sammen­lægning, sammen­stilling, sam­kø­ring, tilrettelæggelse og klargøring af data hvor data er blevet vasket og strøget).

 

Kimball calls this area a ETL-sy­stem or the first system that is off limits to all final data ware­house clients and he use an ana­lo­gous: »The staging area is exactly like the kitchen in a restaurant. The kitchen is a busy, even dangerous, place filled with sharp knives and hot liquids. The cooks are busy, focused on the task of preparing the food. It just isn’t appropriate to allow diners into a professional kitchen or allow the cooks to be distracted with the very separate issues of the fine dining experience.«

 

Kimball's ETL is using Data Staging Area (DSA) as a temporary space to create a data warehouse as a collection of data marts.

 

Data mart area - DMA

From Data staging area to Data mart area with dimensional mode­ling, conformed and shared dimension tables, star schema around each fact table, assigned sur­ro­gate key (artificial key, i­den­tity column, a unique sequence number) for each di­mension and use it in fact as foreign key. When multiple fact tables share di­men­sions it is called constellation schema or multi-star schema.

 

Keep historical data in dimension and fact with timeline columns ValidFrom and ValidTo. Make it easy for a business user to search for current data through a view and to give a historical date to a table-valued function to fetch dimension data and fact data at any point of time. A data mart can have multiple fact tables with different granularities. One or multiple fact tables can create an extra derived fact table with special calculations and search filter cri­teria to enrich data to match the business requirements spe­cifi­ca­tion. Make a conform name for a column with a conform data type, for example when you mer­ge customer addresses from multiple source systems ma­ke a string extra long to fit all address characters and avoid a value to be manda­tory or set a default as the empty string to ensure a robust ETL process.

 

Never empty data mart area and always back­up before the ETL process.

 

Data mart is a front-room for pu­blis­hing the organization’s data assets to effectively supports improved business decision ma­king. A data mart contains one data area for one purpose and is su­bject-orien­ted, there­fore a data warehouse will consist many data marts and they will have some common dimensions. To avoid to maintain the same dimension in multiple data marts may it be considered to have a dimension data mart that share its tables through views inside the other data marts.

 

For the data lineage most tables in the Data mart area will include metadata columns from the Archive area as RecordSource, ArcStorage and ArcRecordId.

 

(In danish data staging area stands for et behandlet lag, tilrettelagt lag, klargjort lag med data som opfylder forretningens data behov).

 

Kimball calls this area a di­men­sio­nal presentation area and he is using the term first-level for a data mart contains data from only one source system and second-level or consolidated for a data mart with mul­ti­ple sources to cross business processes. Read more about data mart modeling in section 1.3.7.

 

Presentation interface area - PIA

From Data mart area to Presentation interface area through data access tools like Reporting Services (SSRS) and Microsoft Access or data visua­liza­tion tools like Ta­bleau, QlikView, Qlik Sense and Power BI can import data from a di­men­sio­nal schema and handle in-memory calculated KPI as quick measure u­sing filter.

 

The purpose is to help and support a business user to solve a business case.

 

A OLAP cube or a Tabular model is loa­ding data from a data mart and the pro­ces­sing is doing calculation of KPI and display a pivot in a data access tool like Excel, Power BI, Tableau, QlikView, Alteryx or Targit.

 

A report with criteria parameter do search, select and calculate data from a data mart based on an ad hoc query of a business user, or the report is send out as pdf file, or Excel file to the business users every day, week, month and so on.

 

A dashboard is a collection of unrelated data intended to summarize the state of the business during a given time frame.

 

A simple dashboard sample.

 

 

This week

Amount

This month

Amount

YTD

Amount

New orders

12

$3,054

41

$12,469

478

$118,608

New customers

3

 

11

 

97

 

New employees

0

 

2

 

6

 

 

A dashboard provides a quick, clear and clear overview of the key figures, indica­tors (KPI) and performance data that are most important to him or her, shown as numbers and data visualization linked to business user groups. »If your dashboard does not fit on one page, you have a report, not a dashboard.« cf. Avinash Kaushik.

 

Start write a data story while designing a dashboard where the design should follow the narrative with a destination to answering the business questions. Dashboards for business need to answer key business questions where every chart and graph and metrics and KPIs gives actionable insights and helping a stakeholder or a busi­ness user to make better data-driven decisions.

 

Data mining and machine lear­ning will also use data from a data mart.

 

Percentage (fraction) and ratio based on a dimension slice will be calculated »on the fly« in a BI tool when a fact table contains a numerator and a denominator, e.g. a DAX (Data Analysis Expressions) query language for a tabular model in Power BI.

Power BI guidance to star-schema and best to use a star schema in Power BI.

DAX for SQL user and DAX tutorial.

 

Two measures Average price and Active customers with a locale variable for an intermediate calculation in a code block ending with return in DAX from a Sales fact:

 

Average price := DIVIDE(Sum(UnitPrice x Quantity), Sum(Quantity), 0)

Active customers :=

  VAR measurevalue =

  CALCULATE(

    DISTINCTCOUNT('Fact_Sales'[Customer_key]);

      FILTER ('Fact_Sales';'Fact_Sales'[EndDate] = BLANK() &&

                  'Fact_Sales'[Amount] > 0))

RETURN IF (ISBLANK(measurevalue); 0; measurevalue)

 

Kimball calls this area a Business In­tel­ligen­ce Application.

A date range lookup in a Kimball type 2 dimension when fact has a business key (_bkey) and a LaunchDate column in DAX:

 

CampaignName := CALCULATE(VALUES('Dim_Campaign'[CampaignName]),FILTER(

'Dim_Campaign', 'Fact_Product'[Campaign_bkey]='Campaign'[Campaign_bkey] &&

                         'Fact_Product'[LaunchDate]>='Campaign'[ValidFrom] &&

                         'Fact_Product'[LaunchDate]<'Campaign'[ValidTo]))

 

10 simple steps for effective KPI.   Data driven Infographic

 

Supporting databases and/or schemas in a database

Supporting data in tables for a data warehouse can be placed in either several databases and/or in several schemas in a database to be across the data areas. Here is some examples.

 

Audit with auditing data to an audit trail for reconciliation check from source sy­stems with number of rows in tables and other summarized measures to monitor and control to avoid discrepancy between source systems and the data ware­house and make sure that all data has been fetched and saved in the areas of the data warehouse by doing validation check. Log of RowCount of target rows before a process and after a process to measure the changes in number of rows the process did (TabelRowCountBeginAt and Tabel­Row­Count­EndAt). RowCount of source data rows and extracted rows based on join and criteria. RowCount of target deleted rows, updated rows and inserted rows/loaded rows successfully or rejected rows with missing value or out of bounds amount or other kind of invalid data depends of the validation check of data quality etc. Do alert the data owner and data users by email. Auditing is a kind of inspection (in danish eftersyn).

 

System with data for the ETL process e.g. connection string, capture execu­tion log metadata (at runtime) e.g. ExecutionLogId, execution time, status of exe­cu­tion, error message, job log, package, transform and exception hand­ling. To storage a value for »last value« for delta load or incremental load with delta data de­tec­tion. To storage information for IdaBatchDataCaptureId, Ida­Batch­Data­Cap­tu­re­Time, ArcBatchData­CaptureId, ArcBatchDataCaptureTime etc.

 

Test with test cases while pro­gram­ming the ETL process and for making KPI values to validate the program.

 

Usage with custom data for mapping and consolidation of data from mul­ti­ple source systems and data for dimension descriptive context, hierarchy, grouping, sort order and rules for new dimension values. Usage tells how source data is transformed and changed into useful information by the ETL process to meet the re­qui­re­ments of the business and displayed in a data mart. Usage data is main­tained and up­da­ted by the business users through an ap­pli­ca­tion. For example, product data can be found in multiple source systems with different spellings where a usage application will help the bu­siness to do the unite, interconnect, compile, conform and consolidate the products into one spel­ling for a product together with a mapping of the multiple source sy­stems product numbers into one product. The usage application can also pro­vide hierarchy levels, grouping, sort order for the products. Some data in the Usage supporting database will be loaded from the archive, because the data comes from a source system, to be enriched by the business users. Input da­ta area will fetch data from the Usage sup­por­ting database and further to the archive to be used in trans­for­ma­tion and updating dimension with en­ric­hed data that doesn’t e­xists in the source system. U­sage data will also be used for making fact data e.g. a cal­cu­la­tion with a constant va­lue, a rule as if-then-else and a lookup value. Usage is data-driven to avoid data values inside the ETL pac­kage, the sql statements and the program. A Master Data Management (MDM) is an extension of a Usage database (in danish brugerdefinerede stamdata, ordet custom kan oversættes til special­bygget). Examples of Usage data in section 6.1.

 

Wrong with data that can’t be loaded into a target because of error descrip­tion e.g.: »The data value violates integrity constraints«, »Violation of pri­ma­ry key con­straint«, »Cannot insert the value null«, »Cannot insert dupli­cate«, »String data would be truncated«, »Arithmetic overflow error«. Wrong data is rejected data like when the ETL process is removing duplicate rows to keep them in a wrong table to be monitor by a data steward that can inform a source system. Wrong data can also be data that do not follow the quality assurance and quality control. Quality assuring, consistency and integrity is impor­tant parts of the ETL pro­cess and the audit trail. More examples in section 1.5 and an Audit dimension in section 4.4 to tag a wrong data row in a target table.

 

When custom data in the Usage database is going to be changed it can be done in a UAT user acceptance testing environment. After the ETL process has been executed we can test the data warehouse. Later we can take a backup of the Usage database and restore it in the production environment.

 

Data integration mapping by abbreviations example

A Usage database has two tables for sales device from two operational systems and each table has its own business key columns labelled _bkey. I have invented an immutable surrogate business key column labelled _sbkey which contains abbrevia­tions as a simple letter code that can be used in a dimension in a data mart as an easy search criteria for a business user query and a dropdown box in a dashboard or report. The _sbkey is called a retained key in SAS and a persistent durable supernatural key in Kim­ball. A _sbkey column is mapped to a business key column which makes it stands for an integration and a merge to a conformed dimension.

 

Let's have a look at an example that is using a key map table to make a solid mapping for a dimension called SalesDevice in a data mart area (DMA).

 

A table SalesDevice_Map has a surrogate business key _sbkey as the primary key merging of all devices from retailer shops and from online shop to a current name and a sort order and a category or classification or grouping called DeviceModel. The table SalesDevice_Map will be updated by a data steward in the Usage data­base and the ETL process will flow the data to a Dim_SalesDevice type 1 dimension in a data mart area.

 

SalesDevice_sbkey

SalesDevice

Sortorder

DeviceModel

CR

Cash register

10

Register

ECR

Electronic cash register

20

Register

TSCR

Touch screen cash register

30

Register

CM

Cashier machine

40

Machine

PC

Personal computer

50

Computer

MAC

Mac

60

Computer

IPD

iPad

70

Computer

IPN

iPhone

80

Phone

SM

Smart phone

90

Phone

TC

Tablet computer

100

Computer

MR

Manually register

110

Register

 

A retailer shop has different sales devices over time.

A key map table SalesDevice_Keymap_Retailer has a business key _bkey from diffe­rent source systems map­ped to the rele­vant sur­rogate business key _sbkey and timeline columns ValidFrom and ValidTo.

 

SalesDevice_bkey

SalesDevice_sbkey

ValidFrom

ValidTo

9001

CM

1900-01-01

9999-12-31

9000

CR

1900-01-01

9999-12-31

1

ECR

1900-01-01

9999-12-31

20

MAC

1900-01-01

9999-12-31

9002

MR

1900-01-01

9999-12-31

10

PC

1900-01-01

9999-12-31

2

TSCR

1900-01-01

9999-12-31

 

A online shop has different sales devices over time.

A key map table SalesDevice_Keymap_Online table has a business key _bkey from dif­ferent source systems map­ped to the relevant sur­rogate business key _sbkey and timeline columns ValidFrom and ValidTo.

 

SalesDevice_bkey

SalesDevice_sbkey

ValidFrom

ValidTo

1

IPD

1900-01-01

9999-12-31

2

IPN

1900-01-01

9999-12-31

MAC

MAC

1900-01-01

2010-12-31

100

MAC

2011-01-01

9999-12-31

PC

PC

1900-01-01

2010-12-31

101

PC

2011-01-01

9999-12-31

20

SM

1900-01-01

9999-12-31

10

TC

1900-01-01

2015-12-31

10

PC

2016-01-01

9999-12-31

15

TC

2016-01-01

9999-12-31

 

We see that sales devices PC and Mac has different business key values in source sy­stems compared to the retailer mapping table, and over time sales devices have changed the business keys in a source system from a letter code to a number value. Above business key 10 represents two different sales de­vi­ces first a Tablet computer and later a Personal computer.

 

For example, above business key value 1 has two mappings, ECR for a retailer Elec­tro­nic cash re­gi­ster and IPD for an online user iPad. The ETL process will for trans­action sales data do a lookup in the relevant key map table to fetch the _sbkey value and another lookup in the SalesDevice dimension to fetch the dimension key value to be saved within the fact row.

 

The table Dim_SalesDevice is a type 1 dimension with a primary key labelled _key that stands for dimension key and with inferred members for Missing and Unknown.

 

SalesDevice

_key

SalesDevice

_sbkey

Device-Model

SalesDevice

Sort-

Order

-2

-2

None

Unknown

-2

-1

-1

None

Missing

-1

1

IPD

Computer

iPad

70

2

MAC

Computer

Mac

60

3

PC

Computer

Personal computer

50

4

TC

Computer

Tablet computer

100

5

CM

Machine

Cashier machine

40

6

IPN

Phone

iPhone

80

7

SM

Phone

Smart phone

90

8

CR

Register

Cash register

10

9

ECR

Register

Electronic cash register

20

10

MR

Register

Manually register

110

11

TSCR

Register

Touch screen cash register

30

 

The SalesDevice dimension has values or members as rows in a table in database.

 

An example of a view that binds the key map table and the dimension table toge­ther to a ETL pro­cess, where a retailer transaction sales data can do a date range lookup to fetch the right SalesDevice_bkey value through a mapping to the surro­gate business key:

 

CREATE VIEW mapping.SalesDevice_Retailer_Lkp AS

SELECT

  map.SalesDevice_bkey,

  map.ValidFrom,

  map.ValidTo,

  dim.SalesDevice_key

FROM usage.SalesDevice_Keymap_Retailer map

     INNER JOIN dma.Dim_SalesDevice dim

       ON dim.SalesDevice_sbkey = map.SalesDevice_sbkey

 

A retailer transaction sales data join to the lookup view to let its business key fetches the dimension key and insert data into a fact table. For a ETL process lets make Extracting as delta load or incremental load with delta data detection in a where clause, Transforming of busi­ness key SalesDeviceNumber to dimension key SalesDevi­ce_key and Loading new data rows into a Sales fact table with a SalesDevi­ce_key as a reference to a SalesDevice dimen­sion, to a data mart area (dma schema):

 

INSERT INTO dma.Fact_Sales (PurchaseDate_key, SalesDevice_key, Amount, TransactionId)

SELECT

  trn.PurchaseDate, --date format same as the Date dimension role-playing Purchase.

  SalesDevice_key = ISNULL(sdrl.SalesDevice_key,-2), --in case of no match in join.

  trn.Amount,

  trn.TransactionId

FROM source.RetailerTransactionSales trn

     LEFT OUTER JOIN mapping.SalesDevice_Retailer_Lkp sdrl

       ON sdrl.SalesDevice_bkey = ISNULL(trn.SalesDeviceNumber,-1) AND

          trn.PurchaseDate >= sdrl.ValidFrom AND trn.PurchaseDate < sdrl.ValidTo

WHERE trn.TransactionId > (SELECT MAX(TransactionId) FROM dma.Fact_Sales)

 

Horoscope star signs as a static dimension example

A Usage database has a table with names of horoscopes as a natural key, an ab­breviation as a business key labelled _bkey, and I am using the planet/house cusp number as a business surrogate key labelled _sbkey. Element becomes a grouping of the twelve signs and the first level of the hierarchy ElementHoroscope. (icons)

 

Element

Horoscope

BeginAt

EndAt

_bkey

_sbkey

Air

Aquarius

January 20

February 18

AQ

11

Air

Gemini

May 21

June 20

GM

3

Air

Libra

September 23

October 22

LI

7

Earth

Capricorn

December 22

January 19

CP

10

Earth

Taurus

April 20

May 20

TA

2

Earth

Virgo

August 23

September 22

VI

6

Fire

Aries

March 21

April 19

AR

1

Fire

Leo

July 23

August 22

LE

5

Fire

Sagittarius

November 22

December 21

SG

9

Water

Cancer

June 21

July 22

CC

4

Water

Pisces

February 19

March 20

PS

12

Water

Scorpio

October 23

November 21

SC

8

 

 

1.2. Data capture or data ingestion

Data capture or data ingestion or ingesting (in danish datafangst, hjemtagelse, indtage, mod­tagelse) of data from a source system to a Landing zone area and/or an Input data area is based of two data flow directions:

 

Push where Source find the data and deliver it to an area like a database or a file at ftp to a Landing zone area and later the Input data area will fetch the data and indicate that the data is downloaded, received and stored, so Source knows that it can make a new delivery by using a semaphore to control ac­cess to a common resource by multiple processes. A snapshot on a source system to create a read-only and static view as a transactionally consistent.

 

Pull where Landing zone area or Input data area will fetch data from Source through a sql Select From Where statement maybe with joins of tables or using a view or calling a stored procedure with a parameter as a criteria value to limit source data and for doing delta data detection delta load or incremental load. From a ETL tool or a web-service where the data warehouse will request for source data.

 

Granularity of data capture integration strategy is to consider when the amount of data is big in a source system. A data warehouse prefer to receive at the lowest granularity level of detail in case of specific analysis usage or data vali­da­tion, but some­times it is necessary to aggregate and summarize source data to a higher granularity like per day and per customer segment for transaction data or trans­actional data from an OLTP system e.g. orders, invoices, billings, payments, site hits.

 

Non-transactional operational system data e.g. Customer, Location, Contact, Sup­plier, Part, Product can be stored in a Master Data Management (MDM) database as master tables that is shared with multiple operational systems in a organization to avoid data redundancy in the sense that the same data must be updated in se­veral systems by a business user. Instead, master data is updated only in one place by a business user, after which data is automatically delivered to a number of ope­rational systems to achieve one truth about data. The operational systems submit data to MDM and the operational systems subscribe to and continuously receive current data as needed, possibly including versioned historically collected data.

MDM is an extension of the Usage supporting database.

 

Data capture from a source system is based on multiple data delivery forms:

 

  • Data rowset, recordset, multiple result sets from a relational database.
  • csv file with a separator like comma or semicolon, maybe with field header.
  • tsv file e.g. from Excel save as unicode text with tabular separator, txt file.
  • XML Extensible Markup Language with XSD XML Schema Definition.
  • JSON JavaScript Object Notation with JSON Schema.
  • JSONL JSON Lines.
  • TIP Nasdaq specification that looks like a mix of xml and csv formats l1, l2.
  • Parquet from Apache.

 

Data capture from a source system is based on two data delivery methods:

 

Full load or full dump as flush and fill where Landing zone area or Input data area gets all data from a source system.

 

Delta load or Incremental load with delta data detection where Landing zone area or Input data area only gets new data or changed data from a source system, called Change Data Capture (CDC). Need a Initial load the first time running. It can be a source system that keep track of deli­vered date, or it can be the Input data area that keep track of a last received Sequence Id (integer), Sequence Time (datetime), or an Insert­Update­Time­stamp value from a source system for each data row, so the Input data area can detect new data or changed data like asking for data since that last received value or a maximum value found in the last delivery, e.g.: Insert­Update­Time­stamp > the last received value.

It could also be a posting data e.g. entry date (or created date) and next delta load or incremental load is for: EntryDate > »the last received EntryDate«.

(in danish Posteringsdato eller Bogføringsdato).

Delta load or Incremental load is relevant for a real-time data warehouse with continu­ously updates throughout the day to be displayed in a dashboard and when the amount of data and number of rows are big and a full load takes longer and longer time, as time goes by.

 

It is important to do a log of number of rows from a source system to the Landing zone area and to Input data area for later reconciliation check count auditing (in danish kon­trol­optælling, afstemning) in case a source system does not deliver the expected number of rows and to monitor over time the number of rows to see if it increases as expected and be used in a graph over the amount of data in the data ware­house. Other data values e.g. amount, quantity and volume can also be logged and monitored. I recommend that a source system tells the number of rows per table that will be compared to the saved number of rows in Input data area in an audit trail. It can be implemented as a view in a source system like this:

 

CREATE VIEW BI_Audit_NumberOfRows AS

SELECT 'Northwind' AS Sourcename, 'Customers' AS Tablename, COUNT(*) AS NumberOfRows

FROM dbo.Customers

UNION ALL

SELECT 'Northwind' AS Sourcename, 'Orders' AS Tablename, COUNT(*) AS NumberOfRows

FROM dbo.Orders

UNION ALL

SELECT 'Northwind' AS Sourcename, 'Order Details' AS Tablename, COUNT(*) AS NumberOfRows

FROM dbo.OrderDetails

 

The data warehouse can do its own delta data detection between Input data area and Archive area to identify new data, changed data and deleted data to maintain historical data with ValidFrom and ValidTo datetime stamps is handled by following actions:

 

  • Which rows is new, give them a InsertedDateTime or a CreatedDateTime and a flag I for inserted or N for new data.
  • Which rows has been updated, modified, changed, give them a Updated­DateTime or a ModifiedDateTime or LastChangedDateTime and a flag U for updated or M for modified or C for changed.
  • Which rows has been deleted, give them a DeletedDateTime and a flag D for deleted.
  • Which rows has been cancelled e.g. a wrong customer, give them a Cancelled­DateTime and a flag C for cancelled when not used C for changed.
  • Which rows has been terminated e.g. a product not for sale anymore, give them a TerminatedDateTime and a flag T for terminated.

 

Sometimes a source system provides flag information to pass a status to a data warehouse. I have seen combination of consistent statuses like these:

 

  • Inserted – Updated – Deleted
  • New – Changed – Removed
  • Created – Modified – Deleted or Destroyed
  • Create – Read – Update – Delete (CRUD)
  • Added – Changed – Removed

 

Microsoft Windows calls it Add/Remove Programs in Control Panel (ARP).

 

Change Data Capture CDC is a method to log data in a table when data is changed by transfer a copy of data row from the table (or table log) to another log table. Changed data represents: Inserted data, updated data and deleted data. Fetch the data values that has changed since last fetching (extract). (In danish Data ændringsdetektion d.v.s at opsnappe eller at spore at data har ændret sig i et andet system).

 

Sometimes a source system update a data row multiple times but only the most recent version goes to the data warehouse. If a source system insert and update a row before a new load to data warehouse, only the updated version goes to data warehouse. If a source system insert a new row and delete the same row before a new load to data warehouse, the data will never goes to data warehouse. It is seldome that a source system has a log and is logging all changes with a revision number or a ValidFrom and ValidTo datetime stamp, the­re­fore a data warehouse does not contains 100% full historical updates of data in an organization. Sometimes a source system has for each (or selected) tables in the database an extra historical log table that contains all inserted, updated and deleted data rows from the original table together with an Action column (values for Insert, Update, Delete) and an EntryDate date­time­ stamp column that mark when the action was occurred and a SequenceId that is used for delta data detection delta load or incremental load to pull data from a Source log to a Input data area, maybe through a Landing zone area.

 

An example of a source historical log table with rows every time a customer change data. A customer has an CustomerId 421 in a source system and her maiden name is Marie Beaulieur. She is inserted into log table that keeps CustomerId as a business key, and that add a unique sequence id as primary key for the log for ha­ving independence of the business key. At date 2007-11-16 she got married, and she took her husband’s surname, therefore her name was changed to Marie Lesauvage, and she is inserted into log table with a new row. Marie Lesauvage remarried at 2014-07-15 and took her new husband’s surname therefore her name is changed to Marie Sainte, and she is inserted into log table with a new row. At date 2017-02-07 she stopped her membership and was deleted as a customer in the source system, and she is inserted into log table with a new row.

 

SequenceId

LogEntryDateTime

LogEntryAction

CustomerId

CustomerName

626

2000-05-08 10:12:34.4906240

Inserted

421

Marie Beaulieur

7503

2007-11-16

14:37:07.2353231

Updated

421

Marie Lesauvage

19473

2014-07-15

11:09:33.7629562

Updated

421

Marie Sainte

25003

2017-02-07

23:57:21.9876543

Deleted

421

Marie Sainte

 

Implemented as after-insert/update/delete triggers in a source system.

 

Another example, where Jane Macquarie was sales manager for the eastern region until De­cem­ber 31, 2018, and then took responsibility for the western region from January 1, 2019. How was it registered in a source system? I have seen systems without a date to register a shift, because an operator will update the re­spon­sibi­lity at beginning of January 2019, which means the data warehouse can let Jane get a ValidFrom date as 2019-01-05 for western region, but it is a wrong picture and sales fact data for few days in 2019 will belong to eastern region. Please, always ask a source system for multiple date columns and use them for ValidFrom and ValidTo to get a real picture of the business in archive area and data mart area.

 

Data latency describes how quickly source data has to be ready in the data ware­house for the business users to do their reporting. Deadline is normally in the morning based on a periodic batch of source data from yesterday. For a real-time load to a data warehouse with constantly new presentation in a dashboard, the ETL process must be streaming oriented where source data continuously flows into the data warehouse by delta load or incremental load to do transformation and making conforming data.

 

1.3. Enterprise Data Warehouse modeling architecture - EDW

A data warehouse can be modeled in different ways. I will present three data mo­dels. In a data warehouse solution the source data from multiple source systems can go into a single common database called a Enterprise Data Warehouse (EDW) where data is integrated into a single and coherent structure. Notice, to modelling EDW it is based on business requirement and not based on source data structure.

 

A EDW contains and represents a »single version of truth« for the enter­prise.

 

1.3.1. Bill Inmon modeling - Corporate Information Factory model

Bill Inmon is using a Operational Data Store (ODS), a place to store source data as a central database that provides a snapshot of the latest source data from multiple source systems (operational systems or transactional systems) with validate refe­rential integrity for operational reporting and as a source of data to an Enterprise Data Warehouse (EDW) which feed the CIF Corporate Information Factory, which provides a logical framework for delivering business intelligence.

 

When a EDW is modeled after Inmon's modeling it is based on Peter Chen’s Entity Relationship data modeling with super-sub entity, associative entity and with 80% norma­lized data, often a third normal form (3NF) relational database.

 

The EDW offers integrated, granular, historical and stable data that has not yet been modified for a concrete usage and can therefore be seen as neutral. It can keep historical data meaning all the changes to the data or only the days-end status e.g. End Of The Day for each data revision from ODS (or Archive area or Input data area).

 

An entity Person can have a one-to-many relationship to the addresses of the per­son’s living places in another PersonAddress entity with ValidFrom and ValidTo co­­lumns, and these entities or tables can be merged into a type 2 dimension table in a Kimball data mart.

 

The EDW modeling is subject-oriented meaning all business processes for each subject e.g. customer needs to be modelled. A EDW com­mon data model is not technical, it must be based in (logical) business pro­ces­ses. Since all source data is integrated and collected into one com­mon database.

 

Inmon's modeling can divide or partition data into super-sub tables to place related columns in its own sub table to separate from other columns. There is often a one-to-one rela­tion­ship between a super table and its sub tables. We can say that this kind of divi­sion of data is subject-oriented based. A super table can contain a business key (natural key, surrogate key) from a source system together with other basic data columns, or instead we place the business key in a mapping table connected to each super table where a translation of a business key becomes a data ware­house surrogate key Id as an integer, guid or a hashbyte value, and we let the Id become primary key in the super table and foreign key in the sub tables. When a source system adds new data, the data warehouse can place data in a new sub table and connect data to the existing super table. Similarly, when a new source sy­stem is introduced to the data warehouse. Inmon's modeling using super-sub tables allows for agile modeling, read more in section 1.3.4. Also called super-sub-entity or supertype/subtypes has two rules:

 

  • Overlapping or inclusive e.g. a person may be an employee or a customer or both.
  • Non-overlapping, mutually exclusive or disjoint e.g. a patient can either be out­patient or resident patient, but not both at the same time.

 

Let the design of the data model for a EDW start doing a »helicopter view« for identification of entities, e.g. sa­les­person or en­gi­neer will be stored in an Employee entity with a one-to-many to a Jobfunction en­tity, or maybe do more abstraction in a bigger picture making a Person entity where employees and customers users can be stored together. Later in the design process move on to a »weeds view« for all the details for columns, conformed names and data types etc.

 

In the EDW keep the source data with time span columns like Start date and Stop date, and sometimes also convert the source data to row wise entity as a kind of transactional data that is a sequence of information exchange like fi­nan­cial, logi­sti­cal or work-related data, involving everything from a purchase order that becomes an invoice with multiple invoice line items, to ship­ping status, to employee hours worked, plan and activity records, for sub­scrip­tion period and to insurance costs and claims.

 

A EDW can be a data source to one or mul­­tiple data marts using Dimensional mo­deling with denor­ma­li­zed data con­trol­led by the ETL process. A data mart can use a dif­fe­rent modeling that fit better for use of data. A EDW should be accountable and auditable which by default means pushing busi­ness rules of changing/altering data downstream to »between the data warehouse and the data marts«.

 

EDW tables and views can be data source for a tabular model with a DAX formula to do filter and calculate in-memory KPI to be visual in Power BI, where a data mart area is skipped or been replaced by EDW views, the­refore it is called a ELT process because the trans­for­ma­tions is executed on the way to a business user.

W H Inmon history look

 

1.3.2. Anchor modeling

When a EDW is modeled after Anchor modeling it is based on four elements that holds on historical and time-variant raw data in entities/tables with labels:

Anchor is a surrogate key gene­rator for a source system business key from a data warehouse point of view.

Attribute has descriptive values from a source system connected to the bu­si­ness key therefore attribute has a foreign key reference back to the anchor. The attribute table is based on type 2 history, therefore it has also a surrogate key. A view will mix anchor and attribute and it becomes a type 7 dimen­sion because surrogate key of anchor becomes a durable key, and surrogate key of attribute becomes a type 2 surrogate key. Both keys will be added to a fact, and the fact can join to a most recent dimension to fetch the current values, or join to a hi­storical dimension to fetch the registered values when the fact data occurred with its transaction date. The model in Power BI can choose the content for its presentation interface because all data is on a silver platter.

Knot is a lookup table with basic data.

Tie is a relationship ta­ble between anchor tables to handle one-to-one, one-to-many and many-to-many rela­tion­ships, therefore no foreign key in anchor or attribute table except to a knot table (»tie the knot« means getting married).

Anchor modeling has only extension of new tables and none modification of existing tables. This ensures that existing data warehouse applications will remain unaf­fec­ted by the evolution of the data warehouse. When a business key exists in multiple source systems, there will be one common anchor table and several attri­bute tables, because each source system has its own attri­bute table.

 

Anchor modeling allows to build a real-time ETL process to a data warehouse where some attribute tables needs a more frequent update than others to provide fresh data to the business users in a report or for a online dashboard. Other attribute tables only needs updating per hour or at end-of-day.

 

Anchor modeling diagram example [Olle Regardt and Lars Rönnbäck] Read more.

 

1.3.3. Data vault modeling

When a EDW is modeled after Data vault modeling it is based on component parts of a Core Business Concept (CBC) as an ensemble consisting of three components that holds on historical and time-variant raw data in entities/tables with labels: Hub is used to store a business key, Link is for a relationship between hubs and Satellite contains the data and descriptive information. They have a LoadDate column (Load_dts, LoadDateTime, ValidFrom) to show when the data row was entered.

 

Each component can be drawn with its own color. Let me elaborate with some examples:

 

Hub (blue) separates the business keys from the rest of the model and will trans­late business key to a unique hashbyte key value. A composite business key of multiple columns will also become a hashbyte key column. E.g. a HubProduct has Product­HashKey together with busi­ness key ProductNumber. A hub is an in­te­gra­tion point of a business key or a unique identifier and will never change. A hub exists together with at least one satellite. A data vault table has a RecordSource column as a refe­ren­ce back to a source system for data lineage e.g. value "Dy­na­mics365.Sa­les.Product" and can be a multi-valued column with multiple sources. (Hub in danish er et sam­lings­punkt for én forretnings­nøgle som har data fordelt over flere satellitter der kredser om en hub eller et link).

 

Satellite (yellow) [sat] stores the context, descriptive data and measure values in co­lumns of either a hub or a link. A satellite is connected to one hub or one link, and a hub or a link can have multiple satellites. E.g. a SatProduct with foreign key Product­Hash­Key from HubProduct and data values for Name, Category, Target group etc. Satellites has all the rele­vant data for the data ware­house. Satellite is a history-tracking to handle historical data values for changes (updated/deleted) at any time in a satellite, where the primary key is com­posite of Hash­Key+LoadDate. A hash diffe­rence HashDiff column is a checksum of all data value columns for making an easy comparison for Kimball type 2 history, read more in section 4.3.

 

To be 100% insert compliant there is no columns for EndDateTime, ValidTo and IsCurrent flag, therefore no updates of rows in a satellite. If a source system can tell that a product has been expired or deleted then a new row is inserted into Sat­Product with a current date value in column Dele­ted­Date.

 

One hub or one link will usually have several satellites associated because we will re­group data into multiple satellites by classifications and types of data and infor­mation and by rate of change, so each of them can have its own granularity and timeline. Split logical groups of data into multiple satellites. With multiple source systems the RecordSource column is helpful when data sets is similar, else let each source system has its own satellite including a RecordSource column and share the HashKey from the hub to all its satellites.

 

Employee satellites example

We start with an employee hub with a business key column EmployeeNumber that is translated to a column EmployeeHashKey that will be in all satellites sat tables:

 

HubEmployee(EmployeeHashKey, EmployeeNumber)

SatEmployeeBasic(SSN, BirthDate, MaidenName) constants/correction overwrites.

SatEmployeeName(FirstName, MiddleName, LastName, Gender).

SatEmployeeContact(CellPhoneNumber, Email, Skype,Twitter,Facetime,Whatsapp).

SatEmployeeAddress with columns for address or a reference to an Address table.

SatEmployeeJob(FromDate, ToDate, Department, Title).

SatEmployeeHoliday(Date, Occasion).

SatEmployeeSalary(DateOfMonth, Amount).

SatEmployeeChristmasGift(Year, NameOfGift, AmountOfGift, LevelOfSatisfaction).

 

The above data comes from multiple source systems and data is added in­de­pen­dently of each other to the individual satellites (divide and conquer). There will be outer join from a hub to its satellites and inner join from a satellite to its hub.

 

More satellites

Conformed and cal­cu­la­ted data is placed in its own Computed Satellite. Other satel­lites are: Overloaded satellite, Multi-active satellite, Status tracking satellite, Effectivity sa­tel­lite and Re­cord tracking satellite.

 

Link (green) integrate and capture relationship between hubs and links. E.g. a product is placed in a store, therefore we have a HubProduct and a HubStore with data values in a SatProduct and a SatStore, and a Link­ProductStore represents the placement by combining ProductHashKey and StoreHashKey as foreign keys from the hubs to capture in which stores a product is placed and for sale. A link creates its own hashkey as a unique combination of the involved hub business keys e.g. ProductStoreHashKey from ProductNumber and StoreCode. Data values connected to a link is placed in its own satellite, e.g. Sat­Pro­ductStore with primary key/foreign key Product­Store­HashKey and data values for PlacementInStore and Quan­tity­In­Store. When a product no longer exists in a store, the ETL process will insert a new row in Sat­Pro­ductStore with the expiration date in the co­lumn DeletedDate.

 

A link handles one-to-one, one-to-many and many-to-many relation­ships because data vault has only optional many-to-many. A link is a dataless and timeless con­nec­tion among hub business keys and is not a data-bearing relationship from ERD.

A satellite on a link represents the history of the connection or the relationship.

 

Product Supplier Category link example

A Product table with foreign keys SupplierID and CategoryID will often be put to­gether into a Link­ProductSupplierCategory with ProductHashKey, SupplierHashKey and CategoryHashKey from Hub­Product, HubSupplier and Hub­Category and a primary key ProductSupplierCategoryHashKey. Data values connected to the link is placed in Sat­Pro­duct­SupplierCategory. Over time a product can change supplier and category which is being handled in the Sat­Pro­duct­SupplierCategory with a new row inserted with filled Deleted­Date and a new row is inserted into link and sat for the new data combination with a new LoadDate. If the business one day allows a product to have multiple suppliers then the link is al­rea­dy prepared for it. In the future, the business will like to place products in stocks all over the world, and it will be registered in a new inventory management system, that creates a HubStock and a SatStock, and creates a Link­Pro­duct­Stock and a SatProductStock to handle all the new data and relationships.

 

Marriage example

A unary/recursive relationship (self join) e.g. LinkMarriage represents a connection between two persons with Person1HashKey and Person2HashKey from HubPerson with business key SSN and it gives primary key MarriageHashKey. All data is placed in SatMarriage(MarriageHashKey, MarriedDate, DivorcedDate, Widowed­Da­te) that is labelled Effectivity satellite because of the columns of dates. A vali­da­tion rule in SatMarriage ensures that one person can only be married to one person at a time else it will be marked as an error for an audit trail to catch. When a per­son enters a marriage and changes last name, it becomes a new row in Sat­Per­son. When a mar­riage ends either by divorce💔 or by death then a new row is inserted into SatMar­ria­ge with filled DivorcedDate or Widowed­Date. Couples who are divor­ced and later remarry each other will reuse the same MarriageHashKey value from LinkMarriage for the new row inserted into SatMarriage. The LoadDate will be in order of the events. The Dele­ted­­Date will be used in connection with an error detection in the source system or an annulment of the marriage.

 

Another modeling of marriage: SatMarriage(MarriageHashKey, Date, MaritalStatus) with values of marital status: Married, Divorced and Widowed, makes it easy to ex­tend with extra values e.g. Separated, Registered partnership and Abolition of re­gi­stered part­ner­ship. When persons as partners are living together we will mo­de­ling it in a LinkCohabitation and a SatCohabitation. When persons only dates we make a LinkRelationship and a SatRelationship. Therefore each link and satellite is a flexible and agile way to divide data (agile in danish som smidig for forandringer, let og hurtig at udvide). In a generic perspective »helicopter view« we could have started with a general link to represent a connection between two persons with Person1­HashKey and Person2HashKey from HubPerson, and label the link a general name like LinkCouple with a Couple­HashKey, and let the types of data create multiple sa­tel­li­tes for the link as SatRelationship, SatCohabitation and SatMar­ria­­ge. It is nor­mal for a couple to start dating with data in the SatRelationship, living together with data in the SatCohabitation. When the couple gets married, the co­ha­bi­tation timeline stops in the SatCohabitation, and the data re­cor­ding continues in the Sat-Marriage. It is agile to think and data modeling in a general way and divide data.

 

Invoice example

An invoice becomes a HubInvoice where business key Invoice­Number becomes Invoice­HashKey used in a SatInvoice with data values IssueDate, DueDate, Deli­very­Date, Pay­mentMe­thod and PaidDate. When a payment happens e.g. a week la­ter, it will become a new row in the SatInvoice, but I prefer to add the payment to a new satellite called SatInvoicePaid with Invoice­HashKey, PaidDate, PaidMe­thod and Paid­Amount to divide data in seperated satellites because of the time difference in data capture, and it allows for payment in installments in the future. If one day the sales department replaces the billing sy­stem then create a new SatInvoice. An in­voi­ce involves a cu­sto­mer in a store, therefore a Link­­InvoiceCustomer includes In­voi­ce­Hash­Key, Custo­merHash­Key and Store­Hash­Key and combines business keys to In­voiceCustomerHash­­Key used in a Sat­­InvoiceCustomer with data values like Re­fe­rence no, Requisition number and Voucher number. An invoice line item involves a product, therefore a Link­In­voice­Product includes Invoice­HashKey and Product­Hash­Key and combines business keys to Invoice­ProductHashKey used in a Sat­Invoice­­Product with data values LineItem­Number, Quantity, CouponCode, Discount (%), UnitPri­ce, VAT and Amount. If the quan­tity in an existing invoice is corrected then a new row is inserted into Sat­In­voi­ce­­­Product. If a product is replaced then for the old product a new row is inserted into Sat­In­voi­ceProduct with filled DeletedDate and the new product is inserted into Link­­Invoice­Pro­duct and into Sat­­Invoice­Product.

 

More links

A link does not have its own unique identifier from a source system. Other links are: Nondescriptive link, Computed aggregate link, Exploration link and Link-to-link or Link-on-link.

 

Address example

Address will not be a Hub since the address itself is not a CBC but a description of a CBC e.g. Customer, Employee or Store. A CBC could be a Customer­Delivery­Loca­tion as a hub with a satellite and a link to Customer hub because a customer could have more than one active location where goods can be delivered. The business key for the hub is using a concatenation of zip code, street name, house number to get a unique identifier for the Customer­Delivery­Loca­tionHashKey.

 

More component parts

There is several other types of component parts (entity/table), for example:

Transactional-link (light green) integrate and capture relationship between hubs and links to capture multiple transactions that involve the same business keys e.g. many sales to same customer of same product from same store, meaning multiple rows with the same set of keys. A trans­actional-link has its own unique identifier for a transaction (TransactionId, EventId) from a source system. No­ti­ces that a trans­actio­nal-link keeps its busi­ness key, it is not placed in an asso­cia­ted hub. If a revision of a transaction is received from a source system or a counterpart then insert the data values into the associated satellite. Since DV raw data is a structured archive, it will be wrong to calculate a counterpart row, but when data is extract-transform-load to a Transactional fact it is fine to calculate a counterpart for the fact.

Nonhistorized-link is for immutable data that has no edit history, in other words, constant data that will never be changed or deleted in a source system. Each transaction (data row) has a unique identifier (TransactionId) and there is no revision of a transaction in the source system. When there is a counterpart to a transaction it will have its own unique identifier and a reference identifier back to the original transaction.

Reference (gray) is referenced from a satellite and is the same as a lookup table with basic data e.g. RefDate or RefCalendar with many columns like Week, Month, Quarter and Year, RefMaritalStatus and RefMarriageType with Code and Text, and RefZipCode with Zipcode, City, State, State abbreviation for US. Date table is called a nonhistory reference table. A satellite is normalization to 3NF and contains only non-foreign key columns except simple reference values. If a reference (cross-reference or look­up table) contains history then move the old data values to a History-based reference satellite with primary key Code+LoadDate and Text con­tains the previous texts. I prefer to have a RefAddress with extra columns like State, Region, Municipality and Latitude and Longitude (gps coordinate), Kimball page 235.

Same-as-link (turquoise) [sal] indicates that two business keys in a hub are the same thing as a logical representation of a link, or to map different business keys from multiple source system where each source has its own satellite. In a sal we can »merge together« differently named business keys to a single master key that is really the same-as the other business keys. Same-as-link is connected to a hub and is used to identify when the same business objects are identified by multiple business key values.

Hierarchical-link (silverblue) [hal] to capture hierarchical relationship between the bu­si­ness con­cept records in same hub e.g. parent-child hierarchy.

Point-in-time (purple) [pit] is a helper table to a hub or to a link and is calculated by the ETL process based on all the satellites of the same hub or of the same link, because these satellites share the same Hash­Key together with different values in LoadDate. A pit table has its own LoadDate column (SnapshotDate), a column for the shared Hash­Key and columns for each involved satellite’s LoadDate, therefore all the values of LoadDate from the satellites are represented in a pit table to be used to derive all valid versions for ease of joins of satellites and to improve the per­for­man­ce and inclu­des to find the newest or most re­cent rows and to calcu­late EndDate or ValidTo. (A different implementation of a pit table is to have a row for each date in the calendar or for each working date in column LoadDate, but of course with 1000 rows in one satellite over a period of ten years it will give millions of rows but very easy to lookup on a given reporting date point in time as a trans­action date). By adding a sur­ro­gate key se­quen­ce number to a pit table it can through a view create a virtual Kimball type 2 dimen­sion. pit create and pit use.

Bridge (taupe) combines (or bridge) hashkeys and optional business keys from multiple links and their related hubs. A bridge can represent an aggregated level and can include derived and calculated columns.

 

HashKey

Data vault 2.0 recommends for a Hub to hash a business key value instead of using an identity column as a unique sequence number because:

  • HASHBYTES('SHA2_256',CONCAT(SSN,';',UPPER(TRIM(Area)))) gives data type binary(32)/char(64). SHA1 use binary(20) and MD5 has been deprecated.
  • Before hashing a business key we do data operations as upper case, left and right alignment or trimming and save it in a column called »comparison busi­ness key« together with the original value in a Hub table.
  • For Hash diff to make an easy comparison for data columns, we do the same data operations when it is agreed with the business and the data profiling.
  • Hashing two values could create same hashkey value which is called a collision and is very unlikely but can’t be left to chance or avoidance. If a data batch contains two different business key values which give same HashKey then bulk insert will result in primary key violation in a Hub. If a new business key value gives the same HashKey as an already existing one in a Hub then the loading can assume the new business key value is already there. Therefore first to check in Hub if new business key value exists and the HashKey exists then increase HashKey until no collision or using a collision sequence number, but there is no guarantee of same hash key values when data is reloaded. Maybe a need for a hash key detector detection job and detecting an undefined hash key. MD5 collision example. What to do when business key value is null or is an empty string?
  • It is ease of copying data from environment to another like dev to test to prod without worrying about sequences being out of sync.
  • When two source systems share same business key but in one source it is an integer data type and in the other source it is a string, by has­hing they become same data type-in the hub table and sat table.
  • When a source system is repla­ced, and the new source system comes with different data types e.g. from an integer business key to a string business key, it will still be has­hed to same data type-in the data vault, therefore no maintenance.
  • A composite business key of multiple columns become one hashkey column which is easy to use for join and merge.
  • Easy to create a hashkey of a business key value when needed by a ETL tool or in a sql select statement instead of doing a lookup to find the sequence number stored in a table. But of course still need to make sure that the same business key is not inserted twice in same hub table and avoid collision.
  • Do not use a HashKey column for a clustered index for the primary key in a hub for faster query join performance, because hashbyte has a non-sequential nature and therefore inserting new rows will take longer time. A non-clustered index should be considered as a default for HashKey and a unique clustered index on the business key in a hub. I have seen a hub with a unique clustered index on LoadDate+Hash­Key because first part of the value is ever-increasing. More.
  • Loading to tables can be done parallelly e.g. at same time to a hub table and to a satellite table from the same source system table (parallel loading, ena­bles parallelization). It sounds awesome! What if a ProductNumber is used in a Store relationship and will be inserted into the Link­ProductStore with Product­Hash­Key and StoreHashKey but the ProductNumber is an orphane and does not exists in the Product table from a source system and therefore is not in the HubProduct then it gives a referential integrity violation in the data vault, read about inferred members in section 4.5.

 

Archive of data from source systems

I find the Data vault model as an archive of structure where data from the source systems has been divided, and where the business rules and transformations is first enforced when data is go­ing to data marts.

 

Linstedt said: »The Data Vault is for back-end data warehousing, not for production release to business users directly.« DV becomes a persistent staging area or an archive area (ARA).

 

Getting the data out

Querying a data vault re­qui­res many more joins than Inmon's model. Be aware it can be a performance issue to have columns of data type binary(32) or char(64) for storing millions of rows and for join many tables together compared by using an in­te­ger data type. Joining large pit, sat, link and hub together is expen­sive for performance and it is often debated by Linstedt. As far as performance goes, normalised struc­tures are no where near as perfor­mant as a dimensional model. Nothing beats dimensional modeling for performance of queries.

 

Data vault divides data in a Raw area and a Business area called Business Vault that contains business calculations and to show data at a different level of gra­nu­larity by making new link with more or fewer hubs (and business keys) involved. Pit and Bridge is part of the Business Vault to provide table to give a better query perfor­mance. Business Vaults can take many forms. It's more about provisioning data for business users.

 

For a ETL process from a Data vault to a Dimensional modeling, think that a Hub and its Satellites become a Dimension, and a Link and its Satellites become a Fact. Pit and Bridge is a big help and sometimes the ETL is done by views only.

When to use Data Vault? When a data warehouse has many source systems!

 

See an Order line example in 3NF Model, Dimensional Model and Data Vault Model.

 

Data vault modeling diagram example [Dan Linstedt] Read more, extra more, more about loading, and there is many sites where individual persons give their recommendations. An example in a six minutes video and a longer video.

Dan Linstedt comments my post at LinkedIn that he is using color: green for hub, blue for link and white for satellite.

 

1.3.4. Agile modeling

Inmon's modeling is based on a normalized data model and the new data mo­dels for EDW wants to break things out into »parts« for agility, extensibility, flexi­bility, generally, scalability and productivity to facilitate the capture of things that are either interpreted in diffe­rent ways or changing independently of each other.

 

Agile modeling has only extension of new tables and none modification of existing tables. Changes in an agile data warehouse environ­ment only require ex­ten­sions, not modifications and no impact to the data warehouse and it be­co­mes quick-moving to act quickly to easily customize the data warehouse when bu­si­ness changes. An ensemble mo­de­ling pattern gives the ability to build incrementally and future chan­ges should not impact the existing design. Agile deve­lopment is performed in an iterative, in­cre­men­tal and collaborative manner. I think we will always meet an issue that address us to do a model refactoring or re-engineering. Read more.

 

Inmon's modeling is a free data model to do split of keys to map tables and data into super-sub tables, to handle one-to-one, one-to-many and many-to-many relation­ships by connection tables. With Entity Rela­tion­ship data modeling you can make your own labelling of tables in a data model, compared to fixed labels in Anchor modeling and Data vault modeling.

 

For Anchor modeling and Data vault modeling the basic idea is to split (groups of) keys into their own entities (called anchors and hubs) and split of all other data into additional entities (non key entities called attributes, satellites), so an Anchor or a Hub table has multiple Attribute or Satellite tables attached because:

 

  • Multiple source systems divide data in a natural way.
  • A large number of columns are separated into smaller su­bject-orien­ted attributes and satellites (based on super-sub entities).
  • For additional columns due to new requirements in order not to extend exis­ting ETL process.
  • Different change frequencies of columns and to be 100% insert compliant.

 

This has advantages of independent load jobs, data model enhancements etc. but makes it more difficult to extract and fetch data. To reduce the complexity in queries, a proven approach is to create (or generate) a view layer on top of the model with Join Elimination etc.

 

Normalization to sixth normal form 6NF is intended to decompose table columns to irreducible components, a hypernormalized database where essentially every co­lumn has its own table with a key-value pair. I see 6NF as a physical level, while I see 1NF to 5NF, you can include DKNF as a logical level useful for design and modeling of a database.

 

A generic data model does not need any changes when there is a change in the source systems. There is a Party modeling, a 2G modeling and a Focal Point modeling and more than dozens of data warehouse data modeling patterns that have been introduced over the past decade.

 

EDM = Enterprise Data Model is a term from IBM back in 1998 for a data modeling tech­nique for data warehousing, where ETL was called Capture, Transform, Apply or Extract, Clean, Conform, Deliver processes.

 

A simple Guidedance for EDW: Do not lose data (auditing/compliance/validation). Model it, so you can understand it. Make it physical, so you can query it.

 

Data warehouse is using a surrogate key instead of a business key to remove de­pen­dence from a source system, read more in section 4.2.

 

1.3.5. Dimensional modeling

Ralph Kimball does not like to store data in a EDW, he only store data in data marts that is using Dimensional modeling, therefore EDW becomes a union of all data marts.

 

Dimensional modeling ends up in a star schema or constellation sche­ma (a group of stars, multi-star schema) with fact tables (analysis variable, measures, events) surrounded by dimension tables (context), where dimensions explain the facts. Di­mension and fact conformance is a must in a successfull data warehouse imple­men­tation to meets the requirements of legislation, accepted practices, prescribed rules and regu­la­tions, specified standards and terms of a contract.

 

Star schema

 

 

Entity Relationship Diagram ERD with five one-to-many relationships.

 

 

Date dimension is a type 0 with original values and a dimension key in column Date_key of data type date for the pri­ma­ry key of the dimension.

 

Date_key

Year

Quarter

Month

Week

Season

2014-04-23

2014

2014Q2

April

201417

Spring

2014-04-24

2014

2014Q2

April

201417

Spring

2014-04-25

2014

2014Q2

April

201417

Spring

2014-05-08

2014

2014Q2

May

201419

Spring

2014-08-16

2014

2014Q3

August

201433

Summer

 

Employee dimension is a type 1 with most recent, current values and a dimension key in column Employee_key as an artificial auto-generated unique sequence num­ber or an identity column as a surrogate key for the primary key of the dimension.

 

Employee_key

Login

Name

Title

1

NADA

Nancy Davolio

Sales rep

2

JALE

Janet Leverling

Sales rep

3

MISU

Michael Suyama

Sales Manager

 

Store dimension is a type 1 with most recent, current values and a dimension key in column Store_key as a unique sequence number as a surrogate key.

 

Store_key

Code

StoreName

Address

13

UK-LON

Shophouse

14 Oxford Street, London

16

FR-PAR

Boutique

Rue Cler 52, Paris

 

Product dimension is a type 1 with most recent, current values and a dimension key in column Product_key as a unique sequence number as a surrogate key. Other columns create a hierarchy: Category→Brand→Label.

 

Product_key

Number

Label

Brand

Category

Price

504

571145292

Blou jeans

Zara

Clothing

60

538

571145311

Comfie pants

Lee

Clothing

25

893

245600390

Walk air

Adidas

Shoes

89

1408

746153022

Passion no 9

Chanel

Beauty

37

 

Customer dimension is a type 2 to keep history in multiple rows or records when a customer changes name or city, and a dimension key in column Customer_key as an artificial auto-generated unique sequence number or an identity column as a surrogate key for the pri­ma­ry key of the di­men­sion. Customer_key values are unique for each row even though the same customer is standing in several rows e.g. CustomerId 421 because the customer has changed her name and city several times. Column CustomerId is a business key and is the »glue« that holds the multiple rows or records together for a customer even though data change over time. Columns ValidFrom and ValidTo represents the span of time when a row was the »current truth« as a time window for a customer in a given state.

 

Customer

_key

CustomerId

CustomerName

City

ValidFrom

ValidTo

1

176

Hans Andersen

London

1900-01-01

9999-12-31

2

359

Joe Watson

Boston

1900-01-01

9999-12-31

3

421

Marie Beaulieur

Lyon

1900-01-01

2007-11-16

4

421

Marie Lesauvage

Paris

2007-11-16

2014-07-15

5

421

Marie Sainte

Nice

2014-07-15

9999-12-31

 

Customer dimension type 2 has not here a column IsCurrent with two values: 0 for historical and 1 for current to mark each data row, instead column ValidTo with value 9999-12-31 represents the most recent, current values.

 

Sales fact with five dimension key columns for five dimensions where Date_key repre­sents the date of purchase, and with columns for two measures or metrics.

(Other names Fact_Sales or FactSales or fact.Sales).

 

Date_key

Customer

_key

Product

_key

Employee

_key

Store

_key

Quantity

Amount

2014-04-23

2

893

2

13

1

89

2014-04-23

2

1408

1

13

1

37

2014-04-24

4

1408

3

16

2

74

2014-04-25

4

504

3

16

1

60

2014-05-08

1

893

1

13

3

267

2014-08-16

5

1408

3

16

1

37

 

Since I keep history of customers name and cities in the Customer dimension as a type 2, the Sales fact tells us the name of a customer and where a customer lived at the date of purchase, when I join a fact row column Customer_key value to the Customer dimension key column Customer_key value for further information about the customer.

 

For example, at Sales fact column Date_key row value 2014-04-25 has Custo­mer_key value 4, which I lookup in Customer dimension to see it is the customer Marie that at that time had last name Lesauvage and was living in Paris when she purchased the product Blou jeans from the employee Michael Suyama in the store Boutique at Rue Cler 52, Paris.

 

I call it registered values at the date of purchase, and the data is the factual or fact back then, because Customer dimension is a type 2 that retains data as they were at a transaction date of a sales event back in time. (In danish »registreret oprindelige værdi«).

 

For example, at Sales fact column Date_key row value 2014-08-16 has Custo­mer_key value 5, which I lookup in Customer dimension to see it is the customer Marie that at that time had last name Sainte and was living in Nice when she purchased the product Passion no 9 from the employee Michael Suyama in the store Boutique at Rue Cler 52, Paris.

 

When I want to write a letter to customer Marie, I will use her current values last name Sainte and that she lives in Nice. Customer dimension rows with ValidTo value 9999-12-31 represents the most recent, current values at present time for a type 2 dimen­sion. (In danish aktuelle, nuværende, gældende, seneste værdi).

 

I recommend using SQL views upon a type 2 dimension for a current view and a registered view e.g. on top of the Customer dimension type 2 table in join with the Sales fact table. Dimension key columns in a fact table e.g. Customer_key and Product_key do not tell whether the dimension is a type 1 or a type 2, but when I join from a fact table dimension key e.g. Customer_key to a dimension view called either DimCustomer_Current or DimCustomer_Registered, I choose which dimen­sion values ​​I want to fetch and see.

 

·  When I want to write a letter to a customer, I will use the current values, e.g. Address from the current view DimCustomer_Current.

 

·  When I want to make a statistic of where the customers lived at date of purchase, I will use the registered values, e.g. City as a derived column of Address from the registered view DimCustomer_Registered.

 

In the above Entity Relationship Diagram Customer dimension is type 2 and the other dimensions is type 1, and I recommend to have a SQL view for each dimensi­on, e.g. DimDate, DimEmployee_Current, DimProduct_Current, DimStore_Current.

 

I recommend a fact view FactSales, therefore a business user will never access a table in a database, only access data through views in a database.

 

Read more examples of type 1, type 2 and type 7 with most recent current values in section 6.2.

 

Improvement expansion

Calculation of Amount in Sales fact is using a Price from the Product dimension, and in case of an offer price it could be a good idea to include Price, Discount, OfferPrice in the Sales fact, because Product dimension as a type 1 will overwrite changes of prices and only show the current price of a product.

 

Shipments Star Schema from Microsoft Polaris data mart database

 

 

Constellation schema or Multi-star schema has multiple facts and shared dimensions

 

 

From a real implementation by Joakim Dalby in 2021 based on 65000 devices/end-points/clients and 34000 software (arp string for add remove program) at hospitals in Copenhagen area of Denmark based on four source systems: AD, SCCM/MECM, ServiceNow and Software Central. Dimensions are type 1 with current values where old data is overwritten and deleted data in a source system will be marked with »true« value in the dimension column IsDeleted_meta as meta­data. Facts are based on »capture a relationship in the fact« e.g. fact_Client, and based on derived facts and snapshot facts. Read more in section 3.1.

 

Snowflake schema   (I'm not a fan of the term)

Levels in the hierarchy consist of a dimension of several normalized dimension tables, so that not all dimension tables are connected to the fact data table. When a hierarchy is divided into several dimensional tables, they can be individually linked to different fact data tables same as Constellation schema or Multi-star schema with multiple facts sharing dimensions.

 

Naming rules

There are many naming rules e.g. dimension tables are singular and fact tables are plural to do a differentiate between dimension and fact.

Adding a role to the name of object by a prefix e.g. DimCustomer or Dim_Customer and Fact_Sales or FactSales or by a database schema e.g. dim.Customer and fact.Sales.

Sale is the selling of goods or services, or a discount on the price.

Sales is a term used to describe the activities that lead to the selling of goods or services. Goods are tangible products as items you buy, such as food, clothing, toys, furniture, and toothpaste. Services are intangible as actions such as haircuts, medical check-ups, mail delivery, car repair, and teaching. A sales organizations can be broken up into different sales teams based on regions or type of product.

 

1.3.6. Data warehouse architecture and ETL modeling

Three business intelligence enterprise data warehouse modeling architectures.

·  Kimball said: »A data warehouse is a union of all its data marts.« illustrated by the dashed line as a Data mart area (DMA).

·  Inmon has an Enterprise Data Warehouse (EDW) illustrated by Data warehouse.

·  Linstedt has a Raw data vault area and Business data vault area.

 

 

Sources represent either Landing zone area or Input data area or Archive area.

It is common for all models to have a Data staging area (DSA).

 

Enterprise Data Warehouse (EDW) is sometimes a term for the whole solution and the different models each use their own terms e.g. Data warehouse or Business data vault. EDW is a common database layer or area before the Data mart area (DMA) that is illustrated as several databases each with their own specific purpose and tailored dimensions and facts that are business oriented.

 

I like to have another data staging area between EDW and DMA (not illustrated) for the ETL process and it can be placed inside a DMA database as a local data mart staging area (DMSA). Sometimes a data mart is implemented as sql views which refers to a EDW database as a virtual data mart. Illustration of a global DMSA

 

A data mart area (DMA) has a Presentation interface (PIA) in front, above it is called BI-Apps.

 

Sometimes business users want sql query access to dimensions and facts but they don’t like to join them together, therefore I implement a Analysis data area (ADA) (or Advanced data analysis or Automated data analysis) with either sql views that refer to DMA and a view has already joined a fact and its associated dimensions together including relevante columns, or materialized each view into a table as a One Big Table (OBT) and a ETL process will keep up to date. Of course I have to make decisions between using the latest (current) dimension values, or the re­gistered dimension values when the fact data was happening, occurred, registered or took place with its transaction date or was entered into the fact, or point in time values. [Analytical Base Table (ABT)].

 

I have used the general word data mart for a database with tables of dimensions and facts, but there is a more narrow word Multidimensional database (MDB) that is the source database for a OnLine Analytical Processing (OLAP) appli­cati­on like a cube also known as a Multidimensional online analytical processing (MOLAP) application. For me, it is a principle for a cube that all dimensions and all measures can be combined freely else divide data into multiple cubes. Same principle for a Tabular model.

SQL (Structured Query Language) is a query language for database model.

MDX (Multidimensional Expressions) is a query language for OLAP model.

DAX (Data Analysis Expressions) is a query language for Tabular model.

 

Read more about differences of opinion.

 

ETL process stands for Extract, Transform, Load, see a model, and ETL exists between the data areas (data layers) of a data warehouse solution, e.g.:

 

 

From each data area (data layer) there will be extract criteria to limit data, and there will be transform to make data conform and calculate new data like KPI, and data will be load into the next area. It is a common way of handling data in a multitier archi­tecture or layered architecture (in danish lagdelt arkitektur; extract for tilvejebringe data ved at udtrække, uddrage eller udlæse data; transform for at trans­formere, bearbejde, behandle, beregne eller berige data; load for at levere data eller indlæse data til det næste område).

 

1.3.7. Data mart modeling

A Data mart area (DMA) will consist of several data marts database. Data is fetched from a Data staging area (DSA) or an Enterprise Data Warehouse (EDW) through a ETL process which may contain a local data mart staging area (DMSA). Sometimes a data mart is implemented as sql views which refers to a EDW database as a virtual data mart.

 

Each data mart database has a specific purpose area data for tailored support and can be characterized as one of the categories below:

 

Common mart

With loaded tables of data from DSA or EDW.

With common dimensions to be reused in the other data marts. Examples of dimensions are Date, Time, Employee, Organization, Product, Retailer and Trans­action­Type. Can also include role-playing dimension e.g. Manager based upon Employee and different date dimensions with unique names of co­lumns.

With common facts to be reused in the other data marts. Examples of facts are Sales, Inventory, Marketing, Finance, Employment and Trans­action­.

Contains star schema/constellation schema and other modeling.

 

Subject mart

With loaded tables of data from DSA or EDW.

For example, Sales mart, Customer mart, CRM mart, Churn prediction mart, Market mart, Production mart, Inventory mart, Shipment mart, HR mart, Tax mart, Credit risk mart, Fraud detection mart, ESG Environmental, Social, and Governance mart.

Data can be a subset of common mart and data can be further enriched.

Contains star schema/constellation schema and other modeling.

 

Analytical mart

With loaded tables of data from DSA or EDW.

With views upon common mart, subject mart or EDW to fetch relevant data.

Making conformed columns for star schema/constellation schema to provide data to OLAP cubes or Tabular cubes or to dashboard visua­liza­tion tools like Ta­bleau, QlikView/Qlik Sense and Power BI. The data mart can be used for analytics by business users and data scientist to do ad hoc sql query.

 

Reporting mart

With views upon common mart, subject mart or EDW to fetch relevant data.

Making conformed columns for star schema/constellation schema to provide data to reports on papir, in pdf files and other display formats.

Stored proce­dures with criteria parameters inputs for example: FromDate, ToDate, Customer, Branche, Product, SalesDevice, TransactionType will fetch the relevant data from the views and present data to a reporting tool used by business users like Reporting Ser­vices (SSRS) or Microsoft Access with pass-through query to call a para­me­te­rized stored procedure.

With tables for Regulatory requirement return report (in danish indberet­ning) for financial reporting, because it is desirable to be able to recreate an old report both as it actually looked at the time of creation and as it should have looked given corrections made to the data after its creation. All tables has a VersionId to be used for sql join of tables to ensure all data rows be­long to the same accounting.

 

Delivery mart

With views upon common mart, subject mart or EDW to fetch relevant data to provide data and to create data in a file format that is customized to the individual data recipient as another IT system internally or externally.

 

Discovery mart or Exploration mart or Self service mart

With tables of raw data from source systems, or with views upon an archive area or views upon EDW or views upon data mart area to some business users to do data discovery by sql statements. It is a sandbox where users also has write access and can create their own tables for staging data, mapping data, search criteria data etc., and for the final result data set to be exported to a Excel sheet or to be connected to a Power BI report and so on. The purpose is to prepare a good business requirements specification and a user story for extensions to the data warehouse solution.

 

Data mart area (DMA) has several databases for each mart and sometimes there is no need to store data in a mart database instead the mart database contains virtual star schema non-materialized views that look like dimensions and facts in a virtual data mart.

 

Remember to set a process for user management, permissions and authentication etc. to manage and monitor and reduce the risk of cross contamination of data to ensure that a business user only has access to view relevant data for analyzes and re­ports.

 

1.4. Data reconciliation

One part of a data warehousing project is to provide compliance, accountability, and auditability. After data capture, please remember to implement a Reconciliation Summary Report with the results from your recon e.g. Uncategorized assets, Invalid country codes, Derivatives transactions with missing currency code.

 

Audit trail (in danish kontrolspor) becomes important for the credibility of a data ware­house. An example from a source system that has a value 0.003589 and export it to a txt file where the value becomes 3.589E-3 in the scientific notation and by a mistake in the ETL process the data warehouse saved and showed the value as 3.589. A contract number 700002848572 becomes 7.00003E+11 and the last part of the value got lost. When reconciliation is built-in the data model and the ETL process, this mistake would be reported and a programmer can fix the import and update his data profiling documentation.

 

A classic reconciliation is to weigh the truck before it leaves and weigh the truck when it arrives at the destination to make sure that no load has been lost on the ride. Do reconciling between data warehouse and source systems with reconcilia­tion of row count and sum of values and mark as reconciled and do auditing report. (to reconcile in danish at afstemme, stemmer overens). Log of RowCount of target rows before a process and after a process to measure the change of rows a process do (TabelRowCountBeginAt and TabelRowCountEndAt), RowCount of source data rows and extracted rows based on join and criteria, RowCount of target deleted rows, updated rows and inserted rows/loaded rows successfully or rejected rows with missing value or out of bounds amount or other kind of invalid data depends of the validation check of data quality etc. By sum of values you can reverse the sign in every other row by multiplying each value with -1 to avoid the sum becoming larger than the data type, e.g. in sql:

 

;WITH InvoiceItemSummarized AS

(

 SELECT InvoiceItemAmount, Sign = IIF(ROW_NUMBER() OVER(ORDER BY (SELECT 1)) % 2 = 0, 1, -1)

 FROM InvoiceItem

 ORDER BY InvoiceItemId

)

SELECT SUM(InvoiceItemAmount*Sign) AS InvoiceItemAmountSummarized

FROM InvoiceItemSummarized

ORDER BY InvoiceItemId

 

How else can we say, we are compliant and ensure compliance without data governance, data management and data reconciliation, data quality and data lineage mixed with documentation in a bank to meet Basel I, Basel II, Basel III, Basel IV, EDA, HIPAA, Sarbanes-Oxley and BCBS239 (The Basel Committee on Banking Supervision 239) together with a regulatory requirements return report.

BCBS239 paper and Explanation and elaboration of Data Governance for BCBS239

to be complient for e.g. Danish Financial Supervisory Authority (Danish FSA) (in danish Finanstilsynet).

 

1.5. Data quality

Data quality purpose is to ensure that business users trust and have confidence to data to achieve reliability and relevance. Defining and implementing processes to measure, monitor and report on data quality, measurement results of data quality performed on a data set and thus for judging a data set, and hopefully to improve the quality of data. The specific characteristics or dimensions of data that are ana­lyzed in a data quality program is differ from one business to another based on the needs and priorities of that business. We must be able to rely on the accuracy of our data on which we base decisions (in danish »Vi skal kunne stole på korrekt­heden af vores data, som vi baserer beslut­ninger på.«) Doing data quality increases believability, credibility and reliability (in danish troværdighed, pålidelighed).

 

Data quality skills and data cleansing for a better data discipline and to avoid violates of rules of data quality with the following dimen­sions are commonly used as data controls with the goal of ensuring data quality, and where one mean is con­for­mance or conformity to make data from multiple source systems con­for­med through data conformation as a part of the ETL process. The term data quality dimension refer to measurement of physical objects e.g. weight, height, length, width we do call dimensions, what should be measured and reported on for a data set. A data quality dimension defines a set of attributes and characteristics that represent an aspect of the data quality.

 

The following twelve data quality dimensions are sig­ni­fi­cant in relation to assessing the quality of data and to declare a data set with di­men­sions for data quality. They can be used to cover the need to be able to docu­ment that the content of data in a data warehouse is correct:

 

Timeliness and Availability (aktualitet og tilgængelighed)

Where data is available and accessible when needed by the business. Timeliness is a measure of time between when data is expected versus made available. E.g. a busi­ness must report its quarterly results by a certain date, the delay between a change of a real world state and the resulting modification of the data warehouse, and a customer contact infor­mation is verified at least once a year and it is indica­ted with a verification date that data is up-to-date. Another data quality rule can be that max 80% of the active companies must have annual accounts that are not older than 15 months old. Timeliness depends on the update frequency of the data set in relation to how often the data changes. The data set is updated real-time, daily, weekly, monthly, quarterly, semi-annually, yearly. Currency measures how quickly data reflects the real-world concept that it represents.

Timeliness is also about how frequently data is likely to change and for what reasons. Mapping and code data is remain current for a long period. Master data needs to be up-to-date in real-time. Volatile data re­mains current for a short period e.g. latest transaction or number of items in stock can be shown with a as-of time to tell business users that there is a risk that data has changed since it was saved/recor­ded. Typical during a day data will be changed several times and in evening and night data will remain unchanged. Latency tells the time between when data was created or changed and when it was made available for use by the business users. Kimball talks about low latency data delivery can be very valuable for a full spectrum of data quality checks (Kimball page 261, in danish reaktionstid, too fast fact data delivery can cause inferred members, read more in section 4.5).

 

Completeness (komplethed, fuldstændighed)

Where a data row has the values that is required to avoid missing values e.g. an order requires a customer number. Therefore data quality must check if some co­lumns are empty (null, empty string, 0 not expected). The ETL process must be robust and must not cause system crash and system error at an unexpected null value, instead the data row must be sent to a Wrong table which the operation monitors and reports on. When ad­ding new data rows, update or delete rows, it can be checked whether the number of rows corresponds to the expected number of rows e.g. between 100 and 130 new orders per day. The number of new daily transactions does not decrease by 10% compared to the average number of transactions in the last 30 days (or does not exceed 10%). Keep an eye on the amount of data, the quantity or volume of available data is appropriate.

RowCount of source data rows and extracted rows based on join and criteria. RowCount of target deleted rows, updated rows and inserted rows/loaded rows successfully or rejected rows with missing value or out of bounds amount or other kind of invalid data depends of the validation check of data quality etc. Read more in section 1.4 Data reconciliation.

Completeness is also about a data row where all the necessary data is present e.g. an address must include house number, street, zip code and city, and an order must include product, price and quantity. Incomplete data e.g. missing one hundred thousand data rows compared to last payload, an address does not include a zip code or an email address is missing the domain (gmail.com) makes the address not usable. A data quality goal could be 98% usable addresses and must mark the data row to be taking care of afterwards. Over-completeness means that the set of rules is too restrictive and value combinations that are actually valid are unjustly excluded.

 

Uniqueness (entydighed)

Where data identifies one and only one entry. Unique data means that there is only one instance of a specific value appearing in a data set, so it is free from data duplication, e.g. a Social Security number (SSN) to ensure that each person has a unique Id, therefore duplicate SSN values are not allowed within the data set. Unambiguous value for having one meaning and avoid ambiguous values at same level in a dimension or need to combine two values to one unique data. Duplication where customer has addresses spelled in different ways is not credible. What about Bob and Bbo, a mistyping, a typos or a misspelling? What about Daniel and Dan may well be the same person? Investigate that duplicates are unique.

 

Validity (validitet, gyldighed)

Where data is valid with a range of values, e.g. an age of a person cannot contains a negative value and cannot be higher than 125 years. When an age is under 18 then marital status must be not married. A validation rule in a time registration system could be that hours worked must be between 0 and 168 per month as a plausibility range. The average price of this month must not differ from last month’s price by more than 50%. A valid data value can also include a regular expression patterns as a column of text has to be validated e.g. phone number pattern: (999) 999–9999 where we want the hyphen to be stored in the column. A validation rule can characterize as: in-column, cross-column, in-row, cross-rows, cross-data set, like a cross-column validation is about certain conditions that span across multiple columns must hold e.g. a patient’s Date of Discharge from the hospital cannot be earlier than the Date of Admission, or a Delivery date of an order cannot be less than its Shipment date. Validation ensures that data is logical. Type of validation rule: equality, inequality, logical rule mixed with range checks, bounds fixed or bounds depending on entries in other columns. A Due date within an acceptable range and not ten years wrong, else we have invalid data that should be marked as rejected to be corrected. It is also important that a data row indicates the period in which it is valid.

 

Conformity (overensstemmelse)

Where data is stored in a column with a business-wide-understanding userfriendly name and where data is in the required format and following standard data definition. E.g. of an easy name of a column for an invoice amount that is excluded VAT (pre-VAT) could be Invoice­Amount_ExclVAT and for the amount included VAT could be Invoice­Amount_InclVAT. E.g. formatting tele­phone numbers with or without country code and a control have to make sure that all phone numbers has proper number of digits and a format to ensure confor­mance. Format of a date as yyyy/mm/dd or dd-mm-yyyy, please choose to have only one format in the data warehouse. How to present workload per day as 7½ hours, you like 7:30 or 7.50?

 

Uniformity (ensartethed)

Where data is stored in a column with a data type and size and where a value has the right unit. For a data warehouse with data across domains from multiple source systems it is important to define and use the same data types and sizes. E.g. a decimal number in Europe is using comma before decimal (3,14) and US is using period (3.14). How many digits in an amount e.g. data type decimal(19, 4) is a popular choice for e-commerce but Bitcoin is using decimal(24, 8). How large can the sum of an amount column over many rows going to be?

A data value is often represented with a specific unit, but seldom the unit can been seen in the name of a column or in an extra column with the unit as text. Unit examples of a weigth value in pounds or in kilos, a length value in centimeters (cm) or in feet (ft) or a distance value in kilometer (km) or miles (mi) or light year (ly), an amount value has a currency like USD, EUR, DKK and a temperature value was measured in Celsius ºC or Fahrenheit ºF.

 

Consistency (konsistent)

Where the same data from two or more source systems should not conflict with each other e.g. when a product is discontinued, there should not be any sales of the product. A zip code is right and has a cityname or a data mapping like DK for Denmark and »Cph, Kbh, København« as Copenhagen to be con­sistent for the capital of Denmark. Code values must follow different standards e.g. ISO Currency Code USD, EUR, GBP, DKK, or LEI Legal Entity Identifier issuer code, or NACE codes and CIC codes. Improve values that take care of case-sensitive words through translate, mapping or bridging rules to create one and only one truth. Three data sets referring to the same time period e.g. one includes data for females, one for male and one for total, the consistency between the results of the three data sets can be checked. Inconsistency data stands for conflicting data or when the same data is not the same across source systems. An address is the same with look-alike e.g. »Fifth Avenue« and »5th Ave«, or the name of a product with two kind of spellings e.g. Ketchup and Catsup. Sometimes a user type-in an extra data e.g. in an address column there is added a direction »on the corner of Fifth and Main«. Hopefully a data cleansing procedure can catch it. Inconsistent data gives problem in any database and data warehouse.

 

Integrity (integritet, pålidelighed)

Where constraints is a not null column, a primary key, a foreign key referential integrity where data stating that all its references are valid. Relationships between data in different systems is maintained to be accuracy and consistency so data relationships are valid and connected. Wrong data type for a data value like too big amount or too long string text. Allowed values like dates, weekdays and week­numbers is constant data.

 

Accuracy (faktuelt, kendsgerning, nøjagtighed, korrekthed)

Where data represents what it is inten­ding to represent in the real world e.g. a Date of Birth (DOB) and age of a person is trusted or an address (Mailing Address or Residence Address because they are not the same for some people) is the cor­rect one by performing a sample measurement, or check and compare an address with another data set known to be accurate e.g. a National Address Database or the openaddresses.io to make a de­gree of which data correctly describes the real world object or event. A new product name may not have the same name or similar soun­ding to a competitor's product. A new product claims to weigh a specific amount, and the weight is verified in an independent test. Inaccurate or incorrect data is poison for a data warehouse, therefore it is good to set up some rules to find inaccurate values and mark them to be taking care of.  Semantic accuracy must ensure consistency between the content of the data recorded and actual conditions (content accuracy). Syntactic cor­rect­ness must ensure that data complies with syntactic rules e.g. spelling and formats e.g. a date or an xml structure (form accuracy, syntax and schema validation). A nurse entered the Date of Birth of all the patients in the format of dd/mm/yyyy instead of the required format of mm/dd/yyyy. The data passed the system validation check as the values are all within the legal range. However they are not accurate.

 

Reasonability or Outliers (rimelighed eller afvigelser)

Where data is based on comparison to benchmark data or past instances of a similar data set e.g. sales from the previous quarter to prove that data is rea­sonableness or data has outliers (deviations) with data point or data value that are completely off. It is a part of the reconciliation of data and of a data set.

 

Confidentiality and Security (fortrolighed og sikkerhed)

Where data will be maintained according to national and internatinal standards for data e.g. GDPR. Which access to data is controlled and restricted appropriately to maintain its security and how long it is retained.

 

Clarity and Usefulness (klarhed, anvendelighed, forståelighed, genbrugelighed)

Where to assess the extent to which data is understandable and can be able to read and used by others without any misunderstanding the data and without difficulty. Presentation quality and usability is the data readability and understandable, rele­vantly, accessible, maintainable and credibility. Availa­bility of documentation and columns of metadata in a data set.

Reports must have accessibility and communicate information in a clear and concise man­ner. Re­ports should be easy to understand yet comprehensive enough to fa­ci­li­tate informed decision-making. Reports should include an appropriate balance be­tween data, analysis and interpretation, and qualitative explanations. Re­ports sho­uld include meaningful information tailored to the needs of the re­ci­pients.

 

Monitoring (tilsyn) the twelve data quality dimensions

Monitoring the quality of the data and reporting it to the data owner is part of been compliant. Being aware of quality of your data can be termed as data quality but sometimes it is misunderstood that this will improve the actual quality of data. The thing it does is to improve the awareness. Data quality measurements are esta­blished on the basis of the importance and use of data, respectively. Data should be monitored to ensure that it continues to meet requirements. A data quality indi­cator is called a metric, and several metrics will behind one data quality dimension. It must be possible to measure each data quality di­men­sion with a limit value range for the data quality rule to indicate whether the measure­ment is:

  • Okay (green)
  • Attention-demanding (yellow)
  • Action-demanding (red)

A data qua­lity measurement should preferably be performed automatically and data quality measurements should be performed after each ETL process. The result of metrics of data quality measurements must be stored and do­cu­mented so that they can be exhibited to the data owner and data users, espe­cially with which measure­ments exceed limit values. The measure­ments must be able to be included in an overall reporting and follow-up on data quality.

Do alert the data owner and data users by email. Auditing is a kind of inspection (in danish eftersyn).

 

Remarks

  • Correction like changing a wrong year of a date or casted to right data type.
  • For fact tables one truth depends of the use because a fact for Finance and a fact for Marketing may look like the same but their business requirements specification has small differences.
  • Cross-contamination in the ETL processes by which bad data is uninten­tio­nally transferred from a source system to the data warehouse with harmful effect on data quality.
  • Data profiling is a classic discipline to ensure a data understanding and to identify the current data quality and maybe to remove data that isn’t accurate and/or isn’t relevant, e.g. duplicates based on a rule of unique data.
  • For a data warehouse tasks as cleansing and scrubbing is ways to improve the data quality in the act of removing incorrect, incomplete and duplicated data, the unwanted data is discarded.
  • Other data quality dimensions could be about precision, availability and credibility. Saying that you live on the earth is accuracy, but not precise. Where on the earth? Saying that you live at a particular address is more precise. Read about twenty data quality dimensions from 1991
  • Quality assurance is a nice side effect to over time to improve other systems.
  • Data is fact, not truth, please keep your intuition and common sense and not only obey what the data tells you.
  • In danish Digitaliseringsstyrelsens Fælles sprog for datakvalitet er en stan­dard omkring datakvalitet i fællesoffentlige projekter.

 

There are two types of strategies for improving data quality:

  • Data-driven by modifying the data value directly.
  • Process-driven by redesigns the process which is produced or modified data in order to improve its quality.

Process-driven is better performing than Data-driven in long period, because it remove root causes of the quality problems completely.

 

Vilfredo Parento from Italy made in 1906 the 80:20 rule or principle built on obser­vations of his such as that 80% of the wealth in Italy belonged to about 20% of the population. The same ratios (in danish forholdstal) is often in data quality that 20% of a data set generates 80% of the errors in data. Corresponding to that 20% of the goods generating 80% of the earnings, which means that the other 80% of the goods generate only 20% of the earnings.

 

1.6. Data lineage and Data provenance

Data lineage is the journey data takes from its creation through its transformations over time. Data lineage can tell us the source of data, where did the data come from and what happened to the data and what transformations took place, and where data moves to over time. It describes a certain data set's origin, movement, characteristics and quality, and what happens to it and where it moves over time. Data lineage (in danish afstamning, hvor stammer data fra eller hvor kommer data fra samt hvor anvendes data) gives trace­abi­lity (sporbarhed) and visibility while greatly simpli­fying the ability to trace errors back to the root cause in a data analytics process. Lineage is about achieving traceability of data to make sure that at any given data point one can easily find a way back to its origin.

 

Data lineage is a end-to-end mapping of upstream and downstream dependencies of the data from capture or ingestion to analytics including which reports and dashboards rely on which data sources, and what specific transformations and modeling take place at every stage. Data lineage also includes how data is obtained by a transformation. Data lineage must ensure that there is traceability from where data is created and where data is subsequently used. Data lineage is both at table level and column level.

 

When a data warehouse is using data from other source systems, inhouse or external, a data lineage can be implemented by tagging data rows with metadata in a column like RecordSource from Data vault to indicating where the data origi­nated.

 

When data is staged and present in a target place in a data warehouse, data line­age is also important to inform a business user where data is comming from. If an organization does not know where data comes from and how it has changed as it has moved between systems, then the organization can not prove that the data re­presents what they clain it represents.

 

In addition to documentation, data lineage is also used in connection with change management so that a data steward can determine which IT systems and solutions will be affected and how they will be affected.

 

Data provenance focuses on the origin of the data aka source system, could be a screen from a host system where users type-in data values after a phone con­ver­sation with af customer, or customer fills a web page. The original data capture. Data provenance is responsible for providing a list of origin, including inputs, entities, systems, and processes related to specific data. Always good to know of latest date of update and who made the update.

 

If you want to drive real value from your data, you must first understand where data is coming from, where data has been, how data is processed, how data is being used, and who is using data. That’s what data lineage and data provenance is all about to create transparency.

 

1.7. Documentation

My experience is that documentation of a data warehouse system takes at least ϖ (Pi) times as long as the development time. Tasks like writing explanations, reasons and arguments, set up tables, drawings, figures and diagramming, show data flow and data lineage and make a data dictionary, and last but not least proofreading. Followed by review by colleagues and subsequent adjustments and additional wri­tings. Documentation is an iterative process and in a good agile way a program­mer should program in the morning and document it in the afternoon.

There are many tools e.g. Erwin or Collibra or Sqldbm and Lineage tool to show a change in one system has effects in another system.

 

1.8. GDPR

For EU GDPR General Data Protection Regulation (Persondataforordning) a bu­si­ness key like a Social Security Number needs to be anonymous after some year by ad­ding a calculated cross sum of the SSN and hash it. It will be a nice scramble or encryption by a scrambling algorithm. To be sure of total anonymity only save Date of Birth and Gender of a person; no name, no address, no zip code and no city. A Master Data Management (MDM) database can contain social security number, name, address together with CustomerId that is used in emails to customers. There­fore we have a surrogate key CustomerInternalId that is used in all other operational systems and data warehouse together with BirthDate and Gender which is not personally identi­fiable or enforceable. A web service will provide all systems the data of personally from MDM database. When a customer is going to be deleted because it is his/her wish or the data limitation period, we only need to anonymization (the way of delete data) in the MDM database and we can keep CustomerInternalId, Birth­Date and Gender in the other systems to make sure statistics remain un­chan­ged back in time. When a system is calling the web service it will give back unknown for SSN, na­me and address when a customer no longer exists in MDM database. If we don’t have a MDM, we must do anonymization in all data layers and data areas and in relevant source systems as well, and flag data with an IsAnonymized column. Personal Information (PI), Personally Identifying Information (PII), Sensitive Personal Information (SPI) is data relating to identifying a person, read more.

 

1.9. DataOps and Data governance and Data management

DataOps is how to organize your data and make it more trusted and secure with tools like:

  • Data governance - Catalog, protect and govern all data types, trace data lineage, and manage data lakes.
  • Master data management - Get a single, trusted, 360-degree view of data and enable business users to know their data.
  • Data preparation - Transform large amounts of raw data into quality, consumable information.
  • Data replication - Provide real-time change data capture and synchronization to make data available fast.
  • Data integration - Integrate, replicate and virtualize data in real-time to meet data access and delivery needs fast across multiple clouds.
  • Data quality - Cleanse and manage data while making it available across your entire organization.

 

Read more about DataOps, DataOps is not DevOps for data and DataOps principles.

The Future of Data Management.

 

The purpose of data governance is to provide tangible answers to how a company can determine and prioritize the financial benefits of data while mitigating the business risks of poor data. Data governance requires determining what data can be used in what scenarios – which requires determining exactly what acceptable data is: what is data, where is it collected and used, how accurate must it be, which rules must it follow, who is involved in various parts of data?

 

According to the DAMA-DMBOK: Data Management Body of Knowledge, 2nd editi­on: »Data governance is the exercise of authority and control (planning, moni­to­ring, and enforcement) over the management of data assets.« This definition focu­ses on authority and control over data assets. The important distinction between data governance and data management is that the DG ensures that data is mana­ged (oversight) and DM ensures the actual managing of data to achieve goals (exe­cu­tion), in short that DM ensures that data is compliant and ensures compliance.

 

According to Seiner: »Data governance is the formalization of behavior around the definition, production, and usage of data to manage risk and improve quality and usability of selected data.« This definition focuses on formalizing behavior and holding people accountable. [Robert S. Seiner. Non-Invasive Data Governance: The Path of Least Resistance and Greatest Success.]

 

If data management is the logistics of data, data governance is the strategy of data.

 

The traditional, centralized, data-first approach to data governance represents a top-down, defensive strategy focused on enterprise risk mitigation at the expense of the true needs of staff who work with data. The Data Governance Office develops policies in a silo and promulgates them to the organization, adding to the burden of obligations on data users throughout the enterprise who have little idea of how to meet the new and unexpected responsibilities.

 

The modern, agile, people-first approach to data governance focuses on providing support to people who work with data. People are empowered but nonetheless, expected to contribute to the repository of knowledge about the data and follow guidelines, rather than rigid, prescriptive procedures.

 

Governance can be divided into three maintainings:

  • Business Governance maintaining the requirement, content, quality, and understanding of core business assets and their supporting data.
  • Compliance Governance maintaining the policies, procedures, classifications, assessments, and audit standards required for regulatory reporting.
  • Technical Governance maintaining the content and quality of extended data attributes and data sources in the enterprise data landscape.

 

Policies describe what to do and what not do to, guide the way people make deci­sions. A policy established and carried out by the government goes through several stages from inception to conclusion.

Procedures describe how to do for accomplish and completing a task or process. Procedures are the operations to express policies.

Community of practice (CoP) is a group of people who share a concern or a pas­sion for something they do and learn how to do it better as they interact regularly.

 

Example of a model of organization where data governance fits as a decision ma­king layer between the executive area and the more operational data management.

 

A little wider term is Data Enablement to enable business users to safely use data where data enablement will be fully aligned with the organization's strategic and organizational goals of data curation:

 

 

All You Need to Know About Data Curation   Big Data Curation

 

Data Mesh is an architectural approach that decentralises data ownership and management. The approach helps you scale, improve data quality, and increase agility in your data team.   Illustration of data mesh   Video about data mesh 

 

Data as a product ensures a better data-driven decision.

Data governance ensures data is managed.

Data management ensures data is compliant.

 

1.10. Big data

Some properties of big data:

  • Volume refers to the amount of data e.g. from GB over TB to PB as data at scale and transactions.
  • Velocity refers to the speed of change and data processing for the timeliness of data, data in motion from batch in periodic, frequently, night or daily loads, intra-day e.g. three times a day or nearly real-time with little delay and real-time instantaneously with streaming data with no delay as inter­active while an event or a transaction is happening, occurred, registered or took place
  • Variety refers to data in many forms and the number of types of data from relational database, image, photo, audio, video, multimedia, sensor, web logs from social network data, smartphone and unstructered text.
  • Veracity refers to data uncertainty due to data inconsistency and incom­pleteness by managing the reliability, trustworthiness, authenticity, origion, repitation, availability, accountability, quality and predictability of inherently imprecise data types.
  • Variability to increase in the range of values typical of a large data set and non-standard.
  • Value refers to data of many values which addresses the need for valuation of enterprise data e.g. statistical, events, correlations, hypothetical, leve­rage­able.
  • Visibility refers to data in the open and being able to handle european General Data Protection Regulation (GDPR) and issues of privacy, security and provenance.
  • Visualization refers to the flexible viewing of data.
  • Viscosity, Validity, Venue, Vocabulary and Vagueness is other V-words.

 

A data lake can be used for massive quantities of unstructured data and big data with tools that can easily interface with them for analysis for business insights. A datum in a lake has tags (or labels) to give it a characteristic and to cataloging the data in data lake. By a tag we can fetch data from the lake without knowing the physical location like a server url with a folder path. A data lake can contain files on multiple servers on premise in different folders and in the cloud (many nodes), and we only using a tag to finding, fetching, searching, exploring or discovering data.

 

For example, I like to find photos of smiling employees in all albums, I can search for a tag FacialExpression = smiling. A data lake is using ELT (extract, load, and then transform). A tool for a data lake can be like Apache Hadoop or Microsoft Azure. Data Discovery Area is a sandbox for business users. Can use U-SQL to dive in the data lake and fetch the wanted data and do the wanted trans­for­ma­tions.

 

Read about a data lakehouse. SQL on-demand is a query service over the data in your data lake. Azure Synapse Analytics formerly known as SQL DW.

 

A relational database is characteristic by the acronym ACID:

  • Atomicity where all parts and steps of a transaction commit succeed or if any step fails then the transaction are rolled back, rollback. Therefore a trans­action cannot be left in a half done state. For example, in an application a transfer of funds from one account to another, the atomicity property ensures that, if a debit is made successfully from one account, the corresponding credit is made to the other account.
  • Consistency where all committed data must be consistent with all data rules like constraints, triggers, validations. A database will move from one consistent state to another consistent state when a transaction commit succeed or remain in the original consistent state when a transaction fails. For example, in an application a transfer of funds from one account to another, the consistency property ensures that the total value of funds in both the accounts is the same at the start and end of each transaction. It means a user should never see data changes in the mid transaction.
  • Isolation where every transaction should operate as if it is the only trans­action in the database, where no transaction can disrupt or interfere with other concurrent transaction. Transactions that run concurrently appear to be serialized. For example, in an application a transfer of funds from one account to another, the isolation property ensures that another transaction sees the transferred funds in one account or the other, but not in both, nor in neither.
  • Durability where once a transaction is committed and data persisted, the updated data rows are available for all other transactions, and data will survive system failures and can be recovered. For example, in an application a transfer of funds from one account to another, the durability property ensures that the changes made to each account will be recorded permanently and not be reversed.

 

This means that a transaction is either carried out completely successfully or not at all (Atomic), that only valid data is added to the database (Consistent), that trans­actions never affect each other (Isolated), and that transactions are never lost (Durable). Read more.

 

Lock granularity of database, table, extent, page or a row is the simplest way to meet the ACID requirement. A page is an eight-kilobyte storage area and an extent is a group of eight pages. Deadlock is when one resource is waiting on the action of a second resource while the second resource is waiting of the first resource, so there is no way to finish and the database system will kill one transaction so the the other can complete.

 

A NoSQL = Not Only SQL database means it can use SQL type query language, but usually do not do so. NoSQL database often designed to run on clusters, made by open source and the database does not operate with a fixed schema structure but allow the addition of data without a pre-defined structure. A NoSQL database is cha­rac­te­ristic by BASE (Basic Availability, Soft state, Eventually consistent). A ACID system guarantees data consistency after each transaction; a BASE system gua­ran­tees data consistency within a reasonable period of time after each trans­action. In other words, there is data consistency in the system, just not im­me­dia­tely. This leads on to the Soft State principle. If the data is not consistent at all times, the system must take a temporary data state into account. The sum of both these principles means that data accessibility is given very high priority, even if coincident errors occur in the database system, operating system or hardware. If parts of the database does not work, other parts of the database take over, so that data can always be accessed.

 

2. Grain (granularity) of data warehouse

The grain is the level of detail of data in the table, both in dimension tables by the hierarchy and in fact tables by the definition of the measurement event and of in­terest, and the grain can later be expressed in terms of the dimensions. The lowest grain keep all relevant data from source systems and is therefore the most flexible approach but also take most storage space and easy can cost high query per­formance. To improve query performances the grain can be lifted up to higher level whi­le data will be aggregated and summarized and therefore take less storage space. Data can also be divided such as current year data is in grain daily (dately), previous year in grain weekly and older data in grain monthly, because very detailed infor­mation is normally not relevant for analysis years back in time.

 

Granularity or grain of fact table can be divided into four types:

 

  • Transaction grain as finest level of detail with one row per transaction together with a transaction datetime like a entry date, post date, sales date or simply a created date, inserted time, registered date or happening date of the row. It is a base-level fact.
  • Periodic grain with one row per time period or time span like a day, week or a month for a monthly grain including summarized measurements.
  • Accumulating/Aggregated grain as more summarized level like one row for the en­tire lifetime of an event or with a clear beginning and end.
  • Temporal snapshots by applying temporal table or system-versioned table theory where each column of data will have its own metadata timeline columns like ValidFrom and ValidTo to tell when the the value in a row and in a column were valid. Tempo­ral snapshot fact table allows to have the equivalent of daily snapshots with­out really having daily snapshots. It introduces the con­cept of time inter­vals into a fact table, allowing to save a lot of space, opti­mi­zing performances while allowing a business user to have the logical equi­va­lent of the »picture of the moment« they are interested in, source reference. Bitemporal modeling is good for financial reporting because it is often desirable to be able to recreate an old report both as it actually looked at the time of creation and as it should have looked given corrections made to the data after its creation. (In danish temporal er tidsmæssig data lagring). Description with examples of a temporal table   Picture.

 

Grain yearly to grain monthly to grain weekly to grain daily we say that each level e.g. daily increase the granularity and the number of rows in the fact table.

 

Aggregation is the process of calculating summary data from detail base-level table rows (records) and is a powerful tool for increasing query processing speed in data marts. For example, a sales is a fact with analysis variable and measures like quantity sold and amount, and dimensions like product, customer and date of purchase brings a sales in a context. The grain of the sales is limit to a date (like December 23) and a time (like 4:30 pm) and therefore the fact is on Transaction grain. In case we drop the time, the measures would be called TotalQuantitySold and TotalAmount because they are the result of a summation of sales times to a sales date and therefore the fact is on Periodic grain. Also if we decide to sum­ma­rize the sales date to a weekly or monthly level. In case we decide to grain the customers by aggregate them into segments and don’t keep names and addresses, then the fact becomes Ac­cu­mu­la­ting/Aggrega­ted grain. A Product dimension has a three level hierarchy of category name, brand name and product name, and there­fo­re we can say that product has the highest level of detail. When a source system only gives a sales transaction with a brand name and no product name, a Product dimension must fit that grain of fact and take out the product name level so brand name becomes the lowest level that match the fact grain.

 

Instead of summation of sales times or sales date and loose some important infor­ma­tion of time of purchase, we can summarize data into two other fact tables for a weekly level for data from the last year to this year, and a monthly level for older data, because when we go back in time, we don’t need to analyze on a dai­ly or we­ek­ly level, and by aggregation we save harddisk space and improve the perfor­mance of the query because fewer rows in month level need to be sum­ma­rized to fetch the year level data.

 

A business key has the same granularity and the same semantic meaning across a business organization company.

 

Granularity is a description of a level of detail of data, e.g. the combination of cu­sto­mer and product is a fine level to tell me about sales, and when I add store to the combination the granularity is driven one level lower, and when I add sales­person to the combination the granularity is driven one level lower and I know much more about sales, and when I add a date or better a datetime it will bring the granularity to its lowest level or to the highest level of de­tail of sales and a sale.

 

The lowest level of aggregation or the highest level of de­tail is referred as the grain of the fact table.

 

The grain of a fact is determined by either the combination of dimensions or the actual transaction level. For aggregate fact or summarized fact the grain is the intersection of its dimensions.

 

3. Fact

A fact table can contain fact’s data on detail or aggregated level depends of the grain approach. A fact table can have five types of columns:

 

Dimension as a dimension key that is a foreign key to a dimension table dimen­sion key (primary key) that pro­vides the context of a measure with name, text and description. Use con­for­med dimensions.

 

Measure or analysis variable that contains quantitative numeric fact as a number value from a business event e.g. amount of sale, produced number of units or called minutes is fact data. All mea­su­res in a fact table must have the same grain like timely, dately, weekly or monthly level. Use conformed mea­sures.

 

Primary key to uniquely identify fact rows to be updatable if data is chan­ged or deleted in a source system and to avoid duplicate rows. But it con­flicts with non-volatile rule! A true transactional fact accepts a counterpart row to handle changed, corrected or cancelled data in a source system, therefore a transaction identifica­tion can’t be a primary key but can be in combination of a datetime or a sequence number for a com­po­site key. Con­sider this for each fact and don’t worry when your fact table does not have a »natural« primary key. Sometimes dimensions can be involved in a composite primary key and the grain of the fact is important here. The Achilles heel of a composite pri­mary key of dimensions is, when several rows have missing or unknown mem­bers or values in their dimensions and the date or datetime dimension is the same in the rows, we will have a violation. A primary key is important in case of linking to other fact table (drill across). The primary key constraint can be non-clustered to get better insertion performance where data is loaded in batch.

 

Tag contains a text, a date or a number that can not be summarized, e.g. a voucher number, a receipt number or a sequence number. A tag column in a fact table is a candi­date for a degenerate dimension or later a real dimension with extra descriptive data.

 

Technical for a surrogate i­den­tity column (as an artificial primary key) a unique sequence number instead of a composite key by a subset of the co­lumns in a fact table. The name of the column can be like FactSales_key. A surrogate unique clustered identity column is useable for parent-child relationship link between fact rows. For example, a order row has a reference to a cancel row of the order and the cancel row has a parent reference to the previous order row, a link column pointing to the surrogate identity column. EntryDateTime or TransactionDateTime (in danish Trans­aktionstidspunkt) can be divided into two dimensions for Date and Time, and the Date column is used for partitioning of the table to a partitioned fact table. Dates for ValidFrom and ValidTo to represent a timeline/period where the measures was valid like a Balance­Amount. Maybe include a Replacement reference from the new fact row back to the soft deleted row. Indicator flag like IsDeleted and Dele­tedTime (ModifiedTime). For the data lineage a fact table will include metadata columns from the Archive area as RecordSource, ArcRecordId and ArcGlobalId for tra­ce­abi­lity back to the ar­chive.

 

The ETL process has to secu­re the key, so fact rows are distinct. Fact tables in a di­­men­­sio­nal model express the many-to-many relationships between dimen­sions and is implemented as one-to-many relationships between dimension tables and fact tables. I never have foreign key constraint on a fact table because it de­crea­se inserting performance, and I trust the ETL process and date range lookup, and no human being will be doing a update or delete of a dimension table or a fact table.

 

3.1. Types of facts

Let us characterize the various facts into different types of facts or more exactly different ty­pes of the columns in a fact table. Conforming facts means making agree­ments on common business metrics such as key performance indicators (KPI) across separa­ted source systems so that these numbers can be compared mathe­matically for calculating differences and ratios (Kimball page 446).

 

A dimension has value or member as a row, or values or members as rows.

 

Fully-Additive measure - summable across any dimension

A fact table has numerical measures that can be summed up for all of the di­mensions in the fact table, so the measure columns data type is a number. A Sales fact is a good example for additive fact with measures like Quantity sold and Amo­unt. In case of a transaction data set to a fact table refer to a measure column which value is empty, null or nullable, use the default value 0 because this won’t bother aggre­gation like summation. Each measure must have its metrics.

 

When it is a monetary measure, it may have a currency column and if it is a unit measure it may have a column to explain the kind of units used like centimeters, litres, cubic metres etc. Fact can have a cal­culated measure or a derived measure based on existing measures and constants e.g.:

   Profit or Surplus = Revenue – Costs.

 

Semi-Additive measure - summable across some dimensions

A fact table has measures that can be summed up for some of the dimensions in the fact table and not for other dimensions. For example, a daily balance measure can be summed up through the customers dimension but not through the date di­men­sion. Inventory levels cannot be summed across time periods.

 

Non-Additive measure - not summable for any dimension

A fact table has measures that cannot be summed up for any of the dimensions in the fact table. For example, a room temperature fact is non-additive and summing the temperature across dif­fe­rent times of the day produces a totally non-mea­ning­ful num­ber. However, if we do an average of several temperatures during the day, we can produce the average temperature for the day, which is a meaningful num­ber. Sum of a measure called DaysForOrderToCompletion in a FactOrder is mea­ning­less but finding a minimum, maximum and average values is meaningful to planning of production.

 

Other examples is a percentage and a ratio are non-additive measures. A fact table that only contains column with transaction number such as order num­ber, invoice number or a voucher number that can’t be summing up. A order fact with measures like Unit pri­ce for a single product ma­kes no sence to summarize, but the derived column Amount = Unit price x Quantity is to be summarized and becomes an additive column, called a calculated measure. Trend, Stock and Ranking can’t be added and in general all calculations on one specific intersection of the dimension. Year-to-Date ytd measure can’t be summed up. Count of rows is normally used. Unit price can be placed in a Product dimension as a current standard list unit price and still keep Unit price in the fact together with a Discount (%) when the purchase was happening, occurred, registered or took place. Back in 2016 I had a customer that used two kinds of discounts when a product had a bundle option and was showned in an invoice line with description »Discounted Bundle«:

   Unit price = List price - Unit discount amount.

   Line amount = Unit price x Quantity – (Unit price x Quantity x Line discount percentage / 100).

Some business users used the term List price other users used Catalog price.

When there was a promotion going on, the unit price had a different calculation:

   Unit price = List price - Proration amount (also called deduction amount).

There was also a Line discount amount and an Order discount amount. I added all the columns into the Sales fact.

 

Conformed fact

A conformed measure in multiple facts must use the same common business rule and definition so multiple facts can be united in a report or a cube. When several data marts are using fact data with same name of fact tables or name of columns for measures and they have compatible calculation methods and units of measure and support additivity across business processes. If a measure e.g. Revenue is used in multiple fact tables with different calculations and mea­nings, it is best to use different column names because theses facts are not conformed. Figures in DKK '000 means in DKK thousand, where a revenue 395,732 is 395,732,000 DKK.

 

Factless fact

A factless table contains no measures, no metrics and no numeric additive values as a normal fact table do, we have a measureless fact that records and event. For example, a fact table which has only columns for employee, date, time and event of work like »work­start«, »work­stop« and »workstopofsickness« and no columns for a measure. You can get the number of employees working over a period by a »select count(distinct Employee_key)« or by a distinct row-count cal­cu­la­ted mea­sure in your OLAP cube. For a factless fact you will normally count the number of rows, row count or counting rows and call it »Number of <a name>«.

 

Sometimes a factless fact has a value column called Count with only one value as 1 used in a data access tool to sum over and get the number of rows. In case the fact grain is weekly and a week is missing, it can be inserted to have all weeks complete and here will Count gets the value 0.

 

Factless fact is for registration of event or assignment e.g. attendence take place in a school class with dimensions for Student, Class, Room and Professor. If we add a measure column for Attendence with 1 or 0 per date per student it is not a factless fact anymore.

 

Factless fact can be used to represent a many-to-many relationship among dimensions. Factless fact can have a ValidFrom and a ValidTo column to tell the timeline/period when the fact was valid.

 

Capture a relationship in the fact

To be a column in a dimension or to be its own dimension and used in a fact is a good question. Kimball’s financial services example starts with an Account di­men­sion including data of products and branches but he choose to remove these descriptive data columns to form independent dimensions of Product and Branch and use them in a fact together with the Account dimension. Therefore the fact capture a relationship among accounts, products and branches.

The account-product-branch relationship is a factual relationship, or a delegation factless (or coverage) fact table.

 

Another example is that an account can belong to two customers and a customer can have several accounts. This many-to-many relationship can be expressed in a factless fact or in a bridge table, see later. A bank transaction is done by one customer from an account and it is natural to have a Customer dimension in the fact.

 

Kimball said: »Demoting the correlations between dimensions into a fact table«, and I like to add: »With capture a relationship in the fact table we also keep the registered relationship at the time the event or transaction occurred and was en­te­red into the fact table«.

 

An example is a retailer SuperChemi belongs to a chain Ethane at date of sale e.g. 2013-05-17 and the fact table has two columns for dimensions to Retailer and Chain. In 2015 the retailer SuperChemi changes to another chain Propane but we still keep the registered relationship back in 2013 and 2014 in the fact. When a chain is not a part of the fact table and we in year 2016 like to find sales of retailers for a specific chain e.g. Propane, we will use the current relationship be­tween dimension Retailer and dimension Chain as a snow­flake dimension, meaning that SuperChemi belongs to Propane, and when we sum­marize sales per year since 2010, chain Propane will include the sales of Super­Chemi in 2013 even though the retailer belonged to Ethane at that time.

 

Transactional fact or Transaction fact

A fact table that describes an event or operation that occurred at a point in time in a source system e.g. an invoice line item. The row has a date e.g. a transac­tion date, an entry date or a post date or a datetime time stamping to represent the point in time with other lowest-level data. Key va­lues for dimensions is found at transaction datetime.

 

If data needs to be changed, corrected or cancelled in a source system, then a data warehouse needs to make a counterpart row in the fact with a new time stamping or a time span. Source data examples, booking a hotel room, order a car in a special color, buy a lottery ticket with my lucky numbers, receive an invoice, put and pick items in stock (in danish lægge varer på lager og plukke varer fra lager).

 

In a data warehouse modeling pro­cess I try to interpret some data as transactional data or transactions. Read more in section 6.4. Some data modeler does not like separate facts for each transaction type but build a single blended fact with a transaction type dimension and a mix of other di­men­sions can make N/A dimen­sions.

 

I have seen a blended fact called FactEvent which I think is a poor and non-descript name of a fact table, and the date dimen­sion gets a ge­ne­ra­lized name. I prefer multiple facts where the date dimension is role-playing for e.g. order date, purchase date, shipping date and receipt date.

 

Snapshot fact versus Transactional fact

Snapshot fact represents a state, e.g. my bank balance right now is $500, and to­mor­row it is $600 because I will deposit $100, therefore the latest date of balance contains the current balance. It is easy to see my balance ten days ago by look it up at the SnapshotDate. It is like a picture from the past.

 

Transactional fact contains all my deposits and all my withdrawals with a date and an amount to records the »happening of an action«. In any point of time (normally today) I can calculate my balance, or the fact contains a running balance measure per transaction/per fact row.

 

Snapshot fact calculates the balance of my account at the end of each day because there can have been many deposits and withdrawals within a day. This snapshot fact can then be easily used to calculate the average daily balance for interest or fees. The balance in the snapshot fact is not additive, able to add/sum up together into a meaningful metric, instead the balance is semi-additive, able to use aggre­gate functions like Avg, Min and Max.

 

A fact table containing sales order lines and we update a line with changed data, add a new line or delete an existing line, the fact is a snapshot.

A periodic snapshot is one that represents states over time where order lines is never changed.

If we store a new complete image of the order lines when it is changed, it is an accumulating snapshot or what I call a discrepancy snapshot fact, see later.

If we store the difference between the original order line and the new order line (i.s. net change), it is transactional fact.

 

Periodic snapshot fact

A fact table that describes the state of things in a particular instance of time or a one point in time, and usual­ly includes more semi-additive and non-additive mea­sures. It is a table with frozen in time data, meaning a row will never be chan­ged/mo­di­fi­ed/de­leted (is unchanged) be­cau­se the row can have been used in a report like a monthly or annual report and later it is a must to be able to create the same raport with the exact same data. The periodic snapshot fact table will never be done empty or updated, there­fore it is a true delta load or incre­men­tal load and the table has a snapshot column like a Day (yyyymmdd) for daily basis, Week (yyyyww) for weekly basis, Month (yyyymm) for a monthly basis or a Year for annual basis to fetch the wanted data as a day slice, week slice, month slice or a year slice. When a day, a week, a month or a year is over and data is ready, data will be loaded.

 

Sometimes data for the current month is also loaded every day for a current-month-to-date, therefore the current month will be updated until it is over, finish and can be closed. Measures can be a balance in account or inventory level of products in stock and so on. Key values for dimensions is found at the end of the period.

 

A fact table can be daily, weekly, monthly or annual partitioning for making a faster query perfor­mance when searching for a specific day, week, month or year. In case there is no data for a specific month it will be nice to insert an artificial row with »missing« value of the dimensions and 0 in the measures. (Periodic snapshot in danish »en skive data«, fastfrysning).

 

It is common that fact data needs a correction from a source system which can cause a redelivery of a snapshot e.g. a specific month that needs a reload in a periodic snapshot fact. When a fact table is implemented as partitioned table, it is easy to truncate the related partition and load it again (Kimball page 517). A ETL process can use a staging table and perform a switching in to a fact table related partition. A redelivery can contain a date of redelivery or versionnumber in case of several versions of the same delivery, sometimes called a generation. It is called incremental refresh when a ETL process does a dynamically replace or reload of a partition of a table by empty the partition and load data with a higher date of redelivery or higher versionnumber into the partition, because it is refreshing an existing delivery in a snapshot.

 

Accumulating snapshot fact

A fact table that describes a process with milestones of multiple dates and values columns with different names which will be filled out gradually that reflect the com­pletion of events in a lifecycle process to representing the entire history. The­re­fore over time the same fact table row will be revisited and updated multiple times where a default date key value is -1 for »hasn’t happened yet« or »be available later«.

 

Each stage of the lifecycle has its own columns e.g. milestones of a hos­pi­ta­lization or steps of the manufacturing of a product. In a retail store a product has three movements as ordered, received and sold that would be three date dimension columns in an accumulating snapshot fact.

 

A fact table where a row is a summarize of measurement events occurring at pre­dictable steps between the beginning and the end of a process. For example, a source system for current payments from customers where some pay seve­ral times over a month, and the first payment becomes a new row in fact table with date of payment in columns BeginDate and EndDate and the amount in Paid co­lumn. The next ETL process will do a summarize of payment per customer from the BeginDate to current date or end-of-month date, and then update the fact row with same BeginDate with the new summarized payment and new EndDate, so a fact row will be revisited and updated multiple times over the »life« of the process (hence the name accumulating snapshot).

 

Timespan accumulating snapshot fact or State oriented fact

A fact table where a fact is not a singular event in time but consists of multiple states or events that occur over time so each row represents the state of an object during a period of time in which this state didn’t change. It is similar to a type 2 dimension with snapshot start date, end date and current indicator/active flag where the row formerly-known-as-current will be revisited and updated. (Kimball Design Tip #145), see more at Time span fact.

 

Depending on the frequency and volume of changes, you may consider a »current only« version of the fact to improve performance of what will probably be a majo­rity of the queries. Then you can use the fact to drive the loading of selected events in a accumulating snapshot fact.

 

Discrepancy snapshot fact

A snapshot describes the state of things at a particular time and when the state does change there is a discrepancy. It is one way to capture versions of the fact over time with no need to keep all the daily snapshots from periodic snapshot fact.

A fact table stores metrics and things that provide context for the metrics goes into dimension tables, therefore it is also possible to have a discrepancy snapshot di­men­sion with discrete textual values and non-additive numeric values that can’t be summarized, reminiscent of type 5 with a foreign key back to the main di­men­sion.

Discrepancy snapshot fact example.

 

Data entry action snapshot fact – a way of Change Data Capture CDC

A snapshot that keep track of the state of things at a particular time as a logbook with three actions as: Added for a new entry, Changed for an existing entry and Removed for a non-existing entry that has been deleted or marked deleted in a source system and is not anymore a part of the full payload to the data warehouse. This snapshot approach has a dimension for data entry action with the three mentioned action values (could also have been: Influx, Changed, Departure).

The data entry action snapshot fact can be used to show the added/gain values from the latest ETL process or the removed/lost values and in combination with changed values from one state to another state.

Formula: current = old current + added/influx – removed/departures.

(in danish: igangværende = forrige igangværende + tilgang – afgang, for bestand,  nytilkomne, indgået og udgået, ikraft og udtrådt).

Data entry action snapshot example.

 

Derived fact or Additional fact

Kimball recommands fact data at the lowest detail grain as possible for ensures ma­xi­mum flexibility and extensibility, he calls it a base-level fact. A derived fact table is created for performing an advanced mathe­matical calculation and complex trans­formations on a fact table like for a specific KPI (Key Performance Indicator), or an aggregate fact with a summation of measures to a higher grain like from date level to month level and from product level to brand level and using shrunken dimen­sions for Month and Category as a dimension lifted to a higher grain from the base-level dimensions as Date and Product. A derived fact can be based on multiple fact tables for making faster ad hoc query performance and simplify queries for analysts and for providing a data set to the Presentation interface area. Remember data lineage. Be aware of drill across and fan trap and chasm trap and loop trap can occurs when two fact tables are joined together, because there can be a fact table with no relation to another fact table except that each one contains a foreign key to a shared dimension table.

 

Aggregate fact or Summarized fact

A derived fact table that is created to referred to as a pre-calculated fact with com­pu­ted sum­ma­ri­zed measures at a higher grain level of one or more dimensions to reduce storage space and query time greatly and eliminate incorrect queries. An aggregate fact is derived from a base-level fact and measures in an aggregate fact is a computed summary of measures in the base-level fact.

 

Dimensions in an ag­gre­gate fact can be derived from base-level dimensions and is called shrunken di­men­sions because the values is rolled up to create less fact rows e.g. Date di­men­sion becomes a Month dimension, Product dimension be­co­mes a Category di­men­sion and an Address dimension becomes a Region dimension, and therefore the mea­­sures can be sum­ma­ri­zed to less fact rows for better query per­for­mance.

 

Year-to-Date ytd fact where month February is a summing up or roll up of January and February and so forth. Last-year-this-year fact with calculation of index com­pa­red to last year as a new column and easy to show in a report.

 

Aggregate fact table is simple numeric roll up of atomic fact table data built solely to accelerate query performance. It is called incremental aggregation when a ETL process does a dynamically update of a table by applying only new or changed data without the need to empty the table and rebuild aggregates.

 

Consolidated fact

A fact table used to combine fact data from multiple source systems and multiple busi­ness processes toge­ther into a single consolidated fact table when they expressed at the same grain. E.g. sales actuals can be consolidated with sales fore­casts in a single fact table to make the task of analyzing actuals versus fore­casts simple and fast to compare how the year is going. A consolidated fact is a derived fact table that combine data from other facts.

 

Smashed fact

A fact table contents several measures but only one or few of them has a value in each fact row. For the fact rows with same dimension member repeated in multiple contiguous rows with identical values, they will be smashed or collapsed into one fact row using operation as sum, min or max to limit the number of rows in the fact.

 

Time span fact or history fact or historical fact

A fact table used for a source system that is regularly updatable meaning that a source system changes and overwrites its values. To capture a continuous time span when the fact row was effective, the fact table will act as SCD type 2 dimension with BeginAt and EndAt columns to keep historical data and to represents the span of time when the fact row was the »current truth«, and with a query it is easy to fetch it at a given point in time. It is called Slowly Changing Facts. Read more in section 6.4.

 

Counterpart fact (negating fact) and Transactional fact

A fact table used for Slowly Changing Facts because a source system is changing fact value without keeping the old value as a historical transactions. Read more in section 6.4.

 

Column wise fact and Row wise fact

A column wise pivoted fact table is useful to be columns in a report e.g.

   Revenue Jan, Revenue Feb, Cost Jan, Cost Feb, Sales Jan, Sales Feb.

For a cube a row wise is much better because it gives good dimensions e.g.

   Period, Entry, Amount.

Therefore a data warehouse needs to convert from columns to rows or vice versa.

 

Exploded fact

A fact table contents huge number of rows where a period e.g. from a »date of employ­ment« to a »termination date« as columns in one row, is going be turned around with one row per day of the period, or per week or per month depends of the wanted grain of the period, it can give a lot of fact rows, when an employee has 10 year anniversary it will make more than 3650 rows on day grain. Employee dimension type 2 to keep hi­story of different names, departments and job functions etc.

 

For a windturbine power pro­duc­tion the grain would be 15 minute grain which gives 96 rows per day for a windturbine power pro­duc­tion.

 

When the exploded fact is a source for a olap cube it can sometimes be imple­men­ted as a view in the database, and when it is for ad hoc reporting it will be used several times per day then it must be a materialized view stored in a table or sometimes as a indexed view.

 

3.2. Other fact classifications

Transaction has one row per transaction when they occur together with a datetime.

Periodic has one row for a group of transactions made over a period of time through summation like from daily grain to monthly grain so the Date dimension is re­pre­sen­ted by the month level with the first day of the each month. Notice that certain dimensions are not defined when com­pared to a transaction fact such as dimension TransactionType.

Accumulating/Aggregated has one row for the entire lifetime of an event and the­re­­fore constantly updated over time. For example, an application for a bank loan until it is accepted or rejected or a customer or working relationship. These fact tables are typically used for short-lived processes and not constant event-based processes, such as bank transactions.

 

Mostly a fact table describes what has happened over a period of time and is the­re­fo­re an additive or cumulative facts. An example of a status column in a fact table that receive data from a school system where a student follow a course and later finish it, but sometimes a student skip the course and are delete in the system. Before reload the fact it can be a good idea to have a CourseStatus column with values like: Active, Completed or Dropped.

 

4. Dimension

A dimension table containing business element contexts and the columns contain element de­scrip­tions and a dimension is referenced by multiple fact tables so the conta­ining mea­sure­ments make sense.

 

A dimension has value or member as a row, or values or members as rows.

 

4.1. Purpose of a dimension

Some purposes as I seen it:

  • Entry point for a business user to query and analysis data in a data mart.
  • Slice operation to fetch a specific data value from a dimension.
  • Dice operation to fetch specific values of multiple dimensions.
  • Slicing and dicing where the dimension values (members) are selected and become a filter, a zoom or a search criteria to do a filtering of the fact measure data and analysis variable after a wish from a business user (search filter criteria in danish filtrering, søgekriterie; slice skære i skiver; dice klemme ud eller se data som en terning eller som en del-kube).
  • Dimensions are placed in the preposition and form rows and as columns and form a cross tabulation or pivot for presentation and reporting of analysis variable or measure value divided by dimension values.
  • Drill-down for presentation of dimension values in gradually greater detailing through a hierarchy or a group that is grouping values in intervals, banding.
  • Drill-up or roll up views dimension value at higher aggregation level and is summing up the fact measure.
  • Drill-through operation to fetch the detail level data from a summary data and to show the underlying data from a source system e.g. document number (in danish bilagsnummer).
  • Drill-across ties data marts and facts together through conformed dimension and different fact measures can be compiled.
  • Ranking or Sorting presents dimension values according to an analysis va­ria­ble, e.g. customer segments in descending order after the sale.
  • Top or bottom for presentation, e.g. top 5 products or 10 poorest customers on sale.
  • Contains relevant data columns for reporting, analysis and self-service BI.

 

Hierarchical with hierarchy levels

The dimension values can be placed in a hierarchy like a Location with three levels CountryRegionCity. A dimension can have several separate and independent hierar­chies with different numbers of levels.

 

Grouping with group or Band

The dimension values can be placed in a group that is grouping values in band intervals like person ages in custom buckets like a Age group column with intervals of Child (0-9), Tween (10-12), Teeanage (13-19), Young adult (20-29), Adult (30-66) and Senior citizen (67-130).

 

A dimension normally contains one or multiple hierarchies and/or groups/bands to fulfill requirements from the business users.

For example, a Product dimension has two hierarchies (here called group) and one band to divide the prices:

  • ProductProduct group
  • ProductBrandBrand group
  • ProductPrice level

 

A dimension can of course be non-hierarchical and non-grouping/non-band.

 

Different dimensionality covers the issue that not all combination of multiple dimension values are allowed in a fact and the data warehouse needs to make sure of the data quality.

 

For the data lineage a dimension table will include metadata columns from the Archive area as RecordSource, ArcRecordId and ArcGlobalId for tra­ce­abi­lity back to the archive.

 

4.2. Dimension column types

A dimension table has mininum three types of co­lumns:

 

Dimension key is the primary key in a dimension table in a data mart area.

A dimension key is an artificial auto-generated unique sequence number or an iden­tity column as a surrogate key for the primary key of a dimension.

 

It is the dimension's own responsibility to generate the next value in column di­men­sion key. The value of a dimension key does not come from a key map table to remove dependence from a source system.

 

Dimension key is a foreign key in a fact table which has several columns as dimen­sion keys to several dimension tables.

 

For a type 1 dimension a dimension key can be named Entity key EK or EKey because it re­presents an entity in source data, or Surrogate key with prefix like SK_ or a suffix like _SK, _SKey, SK, SID, ID, UID unique identification to join to a fact table to fetch and see the current values.

 

For a type 2 dimension a dimension key can be named History key HK or HKey or HSID be­cause it represents a history reference over time to join to a fact table to fetch and see the registered values.

 

For me »surrogate« is a characteristic or a property, and not a name for a column.

 

I prefer a suffix _key for a dimension key column in a dimension as a primary key for the dimension. An alternative suffix _dimkey or _dim_key.

 

For example, Customer_key column in a Customer dimension which is related to a Sales fact to a Customer_key column as a foreign key.

 

A dimension key must not be data-bearing (in danish data bærende), it must be mea­ning­less, but for a date dimension and a time dimension I like to use a smart valued dimension key, e.g. a date 2013-12-31 as a data type date or a value 20131231 as an data type integer, and a time 08:30 am as an integer value 830 and 08:30 pm as an integer value 2030. With these dimension keys I can avoid a join from a fact table to a dimension table to fetch and see a date and a time.

 

I have seen a dimension key store which generates a global number or RefId for each dimension key value across all the dimensions to achieve uniqueness.

 

I like to think a dimension table as an object that has its own methods or functions on data where each dimension object has its own responsibility to generate the next value for the dimension key _key and for the durable key _dkey, see example in section 6.3 for columns System_key and System_dkey.

 

Business key is from a source system where it can be a primary key or a seconda­ry key with unique values.

 

For example, Social Security number (SSN), StatusCode and CustomerNumber.

 

A business key value is either mutable (changeable) or immutable (unchangeable), and is meaningful for a human being where a business user prefers to use a busi­ness key value to identify an entity for a search criteria lookup va­lue giving in a phone call to a company, a bank, a hospital or the government.

 

A business key has an embedded mea­ning and represent a unique object in the business. An alias is an Enterprise wide business key because the same value is used in multiple source systems. If a business key is im­mu­table (un­chan­ge­able) in a source system then it is called a durable business key.

 

From a data warehouse point of view, a business key also represents a surrogate key from a source system where it often is a primary key as an artificial auto-gene­rated unique sequence number or an identity column (id, uid unique identification) and is immutable and meaningless for a human being.

 

A key map table can transform business key values either to an immutable surro­gate business key _sbkey e.g. a code which can be understood by a human, or to an artificial auto-generated unique sequence number or an identity column as a durable key _dkey for later use in a dimension.

 

A durable key is an integer representation of a business key and can be generated in a key map table and fetched by a business key.

 

Multiple source systems have a business key for a Social security number per per­son with a primary key as a surrogate sequence number, the­re­fore we can’t use the primary keys to interconnector (join, drill-across) the systems data, instead we use the business keys. The value of a business key can change over time e.g. a person obtains witness protection and gets a new Social security num­ber, therefore it is up to the ETL process to make a solid mapping. When a business key is an employee number and a employee resigns and some years later is rehired, there could be two different employee numbers for the same employee (EmployeeNumber or Employ­eeId), and a data warehouse must find a way to map (»glue«) them together, so the employee only occurs once in a di­men­sion with help from a key map table that remain the durable key. The other way around can also happen in a source system where the same value of a EmployeeNumber or a EmployeeId is used for two different employees.

 

Descriptive data is representing a dimension value context description as textu­al, numeric, date, time etc. and is saved in columns in a dimension table and will be shown to a business user as descriptive data co­lumns with values to explain fact data in a row from a fact table.

 

4.3. Changing dimensions

Source data is volatile data because they will change over time e.g. a customer change his name or address and a product change place in a hierarchical structure as a result of a reorganization. How can we support evolving dimension data when dimension values and instances normally will chan­ge over time because of the volatility in source systems? The rate of chan­ges can be divided into two kinds of classifications and afterwards we will be loo­king into techniques to handle and tracking changes and to capture its history and preserve the life cycle of source data also called Change Data Capture CDC.

 

Slowly Changing Dimensions SCD

Columns of a dimension that would undergo changes over time. It depends on the business requirement whether particular column history of changes should be pre­ser­ved in the data warehouse or data mart.

(In danish stille og rolig ændrede (skiftende, foranderlige) dimensioner med lang­som opbyggende historik).

 

Rapidly Changing Dimensions RCD

A dimension column that changes frequently. If you do need to track the changes, using a standard Slowly Changing Dimensions technique can result in a huge inflation of the size of the dimension. One solution is to move the column to its own di­men­sion with a separate foreign key in the fact table. Rapidly changing data usu­al­ly indicate the presence of a business process that should be tracked as a se­pa­ra­te di­men­sion in a separate historical data table.

(In danish hurtig ændrede (skiftende, foranderlige) dimensioner med omfangsrig opbyggende historik).

 

Techniques or methods to handle dimension values that is changing over time from the source systems is called Ralph Kimball’s eight types or approaches of SCD:

 

Type 0

Keep original value, where a dimension value will never change, and when a value is changed in a source system, the original value is retained in a dimension. It is called no versioning.

 

The type 0 approach is only to store the original value from a source system.

 

A single fixed value does not change over time but can be corrected in case of an error in a source system, e.g. a correction of a mistyping, a typos or a misspelling of a name or a wrong Date of Birth, Date of Issue, Date of Expiration, Date of Laun­ching something or Date of First purchase.

 

The value of a business key will be in one row only, therefore the business key co­lumn can be a unique index. There is a one-to-one relationship between business key and dimension key.

 

Dimension key is an artificial auto-generated unique sequence number or an identi­ty column as a surrogate key for the primary key of the dimension to join to a fact dimension key.

 

A fact table refers to a dimension key value to fetch the original values, as-began.

 

Inserting a new row into a fact table will lookup a business key value in a dimen­sion table to fetch a dimension key value for the new row.

 

A view upon the dimension will provide the original values for the dimension key to join to a fact dimension key to fetch the original values, as-began.

 

This is the technique for Slowly Changing Dimension: »Retain original«.

 

Type 1

Keep current value, where a dimension value will change, and when a value is changed in a source system, the old current value will be overwritten in a dimen­sion by the new current value in the same row in the dimension. It is called no versioning.

 

The type 1 approach is only to store the active, actual, present, newest or latest of the value, the most recent value or current value from a source system. The old value will be forgotten. The history of the old value is lost forever. Historical descriptive data values is not preserved. This approach is suitable when historical changes are not significant or when tracking historical changes is not required.

 

The value of a business key will be in one row only, therefore the business key co­lumn can be a unique index. There is a one-to-one relationship between business key and dimension key.

 

Dimension key is an artificial auto-generated unique sequence number or an identi­ty column as a surrogate key for the primary key of the dimension to join to a fact dimension key.

 

A fact table refers to a dimension key value to fetch the most recent, current values, as-is.

 

Inserting a new row into a fact table will lookup a business key value in a dimen­sion table to fetch a dimension key value for the new row.

 

A view upon the dimension will provide the current values for the dimension key to join to a fact dimension key to fetch the most recent, current values, as-is.

The current view will show a unique list of most recent values which is handy for a dropdown box for search criteria in a dashboard.

(In danish aktuelle, nuværende, gældende, seneste værdi).

 

This is the technique for Slowly Changing Dimension: »Current by overwrite«.

 

Type 2

Keep historical value and current value, all values as history, where a dimension value will never change, and when a value is changed in a source system, the old current value will be remained in a dimension and the new current value will be inserted into a new row in the dimension. It is preserving historical descriptive data values in a dimension table. It is called row versioning.

 

We achieve to have an unlimited history of values over time marked by timeline co­lumns as a pair of data type date or datetime that represents the span of time when a row with its values was the »current truth« with metadata columns:

 

  • Effective date and Expiration date (Active date, Expired date, Expiry date)
  • BeginDate and EndDate
  • Start­Date and StopDate
  • ValidFrom and ValidTo    (I prefer this pair of columns with a datetime)

 

The type 2 approach is to store the full history of changed value from a source system and to fetch the most recent value or current value from a source system. Historical descriptive data values is preserved. This approach is suitable when historical changes are significant and when tracking historical changes is required to have a full historical tracking, no data loss.

 

Timeline metadata columns ValidFrom and ValidTo represents a period for the time when the value was current and active in a source system.

 

The value of a business key will be repeated in a new row every time a data value is chan­ging. There is a one-to-many relationship between business key and dimen­sion key.

 

A ETL process is using a business key value to update the old current row by setting metadata column ValidTo to a datetime stamp, and to insert a new current row with same datetime stamp to metadata column ValidFrom and set metadata column ValidTo to 9999-12-31 to represent the new current row for the business key. Kimball page 508 calls it no gap exists between rows of the same business key value, meaning that the old current row ValidTo and the new current row ValidFrom are using the same datetime stamp value.

 

A key map table can transform business key values either to an immutable surro­gate business key _sbkey e.g. a code which can be understood by a human, or to an artificial auto-generated unique sequence number or an identity column as a durable key _dkey which can be included in a type 2 dimension.

 

Dimension key is an artificial auto-generated unique sequence number or an identi­ty column as a surrogate key for the primary key of the dimension to join to a fact dimension key.

 

A fact table refers to a dimension key value to fetch the registered values of the dimension at the time when fact data was happening, occurred, registered or took place often by a date column in the fact table based on source data, or when fact data was entered into the fact table by a current load insert date, as-was.

 

Inserting a new row into a fact table will date range lookup a business key value in a dimension table between ValidFrom and ValidTo to fetch a dimension key value for the new row.

 

A view upon the dimension will provide the registered values for the dimension key to join to a fact dimension key to fetch the registered values, as-was.

(In danish oprindelige værdi).

 

A view upon the dimension will provide the current values for the dimension key to join to a fact dimension key to fetch the most recent, current values, as-is.

The current view will show a non-unique list of most recent values which is not handy for a dropdown box for search criteria in a dashboard, because a value can be repeated for different dimension key values used in a fact table.

(In danish aktuelle, nuværende, gældende, seneste værdi).

 

I like to add a metadata column called IsCurrent with two values: 0 for historical and 1 for cur­rent to mark each row in a type 2 dimension.

 

This is the technique for Slowly Changing Dimension: »Keep history in rows as full history«.

 

Type 3

Keep current value and previous value, where a dimension value will change, and when a value is changed in a source system, the old current value will be remained and will be stored in a Previous column (or Historical column) in a dimension, and hereafter the old current value will be overwritten in a dimension by the new current value in the same row in the dimension. It is called column versioning as an extension to type 1.

 

The type 3 approach is to store the latest value or current value from a source system along with a sufficiently limited amount of historical data in a previous value or few prior values.

 

The value of a business key will be in one row only, therefore the business key co­lumn can be a unique index. There is a one-to-one relationship between business key and dimension key.

 

Dimension key is an artificial auto-generated unique sequence number or an identi­ty column as a surrogate key for the primary key of the dimension to join to a fact dimension key.

 

A fact table refers to a dimension key value to fetch the most recent, current values, as-is, and some previous dimension values, as-was.

 

Inserting a new row into a fact table will lookup a business key value in a dimen­sion table to fetch a dimension key value for the new row.

 

A view upon the dimension will provide the current values for the dimension key to join to a fact dimension key to fetch the most recent, current values, as-is, and to fetch some previous dimension values, as-was.

 

This is the technique for Slowly Changing Dimension: »Keep history in columns as partial history«.

 

Type 4

When a group of columns in a dimension is going to change often as a Rapidly Changing Dimension, we can split the dimension into a Base-dimension that has slowly changed columns and into one or more separate Mini-dimensions that has often changed columns (volatile data values) to keep the fast changing values in its own table. Type 4 store all historical changes in separate historical data tables.

 

The type 4 approach is to store separately the current value and the historical value to maintaining data integrity.

 

The Base-dimension can be either a type 1 or type 2 (most often type 1) and the Mini-dimension be­comes a type 2. We get a »capture a relationship in the fact«.

 

A fact table refers to a Base-dimension key value to fetch the Base-dimension va­lues.

 

A fact table refers to a Mini-dimension key value to fetch the Mini-dimension va­lues.

 

This is the technique for Rapidly Chan­ging Dimension: »Keep history in tables«.

 

Type 5

Builds on the type 4 where a Mini-dimension gets a view to fetch the cur­rent rows of the Mini-dimension and the Base-dimension is extended with a key value that point to the current view and the key becomes a type 1 column outrigger in base.

 

Base-dimension and Mini-dimension can join without include a fact to save query perfor­mance. The join can be implemented in a view. We get a »capture a relationship in the dimensions«

 

The ETL process must overwrite the type 1 column in Base-dimension whenever the current Mini-dimension changes over time. Therefore 4 + 1 = 5 type.

 

This is the technique for Rapidly Chan­ging Dimension: »Outrigger current dimen­sion«.

 

Type 6

Mixture of type 1 and type 2 columns therefore a good idea to suffix columns as _t1 and _t2 to know which columns can be overwritten in the current row. Can also have column of type 3, therefore 3 + 2 + 1 = 6 type.

 

Type 6 act as type 2 of tracking changes by adding a new row for each new version but type 6 also overwrites _t1 columns on the previous row versions to reflect the current state of data by using the business key to join the new row with the previous rows.

 

This is the technique for Slowly Changing Dimension: »Hy­brid«.

 

Type 7

All rows follow type 2 to keep track of history values with a dimension key column and an extra key column called a durable key follows type 1 for the current value.

 

A key map table can transform business key values either to an immutable surro­gate business key _sbkey e.g. a code which can be understood by a human, or to an artificial auto-generated unique sequence number or an identity column as a durable key _dkey which can be included in a type 7 dimension.

 

The value of a business key and the value of a durable key will be repeated in a new row every time a data value is chan­ging. There is a one-to-many relationship between business key and dimen­sion key. There is a one-to-many relationship between durable key and dimension key.

 

Dimension key is an artificial auto-generated unique sequence number or an identi­ty column as a surrogate key for the primary key of the dimension to join to a fact dimension key.

 

A fact table refers to a dimension key value to fetch the registered values of the dimension at the time when fact data was happening, occurred, registered or took place often by a date column in the fact table based on source data, or when fact data was entered into the fact table by a current load insert date, as-was.

 

A fact table refers to a durable key value to fetch the most recent, current values, as-is.

 

Inserting a new row into a fact table will date range lookup a business key value in a dimension table between ValidFrom and ValidTo to fetch a dimension key value and a durable key value for the new row.

 

A view upon the dimension will provide the registered values for the dimension key to join to a fact dimension key to fetch the registered values, as-was.

(In danish oprindelige værdi).

 

A view upon the dimension will provide the current values for the durable key to join to a fact durable key to fetch the most recent, current values, as-is.

The current view will show a unique list of most recent values which is handy for a dropdown box for search criteria in a dashboard, because the values will be distinct by the use of the durable key column in a dimension table of type 7.

(In danish aktuelle, nuværende, gældende, seneste værdi).

 

I like to add a metadata column called IsCurrent with two values: 0 for historical and 1 for cur­rent to mark each row in a type 7 dimension.

 

This is the technique for Slowly Changing Dimension: »Dual Type 1 and Type 2 Dimensions and Dual foreign keys in fact for a given dimension«.

 

Correction of data together with the types

For type 0 I mentioned »correction of a mistyping, a typos or a misspelling« e.g. a first name of a customer Kelli is corrected to Kelly after few days in a source system, for type 2 or type 7 it will per automatic mean an extra row in the dimen­sion, but I think it needs some consideration especially if a source system can inform about the correc­tion, maybe the dimension can do an overwrite/update of the row as type 1 and mark the row with a metadata column IsCorrected, or tag changed data with a reason (Kimball page 451).

 

Examples of type 0, type 1, type 2, type 3, type 4 and type 7

Type 0 for a Date dimension and a Time dimension and a Gender dimension be­cause the values will never change.

 

Type 1 for an Invoice dimension because old values are not important. A dimension for MaritalStatus with values: Single, Married, Separated, Divorced, Widowed, Se­parated, Registered partnership and Abolition of registered partnership because in case a value will change, we don’t want to keep the old value.

 

Type 1 for a Customer dimension example is in section 6.2.1.

 

Type 2 for a Customer dimension because customer values from a date of purchase are im­portant to keep for accounting, purchase statistics and analyzes for marking. Example with date range lookup is in section 6.2.2 and includes a non-unique list of customer current names.

 

Type 7 for a Customer dimension example and date range lookup is in section 6.2.3 and includes a unique list of customer current names.

 

Here is a type 3 exam­­ple for a customer con­tains three columns for postal codes. The column names are Current Postal Code, Previous Postal Code and Oldest Postal Code. When a customer address is changed to a new postal code, the ETL process will move a value from the Previous column to the Oldest column and move a value from Current column to the Previous column and add the new value to Current column.

 

Here is a type 4 exam­­ple where the customer values are split into a Base-dimen­sion with slowly changes in columns e.g. CustomerName, ContactName, Address, City,  Postal Code, Discount Band, Marketing category, and a Mini-dimension with fast/rapidly changes in columns e.g. Custo­merAge, Income, Raiting, AccountStatus, Days of payment.

 

Here is a type 6 exam­­ple: In a dimension table the columns can be a mix of type 1 and type 2 e.g. a Cu­sto­mer dimen­sion where columns for customer name, street name, house or apart­ment number is type 1 because we only need the recent value for shipment, and columns for postal code (zip code) and city is type 2 because we like to trac­king these changes and keep data for the city where a customer was living in when purchase a product. A »sales in cities« report for the last ten years will use the right city of the customer at the time the purchase was happening, occurred, registered or took place.

 

Temporal table system-versioned table

Temporal table looks like a type 4 dimension because keeping history in the sepa­rate table while original dimension table keeps current dimension member, but the dimension key is not an artificial auto-generated unique sequence number for each update of values in the dimension and therefore it can not be a type 2 to type 7. Temporal table is only usable for a type 1, when it is including the deleted rows because older fact rows can be referring to it. We can say that a temporal table or a system-versioned table is a type 1 dimension with an addition of a history-tracking to handle historical data values for changes (updated or deleted) at any time.

Example from Kimball design tip #100

 

4.4. Types of dimensions

Let us characterize the various dimensions into different types of dimensions.

 

Conformed dimension or Shared dimension or Common dimension

A conformed dimension has the same meaning to every fact in multiple data marts and measures will be categorized and described in the same way and ensuring consistent reporting across the data warehouse and is essential for enterprise data warehousing.

 

A conformed dimension is a con­sistent interface to make sure that data can be combined in a data warehouse and be used all over the business because values of a dimension means the same thing in each fact. With a conformed dimension we can combine and drill across from fact to fact in one data mart or over several data marts, and analyze common columns and values. Se­pe­ra­te fact tables can be used together with shared, common and conformed dimensions.

 

Conforming of several source data is part of the in­te­gra­tion to achieve a conformed dimension where data is integrated of dif­fe­rent mea­nings and different columns must be compared against each other, rules must be set, and data must be cleansed to create a single version of the entity. Conformed dimensions will uni­te and integrate data values among multiple source systems so it is easy to search across different types of data and sync them in a common report. Shared di­men­sion is utilized in multiple fact tables in a data mart or across multiple data marts.

 

Di­men­sion values comes from either a source system or is built by business rules in Usage supporting database. Non-conformed dimension can only be used within one fact. It is part of the ETL process to do conforming by merge, unite and consolidate multiple source data across the enterprise for making a confor­med dimension e.g. a Cu­sto­mer di­men­sion from different business areas  as order, sale, invoice, delivery and sup­port service (for B2B and B2C) with both different customers and same customers and with different business key values and with different addresses like a shipping ad­dress and a billing address. Sometimes data from a conformed dimen­sion is send back to the source systems as master data to be reusable in the organi­zation.

 

Hierarchical with hierarchy levels

The dimension values can be placed in a hierarchy like a Location with three levels CountryRegionCity. A dimension can have several separate and independent hierar­chies with different numbers of levels.

 

Grouping with group or Band

The dimension values can be placed in a group that is grouping values in band intervals like person ages in custom buckets like a Age group column with intervals of Child (0-9), Tween (10-12), Teeanage (13-19), Young adult (20-29), Adult (30-66) and Senior citizen (67-130).

 

Data classification is the process of organizing data into categories, group or cate­gory is part of a categorization or grouping of data to make a dimension more user-friendly to see data of a dimension on an aggregated and summed level.

 

A dimension normally contains one or multiple hierarchies and/or groups to fulfill requirements from the business users.

A dimension can of course be non-hierarchical and non-grouping.

 

Date dimension or Calendar dimension

A very common dimension with the granularity of a single day with hierarchies as:

YearHalf yearQuarterMonthDate (five levels) or

YearWeekDate (three levels) because a week does not always belong to one month.

 

When a fact date is connected to the Date dimension it is easy to make a re­port based on month, quarter or yearly level or a search filter criteria as Q3 2013.

Dates or days can be grouped to a Season or TimeOfYear: Winter (December Janu­ary February), Spring (March April May), Summer (June July August) and Fall (September October November) (in danish Årstider).

 

Date dimension has normally many supporting columns like Weekday name and Month name and an IsHoliday column has a data type of boolean/bit with two va­lu­es True (1) and False (0) as a flag indicator, but to be used in a presentation tool it is better to have a textual column Day­OnOff with values »Holiday« and »Working day«. Of course the rule of when a specific date is a holiday is depending on country, collective agreement, company policy etc. is programmed into the stored procedure that builds the date dimension.