Dimensional modeling – architecture and terminology
by Joakim Dalby (danish Dimensionel modellering, Datamodellering, Datavarehus)
»The most valuable commodity I know of is information« says Gordon Gekko in Wall Street movie from 1987. »Knowing is good, but knowing everything is better« says CEO in The Circle movie from 2017. »Information at your fingertips« was a concept by Bill Gates in 1990.
Information has become a major asset for any organization. The book Das Kapital by Karl Marx could today get the title Die Information.
A data warehouse is the way to provide information to the business users and add value to the business and reuse data in a new way to increase the revenue.
A data warehouse is a separated system so a user query and analysis will not slow down and not reduce the workload on the operational systems.
Data is stored in a data warehouse through an extract, transform and load process, where data is extracted from operational systems, transformed into high-quality data and checked for compliance with business rules before loaded into a data warehouse system.
Gartner formulates that the platform for data and analytics is the heart of the digital business. Here you will find data management and the company's governance for data and analysis. Data is the raw material in our digital age. Data is collected, cleaned, controlled, optimized, shared and used, and our ability to work flexibly and intelligently at this »assembly line« is growing fast. Data management is crucial to tomorrow's business success. They must shape the technology and the data resource both short-term and long-term for the sake of both the existing business and the future's business. Now, the actual data platform is, with good reason, a focus area for the companies. In a data-centric world, tools, processes and competencies to handle data are infinitely crucial. All of today's greatest technological trends ranging from Internet of Things IoT, robotics, artificial intelligence and virtual reality are ultimately dependent on data managers, architects and company data strategies.
»That's all data is. A gift from yesterday that you receive today to make tomorrow better.« Jon Acuff: Finish: Give Yourself the Gift of Done, page 126.
Performance Management Capabilities required:
A data warehouse tells you what has happened (lag measures) - not why it has happened or what is going to happen (lead measures). There are many gaps that need to be filled in by a different kind of insight than the one that can be delivered in a data flow task from multiple source systems to an enterprise data warehouse, but let's start with what is a data warehouse.
Data warehouse is
· Subject-oriented because data from the business is merged into areas of subjects and not organized around the functional applications of the business.
· Integrated because multiple operational source systems is using common and consistent business names, definitions, formats, data types and units.
· Time-variant because historical data is kept as current data at any given time. Time variance records change over time where the items of data are timestamped to inform when the data was registered. In contrast, a source system will often only contain current data that is correct at the moment.
· Non-volatile because data do not change or historical data will never be altered.
· Unite, interconnect, compile, conform and consolidate data into common format.
· Enrichment and processing of data for providing new knowledge and wisdom.
· Collection of data in support of management's decision making process and analyzing the business with business performance measurement qualifiers and KPI Key Performance Indicator used as a measure of how a company is doing.
· Utilizing dimensional modeling, end users and analysts can easily understand and navigate the data structure and fully exploit the data for self-service BI.
· Goal to support slice and dice of data for business analysis and insight.
· Data-driven to become enjoyable and pleasurable to work with for the users, and it is important for a business and a data analyst to have the right data, in the right form, at the right time, so they can turn data into insight and achieve business intelligence.
· Compliance requirement for provide proof that the reported numbers are accurate and complete by having an audit trail for reconciliation check etc.
· Driven by business requirements specification and user story from stakeholders and users. A data warehouse must have focus on the needs of the business.
Data from multiple operational source systems and source legacy systems are stored in tables in a relational database called a data warehouse. Data will be merged and integrated inside a data warehouse to get a consistent format and use same naming.
Transversal data warehouse, it cannot be said more tellingly (in danish Tværgående datavarehus, mere sigende kan det ikke siges).
Uniformity means that within the data warehouse the common data elements of multiple source systems are referred to in a consistent manner.
Conforming means that within the data warehouse creating a conform column with a single and consistently labelled name that has the same common meaning with identical contents all over the data warehouse e.g. same spelling and unit of measurement (in danish ensartet).
For example, product data can be found in multiple source systems with different names of columns, different spelling of product names and different segmentations of products, therefore a data warehouse will unite, interconnect, compile, conform and consolidate the data into one product table and one product name column to make it easy for the business users to make a list of all products. I.e. the length of a pipe is registered as 55 cm. in one source system and as 0.55 m. in another source system, but in a data warehouse the values become conformed to 550 mm.
A conformed dimension supports integration of multiple source systems into a data warehouse.
It is important to design a data warehouse to support reporting and data analyses by a single and common data model that is easy to navigate in and a business user does not have to think of data across of source systems. Data integration becomes seamlessly. Many source systems update its data to reflect the most current and recent state while a data warehouse also maintain history.
For example, an ambassador who has lived in Washington, London and New Delhi for the last 25 years and always bought her perfume from a Lancôme store in Paris with a IT system that only stores current shipping address, all the sales over the last 25 years will be placed in India, while a data warehouse will remember all cities and give a correct picture of the sales shipped around the world.
Data mart is a specific subject area that contains the data and information that is relevant for a business user. A data mart has one purpose to customize and summarized data for tailored support of a specific analytical requirement from a business unit. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility.
For example, Sales mart, Customer mart, CRM mart, Churn prediction mart, Market mart, Production mart, Inventory mart, Shipment mart, HR mart, Tax mart, Credit risk mart, Fraud detection mart etc.
A data warehouse can have multiple data marts bound together with conformed and shared dimensions. A conformed dimension is shared by multiple facts that has conformed measures with same calculation methods and common unit of measure e.g. sales revenue and support revenue in two facts are both pre-VAT (value added tax) in United States Dollar USD and adding them will result in a total revenue.
A data mart contains the data to calculate Key Performance Indicators (KPI) e.g. MRR Monthly Recurring Revenue. (KPI in danish Nøgletal).
Dimensional modeling is a technique approach that seeks to present data in a standard, intuitive framework of:
A fact may tells how much or how often, and a dimension adds texts to it. Both fact and dimension are tables in a database. A fact has several dimensions connected via key columns and is often displayed as a star schema diagram.
To focuses on ease of end user accessibility and provides a high level of performance access to the data warehouse.
Ralph Kimball recommends in the design of a data warehouse to decide what business process(es) to model by combining an understanding of the business requirements with an understanding of the available data.
Kimball page 37: »The BI team needs to understand the needs of the business as well as the realities of the underlying source data.« A process-centric fact table supports integration of multiple source systems via a set of conformed dimensions, and the entire data warehouse is build process-by-process.
Data warehousing (DWH) is made up of three components: The architecture, The data model and The methodology. It is a method to help database designer to build a comprehensive and reliable data warehouse system, e.g. based on Dimensional modeling principles. A OLAP (OnLine Analytical Processing) cube or a Tabular model can be at the top of the Dimensional modeling to present data in tools like Excel, Power BI, QlikView, Alteryx, Tableau or Targit.
Business Intelligence (BI) system provides the information that management needs to make good business decisions. BI is going from data and information to knowledge and wisdom for the users. BI was previously called Decision Support System (in danish Beslutningsstøttesystem). BI could also stands for Business Insight.
Data does not turn into insight without effort of building a good data warehouse with an architecture of data layers (areas), a data model for each data layer and a methodology for process of data in each data layer (area) with a famous abbreviation ETL that stands for three operations on data:
In the article I will focus on the basic concepts, terminology and architecture in Dimensional modeling. My homepage has other articles of how to design and implement a data warehouse system with levels of data in the ETL process.
I see Dimensional modeling as a subset of the Entity Relationship (ER) data modeling design method of a relational database system OLTP (OnLine Transaction Processing).
Peter Chen, the father of ER modeling said in his seminal paper in 1976: »The entity-relationship model adopts the more natural view that the real world consists of entities and relationships. It incorporates some of the important semantic information about the real world.«
Where a ER data model have little or no redundant data, a dimensional model typically has a large amount of redundant data. Where a ER data model have to assemble data from numerous tables to produce anything of great use, a dimensional model store the bulk of their data in the single fact table or a small number of them and divide data into multiple databases called data mart. Read more.
Back in the 1970's Bill Inmon began to define the term data warehouse as a one part of the overall business intelligence system. An enterprise has one data warehouse with entities in a relational model as a centralized repository, and multiple data marts which source their information from the data warehouse. In the data warehouse, data is stored in 3rd normal form. Data warehouse is at the center of the Corporate Information Factory (CIF), which provides a logical framework for delivering business intelligence. A data mart is a simple form of a data warehouse that is focused on a single subject. Denormalization and redundancy is the norm for data modeling techniques in a data mart.
Ralph Kimball says: »A data warehouse is a union of all its data marts.«
Kimball’s data warehousing architecture is also known as Enterprise Data Warehouse Bus Architecture matrix (BUS matrix) as a collection of conformed dimensions that has the same meaning to every fact (e.g.).
A data model represents correctness, completeness and consistency.
A schema is a logical grouping of database objects e.g. tables, views, table-valued functions and stored procedures and it can handle user access rights.
A table contains data in rows also known as records, tuples, instances or data rows.
A table contains columns also known as attributes, fields, elements or properties.
A column contains an atomic value like a First name or a compound value like a Full name that can be further divided into three columns for Firstname, Middlename and Surname. A null represents a blank value, an empty value, an indefinite value, a nothing or a none in a column which value is missing at the present time because the value is not registered (yet) or the value is not defined for the row. A column can contain a null value or a column has a constraint of enforcing it to always contain a value e.g. by a default value. In a Customer table a column LastPurchaseDate has a row with null, it is not unknown, it indicates the lack of a known value or it is known that customer hasn’t made a purchase yet, therefore it is known to be absent or omitted or missing. A numeric column with null in several rows can cause a problem, because queries using aggregate functions can provide misleading results. A SQL statement like this: Sum(<column>) can return one row that contains null, and use of Outer Join can results in a column with null. Joe Celko's SQL for Smarties: Advanced SQL Programming, says: »Null is not a value; it is a marker that holds a place where a value might go.« The term Void I have seen on a fly ticket with stopovers to mark in empty lines that there is nothing here to avoid me to be able to add an extra stopover. John Doe (for men) and Jane Doe (for women) are used when the true name of a person is unknown or is being intentionally concealed.
A column has a data type e.g. integer, numeric, decimal, string or text, date, time, datetime, amount, boolean or bit for true (1, yes) and false (0, no). Important tip by naming columns, do not use possessive nouns such as Recipient's date of birth better to use CamelCase like RecipientBirthDate in a column name, and don’t use values in column name like EmployeeDay/NightCode better with EmployeeShiftTypeCode. With conforming or conformance we avoid synonym columns with same content but different names, and we avoid homonyms columns with same name but different content, e.g. Type is a general term better give a column a name as ProductType, CustomerType, TransactionType etc. Names of tables and columns must be business-friendly and must make sense for both the business users and the IT developers. It is possible to use a view to make better names e.g. an IT developer like a date column to be called ReservationExpiryDate and in a view the business user like to see Date_of_Reservation_Expiry, BankAccountBalanceAmount becomes Saldo, and IncludeTipsBoolean becomes Tips with Yes or No values and not null with default No.
A derived column represents a value that is derivable from the value of a related column or set of columns and not necessarily in the same table. For example, the age of a person is derivable from the date of birth column and date of today, or a calculation of an amount or a glue column to bind rows together for fast search.
A single-valued column holds a single value for an entity occurrence.
A multi-valued column holds multiple values, MovieCategory »Children, Comedy«.
A primary key is a unique identifier for all rows in a table so each row contains an unique value and it can’t be a null value (is not null, not allow null, non null), e.g. column EmployeeId is the primary key in an Employee table because Id is an artificial auto-generated unique sequence number or an identity column.
A compound primary key is concatenated data e.g. a US phone number contains area code + exchange + local number in an EmployeePhonebook table.
A composite primary key is composed of multiple columns which combination is used to uniquely identify each row in a table, e.g. EmployeeId and CourseNumber in a Participant table.
An entity is a set of objects with the same properties and is implemented as a table with rows where a primary key is unique and for use in identification of rows.
A relationship defines how the entities are related to each other with a meaningful association among entities in three ways called one-to-one, one-to-many or many-to-many displayed in a Entity Relationship Diagram ERD. A relationship is implemented by including the primary key in the related entity and calling it the foreign key to make sure of the consistency in database. When a database is consistent we can navigate with a join among tables and relationships become navigation paths. Example of a many-to-many relationship is between two entities Employee and Course, where one employee can take many courses and where one course can be taken by many employees, may be modeled through the use of a third entity that provides a mapping between them with foreign keys back to Employee and Course plus extra data e.g. date of the course, evaluation of the course, student grade.
Normalization is the process of structuring a database in several tables to reduce data redundancy and improve data integrity, through steps called 1NF, 2NF, 3NF, BCNF, 4NF and 5NF. But a data warehouse loves data redundancy, because it is not a human that update data, instead it is a ETL process that runs as a computer software program developed by a ETL developer.
SQL stands for Structured Query Language e.g. Select From Join Where Union.
A view is a stored sql query criteria statement used to fetch data from a database and the data set has columns as a table and is called a result set or a recordset. Also called a virtual table representing the result of a database query.
A materialized view is a database object that contains the result of a sql query e.g. data is being persisted into a table, a file or placed in a server memory cache. A view can be materialized through an index. Materialization is for performance reasons as a form of optimization of the ETL process and for fetching data to business users e.g. five tables that are joined and aggregations are precomputed and precalculated. For a client-server database application, the result can be placed as a local copy at a users computer PC.
A table-valued function is an intelligent view with parameter as a mathematical function and it is returning a result set wrapped in a table to be used in a join etc.
A stored procedure (sometimes parameterized) performs actions on data and may return a result set, a value or an output parameter. Procedures are important for the ETL process.
SELECT COUNT(*) AS NumberOfBoats
WHERE SailingPast = 'My window'
I prefer alias using = instead of AS:
SELECT NumberOfBoats = COUNT(*)
WHERE SailingPast = 'My window'
Because assignment of a variable looks very much as my sql before:
DECLARE @NumberOfBoats int
SELECT @NumberOfBoats = COUNT(*)
WHERE SailingPast = 'My window'
Data is a plural of datum e.g. 5.5 or 38 and after labelling them shoe size in US and in EU they become information. When I know many of my friends' shoe sizes together with other information I have knowledge that I can use for a better birthday gifts or sell to a marketing campaign.
38 is data like a person age 38 years old or a person weight 38 kilograms becomes information because I put 38 in a context and I use 38 together with an unit. An example of metadata to the person weight 38 kilograms could be the date and time it was measured.
Data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data are deleted, leaving only one copy to be stored. Also called Removing duplicates.
Metadata is in short »data about data« because metadata provides data or information about »other content data« (in danish indholdsdata) e.g. author and title of a book, table of content in a book, index in a book or an error description from a IT system like »violation of a foreign key constraint«.
Example of a XML structure with an element as InvoiceAmount with a data value content of a real invoice amount together with a metadata attribute as a currency or unit:
In a data warehouse metadata adds an extra value to the raw data from a source systems together with metadata for the data flow between data areas. A data warehouse use metadata e.g. to define and specify, describe, classify, trace, debug, audit, logs and metrics data, access restrict and monitor data.
Metadata is partly business data (e.g. business descriptions, trace/lineage, access restriction) and partly non-business data (e.g. monitor data, logs, metrics, technical definition). I am using metadata columns for housekeeping columns, support columns or administrative columns in tables, e.g. timeline columns with names as ValidFrom and ValidTo.
I have chosen to use two specific dates as metadata values (yyyy-mm-dd):
· Beginning of time as the first occurrence of data specified in column ValidFrom begins with date 1900-01-01, can include a time 00:00:00.0000000 (00:00:00.000 or 00:00:00).
· End of time as the last occurrence of data specified in column ValidTo ends with date 9999-12-31, can include a time 23:59:59.9999999 (23:59:59.997 or 23:59:59).
If a source system has a date column with a null value and the null doesn’t means »missing« and a data warehouse only like to have a non-null value, then the data warehouse can use a forever perpetual date as 9999-12-31 to identify a future undetermined date, an infinite date (in danish forevig uendelig dato for en uafklaret ikke aftalt dato). The benefit with a non-null date column is that a normal sql query join can be used to a calendar table in the data warehouse. For an amount a data warehouse can use a fixed value to represent a non-existent value.
Metadata columns will be mentioned in many places in the article and about its naming in section 6.3. I will end with some categories of metadata:
Structural metadata is containers of data how compound objects are put together e.g. rule of a sql left-join and an enrichment of a derived column in a sql case-when statement.
Descriptive metadata is about a resource e.g. a link to a User Story Id in Jira and to a description and explanation in Confluence.
Reference metadata is about the contents, when and how it was created, record source, data lineage, movement of data between systems and data layers and data areas. When we deliver information to the end-users, we must be able to tie that back to the source data sets.
Administrative metadata is about how to read the contents e.g. column PerDate. Also permissions to the contents.
Process metadata is about how a ETL process is running e.g. a package start/stop time, its duration, execution result and error message.
Audit metadata is about reconciliation check e.g. rowcount, summarized measures to monitor and control movement of data between systems and data layers and data areas. Also about monitor and control of data quality and audit trail.
Statistical metadata is based on process metadata and audit metadata to describe the status of a data warehouse to be shown in a dashboard for Data management and DataOps.
Conclusion: It takes metadata to manage data and to use data.
Conforming - what to call it?
Humans all over the world are using many terms for the same thing, for example a transport system: Subway in New York City, Tube or Underground in London, Metro in Paris or Copenhagen, U-Bahn in Berlin, Vancouver and Bangkok has SkyTrain as an elevated train system above the streets.
When a filipino wants to buy toothpaste in a store, he will ask for a colgate. A sign on the road says xerox instead of photocopying. In Denmark I will ask for a kleenex instead of a facial tissue. Some names of brands becoming words in our language.
It can be a little tuff job to find and to determine a good and common conformed term to become a column name, but later on the communication will be much easier among business people and IT-people.
How to present workload per day as 7½ hours, you like 7:30 or 7.50?
How to present a date, you like 2001-09-11 or 11-09-2001 or 9/11/2001?
If Shakespeare had been a data modeler, I wonder if he had made the question:
»A transaction has relationships or a transaction is a relationship?«
· If a transaction can be referenced to and it has a TransactionID as a primary key and it is a thing (its own entity) then it has relationships.
· If a transaction only contains foreign keys and it has no need for a unique identifier and nothing else will reference to it, then it is a relationship,
e.g. many-to-many data-bearing relationship.
In Peter Chen's Entity Relationship Diagram ERD there are many kind of transactions e.g. an order, an invoice or deposit money into a bank account that is a many-to-many data-bearing relationship in the conceptual level design with the caption like "deposit transaction". Moving on to a logical level design is when a many-to-many relationship becomes an entity and primary keys becomes foreign keys. Moving on to a physical level where an entity becomes a table in the database through a Relational Database Management System RDBMS. The most important thing for me is, that the names of tables and columns reflect as much as possible the user's conceptual world (in danish afspejler brugerens begrebsverden).
A counterpart to an invoice is called a credit memo, credit memorandum or a credit nota to reduce the amount of the invoice for several reasons, for example, the item is returned, the billed unit quantity is larger than the one delivered, the invoiced unit price is higher than agreed or there are errors in the delivery that justify price reductions. If a transaction in a source system is changed, corrected or cancelled then a data warehouse can calculate a counterpart transaction with a negative amount. The calculation of the final amount will be a summary of transactions whereby the final total amount will be less than the original amount.
1.1. Data layers in a dimensional modeling architecture
A description of the content of data layers of the throughput from multiple source systems to business users PC screen for improved analytic decision making. I am using the term area to point out that an area may contain several databases to split data from multiple source systems to make the entire solution scalable and can be distributed over multiple servers. Each area is labelled with data warehouse acronyms as an abbreviation formed from the initial letters.
Source system→Input data area→Archive area→Data staging area→Data mart area.
Source file/table→Input table→Archive table→Stage table→Target table (destination table).
Source system area – SSA (Operational system and Legacy system)
A company or an organization is using several operational systems to handle daily data of many different kinds. Most systems is based on a database with primary keys and foreign keys, and there is three types of candidate keys to identity an entity object:
· Natural key exists in the real world e.g. a fingerprint, a hotel room number, a medical diagnosis, a phone number, a three-letter currency code, a two-letter ISO country code and other kind of codes that a user understand e.g. JobCode MNG is manager, PRG is programmer and CLN is cleaner. A website address/url. An address with country + state/providence/ region/district + postal code + street name + house number + floor number (stairwell, door number, PO box). Country name. Names of my friends combined with their Date of Birth. The key value is mutable (changeable, not immutable) and is meaningful for a human being. Back in 1990s we could remember many phone numbers, but the cell mobile phone contact list took over.
· Business key exists in a system and is build as an artificial number e.g. ISBN International Standard Book Number, a vehicle identification number (VIN), a vehicle registration number plate, Customer number, Purchase order number, Invoice number with Product numbers, Insurance policy number, Flight number, a general ledger account number, Login name and Social Security Number (SSN) a civil registration number like each danish person by birth is given a CPR number for Central Person Register (PersonCivilRegistrationIdentifier) but it is GDPR sensitive because it contains a natural part with a Date of Birth and the last digit tells gender as an even number for woman and odd numbers for man (anonymization/pseudomization). Some of the numbers do exists in a natural way in the real world because they are printed on papers and things. The key value is mutable and is meaningful for a human being that a business user prefer to use to identify a thing and as a search lookup value giving in a phone call to a company, a bank, a hospital or the government.
· Surrogate key exists in a system is an artificial auto-generated unique sequence number or identity column (id, uid unique identification) that is used as an unchangeable column instead of the natural/business key and a composite primary key. It is represented as an integer, a guid or a hashbyte of natural/business key. The key value is immutable (unchangeable) and is meaningless for a human being and will normally not be exposed to outside users, but is useful for join operations inside the system.
A natural key or a business key is what the business uses to uniquely describe data. If a natural key or a business key is immutable and unchangeable in the source system then call it a durable natural key or a durable business key.
A table called Salestype has a business key as an one-letter code and has a natural key as a textual value, e.g. R = Retail sales, D = Direct sales, O = Online sales.
A table called Status has a column StatusId that is a surrogate key auto-generated unique sequence number, a column StatusNumber that is a durable business key and a column StatusCode that is a durable natural key with a textual value.
A table called Customer has a column CustomerId that is a surrogate key auto-generated unique sequence number, a column CustomerNumber that is a business key and a column CustomerName that is almost a natural key. It is common in a customer system that the same person exists several times because the person contains in duplicate rows with same name and address and with different CustomerNumber, therefore I will not call it a durable business key.
From a data warehouse point of view, a natural key is also a business key, and a surrogate key is sometimes called a business key e.g. a table called LeadState has a surrogate key Id and a column Text with values as »New, Contacted, Qualified, Lost«, then a data warehouse lets Id become the business key because a text value can be changed over time, like »Lost« is changed to »Not interested«. A beloved child has many names.
Data profiling of a source system is very important and the choice of column to become a data warehouse business key requires a good data analysis and criteria for the choice.
Be aware of a value of a natural key or a business key can be changed in a system because the original value was entered incorrectly and need to be corrected. A composite natural key in a Member table with Name+Address will be updated when a person change place of residence. I saw in an airport, that the flight number was selected as the business key for the data warehouse and used in the ETL process, but no one had taken into account, that when a plane was delayed, the letter D was added at the end of the flight number, therefore the data warehouse thought it was a new flight.
A surrogate key as an integer or guid won’t ever need to be changed, but if a customer is deleted by a mistake and the customer later will be inserted again with the same CustomerNumber as business key, then it will get a new sequence number in CustomerId. When a company has several operational systems handling customer data, each system will use its own surrogate key for CustomerId, therefore business key CustomerNumber is the glue between the systems.
It is important for a data warehouse to receive the value of the business key (and/or the natural key) from a source system, and in case of a value has been changed, that data warehouse also receive the before-value. A database has many tables for example, a Customer table has columns of CustomerId as surrogate key, CustomerNumber as business key and a CustomerName and an Invoice table has columns of InvoiceId, InvoiceNumber, InvoiceDate and CustomerId as foreign key, the invoice data must be extended with the business key CustomerNumber either by source system or inside a data warehouse to know the customer of an invoice. I prefer that a data warehouse also store the value of surrogate key to be helpful when a value of a business key is changed and to obtain traceability for data lineage to tell where does the data come from, more in section 1.6. Merging data and integrating data can be hard to set up in a ETL process and to test afterwards.
Input data area - IDA
From source systems to Input data area for temporary storage of source data which has been received as csv, xml, json, excel, access, database files or as to fetch data from a database by using incremental load, more in section 1.2. This area is also called Raw data, Source, Legacy, Extract, Capture Operational Data layer, Data Acquisition or Landing Zone.
Empty this area in the beginning of the ETL process. Data types will be adjusted to fit with the receiving database system, and it is a good idea to use a wide data type e.g. a source provide a string of length of 6 or 50 characters, I will make it as 100 characters to be prepared for a later extension of characters in the source system. In case a date format can’t be agreed for a csv file e.g. 05/08/2014 what is order of day and month as mm/dd/yyyy or dd/mm/yyyy or an amount format with a decimal separator and a thousand separator as a period or a comma e.g. 1,234.56 or 1.234,56 and the use of field quote, field terminator, text qualifier and row delimiter, I recommend to load each column into a data type of string e.g. varchar(100) to have data in a table to do try_cast data conversion to a new column with the appropriate data type e.g. Date, Decimal(19, 4), Currency, Money, Integer etc. I prefer a Tab delimited text UTF-8 (65001) file where the list seperator is <Tabular>, Chr(9), \t. Study Schema.ini and XML Schema Definition XSD.
Do reconciling between data warehouse and source systems with reconciliation of row count and sum of values and mark as reconciled and do auditing report (to reconcile in danish at afstemme, stemmer overens). Maybe a receipt system where IDA tells the source systems that data has been received. Data might be wrong in input data area or in archive and later source system will resend a new revision of data, therefore important to create output views that filter wrong data away and making correct data as an extraction to the next layer of the data warehouse. Read more about data reconciliation in section 1.4.
For the data lineage most tables in the Input data area will include metadata columns: RecordSource as a reference back to the source system e.g. "Dynamics365.Sales.Product", RecordSourceId data origin primary key column from source system, IdaBatchDataCaptureId where all rows in all tables will use the same number of identification together with a IdaBatchDataCaptureTime as a received date and time to represent a snapshot of the total amount of data as it look like in the source system, and can be used later for keep history and in ETL jobs/packages/transformations/sql will reuse the same id and time for all data rows, IdaInsertTime (datetime2(7) default sysdatetime(), not unique per row) when the fetching is done. Sometimes a source system has a Changed column with data/time of first inserted or data/time of latest updated data in the row/record that can be used for later history. Read more about data lineage in section 1.6.
Exchanging data from a source system to IDA can be in different formats e.g. JSON data-interchange format where I prefer storing data for IDA in a relational database with tables with rows, columns and data types, I must parse a json format.
Kimball recommends that source systems express data at the lowest detail possible for maximum flexibility and extensibility to provide the data warehouse for simplicity and accessibility and it will be the data warehouse to make summary data, not the source data system. IDA is a mirror of the source.
I recommend for each source system to create its own Input data area e.g. IDA_Dynamics365, IDA_Salesforce, IDA_ServiceNow, IDA_Proteus to keep the source data separately. Each IDA has metadata for scheduled expected delivery with dates, filenames or tablenames, estimated number of rows, receiving time, row count and flags like IsReceived, IsCompleted, IsReady together with different control tables, reconciliation tables and staging tables to make data ready for the archive area. A nightmare is like 25 csv files that occasionally change filenames and change columnnames and data type and content. Create a File handling area with a landing folder for source gz files, unzip folder for source csv files, and archive folder to keep the original source files, and the folders can be divided into delivery folders with a date or a batch number. Contents of the files must be checked for data quality before they are loaded to the archive area.
In case you like to open a 20 GB csv file and delete the last line that is corrupt, use Sublime Text, its free and no registration, only long waiting for huge file.
I recommend for each source system to create its own ETL process (job with packages) to extract, maybe a little transform and load data into an Input data area and possibly furthermore to an Archive area. When a source system is bad or down, data must continue to be loaded from other source systems. When a daily source data delivery is bad, the next day delivery must still be able to be loaded.
Archive area - ARA
From Input data area to Archive area for forever storage of source data as a data basis for a data warehouse. This area is also called Operational Data Store (ODS), Persistent Staging Area (PSA), Persistence Layer, Detail Data Store, Data Repository or History.
Never empty this area because it is archiving of time variant source data and it will retain historical value changes in the source system. Simple data adjustment can be done to gain same date and amount format and same data representation of a social security number etc. but it will be in new columns so the original values are unchanged.
For the data lineage most tables in the Archive area will include metadata columns: RecordSource, RecordSourceId, IdaBatchDataCaptureId, IdaInsertTime and ArcBatchDataCaptureId, ArcBatchDataCaptureTime, ArcStorage, ArcInsertTime (datetime2(7) default sysdatetime(), not unique per row), ArcTs (timestamp, rowversion, unique per row), ArcRecordId is a unique sequence number per row per table to obtain data lineage and traceability back to the archive and the column can also be used to fetch and pull delta data out of the archive that have not yet been loaded into the data warehouse. ArcGlobalId is a unique sequence number per row across all tables in archive.
Archive area can also be called ADA for Archive data area or Active data archive not to be confused with ADA for Advanced data analysis or Automated data analysis. Microsoft using the term Landing Zone as the place source data is landing and is never deleted. Archive is a versioning of the source.
Data staging area - DSA
From Archive area to Data staging area by extraction to prepare data cleansing, data cleaning and data scrubbing as trim string, unicode string, max length of string, null gets a value as 0 or empty string '', replace illegal characters, replace value, verification of data type and do data conversion to a suitable data type, e.g. a csv file contains a value 1.0E7 in the scientific notation and it will be converted to the value 10000000 and saved in a column in a table. Data syntax, date format like mm/dd/yy or dd.mm.yyyy will be converted to yyyy-mm-dd, correcting misspelling, fix impossible values, punctuation and spelling differences to achieve common format, notation, representation, validate data, correct data and data deduplication by selecting one of the duplicate rows and removing duplicate rows to keep them in a wrong table.
Remember to do an auditing reporting after a validation check.
The data cleansing and data integration process with multiple source systems is to make data cleaned and conformed, e.g. a gender code from one source has »Man, Women«, another has »M, F« will be conformed to »Male, Female« through a mapping of source data. Data enrichment according to business rules, identify dimension data and fact data with derived values e.g. total cost and revenue, elapsed time and overdue time. In DSA the data can be scrubbed, checked for redundancy and checked for compliance with business rules before entering the data warehouse. This is the transformation of data.
Empty this area in the beginning of the ETL process because data is only transient in this layer. When an archive is stored on an another server it is common to load data from the archive into a data staging area e.g. data to a dimension of type 2 or 7 and from the stage table do the merge or insert/update to the dimension table in the data mart area. I like to design a stage table to suit the target table structure in a data mart rather than the source e.g. translate names of columns and data types, merge or divide columns like a Name to become two columns of Firstname and Surname.
I like to use computed columns in a stage table for calculation, string manipulation and a hashbyte value for a comparison column to compare data with a dimension table in the data mart area. DSA database will contain different auxiliary tables with names as: stg, stage, tmp, temp, temporary, and with multiple stage tables to the target table, they can have an additional number e.g. Stg_Customer_Order_1, Stg_Customer_Order_2, and I have seen that the last stage table before the target table is using 0 as in Stg_Customer_Order_0.
DSA will perform a data quality and filter wrong data and invalid data into a quality assurance and quality control (QA/QC) database. DSA continues reconciling data from the archive for referential integrity i.e. foreign key value exists as primary key value, relationship cardinality rules for dependencies as mandatory, optional or contingent to make sure there is a foreign key or not, it can be null or it needs a value. Many other value checks and summation and so on for a validation check of data. When all tables in this area is loaded with correct data, the DSA is successfully completed. A source systems new date or changed date or the IdaInsertTime that represent a batch of insertion and mark all data to that batch time no matter which source system data is coming from. In case of ValidFrom and ValidTo all ValidFrom will be using the same batch time which is useful for join between tables and between source systems to fetch the right data shapshot at a particular time back in history. Read more about data quality in section 1.5.
Furthermore is derived columns and calculations ready for loading to the next layer of the data warehouse. In case of an error in the ETL process or in important data there will be raised a halt condition to stop the ETL process. Other times data will pass by with a flag in an audit dimension. Kimball has called it data wrangling to lasso the data and get it under control (data munging). DSA can also be called Work area, Prepare area or just Staging, Stage or Stg.
For the data lineage most tables in the Data staging area will include metadata columns from the Archive area as RecordSource, ArcStorage and ArcRecordId.
Kimball calls this layer ETL-system or the first system that is off limits to all final data warehouse clients and he use an analogous: »The staging area is exactly like the kitchen in a restaurant. The kitchen is a busy, even dangerous, place filled with sharp knives and hot liquids. The cooks are busy, focused on the task of preparing the food. It just isn’t appropriate to allow diners into a professional kitchen or allow the cooks to be distracted with the very separate issues of the fine dining experience.« (In danish this layer is called data forberedelsesområde eller data rangeringsområde til data vask, data udsøgning, data berigelse, data behandling, data beregning, data bearbejdning, data sammenlægning, sammenstilling, samkøring, data er blevet vasket og strøget.)
Kimball's ETL is using DSA as a temporary space to create the data warehouse as a collection of data marts. Inmon's ETL is using a relational 3NF ODS Operational Data Store with validate referential integrity for operational reporting and as a source of data for the Enterprise Data Warehouse (EDW) which feed the Corporate Information Factory.
Data mart area - DMA
From Data staging area to Data mart area with dimensional modeling, conformed and shared dimension tables, star schema around each fact table, assigned surrogate key (artificial key, identity column, a unique sequence number) for each dimension and use it in fact as foreign key. When multiple fact tables share dimensions it is called constellation schema or multi-star schema.
Keep historical data in dimension and fact with timeline columns ValidFrom and ValidTo. Make it easy for users to search for current data through a view and to give a historical date to a table-valued function to fetch dimension data and fact data at any point of time. A data mart can have multiple fact tables with different granularities. One or multiple fact tables can create an extra derived fact table with special calculations and search filter criteria to enrich data to match the business requirements specification. Make a conform name for a column with a conform data type, for example when you merge customer addresses from multiple source systems make a string extra long to fit all address characters and avoid a value to be mandatory or set a default as the empty string to ensure a robust ETL process.
Never empty this area and always backup before the ETL process. Data mart is a front-room for publishing the organization’s data assets to effectively supports improved business decision making. A data mart contains one data area for one purpose and is subject-oriented, therefore a data warehouse will consist many data marts and they will have some common dimensions. To avoid to maintain the same dimension in multiple data marts may it be considered to have a dimension data mart that share its tables through views inside the other data marts. (In danish this layer is called behandlet lag, tilrettelagt lag, eller klargjort lag der står for klargøring af data).
For the data lineage most tables in the Data mart area will include metadata columns from the Archive area as RecordSource, ArcStorage and ArcRecordId.
Kimball calls this layer dimensional presentation area and he is using the term first-level for a data mart contains data from only one source system and second-level or consolidated for a data mart with multiple sources to cross business processes. Read more about data mart modeling in section 1.3.7.
Presentation interface area - PIA
From Data mart area to Presentation interface area through data access tools like Reporting Services (SSRS) and Microsoft Access or data visualization tools like Tableau, QlikView, Qlik Sense and Power BI can import data from a dimensional schema and handle in-memory calculated KPI as quick measure using filter.
A OLAP cube or a Tabular model is loading data from a data mart and the processing is doing calculation of KPI and display a pivot in Excel, Targit or Power BI.
A report with criteria parameter do search, select and calculate data from a data mart based on an ad hoc query of a user or the report is send out as pdf file or excel file to business users every day, week, month and so on.
A dashboard is a collection of unrelated data intended to summarize the state of the business during a given time frame.
A simple dashboard sample:
Data mining and machine learning will also use data from a data mart. Percentage (fraction) and ratio based on a dimension slice will be calculated »on the fly« in a BI tool when a fact table contains a numerator and a denominator, e.g. a DAX (Data Analysis Expressions) query language for a tabular model in Power BI. Best to use a star schema. DAX for SQL user and DAX tutorial.
Average price = DIVIDE(Sum(UnitPrice x Quantity), Sum(Quantity), 0)
Active customers =
VAR measurevalue =
FILTER ('Fact_Sales';'Fact_Sales'[EndDate] = BLANK() &&
'Fact_Sales'[Amount] > 0))
RETURN IF (ISBLANK(measurevalue); 0; measurevalue)
Kimball calls this layer Business Intelligence Application.
A date range lookup in a Kimball type 2 dimension when fact only has a business key (_bkey):
CampaignName = CALCULATE(VALUES('Dim_Campaign'[CampaignName]),FILTER(
'Dim_Campaign', 'Fact_Product'[Campaign_bkey]='Campaign'[Campaign_bkey] &&
Supporting databases (not part of data layer and is introduced by J. Dalby)
When custom data in the Usage database is going to be changed it can be done in a UAT user acceptance testing environment. After the ETL process has been executed we can test the data warehouse. Later we can take a backup of the Usage database and restore it in the production environment.
Data integration mapping by abbreviations example
A Usage database has two tables for sales device from two operational systems and each table has its own business key columns labelled _bkey. I have invented an immutable surrogate business key column labelled _sbkey which contains abbreviations as a simple letter code that can be used in a dimension in a data mart as an easy search criteria for a user query and a drop-down box in a dashboard or report. The _sbkey is called a retained key in SAS and a persistent durable supernatural key in Kimball page 101. The _sbkey is mapped to the business keys which makes it stands for a integration and merge to a conformed dimension.
Table SalesDevice_Map with a surrogate business key _sbkey as the primary key merging all devices from retailer shops and from online shop to a current name and a sort order and a category or classification or grouping called DeviceModel. The table SalesDevice_Map will be updated by a data steward in the Usage database and the ETL process will flow the data to a SalesDevice dimension in a data mart:
A retailer shop has different sales devices over time. SalesDevice_Map_Retailer table has the business key _bkey from different source systems mapped to the relevant surrogate business key _sbkey and timeline columns ValidFrom and ValidTo:
A online shop has different sales devices over time. SalesDevice_Map_Online table has the business key _bkey from different source systems mapped to the relevant surrogate business key _sbkey and timeline columns ValidFrom and ValidTo.
We see that sales devices PC and Mac has different business key values in source systems compared to the retailer mapping table, and over time sales devices have changed the business keys in a source system from a letter code to a number value. Above business key 10 represents two different sales devices first a Tablet computer and later a Personal computer.
For example, above business key value 1 has two mappings, ECR for a retailer Electronic cash register and IPD for an online user iPad. The ETL process will for transaction sales data do a lookup in the relevant map-table to fetch the _sbkey value and another lookup in the SalesDevice dimension to fetch the key value to be saved within the fact row.
Table SalesDevice_Dim type 1 dimension with its own primary key labelled _key:
An example of a view that binds map table and dim table together to a ETL process so a retailer transaction sales data can do a date range lookup to fetch the right SalesDevice_bkey value through a mapping to the surrogate business key:
CREATE VIEW mapping.SalesDevice_Retailer_Lkp AS
FROM usage.SalesDevice_Map_Retailer map
INNER JOIN dma.SalesDevice_Dim dim
ON dim.SalesDevice_sbkey = map.SalesDevice_sbkey
A retailer transaction sales data join to the lookup view to let its business key fetches the dimension key and insert data into a fact table. For a ETL process lets make Extracting as incremental load with delta data detection in a where clause, Transformation of business key SalesDeviceNumber to dimension key SalesDevice_key and Loading new data rows into a fact table Sales_Fact with a SalesDevice_key as a reference to a SalesDevice dimension:
INSERT INTO dma.Sales_Fact (PurchaseDate_key, SalesDevice_key, Amount, TransactionId)
trn.PurchaseDate, --date format same as the Date dimension role-playing Purchase.
SalesDevice_key = ISNULL(sdrl.SalesDevice_key,-2), --in case of no match in join.
FROM source.RetailerTransactionSales trn
LEFT OUTER JOIN mapping.SalesDevice_Retailer_Lkp sdrl
ON sdrl.SalesDevice_bkey = ISNULL(trn.SalesDeviceNumber,-1) AND
sdrl.ValidFrom <= trn.PurchaseDate AND sdrl.ValidTo > trn.PurchaseDate
WHERE trn.TransactionId > (SELECT MAX(TransactionId) FROM dma.Sales_Fact)
Horoscope star signs as a static dimension example
A Usage database has a table with names of horoscopes as a natural key, an abbreviation as a business key labelled _bkey, and I am using the planet/house cusp number as a surrogate key labelled _sbkey. Element becomes a grouping of the twelve signs and the first level of the hierarchy Element→Horoscope: (icons)
1.2. Data capture or data ingestion to input data area
Data capture or data ingestion (in danish datafangst, hjemtagelse, indtage, modtagelse) of data from a source system to an input data area is based of two data flow directions:
Granularity of data capture integration strategy is to consider when the amount of data is huge in a source system. A data warehouse prefer to receive at the lowest granularity level of detail in case of specific analysis usage or data validation, but sometimes it is necessary to aggregate and summarize source data to a higher granularity like per day and per customer segment for transaction data or transactional data from an OLTP system e.g. orders, invoices, billings, payments, site hits. Non-transactional data e.g. Customer, Location, Contact, Supplier, Part, Product can be stored in a Master Data Management (MDM) database as master tables that is shared with multiple operational systems in the organization. MDM is an extension of the Usage supporting database.
Data capture from a source system is based on multiple data delivery forms:
Data capture from a source system is based on two data delivery methods:
EntryDate > »the last received EntryDate«.
(in danish Posteringsdato eller Bogføringsdato).
It is important to do a log of number of rows from a source system to the Input data area for later reconciliation check count auditing (in danish kontroloptælling, afstemning) in case a source system does not deliver the expected number of rows and to monitor over time the number of rows to see if it increases as expected and be used in a graph over the amount of data in the data warehouse. Other data values e.g. amount, quantity and volume can also be logged and monitored. I recommend that source system tells the number of rows per table that will be compared to the saved number of rows in Input data area in an audit trail. It can be implemented as a view in a source system like this:
CREATE VIEW BI_Audit_NumberOfRows AS
SELECT 'Northwind' AS Sourcename, 'Customers' AS Tablename, COUNT(*) AS NumberOfRows
SELECT 'Northwind' AS Sourcename, 'Orders' AS Tablename, COUNT(*) AS NumberOfRows
SELECT 'Northwind' AS Sourcename, 'Order Details' AS Tablename, COUNT(*) AS NumberOfRows
FROM dbo.[Order Details]
The data warehouse can do its own delta data detection between Input data area and Archive area to identify new data, changed data and deleted data to maintain historical data with ValidFrom and ValidTo datetimestamps is handled by following actions:
Sometimes the source system provides flag information to pass a status to a data warehouse. I have seen combination of consistent statuses like these:
Microsoft Windows calls it Add/Remove Programs in Control Panel (ARP).
Change Data Capture CDC is a method to log data in a table when data is changed by transfer a copy of data row from the table (or table log) to another log table. Changed data represents: Inserted data, updated data and deleted data. (In danish Data ændringsdetektion d.v.s at opsnappe eller at spore at data har ændret sig i et andet system.)
Sometimes a source system update a data row multiple times but only the most recent version goes to the data warehouse. If source system insert and update a row before a new load to data warehouse, only the updated version goes to data warehouse. If source system insert a new row and delete the same row before a new load to data warehouse, the data will never goes to data warehouse. It is seldome that a source system has a log and is logging all changes with a revision number or a ValidFrom and ValidTo datetimestamp, therefore a data warehouse does not contains 100% full historical updates of data in an organization. Sometimes a source system has for each (or selected) tables in the database an extra historical log table (shadow table) that contains all inserted, updated and deleted data rows from the original table together with an Action column (values for Insert, Update, Delete) and an EntryDate datetimestamp column that mark when the action was occurred and a SequenceId that is used for delta data detection to pull data from Source log to Input data area.
An example of a source historical log table with rows every time a customer change data:
Implemented as after-insert/update/delete triggers in source system.
For example, Jane Macquarie was sales manager for the eastern region until December 31, 2018, and then took responsibility for the western region from January 1, 2019. How is that registered in the source system? I have seen systems without a date to register the shift, because an operator will update the responsibility at beginning of January, which means the data warehouse can let Jane get a ValidFrom date as 2019-01-05 for western region but it is a wrong picture and sales fact data for few days in 2019 will belong to eastern region. Please, always ask a source system for multiple date columns and use them for ValidFrom and ValidTo to get a real picture of the business in archive area and data mart area.
Data latency describes how quickly source data has to be ready in the data warehouse for the business users to do their reporting. Deadline is normally in the morning based on a periodic batch of source data from yesterday. For a real-time load to a data warehouse with constantly new presentation in a dashboard, the ETL process must be streaming oriented where source data continuously flows into the data warehouse do to transformation and make conforming data.
1.3. Enterprise Data Warehouse modeling architecture - EDW
A data warehouse can be modeled in different ways. I will present three data models. In a data warehouse solution the data from the source systems can go into a database called Enterprise Data Warehouse (EDW) that is a common data area before the Data mart area (DMA). Notice, to modelling EDW based at the business requirement and not based at a source system data structure.
1.3.1. Bill Inmon modeling - Corporate Information Factory model
When a EDW is modeled after Bill Inmon it is based on Peter Chen’s Entity Relationship data modeling with super-sub entity, associative entity and with 80% normalized data. The EDW offers integrated, granular, historical and stable data that has not yet been modified for a concrete usage and can therefore be seen as neutral. It can keep historical data meaning all the changes to the data or only the days-end status e.g. End Of The Day for each data revision from Archive area or Input data area. An entity Person can have a one-to-many relationship to the addresses of the person’s living places in another PersonAddress entity with ValidFrom and ValidTo columns and these entities or tables will be merged into one type 2 dimension table according to Kimball in a data mart. The modeling is subject-oriented meaning all business processes for each subject e.g. customer needs to be modelled. A EDW common data model is not technical it must be based in (logical) business processes. Since all source system data is integrated and collected into one common database, the EDW contains and represents a »single version of truth« for the enterprise.
Inmon modeling can divide or partition data into super-sub tables to place related columns in its own sub table to separate from other columns. There is often a one-to-one relationship between a super table and its sub tables. We can say that this kind of division of data is subject-oriented based. A super table can contain a business key (natural key, surrogate key) from a source system together with other basic data columns, or instead we place the business key in a mapping table connected to each super table where a translation of a business key becomes a data warehouse surrogate key Id as an integer, guid or a hashbyte value, and we let the Id become primary key in the super table and foreign key in the sub tables. When a source system adds new data, the data warehouse can place data in a new sub table and connect data to the existing super table. Similarly, when a new data source system is introduced to the data warehouse. Inmon modeling using super-sub tables allows for agile modeling, see 1.3.4. Also called super-sub-entity or supertype/subtypes has two rules:
· Overlapping or inclusive e.g. a person may be an employee or a customer or both.
· Non-overlapping, mutually exclusive or disjoint e.g. a patient can either be outpatient or resident patient, but not both at the same time.
Let the design of the data model for a EDW start doing a »helicopter view« for identification of entities, e.g. salesperson or engineer will be stored in an Employee entity with a one-to-many to a Jobfunction entity, or maybe do more abstraction in a bigger picture making a Person entity where employees and customers users can be stored together. Later in the design process move on to a »weeds view« for all the details for columns, conformed names and data types etc.
In the EDW keep the source data with time span columns like Start date and Stop date, and sometimes also convert the source data to row wise entity as a kind of transactional data that is a sequence of information exchange like financial, logistical or work-related data, involving everything from a purchase order that becomes an invoice with multiple invoice line items, to shipping status, to employee hours worked, plan and activity records, for subscription period and to insurance costs and claims.
The EDW will be the source to one or multiple data marts using Dimensional modeling with denormalized data controlled by the ETL process. Or maybe the data mart is using a different modeling that fit better for use of data. An enterprise data warehouse should be accountable and auditable which by default means pushing business rules of changing/altering data downstream to »between the data warehouse and the data marts«. EDW tables and views can be data source for a tabular modeling with a DAX formula to do filter and calculate in-memory KPI to be visual in Power BI, and a data mart area is skipped or been replaced by EDW views, therefore it is called a ELT process because the transformations is executed on the way to the user.
1.3.2. Anchor modeling
When a EDW is modeled after Anchor modeling it is based on four elements that holds on historical and time-variant raw data in entities/tables with labels:
Anchor is a surrogate key generator for the source system business key from a data warehouse point of view.
Attribute has descriptive values from a source system connected to the business key therefore attribute has a foreign key reference back to the anchor. The attribute table is based on type 2 historic, therefore it has also a surrogate key. A view will mix anchor and attribute and it becomes a type 7 dimension because surrogate key of anchor becomes a durable key and surrogate key of attribute becomes a type 2 surrogate key. Both keys will be added to a fact, and the fact can join to a most recent dimension to show the current value or join to a historical dimension to show the registered value when the fact data occurred with its transaction date. The model in Power BI can choose the content for its presentation interface because all data is on a silver platter.
Knot is a lookup table with basic data.
Tie is a relationship table between anchor tables to handle one-to-one, one-to-many and many-to-many relationships, therefore no foreign key in anchor or attribute table except to a knot table (»tie the knot« means getting married).
Anchor modeling has only extension of new tables and none modification of existing tables. This ensures that existing data warehouse applications will remain unaffected by the evolution of the data warehouse. When a business key exists in multiple source systems, there will be one common anchor table and several attribute tables, because each source system has its own attribute table.
Anchor modeling allows to build a real-time ETL process to a data warehouse where some attribute tables needs a more frequent update than others to provide fresh data to the users in a report or for a online dashboard. Other attribute tables only needs updating per hour or at end-of-day.
1.3.3. Data vault modeling
When a EDW is modeled after Data vault modeling it is based on component parts of a Core Business Concept (CBC) as an ensemble consisting of three components that holds on historical and time-variant raw data in entities/tables with labels: Hub is used to store a business key, Link is for a relationship between hubs and Satellite contains the data and descriptive information. They have a LoadDate column (Load_dts, LoadDateTime, ValidFrom) to show when the data row was entered.
Each component can be drawn with its own color. Let me elaborate with some examples:
Hub (blue) separates the business keys from the rest of the model and will translate business key to a unique hashbyte key value. A composite business key of multiple columns will also become a hashbyte key column. E.g. a HubProduct has ProductHashKey together with business key ProductNumber. A hub is an integration point of a business key or a unique identifier and will never change. A hub exists together with at least one satellite. A data vault table has a RecordSource column as a reference back to the source system for data lineage e.g. "Dynamics365.Sales.Product" and can be a multi-valued column with multiple sources.
Avoid to composite business key and RecordSource, use instead a Same-as-link table (mapping). (Hub in danish er et samlingspunkt for én forretningsnøgle som har data fordelt over flere satellitter der kredser om en hub eller et link).
Satellite (yellow) [sat] stores the context, descriptive data and measure values in columns of either a hub or a link. A satellite is connected to one hub or one link, and a hub or a link can have multiple satellites. E.g. a SatProduct with foreign key ProductHashKey from HubProduct and data values for Name, Category, Target group etc. Satellites has all the relevant data for the data warehouse. Satellite is history-tracking and to handle historical data in a satellite, the primary key is composite of HashKey+LoadDate. A hash difference HashDiff column is a checksum of all data value columns for making an easy comparison for Kimball type 2 history, see more in section 4.3.
To be 100% insert compliant there is no columns for EndDateTime, ValidTo and IsCurrent flag, therefore no updates of rows in a satellite. If a source system can tell that a product has been expired or deleted then a new row is inserted into SatProduct with a current date value in column DeletedDate.
One hub or one link will usually have several satellites associated because we will regroup data into multiple satellites by classifications and types of data and information and by rate of change, so each of them can have its own granularity and timeline. Split logical groups of data into multiple satellites. With multiple source systems the RecordSource column is helpful when data sets is similar, else let each source system has its own satellite including a RecordSource column and share the HashKey from the hub to all its satellites.
Employee satellites example
We start with an employee hub with a business key column EmployeeNumber that is translated to a column EmployeeHashKey that will be in all satellites sat tables:
SatEmployeeBasic(SSN, BirthDate, MaidenName) constants/correction overwrites.
SatEmployeeName(FirstName, MiddleName, LastName, Gender).
SatEmployeeContact(CellPhoneNumber, Email, Skype,Twitter,Facetime,Whatsapp).
SatEmployeeAddress with columns for address or a reference to an Address table.
SatEmployeeJob(FromDate, ToDate, Department, Title).
SatEmployeeChristmasGift(Year, NameOfGift, AmountOfGift, LevelOfSatisfaction).
The above data comes from multiple source systems and data is added independently of each other to the individual satellites (divide and conquer). There will be outer join from a hub to its satellites and inner join from a satellite to its hub.
Conformed and calculated data is placed in its own Computed Satellite. Other satellites are: Overloaded satellite, Multi-active satellite, Status tracking satellite, Effectivity satellite and Record tracking satellite.
Link (green) integrate and capture relationship between hubs and links. E.g. a product is placed in a store, therefore we have a HubProduct and a HubStore with data values in a SatProduct and a SatStore, and a LinkProductStore represents the placement by combining ProductHashKey and StoreHashKey as foreign keys from the hubs to capture in which stores a product is placed and for sale. A link creates its own hashkey as a unique combination of the involved hub business keys e.g. ProductStoreHashKey from ProductNumber and StoreCode. Data values connected to a link is placed in its own satellite, e.g. SatProductStore with primary key/foreign key ProductStoreHashKey and data values for PlacementInStore and QuantityInStore. When a product no longer exists in a store, the ETL process will insert a new row in SatProductStore with the expiration date in the column DeletedDate.
A link handles one-to-one, one-to-many and many-to-many relationships because data vault has only optional many-to-many. A link is a dataless and timeless connection among hub business keys and is not a data-bearing relationship from ERD.
A satellite on a link represents the history of the connection or the relationship.
Product Supplier Category link example
A Product table with foreign keys SupplierID and CategoryID will often be put together into a LinkProductSupplierCategory with ProductHashKey, SupplierHashKey and CategoryHashKey from HubProduct, HubSupplier and HubCategory and a primary key ProductSupplierCategoryHashKey. Data values connected to the link is placed in SatProductSupplierCategory. Over time a product can change supplier and category which is being handled in the SatProductSupplierCategory with a new row inserted with filled DeletedDate and a new row is inserted into link and sat for the new data combination with a new LoadDate. If the business one day allows a product to have multiple suppliers then the link is already prepared for it. In the future, the business will like to place products in stocks all over the world, and it will be registered in a new inventory management system, that creates a HubStock and a SatStock, and creates a LinkProductStock and a SatProductStock to handle all the new data and relationships.
A unary/recursive relationship (self join) e.g. LinkMarriage represents a connection between two persons with Person1HashKey and Person2HashKey from HubPerson with business key SSN and it gives primary key MarriageHashKey. All data are placed in SatMarriage(MarriageHashKey, MarriedDate, DivorcedDate, WidowedDate) that is labelled Effectivity satellite because of the columns of dates. A validation rule in SatMarriage ensures that one person can only be married to one person at a time else it will be marked as an error for an audit trail to catch. When a person enters a marriage and changes last name, it becomes a new row in SatPerson. When a marriage ends either by divorce💔 or by death† then a new row is inserted into SatMarriage with filled DivorcedDate or WidowedDate. Couples who are divorced and later remarry each other will reuse the same MarriageHashKey value from LinkMarriage for the new row inserted into SatMarriage. The LoadDate will be in order of the events. The DeletedDate will be used in connection with an error detection in the source system or an annulment of the marriage.
Another modeling of marriage: SatMarriage(MarriageHashKey, Date, MaritalStatus) with values of marital status: Married, Divorced and Widowed, makes it easy to extend with extra values e.g. Separated, Registered partnership and Abolition of registered partnership. When persons as partners are living together we will modeling it in a LinkCohabitation and a SatCohabitation. When persons only dates we make a LinkRelationship and a SatRelationship. Therefore each link and satellite is a flexible and agile way to divide data (agile in danish som smidig for forandringer, let og hurtig at udvide). In a generic perspective »helicopter view« we could have started with a general link to represent a connection between two persons with Person1HashKey and Person2HashKey from HubPerson, and label the link a general name like LinkCouple with a CoupleHashKey, and let the types of data create multiple satellites for the link as SatRelationship, SatCohabitation and SatMarriage. It is normal for a couple to start dating with data in the SatRelationship, living together with data in the SatCohabitation. When the couple gets married, the cohabitation timeline stops in the SatCohabitation, and the data recording continues in the Sat-Marriage. It is agile to think and data modeling in a general way and divide data.
An invoice becomes a HubInvoice where business key InvoiceNumber becomes InvoiceHashKey used in a SatInvoice with data values IssueDate, DueDate, DeliveryDate, PaymentMethod and PaidDate. When a payment happens e.g. a week later, it will become a new row in the SatInvoice, but I prefer to add the payment to a new satellite called SatInvoicePaid with InvoiceHashKey, PaidDate, PaidMethod and PaidAmount to divide data in seperated satellites because of the time difference in data capture, and it allows for payment in installments in the future. If one day the sales department replaces the billing system then create a new SatInvoice. An invoice involves a customer in a store, therefore a LinkInvoiceCustomer includes InvoiceHashKey, CustomerHashKey and StoreHashKey and combines business keys to InvoiceCustomerHashKey used in a SatInvoiceCustomer with data values like Reference no, Requisition number and Voucher number. An invoice line item involves a product, therefore a LinkInvoiceProduct includes InvoiceHashKey and ProductHashKey and combines business keys to InvoiceProductHashKey used in a SatInvoiceProduct with data values LineItemNumber, Quantity, CouponCode, Discount (%), UnitPrice, VAT and Amount. If the quantity in an existing invoice is corrected then a new row is inserted into SatInvoiceProduct. If a product is replaced then for the old product a new row is inserted into SatInvoiceProduct with filled DeletedDate and the new product is inserted into LinkInvoiceProduct and into SatInvoiceProduct.
A link does not have its own unique identifier from a source system. Other links are: Nondescriptive link, Computed aggregate link, Exploration link and Link-to-link or Link-on-link.
Dan Linstedt comments my post at LinkedIn that he is using color: green for hub, blue for link and white for satellite.
Address will not be a Hub since the address itself is not a CBC but a description of a CBC e.g. Customer, Employee or Store. A CBC could be a CustomerDeliveryLocation as a hub with a satellite and a link to Customer hub because a customer could have more than one active location where goods can be delivered. The business key for the hub is using a concatenation of zip code, street name, house number to get a unique identifier for the CustomerDeliveryLocationHashKey.
There is several other types of components parts (entity/table), for example:
Transactional-link (light green) integrate and capture relationship between hubs and links to capture multiple transactions that involve the same business keys e.g. many sales to same customer of same product from same store, meaning multiple rows with the same set of keys. A transactional-link has its own unique identifier for a transaction (TransactionId, EventId) from a source system. Notices that a transactional-link keeps its business key, it is not placed in an associated hub. If a revision of a transaction is received from a source system or a counterpart then insert the data values into the associated satellite. Since DV raw data is a structured archive, it will be wrong to calculate a counterpart row, but when data is extract-transform-load to a Transactional fact it is fine to calculate a counterpart for the fact.
Nonhistorized-link is for immutable data that has no edit history, in other words, constant data that will never be changed or deleted in a source system. Each transaction (data row) has a unique identifier (TransactionId) and there is no revision of a transaction in the source system. When there is a counterpart to a transaction it will have its own unique identifier and a reference identifier back to the original transaction.
Reference (gray) is referenced from a satellite and is the same as a lookup table with basic data e.g. RefDate or RefCalendar with many columns like Week, Month, Quarter and Year, RefMaritalStatus and RefMarriageType with Code and Text, and RefZipCode with Zipcode, City, State, State abbreviation for US. Date table is called a nonhistory reference table. A satellite is normalization to 3NF and contains only non-foreign key columns except simple reference values. If a reference (cross-reference or lookup table) contains history then move the old data values to a History-based reference satellite with primary key Code+LoadDate and Text contains the previous texts. I prefer to have a RefAddress with extra columns like State, Region, Municipality and Latitude and Longitude (gps coordinate), Kimball page 235.
Same-as-link (turquoise) [sal] to map different business keys from multiple source system where each source has its own satellite. In a sal we can »merge together« differently named business keys to a single master key that is really the same-as the other business keys. Same-as-link is connected to a hub and is used to identify when the same business objects are identified by multiple business key values, I shown an example for a SalesDevice in section 1.1.
Hierarchical-link (silverblue) [hal] to capture hierarchical relationship between the business concept records in same hub e.g. parent-child hierarchy.
Point-in-time (purple) [pit] is a helper table to a hub or to a link and is calculated by the ETL process based on all the satellites of the same hub or of the same link, because these satellites share the same HashKey together with different values in LoadDate. A pit table has its own LoadDate column (SnapshotDate), a column for the shared HashKey and columns for each involved satellite’s LoadDate, therefore all the values of LoadDate from the satellites are represented in the pit table to be used to derive all valid versions for ease of joins of satellites and to improve the performance and includes to find the newest or most recent rows and to calculate EndDate or ValidTo. By adding a surrogate key sequence number to a pit table it can through a view create a virtual Kimball type 2 dimension. pit create and pit use.
Bridge (taupe) combines (or bridge) hashkeys and optional business keys from multiple links and their related hubs. A bridge can represent an aggregated level and can include derived and calculated columns.
Data vault 2.0 recommends for a Hub to hash a business key value instead of using an identity column as a unique sequence number because:
· HASHBYTES('SHA2_256',CONCAT(SSN,';',UPPER(TRIM(Area)))) gives data type binary(32)/char(64). SHA1 use binary(20) and MD5 has been deprecated.
· Before hashing a business key we do data operations as upper case, left and right alignment or trimming and save it in a column called »comparison business key« together with the original value in a Hub table.
· For Hash diff to make an easy comparison for data columns, we do the same data operations when it is agreed with the business and the data profiling.
· Hashing two values could create same hashkey value which is called a collision and is very unlikely but can’t be left to chance or avoidance. If a data batch contains two different business key values which give same HashKey then bulk insert will result in primary key violation in a Hub. If a new business key value gives the same HashKey as an already existing one in a Hub then the loading can assume the new business key value is already there. Therefore first to check in Hub if new business key value exists and the HashKey exists then increase HashKey until no collision or using a collision sequence number, but there is no guarantee of same hash key values when data is reloaded. Maybe a need for a hash key detector detection job and detecting an undefined hash key. MD5 collision example. What to do when business key value is null or is an empty string?
· It is ease of copying data from environment to another like dev to test to prod without worrying about sequences being out of sync.
· When two source systems share same business key but in one source it is an integer data type and in the other source it is a string, by hashing they become same data type in the hub table and sat table.
· When a source system is replaced, the new source comes with different data types e.g. from an integer business key to a string business key, it will still be hashed to same data type in the data vault, therefore no maintenance.
· A composite business key of multiple columns become one hashkey column which is easy to use for join and merge.
· Easy to create a hashkey of a business key value when needed by a ETL tool or in a sql select statement instead of doing a lookup to find the sequence number stored in a table. But of course still need to make sure that the same business key is not inserted twice in same hub table and avoid collision.
· Do not use a HashKey column for a clustered index for the primary key in a hub for faster query join performance, because hashbyte has a non-sequential nature and therefore inserting new rows will take longer time. A non-clustered index should be considered as a default for HashKey and a unique clustered index on the business key in a hub. I have seen a hub with a unique clustered index on LoadDate+HashKey because first part of the value is ever-increasing. More.
· Loading to tables can be done parallelly e.g. at same time to a hub table and to a satellite table from the same source system table (parallel loading, enables parallelization). It sounds awesome! What if a ProductNumber is used in a Store relationship and will be inserted into the LinkProductStore with ProductHashKey and StoreHashKey but the ProductNumber is an orphane and does not exists in the Product table from a source system and therefore is not in the HubProduct then it gives a referential integrity violation in the data vault, read about inferred members in section 4.5.
I find the Data vault model as an archive of structure where data from source systems has been divided, and where the business rules and transformations is first enforced when data is going to data marts.
Linstedt says: »The Data Vault is for back-end data warehousing, not for production release to business users directly.« DV becomes a persistent staging area or an archive area (ARA).
Querying a data vault requires many more joins than Inmon's model. Be aware it can be a performance issue to have columns of data type binary(32) or char(64) for storing millions of rows and for join many tables together compared by using an integer data type. Joining large links and hubs together is expensive for performance and it is often debated by Linstedt. As far as performance goes, normalised structures are no where near as performant as a dimensional model. Nothing beats dimensional modeling for performance of queries.
Data vault divides data in a Raw area and a Business area, where the last area contains business calculations and to display at data on a different level of granularity by making new link with more or fewer hubs (and business keys) involved. Pit and Bridge is part of the Business area to provide table to give a better query performance. For a ETL process from a Data vault to a Dimensional modeling, think that a Hub and its Satellites become a Dimension, and a Link and its Satellites become a Fact. Pit and Bridge is a big help and sometimes the ETL is done by views only.
When to use Data Vault? When a data warehouse has many source systems!
1.3.4. Agile modeling
Bill Inmon's modeling is based on a normalized data model and the new data models for EDW wants to break things out into »parts« for agility, extensibility, flexibility, generally, scalability and productivity to facilitate the capture of things that are either interpreted in different ways or changing independently of each other.
Agile modeling has only extension of new tables and none modification of existing tables. Changes in an agile data warehouse environment only require extensions, not modifications and no impact to the data warehouse and it becomes quick-moving to act quickly to easily customize the data warehouse when business changes. An ensemble modeling pattern gives the ability to build incrementally and future changes should not impact the existing design. Agile development is performed in an iterative, incremental and collaborative manner. I think we will always meet an issue that address us to do a model refactoring or re-engineering. Read more.
Inmon modeling is a free data model to do split of keys to map tables and data into super-sub tables, to handle one-to-one, one-to-many and many-to-many relationships by connection tables. With Entity Relationship data modeling you can make your own labelling of tables in a data model, compared to fixed labels in Anchor modeling and Data vault modeling.
For Anchor modeling and Data vault modeling the basic idea is to split (groups of) keys into their own entities (called anchors and hubs) and split of all other data into additional entities (non key entities called attributes, satellites), so an Anchor or a Hub table has multiple Attribute or Satellite tables attached because:
This has advantages of independent load jobs, data model enhancements etc. but makes it more difficult to extract and fetch data. To reduce the complexity in queries, a proven approach is to create (or generate) a view layer on top of the model with Join Elimination etc.
Normalization to sixth normal form 6NF is intended to decompose table columns to irreducible components, a hypernormalized database where essentially every column was its own table. I see 6NF as a physical level, while I see 1NF to 5NF as a logical level useful for design and modeling of a database.
A generic data model does not need any changes when there is a change in the source systems. There is a Party modeling, a 2G modeling and a Focal Point modeling and more than dozens of data warehouse data modeling patterns that have been introduced over the past decade.
EDM = Enterprise Data Model is a term from IBM back in 1998 for a data modeling technique for data warehousing, where ETL was called Capture, Transform, Apply.
A simple Guidedance for EDW: Do not lose data (auditing/compliance/validation). Model it, so you can understand it. Make it physical, so you can query it.
Data warehouse is using a surrogate key instead of a business key to remove dependence from the source system, see more in section 4.2.
1.3.5. Dimensional modeling
Ralph Kimball does not like to store data in a EDW, he only store data in data marts that is using Dimensional modeling, therefore EDW becomes a union of all data marts.
Dimensional modeling ends up in a star schema or constellation schema (multi-star schema) with fact tables (analysis variable, measures, events) surrounded by dimension tables (context), where dimensions explain the facts. Dimension and fact conformance is a must in a successfull data warehouse implementation to meets the requirements of legislation, accepted practices, prescribed rules and regulations, specified standards and terms of a contract.
There are many naming rules e.g. dimension tables are singular and fact tables are plural to do a differentiate between dimension and fact.
Adding a role to the name of object by a prefix e.g. DimCustomer and FactSales or by a database schema e.g. dim.Customer and fact.Sales.
Sale is the selling of goods or services, or a discount on the price.
Sales is a term used to describe the activities that lead to the selling of goods or services. Goods are tangible products as items you buy, such as food, clothing, toys, furniture, and toothpaste. Services are intangible as actions such as haircuts, medical check-ups, mail delivery, car repair, and teaching. A sales organizations can be broken up into different sales teams based on regions or type of product.
Constellation schema or Multi-star schema has multiple facts + sharing dimensions
From a real implementation based on 65000 devices/end-points/clients and 34000 software (arp string for add remove program) from four source systems: AD, SCCM/MECM, ServiceNow and Software Central. Dimensions are of Kimball type 1 which contains current data where old data is overwritten and deleted data in a source system will be marked with »true« value in the dimension column IsDeleted_meta as metadata. Facts are based on »capture a relationship in the fact« instead of showflake dimension e.g. fact_Client and derived and snapshot facts, see more later in this article.
Snowflake schema (I'm not a fan of the term)
Levels in the hierarchy consist of a dimension of several normalized dimension tables, so that not all dimension tables are connected to the fact data table. When a hierarchy is divided into several dimensional tables, they can be individually linked to different fact data tables same as Constellation schema or Multi-star schema with multiple facts sharing dimensions.
1.3.6. Data warehouse architecture and ETL modeling
Three business intelligence enterprise data warehouse modeling architectures:
Sources represent either Input data area (IDA) or Archive area (ARA).
It is common for all models to have a Data staging area (DSA).
Enterprise Data Warehouse (EDW) is sometimes a term for the whole solution and the different models each use their own terms e.g. Data warehouse or Raw data.
EDW is a common database layer before the Data mart area (DMA).
A data mart has a Presentation interface (PIA) in front, above it is called BI-Apps.
I have used the general word data mart for a database with tables of dimensions and facts, but there is a more narrow word Multidimensional database (MDB) that is the source database for a OnLine Analytical Processing (OLAP) application like a cube also known as a Multidimensional online analytical processing (MOLAP) application. For me, it is a principle for a cube that all dimensions and all measures can be combined freely else divide data into multiple cubes. Same principle for a Tabular model.
SQL (Structured Query Language) is a query language for database model.
MDX (Multidimensional Expressions) is a query language for OLAP model.
DAX (Data Analysis Expressions) is a query language for Tabular model.
ETL process stands for Extracting, Transformation, Loading, see a model, and ETL exists between the data areas (data layers) of a data warehouse solution, e.g.:
From each data area (data layer) there will be extract criteria to limit data, and there will be transform to make data conform and calculate new data like KPI, and data will be load into the next area. It is a common way of handling data in a multitier architecture or layered architecture (in danish lagdelt arkitektur; extract for tilvejebringe data ved at udtrække, uddrage eller udlæse data; transform for at transformere, beregne eller berige data; load for at levere eller indlæse næste område med data).
1.3.7. Data mart modeling
A Data mart area (DMA) will consist of several data marts database. Data is fetched from a Data staging area (DSA) or a Enterprise Data Warehouse (EDW) through a ETL process which may contain a local data mart staging area (DMSA). When a data mart is implemented as a view layer on EDW, I am calling it a virtual data mart. Each data mart database has a specific purpose area data for tailored support and can be characterized as one of the categories below:
With loaded tables of data from DSA or EDW.
With common dimensions to be reused in the other data marts. Examples of dimensions are Date, Time, Employee, Organization, Product, Retailer and TransactionType. Can also include role-playing dimension e.g. Manager based upon Employee and different date dimensions with unique names of columns.
With common facts to be reused in the other data marts. Examples of facts are Sales, Inventory, Marketing, Finance, Employment and Transaction.
Contains star schema/constellation schema and other modeling.
With loaded tables of data from DSA or EDW.
For example, Sales mart, Customer mart, CRM mart, Churn prediction mart, Market mart, Production mart, Inventory mart, Shipment mart, HR mart, Tax mart, Credit risk mart, Fraud detection mart.
Data can be a subset of common mart and data can be further enriched.
Contains star schema/constellation schema and other modeling.
With loaded tables of data from DSA or EDW.
With views upon common mart, subject mart or EDW to fetch relevant data.
Making conformed columns for star schema/constellation schema to provide data to OLAP cubes or Tabular cubes or to dashboard visualization tools like Tableau, QlikView/Qlik Sense and Power BI. The data mart can be used for analytics by business users and data scientist to do ad hoc sql query.
With views upon common mart, subject mart or EDW to fetch relevant data.
Making conformed columns for star schema/constellation schema to provide data to reports on papir, in pdf files and other display formats.
Stored procedures with criteria parameters inputs for example: FromDate, ToDate, Customer, Branche, Product, SalesDevice, TransactionType will fetch the relevant data from the views and present data to a reporting tool used by business users like Reporting Services (SSRS) or Microsoft Access with pass-through query to call a parameterized stored procedure.
With tables for Regulatory requirement return report (in danish indberetning) for financial reporting, because it is desirable to be able to recreate an old report both as it actually looked at the time of creation and as it should have looked given corrections made to the data after its creation. All tables has a VersionId to be used for sql join of tables to ensure all data rows belong to the same accounting. Read about Temporal snapshots in chapter 2.
With views upon common mart, subject mart or EDW to fetch relevant data to provide data and to create data in a file format that is customized to the individual data recipient as another IT system internally or externally.
With tables of raw data from source systems, or with views upon an archive area or views upon EDW or views upon data mart area to some business users to do data discovery by sql statements. It is a sandbox where users also has write access and can create their own tables for staging data, mapping data, search criteria data etc., and for the final result dataset to be exported to a Excel sheet or to be connected to a Power BI report and so on. The purpose is to prepare a good business requirements specification and a user story for extensions to the data warehouse solution.
Data mart area (DMA) has several databases for each mart and sometimes there is no need to store data in a mart database instead the mart database contains virtual star schema non-materialized views that look like dimensions and facts in a virtual data mart.
Remember to set a process for user management, permissions and authentication etc. to manage and monitor and reduce the risk of cross contamination of data to ensure that a user only has access to view relevant data for his analyzes and reports.
1.4. Data reconciliation
One part of a data warehousing project is to provide compliance, accountability, and auditability. After data capture, please remember to implement a Reconciliation Summary Report with the results from your recon e.g. Uncategorized assets, Invalid country codes, Derivatives transactions with missing currency code.
Audit trail (in danish kontrolspor) becomes important for the credibility of a data warehouse. An example from a source system that has a value 0.003589 and export it to a txt file where the value becomes 3.589E-3 and by a mistake in the ETL process the data warehouse saved and displayed the value as 3.589. A contract number 700002848572 becomes 7.00003E+11 and the last part of the value got lost. When reconciliation is built-in the data model and the ETL process, this mistake would be reported and a programmer can fix the import and update his data profiling documentation.
A classic reconciliation is to weigh the truck before it leaves and weigh the truck when it arrives at the destination to make sure that no load has been lost on the ride. Do reconciling between data warehouse and source systems with reconciliation of row count and sum of values and mark as reconciled and do auditing report. (to reconcile in danish at afstemme, stemmer overens). Log of RowCount of target rows before a process and after a process to measure the change of rows a process do (TabelRowCountBeginAt and TabelRowCountEndAt), RowCount of source data rows and extracted rows based on join and criteria, RowCount of target deleted rows, updated rows and inserted rows/loaded rows successfully or rejected rows with missing value or out of bounds amount or other kind of invalid data depends of the validation check of data quality etc. By sum of values you can reverse the sign in every other row by multiplying each value with -1 to avoid the sum becoming larger than the data type, for example in sql:
;WITH InvoiceItemSummarized AS
SELECT InvoiceItemAmount, Sign = IIF(ROW_NUMBER() OVER(ORDER BY (SELECT 1)) % 2 = 0, 1, -1)
ORDER BY InvoiceItemId
SELECT SUM(InvoiceItemAmount*Sign) AS InvoiceItemAmountSummarized
ORDER BY InvoiceItemId
How else can we say, we are compliant and ensure compliance without data governance, data management and data reconciliation, data quality and data lineage mixed with documentation in a bank to meet Basel I, Basel II, Basel III, Basel IV, EDA, HIPAA, Sarbanes-Oxley and BCBS239 (The Basel Committee on Banking Supervision 239) together with a regulatory requirements return report.
to be complient for e.g. Danish Financial Supervisory Authority (Danish FSA) (in danish Finanstilsynet).
1.5. Data quality
Data quality purpose is to ensure that business users trust and have confidence to data to achieve reliability and relevance. Defining and implementing processes to measure, monitor and report on data quality, measurement results of data quality performed on a data set and thus for judging a data set, and hopefully to improve the quality of data. The specific characteristics or dimensions of data that are analyzed in a data quality program is differ from one business to another based on the needs and priorities of that business. We must be able to rely on the accuracy of our data on which we base decisions (in danish »Vi skal kunne stole på korrektheden af vores data, som vi baserer beslutninger på.«) Doing data quality increases believability, credibility and reliability (troværdighed, pålidelighed). Data quality skills and data cleansing for a better data discipline and to avoid violates of rules of data quality with the following dimensions are commonly used as data controls with the goal of ensuring data quality, and where one mean is conformance or conformity to make data from multiple source systems conformed through data conformation as a part of the ETL process. The term data quality dimension refer to measurement of physical objects e.g. weight, height, length, width we do call dimensions, what should be measured and reported on for a data set. A data quality dimension defines a set of attributes and characteristics that represent an aspect of the data quality.
The following twelve data quality dimensions are significant in relation to assessing the quality of data and to declare a data set with dimensions for data quality. They can be used to cover the need to be able to document that the content of data in a data warehouse is correct:
Timeliness and Availability (aktualitet og tilgængelighed)
Where data is available and accessible when needed by the business. Timeliness is a measure of time between when data is expected versus made available. E.g. a business must report its quarterly results by a certain date, the delay between a change of a real world state and the resulting modification of the data warehouse, and a customer contact information is verified at least once a year and it is indicated with a verification date that data is up-to-date. Another data quality rule can be that max 80% of the active companies must have annual accounts that are not older than 15 months old. Timeliness depends on the update frequency of the data set in relation to how often the data changes. The data set is updated real-time, daily, weekly, monthly, quarterly, semi-annually, yearly. Currency measures how quickly data reflects the real-world concept that it represents.
Timeliness is also about how frequently data is likely to change and for what reasons. Mapping and code data is remain current for a long period. Master data needs to be up-to-date in real time. Volatile data remains current for a short period e.g. latest transaction or number of items in stock can be shown with a as-of-time to tell users that there is a risk that data has changed since it was saved/recorded. Typical during a day data will be changed several times and in evening and night data will remain unchanged. Latency tells the time between when data was created or changed and when it was made available for use by the users. Kimball talks about low latency data delivery can be very valuable for a full spectrum of data quality checks (page 261, in danish reaktionstid, too fast fact data delivery can cause inferred members, see section 4.5).
Completeness (komplethed, fuldstændighed)
Where a data row has the values that is required to avoid missing values e.g. an order requires a customer number. Therefore data quality must check if some columns are empty (null, empty string, 0 not expected). The ETL process must be robust and must not cause system crash and system error at an unexpected null value, instead the data row must be sent to a Wrong table which the operation monitors and reports on. When adding new data rows, update or delete rows, it can be checked whether the number of rows corresponds to the expected number of rows e.g. between 100 and 130 new orders per day. The number of new daily transactions does not decrease by 10% compared to the average number of transactions in the last 30 days (or does not exceed 10%). Keep an eye on the amount of data, the quantity or volume of available data is appropriate.
Completeness is also about a data row where all the necessary data is present e.g. an address must include house number, street, zip code and city, and an order must include product, price and quantity. Incomplete data e.g. missing one hundred thousand data rows compared to last payload, an address does not include a zip code or an email address is missing the domain (gmail.com) makes the address not usable. A data quality goal would be 98% usable addresses and must mark the data row to be taking care of afterwards. Over-completeness means that the set of rules is too restrictive and value combinations that are actually valid are unjustly excluded.
Where data identifies one and only one entry. Unique data means that there is only one instance of a specific value appearing in a data set, so it is free from data duplication, e.g. a Social Security number (SSN) to ensure that each person has a unique Id, therefore duplicate SSN values are not allowed within the data set. Unambiguous value for having one meaning and avoid ambiguous values at same level in a dimension or need to combine two values to one unique data. Duplication where customer has addresses spelled in different ways is not credible. What about Bob and Bbo, a misspelling? What about Daniel and Dan may well be the same person? Investigate that duplicates are unique.
Validity (validitet, gyldighed)
Where data is valid with a range of values, e.g. an age of a person cannot contains a negative value and cannot be higher than 125 years. When an age is under 18 then marital status must be not married. A validation rule in a time registration system could be that hours worked must be between 0 and 168 per month as a plausibility range. The average price of this month must not differ from last month’s price by more than 50%. A valid data value can also include a regular expression patterns as a column of text has to be validated e.g. phone number pattern: (999) 999–9999 where we want the hyphen to be stored in the column. A validation rule can characterize as: in-column, cross-column, in-row, cross-rows, cross-data set, like a cross-column validation is about certain conditions that span acrossmultiple columns must hold e.g. a patient’s Date of Discharge from the hospital cannot be earlier than the Date of Admission, or a Delivery date of an order cannot be less than its Shipment date. Validation ensures that data is logical. Type of validation rule: equality, inequality, logical rule mixed with range checks, bounds fixed or bounds depending on entries in other columns. A Due date within an acceptable range and not ten years wrong, else we have invalid data that should be marked as rejected to be corrected. It is also important that a data row indicates the period in which it is valid.
Where data is stored in a column with a business-wide-understanding userfriendly name and where data are in the required format and following standard data definition. E.g. of an easy name of a column for an invoice amount that is excluded VAT (pre-VAT) could be InvoiceAmount_ExclVAT and for the amount included VAT could be InvoiceAmount_InclVAT. E.g. formatting telephone numbers with or without country code and a control have to make sure that all phone numbers has proper number of digits and a format to ensure conformance. Format of a date as yyyy/mm/dd or dd-mm-yyyy, please choose to have only one format in the data warehouse. How to present workload per day as 7½ hours, you like 7:30 or 7.50?
Where data is stored in a column with a data type and size and where a value has the right unit. For a data warehouse with data across domains from multiple source systems it is important to define and use the same data types and sizes. E.g. a decimal number in Europe is using comma before decimal (3,14) and US is using period (3.14). How many digits in an amount e.g. data type decimal(19, 4) is a popular choice for e-commerce but Bitcoin is using decimal(24, 8). How large can the sum of an amount column over many rows going to be?
A data value is often represented with a specific unit, but seldom the unit can been seen in the name of a column or in an extra column with the unit as text. Unit examples of a weigth value in pounds or in kilos, a length value in centimeters (cm) or in feet (ft) or a distance value in kilometer (km) or miles (mi) or light year (ly), an amount value has a currency like USD, EUR, DKK and a temperature value was measured in Celsius ºC or Fahrenheit ºF.
Where the same data from two or more source systems should not conflict with each other e.g. when a product is discontinued, there should not be any sales of the product. A zip code is right and has a cityname or a data mapping like DK for Denmark and »Cph, Kbh, København« as Copenhagen to be consistent for the capital of Denmark. Code values must follow different standards e.g. ISO Currency Code USD, EUR, GBP, DKK, or LEI Legal Entity Identifier issuer code, or NACE codes and CIC codes. Improve values that take care of case-sensitive words through translate, mapping or bridging rules to create one and only one truth. Three data sets referring to the same time period e.g. one includes data for females, one for male and one for total, the consistency between the results of the three data sets can be checked. Inconsistency data stands for conflicting data or when the same data is not the same across systems. An address is the same with look-alike e.g. »Fifth Avenue« eller »5th Ave«, or the name of a product with two kind of spellings e.g. Ketchup and Catsup. Sometimes a user type in an extra data e.g. in an address column there is added a direction »on the corner of Fifth and Main«. Hopefully a data cleansing procedure can catch it. Inconsistent data gives problem in any database and data warehouse.
Integrity (integritet, pålidelighed)
Where constraints is a not null column, a primary key, a foreign key referential integrity where data stating that all its references are valid. Relationships between data in different systems is maintained to be accuracy and consistency so data relationships are valid and connected. Wrong data type for a data value like too big amount or too long string text. Allowed values like dates, weekdays and weeknumbers is constant data.
Accuracy (nøjagtighed, korrekthed)
Where data represents what it is intending to represent in the real world e.g. the age of a person is trusted or an address is the correct one by performing a sample measurement, or check and compare an address with another data set known to be accurate e.g. a National Address Database or the openaddresses.io to make a degree of which data correctly describes the real world object or event. A new product name may not have the same name or similar sounding to a competitor's product. A new product claims to weigh a specific amount, and the weight is verified in an independent test. Inaccurate or incorrect data is poison for a data warehouse, therefore it is good to set up some rules to find inaccurate values and mark them to be taking care of. Semantic accuracy must ensure consistency between the content of the data recorded and actual conditions (content accuracy). Syntactic correctness must ensure that data complies with syntactic rules e.g. spelling and formats e.g. a date or an xml structure (form accuracy, syntax and schema validation). A nurse entered the Date of Birth of all the patients in the format of dd/mm/yyyy instead of the required format of mm/dd/yyyy. The data passed the system validation check as the values are all within the legal range. However they are not accurate.
Where data is based on comparison to benchmark data or past instances of a similar data set e.g. sales from the previous quarter to prove that data is reasonableness or not.
Confidentiality and Security (fortrolighed og sikkerhed)
Where data will be maintained according to national and internatinal standards for data e.g. GDPR. Which access to data is controlled and restricted appropriately to maintain its security and how long it is retained.
Clarity and Usefulness (klarhed, anvendelighed, forståelighed, genbrugelighed)
Where to assess the extent to which data is understandable and can be able to read and used by others without any misunderstanding the data and without difficulty. Presentation quality and usability is the data readability and understandable, relevantly, accessible, maintainable and credibility. Availability of documentation and columns of metadata in a data set.
Reports must have accessibility and communicate information in a clear and concise manner. Reports should be easy to understand yet comprehensive enough to facilitate informed decision-making. Reports should include an appropriate balance between data, analysis and interpretation, and qualitative explanations. Reports should include meaningful information tailored to the needs of the recipients.
Monitoring (tilsyn) the twelve data quality dimensions
Monitoring the quality of the data and reporting it to the data owner is part of been compliant. Being aware of quality of your data can be termed as data quality but sometimes it is misunderstood that this will improve the actual quality of data. The thing it does, is to improve the awareness. Data quality measurements are established on the basis of the importance and use of data, respectively. Data should be monitored to ensure that it continues to meet requirements. A data quality indicator is called a metric, and several metrics will behind one data quality dimension. It must be possible to measure each data quality dimension with a limit value range for the data quality rule to indicate whether the measurement is:
A data quality measurement should preferably be performed automatically and data quality measurements should be performed after each ETL process. The result of metrics of data quality measurements must be stored and documented so that they can be exhibited to the data owner and data users, especially with which measurements exceed limit values. The measurements must be able to be included in an overall reporting and follow-up on data quality.
There are two types of strategies for improving data quality:
Process-driven is better performing than Data-driven in long period, because it remove root causes of the quality problems completely.
Vilfredo Parento from Italy made in 1906 the 80:20 rule or principle built on observations of his such as that 80% of the wealth in Italy belonged to about 20% of the population. The same ratios (in danish forholdstal) is often in data quality that 20% of a data set generates 80% of the errors in data. Corresponding to that 20% of the goods generating 80% of the earnings, which means that the other 80% of the goods generate only 20% of the earnings.
1.6. Data lineage and Data provenance
Data lineage is the journey data takes from its creation through its transformations over time. Data lineage can tell us the source of data, what transformations take place, and where data moves to over time. It describes a certain dataset's origin, movement, characteristics and quality and what happens to it and where it moves over time. Data lineage (in danish afstamning, hvor stammer data fra eller hvor kommer data fra samt hvor anvendes data) gives traceability (sporbarhed) and visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. Lineage is about achieving traceability of data to make sure that at any given data point one can easily find a way back to its origin.
Data lineage is a end-to-end mapping of upstream and downstream dependencies of the data from capture or ingestion to analytics including which reports and dashboards rely on which data sources, and what specific transformations and modeling take place at every stage. Data lineage also includes how data is obtained by a transformation. Data lineage must ensure that there is traceability from where data is created and where data is subsequently used. Data lineage is both at table level and column level.
When a data warehouse is using data from other source systems, inhouse or external, a data lineage can be implemented by tagging data rows with metadata in a column like RecordSource from Data vault to indicating where the data originated.
When data is staged and present in a target place in a data warehouse, data lineage is also important to inform a business user where data is comming from. If an organization does not know where data comes from and how it has changed as it has moved between systems, then the organization can not prove that the data represents what they clain it represents.
In addition to documentation, data lineage is also used in connection with change management so that a data steward can determine which IT systems and solutions will be affected and how they will be affected.
Data provenance focuses on the origin of the data aka source system, could be a screen from a host system where users type in data values after a phone conversation with af customer, or customer fills a web page. The original data capture. Data provenance is responsible for providing a list of origin, including inputs, entities, systems, and processes related to specific data. Always good to know of latest date of update and who made the update.
If you want to drive real value from your data, you must first understand where data is coming from, where data has been, how data is processed, how data is being used, and who is using data. That’s what data lineage and data provenance is all about to create transparency.
My experience is that documentation of a data warehouse system takes at least ϖ (Pi) times as long as the development time. Tasks like writing explanations, reasons and arguments, set up tables, drawings, figures and diagramming, show data flow and data lineage and make a data dictionary, and last but not least proofreading. Followed by review by colleagues and subsequent adjustments and additional writings. Documentation is an iterative process and in a good agile way a programmer should program in the morning and document it in the afternoon.
For EU GDPR General Data Protection Regulation (Persondataforordning) a business key like a social security number needs to be anonymous after some year by adding a calculated cross sum of the ssn and hash it. It will be a nice scramble or encryption by a scrambling algorithm. To be sure of total anonymity only save Date of Birth and Gender of a person; no name, no address, no zipcode and no city. A Master Data Management (MDM) database can contain social security number, name, address together with CustomerId that is used in emails to customers. Therefore we have a surrogate key CustomerInternalId that is used in all other operational systems and data warehouse together with BirthDate and Gender which is not personally identifiable or enforceable. A web service will provide all systems the data of personally from MDM database. When a customer is going to be deleted because it is his/her wish or the data limitation period, we only need to anonymization (the way of delete data) in the MDM database and we can keep CustomerInternalId, BirthDate and Gender in the other systems to make sure statistics remain unchanged back in time. When a system is calling the web service it will give back unknown for ssn, name and address when a customer no longer exists in MDM database. If we don’t have a MDM, we must do anonymization in all data layers and in relevant source systems as well, and flag data with an IsAnonymized column. Personal Information (PI), Personally Identifying Information (PII), Sensitive Personal Information (SPI) is data relating to identifying a person, read more.
1.9. DataOps and Data governance and Data management
DataOps is how to organize your data and make it more trusted and secure with tools like:
The purpose of data governance is to provide tangible answers to how a company can determine and prioritize the financial benefits of data while mitigating the business risks of poor data. Data governance requires determining what data can be used in what scenarios – which requires determining exactly what acceptable data is: what is data, where is it collected and used, how accurate must it be, which rules must it follow, who is involved in various parts of data?
According to the DAMA-DMBOK: Data Management Body of Knowledge, 2nd edition: »Data governance is the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets.« This definition focuses on authority and control over data assets. The important distinction between data governance and data management is that the DG ensures that data is managed (oversight) and DM ensures the actual managing of data to achieve goals (execution), in short that DM ensures that data is compliant and ensures compliance.
According to Seiner: »Data governance is the formalization of behavior around the definition, production, and usage of data to manage risk and improve quality and usability of selected data.« This definition focuses on formalizing behavior and holding people accountable. [Robert S. Seiner. Non-Invasive Data Governance: The Path of Least Resistance and Greatest Success.]
If data management is the logistics of data, data governance is the strategy of data.
The traditional, centralized, data-first approach to data governance represents a top-down, defensive strategy focused on enterprise risk mitigation at the expense of the true needs of staff who work with data. The Data Governance Office develops policies in a silo and promulgates them to the organization, adding to the burden of obligations on data users throughout the enterprise who have little idea of how to meet the new and unexpected responsibilities.
The modern, agile, people-first approach to data governance focuses on providing support to people who work with data. People are empowered but nonetheless, expected to contribute to the repository of knowledge about the data and follow guidelines, rather than rigid, prescriptive procedures.
Governance can be divided into three maintainings:
Policies describe what to do and what not do to, guide the way people make decisions. A policy established and carried out by the government goes through several stages from inception to conclusion.
Procedures describe how to do for accomplish and completing a task or process. Procedures are the operations to express policies.
Community of practice (CoP) is a group of people who share a concern or a passion for something they do and learn how to do it better as they interact regularly.
1.10. Big data
Some properties of big data:
A data lake can be used for massive quantities of unstructured data and big data with tools that can easily interface with them for analysis for business insights. A datum in a lake has tags to give it a characteristic and by the tags we can fetch data from the lake without knowing the physical location like a server url with a folder path. A data lake can contain files on multiple servers on premise in different folders and in the cloud (many nodes), and we only using a tag to find and fetch data.
For example, I like to find photos of smiling employees in all albums, I can search for a tag FacialExpression = smiling. A data lake is using ELT (extract, load, and then transform). A tool for a data lake can be like Apache Hadoop or Microsoft Azure. Data Discovery Area is an end-user sandbox. Can use U-SQL to dive in the data lake and fetch the wanted data and do the wanted transformations.
A relational database is characteristic by ACID (Atomicity, Consistency, Isolation, Durability). This means that a transaction is either carried out completely or not at all (Atomic), that only valid data is added to the database (Consistent), that transactions never affect each other (Isolated), and that transactions are never lost (Durable). Read more.
A NoSQL = Not Only SQL database means it can use SQL type query language, but usually do not do so. NoSQL database often designed to run on clusters, made by open source and the database does not operate with a fixed schema structure but allow the addition of data without a pre-defined structure. A NoSQL database is characteristic by BASE (Basic Availability, Soft state, Eventually consistent). A ACID system guarantees data consistency after each transaction; a BASE system guarantees data consistency within a reasonable period of time after each transaction. In other words, there is data consistency in the system, just not immediately. This leads on to the Soft State principle. If the data is not consistent at all times, the system must take a temporary data state into account. The sum of both these principles means that data accessibility is given very high priority, even if coincident errors occur in the database system, operating system or hardware. If parts of the database does not work, other parts of the database take over, so that data can always be accessed.
2. Grain (granularity) of data warehouse
The grain is the level of detail of data in the table, both in dimension tables by the hierarchy and in fact tables by the definition of the measurement event and of interest, and the grain can later be expressed in terms of the dimensions. The lowest grain keep all relevant data from source systems and is therefore the most flexible approach but also take most storage space and easy can cost high query performance. To improve query performances the grain can be lifted up to higher level while data will be aggregated and summarized and therefore take less storage space. Data can also be divided such as current year data is in grain daily (dately), previous year in weekly and older data in monthly, because very detailed information is normally not relevant for analysis years back in time.
Granularity of fact table can be divided into four types:
Grain yearly to monthly to weekly to daily we say that each level e.g. daily increase the granularity and the number of rows in the fact table.
Aggregation is the process of calculating summary data from detail base level table rows (records) and is a powerful tool for increasing query processing speed in data marts. For example, a sales is a fact with analysis variable and measures like quantity sold and amount, and dimensions like product, customer and date of purchase brings a sales in a context. The grain of the sales is limit to a date (like December 23) and a time (like 4:30 pm) and therefore the fact is on Transaction grain. In case we drop the time, the measures would be called TotalQuantitySold and TotalAmount because they are the result of a summation of sales times to a sales date and therefore the fact is on Periodic grain. Also if we decide to summarize the sales date to a weekly or monthly level. In case we decide to grain the customers by aggregate them into segments and don’t keep names and addresses, then the fact becomes Accumulating/Aggregated grain. A Product dimension has a three level hierarchy of category name, brand name and product name, and therefore we can say that product has the highest level of detail. When a source system only gives a sales transaction with a brand name and no product name, a Product dimension must fit that grain of fact and take out the product name level so brand name becomes the lowest level that match the fact grain.
Instead of summation of sales times or sales date and loose some important information of time of purchase, we can summarize data into two other fact tables for a weekly level for data from the last year to this year, and a monthly level for older data, because when we go back in time, we don’t need to analyze on a daily or weekly level, and by aggregation we save harddisk space and improve the performance of the query because fewer rows in month level need to be summarized to fetch the year level data.
A business key has the same granularity and the same semantic meaning across a business organization company.
Granularity is a description of a level of detail of data, e.g. the combination of customer and product is a fine level to tell me about sales, and when I add store to the combination the granularity is driven one level lower, and when I add salesperson to the combination the granularity is driven one level lower and I know much more about sales, and when I add a date or better a datetime it will bring the granularity to its lowest level or to the highest level of detail of sales and a sale.
The lowest level of aggregation or the highest level of detail is referred as the grain of the fact table.
A fact table can contain fact’s data on detail or aggregated level depends of the grain approach. A fact table can have five types of columns:
The ETL process has to secure the key, so fact rows are distinct. Fact tables in a dimensional model express the many-to-many relationships between dimensions and is implemented as one-to-many relationships between dimension tables and fact tables. I never have foreign key constraint on a fact table because it decrease inserting performance and I trust the ETL process and date range lookup, and no human being will be doing a update or delete of a dimension table or a fact table.
3.1. Types of facts
Let us characterize the various facts into different types of facts or more exactly different types of the columns in a fact table. Conforming facts means making agreements on common business metrics such as key performance indicators (KPI) across separated source systems so that these numbers can be compared mathematically for calculating differences and ratios.
Fully-Additive measure - summable across any dimension
A fact table has numerical measures that can be summed up for all of the dimensions in the fact table, so the measure columns data type is a number. A sales fact is a good example for additive fact with measures like Quantity sold and Amount. In case of a transaction dataset to a fact table refer to a measure column which value is empty, null or nullable, use the default value 0 because this won’t bother aggregation like summation. Each measure must have its metrics.
When it is a monetary measure, it may have a currency column and if it is a unit measure it may have a column to explain the kind of units used like centimeters, litres, cubic metres etc. Fact can have a calculated measure or a derived measure based on existing measures and constants e.g.:
Profit or Surplus = Revenue – Costs.
Semi-Additive measure - summable across some dimensions
A fact table has measures that can be summed up for some of the dimensions in the fact table and not for other dimensions. For example, a daily balance measure can be summed up through the customers dimension but not through the date dimension. Inventory levels cannot be summed across time periods.
Non-Additive measure - not summable for any dimension
A fact table has measures that cannot be summed up for any of the dimensions in the fact table. For example, a room temperature fact is non-additive and summing the temperature across different times of the day produces a totally non-meaningful number. However, if we do an average of several temperatures during the day, we can produce the average temperature for the day, which is a meaningful number. Sum of a measure called DaysForOrderToCompletion in a FactOrder is meaningless but finding a minimum, maximum and average values is meaningful to planning of production.
Other examples is a percentage and a ratio are non-additive measures. A fact table that only contains column with transaction number such as order number, invoice number or a voucher number that can’t be summing up. A order fact with measures like Unit price for a single product makes no sence to summarize, but the derived column Amount = Unit price x Quantity is to be summarized and becomes an additive column, called a calculated measure. Trend, Stock and Ranking can’t be added and in general all calculations on one specific intersection of the dimension. Year-to-Date ytd measure can’t be summed up. Count of rows is normally used. Unit price can be placed in a Product dimension as a current standard list unit price and still keep Unit price in the fact together with a Discount (%) when the purchase happened, occurred or took place. Back in 2016 I had a customer that used two kinds of discounts when a product had a bundle option and was showned in an invoice line with description »Discounted Bundle«:
Unit price = List price - Unit discount amount.
Line amount = Unit price x Quantity – (Unit price x Quantity x Line discount percentage / 100).
Some business users used the term List price other users used Catalog price.
When there was a promotion going on, the unit price had a different calculation:
Unit price = List price - Proration amount (also called deduction amount).
There was also a Line discount amount and an Order discount amount. I added all the columns into the Sales fact.
A conformed measure in multiple facts must use the same common business rule and definition so multiple facts can be united in a report or a cube. When several data marts are using fact data with same name of fact tables or name of columns for measures and they have compatible calculation methods and units of measure and support additivity across business processes. If a measure e.g. Revenue is used in multiple fact tables with different calculations and meanings, it is best to use different column names because theses facts are not conformed. Figures in DKK '000 means in DKK thousand, where a revenue 395,732 is 395,732,000 DKK.
A factless table contains no measures, no metrics and no numeric additive values as a normal fact table do, we have a measureless fact that records and event. For example, a fact table which has only columns for employee, date, time and event of work like »workstart«, »workstop« and »workstopofsickness« and no columns for a measure. You can get the number of employees working over a period by a »select count(distinct Employee_key)« or by a distinct row-count calculated measure in your OLAP cube. For a factless fact you will normally count the number of rows, row count or counting rows and call it »Number of <a name>«.
Sometimes a factless fact has a value column called Count with only one value as 1 used in a data access tool to sum over and get the number of rows. In case the fact grain is weekly and a week is missing, it can be inserted to have all weeks complete and here will Count gets the value 0.
Factless fact is for registration of event or assignment e.g. attendence take place in a school class with dimensions for Student, Class, Room and Professor. If we add a measure column for Attendence with 1 or 0 per date per student it is not a factless fact anymore.
Factless fact can be used to represent a many-to-many relationship among dimensions. Factless fact can have a ValidFrom and a ValidTo column to tell the timeline/period when the fact was valid.
Capture a relationship in the fact
To be a column in a dimension or to be its own dimension and used in a fact is a good question. Kimball’s financial services example starts with an Account dimension including data of products and branches but he choose to remove these descriptive columns to form independent dimensions of Product and Branch and use them in a fact together with the Account. Therefore the fact capture a relationship among accounts, products and branches.
Another example is that an account can belong to two customers and a customer can have several accounts. This many-to-many relationship can be expressed in a factless fact or in a bridge table, see later. A bank transaction is done by one customer from an account and it is natural to have a Customer dimension in the fact.
Kimball says: »Demoting the correlations between dimensions into a fact table«, and I like to add: »With capture a relationship in the fact table we also keep the registered relationship at the time the event or transaction occurred and was entered into the fact table«.
An example is a retailer SuperChemi belongs to a chain Ethane at date of sale e.g. 2013-05-17 and the fact table has two columns for dimensions to Retailer and Chain. In 2015 the retailer SuperChemi changes to another chain Propane but we still keep the registered relationship back in 2013 and 2014 in the fact. When a chain is not a part of the fact table and we in year 2016 like to find sales of retailers for a specific chain e.g. Propane, we will use the current relationship between dimension Retailer and dimension Chain as a snowflake dimension, meaning that SuperChemi belongs to Propane, and when we summarize sales per year since 2010, chain Propane will include the sales of SuperChemi in 2013 even though the retailer belonged to Ethane at that time.
Kimball also says: »Combine correlated dimensions into a single dimension«, e.g. Retailer and Chain into one dimension with a hierarchy Chain→Retailer including history to handle a retailer changes chain or a retailer changes name or changes data in other columns.
It is always a consideration to have one dimension or to have several separate dimensions e.g. a many-to-many relationship between addresses for ship-to and bill-to or between name of sales rep and customer, it is best to handle as separate dimensions and keep both of them in a fact table row, read more in Kimball page 175-177.
Transactional fact or Transaction fact
A fact table that describes an event or operation that occurred at a point in time in a source system e.g. an invoice line item. The row has a date e.g. a transaction date, an entry date or a post date or a datetime time stamping to represent the point in time with other lowest-level data. Key values for dimensions is found at transaction datetime.
If data needs to be changed, corrected or cancelled in source system, then a data warehouse needs to make a counterpart row in the fact with a new time stamping or a time span. Source data examples, booking a hotel room, order a car in a special color, buy a lottery ticket with my lucky numbers, receive an invoice, put and pick items in stock (in danish lægge varer på lager og plukke varer fra lager).
In a data warehouse modeling process I try to interpret some data as transactional data or transactions. See more in section 6.4. Some data modeler does not like separate facts for each transaction type but build a single blended fact with a transaction type dimension and a mix of other dimensions can make N/A dimensions.
I have seen a blended fact called FactEvent which I think is a poor and non-signing name of a fact table, and the date dimension gets a generalized name. I prefer multiple facts where the date dimension is role-playing for e.g. order date, purchase date, shipping date and receipt date.
Snapshot fact versus Transactional fact
Snapshot fact represents a state, e.g. my bank balance right now is $500, and tomorrow it is $600 because I will deposit $100, therefore the latest date of balance contains the current balance. It is easy to see my balance ten days ago by look it up at the SnapshotDate.
Transactional fact contains all my deposits and all my withdrawals with a date and an amount to records the »happening of an action«. In any point of time (normally today) I can calculate my balance, or the fact contains a running balance measure per transaction/per fact row.
Snapshot fact calculates the balance of my account at the end of each day because there can have been many deposits and withdrawals within a day. This snapshot fact can then be easily used to calculate the average daily balance for interest or fees. The balance in the snapshot fact is not additive, able to add/sum up together into a meaningful metric, instead the balance is semi-additive, able to use aggregate functions like Avg, Min and Max.
A fact table containing sales order lines and we update a line with changed data, add a new line or delete an existing line, the fact is a snapshot.
A periodic snapshot is one that represents states over time where order lines is never changed.
If we store a new complete image of the order lines when it is changed, it is an accumulating snapshot or what I call a discrepancy snapshot fact, see later.
If we store the difference between the original order line and the new order line (i.s. net change), it is transactional fact.
Periodic snapshot fact
A fact table that describes the state of things in a particular instance of time or a one point in time, and usually includes more semi-additive and non-additive measures. It is a table with frozen in time data, meaning a row will never be changed/modified/deleted (is unchanged) because the row can have been used in a report like a monthly or annual report and later it is a must to be able to create the same raport with the exact same data. The periodic snapshot fact table will never be done empty or updated, therefore it is a true incremental load and the table has a column like a Month (yyyymm) for a monthly basis or a Year to fetch the wanted data as a month slice or a year slice. When a month or a year is over and data is ready, data will be loaded.
Sometimes data for the current month is also loaded every day for a current-month-to-date, therefore the current month will be updated until it is over, finish and can be closed. Measures can be a balance in account or inventory level of products in stock and so on. Key values for dimensions is found at the end of the period.
Fact table can be monthly partitioning for making a faster query performance when searching for a month. In case there is no data for a specific month it will be inserted as an artificial row with »missing«/»nothing« value of the dimensions and 0 in the measures. (Periodic snapshot in danish »en skive data«, fastfrysning.)
Accumulating snapshot fact
A fact table that describes a process with milestones of multiple dates and values columns with different names which will be filled out gradually that reflect the completion of events in a lifecycle process to representing the entire history. Therefore over time the same fact table row will be revisited and updated multiple times where a default date key value is -1 for »hasn’t happened yet« or »be available later«.
Each stage of the lifecycle has its own columns e.g. milestones of a hospitalization or steps of the manufacturing of a product. In a retail store a product has three movements as ordered, received and sold that would be three date dimension columns in an accumulating snapshot fact.
A fact table where a row is a summarize of measurement events occurring at predictable steps between the beginning and the end of a process. For example, a source system for current payments from customers where some pay several times over a month, and the first payment becomes a new row in fact table with date of payment in columns BeginDate and EndDate and the amount in Paid column. The next ETL process will do a summarize of payment per customer from the BeginDate to current date or end-of-month date, and then update the fact row with same BeginDate with the new summarized payment and new EndDate, so a fact row will be revisited and updated multiple times over the »life« of the process (hence the name accumulating snapshot).
Timespan accumulating snapshot fact or State oriented fact
A fact table where a fact is not a singular event in time but consists of multiple states or events that occur over time so each row represents the state of an object during a period of time in which this state didn’t change. It is similar to a type 2 dimension with snapshot start date, end date and current indicator/active flag where the row formerly-known-as-current will be revisited and updated. (Kimball Design Tip #145.), see more below for Time span fact.
Depending on the frequency and volume of changes, you may consider a »current only« version of the fact to improve performance of what will probably be a majority of the queries. Then you can use the fact to drive the loading of selected events in a accumulating snapshot fact.
Discrepancy snapshot fact
A snapshot describes the state of things at a particular time and when the state does change there is a discrepancy. It is one way to capture versions of the fact over time with no need to keep all the daily snapshots from periodic snapshot fact.
A fact table stores metrics and things that provide context for the metrics goes into dimension tables, therefore it is also possible to have a discrepancy snapshot dimension with discrete textual values and non-additive numeric values that can't be summarized, reminiscent of type 5 with a foreign key back to the main dimension.
Data entry action snapshot fact – a way of Change Data Capture CDC
A snapshot that keep track of the state of things at a particular time as a logbook with three actions as: Added for a new entry, Changed for an existing entry and Removed for a non-existing entry that has been deleted or marked deleted in the source system and is not anymore a part of the full payload to the data warehouse. This snapshot approach has a dimension for data entry action with the three mentioned action values (could also have been: Influx, Changed, Departure).
The data entry action snapshot fact can be used to show the added/gain values from the latest ETL process or the removed/lost values and in combination with changed values from one state to another state.
Formula: current = old current + added/influx – removed/departures.
(in danish: igangværende = forrige igangværende + tilgang – afgang, for bestand, nytilkomne, indgået og udgået, ikraft og udtrådt).
Derived fact or Additional fact
Kimball recommands fact data at the lowest detail grain as possible for ensures maximum flexibility and extensibility, he call it a base-level fact. A derived fact table is created for performing an advanced mathematical calculation and complex transformations on a fact table like for a specific KPI (Key Performance Indicator), or an aggregate fact with a summation of measures to a higher grain like from date level to month level and from product level to brand level and using shrunken dimensions for Month and Category as a dimension lifted to a higher grain from the base-level dimensions as Date and Product. A derived fact can be based on multiple fact tables for making faster ad hoc query performance and simplify queries for analysts and for providing a dataset to the Presentation interface area.
Aggregate fact or Summarized fact
A derived fact table that is created to referred to as a pre-calculated fact with computed summarized measures at a higher grain level of one or more dimensions to reduce storage space and query time greatly and eliminate incorrect queries. An aggregate fact is derived from a base-level fact and measures in an aggregate fact is a computed summary of measures in the base-level fact.
Dimensions in an aggregate fact can be derived from base-level dimensions and is called shrunken dimensions because the values is rolled up to create less fact rows e.g. Date dimension becomes a Month dimension, Product dimension becomes a Category dimension and an Address dimension becomes a Region dimension, and therefore the measures can be summarized to less fact rows for better query performance.
Year-to-Date ytd fact where month February is a summing up or roll up of January and February and so forth. Last-year-this-year fact with calculation of index compared to last year as a new column and easy to display in a report.
Aggregate fact table is simple numeric roll up of atomic fact table data built solely to accelerate query performance. It is called incremental aggregation when a ETL process do a dynamically update of a table by applying only new or changed data without the need to empty the table and rebuild aggregates.
A fact table used to combine fact data from multiple source systems and multiple business processes together into a single consolidated fact table when they expressed at the same grain. E.g. sales actuals can be consolidated with sales forecasts in a single fact table to make the task of analyzing actuals versus forecasts simple and fast to compare how the year is going. A consolidated fact is a derived fact table that combine data from other facts.
A fact table contents several measures but only one or few of them has a value in each fact row. For the fact rows with same dimension member repeated in multiple contiguous rows with identical values, they will be smashed or collapsed into one fact row using operation as sum, min or max to limit the number of rows in the fact.
Time span fact or history fact or historical fact
A fact table used for a source system that is regularly updatable meaning that the source change and overwrite its values. To capture a continuous time span when the fact row was effective, the fact table will act as SCD type 2 dimension with BeginAt and EndAt columns to keep historical data and to represents the span of time when the fact row was the »current truth«, and with a query it is easy to fetch it at a point in time. It is called Slowly Changing Facts. See more in section 6.4.
Counterpart fact (negating fact) and Transactional fact
A fact table used for Slowly Changing Facts because the source system is changing fact value without keeping the old value as a historical transactions. See more in section 6.4.
Column wise fact and Row wise fact
A column wise pivoted fact table is useful to be columns in a report e.g.
Revenue Jan, Revenue Feb, Cost Jan, Cost Feb, Sales Jan, Sales Feb.
For a cube a row wise is much better because it gives good dimensions e.g.
Period, Entry, Amount.
Therefore a data warehouse needs to convert from columns to rows or vice versa.
A fact table contents huge number of rows where a period e.g. from a »date of employment« to a »termination date« as columns in one row, will be turned around to many rows with one row per day of the period, or per week or per month depends of the wanted grain of the period.
When an employee has 10 year anniversary it will make more than 3650 rows on day grain, and if the employee dimension keep history of different names, departments and job functions etc. for each staff, then the fact rows will have different key references to the dimension.
For a windturbine power production the grain would be 15 minute grain which gives 96 rows per day for a windturbine power production.
When the exploded fact is a source for a olap cube it can sometimes be implemented as a view in the database, and when it is for ad hoc reporting it will be used several times per day then it must be a materialized view stored in a table or sometimes as a indexed view.
3.2. Other fact classifications
Transaction has one row per transaction when they occur together with a datetime.
Periodic has one row for a group of transactions made over a period of time through summation like from daily grain to monthly grain so the Date dimension is represented by the month level with the first day of the each month. Notice that certain dimensions are not defined when compared to a transaction fact such as dimension TransactionType.
Accumulating/Aggregated has one row for the entire lifetime of an event and therefore constantly updated over time. For example, an application for a bank loan until it is accepted or rejected or a customer or working relationship. These fact tables are typically used for short-lived processes and not constant event-based processes, such as bank transactions.
Mostly a fact table describes what has happened over a period of time and is therefore an additive or cumulative facts. An example of a status column in a fact table that receive data from a school system where a student follow a course and later finish it, but sometimes a student skip the course and are delete in the system. Before reload the fact it can be a good idea to have a CourseStatus column with values like: Active, Completed or Dropped.
A dimension table containing business element contexts and the columns contain element descriptions and a dimension is referenced by multiple fact tables so the containing measurements make sense.
4.1. Purpose of a dimension
Some purposes as I seen it:
The dimension values can be placed in a hierarchy like a Location with three levels Country→Region→City. A dimension can have several separate and independent hierarchies with different numbers of levels.
Grouping or Band
The dimension values can be placed in a group that is grouping values in band intervals like person ages in custom buckets like a Age group column with intervals of Child (0-9), Tween (10-12), Teeanage (13-19), Young adult (20-29), Adult (30-66) and Senior citizen (67-130).
A dimension normally contains one or multiple hierarchies and/or groups/bands to fulfill requirements from the users. E.g. a Product dimension has two hierarchies (here called group) and one band to divide the prices:
A dimension can of course be non-hierarchical and non-grouping/non-band.
Different dimensionality covers the issue that not all combination of multiple dimension values are allowed in a fact and the data warehouse needs to make sure of the data quality.
For the data lineage a dimension table will include metadata columns from the Archive area as RecordSource, ArcRecordId and ArcGlobalId for traceability back to the archive.
4.2. Dimension keys
A dimension table has mininum three types of columns:
Primary key is a surrogate key identity column as a unique sequence number to remove dependence from the source system and for using in fact table foreign key. It can be called Entity key EK because it represent an entity in source data or Surrogate key SK, but for me »surrogate« is a characteristic or a property not a name of a column. The primary key must not be data-bearing (in danish data bærende), it must be meaningless, but for a date dimension and a time dimension I like to use a smart valued primary key, e.g. a date 2013-12-31 as an integer value 20131231 and a time 08:30 am as an integer value 830 or 08:30 pm as 2030.
Business key is from a source system and can be a primary key or another column that has unique values. In section 1.1 I used the terms natural key and business key and from a data warehouse point of view they are merged into one term business key. For example, Social Security number (SSN), StatusCode and CustomerNumber. The key value is mutable (changeable) and meaningful for a human being that a business user prefer to use to identify a thing and as a search lookup value giving in a phone call to a company, a bank, a hospital or to the government. A business key has an embedded meaning and represent a unique object in the business. An alias is an Enterprise wide business key because the same value is used in multiple source systems. If a business key is immutable (unchangeable) in the source system then call it a durable business key.
Surrogate key is from a source system and is most often a primary key as an auto-generated unique sequence number or identity column (id, uid unique identification), that is immutable and meaningless for a human being.
Textual data is representing the dimension value context description and is saved in columns of the dimension table and will be shown to the users as descriptive columns to explain fact data.
Multiple source systems have a business key for a Social security number per person with a primary key as a surrogate sequence number, therefore we can’t use the primary keys to interconnector (join, drill-across) the systems data, instead we use the business keys. The value of a business key can change over time e.g. a person obtains witness protection and gets a new Social security number, therefore it is up to the ETL process to make a solid mapping. When a business key is an employee number, and the employee resign, and some years later is rehired, there could be two employee numbers, and a data warehouse must find a way to glue or map them together, so the employee only occurs once in a dimension.
4.3. Changing dimensions
Source data is volatile data because they will change over time e.g. a customer change his name or address and a product change place in a hierarchical structure as a result of a reorganization. Therefore dimension values will normally be changed over time because of the volatility in source systems. The rate of changes can be divided into two kinds of classifications and afterwards we will be looking into techniques to handle and tracking changes and to capture its history and preserve the life cycle of source data also called Change Data Capture CDC.
Slowly Changing Dimensions SCD
Columns of a dimension that would undergo changes over time. It depends on the business requirement whether particular column history of changes should be preserved in the data warehouse or data mart. (In danish Stille og rolig ændrede (skiftende, foranderlige) dimensioner eller Dimensioner med langsomt opbyggende historik.)
Rapidly Changing Dimensions RCD
A dimension column that changes frequently. If you do need to track the changes, using a standard Slowly Changing Dimensions technique can result in a huge inflation of the size of the dimension. One solution is to move the column to its own dimension with a separate foreign key in the fact table. Rapidly changing data usually indicate the presence of a business process that should be tracked as a separate dimension in a separate historical data table.
Techniques or methods to handle dimension values that is changing over time from the source systems is called Ralph Kimball’s eight types or approaches of SCD:
Type 0: Original value, where the dimension value never change (method is passive) meaning keeping the original value from the source system, and when the value is changed in source system, we don’t change the value in the dimension. A single fixed value does not change over time but can be corrected in case of an error in a source system, e.g. a correction of a misspelling of a name or a wrong Date of Birth, Date of Issue, Date of Expiration, Date of Launching something or Date of First purchase. »Retain original«.
Type 1: Current value, where the old dimension value will be changed and forgotten when value is changed in source system. The history of data values is lost forever. The active, actual, present, newest or latest of the value or most recent indicator. A fact table refers to a dimension value most recent, as-is. (In danish aktuelle, nuværende, gældende, seneste værdi). »Current by overwrite«. There is a one-to-one relationship between the business key and the surrogate key identity column as primary key of the dimension.
Type 2: Keep all values, where new dimension value is inserted into a new row to have a unlimited history of dimension values over time marked by timeline columns Effective date and Expiration date (Active date, Expired date, Expiry date) (or StartDate and StopDate or BeginDate and EndDate or ValidFrom and ValidTo), a pair of data type »date« or »datetime« that represents the span of time when a value was the »current truth«. A fact table refers to a dimension value in effect when fact data occurred, as-was, often by a date column in fact table based on source data or by a current load insert date when the fact data was entered into the fact table or was created, born or occurred. (In danish oprindelige værdi). The value of a business key will be repeated every time the textual data is changing, therefore the primary key is a surrogate key identity column as a unique sequence number. A view upon the dimension will provide the current values. A view upon the fact will provide the current keys to join to dimension view. A column called IsCurrent has two values: 0 for historical and 1 for current to mark each data row of a type 2 dimension. This is the technique for Slowly Changing Dimension. »Keep history in rows as full history«. There is a one-to-many relationship between the business key and the surrogate key identity column as primary key of the dimension.
Type 3: Keep the last value, where the previous dimension value is stored in a Previous column (or Historical column) and current dimension value stored in a Current column. An extended example of a type 3 dimension for a customer contains three columns for postal codes. The column names is Current Postal Code, Previous Postal Code and Oldest Postal Code. When a customer address has changed to a new postal code, the ETL process will move the value from the Previous column into the Oldest column and move the value from Current column into the Previous column and add the new value into Current column. This is the technique for Slowly Changing Dimension. »Keep history in columns as partial history«.
Type 4: When a group of columns in a dimension is going to change often, we can split the dimension into a base dimension that has slowly changed columns and into one or more separate Mini Dimensions that has often changed columns (volatile data values) to keep the fast changing values in its own table. This is the technique for Rapidly Changing Dimension to store all historical changes in separate historical data tables. The base dimension can be either a type 1 or type 2 and the mini dimension becomes a type 2. A fact table contains foreign key to the base dimension and an extra foreign key to a mini dimension. We get a »capture a relationship in the fact«. »Keep history in tables«.
Type 5: Builds on the type 4 where a mini dimension gets a view to show the current rows of the mini dimension and the base dimension is extended with a key value that point to the current view and the key becomes a type 1 column outrigger in base. We get a »capture a relationship in the dimensions« because we only need to join base dimension and mini dimension without include the fact to save query performance. The join can be implemented in a view that is used in PIA. The ETL process must update/overwrite the type 1 column in base whenever the current mini-dimension changes over time. Therefore 4 + 1 = 5 type. »Outrigger current dimension«.
Type 6: Mixture of type 1 and type 2 columns therefore a good idea to suffix columns as _t1 and _t2 to know which columns can be overwritten in the current row. Can also have column of type 3, therefore 3 + 2 + 1 = 6 type. Type 6 act as type 2 of tracking changes by adding a new row for each new version but type 6 also updates _t1 columns on the previous row versions to reflect the current state of data by using the business key to join the new row with the previous rows. »Hybrid«.
Type 7: All rows follow type 2 to keep track of historical values with a key column and an extra key column called a durable key follow type 1 for the current value. The durable key is an integer representation of the business key. The fact table contains dual foreign keys for a given dimension (its key and its durable key) to show the historical value or a better term the registered value of the dimension at the time when the fact data was entered into the fact table or was created, born or occurred (in danish oprindelige værdi), and to show the current value of the dimension (in danish aktuelle, nuværende, gældende, seneste værdi). A view upon the dimension will provide the current values with the durable key to join to the fact durable key. »Dual Type 1 and Type 2 Dimensions«.
Correction of data together with the types
For type 0 I mentioned »correction of a misspelling« e.g. a first name of a customer Kelli is corrected to Kelly after few days in the source system, for type 2 or type 7 it will per automatic mean an extra row in the dimension, but I think it needs some consideration especially if a source system can inform about the correction, maybe the dimension can do an update of the row as type 1 and mark the row with a metadata column IsCorrected.
Section 6.2 will show an example of type 1 and type 2. Section 6.3 will show an example of type 7 that has historical rows with a current mirror. Here is a type 6 example: In a dimension table the columns can be a mix of type 1 and type 2 e.g. a Customer dimension where columns for customer name, street name, house or apartment number is type 1 because we only need the recent value for shipment, and columns for postal code (zip code) and city is type 2 because we like to tracking these changes and keep data for the city where a customer was living in when purchase a product. A »sales in cities« report for the last ten years will use the right city of the customer at the time the purchase happened, occurred or took place.
4.4. Types of dimensions
Let us characterize the various dimensions into different types of dimensions.
Conformed dimension or Shared dimension or Common dimension
A conformed dimension has the same meaning to every fact in multiple data marts and measures will be categorized and described in the same way and ensuring consistent reporting across the data warehouse. A conformed dimension is a consistent interface to make sure that data can be combined in a data warehouse and be used all over the business because values of a dimension means the same thing in each fact. With a conformed dimension we can combine and drill across from fact to fact in one data mart or over several data marts, and analyze common columns and values. Seperate fact tables can be used together with shared, common and conformed dimensions.
Conforming of several source system data is part of the integration to achieve a conformed dimension where data is integrated of different meanings and different columns must be compared against each other, rules must be set, and data must be cleansed to create a single version of the entity. Conformed dimensions will unite and integrate data values among multiple source systems so it is easy to search across different types of data and sync them in a common report. Shared dimension is utilized in multiple fact tables in a data mart or across multiple data marts.
Dimension values comes from either the source systems or is built by business rules in Usage supporting database. Non-conformed dimension can only be used within one fact. It is part of the ETL process to do conforming by merge, unite and consolidate multiple source system data across the enterprise for making a conformed dimension e.g. a Customer dimension from different business areas as order, sale, invoice, delivery and support service (for B2B and B2C) with both different customers and same customers and with different business key values and with different addresses like a shipping address and a billing address. Sometimes data from a conformed dimension is send back to the source systems as master data to be reusable in the organization.
The dimension values can be placed in a hierarchy like a Location with three levels Country→Region→City. A dimension can have several separate and independent hierarchies with different numbers of levels.
The dimension values can be placed in a group that is grouping values in band intervals like person ages in custom buckets like a Age group column with intervals of Child (0-9), Tween (10-12), Teeanage (13-19), Young adult (20-29), Adult (30-66) and Senior citizen (67-130).
Data classification is the process of organizing data into categories, group or category is part of a categorization or grouping of data to make a dimension more user-friendly to see data of a dimension on an aggregated and summed level.
A dimension normally contains one or multiple hierarchies and/or groups to fulfill requirements from the users. A dimension can of course be non-hierarchical and non-grouping.
Date dimension or Calendar dimension
A very common dimension with the granularity of a single day with hierarchies as:
Year→Half year→Quarter→Month→Date (five levels) or
Year→Week→Date (three levels) because a week does not always belong to one month.
When a fact date is connected to the Date dimension it is easy to make a report based on month, quarter or yearly level or a search filter criteria as Q3 2013.
Dates or days can be grouped to a Season or TimeOfYear: Winter (December January February), Spring (March April May), Summer (June July August) and Fall (September October November) (in danish Årstider).
Date dimension has normally many supporting columns like Weekday name and Month name and an IsHoliday column has a data type of boolean/bit with two values True (1) and False (0) as a flag indicator, but to be used in a presentation tool it is better to have a textual column DayOnOff with values »Holiday« and »Working day«. Of course the rule of when a specific date is a holiday is depending on country, collective agreement, company policy etc. is programmed into the stored procedure that builds the date dimension.
A fact table has often at least one column that represent a date e.g. an entry date, order data, invoice date, shipping date. When an event is happening it has normally a date or multiple dates like an injury with an InjuryDate, a ReceivedDate at insurance company, a RulingDate, a TreatmentDate, a BillingDate and a PaymentDate. Date dimension is a role-playing dimension and therefore the data mart contains multiple views upon the date dimension where each view can join to each date column in the fact table. Each view has a good name like the column in the fact table and the columns of the Date dimension is renamed to unique names for each view.
Using an integer value as surrogate key identity column in format yyyymmdd e.g. 20131224 for Christmas day, where the fact table will contain useful information that can be used in a query like:
»between 20131201 and 20131231« for month of December 2013.
Using integer key values to handle these five artifacts:
»Missing« (-1) when a source system provide a date value that is null, meaning it is not present in some data, not registered, not reported, »hasn’t happened yet« or »be available later« because the date to be determined is expected to be available later and fact table will be updated thereby. Therefore the fact table column for the date dimension gets the value -1 for »Missing date«. (In danish Mangler eller Ubestemt).
»Not available« (-2) when a source system provide a date value that is not known and not found in the data warehouse, because the date value does not exists in the data warehouse e.g. date value 1234-05-06. Therefore the fact table column for the date dimension gets the value -2 for »Unknown date«. (In danish data værdien er ikke tilgængelig, værdien eksisterer ikke og er ikke til rådighed i dimensionen, ukendt).
»Not applicable« (-3) when the dimension is irrelevant for the fact row (N/A, N.A., NOAP or »Irrelevant«). Therefore the fact table column for the date dimension gets the value -3 for »Irrelevant date«. (In danish ikke anvendelig, ikke relevant, er ikke krævet).
»Wrong«, »Bad«, »Corrupt«, »Dirty« (-4) when a source system provide wrong or bad date e.g. year of birth as 2099 or 2017-02-29 because year 2017 is not a leap year. Therefore the fact table column for the date dimension gets the value -4 for »Wrong date«. Please read more about it in section 4.5.
»Nothing« (-5) when we know when a date is not missing and therefore a date is not required the fact table column for the date dimension gets the value -5 for »No date«. When the date dimension is role-playing for a BirthDate value -5 might stands for »Prefer not to say« or »No answer« or in a general way »No data«. Sometimes it is fine that there is no value (null/empty) because the value is not required, therefore I am using -5 for »Nothing« (in danish Ingen, Ingen tildelt, Ingenting, Intet, Ikke oplyst eller Uden data).
When a fact table gets many millions of rows it can be a good idea to use partitioning that is a behind technique to segment a fact table into smaller tables and in most cases it is based on the column of the date dimension. Therefore the date key is using data type date instead of an integer key to make the partitioning of a fact table works best, but that means we have to use specific dates for:
»Missing« as 0001-01-01.
»Not available« as 0002-01-01.
»Not applicable« as 0003-01-01.
»Wrong« as 0004-01-01.
»Nothing« as 0005-01-01.
A future undetermined date is identified as 9999-12-31.
An artifact value in a dimension table is to ensure no null value in foreign keys in a fact table by maintain referential integrity from the fact table to the dimension table. (In danish artifact stands for et kunstprodukt, en genstand eller et fænomen der er skabt af mennesker ofte med et unaturligt eller kunstigt præg [kunstpause]).
Time dimension or Time-of-day dimension
A very common dimension with hierarchies such as:
Hour→Minute→Second (three levels) or
Hour→Minute (two levels)
The time dimension has the granularity of a single second with 86400 rows or of a single minute with 1440 rows.
Time dimension values can be placed in a group that roll up of time periods into more summarized business-specific time grouping or band intervals e.g. Time division or Time interval (»døgn inddeling«):
Morning (6 am to 8 am)
Rush hour (8 am to 11.30 am and 4 pm to 6 pm)
Lunch hour (11.30 am to 1 pm)
Afternoon hour (1 pm to 4 pm)
Dinner hour (6 pm to 8 pm)
Evening hour (8 pm to 11 pm)
Night hour (11 pm to 6 am)
I like to use an integer value (a kind of military time) as surrogate key identity column in format hhmmss e.g. 0 for 00:00:00, 63000 for 6:30:00 am and 202530 for 8:25:30 pm because 8 pm is also called 20 o’clock in the evening.
When Time dimension has granularity of minute an integer value 10 stands for o'clock 00:10 and an integer value 2359 stands for 23:59.
Duration e.g. 5:15 for minuts/seconds is sometimes written as 5.15 or as 5'15".
Using integer key values to handle these five cases:
· Missing is represented with the value -1 »Missing time«.
· Unknown is -2 »Unknown time« if source provide a time as 24:12:06.
· Not applicable is -3 »Irrelevant time«.
· Wrong is -4 »Wrong time« if source provide a time as "late afternoon" or "when the sun goes down" or "the twilight hour" (in danish sen eftermiddag, når solen går på hæld, skumringstimen).
· Nothing is -5 »No time« when we know a time is not missing and therefore a time is not required.
In a ETL process for making fact data rows, developed in a SSIS package, the column Time_key with data type smallint will be created as a derived column to the pipeline where null value becomes -1 and time 10:34:45 becomes 1034, expression:
ISNULL(TransactionDatetime) ? (DT_I2)-1 : (DT_I2)DATEPART("hh",TransactionDatetime) * 100 + DATEPART("mi",TransactionDatetime).
All dimension members or values (or branches in the hierarchy) have the same number of levels which makes the dimension symmetrical or balanced such as the Date dimension. A regular dimension has a flattened denormalized structure.
Fixed-depth hierarchy: Country→Region→City where the three levels is in separate columns. Several columns in the dimension table is not at third normal form (3NF), therefore the dimension table contains redundant data or duplicate data by nature.
A Product dimension table can have columns like ProductId, Code, Name and other descriptive columns, a Category because products are divided into categories and each category has a Target group. When a category belongs to multiple products, the target group will be repeated, meaning it does not satisfy third normal form (3NF), but it is okay for a regular dimension.
Example of a Product dimension with hierarchy Target group→Category→Product:
There is three levels in the dimension hierarchy and it is balanced meaning that each Product has a Category and each Category has a Target group, so the dimension hierarchy maximum dimensionality is 3 (level of dimensionality).
For the Product dimension the product is the lowest level of granularity, and there is two many-to-one relationships because many products roll up to a single category and many categories roll up to a single target group.
A ragged dimension has leaf members (the last level of a hierarchy) that appears at different levels of the hierarchy and therefore contains branches with varying depths and number of levels.
Variable-depth hierarchy: Europe→Denmark→Copenhagen with three levels and
North America→United States→California→Sacramento with four levels representing continent→country→state→capital. A ragged dimension is implement as a Parent-child structure or with a bridge.
To keep a hierarchy structure as a fixed level balanced regular dimension with a fixed-depth hierarchy I can make a dummy level for a »not applicable« state of Denmark and get this: Europe→Denmark→N/A→Copenhagen.
Denmark and France is divided into regions, Philippines is divided into provinces and Germany has states/lands, therefore the name of level »State« could be changed to a broader word like »Division« which covers the different countries' areas.
Used to model flexible hierarchical structure where some dimension values have different levels of hierarchies called unbalanced or variable-depth hierarchy. Every value in the dimension have a related parent (mother) value, except the top value. The hierarchy is asymmetrical or unbalanced because values are placed in different levels within the same hierarchy.
For example, an Employee dimension where the parent is the manager and the children is the employees under the manager, and some employees are both a manager and have another manager above, and the chain of command can be different from department to department.
Another example is an Organization dimension where some departments have sub-departments and some teams have sub-teams, but there are also teams that don’t have sub-teams. This is the strongest side of Parent-child dimension to modeling.
Dimensional model dimensions is comply with second normal form (2NF) and snowflaking is a normalization of a dimension.
a) Splitting columns of a dimension table into smaller dimension tables with one-to-many relationships so data values fulfill and comply with 3NF and BCNF to avoid redundant data and denormalized structure. Snowflaking is normalization to 3NF.
b) Splitting a dimension hierarchy into two or more dimension tables is called »snowflake a hierarchy«. For example, a Customer dimension with the hierarchy:
Country→Region→City will be split into three dimension tables so only column City remains in the Customer dimension table that is connected to the Sales fact table, and two Snowflake dimension tables for Region and for Country with one-to-many from Country to Region and one-to-many from Region to Customer. Region and Country can be reused in other dimensions for Supplier and Store or in a FactSales to quick divide in regions and countries. Of course depending of business requirements specification.
c) Splitting a dimension with columns from multiple source systems or when some columns will be user type-in from an application to enrich data and to avoid redundant data. See an example in section 6.1.
d) A Product dimension with a Category hierarchy is used in a Sales fact and Inventory fact. For forecasting (budgeting) the data is generated at Category level. Kimball will not do snowflaking of Product dimension instead he roll up to Category dimension as a strict subset of Product that the ETL process must take care of.
e) Splitting dimensions and move common columns to a general dimension is called an Outrigger dimension. For example, an Address dimension with street name, house or apartment number, postal code, city, county, area, region, country, gps coordinates and so forth to be used in other dimensions e.g. a Customer dimension with a shipping address and a billing address, a Building dimension with location address and an Employee dimension with home address. The Address dimension contains one row per unique address. A Sales fact can have a ShippingAddress_key and a BillingAddress_key as role-playing dimensions. Or a factless fact table with Customer_key, ShippingAddress_key, BillingAddress_key and effective and expiration dates for reporting track addresses of a customer. Kimball page 235. See more later.
f) A Shrunken dimension is not a snowflake dimension, see more later.
g) Splitting a dimension with rapidly changing columns is not snowflaking, see Mini.
h) Kimball’s financial services example starts with an Account dimension including data of products and branches but he choose to remove these descriptive columns to form independent dimensions of Product and Branch and not snowflake the Account instead add them to the fact because that’s the way business users think of them too. The fact will capture the relationship among data entities.
Some data modeler does not like Snowflake dimension because of the query performance with more tables in the join. I don’t use the terms snowflake schema and starflake schema because for me snowflake belongs to a dimension and sometimes it is great to have a denormalized structure and other times it is good to think of the above points. Most times a data mart will be a constellation schema (multi-star schema) where some dimensions is shared among different facts.
Now and then a data warehouse architecture has an extra layer after the data mart that is called presentation area where the ETL process or a view has joined and merged snowflaked dimensions together to only one dimension with all the columns with a lot of data redundancy, and that is fine since this layer will be refilled in every ETL process and it is easier for the users to do their query and for the tool to load data into its data model. Can use a materialized view. The layer is also called data consumption layer or data delivery layer for reporting, analysis and self-service BI.
Outrigger dimension or Reference dimension
When many columns belong logically together in a cluster or group it is fine to do a snowflaking to avoid a repeating large set of data and therefore making a dimension smaller and stable. Sometimes a canoe or a sailboat is using a rig to achieve balance because they are very narrow and a cluster of columns is placed in a Outrigger dimension or a reference dimension because it’s primary key will be a foreign key in the main dimension, but there is no reference found in any fact table.
For example, a Product dimension has a column called LaunchDate that represent the date when the product will be applicable for the customers, and from that date the product can appear in the Sales fact table. The LaunchDate column is a foreign key to a Date dimension that becomes Outrigger dimension because Date dimension has many columns about dates, weeks, months, years and maybe fiscal columns too.
Another example is demographic data for each country which is providing with 50 different columns. When we using outrigger dimensions we let each dimension has its own core columns.
Another example is from a bank with two kind of customers for Person and Company with common data as name and address and with specific data where a Person has social security number, gender and marital status and a Company has VAT identification number, industry code and turnover amount. A Customer dimension handle the connections to the fact tables and becomes a hub, an anchor or a party. Every time the bank gets a new customer, it will be set up in Customer dimension and the surrogate key value will be reused in either Person outrigger dimension or in Company outrigger dimension where both of them have a one-to-one relationship to the Customer dimension.
Shrunken dimension or Shrunken Rollup dimension
A Shrunken dimension is a subset of another dimension columns that apply to a higher level of summary of an aggregated fact and the shrunken dimension key will appear in the fact.
a) The Month dimension is a shrunken dimension of the Date dimension. The Month dimension would be connected to a forecast fact table whose grain is at the monthly level, while Date dimension is connected to the realized fact table.
b) A base-level Sales fact has a grain per date and product and is connected to a Date dimension with columns of date, month, year and a Product dimension with columns of names of product, brand and category. The Sales fact is derived to a aggregate fact with a grain per month and category and is connected to a shrunken Month dimension with columns of month, year and is connected to a shrunken Category dimension with a name column. The aggregate fact has columns for Month_key and Category_key and a summary amount of the Sales fact where the ETL process is using the Product dimension to roll up a product to a category and match the category to the Category dimension. Therefore both dimensions Product and Category have a category name column that becomes redundant data in the dimensional modeling. Therefore shrunken is not snowflaking because a snowflake dimension is on 3NF.
c) Shrunken roll up dimensions are required when constructing aggregate fact table. When a Sales fact has a daily grain the number of rows can become very large over time, therefore an aggregate fact summarized to monthly level will have less rows. Since the aggregate fact don’t need detailed customer information, the Customer dimension can make new shrunken dimensions for Region out of address, for Gender and for Age groups, and the summarized fact data become even less rows.
d) Sometimes users want few dimension values e.g. Red, Yellow and Green and they want Pink to become Red and Orange to become Yellow and so forth and the rest of the colours gets a residual value called Others. I will make a ColourGroup shrunken dimension with values: Red, Yellow, Green and Others. I will make a mapping table that will translate the colours e.g. Red to Red, Pink to Red, Yellow to Yellow and Orange to Yellow and the rest of the colours to Others. In the loading to the fact table I will let the colours pass by the mapping table and fetch the ColourGroup dimension to the fact table to obtain good performance for various statistics for the users.
e) Store data of an aggregated fact in a column of a dimension for easy searching e.g. a year-to-date SalesSegment in a customer dimension with descriptive values as Excellent, Super, Good, Average, Less, Nothing. The ETL process must ensure the column is accurate and up-to-date with the sales fact rows.
Dimensions that represent data at different levels of granularity to give higher performance. Can also refer to hierarchies inside a dimension with a higher grain.
Like a Month dimension that is derived from a Calendar dimension or we can say that Calendar has been reduced to Month, Year and Week with the start date of the week together with a week number and which year the week belongs to. A derived dimension can also be created by aggregating two existing dimensions.
In a hospital we can from a Patient dimension and an Employee dimension derive a Person dimension. A person can over time be both an employee and a patient or at the same time when the employee becomes sick and will be hospitalized.
Fact data can derive dimension data and it is called a degenerate dimension.
Previously, I showed a Date dimension and a Time dimension and with a combination of them I can create a new dimension to handle date and hour to be used in Power BI Impact Bubble Chart e.g. from 2016-08-22 10:00 to 2016-08-26 22:00.
A fact table or a view can have a derived column like DateHourInterval:
DateHourInterval = FORMAT(TransactionDatetime,'dd-MM-yyyy HH:00','en-US')
A view can make the data rows for the derived dimension:
CREATE VIEW DimDateHourInterval AS
SELECT DateHourInterval = FORMAT(CAST([Date] AS datetime) +
CAST([Time] AS datetime),'dd-MM-yyyy HH:00','en-US')
FROM DimDate CROSS JOIN DimTime
WHERE Minute = 0 AND [Date] BETWEEN '2010-01-01' AND '2029-12-31'
Mini dimension or Historical dimension
For Rapidly Changing Dimensions for managing high frequency and low cardinality changes in a dimension of fast changing volatile columns they are placed in a mini dimension or historical dimension with its own surrogate key identity column which will be included in the fact table. This approach is called type 4. A dimension table will be split into two tables, one with type 1 columns and the other with type 2 columns.
An example is a customer with columns for Name, Gender, DateOfBirth, Address and Country→Region→City is placed in a Customer dimension table (can be type 1 or type 2), and the fast changing columns BodyWeightAtPurchaseTime and MonthlyIncome interval e.g. $ 0-10000, 10000-25000, 25000-50000, 50000-99999 is placed in a mini dimension table called CustomerBodyWeightIncome with its own surrogate key identity column and a foreign key back to the main dimension as a type 5. The sales fact table will have two columns to provide data for a customer, one key for Customer dimension and another key for CustomerBodyWeightIncome dimension. Sometimes it is necessary to have two or more mini dimensions if the columns is changing rapidly at different times. Normally there is no hierarchy in a mini dimension.
A bridge is to combine data that have a many-to-many relationship between a fact and a dimension, or between a dimension and a dimension to take care of multivalued column at the conceptual level, and break it down at the logical level to a bridge dimension table with two one-to-many relationships and a composite primary key from the involved data. Instead of the word Bridge table a term Combine or Helper could be used. Peter Chen calls it an associative entity. A bridge can also be used for ragged hierarchies. (A song can have a Verse 1, a Chorus, a Verse 2, a Chorus, a Bridge and ends with the Chorus again.)
Bridge table connects a fact table to a dimension table in order to bring the grain of the fact down to the grain of the dimension.
Multivalued dimension, Many-valued dimension or Multi-valued dimension
a) To handle when one fact row have two or more dimension values from same dimension. For example, a Sales fact can have up to four different sales staff employees and a sales staff has many sales. Therefore we say there is a many-to-many relationship between Sales fact and Employee dimension, and sales staff employee becomes multivalued in the Sales fact. It can be implemented by an extra table called SalesEmployeeBridge that contains the _key (unique Id) of a fact row and the Employee_key from the Employee dimension, or the fact table will get a dimension key column for a SalesEmployeeGroup dimension meaning a group of sales staff is connected to one fact row; and since there is a many-to-many relationship between a SalesEmployeeGroup and a Employee dimension, a SalesEmployeeGroupBridge table will express that by combining the keys from SalesEmployeeGroup and Employee.
b) When buying a paint mixture the different colors are mixed with a ratio or weight (sum up to 100%) the amount of paint. One sales fact row contains many colors and one color is included in many paint mixtures. Color becomes multivalued and a bridge gets a weight value to tell how much of that color is used.
c) To handle when one dimension row has two or more dimension values from another dimension. For example, an employee has a group of skills and one skill can belong to several employees. Therefore we say there is a many-to-many relationship between Employee dimension and Skill dimension, and skills of an employee becomes multivalued in the Employee dimension. It can be implemented by an extra table called EmployeeSkillGroupBridge that has a composite primary key of SkillGroup_key and Skill_key, and SkillGroup_key is a foreign key in Employee dimension and Skill_key is a foreign key in Skill dimension, so the many-to-many relationship comes two one-to-many relationships. The data experience level would be placed in the EmployeeSkillGroupBridge. Alternative approach is to capture a relationship in a EmployeeSkillFact. In a School mart I would place the students' courses in a fact table because it is a result of study and passed an exam.
Another example is a t-shirt can have up to three sizes »small«, »medium« and »large« and they can become three columns in the T-shirt dimension, or to make a Size dimension that has a many-to-many relationship to T-shirt dimension and the relationship becomes a T-shirtSizeBridge with QuantityInStock.
Another example is a bank account can two bank customers like wife and husband, and of course each bank customer can have several accounts like budget, saving and pension. Therefore we say there is a many-to-many relationship between Customer dimension and Account dimension. We create a BankAccount fact table that will refer to the Account dimension, and the Account dimension refer to an AccountCustomerBridge table that again refer to the Customer dimension, so the BankAccount fact table will not refer directly to the Customer dimension. The AccountCustomerBridge table contains two columns Account_key and Customer_key as a composite primary key so an account can have several customers.
Other examples in Kimball page 287 and 382 with a weighting factor.
Role-playing dimension or Repeated dimension
A role-playing dimension is repeated two or several times in same fact e.g. a Date dimension which key column is repeated in three foreign key columns in a fact table for three roles labelled SaleDate, ShipmentDate and DeliveryDate. For each role I create a view with distinguishable, unambiguously and unique names of columns as renaming of the columns from the Date dimension, and in this example it becomes three views upon the Date dimension called SaleDate, ShipmentDate and DeliveryDate with columns like Date of Sale, Year of Sale, Date of Shipment, Year of Shipment, Date of Delivery and Year of Delivery.
A City dimension can be repeated in multiple roles in a fact table of persons like these columns: BirthplaceCity, ChildhoodCity, ResidensCity, WorkingplaceCity, SeniorCity and DeathCity. It will become six views upon the City dimension.
A Manager dimension can be repeated as Sales clerk and Store manager.
Another example is a boolean dimension with key values 1/True and 0/False and in different views for several role-playing statuses the two values is translated to good texts as »Yes« and »No«, »In stock« and »Delivery made«, or »Available« and »Utilized« for a Utilization status.
For a question like »Want children« the answer is normally »Yes« or »No« but we must give extra options as »Not sure«, »Maybe«, »I don’t know« or »I can't decide« or »No answer« when the answer is missing. Therefore we end up with a dimension with several values.
When a dimension has an outrigger dimension e.g. Customer dimension has a column for FirstPurchageDate I create a view upon Date dimension.
Kimball says: »Create the illusion of independent date dimensions by using views or aliases and uniquely label the columns.« It will be easy for a user in Power BI or Tableau to drag into a data model several dimensions without thinking of a dimension is playing multiple roles. In a olap cube data model the fact can be joined multiple times to the same dimension and at Dimension Usage each role can be labelled, but since we are using the same dimension the column names will be reused.
A Sales fact can have a ShippingAddress_key and a BillingAddress_key as role-playing dimensions. In a fact a Date_key is a general term, therefore it is better to give the date a label and let it play a role e.g. label »purchase« becomes a PurchaseDate_key as a role-playing date dimension.
Junk dimension, Garbage dimension, Abstract or Hybrid dimension
A single table with a combination of different and unrelated columns to avoid having a large number of foreign keys in the fact table and therefore have decreased the number of dimensions (dimensionality) in the fact table. Kimball recommand up to 25 dimensions in a fact. The content in the junk dimension table is the combination of all possible values of the individual columns called the cartesian product.
For example, four different values that can be cross-joined into a Junk dimension:
Payment method: Cash or Credit card.
Coupon used: Yes or No or Not applicable.
Bag type: Fabric or Paper or Plastic or Unspecified.
Customer feedback: Good, Bad or None.
It will give 2 x 3 x 4 x 3 = 72 values or rows in a Junk dimension table, but can contain only the combination of values that actually occur in the source data.
The fact table only needs one key to the Junk dimension for getting the descriptive values of fact data for reporting. The pitfall of a Junk dimension is the filtering because a value (e.g. Credit card) exists as duplicate in multiple rows and therefore gives multiple key values to be joined to the fact table. To display unique content of a column from a Junk dimension in a dropdown or listbox I need to create a view for that column e.g. create view [Dim Bag type] as select distinct [Bag type] from [Dim Junk]. The view will handle a one-to-many relationship to the Junk dimension in the same way we handle a snowflake dimension. I can in Power BI create a calculated table upon the Junk dimension with a DAX like:
Dim Bag type = distinct(Dim Junk[Bag type]) and I build a relationship from the calculated table Dim Bag type back to the Junk dimension in the model where Junk dimension already has a one-to-many relationship back to the fact. I hide the Dim Junk because a user does not need it after we have calculated tables for each of the columns in the Junk dimension and therefore the Junk dimension becomes a bridge or a helper. I hope it shows how to use a Junk dimension in practice.
Degenerate dimension values exist in the fact table, but they are not foreign keys, and they do not join to a real dimension table. When the dimension value is stored as part of fact table, and is not in a separate dimension table, it is typically used for lowest grain or high cardinality dimensions such as voucher number, transaction number, order number, invoice number or ticket number. These are essentially dimension key for which there are no other columns, so a degenerate dimension is a dimension without columns or hierarchy.
For example, the OrderNumber column can be in the Order fact table with several rows using the same order number, because one order can contain several products. Therefore the OrderNumber column is important to group together all the products in one order. Later for searching for an order number in a OLAP cube, a Order number dimension is very usefull, but it is not a dimension table, it is generated by the Order fact table and there is no additional data like a name or text.
Degenerate dimension is also used as an implemented Bridge dimension table in making a Multivalued dimension. In a data mart these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.
Normally a degenerate dimension is not a table in a database, it is a view with distinct values based on the fact table.
Dimensionalised dimension is a replacement of Degenerate dimension where the view becomes a materialized view meaning it becomes a table in the data mart and where the text column will be transformed to an integer key, like this:
SELECT QuantityPerUnit_key = CONVERT(BIGINT, HASHBYTES('SHA2_256', t.QuantityPerUnit)),
FROM (SELECT DISTINCT QuantityPerUnit
FROM Products WHERE QuantityPerUnit IS NOT NULL) t
Static dimension or Constant dimension
Static dimensions are not extracted from the source system, but are created within the context of the data warehouse or data mart. A static dimension can be loaded manually with Status codes or it can be generated by a procedure such as a Date dimension and Time dimension. The opposite would be called Dynamic dimension.
Several different kinds of entry with different columns for each fact (like sub classes). For example, heterogeneous products have separate unique columns and it is therefore not possible to make a single product table to handle these heterogeneous products.
Measurement type dimension or Entry type dimension or Fact dimension
Used to identify different facts that is populated in the same measure column in a fact table because the fact rows represent different entry typies. An measurement type dimension describes what the fact row represents and how measures must be understand and used. The alternative is to have multiple measure columns for each entry type in the fact table where only one column has a value for each row.
Time machine dimension
A combination of two dimensions for Entry date (or event date) and for Year and where fact data rows is based on either »counterpart« as in a financial accounting or »transactions« as in a financial transactions. Section 6.4 will show an example.
A very large dimension that has a huge number of rows or many columns. For a real estate agent, I implemented a 132-columns dimension through a merge of five source systems. The column names was made together with the users. The created dimension table got the column names in an alphabetical order so it is easy to find a specific column.
Dimensional columns are allowed to become complex objects rather than simple text like unstructured text, gps tracking, graphic images and in time series and in NoSQL bases like Hadoop and MongoDB, and become much more malleable and extensible from one analysis to another. Extensible design in software engineering is to accept that not everything can be designed in advance and extensibility is a software design principle defined as a system’s ability to have new functionality extended.
Audit dimension or Data quality dimension
A description of each fact table row would be »Normal value«, »Out-of-bounds value«, »Unlikely value«, »Verified value«, »Unverified value« and »Uncertain value«.
Kimball recommands to load all rows to the fact table and use an Audit dimension to do a tagging of data because of an error condition and thereby to tell the state of each row in an audit report so a user can look at the data and make a data change or a counterpart in the source system and do a new load to the data warehouse to fix the data in the fact table. Data quality is monitored during the ETL process and it can procedure an audit statistics.
In a SSIS package a derived column can have an expression to validate data and give value to an Audit dimension where an amount from a source system is less than 0 or more than 5000 gets audit key value 12 for out of bounds amount else value 0 for okay: Amount < 0 || Amount > 5000 ? (DT_I4)12 : (DT_I4)0
All rows are checked for compliance with the constraints.
An audit dimension can also have data for name and version of a source system, name of data table in a source system, time of extract from a source system, time of insert into the fact table etc.
Textual data as a numerical fact measurement
It is a matter of confusion whether a numerical value should belong to a dimension or a fact. For example, a price of a product could be a column of the product dimension, but since the price of a product often varies over time and maybe also over location, it should be a non-additive measure column in a fact together with a Date dimension.
Another example, a national lottery has a fixed pool size amount for each weekly drawing and since the amount is fixed and thus does not change over time, the pool size amount belongs as a column of the draw dimension together with a DrawDate column that becomes a outrigger or reference dimension which provide a week number and a year. The pool size amount can be used for filtering (like to see pool size above $1 billion) and for grouping value band. When a person buys a lottery coupon, the deposit and the date of purchase will be recorded in a fact, and all the deposit amounts for a week can be summarized to a weekly deposit kpi e.g. in a view or in a tabular cube. A Power BI chart can combine the fact weekly deposit with the dimension weekly pool size and it can be seen that the deposit is significantly larger in the weeks that have very big pools.
4.5. Inferred members or Inferred dimensions
A dimension has two members (or values) called »Missing« and »Unknown« to handle source data that is going to be loaded into a fact table:
Approach: Handling a null/empty business key as a missing member because a row in a fact table has a column of a business key that is not registered or recorded and therefore means »no data«. It is known it is missing, and sometimes it is fine that there is no value because the value is not required, therefore I am using nothing instead of missing.
For a yes/no question the answer is »Yes« or »No« (true or false, 1 or 0) where null value represents a response for »Don’t know« like the question was unanswered. A »Don’t know« response is not the same as a neutral response therefore I treat the answer as a missing value, and I wouldn’t keep it as a Likert scale item.
To treat null, blank, empty, nothing and empty string as equal to null in sql:
WHERE (NULLIF(TRIM(CustomerAddress),'') IS NULL)
Approach: »Early arriving fact« for handling of orphaned data where a fact data has a unknown member, meaning a fact value is an orphan child because there is no parent value in a corresponding dimension table, a fact value does not exists in a dimension because the value is unavailable. »404 Document Not Found« means that a homepage deep link is referring to a page that does not exists or is unavailable because the page has been deleted or the page has been renamed or there is a misspelling in the link.
It is known it is unknown.
In a relational database it is called a referential integrity constraint violation in a table when a foreign key contains a value that does not exists as a primary key in a different (or same) table. There is a no referred member and there is a need to have an inferred member. A forthcoming fact row has a member that will infer a new dimension member, therefore it is called inferred member of the dimension (in danish udledt), and it is to improve data quality and audit trail/control track for reconciliation in the data warehouse.
For example, an Order table has a foreign key column called PartNumber referring to a Part table, in case PartNumber column contains null it means you have an order without a part or the part is missing which for an order makes no sense. This is a mandatory relationship; you cannot order a part that does not exist, therefore an order’s PartNumber value must exists in the Part table. In case you like to delete a part in the Part table its PartNumber must not being used in the Order table.
Handling a null/empty business key as a missing member
When a dataset is transformed to a fact table, there can be a business key column which value is empty or null, meaning it does not yet exist for the fact data. To keep all the fact data rows in the fact table, the related dimension table is already born with a »Missing« value (an inferred member) with surrogate key identity value -1, which can be used as a default value in the fact table foreign key column to the dimension table primary key column. Later the fact table with a -1 value can be updated with a real business key value that either exists in dimension table or will be inserted first and get a new surrogate key identity that the fact data row can refer to.
Sometimes I have seen a dimension with an inferred member value of -1 for »Unknown«, but I prefer using -1 for »Missing« (someone calls it »Blank«), and I am using -2 for »Unknown« to handle the situation of early arriving fact in general way. Sometimes it is fine that there is no value (null/empty) because the value is not required, therefore I am using -5 for »Nothing«. There can be other columns to let us know when a null/empty value represent missing or nothing.
Early arriving fact as a unknown member
When a dataset is transformed to a fact table, there can be a business key column which value has not yet been received to the related dimension table and therefore does not yet exist. To keep all the fact data rows in the fact table, the related dimension table will first have to insert the new business key value with a »Unknown« value (called an inferred member) which later will be updated in the dimension with the correct text value then it is known. The »Unknown« value gets the next surrogate key identity as a unique sequence number and will be used in fact table like any other dimension member value.
A dimension table can at the same time have several »Unknown« member values with their own business key, surrogate key identity and the text value can include the value of the business key like »Unknown 886«, »Unknown 887«, »Unknown 888« and so on to distinct them for the users of the dimension. When a regular dimension has a hierarchy these levels can have text value »Unknown« as a special branch in the hierarchy.
Late arriving dimension
When a dimension data has been delayed:
a) If business key not exists then insert a new row.
b) If business key exists as an inferred member »Unknown« for type 1 update the row by overwriting its values, and for type 2 by inserting a new row to keep history.
c) If business key exists and the delayed data comes with a business date of valid then type 2 becomes more complex because validfrom and validto has to be adjusted and fact rows has to be revisited to update the dimension key column to point at the right dimension value at that business date. You have to consider if it is allowed to change old data and old reporting result.
Late arriving fact
When a fact data has been delayed maybe it is including a date that can be used to search and fetch the current dimension member at that time if dimension keeps history. You have to consider if it is allowed to add a late arriving fact row because it will change an old report. For example, a boss already got the report of sales for the first quarter and at June 5 a late sales fact for March 31 is arriving and when it is added to the Sales fact, the report for first quarter will change so it does not match the old reporting result.
When building the ETL process for a fact that is using type 2 or 7 dimensions sometimes we can assume that new fact data rows is at current date and therefore we only need to do a lookup for the dimension key value with this criteria:
dim.Businesskey = fact.Businesskey AND dim.IsCurrent = 1
But if a fact data row can be late arriving with an old date stamp in column TransactionDate we need to do a date range lookup for the dimension key value with a criteria to found the fact business key in the dimension at the time when fact date was valid in the dimension:
dim.Businesskey = fact.Businesskey AND
dim.EffectiveDate <= fact.TransactionDate AND dim.ExpirationDate > fact. TransactionDate
Range lookup is an in-memory lookup to assign a surrogate key to each coming fact row in a streaming ETL process of rows from a source system to translate and replace a business key value with a dimension surrogate key value to be saved into a fact data row in a fact table.
Other terms for TransactionDate could be EventDate, HappeningDate or OccurredDate in the business sense e.g. OrderDate, PurchaseDate, SalesDate, InsuranceCoverageStartDate, PlanEffectiveDate or ValueDate etc. and of course a time (o'clock) can also be added.
Date columns in a fact table to make sure of match the old reporting result:
· TransactionDate when something occured in the business sense.
· RegisteredDate when it was known in the system.
· InsertTime when it was known in the data warehouse fact table.
When the data warehouse receives a correction to a fact row or a fact row arrives late these dates can be helpful. When the boss got the report of sales for the first quarter at April 2. We can mark the report with an ExecuteDatetime 2010-04-02 4pm to be used when the boss at August 16 wants to recreate an old report as it actually was looked at the time of creation, we can search the fact data:
InsertTime <= ExecuteDatetime.
And we can make a new report as it should have looked given corrections made to the data after its creation: InsertTime <= Now.
See more in chapter 2 temporal snapshots and in section 6.4.
Early arriving dimension
When a dimension has members that is not yet been referred to from a fact row. There will never be a method of payment called »Blood«, but I still keep it as a member of the dimension.
Other inferred members
Variety of constant inferred members and fixed key values for a dimension could be:
»Missing« (-1) when a source system provide a business key value that is null, meaning it is not present in some data, not registered, not reported or »hasn’t happened yet« or »be available later« because the business key to be determined is expected to be available later and fact table will be updated thereby. Therefore the fact table column for the dimension gets the value -1 for missing. (In danish Mangler eller Ubestemt).
»Not available« (-2) when a source system provide a business key value that is not known and not found in the data warehouse because the value does not exists in the data warehouse e.g. CustomerNumber 424-15-90. Therefore the fact table column for the dimension gets the value -2 for unknown in a general term for a unknown member because it is not all values I like to make as an inferred member for a dimension. A CustomerNumber in a source system could be type in wrongly and therefore is it not available for a dimension in the data warehouse, and I find it better to mark the dimension in the fact with -2 instead of inserting a wrong value into the Customer dimension. Some wrong CustomerNumbers will in the fact table gets the value -2 and therefore become homeless, but an Audit trail can fetch them and reporting them and the source system can be corrected. In a real time load it could happen that a sales transaction is referring to a CustomerNumber that is not yet in the Customer dimension and inferred member is the only way out if it is not possible to use a snapshop on the source system to create a read-only and static view as a transactionally consistent with the database as of the moment of the snapshot's creation. (In danish data værdien er ikke tilgængelig, værdien eksisterer ikke og er ikke til rådighed i dimensionen, ukendt).
»Not applicable« (-3) when the dimension is irrelevant for the fact row (N/A, N.A., NOAP or »Irrelevant«). It is best to avoid a N/A dimension by making separate fact tables. (In danish ikke anvendelig, ikke relevant, er ikke krævet).
»Wrong«, »Bad«, »Corrupt«, »Dirty« (-4) when a source system provide wrong or bad business key value or not enough data to determine the appropriate dimension key identity. This may be due to corrupted data in the source system or incomplete knowledge of the business rules for the source data for the dimension (wrong data).
»Nothing« (-5) when we know that it is fine that there is no value (null/empty) because the value is not required in a source system and therefore is not missing and therefore is not required the fact table column. There can be other columns to let us know when a null/empty value represent missing or nothing. I am using -5 for »Nothing« (in danish Ingen, Ingen tildelt, Ingenting, Intet, Ikke oplyst eller Uden data).
When you know that a product is missing in your invoice item line, it is fine to call it -1 »Missing product« (in danish Mangler produkt). When you know that a customer don’t need to have a role, it is fine to call it -5 »No role« (in danish Ingen rolle). I.e. a FactClassInstructor has two dimensions for a primary instructor and a secondary instructor and a role-playing DimInstructor. Sometimes a class has no secondary instructor therefore I use dimension key value -5 in the fact row with the term »No instructor« in DimInstructor.
When a type 2 dimension is expanded with new columns the old historical rows do not have a value for the columns, we can use value »Before column« (-9). The same apply to a fact table expanded with a new dimension column where we back in time (old fact rows) have no value for that dimension therefore we use dimension key value -9 for »Before column«.
Example of inferred members
The key values -1 to -7 is not carved in stone, please make the values that fit the dimension, for example a FactQuestionnaire has a DimRespondent with following key values in no particular order:
A dimension table can have a metadata column like IsInferred as a boolean (bit) with two values (1 = true and 0 = false):
An example of a Customer dimension:
Will an inferred dimension member value data row be updated when the source system provide a Region and a Name?
Kimball page 479 and 516 says it is okay to do a type 1 overwrite in a type 2 dimension and overwrite IsInferred with 0. I say it is up to you to think about the overwrite, because will it be okay after a week or a month that a report back in time will have a different output in present time?
The fact table is untouched or unchanged because the foreign key Customer_key values 134 - 136 are at the right place already, it is only Unknown values of the dimension that is changed. Kimball says there is no need for revisiting the fact table for making an inferred member to a normal member of the dimension.
When other levels in a hierarchy has text value »Unknown« like Region and users of the dimension in a cube has already made reports with fact data connected to »Unknown«, it is important to keep the history in the dimension so a user some months later still can make the same report with same summary value »Unknown«. If a user instead want to use the current values of the previous inferred members, the cube should provide such a dimension too.
The design of a dimensional model has a four-step process described by Kimball:
1. Select the business process to model by gathering and understanding business needs and available data in source systems.
2. Declare the grain of fact as Transaction, Periodic or Accumulating/Aggregated.
3. Identify the dimensions with columns, hierarchy and group.
4. Identify the facts with measures as additive or non-additive.
Dimensional modeling ends up in relational database tables for dimensions and facts with one-to-many relationships between primary keys and foreign keys, meaning it becomes a Entity Relationship data model where entities are labelled with Kimball naming.
An entity or table can have a normal name e.g. Product and Sales and they can be placed into multiple database schemas based on Kimball naming to labelling each table in the dimensional modeling e.g. of some database schema names (labelling in danish at lave en mærkning og sætte en etikette på):
Another naming is to add the role or type of data at the end, for example:
Sometimes a data modeler is adding a label Dim and Fact to the name of the table e.g. DimProduct and FactSales or Product_dim and Sales_fact.
Extra schemas for Usage supporting tables in the ETL process:
The most important of all is a good table name e.g. in Kimball’s Healthcare example he has tables like:
An invoice has multiple line items. Why make invoice as a dimension, when the parent invoice header fact table has 4 dimensions and 5 measures, and the child invoice line item fact table has 6 dimensions and 3 measures? Kimball Design Tip #25: We can’t roll up our business by product! If we constrain by a specific product, we don’t know what to do with invoice level discounts, freight charges and tax. Have to take the invoice level data and allocate down to the line item level. Therefore we get one Invoice fact table with dimensions and measures from the invoice header e.g. invoice no, order no, invoice date, customer, ship-to and sales person, and from the invoice line item e.g. product, quantity, unit price and amount. Invoice number becomes a degenerate dimension. We get a »capture a relationship in the fact«.
Utilizing dimensional modeling, end users easily understand and navigate data structure and fully exploit the data.
5. More readings
https://www.kimballgroup.com/2002/09/two-powerful-ideas/ (data staging area)
A danish teachers presentations in English, texts and videos: https://astridhanghoej.dk/dataanalyticsinfrastructure/
Steps to Modeling the EDW (video)
Ralph Kimball and Margy Ross: The Data Warehouse Toolkit, The Definitive Guide to Dimensional Modeling, Third Edition, 2013. First edition was published in 1996 and was introducing the star schema as the basic solution for modeling multidimensional data and had subtitle: Practical Techniques for Building Dimensional Data Warehouses.
Ralph Kimball and Joe Caserta: The Data Warehouse ETL Toolkit, Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.
Page 215 shows a pipeline for a ETL process to lookup key values in dimensions. Read how to implement a pipeline with date range lookup for SSIS in section 11.4
Ralph Kimball and Margy Ross: The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence Remastered Collection, 2016 including all Design Tip #1 to #180. Some of the tips
William H. Inmon: Building the Data Warehouse, Fourth Edition, 2005, first edition published in 1990 to be considered as the birth of DW as »a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision making process.«
Dan Linstedt and Michael Olschimke: Building a Scalable Data Warehouse with Data Vault 2.0, 2015. Introduction and good debate
6. Examples with sample values
6.1. Customer snowflake dimension
Customer table in a source system with CustomerId as a primary key (selected as business key for the data warehouse because the value is immutable and unchangeable) and CustomerNumber build of the nationality (country) and the gender is a mutable and a changeable value) because a customer can change gender).
Customer table from a source system with business key column CustomerId: