Dimensional
modeling – architecture and terminology by Joakim Dalby (danish Dimensionel modellering,
Datamodellering, Datavarehus) I find knowing
the theory really helps the practice. 1.
Introduction »The most valuable commodity I know
of is information« says Gordon Gekko in Wall Street movie from 1987. »Knowing
is good, but knowing everything is better« says CEO in The Circle movie from
2017. »Information at your fingertips« was a concept by Bill Gates in 1990. Purpose: Let's enable business
users to make better decisions faster. Information has become a major asset
for any organization. The book Das Kapital by Karl Marx could today get the
title Die Information. A data warehouse is the way to
provide information to the business users and add value to the business and
reuse data in a new way to increase the revenue. A data warehouse contains data from
several operational systems and most important is it, that data from a
source system must be integrated, because integration of data is the keyword
to represent an enterprise data platform with data assets. When a source system database
forgets, a data warehouse remembers everything. A data warehouse is a separated system
so a business user query and analysis will not slow down and not reduce the
workload on the operational systems. »Alot of you are going to have to make
decisions above your level. Make the best decision that you can with the
information that’s available to you at the time, and, above all, do the right
thing.« says Atul Gawande. Star
schema for business users to do their analytics Data is stored in a data warehouse
through an extract, transform and load process, where data is extracted from
operational systems, transformed into high-quality data and checked for
compliance with business rules and enriched by business users data before
loaded into a data warehouse system. Data quality isn't everything. It’s
the only thing. Gartner formulates that the platform
for data and analytics is the heart of the digital business. Here you will
find data management and the company's governance for data and analysis. Data
is the raw material in our digital age. Data is collected, cleaned,
controlled, optimized, shared and used, and our ability to work flexibly and
intelligently at this »assembly line« is growing fast. Data management is
crucial to tomorrow's business success. They must shape the technology and
the data resource both short-term and long-term for the sake of both the
existing business and the future's business. Now, the actual data platform
is, with good reason, a focus area for the companies. In a data-centric
world, tools, processes and competencies to handle data is infinitely
crucial. All of today's greatest technological trends ranging from Internet
of Things IoT, robotics, artificial intelligence and virtual reality are
ultimately dependent on data managers, architects and company data
strategies. »That's all data is. A gift from
yesterday that you receive today to make tomorrow better.« Jon Acuff: Finish:
Give Yourself the Gift of Done, page 126. Performance
Management Capabilities required:
A data warehouse tells you what has
happened (lag measures) - not why it has happened or what is going to happen (lead
measures). »Description is not analysis.« Dr. Donald J. Wheeler:
Making sense of data. The
four levels of analytics to get better data skills and data literacy are:
There are many gaps that need to be
filled in by a different kind of insight than the one that can be delivered
in a data flow task from multiple source systems to an enterprise data
warehouse, but let's start with what is a data warehouse. Data warehouse is
Data from multiple operational
source systems and source legacy systems are stored in tables in a relational
database called a data warehouse. Data will be merged and integrated inside a
data warehouse to get a consistent format and use same naming. Transversal data warehouse, it
cannot be said more tellingly (in danish Tværgående datavarehus, mere sigende
kan det ikke siges). Uniformity means that within the
data warehouse the common data elements of multiple source systems are
referred to in a consistent manner. Conforming means that within the
data warehouse creating a conform column with a single and consistently labelled
name that has the same common meaning with identical contents all over the
data warehouse e.g. same spelling and unit of measurement (in danish ensartet). For example, IBM or I.B.M. or International
Business Machines Corporation. For example, product data can be
found in multiple source systems with different names of columns, different
spelling of product names and different segmentations of products, therefore
a data warehouse will unite, interconnect, compile, conform and consolidate
the data into one product table and one product name column to make it easy
for the business users to make a list of all products. For example, the
length of a pipe is registered as 55 cm. in one source system and as 0.55 m.
in another source system, but in a data warehouse the values become
conformed to 550 mm. For example, status can be found in
order statuses, shipping statuses and payment status from multiple source
systems where a data warehouse dimension will unite and conform all statuses
in a hierarchy. A conformed dimension supports
integration of multiple source systems into a data warehouse and is essential
for enterprise data warehousing. It is important to design a data
warehouse to support reporting and data analyses by a single and common
data model that is easy to navigate in and a business user does not have to think
of data across of source systems. Data integration becomes seamlessly. Many source
systems update its data to reflect the most current and recent state while a
data warehouse also maintain history. For example, an ambassador who has
lived in Washington, London and New Delhi for the last 25 years and always
bought her perfume from a Lancôme store in Paris with a IT system that only stores
current shipping address, all the sales over the last 25 years will be placed
in India, while a data warehouse will remember all cities and give a correct
picture of the sales shipped around the world. Data mart is a specific subject
area that contains the data and information that is relevant for a business
function or a department or business users. A data mart has one purpose to customize
and summarize data for tailored support of a specific analytical requirement
from a business unit. It utilizes a common enterprise view of strategic
data and provides business units more flexibility, control and responsibility.
A mart display data used to support decision-making, reporting and dashboards
(in danish udstiller). For example, Sales mart, Customer
mart, CRM mart, Churn prediction mart, Market mart, Production mart,
Inventory mart, Shipment mart, HR mart, Tax mart, Credit risk mart, Fraud detection
mart, Financial mart, Budget mart. A data warehouse can have multiple
data marts bound together with conformed and shared dimensions. A conformed
dimension is shared by multiple facts that has conformed measures with same
calculation methods and common unit of measure e.g. sales revenue and support
revenue in two facts are both pre-VAT (value added tax) in United States Dollar
USD and adding them will result in a total revenue. A data mart contains the data to calculate Key Performance Indicators (KPI)
e.g. MRR Monthly Recurring Revenue and ESG Environmental, Social, and
Governance. (KPI in danish Nøgletal). Dimensional
modeling
and the process of dimensionalization by thinking dimensionally is a
technique approach that seeks to divide and present data in a standard,
intuitive framework of: ·
Dimension contains context descriptive
data, hierarchy and grouping for a fact. ·
Fact contains measurement,
measures, metrics for a business process, an event, a transaction, a circumstance,
an instance, an incident etc. A fact may tells how much or how
often, and a dimension adds texts to it. Both fact and dimension are tables
in a database. A fact has several dimensions connected via dimension key
columns and is often displayed as a star schema diagram. (Fact in danish
faktum, faktuelt, faktuelle, kendsgerning, forretningsproces, en begivenhed,
en transaktion, en omstændighed, en instans, en hændelse). Dimension and fact will focus on
ease of business users (end users) accessibility and provides a high level of
performance access to the data warehouse/data mart. Ralph Kimball recommends in the
design of a data warehouse to decide what business process(es) to model by
combining an understanding of the business requirements with an understanding
of the available data. Kimball page
37: »The BI team needs to understand the needs of the business as well as the
realities of the underlying source data.« A process-centric fact table supports
integration of multiple source systems via a set of conformed dimensions,
and the entire data warehouse is build process-by-process. Data
warehousing
(DWH) is made up of three components: The architecture, The data model and The
methodology. It is a method to help database designer to build a comprehensive
and reliable data warehouse system, e.g. based on Dimensional modeling principles.
A OLAP (OnLine Analytical Processing) cube or a Tabular model can be at the
top of the Dimensional modeling to present data in a data access tool like
Excel, Power BI, Tableau, QlikView, Alteryx or Targit. Business
Intelligence
(BI) system provides the information that management needs to make good
business decisions. BI is going from data and information to knowledge and
wisdom for the business users. BI was previously called Decision Support
System (DSS) (in danish Beslutningsstøttesystem, Ledelsesinformationssystem
LIS). BI could also stands for Business Insight. Data does not turn into insight without
effort of building a good data warehouse with an architecture of data layers
or data areas, a data model for each data layer and a methodology for
process of data in each data layer or data area with a famous abbreviation ETL that stands for three operations
on data:
Extract is the process of
fetching data and extracting a desired subset of data. Transform is the process
of transforming the extracted data into a required shape. Load is the process of
loading the transformed data into a suitable target table. ETL process brings multiple source
systems together and combines their data to an integrated set of data into a
data warehouse and into data marts. Read more in section 1.3.6. In the article I will focus on the
basic concepts, terminology and architecture in Dimensional modeling. My
homepage has other articles of how to design and implement a data warehouse
system with levels of data in the ETL process and orchestration to
orchestrate a set of tasks. I see Dimensional modeling
as a subset of the Entity Relationship (ER) data modeling design
method of a relational database system OLTP (OnLine Transaction Processing).
Peter Chen, the father of ER
modeling said in his seminal paper in 1976: »The entity-relationship model
adopts the more natural view that the real world consists of entities and relationships.
It incorporates some of the important semantic information about the real world.«
Where a ER data model can have a little
redundant data or no data redundancy, a dimensional model typically has a
large amount of redundant data. Where a ER data model has to assemble data
from numerous tables to produce anything of great use, a dimensional model
store the bulk of their data in the single fact table or a small number of
them and divide data into multiple databases called data mart. Read more. Back in the 1970's Bill Inmon began
to define the term data warehouse as a one part of the overall business
intelligence system. An enterprise has one data warehouse with entities in a
relational model as a centralized repository, and multiple data marts which
source their information from the data warehouse. In the data warehouse, data
is stored in third normal form (3NF). A data mart is a simple form of a data
warehouse that is focused on a single subject. Denormalization and data redundancy
is a norm for data modeling techniques in a data mart. Ralph Kimball said: »A data
warehouse is a union of all its data marts.« Kimball’s data warehousing
architecture is also known as Enterprise
Data Warehouse Bus Architecture matrix (BUS matrix) as a collection of conformed dimensions that has
the same meaning to every fact (e.g.). Database
concepts A data model represents correctness,
completeness and consistency. A schema is a logical grouping of
and a container for database objects e.g. tables, views, table-valued
functions and stored procedures, and it can handle user access rights. A table contains data in rows also
known as records, tuples, instances or data rows. A table contains columns also known
as attributes, fields, elements or properties. A column
contains an atomic value like a First name or a compound value like a Full
name that can be further divided into three columns for FirstName, MiddleName
and LastName or Surname or Familyname. Sometimes FirstName is called Givenname. Null
represents a blank value, an empty value, an indefinite value, a nothing or a
none in a column which value is missing at the present time because the value
is not yet registered, not yet assigned or the value is not defined for the
row, or it is okay that the column not has a value. A column can contain a
null value or a column has a constraint of enforcing it to always contain a
value e.g. by a default value. In a Customer
table a column LastPurchaseDate has a row with null that indicates the lack
of a known value or it is known that customer hasn’t made a purchase yet,
therefore it is known to be absent or omitted or missing. When a
person column Date of Birth is null I will say it is missing because
of the person prefer not to say or a lack of the information at the time the
person was registered – a not knowing situation. When a person
column MiddleName is null, I will say it is nothing because it is okay
that there is no value because a value is not required and therefore is not
missing – a knowing that there is nothing situation. Sometimes people for
MiddleName typein a value »(None)« to tell that a person has no middle name, or
»(Missing)« to tell that a person has a middle name but we don’t know it, and
let null indicates that we don’t know if a person has a middle name or not or
typein »(Unknown)«. A column Number
of items is going to be assigned a value that is too big or too small to be
represented for the data type, maybe because of an error, and therefore we
are replacing the value with a null and adding an audit description »Out-of-bounds
value« to an Audit column for later examination and investigation. On the
other hand I would prefer column Number of items to be not allow null and a
use of a negative value could be interpreted as a return of an item. A numeric
column with null in several rows can cause a problem, because queries using aggregate
functions can provide misleading results. A SQL statement like this: Sum(<column>)
can return one row that contains null, or use of Outer Join can result in a
column with null. In a
division it is nice to use Null functions to avoid division by zero error. In
sql, e.g.: SELECT ISNULL(Amount / NULLIF(NumberOfItems, 0), 0). Null reminds
me of: »No answer is also an answer« when I ask a person a question and he or
she does not reply, his or her very silence is an answer in itself. When I
receive a Christmas gift in a box with something inside, I do not know the
contents of the box, therefore the gift is unknown to me, and known to the
giver. I will not say that the gift is null for me, okay maybe null if the
box is empty but I only know that when I open the box and I could say the
gift is missing. Joe Celko's
SQL for Smarties: Advanced SQL Programming, said: »Null is not a value; it
is a marker that holds a place where a value might go.« Null is a placeholder
for a future value e.g. an Address when it is known from the customer. In
sql, e.g.: WHERE
CustomerAddress IS NULL. Opposite term of Null can be Assigned, Extant,
HasValue, IsSet, Usable, Something might imply not nothing or not null. In
sql, e.g.: WHERE
CustomerAddress IS NOT NULL. The question
is how you will interpret null when you are reading it. Empty string
or zero-length string has zero characters with a value as ""
or '' or String.Empty, and null string has no value at all. An empty string could
be used as a default value for a column that does not allow a null value.
Trim a string value is taking away the spaces at beginning and at ending of
the value, and when the string only contains spaces the trim makes it as an
empty string. [Alt+0160 at
numeric keyboard gives a Non-breaking space (No-Break
Space) that can be replaced to an empty string e.g. Replace(<StringColumn>,CHAR(160),'').] To treat null, blank, empty, nothing
and empty string as equal to null in sql, do like this: WHERE
(NULLIF(TRIM(CustomerAddress),'') IS NULL) Void
I have seen on a paper fly ticket with stopovers to mark the empty lines that
there is nothing here, to avoid me to be able to add an extra stopover. Unknown
represents a value therefore it is not null, but the value is not known in a
system because it does not exists in the system, or it is a mistyping, a
typos or a misspelling and therefore is incorrect. Unknown is something, null
is anything and thus nothing. (In danish: Ukendt er noget, null er hvad som helst og
dermed intet.) For example, I
receive an address with a zip code I don’t know therefore the zip code is an unknown
value for me and I need to validate it and make the zip code value known for the
system by insert the zip code value in a table of zip codes and cities and countries.
When a
person column MiddleName is null, I will not call it unknown because many
persons do not have a middle name and therefore it not exists for them. A
default value could be used like »unknown« and later changed to »none« or »nothing«
when we know that the person is without a middle name, and later changed to
»missing« when we know that the person has a middle name but we don’t know
the middle name. I would prefer to have another column MiddleNameStatus, but
I think we in most cases accept a null value in MiddleName for none or nothing
until we know the name if it exists and type it into a system. A person
column Date of Birth does exists somewhere and can missing and therefore is
unknown at the moment. A default value could be used like 0001-01-01. The cause of
an event, usually a bad event, is the thing that makes it happen. Example of
causes are: accident, crash, delay, disaster, flood, sickness. If something
is unknown to you, you have no knowledge of it. The cause of death is
unknown. John Doe
(for men) and Jane Doe (for women) are used when the true name of a person is
unknown or is being intentionally concealed or anonymized. Data type
for a column defines the kind of data that can be stored such as integer, numeric,
decimal, string or text, date, time, datetime, amount, boolean or bit for
true (1, yes) and false (0, no). Important tip by naming columns, do not use possessive
nouns such as Recipient's date of birth better to use CamelCase like RecipientBirthDate
in a column name, and don’t use values in column name like EmployeeDay/NightCode
better with EmployeeShiftTypeCode. With
conforming or conformance we avoid synonym columns with same content but
different names, and we avoid homonyms columns with same name but different
content. For
example, Type is a general term better give a column a name
as ProductType, CustomerType, TransactionType etc. Names of
tables and columns must be business-friendly and must make sense for both
the business users and the IT developers. It is possible to use a view to
make better names. For
example, an IT developer named a date column in a table as ReservationExpiryDate and
in a view to a business user the column is renamed to Date_of_Reservation_Expiry.
Another example where BankAccountBalanceAmount becomes Saldo and IncludeTipsBoolean
becomes Tips with Yes or No values instead of database values as true/false
or 1/0. A derived column represents a value
that is derivable from the value of a related column or set of columns and
not necessarily in the same table. For example, the age of a person is
derivable from the date of birth column and date of today, or a calculation
of an amount or a glue column to bind rows together for fast search. A
single-valued column holds a single value for an entity occurrence. A
multi-valued column holds multiple values, MovieCategory »Children, Comedy«. A candidate
key is a unique identifier for all rows in a table so each row contains a unique
value and it can be a null value in one row. Ensures that each row is unique.
For example,
column EmployeeId, Employee Social Security Number (SSN), Employee Login
name, Employee Passport Number or Employee Car number plate. A primary
key is selected among the candidate keys and is a unique identifier for each
row in a table and its value is unique for all rows in a table, so each row
contains a unique value and it can’t be a null value (is not null, not allow
null, non null). For
example, column EmployeeId is the primary key in an Employee
table because Id is an artificial auto-generated unique sequence number or an
identity column as a surrogate key. Alternate
key or Secondary key is the key that has not been selected to be the primary
key, but it is still a candidate key and it can be a null value in one row. A compound
primary key is concatenated data e.g. a US phone number contains area code +
exchange + local number in an EmployeePhonebook table. A composite
primary key is composed of multiple columns which combination is used to
uniquely identify each row in a table e.g. column EmployeeId and column CourseNumber
in a Participant table. Useful when a single column isn’t enough to uniquely
identify a row. An entity is
a set of objects with the same properties and is implemented as a table with
rows where a primary key is unique and for use in identification of rows. A foreign
key (one column or a collection of columns) in a table refers to the primary
key in another table to establishing relationships or links between the two tables.
An orphan child as a row in a Child table with a column MotherId as Null.
Another orphan child has also MotherId as Null. It is good that Null = Null
will evaluate to Null, because I don’t know if the two children are siblings
or not, therefore no answer. A
relationship defines how the entities are related to each other with a
meaningful association among entities in three ways called one-to-one,
one-to-many or many-to-many displayed in a Entity Relationship Diagram ERD. A
relationship is implemented by including the primary key in the related
entity and calling it the foreign key to make sure of the consistency in
database. When a database is consistent we can navigate with a join among
tables and relationships become navigation paths. Example of a
many-to-many relationship is between two entities Employee and Course, where
one employee can take many courses and where one course can be taken by many
employees, may be modeled through the use of a third entity that provides a mapping
between them with foreign keys back to Employee and Course plus extra data
e.g. date of the course, evaluation of the course, student grade. Normalization
is the process of structuring a database in several tables to remove
redundancy or reduce data redundancy and improve data integrity, through
steps called normal form: 1NF, 2NF, 3NF, BCNF, 4NF and 5NF. Guide to normal
forms. But a data
warehouse loves data redundancy, because it is not a human that update data,
instead it is a ETL process that runs as a computer software program developed
by a ETL developer or a data engineer. SQL stands
for Structured Query Language e.g. Select From Join Where Union. A view is a
stored sql query criteria statement
used to fetch data from a database and the data set has columns as a table
and is called a result set, rowset or a recordset. Also called a virtual
table representing the result of a database query. A
materialized view is a database object that contains the result of a sql
query e.g. data is being persisted into a table, a file or placed in a server
memory cache. A view can be materialized through an index. Materialization is
for performance reasons as a form of optimization of the ETL process and for
fetching data to business users e.g. five tables that are joined and
aggregations are precomputed and precalculated. For a client-server database
application, the result can be placed as a local copy at a users computer PC. A
table-valued function is an intelligent view with parameter as a mathematical
function and it is returning a result set wrapped in a table to be used in a
join etc. A stored
procedure (sometimes parameterized) performs actions on data and may return a
result set, a value or an output parameter. Procedures are important for the
ETL process. Join is
often between two tables columns of primary key and a foreign key. An equi
join (equality join condition) includes matched rows in both tables by matching
column values and is indicated with an equal operator =. A natural join is a
type of equi join which occurs implicitly by comparing all the same names
columns in both tables. The join result has only one column for each pair of
equally named columns. A theta join
or non-equi join matches the column values using any operator other than the
equal operator like <, >, <=, >=, !=, <>. A outer join
includes both matched rows and unmatched rows of a table, normally the left
side table else right or full from both tables. SQL example: SELECT COUNT(*) AS
NumberOfBoats FROM dbo.CopenhagenCanals WHERE SailingPast = 'My window' I prefer alias using = instead of AS: SELECT NumberOfBoats = COUNT(*) FROM dbo.CopenhagenCanals WHERE SailingPast = 'My window' Because assignment of a variable looks
very much as my sql before: DECLARE @NumberOfBoats int SELECT @NumberOfBoats = COUNT(*) FROM dbo.CopenhagenCanals WHERE SailingPast = 'My window' SELECT @NumberOfBoats Data is a plural of datum
e.g. 5.5 or 38 and after labelling them shoe size in US and in EU they become
information. When I know many of my friends' shoe sizes together with other
information I have knowledge that I can use for a better birthday gifts or
sell to a marketing campaign. 38 is data like a person age 38
years old or a person weight 38 kilograms becomes information because I put
38 in a context and I use 38 together with an unit. An example of metadata to
the person weight 38 kilograms could be the date and time it was measured. Data deduplication refers to a
technique for eliminating redundant data in a data set. In the process of
deduplication, extra copies of the same data is deleted, leaving only one
copy to be stored. Also called Removing
duplicates, because the unwanted data is discarded. Metadata Metadata is in short »data about
data« because metadata provides data or information about »other content data«
(in danish indholdsdata) e.g. author and title of a book, table of content in
a book, index in a book or an error description from a IT system like
»violation of a foreign key constraint«. Example of a XML structure with an element
as SalePriceAmount with a data value content of a real sale price amount
together with a metadata attribute as a currency or an unit: <SalePriceAmount Currency="USD">295.73</SalePriceAmount> <SalePriceAmount Currency="DKK">2236.14</SalePriceAmount> <SalePriceAmount Currency="EUR">300.65</SalePriceAmount> In a data warehouse metadata adds an
extra value to the raw data from a source systems together with metadata for
the data flow between data areas. A data warehouse use metadata e.g. to
define and specify, describe, classify, trace, debug, audit, logs and metrics
data, access restrict and monitor data. Metadata is partly business data
(e.g. business descriptions, trace/lineage, access restriction) and partly
non-business data (e.g. monitor data, logs, metrics, technical definition). I
am using metadata columns for
housekeeping columns, support columns or administrative columns in tables, for
data lineage and timeline columns with names as InsertTime, UpdateTime, ValidFrom
and ValidTo. I have chosen to use two specific
dates as metadata values (yyyy-mm-dd):
ValidFrom and ValidTo represents the
span of time when a row in a table was the »current truth«. If a source system has a date column
with a null value and the null doesn’t means »missing« and a data warehouse
only like to have a non-null value, then the data warehouse can use a forever
perpetual date as 9999-12-31 to identify a future undetermined date, an
infinite date (in danish forevig uendelig dato for en uafklaret ikke aftalt dato).
The benefit with a non-null date column is that a normal sql query join can
be used to a calendar table in the data warehouse. For an amount a data
warehouse can use a fixed value to represent a non-existent value. Other examples of column flag of
metadata is IsCurrent, IsDeleted, IsInferred, more about them later. Metadata columns will be mentioned
in many places in the article and about its naming in section 6.3. I will end
with some categories of metadata: Structural metadata is containers
of data how compound objects are put together e.g. rule of a sql left-join
and an enrichment of a derived column in a sql case-when statement. Descriptive metadata is about a
resource e.g. a link to a User Story Id in Jira and to a description and
explanation in Confluence. Reference metadata is about the
contents, when and how it was created, record source, data lineage, movement
of data between systems and data layers or data areas. When we deliver
information to the business users, we must be able to tie that back to a
source data set. Administrative metadata is about how to
read the contents e.g. column PerDate. Also permissions to the contents. Process metadata is about how a ETL process
is running e.g. a package start/stop time, its duration, execution result and
error message. Audit metadata is about reconciliation
check e.g. rowcount, summarized measures to monitor and control movement of
data between systems and data layers or data areas. Also about monitor and
control of data quality and audit trail. Statistical metadata is based on
process metadata and audit metadata to describe the status of a data
warehouse to be shown in a dashboard for Data management and DataOps. Conclusion: It takes metadata to manage data and to use data. Conforming - what to call it? Humans all over the world are using
many terms for the same thing. For example, a transport system: Subway in New
York City, Tube or Underground in London, Metro in Paris or Copenhagen,
U-Bahn in Berlin, Vancouver and Bangkok has SkyTrain as an elevated train
system above the streets. When a filipino wants to buy
toothpaste in a store, he will ask for a colgate. A sign on the road says
xerox instead of photocopying. In Denmark I will ask for a kleenex instead of
a facial tissue. Some names of brands becoming words in our language. It can be a little tuff job to find
and to determine a good and common conformed term to become a column name,
but later on the communication will be much easier among business people and
IT-people. How to present workload per day as 7½ hours, hh:mm 7:30 or decimal
7.50? How to present a date, 2001-09-11 or 11-09-2001 or 9/11/2001? Transaction If Shakespeare had been a data
modeler, I wonder if he had made the question: »A transaction has relationships or
a transaction is a relationship?«
In Peter Chen's Entity Relationship
Diagram ERD there are many kind of transactions e.g. an order, an invoice or
deposit money into a bank account that is a many-to-many data-bearing
relationship in the conceptual level design with the caption like »deposit
transaction«. Moving on to a logical level design is when a many-to-many relationship
becomes an entity and primary keys becomes foreign keys. Moving on to a
physical level where an entity becomes a table in the database through a
Relational Database Management System RDBMS. The most important thing for me
is, that the names of tables and columns reflect as much as possible the
user's conceptual world (in danish afspejler brugerens begrebsverden). A counterpart to an invoice is
called a credit memo, credit memorandum or a credit nota to reduce the amount
of the invoice for several reasons, for example, the item is returned, the
billed unit quantity is larger than the one delivered, the invoiced unit
price is higher than agreed or there are errors in the delivery that justify
price reductions. If a transaction
in a source system is changed, corrected or cancelled then a data warehouse can calculate a counterpart
transaction with a negative amount. The calculation of the final amount will
be a summary of transactions whereby the final total amount will be less than
the original amount. Top 10 Analytics And Business Intelligence Trends for 2019. 1.1.
Data areas (data layers) in a dimensional modeling architecture A description of the content of data
areas or data layers of the throughput from multiple source systems to
business users PC screen for improved analytic decision making. I am using
the term area to point out that an
area may contain several databases to split data from multiple source systems
to make the entire solution scalable and can be distributed over multiple
servers. Each area is labelled with data warehouse acronyms as an
abbreviation formed from the initial letters. Data flow Source system→Landing zone area→Input data area→Archive area→Data
staging area→Data mart area. Source file/table→Input
table→Archive table→Staging
table→Target table (destination). Source
system area – SSA (Operational system and Legacy system) A company or an organization is
using several operational systems to handle daily data of many different
kinds. Most systems is based on a database with primary keys and foreign
keys, and there is three types of candidate keys to identity an entity
object, we can call them source keys: Natural key exists in the real world
e.g.:
A composite natural key of
multiple columns is still a natural key. A natural key value is mutable
(changeable, not immutable), and is meaningful for a human being. Back in
1990s we could remember many phone numbers, but the cell mobile phone
contact list took over. Business
key exists
in a system and is
build as an artificial number e.g.:
Some of the above numbers do
exists in a natural way in the real world because they are printed on papers
and things, therefore can be called an application key. A composite business key of
multiple columns is still a business key. A business key value is either
mutable (changeable) or immutable (unchangeable), and is meaningful for a
human being where a business user prefers to use a business key value to
identify an
entity for a search criteria lookup value giving in a phone call to a company,
a bank, a hospital or the government. Surrogate key exists in a system as an artificial
auto-generated unique sequence number or an identity column (id, uid unique
identification) that is used as an unchangeable column instead of a natural
key or a business key or a composite primary key. Surrogate key is represented
as an integer, a guid, a uniqueidentifier or a hashbyte of a natural key value
or a business key value. A surrogate key value is
immutable (unchangeable), and is meaningless for a human being and will normally
not be exposed to outside business users, but is useful for join operations
inside an IT system. A natural key or a business key is
what the business uses to uniquely describe data. If a natural key or a
business key is immutable and unchangeable in a source system then call it a durable natural key or a durable business key. (Durable in danish holdbar eller
slidstærkt). A table called Salestype has a
business key as an one-letter code and has a natural key as a textual value,
e.g. R = Retail sales, D = Direct sales, O = Online sales. A table called Status has a surrogate
key StatusId as an artificial auto-generated unique sequence number or an identity
column, a column
StatusNumber as a durable business key and a column
StatusCode as a durable
natural key with a textual value which can be understood by a human being, otherwise look at column
StatusName which does not need to be unique. A table called Customer has a surrogate
key CustomerId, a column CustomerNumber as a business key and a column CustomerName
as a natural key. It is common to have several
customers with same name, therefore I will not call CustomerName a durable natural key. It is common in a
customer system that the same person exists several times because a person contains
in duplicate rows with the same name and address but with different CustomerNumber,
therefore I will not call CustomerNumber a durable business key. From a data warehouse point of view,
a natural key and a surrogate key is also a business key. For example, a table
called LeadState has a surrogate key Id and a column State with values as »New,
Contacted, Qualified, Lost« then a data warehouse lets Id become the
business key because a text value can be changed over time e.g. »Lost« will one
day be changed to »Not interested«. A beloved child has many names. Key map
table,
surrogate key mapping table, cross reference table or translate table as a
shadow table is for each business key or composite business key of multiple
columns, (or natural key) to transform business key values either to an immutable
surrogate business key _sbkey e.g. a code which can be understood by a human,
or to an artificial auto-generated unique sequence number or an identity
column as a durable key _dkey for later use in a dimension to glue multiple rows together in a timeline for a type
2 or a type 7 dimension. Durable key is a surrogate key for a business key
(or a natural key) and has the function of a stable key. Since a dimension can have multiple
source data, a key map table contains a business key for each of its source
systems for data lineage to provide only one immutable surrogate business
key (code) and/or one durable key across multiple
source systems. Furthermore, a key map table must be
able to handle a business key value which can be mutable (changeable) to keep
the same code or same durable key value. For example, a business key Employee
number may be changed if an employee resigns and some years later is rehired,
the belonging durable key value must be remained. Kimball calls it a durable
supernatural key and it can be part of an integration hub in a data
warehouse system. Kimball page 101: »The durable
supernatural key is handled as a dimension attribute; it’s not a replacement
for the dimension table’s surrogate primary key« which is called a dimension
key regardless of dimension type (0-7, see later). Dimension
key is
the primary key in a dimension table in a data mart area. A dimension key is an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of a dimension. I prefer a suffix _key for a dimension key column in a
dimension as a primary key for the dimension. An alternative suffix _dimkey
or _dim_key. For example, Customer_key
column in a Customer dimension which is related to a Sales fact to a Customer_key
column as a foreign key. Please
notice, that a business key translation in a key map table to a durable key same as surrogate
key does not make the key value to a dimension key in a dimension, because
there have to an independence to a business key, and in a type 2 dimension the
dimension key has unique values for each row even though the same durable key value
or business key value is standing in several rows. Data
profiling
of a source system is very important and the choice of column to become a
data warehouse business key requires a good data analysis and criteria for
the choice and data type and data range. Read
more. Be aware of a value of a natural key
or a business key can be changed in a system because the original value was
entered incorrectly and need to be corrected. A composite natural key in a Member table with Name+Address
will be updated when a person change place of residence. I saw in an
airport, that the flight number was selected as the business key for the data
warehouse and used in the ETL process, but no one had taken into account,
that when a plane was delayed, the letter D was added at the end of the
flight number, therefore the data warehouse thought it was a new flight. A surrogate key as an integer or
guid won’t ever need to be changed, but if a customer is deleted by a
mistake and the customer later will be inserted again with the same CustomerNumber
as business key, then it will get a new sequence number in CustomerId. When
a company has several operational systems handling customer data, each
system will use its own surrogate key for CustomerId, therefore business key
CustomerNumber is the glue between the systems. It is important for a data warehouse
to receive the value of the business key (and/or the natural key) from a
source system, and in case of a value has been changed, that data warehouse also
receive the before-value. A database has many tables for example, a Customer
table has columns of CustomerId as surrogate key, CustomerNumber as
business key and a CustomerName and an Invoice table has columns of InvoiceId,
InvoiceNumber, InvoiceDate and CustomerId as foreign key, the invoice data must
be extended with the business key CustomerNumber either by a source system
or inside a data warehouse to know the customer of an invoice. I prefer that
a data warehouse also store the value of surrogate key to be helpful when a value
of a business key is changed and to obtain traceability for data lineage
to tell where does the data come from, more in section 1.6. Merging data and
integrating data can be hard to set up in a ETL process and to test
afterwards. Landing
zone area – LZA From source systems to landing zone area
as a file storage of source data wherea landing zone area can be a file
server. A landing zone area is a data reception area with data from mutiple source
systems. A data set can arrive in many formats, therefore a good data
profiling document is nice to have. For example:
·
by
database replication (batched) ·
by
data streaming (streamed) where data flows out from a source system through
a delivery or a redelivery ·
by
receiving data from a real-time data replication that fetch new, changed and deleted
source data. An example of a Change Data Capture CDC tool is Qlik Replicate
Change Data Capture, read more in section 1.2. A landing zone area can be a file
server to receive files. A landing zone area can be a
database server to receive a database backup file that will be restored to a
database at the server, or there is a persistent database that receive data
from a source system. This area is also called Operational
Data Store (ODS), a place to store source data. Microsoft is using the term Landing
Zone as the place where source data is landing and often is never deleted or deleted
after a longer period. A file it can be transferred to a
file handling area or a file can be fetch from a SFPT server through WinSCP
to a folder which represents a landing zone area, and from there copy the
file to an archive folder to keep the original source files, and move the
file to an input folder for later unzip to another folder and loading into a
data warehouse. A file can have metadata in the filename like a date of contents
and delivery. Sometimes a delivery consists of several files which can be
advantageously placed in a delivery folder with a name including a date value
or a batch number. Contents of the files must be checked for data quality
before they are loaded into the input data area. In case you like to open a
20 GB csv file and delete the last line that is corrupt, use Sublime Text,
its free and no registration, only long waiting for large file. A nightmare
is like 25 csv files that occasionally change filenames and change column names
and data type and content. It is an advantage to receive a delivery
note together with files from a source system with information about the
files like filenames, maybe a missing file and the number of lines in a file
or rows in a xml file. For a database backup file it is a must to identify
them as full backup or incremental backup before restoring. A data set should have accompanying
metadata describing the origins and processing steps that produced the data
for tracking data lineage. Input
data area - IDA From Landing zone area to Input data
area for temporary storage of source data which has been received in files as
csv, xml, json, parquet, excel, access, database files, database backup files
or by database replication, data streaming or CDC. The files in landing zone area can
be imported or extracted into a input data area to become database tables
with loose data types e.g. string, text, varchar or nvarchar with a length of
100+ characters. When a source system has millions of
transaction data it can take a long time to do a full load every night,
therefore to save on loading time from a source system to a data warehouse it
is good to detect delta data as new data, changed data or removed data in a source
system. To fetch delta data from a source system database by using delta load or incremental
load, read more in section 1.2. This area is also called Raw data, Data
Sourcing layer, Source, Legacy, Extract, Capture Operational Data layer, Data
Acquisition or Data Lakehouse. Empty input data area in the beginning of
the ETL process. Data types will be adjusted to fit
with the receiving database system, and it is a good idea to use a wide data
type e.g. a source system provides a string of length of 6 or 50 characters,
I will make it as 100 characters to be prepared for a later extension of
characters in a source system. In case a date format can’t be agreed for a
csv file e.g. 05/08/2014 what is order of day and month as format mm/dd/yyyy
or as format dd/mm/yyyy; an amount format with a decimal separator and a thousand
separator as a period or a comma e.g. 1,234.56 or 1.234,56 and the use of
field quote, field terminator, text qualifier and row delimiter, I recommend
to load each column into a data type of string e.g. varchar(100) to have data
in a table to do try_cast data conversion to a new column with the
appropriate data type e.g. Date, Decimal(19, 4), Currency, Money, Integer
etc. I prefer a Tab delimited text UTF-8 (65001) file where the list
seperator is <Tabular>, Chr(9), \t. Study Schema.ini and XML Schema
Definition XSD. Do reconciling between data warehouse
and source systems with reconciliation of row count and sum of values and
mark as reconciled and do auditing report (to reconcile in danish at
afstemme, stemmer overens). Maybe a receipt system where IDA tells a source
system that data has been received. Data might be wrong in input data area or
in archive and later source system will resend a new revision of data,
therefore important to create output views that filter wrong data away and
making correct data as an extraction to the next area of the data warehouse. Read
more about data reconciliation in section 1.4. For the data lineage most
tables in the Input data area will include metadata columns: RecordSource as
a reference back to a source system e.g. value "Dynamics365.Sales.Product".
RecordSourceId is origin primary key column from source system. IdaBatchDataCaptureId
is in all rows in all tables and will use the same number of identification together
with a IdaBatchDataCaptureTime as a received date and time to represent a
snapshot of the total amount of data as it look like in a source system, and
can be used later for keep history and in ETL jobs/packages/transformations/sql
will reuse the same id and time for all data rows. IdaInsertTime (datetime2(7)
default sysdatetime(), not unique per row) when the fetching is done. Sometimes
a source system has a Changed column with data/time of first inserted or data/time
of latest updated data in the row/record that can be used for later history.
Read more about data lineage in section 1.6. Exchanging data from a source system
to IDA can be in different formats e.g. JSON data-interchange format where I
prefer storing data for IDA in a relational database with tables with rows, columns
and data types, I must parse a json format. Kimball recommends that a source
system express data at the lowest detail possible for maximum flexibility and
extensibility to provide the data warehouse for simplicity and accessibility
and it will be the data warehouse to make summary data, and not the source
system. IDA is a mirror of source data. I recommend for each source system
to create its own Input data area e.g. IDA_Dynamics365, IDA_Salesforce,
IDA_ServiceNow, IDA_Proteus to keep the source data separately. Each IDA has metadata
for scheduled expected delivery with dates, filenames or tablenames,
estimated number of rows, receiving time, row count and flags like
IsReceived, IsCompleted, IsReady together with different control tables,
reconciliation tables and staging tables to make data ready for the archive
area. I recommend for each source system
to create its own ETL process (job with packages) to extract, maybe a little
transform and load data into an Input data area and possibly
furthermore to an Archive area. When a source system is bad or down, data
must continue to be loaded from other source systems. When a daily source
data delivery is bad, the next day delivery must still be able to be loaded
and it is nice when a redelivery of a bad data can be provided and
handled in the ETL process. Archive
area - ARA From Input data area to Archive area
for keeping source data indefinitely through forever storage of source data from
source systems as a data basis for a data warehouse. This area is also called Persistent
Staging Area (PSA), Persistent Historized Data Store, Persistence Layer, Detail
Data Store, Data Repository or History or Archive data area. Archive is a
versioning of the source. Never empty archive area because it
is archiving of time variant source data and it will retain historical value
changes in the source system. Simple data adjustment can be done
to gain same date and amount format and same data representation of a social
security number etc. but it will be in new columns so the original values are
unchanged. For the data lineage most
tables in the Archive area will include metadata columns: RecordSource,
RecordSourceId, IdaBatchDataCaptureId, IdaInsertTime and ArcBatchDataCaptureId,
ArcBatchDataCaptureTime, ArcStorage, ArcInsertTime (datetime2(7) default sysdatetime(),
not unique per row), ArcTs (timestamp, rowversion, unique per row), ArcRecordId
is a unique sequence number per row per table to obtain data lineage and
traceability back to the archive and the column can also be used to fetch and
pull delta data out of the archive that have not yet been loaded into the
data warehouse. ArcGlobalId is a unique sequence number per row across all
tables in archive. A different kind of an archive that
does not do »forever storage of source data« can be called Current Data
Area – CDA with the purpose to have a storage of current source data because
the data change frequency and volume of changes is too big to retain
historical value changes in a source system, therefore the archive has »current
only« version of source data to improve performance and use less storage. CDA
means that you have to detect data to insert new data, to update changed data
and to delete removed data from a source system. CDA can be based on either
full load or delta load or incremental load. CDA does not mean that you can’t
have a dimension of type 2 history in a data mart area. Data
staging area - DSA From Archive area to Data staging area
for extracting archive data by full load or data delta load or incremental
load for transforming data enrichment according to business rules to identify
dimension data and fact data to staging tables to shape and suit tailored target
tables for ensuring data is trustworthy in order to prepare data for loading into
a data mart area for the purposes of querying and analysing in different ways
by the business users for their analysis, reporting and dashboards. Data processing can be for example:
DSA can be divided into tables for StagingInput
and for StagingOutput by using database schemas like StagingDim,
StagingFact, StagingAnalysis (one big table) to shape and suit target tables
in schemas Dim, Fact, Analysis in a data mart area. This area is also called Work,
Batch, Calculation, Enrichment, Prepare, Temporary, Transformation or just
Staging, Stage or Stg. Empty data staging area in the beginning
of the ETL process because data is only transient and temporary in this area.
Examples of concrete data processing
are trim string, unicode string, max length of string, null gets a value as 0
or empty string '', replace illegal characters, replace value, verification
of data type and do data conversion to a suitable data type, e.g. a csv file
contains a value 1.0E7 in the scientific notation and it will be converted to
the value 10000000 and saved in a column in a table. Data syntax, date format
like mm/dd/yy or dd.mm.yyyy will be converted to yyyy-mm-dd, correcting mistyping,
typos or misspelling, fix impossible
values, punctuation and spelling differences to achieve common format,
notation, representation, validate data e.g. you expect two values »true«
and »false« in a boolean column in a file but suddently there is a value like
»none«, »void« or a sign ?, correct data and do data deduplication by
selecting one of the duplicate rows and removing duplicate rows to keep them
in a wrong table. The data cleansing and data integration process with
multiple source systems is to make data cleaned and conformed, e.g. a gender
code from one source has »Man, Women«, another has »M, F« will be conformed
to »Male, Female« through a mapping of source data. Calculations and computing
metrics e.g. total cost and revenue, elapsed time and overdue time. When an archive is stored on an
another server it is common to load data from the archive into a data staging
area at a data warehouse server, e.g. data to a dimension of type 1, type 2
or type 7 and from a dimension staging table perform a loading as a merge or an
insert and update to a dimension table in a data mart area. I know that many like to send a
query to a source system for translate names of columns and data types, merge
or divide columns like a Name to become two columns of FirstName and LastName
(Surname), but then a data warehouse will not receive the original source
data. I like to use computed columns in a
staging table for calculation, string manipulation. Sometimes I use a hashbyte
value for a comparison column for a composite business key that is composed
of multiple columns (ComparisonBkey_meta) and for the other data value columns
(ComparisonData_meta) to merge and load data from a staging dimension table
into a mart dimension table. Other times I use a generic stored procedure
with dynamic sql statements (string gymnastics <template> merge construction)
that will perform the load from data staging area to data mart area for many
dimensions. DSA database will contain different
auxiliary tables with names as: stg, stage, tmp, temp, temporary, and with
multiple staging tables to a target table, where the staging tables can have
an additional number e.g.: Stg_Customer_Order_1,
Stg_Customer_Order_2, Stg_Customer_Order_3 and the last staging table to shape
and suit the target table is using number 0 as in Stg_Customer_Order_0. DSA will perform a data quality and
filter wrong data and invalid data into a quality assurance and quality
control (QA/QC) database. DSA continues reconciling data from the archive
for referential integrity e.g. foreign key value exists as primary key
value, relationship cardinality rules for dependencies as mandatory,
optional or contingent to make sure there is a foreign key or not, it can be
null or it needs a value. Many other value checks and summation and so on for
a validation check of data. When all tables in this area is loaded with
correct data, the DSA is successfully completed. A source system’s new date
or changed date or the IdaInsertTime or ArcInsertTime which represent a batch
of insertion and mark all data to that batch time no matter which source data
is coming from. In case of ValidFrom and ValidTo, all ValidFrom will be
using the same batch time which is useful for join between tables and between
source systems to fetch the right data shapshot at a particular time back in
history. Read more about data quality in section 1.5. Furthermore is derived columns and
calculations ready for loading to the next area of the data warehouse. In
case of an error in the ETL process or in important data there will be raised
a halt condition to stop the ETL process. Other times data will pass by with
a flag in an audit dimension. Kimball has called it data wrangling to lasso
the data and get it under control (data munging). Remember to do an auditing
reporting after a validation check. For the data lineage most
tables in the Data staging area will include metadata columns from the
Archive area as RecordSource, ArcStorage and ArcRecordId. (In danish data staging area stands for et data
forberedelsesområde eller data rangeringsområde til data vask, data udsøgning,
data berigelse, data behandling, data beregning, data bearbejdning, data
sammenlægning, sammenstilling, samkøring, tilrettelæggelse og klargøring
af data hvor data er blevet vasket og strøget). Kimball calls this area a ETL-system
or the first system that is off limits to all final data warehouse clients
and he use an analogous: »The staging area is exactly like the kitchen in a
restaurant. The kitchen is a busy, even dangerous, place filled with sharp
knives and hot liquids. The cooks are busy, focused on the task of preparing
the food. It just isn’t appropriate to allow diners into a professional
kitchen or allow the cooks to be distracted with the very separate issues of
the fine dining experience.« Kimball's ETL is using Data Staging Area
(DSA) as a temporary space to create a data warehouse as a collection of data
marts. Data
mart area - DMA From Data staging area to Data mart area
with dimensional modeling, conformed and shared dimension tables, star
schema around each fact table, assigned surrogate key (artificial key, identity
column, a unique sequence number) for each dimension and use it in fact as
foreign key. When multiple fact tables share dimensions it is called constellation
schema or multi-star schema. Keep historical data in dimension
and fact with timeline columns ValidFrom and ValidTo. Make it easy for a business
user to search for current data through a view and to give a historical date
to a table-valued function to fetch dimension data and fact data at any point
of time. A data mart can have multiple fact tables with different
granularities. One or multiple fact tables can create an extra derived fact
table with special calculations and search filter criteria to enrich data to
match the business requirements specification. Make a conform name for a column
with a conform data type, for example when you merge customer addresses from
multiple source systems make a string extra long to fit all address
characters and avoid a value to be mandatory or set a default as the empty
string to ensure a robust ETL process. Never empty data mart area and
always backup before the ETL process. Data mart is a front-room for publishing
the organization’s data assets to effectively supports improved business decision
making. A data mart contains one data area for one purpose and is subject-oriented,
therefore a data warehouse will consist many data marts and they will have
some common dimensions. To avoid to maintain the same dimension in multiple
data marts may it be considered to have a dimension data mart that share its
tables through views inside the other data marts. For the data lineage most
tables in the Data mart area will include metadata columns from the Archive
area as RecordSource, ArcStorage and ArcRecordId. (In danish data staging area stands for et behandlet
lag, tilrettelagt lag, klargjort lag med data som opfylder forretningens data
behov). Kimball calls this area a dimensional
presentation area and he is using the term first-level for a data mart
contains data from only one source system and second-level or consolidated
for a data mart with multiple sources to cross business processes. Read
more about data mart modeling in section 1.3.7. Presentation
interface area - PIA From Data mart area to Presentation
interface area through data access tools like Reporting Services (SSRS) and
Microsoft Access or data visualization tools like Tableau, QlikView, Qlik Sense
and Power BI can import data from a dimensional schema and handle
in-memory calculated KPI as quick measure using filter. The purpose is to help and support a
business user to solve a business case. A OLAP cube or a Tabular model
is loading data from a data mart and the processing is doing calculation
of KPI and display a pivot in a data access tool like Excel, Power BI, Tableau, QlikView, Alteryx or Targit. A report with criteria parameter do search, select and calculate
data from a data mart based on an ad hoc query of a business user, or the
report is send out as pdf file, or Excel file to the business users every
day, week, month and so on. A dashboard is a collection of unrelated data intended to summarize
the state of the business during a given time frame. A simple dashboard sample.
A dashboard provides a quick, clear
and clear overview of the key figures, indicators (KPI) and performance data
that are most important to him or her, shown as numbers and data
visualization linked to business user groups. »If your dashboard does not fit
on one page, you have a report, not a dashboard.« cf. Avinash Kaushik. Start write a data story while
designing a dashboard where the design should follow the narrative with a
destination to answering the business questions. Dashboards for business need
to answer key business questions where every chart and graph and metrics and
KPIs gives actionable insights and helping a stakeholder or a business user
to make better data-driven decisions. Data mining and machine learning will also use data from a data mart. Percentage (fraction) and ratio
based on a dimension slice will be calculated »on the fly« in a BI tool when
a fact table contains a numerator and a denominator, e.g. a DAX (Data
Analysis Expressions) query language for a tabular model in Power BI. Power BI guidance to star-schema and best to use a star schema in Power BI. DAX for SQL user and DAX tutorial. Two measures Average price and
Active customers with a locale variable for an intermediate calculation in a
code block ending with return in DAX from a Sales fact: Average price := DIVIDE(Sum(UnitPrice x Quantity),
Sum(Quantity),
0) Active customers :=
VAR measurevalue =
CALCULATE(
DISTINCTCOUNT('Fact_Sales'[Customer_key]);
FILTER ('Fact_Sales';'Fact_Sales'[EndDate] = BLANK() && 'Fact_Sales'[Amount] >
0)) RETURN IF (ISBLANK(measurevalue); 0;
measurevalue) Kimball calls this area a Business
Intelligence Application. A date range lookup in a Kimball
type 2 dimension when fact has a business key (_bkey) and a LaunchDate column
in DAX: CampaignName := CALCULATE(VALUES('Dim_Campaign'[CampaignName]),FILTER( 'Dim_Campaign', 'Fact_Product'[Campaign_bkey]='Campaign'[Campaign_bkey]
&& 'Fact_Product'[LaunchDate]>='Campaign'[ValidFrom]
&&
'Fact_Product'[LaunchDate]<'Campaign'[ValidTo])) 10
simple steps for effective KPI. Data driven Infographic Supporting
databases and/or schemas in a database Supporting data in tables for a data
warehouse can be placed in either several databases and/or in several schemas
in a database to be across the data areas. Here is some examples. Audit with auditing data to an
audit trail for reconciliation check from source systems with number of rows
in tables and other summarized measures to monitor and control to avoid
discrepancy between source systems and the data warehouse and make sure that
all data has been fetched and saved in the areas of the data warehouse by
doing validation check. Log of RowCount of target rows before a
process and after a process to measure the changes in number of rows the
process did (TabelRowCountBeginAt and TabelRowCountEndAt). RowCount of
source data rows and extracted rows based on join and criteria. RowCount of
target deleted rows, updated rows and inserted rows/loaded rows successfully
or rejected rows with missing value or out of bounds amount or other kind of
invalid data depends of the validation check of data quality etc. Do alert
the data owner and data users by email. Auditing is a
kind of inspection (in danish eftersyn). System with data for the ETL
process e.g. connection string, capture execution log metadata (at runtime)
e.g. ExecutionLogId, execution time, status of execution, error message, job
log, package, transform and exception handling. To storage a value for »last
value« for delta load or incremental load with delta data detection. To
storage information for IdaBatchDataCaptureId, IdaBatchDataCaptureTime,
ArcBatchDataCaptureId, ArcBatchDataCaptureTime etc. Test with test cases while
programming the ETL process and for making KPI values to validate the
program. Usage with custom data for mapping
and consolidation of data from multiple source systems and data for
dimension descriptive context, hierarchy, grouping, sort order and rules for new
dimension values. Usage tells how source data is transformed and changed into
useful information by the ETL process to meet the requirements of the
business and displayed in a data mart. Usage data is maintained and updated
by the business users through an application. For example, product data
can be found in multiple source systems with different spellings where a
usage application will help the business to do the unite, interconnect,
compile, conform and consolidate the products into one spelling for a
product together with a mapping of the multiple source systems product
numbers into one product.
The usage application can also provide hierarchy levels, grouping, sort
order for the products. Some data in the Usage supporting database will be
loaded from the archive, because the data comes from a source system, to be enriched
by the
business users.
Input data area will fetch data from the Usage supporting database and
further to the archive to be used in transformation
and updating dimension with enriched data that doesn’t exists in the source
system. Usage data will also be
used for making fact data e.g. a calculation with a constant value, a
rule as if-then-else and a lookup value. Usage is data-driven to avoid data
values inside the ETL package, the sql statements and the program. A Master
Data Management (MDM) is an extension of a Usage database (in danish
brugerdefinerede stamdata, ordet custom kan oversættes til specialbygget).
Examples of Usage data in section 6.1. Wrong with data that can’t be
loaded into a target because of error description e.g.: »The data value
violates integrity constraints«, »Violation of primary key constraint«,
»Cannot insert the value null«, »Cannot insert duplicate«, »String data
would be truncated«, »Arithmetic overflow error«. Wrong data is rejected data
like when the ETL process is removing duplicate rows to keep them in a wrong
table to be monitor by a data steward that can inform a source system. Wrong
data can also be data that do not follow the quality assurance and quality
control. Quality assuring, consistency and integrity is important parts of the
ETL process and the audit trail. More examples in section 1.5 and an Audit
dimension in section 4.4 to tag a wrong data row in a target table. When custom data in the Usage
database is going to be changed it can be done in a UAT user acceptance
testing environment. After the ETL process has been executed we can test the
data warehouse. Later we can take a backup of the Usage database and restore
it in the production environment. Data
integration mapping by abbreviations example A Usage database has two tables for
sales device from two operational systems and each table has its own business
key columns labelled _bkey. I have invented an immutable surrogate
business key column labelled _sbkey which contains abbreviations as a
simple letter code that can be
used in a dimension in a data mart as an easy search criteria for a business
user query and a dropdown box in a dashboard or report. The _sbkey is called
a retained key in SAS and a persistent durable supernatural key in Kimball.
A _sbkey column is mapped to a business key column which makes it stands for
an integration and a merge to a conformed dimension. Let's have a look at an example that
is using a key map table to make a
solid mapping for a dimension called SalesDevice in a data mart area (DMA). A table SalesDevice_Map has a
surrogate business key _sbkey as the primary key merging of all devices from
retailer shops and from online shop to a current name and a sort order and a category
or classification or grouping called DeviceModel. The table SalesDevice_Map
will be updated by a data steward in the Usage database and the ETL process will
flow the data to a Dim_SalesDevice type 1 dimension in a data mart area.
A retailer shop has different sales
devices over time. A key map table SalesDevice_Keymap_Retailer
has a business key _bkey from different source systems mapped to the relevant
surrogate business key _sbkey and timeline columns ValidFrom and ValidTo.
A online shop has different sales
devices over time. A key map table SalesDevice_Keymap_Online
table has a business key _bkey from different source systems mapped to the
relevant surrogate business key _sbkey and timeline columns ValidFrom and
ValidTo.
We see that sales devices PC and Mac
has different business key values in source systems compared to the retailer
mapping table, and over time sales devices have changed the business keys in
a source system from a letter code to a number value. Above business key 10
represents two different sales devices first a Tablet computer and later a
Personal computer. For example, above business key value
1 has two mappings, ECR for a retailer Electronic cash register and IPD for an online user iPad. The ETL
process will for transaction sales data do a lookup in the relevant key map table to fetch the _sbkey value and another
lookup in the SalesDevice dimension to fetch the dimension key value to be
saved within the fact row. The table Dim_SalesDevice is a type
1 dimension with a primary key labelled _key that stands for dimension key
and with inferred members for Missing and Unknown.
The SalesDevice dimension has values or members as rows in
a table in database. An example of a view that binds the key map table and the dimension table
together to a ETL process, where a retailer transaction sales data can do a
date range lookup to fetch the right SalesDevice_bkey value through a mapping
to the surrogate business key: CREATE VIEW mapping.SalesDevice_Retailer_Lkp
AS SELECT
map.SalesDevice_bkey,
map.ValidFrom,
map.ValidTo,
dim.SalesDevice_key FROM usage.SalesDevice_Keymap_Retailer
map INNER JOIN dma.Dim_SalesDevice dim
ON dim.SalesDevice_sbkey = map.SalesDevice_sbkey A retailer transaction sales data join to the lookup view to let its
business key fetches the dimension key and insert data into a fact table. For
a ETL process lets make Extracting
as delta load or incremental load with delta data detection
in a where clause, Transforming of business key SalesDeviceNumber to dimension key SalesDevice_key
and Loading new data rows into a Sales
fact table with a SalesDevice_key as a reference to a SalesDevice dimension,
to a data mart area (dma schema): INSERT INTO dma.Fact_Sales (PurchaseDate_key,
SalesDevice_key, Amount, TransactionId) SELECT
trn.PurchaseDate, --date format same as the Date dimension role-playing
Purchase. SalesDevice_key
= ISNULL(sdrl.SalesDevice_key,-2), --in case of no match in join.
trn.Amount,
trn.TransactionId FROM source.RetailerTransactionSales trn
LEFT OUTER JOIN mapping.SalesDevice_Retailer_Lkp sdrl
ON sdrl.SalesDevice_bkey = ISNULL(trn.SalesDeviceNumber,-1)
AND
trn.PurchaseDate >= sdrl.ValidFrom AND trn.PurchaseDate < sdrl.ValidTo WHERE trn.TransactionId > (SELECT
MAX(TransactionId) FROM dma.Fact_Sales) Horoscope star signs as a static dimension
example A Usage database has a table with names of horoscopes as a natural key, an abbreviation as a business key labelled _bkey, and I am using the planet/house cusp number as a business surrogate
key labelled
_sbkey. Element becomes a grouping of the
twelve signs and the first level of the hierarchy Element→Horoscope. (icons)
1.2.
Data capture or data ingestion Data capture or data ingestion or ingesting
(in danish datafangst, hjemtagelse, indtage, modtagelse) of data from a
source system to a Landing zone area and/or an Input data area is based of
two data flow directions: Push where Source find the
data and deliver it to an area like a database or a file at ftp to a Landing
zone area and later the Input data area will fetch the data and indicate that
the data is downloaded, received and stored, so Source knows that it can make
a new delivery by using a semaphore to control access to a common resource
by multiple processes. A snapshot on a source system to create a read-only
and static view as a transactionally consistent. Pull where Landing zone area
or Input data area will fetch data from Source through a sql Select From
Where statement maybe with joins of tables or using a view or calling a
stored procedure with a parameter as a criteria value to limit source data
and for doing delta data detection delta load or incremental load. From a ETL
tool or a web-service where the data warehouse will request for source data. Granularity of data capture
integration strategy is to consider when the amount of data is big in a
source system. A data warehouse prefer to receive at the lowest granularity level
of detail in case of specific analysis usage or data validation, but sometimes
it is necessary to aggregate and summarize source data to a higher
granularity like per day and per customer segment for transaction data or transactional
data from an OLTP system e.g. orders, invoices, billings, payments, site hits.
Non-transactional operational system
data e.g. Customer, Location, Contact, Supplier, Part, Product can be stored
in a Master Data Management (MDM) database
as master tables that is shared with multiple operational systems in a organization
to avoid data redundancy in the sense that the same data must be updated in several
systems by a business user. Instead, master data is updated only in one place
by a business user, after which data is automatically delivered to a number
of operational systems to achieve one truth about data. The operational
systems submit data to MDM and the operational systems subscribe to and
continuously receive current data as needed, possibly including versioned
historically collected data. MDM is an extension of the Usage
supporting database. Data capture from a source system is
based on multiple data delivery forms:
Data capture from a source system is
based on two data delivery methods: Full load or full dump as flush and
fill where Landing zone area or Input data area gets all data from a
source system. Delta load
or Incremental load with delta data detection where Landing zone area or Input data area only gets new data
or changed data from a source system, called Change Data Capture (CDC). Need
a Initial load the first time running. It can be a source system that keep
track of delivered date, or it can be the Input data area that keep track of
a last received Sequence Id (integer), Sequence Time (datetime), or an InsertUpdateTimestamp
value from a source system for each data row, so the Input data area can
detect new data or changed data like asking for data since that last received
value or a maximum value found in the last delivery, e.g.: InsertUpdateTimestamp
> the last received value. It could also be a posting data e.g.
entry date (or created date) and next delta load or incremental load is for: EntryDate > »the last received
EntryDate«. (in danish Posteringsdato eller Bogføringsdato). Delta load or Incremental load is
relevant for a real-time data warehouse
with continuously updates throughout the day to be displayed in a dashboard
and when the amount of data and number of rows are big and a full load takes
longer and longer time, as time goes by. It is important to do a log of number of rows from a source system
to the Landing zone area and to Input data area for later reconciliation check count auditing (in
danish kontroloptælling, afstemning) in case a source system does not
deliver the expected number of rows and to monitor over time the number of
rows to see if it increases as expected and be used in a graph over the
amount of data in the data warehouse. Other data values e.g. amount,
quantity and volume can also be logged and monitored. I recommend that a source
system tells the number of rows per table that will be compared to the saved number
of rows in Input data area in an audit trail. It can be implemented as a view
in a source system like this: CREATE VIEW BI_Audit_NumberOfRows AS SELECT 'Northwind' AS Sourcename,
'Customers' AS Tablename, COUNT(*) AS NumberOfRows FROM dbo.Customers UNION ALL SELECT 'Northwind' AS Sourcename,
'Orders' AS Tablename, COUNT(*) AS NumberOfRows FROM dbo.Orders UNION ALL SELECT 'Northwind' AS Sourcename, 'Order
Details' AS Tablename, COUNT(*) AS NumberOfRows FROM dbo.OrderDetails The data warehouse can do its own
delta data detection between Input data area and Archive area to identify new
data, changed data and deleted data to maintain historical data with
ValidFrom and ValidTo datetime stamps is handled by following actions:
Sometimes a source system provides
flag information to pass a status to a data warehouse. I have seen combination
of consistent statuses like these:
Microsoft Windows calls it
Add/Remove Programs in Control Panel (ARP). Change Data
Capture CDC
is a method to log data in a table when data is changed by transfer a copy of
data row from the table (or table log) to another log table. Changed data
represents: Inserted data, updated data and deleted data. Fetch the data
values that has changed since last fetching (extract). (In danish Data
ændringsdetektion d.v.s at opsnappe eller at spore at data har ændret sig i
et andet system). Sometimes a source system update a data
row multiple times but only the most recent version goes to the data
warehouse. If a source system insert and update a row before a new load to
data warehouse, only the updated version goes to data warehouse. If a source
system insert a new row and delete the same row before a new load to data warehouse,
the data will never goes to data warehouse. It is seldome that a source
system has a log and is logging all changes with a revision number or a ValidFrom
and ValidTo datetime stamp, therefore a data warehouse does not contains
100% full historical updates of data in an organization. Sometimes a source
system has for each (or selected) tables in the database an extra historical log
table that contains all inserted, updated and deleted data rows from the
original table together with an Action column (values for Insert, Update,
Delete) and an EntryDate datetime stamp column that mark when the action
was occurred and a SequenceId that is used for delta data detection delta
load or incremental load to pull data from a Source log to a Input data area,
maybe through a Landing zone area. An example of a source historical
log table with rows every time a customer change data. A customer has an
CustomerId 421 in a source system and her maiden name is Marie Beaulieur. She
is inserted into log table that keeps CustomerId as a business key, and that
add a unique sequence id as primary key for the log for having independence
of the business key. At date 2007-11-16 she got married, and she took her
husband’s surname, therefore her name was changed to Marie Lesauvage, and she
is inserted into log table with a new row. Marie Lesauvage remarried at
2014-07-15 and took her new husband’s surname therefore her name is changed
to Marie Sainte, and she is inserted into log table with a new row. At date
2017-02-07 she stopped her membership and was deleted as a customer in the source
system, and she is inserted into log table with a new row.
Implemented as after-insert/update/delete
triggers in a source system. Another example, where Jane
Macquarie was sales manager for the eastern region until December 31, 2018,
and then took responsibility for the western region from January 1, 2019. How
was it registered in a source system? I have seen systems without a date to
register a shift, because an operator will update the responsibility at
beginning of January 2019, which means the data warehouse can let Jane get a
ValidFrom date as 2019-01-05 for western region, but it is a wrong picture
and sales fact data for few days in 2019 will belong to eastern region.
Please, always ask a source system for multiple date columns and use them for
ValidFrom and ValidTo to get a real picture of the business in archive area
and data mart area. Data
latency
describes how quickly source data has to be ready in the data warehouse for
the business users to do their reporting. Deadline is normally in the morning
based on a periodic batch of source data from yesterday. For a real-time load
to a data warehouse with constantly new presentation in a dashboard, the ETL
process must be streaming oriented where source data continuously flows into the
data warehouse by delta load or incremental load to do transformation and
making conforming data. 1.3.
Enterprise Data Warehouse modeling architecture - EDW A data warehouse can be modeled in
different ways. I will present three data models. In a data warehouse solution
the source data from multiple source systems can go into a single common database
called a Enterprise Data Warehouse (EDW) where data is integrated into a
single and coherent structure. Notice, to modelling EDW it is based on
business requirement and not based on source data structure. A EDW contains and represents a »single version of truth« for the
enterprise. 1.3.1.
Bill Inmon modeling - Corporate Information Factory model Bill Inmon is using a Operational
Data Store (ODS), a place to store source data as a central database that
provides a snapshot of the latest source data from multiple source systems (operational
systems or transactional systems) with validate referential integrity for
operational reporting and as a source of data to an Enterprise Data Warehouse
(EDW) which feed the CIF Corporate Information Factory, which provides a
logical framework for delivering business intelligence. When a EDW is modeled after Inmon's
modeling it is based on Peter Chen’s Entity Relationship data modeling with
super-sub entity, associative entity and with 80% normalized data, often a third
normal form (3NF) relational database. The EDW offers integrated, granular,
historical and stable data that has not yet been modified for a concrete
usage and can therefore be seen as neutral. It can keep historical data
meaning all the changes to the data or only the days-end status e.g. End Of
The Day for each data revision from ODS (or Archive area or Input data area).
An entity Person can have a
one-to-many relationship to the addresses of the person’s living places in
another PersonAddress entity with ValidFrom and ValidTo columns, and these
entities or tables can be merged into a type 2 dimension table in a Kimball
data mart. The EDW modeling is subject-oriented
meaning all business processes for each subject e.g. customer needs to be
modelled. A EDW common data model is not technical, it must be based in
(logical) business processes. Since all source data is integrated and
collected into one common database. Inmon's modeling can divide or
partition data into super-sub tables to place related columns in its own sub
table to separate from other columns. There is often a one-to-one relationship
between a super table and its sub tables. We can say that this kind of division
of data is subject-oriented based. A super table can contain a business key
(natural key, surrogate key) from a source system together with other basic
data columns, or instead we place the business key in a mapping table
connected to each super table where a translation of a business key becomes a
data warehouse surrogate key Id as an integer, guid or a hashbyte value, and
we let the Id become primary key in the super table and foreign key in the sub
tables. When a source system adds new data, the data warehouse can place data
in a new sub table and connect data to the existing super table. Similarly,
when a new source system is introduced to the data warehouse. Inmon's
modeling using super-sub tables allows for agile modeling, read more in section
1.3.4. Also called super-sub-entity or supertype/subtypes has two rules:
Let the design of the data model for
a EDW start doing a »helicopter view« for identification of entities, e.g. salesperson
or engineer will be stored in an Employee entity with a one-to-many to a
Jobfunction entity, or maybe do more abstraction in a bigger picture making
a Person entity where employees and customers users can be stored together. Later
in the design process move on to a »weeds view« for all the details for columns,
conformed names and data types etc. In the EDW keep the source data with
time span columns like Start date and Stop date, and sometimes also convert
the source data to row wise entity as a kind of transactional data that is a
sequence of information exchange like financial, logistical or
work-related data, involving everything from a purchase order that becomes an
invoice with multiple invoice line items, to shipping status, to employee
hours worked, plan and activity records, for subscription period and to
insurance costs and claims. A EDW can be a data source to one or
multiple data marts using Dimensional modeling with denormalized data controlled
by the ETL process. A data mart can use a different modeling that fit
better for use of data. A EDW should be accountable and auditable which by
default means pushing business rules of changing/altering data downstream to
»between the data warehouse and the data marts«. EDW tables and views can be data
source for a tabular model with a DAX
formula to do filter and calculate in-memory KPI to be visual in Power BI, where
a data mart area is skipped or been replaced by EDW views, therefore it is called
a ELT process because the transformations is executed on the way to a
business user. 1.3.2.
Anchor modeling When a EDW is modeled after Anchor
modeling it is based on four elements that holds on historical and time-variant
raw data in entities/tables with labels: Anchor is a surrogate key generator for a
source system business key from a data warehouse point of view. Attribute has descriptive values from
a source system connected to the business key therefore attribute has a
foreign key reference back to the anchor. The attribute table is based on
type 2 history, therefore it has also a surrogate key. A view will mix anchor
and attribute and it becomes a type 7 dimension because surrogate key of
anchor becomes a durable key, and surrogate key of attribute becomes a type 2
surrogate key. Both keys will be added to a fact, and the fact can join to a
most recent dimension to fetch the current values, or join to a historical
dimension to fetch the registered values when the fact data occurred with its
transaction date. The model in Power BI can choose the content for its
presentation interface because all data is on a silver platter. Knot is a lookup table with basic data. Tie is a relationship table between
anchor tables to handle one-to-one, one-to-many and many-to-many relationships,
therefore no foreign key in anchor or attribute table except to a knot table (»tie the knot« means getting
married). Anchor modeling has only extension
of new tables and none modification of existing tables. This ensures that
existing data warehouse applications will remain unaffected by the
evolution of the data warehouse. When a business key exists in multiple source
systems, there will be one common anchor table and several attribute tables,
because each source system has its own attribute table. Anchor modeling allows to build a
real-time ETL process to a data warehouse where some attribute tables needs a
more frequent update than others to provide fresh data to the business users in
a report or for a online dashboard. Other attribute tables only needs
updating per hour or at end-of-day. Anchor modeling diagram example [Olle Regardt and Lars
Rönnbäck] Read more. 1.3.3.
Data vault modeling When a EDW is modeled after Data
vault modeling it is based on component parts of a Core Business Concept
(CBC) as an ensemble consisting of three components that holds on historical
and time-variant raw data in entities/tables with labels: Hub is used to
store a business key, Link is for a relationship between hubs and Satellite
contains the data and descriptive information. They have a LoadDate column (Load_dts, LoadDateTime,
ValidFrom) to show when the data row was entered. Each component can be drawn with its
own color. Let me elaborate with some examples: Hub (blue) separates the business
keys from the rest of the model and will translate business key to a unique hashbyte
key value. A composite
business key of
multiple columns will also become a hashbyte key column. E.g. a HubProduct
has ProductHashKey together with business key ProductNumber. A hub is an integration
point of a business key or a unique
identifier and will never change. A hub exists together with at
least one satellite. A data vault table has a RecordSource column as a reference back to a source system for
data lineage e.g. value "Dynamics365.Sales.Product" and can be a
multi-valued column with multiple sources. (Hub in danish er et samlingspunkt for én
forretningsnøgle som har data fordelt over flere satellitter der kredser om
en hub eller et link). Satellite (yellow) [sat] stores the
context, descriptive data and measure values in columns of either a hub or a
link. A satellite is connected to one hub or one link, and a hub or a link
can have multiple satellites. E.g. a SatProduct with foreign key ProductHashKey
from HubProduct and data values for Name, Category, Target group etc.
Satellites has all the relevant data for the data warehouse. Satellite is a
history-tracking
to handle historical data values for changes (updated/deleted) at any time in a satellite, where the
primary key is composite of HashKey+LoadDate. A hash difference HashDiff
column is a checksum of all data value columns for making an easy comparison
for Kimball type 2 history, read more in section 4.3. To be 100% insert
compliant there is no columns for EndDateTime, ValidTo and IsCurrent flag,
therefore no updates of rows in a satellite. If a source system can tell that
a product has been expired or deleted then a new row is inserted into SatProduct
with a current date value in column DeletedDate. One hub or one link will
usually have several satellites associated because we will regroup data into
multiple satellites by classifications and types of data and information and
by rate of change, so each of them can have its own granularity and timeline.
Split logical groups of data into multiple satellites. With multiple source
systems the RecordSource column is helpful when data sets is similar, else
let each source system has its own satellite including a RecordSource column and share the HashKey from the hub to all its
satellites. Employee satellites
example We start with an employee
hub with a business key column EmployeeNumber that is translated to a column EmployeeHashKey
that will be in all satellites sat tables: HubEmployee(EmployeeHashKey,
EmployeeNumber) SatEmployeeBasic(SSN, BirthDate, MaidenName) constants/correction
overwrites. SatEmployeeName(FirstName, MiddleName, LastName, Gender). SatEmployeeContact(CellPhoneNumber, Email, Skype,Twitter,Facetime,Whatsapp). SatEmployeeAddress with columns for address or a reference to an Address
table. SatEmployeeJob(FromDate, ToDate, Department, Title). SatEmployeeHoliday(Date, Occasion). SatEmployeeSalary(DateOfMonth, Amount). SatEmployeeChristmasGift(Year, NameOfGift, AmountOfGift,
LevelOfSatisfaction). The above data comes from
multiple source systems and data is added independently of each other to
the individual satellites (divide and conquer). There will be outer join from
a hub to its satellites and inner join from a satellite to its hub. More satellites Conformed and calculated
data is placed in its own Computed Satellite. Other satellites are:
Overloaded satellite, Multi-active satellite, Status tracking satellite,
Effectivity satellite and Record tracking satellite. Link (green) integrate and capture
relationship between hubs and links. E.g. a product is placed in a store, therefore
we have a HubProduct and a HubStore with data values in a SatProduct and a
SatStore, and a LinkProductStore represents the placement by combining
ProductHashKey and StoreHashKey as foreign keys from the hubs to capture in
which stores a product is placed and for sale. A link creates its own hashkey
as a unique combination of the involved hub business keys e.g. ProductStoreHashKey
from ProductNumber and StoreCode. Data values connected to a link is placed
in its own satellite, e.g. SatProductStore with primary key/foreign key
ProductStoreHashKey and data values for PlacementInStore and QuantityInStore.
When a product no longer exists in a store, the ETL process will insert a new
row in SatProductStore with the expiration date in the column DeletedDate. A link handles one-to-one, one-to-many
and many-to-many relationships because data vault has only optional many-to-many.
A link is a dataless and timeless connection among hub business keys and is
not a data-bearing relationship from ERD. A satellite on a link
represents the history of the connection or the relationship. Product Supplier
Category link example A Product table with foreign
keys SupplierID and CategoryID will often be put together into a LinkProductSupplierCategory
with ProductHashKey, SupplierHashKey and CategoryHashKey from HubProduct,
HubSupplier and HubCategory and a primary key ProductSupplierCategoryHashKey.
Data values connected to the link is placed in SatProductSupplierCategory.
Over time a product can change supplier and category which is being handled
in the SatProductSupplierCategory with a new row inserted with filled
DeletedDate and a new row is inserted into link and sat for the new data combination
with a new LoadDate. If the business one day allows a product to have
multiple suppliers then the link is already prepared for it. In the future,
the business will like to place products in stocks all over the world, and it
will be registered in a new inventory management system, that creates a HubStock
and a SatStock, and creates a LinkProductStock and a SatProductStock to
handle all the new data and relationships. Marriage example A unary/recursive
relationship (self join) e.g. LinkMarriage
represents a connection between two persons with Person1HashKey and
Person2HashKey from HubPerson with business key SSN and it gives primary key MarriageHashKey. All data is placed
in SatMarriage(MarriageHashKey, MarriedDate, DivorcedDate, WidowedDate)
that is labelled Effectivity satellite because of the columns of
dates. A validation rule in SatMarriage ensures that one person can only be married to one
person at a time else it will be marked as an error for an audit trail to
catch. When a person enters a marriage and changes last
name, it becomes a new row in SatPerson. When a marriage ends either by
divorce💔 or by death†
then a new row is
inserted into SatMarriage with filled DivorcedDate or WidowedDate. Couples
who are divorced and later remarry each other will reuse the same MarriageHashKey value from LinkMarriage for
the new row inserted into SatMarriage.
The LoadDate will be in order of the events. The DeletedDate will
be used in connection with an error detection in the source system or an
annulment of the marriage. Another modeling of marriage: SatMarriage(MarriageHashKey, Date,
MaritalStatus) with values of marital status: Married, Divorced and Widowed,
makes it easy to extend with extra values e.g. Separated, Registered
partnership and Abolition of registered partnership. When persons as partners
are living together we will modeling it in a LinkCohabitation and a
SatCohabitation. When persons only dates we make a LinkRelationship and a SatRelationship.
Therefore each link and satellite is a flexible and agile way to divide data (agile in danish
som smidig for forandringer, let og hurtig at udvide). In a generic perspective »helicopter view« we could have started with a
general link to represent a connection between two persons with Person1HashKey and
Person2HashKey from HubPerson, and label the link a general name like LinkCouple
with a CoupleHashKey, and let the types of data create multiple satellites
for the link as SatRelationship, SatCohabitation and SatMarriage.
It is normal for a couple to start dating with data in the SatRelationship, living
together with data in the SatCohabitation. When the couple gets married, the cohabitation
timeline stops in the SatCohabitation, and the data recording continues in the
Sat-Marriage. It
is agile to think and data modeling in a general way and divide data. Invoice example An invoice becomes a
HubInvoice where business key InvoiceNumber becomes InvoiceHashKey used in
a SatInvoice with data values IssueDate, DueDate, DeliveryDate, PaymentMethod
and PaidDate. When a payment happens e.g. a week later, it will become a new
row in the SatInvoice, but I prefer to add the payment to a new satellite
called SatInvoicePaid with InvoiceHashKey, PaidDate, PaidMethod and PaidAmount
to divide data in seperated satellites because of the time difference in data
capture, and it allows for payment in installments in the future. If one day
the sales department replaces the billing system then create a new
SatInvoice. An invoice involves a customer in a store, therefore a LinkInvoiceCustomer
includes InvoiceHashKey, CustomerHashKey and StoreHashKey and combines
business keys to InvoiceCustomerHashKey used in a SatInvoiceCustomer
with data values like Reference no, Requisition number and Voucher number. An
invoice line item involves a product, therefore a LinkInvoiceProduct
includes InvoiceHashKey and ProductHashKey and combines business keys to
InvoiceProductHashKey used in a SatInvoiceProduct with data values LineItemNumber,
Quantity, CouponCode, Discount (%), UnitPrice, VAT and Amount. If the quantity
in an existing invoice is corrected then a new row is inserted into SatInvoiceProduct.
If a product is replaced then for the old product a new row is inserted into
SatInvoiceProduct with filled DeletedDate and the new product is inserted
into LinkInvoiceProduct and into SatInvoiceProduct. More links A link does not have its own
unique identifier
from a source system. Other links are: Nondescriptive link, Computed aggregate link,
Exploration link and Link-to-link or Link-on-link. Address example Address will not be a Hub since the
address itself is not a CBC but a description of a CBC e.g. Customer,
Employee or Store. A CBC could be a CustomerDeliveryLocation as a hub with
a satellite and a link to Customer hub because a customer could have more
than one active location where goods can be delivered. The business key for
the hub is using a concatenation of zip code, street name, house number to
get a unique identifier for the CustomerDeliveryLocationHashKey. More component parts There is several other types
of component parts (entity/table), for example: Transactional-link (light green) integrate and
capture relationship between hubs and links to capture multiple transactions
that involve the same business keys e.g. many sales to same customer of same
product from same store, meaning multiple rows with the same set of keys. A transactional-link
has its own unique
identifier for a transaction (TransactionId, EventId) from a source system. Notices that a transactional-link
keeps its business key, it is not placed in an associated hub. If a
revision of a transaction is received from a source system or a counterpart
then insert the data values into the associated satellite. Since DV raw data
is a structured archive, it will be wrong to calculate a counterpart row, but
when data is extract-transform-load to a Transactional fact it is fine to
calculate a counterpart for the fact. Nonhistorized-link is for immutable data that
has no edit history, in other words, constant data that will never be changed
or deleted in a source system. Each transaction (data row) has a unique identifier (TransactionId) and there is
no revision of a transaction in the source system. When there is a
counterpart to a transaction it will have its own unique identifier and a reference
identifier back to the original transaction. Reference (gray) is referenced from a satellite
and is the same as a lookup table with basic data e.g. RefDate or RefCalendar
with many columns like Week, Month, Quarter and Year, RefMaritalStatus and
RefMarriageType with Code and Text, and RefZipCode with Zipcode, City, State,
State abbreviation for US. Date table is called a nonhistory reference table. A satellite is normalization
to 3NF and contains only non-foreign key columns except simple reference
values. If a reference (cross-reference or lookup table) contains history then
move the old data values to a History-based reference satellite with primary
key Code+LoadDate and Text contains the previous texts. I prefer to have a
RefAddress with extra columns like State, Region, Municipality and Latitude and Longitude
(gps coordinate), Kimball page 235. Same-as-link (turquoise) [sal] indicates
that two business keys in a hub are the same thing as a logical representation
of a link, or to map different business keys from multiple source system
where each source has its own satellite. In a sal we can »merge together« differently
named business keys to a single master key that is really the same-as the other
business keys. Same-as-link is connected to a hub and is used to identify
when the same business objects are identified by multiple business key values. Hierarchical-link (silverblue) [hal] to
capture hierarchical relationship between the business concept records in
same hub e.g. parent-child hierarchy. Point-in-time (purple) [pit] is a helper
table to a hub or to a link and is calculated by the ETL process based on all
the satellites of the same hub or of the same link, because these satellites
share the same HashKey together with different values in LoadDate. A pit
table has its own LoadDate column (SnapshotDate), a column for the shared
HashKey and columns for each involved satellite’s LoadDate, therefore all the
values of LoadDate from the satellites are represented in a pit table to be
used to derive all valid versions for ease of joins of satellites and to
improve the performance and includes to find the newest or most recent
rows and to calculate EndDate or ValidTo. (A different implementation of a
pit table is to have a row for each date in the calendar or for each working
date in column LoadDate, but of course with 1000 rows in one satellite over a
period of ten years it will give millions of rows but very easy to lookup on
a given reporting date point in time as a transaction date). By adding a surrogate key sequence
number to a pit table it can through a view create a virtual Kimball type 2
dimension. pit create and pit use. Bridge (taupe) combines (or
bridge) hashkeys and optional business keys from multiple links and their
related hubs. A bridge can represent an aggregated level and can include
derived and calculated columns. HashKey Data vault 2.0 recommends for
a Hub to hash a business key value instead of using an identity column as a
unique sequence number because:
Archive of
data from source systems I find the Data vault model
as an archive
of structure
where data from the source systems has been divided, and where the business rules and
transformations is first enforced when data is going to data marts. Linstedt said: »The Data Vault is for
back-end data warehousing, not for production release to business users
directly.« DV becomes a persistent staging area or an archive area (ARA). Getting the data out Querying a data vault requires
many more joins than Inmon's model. Be aware it can be a
performance issue to have columns of data type binary(32) or char(64) for
storing millions of rows and for join many tables together compared by using
an integer data type. Joining large pit, sat, link and hub together is expensive
for performance and it is often debated by Linstedt. As far as performance goes,
normalised structures are no where near as performant as a dimensional
model. Nothing beats dimensional modeling for performance of queries. Data vault divides data in a
Raw area and a Business area called Business Vault that contains business
calculations and to show data at a different level of granularity by making
new link with more or fewer hubs (and business keys) involved. Pit and Bridge
is part of the Business Vault to provide table to give a better query performance.
Business Vaults can take many forms. It's more about provisioning data for
business users. For a ETL process from a
Data vault to a Dimensional modeling, think that a Hub and its Satellites
become a Dimension, and a Link and its Satellites become a Fact. Pit and
Bridge is a big help and sometimes the ETL is done by views only. When to use Data Vault? When
a data warehouse has many source systems! See
an Order line example in 3NF Model, Dimensional Model and Data Vault Model. Data vault modeling
diagram example [Dan Linstedt] Read more, extra more, more about loading, and there is many sites where
individual persons give their recommendations. An example
in a six minutes video and a longer
video. Dan Linstedt comments my
post at LinkedIn that he is using color: green for hub, blue for link and
white for satellite. 1.3.4.
Agile modeling Inmon's modeling is based on a normalized
data model and the new data models for EDW wants to break things out into
»parts« for agility, extensibility, flexibility, generally, scalability and
productivity to facilitate the capture of things that are either interpreted
in different ways or changing independently of each other. Agile modeling has only extension of
new tables and none modification of existing tables. Changes in an agile data
warehouse environment only require extensions, not modifications and no
impact to the data warehouse and it becomes quick-moving to act quickly to
easily customize the data warehouse when business changes. An ensemble modeling
pattern gives the ability to build incrementally and future changes should
not impact the existing design. Agile development is performed in an
iterative, incremental and collaborative manner. I think we will always
meet an issue that address us to do a model refactoring or re-engineering. Read more. Inmon's modeling is a free data
model to do split of keys to map tables and data into super-sub tables, to
handle one-to-one, one-to-many and many-to-many relationships by connection
tables. With Entity Relationship data modeling you can make your own labelling
of tables in a data model, compared to fixed labels in Anchor modeling and
Data vault modeling. For Anchor modeling and Data vault
modeling the basic idea is to split (groups of) keys into their own entities
(called anchors and hubs) and split of all other data into additional
entities (non key entities called attributes, satellites), so an Anchor or a
Hub table has multiple Attribute or Satellite tables attached because:
This has advantages of independent
load jobs, data model enhancements etc. but makes it more difficult to extract
and fetch data. To reduce the complexity in queries, a proven approach is to
create (or generate) a view layer on top of the model with Join Elimination
etc. Normalization to sixth normal form 6NF
is intended to decompose table columns to irreducible components, a
hypernormalized database where essentially every column has its own table
with a key-value pair. I see 6NF as a physical level, while I see 1NF to 5NF,
you can include DKNF as a logical level useful for design and modeling of a
database. A generic data model does not need
any changes when there is a change in the source systems. There is a Party
modeling, a 2G modeling and a Focal Point modeling and more than dozens of
data warehouse data modeling patterns that have been introduced over the past
decade. EDM = Enterprise Data Model
is a term from IBM back in 1998 for a data modeling technique for data
warehousing, where ETL was called Capture, Transform, Apply or Extract,
Clean, Conform, Deliver processes. A simple Guidedance for EDW:
Do not lose data (auditing/compliance/validation). Model it, so you can
understand it. Make it physical, so you can query it. Data warehouse is using a
surrogate key instead of a business key to remove dependence from a source
system, read more in section 4.2. 1.3.5.
Dimensional modeling Ralph Kimball does not like to store
data in a EDW, he only store data in data marts that is using Dimensional
modeling, therefore EDW becomes a
union of all data marts. Dimensional modeling ends up in a star schema or constellation schema (a group of stars, multi-star schema) with
fact tables (analysis variable, measures, events) surrounded by dimension
tables (context), where dimensions explain the facts. Dimension and fact
conformance is a must in a successfull data warehouse implementation to
meets the requirements of legislation, accepted practices, prescribed rules
and regulations, specified standards and terms of a contract. Star schema Entity
Relationship Diagram ERD with five one-to-many relationships. Date dimension is a type 0 with original
values and a dimension key in column Date_key of data type date for the primary
key of the dimension.
Employee dimension is a type 1 with most
recent, current values and a dimension key in column Employee_key as an
artificial auto-generated unique sequence number or an identity column as a
surrogate key for the primary key of the dimension.
Store dimension is a type 1 with most
recent, current values and a dimension key in column Store_key as a unique
sequence number as a surrogate key.
Product dimension is a type 1 with most
recent, current values and a dimension key in column Product_key as a unique
sequence number as a surrogate key. Other columns create a hierarchy: Category→Brand→Label.
Customer dimension is a type 2 to
keep history in multiple rows or records when a customer changes name or
city, and a dimension key in column Customer_key as an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of the dimension. Customer_key values are unique
for each row even though the same customer is standing in several rows e.g.
CustomerId 421 because the customer has changed her name and city several
times. Column CustomerId is a business key and is the »glue« that holds the
multiple rows or records together for a customer even though data change over
time. Columns
ValidFrom and ValidTo represents the span of time when a row was the »current
truth« as a time window for a customer in a given state.
Customer dimension type 2 has not
here a column IsCurrent with two values: 0 for historical and 1 for current
to mark each data row, instead column ValidTo with value 9999-12-31
represents the most recent, current values. Sales fact with five dimension key
columns for five dimensions where Date_key represents the
date of purchase,
and with columns for two measures or metrics. (Other names Fact_Sales or FactSales
or fact.Sales).
Since I keep history of customers name and cities in
the Customer dimension as a type 2, the Sales fact tells us the name of a
customer and where a customer lived at
the date of purchase, when I join a fact row column Customer_key value to
the Customer dimension key column Customer_key value for further information
about the customer. For example, at Sales fact column Date_key row value
2014-04-25 has Customer_key value 4, which I lookup in Customer dimension to
see it is the customer Marie that at that time had last name Lesauvage and
was living in Paris when she purchased the product Blou jeans from the employee
Michael Suyama in the store Boutique at Rue Cler 52, Paris. I call it registered
values at the date of purchase, and the data is the factual or fact back
then, because Customer dimension is a type 2 that retains data as they were
at a transaction date of a sales event back in time. (In danish »registreret oprindelige
værdi«). For example, at Sales fact column Date_key row value
2014-08-16 has Customer_key value 5, which I lookup in Customer dimension to
see it is the customer Marie that at that time had last name Sainte and was
living in Nice when she purchased the product Passion no 9 from the employee Michael
Suyama in the store Boutique at Rue Cler 52, Paris. When I want to write a letter to customer Marie, I
will use her current values last name
Sainte and that she lives in Nice. Customer dimension rows with ValidTo value
9999-12-31 represents the most recent, current values at present time for a
type 2 dimension. (In danish aktuelle, nuværende, gældende, seneste
værdi). I recommend using SQL views upon a type 2 dimension for
a current view and a registered view e.g. on top of the Customer
dimension type 2 table in join with the Sales fact table. Dimension key
columns in a fact table e.g. Customer_key and Product_key do not tell whether
the dimension is a type 1 or a type 2, but when I join from a fact table
dimension key e.g. Customer_key to a dimension view called either DimCustomer_Current
or DimCustomer_Registered, I choose which dimension values I
want to fetch and see. ·
When I want to write a letter to a customer, I will
use the current values, e.g. Address from the current view DimCustomer_Current. ·
When I want to make a statistic of where the
customers lived at date of purchase, I will use the registered values, e.g.
City as a derived column of Address from the registered view
DimCustomer_Registered. In
the above Entity Relationship Diagram Customer
dimension is type 2 and the other dimensions is type 1, and I recommend to
have a SQL view for each dimension, e.g. DimDate, DimEmployee_Current,
DimProduct_Current, DimStore_Current. I recommend a fact view FactSales, therefore a
business user will never access a table in a database, only access data
through views in a database. Read more examples of type 1, type 2 and type 7 with
most recent current values in section 6.2. Improvement expansion Calculation of Amount in Sales fact is using a Price
from the Product dimension, and in case of an offer price it could be a good
idea to include Price, Discount, OfferPrice in the Sales fact, because
Product dimension as a type 1 will overwrite changes of prices and only show
the current price of a product. Shipments Star Schema from Microsoft
Polaris data
mart database Constellation schema or Multi-star
schema has multiple facts and shared dimensions From a real implementation by Joakim
Dalby in 2021 based on 65000 devices/end-points/clients and 34000 software
(arp string for add remove program) at hospitals in Copenhagen area of
Denmark based on four source systems: AD, SCCM/MECM, ServiceNow and Software
Central. Dimensions are type 1 with current values where old data is
overwritten and deleted data in a source system will be marked with »true«
value in the dimension column IsDeleted_meta as metadata. Facts are based on
»capture a relationship in the fact« e.g. fact_Client, and based on derived
facts and snapshot facts. Read more in section 3.1. Snowflake
schema (I'm not a fan of the term) Levels in the hierarchy consist of a
dimension of several normalized dimension tables, so that not all dimension
tables are connected to the fact data table. When a hierarchy is divided into
several dimensional tables, they can be individually linked to different fact
data tables same as Constellation schema or Multi-star schema with multiple
facts sharing dimensions. Naming
rules There are many naming rules e.g.
dimension tables are singular and fact tables are plural to do a
differentiate between dimension and fact. Adding a role to the name of object
by a prefix e.g. DimCustomer or Dim_Customer and Fact_Sales or FactSales or
by a database schema e.g. dim.Customer and fact.Sales. Sale is the selling of goods or
services, or a discount on the price. Sales is a term used to describe the
activities that lead to the selling of goods or services. Goods are tangible
products as items you buy, such as food, clothing, toys, furniture, and
toothpaste. Services are intangible as actions such as haircuts, medical
check-ups, mail delivery, car repair, and teaching. A sales organizations can
be broken up into different sales teams based on regions or type of product. 1.3.6.
Data warehouse architecture and ETL modeling Three business intelligence enterprise
data warehouse modeling architectures. · Kimball said: »A data
warehouse is a union of all its data marts.« illustrated by the dashed line
as a Data mart area (DMA). · Inmon has an Enterprise
Data Warehouse (EDW) illustrated by Data warehouse. · Linstedt has a Raw data
vault area and Business data vault area. Sources represent either Landing
zone area or Input data area or Archive area. It is common for all models to have
a Data staging area (DSA). Enterprise Data Warehouse (EDW) is
sometimes a term for the whole solution and the different models each use
their own terms e.g. Data warehouse or Business data vault. EDW is a common
database layer or area before the Data mart area (DMA) that is illustrated as
several databases each with their own specific purpose and tailored
dimensions and facts that are business oriented. I like to have another data staging
area between EDW and DMA (not illustrated) for the ETL process and it can be
placed inside a DMA database as a local data
mart staging area (DMSA). Sometimes a data mart is implemented as sql
views which refers to a EDW database as a virtual data mart. Illustration of a global DMSA A data mart area (DMA) has a
Presentation interface (PIA) in front, above it is called BI-Apps. Sometimes business users want sql
query access to dimensions and facts but they don’t like to join them
together, therefore I implement a Analysis
data area (ADA) (or Advanced data analysis or Automated data analysis) with
either sql views that refer to DMA and a view has already joined a fact and its
associated dimensions together including relevante columns, or materialized each
view into a table as a One Big Table (OBT) and a ETL process will keep up to
date. Of course I have to make decisions between using the latest (current)
dimension values, or the registered dimension values when the fact data was happening,
occurred, registered or took place with its transaction date or was entered
into the fact, or point in time values. [Analytical Base Table (ABT)]. I have used the general word data
mart for a database with tables of dimensions and facts, but there is a more
narrow word Multidimensional database
(MDB) that is the source database for a OnLine
Analytical Processing (OLAP) application like a cube also known as a Multidimensional online analytical
processing (MOLAP) application. For me, it is a principle for a cube that all
dimensions and all measures can be combined freely else divide data into
multiple cubes. Same principle for a Tabular
model. SQL (Structured Query Language) is a
query language for database model. MDX (Multidimensional Expressions)
is a query language for OLAP model. DAX (Data Analysis Expressions) is a
query language for Tabular model. Read more about differences of opinion. ETL process stands for Extract, Transform, Load, see a
model, and ETL exists between
the data areas (data layers) of a data warehouse solution, e.g.: From each data area (data layer)
there will be extract criteria to limit data, and there will be transform
to make data conform and calculate new data like KPI, and data will be load
into the next area. It is a common way of handling data in a multitier
architecture or layered architecture (in danish lagdelt arkitektur; extract
for tilvejebringe data ved at udtrække, uddrage eller udlæse data; transform
for at transformere, bearbejde, behandle, beregne eller berige data; load
for at levere data eller indlæse data til det næste område). 1.3.7.
Data mart modeling A Data mart area (DMA) will consist
of several data marts database. Data is fetched from a Data staging area
(DSA) or an Enterprise Data Warehouse (EDW) through a ETL process which may
contain a local data mart staging area (DMSA). Sometimes a data mart is
implemented as sql views which refers to a EDW database as a virtual data
mart. Each data mart database has a
specific purpose area data for tailored support and can be characterized as
one of the categories below: Common mart With loaded tables of data from DSA
or EDW. With common dimensions to be reused
in the other data marts. Examples of dimensions are Date, Time, Employee, Organization,
Product, Retailer and TransactionType. Can also include role-playing
dimension e.g. Manager based upon Employee and different date dimensions with
unique names of columns. With common facts to be reused in the
other data marts. Examples of facts are Sales, Inventory, Marketing, Finance,
Employment and Transaction. Contains star schema/constellation
schema and other modeling. Subject
mart With loaded tables of data from DSA
or EDW. For example, Sales mart, Customer
mart, CRM mart, Churn prediction mart, Market mart, Production mart,
Inventory mart, Shipment mart, HR mart, Tax mart, Credit risk mart, Fraud
detection mart, ESG Environmental, Social, and Governance mart. Data can be a subset of common mart and
data can be further enriched. Contains star schema/constellation
schema and other modeling. Analytical mart With loaded tables of data from DSA
or EDW. With views upon common mart, subject
mart or EDW to fetch relevant data. Making conformed columns for star
schema/constellation schema to provide data to OLAP cubes or Tabular cubes or
to dashboard visualization tools like Tableau, QlikView/Qlik
Sense and Power BI. The data mart can be used for analytics by business users
and data scientist to do ad hoc sql query. Reporting mart With views upon common mart, subject
mart or EDW to fetch relevant data. Making conformed columns for star
schema/constellation schema to provide data to reports on papir, in pdf files
and other display formats. Stored procedures with criteria
parameters inputs for example: FromDate, ToDate, Customer, Branche, Product, SalesDevice,
TransactionType will fetch the relevant data from the views and present data
to a reporting tool used by business users like Reporting Services (SSRS) or
Microsoft Access with pass-through query to call a parameterized stored
procedure. With tables for Regulatory
requirement return report (in danish indberetning) for financial reporting,
because it is desirable to be able to recreate an old report both as it
actually looked at the time of creation and as it should have looked given
corrections made to the data after its creation. All tables has a VersionId
to be used for sql join of tables to ensure all data rows belong to the same
accounting. Delivery mart With views upon common mart, subject
mart or EDW to fetch relevant data to provide data and to create data in a file
format that is customized to the individual data recipient as another IT
system internally or externally. Discovery
mart or Exploration mart or Self service mart With tables of raw data from source
systems, or with views upon an archive area or views upon EDW or views upon
data mart area to some business users to do data discovery by sql statements.
It is a sandbox where users also has write access and can create their own
tables for staging data, mapping data, search criteria data etc., and for the
final result data set to be exported to a Excel sheet or to be connected to a
Power BI report and so on. The purpose is to prepare a good business
requirements specification and a user story for extensions to the data
warehouse solution. Data mart area (DMA) has several databases
for each mart and sometimes there is no need to store data in a mart database
instead the mart database contains virtual star schema non-materialized views
that look like dimensions and facts in a virtual data mart. Remember to set a process for user
management, permissions and authentication etc. to manage and monitor and
reduce the risk of cross contamination of data to ensure that a business user
only has access to view relevant data for analyzes and reports. 1.4.
Data reconciliation One part of a data warehousing
project is to provide compliance, accountability, and auditability. After
data capture, please remember to implement a Reconciliation Summary Report with the results from your recon
e.g. Uncategorized assets, Invalid country codes, Derivatives transactions
with missing currency code. Audit trail (in danish kontrolspor)
becomes important for the credibility of a data warehouse. An example from a
source system that has a value 0.003589 and export it to a txt file where the
value becomes 3.589E-3 in the scientific notation and by a mistake in the ETL
process the data warehouse saved and showed the value as 3.589. A contract
number 700002848572 becomes 7.00003E+11 and the last part of the value got
lost. When reconciliation is built-in the data model and the ETL process,
this mistake would be reported and a programmer can fix the import and update
his data profiling documentation. A classic reconciliation is to weigh
the truck before it leaves and weigh the truck when it arrives at the
destination to make sure that no load has been lost on the ride. Do reconciling
between data warehouse and source systems with reconciliation of row count and
sum of values and mark as reconciled and do auditing report. (to reconcile in
danish at afstemme, stemmer overens). Log of RowCount of target rows before a
process and after a process to measure the change of rows a process do
(TabelRowCountBeginAt and TabelRowCountEndAt), RowCount of source data rows
and extracted rows based on join and criteria, RowCount of target deleted
rows, updated rows and inserted rows/loaded rows successfully or rejected
rows with missing value or out of bounds amount or other kind of invalid data
depends of the validation check of data quality etc. By sum of values
you can reverse the sign in every other row by multiplying each value with -1
to avoid the sum becoming larger than the data type, e.g. in sql: ;WITH InvoiceItemSummarized AS ( SELECT
InvoiceItemAmount, Sign = IIF(ROW_NUMBER() OVER(ORDER BY (SELECT 1)) % 2 = 0,
1, -1) FROM InvoiceItem ORDER BY InvoiceItemId ) SELECT SUM(InvoiceItemAmount*Sign) AS
InvoiceItemAmountSummarized FROM InvoiceItemSummarized ORDER BY InvoiceItemId How else can we say, we are
compliant and ensure compliance without data governance, data management and
data reconciliation, data quality and data lineage mixed with documentation
in a bank to meet Basel I, Basel II, Basel III, Basel IV, EDA, HIPAA,
Sarbanes-Oxley and BCBS239 (The Basel Committee on Banking Supervision 239)
together with a regulatory
requirements return report. BCBS239 paper
and Explanation and elaboration of Data Governance for BCBS239 to be complient for e.g. Danish
Financial Supervisory Authority (Danish FSA) (in danish Finanstilsynet). 1.5. Data quality Data quality purpose is to ensure that business users trust and
have confidence to data to achieve reliability and relevance. Defining and
implementing processes to measure, monitor and report on data quality, measurement
results of data quality performed on a data set and thus for judging a data
set, and hopefully to improve the quality of data. The specific
characteristics or dimensions of data that are analyzed in a data quality
program is differ from one business to another based on the needs and
priorities of that business. We must be able to rely on the accuracy of our
data on which we base decisions (in danish »Vi skal kunne stole på
korrektheden af vores data, som vi baserer beslutninger på.«) Doing
data quality increases believability, credibility and reliability (in danish
troværdighed, pålidelighed). Data quality skills and data cleansing for a better data
discipline and to avoid violates of rules of data quality with the
following dimensions are commonly used as data controls with the goal of
ensuring data quality, and where one mean is conformance
or conformity to make data from multiple source systems conformed through
data conformation as a part of the ETL process. The term data quality
dimension refer to measurement of physical objects e.g. weight, height,
length, width we do call dimensions, what should be measured and reported on
for a data set. A data quality dimension defines a set of attributes and
characteristics that represent an aspect of the data quality. The following twelve data
quality dimensions are significant in relation to assessing the
quality of data and to declare a data set with dimensions for
data quality. They can be used to cover the need to be able to document that
the content of data in a data warehouse is correct: Timeliness and Availability (aktualitet og
tilgængelighed) Where data is available and accessible when needed by the
business. Timeliness is a measure of time between when data is expected
versus made available. E.g. a business must report its quarterly results by
a certain date, the delay between a change of a real world state and the
resulting modification of the data warehouse, and a customer contact information
is verified at least once a year and it is indicated with a verification
date that data is up-to-date. Another data quality rule can be that max 80%
of the active companies must have annual accounts that are not older than 15
months old. Timeliness depends on the update frequency of the data set in
relation to how often the data changes. The data set is updated real-time,
daily, weekly, monthly, quarterly, semi-annually, yearly. Currency measures
how quickly data reflects the real-world concept that it represents. Timeliness is also about how frequently data is likely to change
and for what reasons. Mapping and code data is remain current for a long
period. Master data needs to be up-to-date in real-time. Volatile data remains
current for a short period e.g. latest transaction or number of items in stock
can be shown with a as-of time to tell business users that there is a risk
that data has changed since it was saved/recorded. Typical during a day data
will be changed several times and in evening and night data will remain
unchanged. Latency tells the time between when data was created or changed
and when it was made available for use by the business users. Kimball talks
about low latency data delivery can be very valuable for a full spectrum of
data quality checks (Kimball page 261, in danish reaktionstid, too fast fact
data delivery can cause inferred members, read more in section 4.5). Completeness (komplethed, fuldstændighed) Where a data row has the values that is required to avoid
missing values e.g. an order requires a customer number. Therefore data
quality must check if some columns are empty (null, empty string, 0 not
expected). The ETL process must be robust and must not cause system crash and
system error at an unexpected null value, instead the data row must be sent
to a Wrong table which the operation monitors and reports on. When adding
new data rows, update or delete rows, it can be checked whether the number of
rows corresponds to the expected number of rows e.g. between 100 and 130 new
orders per day. The number of new daily transactions does not decrease by 10%
compared to the average number of transactions in the last 30 days (or does
not exceed 10%). Keep an eye on the amount of data, the quantity or volume of
available data is appropriate. RowCount of source data rows and
extracted rows based on join and criteria. RowCount of target deleted rows,
updated rows and inserted rows/loaded rows successfully or rejected rows with
missing value or out of bounds amount or other kind of invalid data depends
of the validation check of data quality etc. Read more in section 1.4 Data
reconciliation. Completeness is also about a data row where all the necessary
data is present e.g. an address must include house number, street, zip code
and city, and an order must include product, price and quantity. Incomplete
data e.g. missing one hundred thousand data rows compared to last
payload, an address does not include a zip code or an email address is
missing the domain (gmail.com) makes the address not usable. A data quality
goal could be 98% usable addresses and must mark the data row to be taking
care of afterwards. Over-completeness means that the set of rules is too
restrictive and value combinations that are actually valid are unjustly
excluded. Uniqueness (entydighed) Where data identifies one and only one entry. Unique data means
that there is only one instance of a specific value appearing in a data set,
so it is free from data duplication, e.g. a Social Security number
(SSN) to ensure that each person has a unique Id, therefore duplicate SSN values are not allowed within the data
set. Unambiguous value for having one meaning and avoid ambiguous values at
same level in a dimension or need to combine two values to one unique data.
Duplication where customer has addresses spelled in different ways is not
credible. What about Bob and Bbo, a mistyping, a
typos or a misspelling? What
about Daniel and Dan may well be the same person? Investigate that duplicates
are unique. Validity (validitet, gyldighed) Where data is valid with a range of values, e.g. an age of a
person cannot contains a negative value and cannot be higher than 125 years.
When an age is under 18 then marital status must be not married. A validation
rule in a time registration system could be that hours worked must be between
0 and 168 per month as a plausibility range. The average price of this month
must not differ from last month’s price by more than 50%. A valid data value
can also include a regular expression patterns as a column of text has to be
validated e.g. phone number pattern: (999) 999–9999 where we want the hyphen
to be stored in the column. A validation rule can characterize as: in-column,
cross-column, in-row, cross-rows, cross-data set, like a cross-column
validation is about certain conditions that span across multiple columns must
hold e.g. a patient’s Date of Discharge from the hospital cannot be earlier
than the Date of Admission, or a Delivery date of an order cannot be less
than its Shipment date. Validation ensures that data is logical. Type of
validation rule: equality, inequality, logical rule mixed with range checks,
bounds fixed or bounds depending on entries in other columns. A Due date
within an acceptable range and not ten years wrong, else we have invalid
data that should be marked as rejected to be corrected. It is also
important that a data row indicates the period in which it is valid. Conformity (overensstemmelse) Where data is stored in a column with
a business-wide-understanding userfriendly name and where data is in the required format and following
standard data definition. E.g. of an easy name of a column for an
invoice amount that is excluded VAT (pre-VAT) could be InvoiceAmount_ExclVAT
and for the amount included VAT could be InvoiceAmount_InclVAT. E.g.
formatting telephone numbers with or without country code and a control have
to make sure that all phone numbers has proper number of digits and a format
to ensure conformance. Format of a date as yyyy/mm/dd or dd-mm-yyyy, please
choose to have only one format in the data warehouse. How to present workload
per day as 7½ hours, you like 7:30 or 7.50? Uniformity (ensartethed) Where data is stored in a column with a data type and size and where a value has the right unit.
For a data warehouse with data across domains from multiple source systems
it is important to define and use the same data types and sizes. E.g. a
decimal number in Europe is using comma before decimal (3,14) and US is using
period (3.14). How many digits in an amount e.g. data type decimal(19, 4) is
a popular choice for e-commerce but Bitcoin is using decimal(24, 8). How
large can the sum of an amount column over many rows going to be? A data value is often represented with a specific unit, but
seldom the unit can been seen in the name of a column or in an extra column with
the unit as text. Unit examples of a weigth value in pounds or in kilos, a
length value in centimeters (cm) or in feet (ft) or a distance value in
kilometer (km) or miles (mi) or light year (ly), an amount value has a
currency like USD, EUR, DKK and a temperature value was measured in Celsius ºC
or Fahrenheit ºF. Consistency (konsistent) Where the same data from two or more source systems should not
conflict with each other e.g. when a product is discontinued, there should
not be any sales of the product. A zip code is right and has a cityname or a
data mapping like DK for Denmark and »Cph, Kbh, København« as
Copenhagen to be consistent for the capital of Denmark. Code values must
follow different standards e.g. ISO Currency Code USD, EUR, GBP, DKK, or LEI
Legal Entity Identifier issuer code, or NACE codes and CIC codes. Improve
values that take care of case-sensitive words through translate, mapping or
bridging rules to create one and only one truth. Three data sets referring to
the same time period e.g. one includes data for females, one for male and one
for total, the consistency between the results of the three data sets can be
checked. Inconsistency data stands for conflicting data or when
the same data is not the same across source systems. An address is the same
with look-alike e.g. »Fifth Avenue« and »5th Ave«, or the name of a product
with two kind of spellings e.g. Ketchup and Catsup. Sometimes a user type-in
an extra data e.g. in an address column there is added a direction »on the
corner of Fifth and Main«. Hopefully a data cleansing procedure can catch it.
Inconsistent data gives problem in any database and data warehouse. Integrity (integritet, pålidelighed) Where constraints is a not null column, a primary key, a foreign
key referential integrity where data stating that all its references are
valid. Relationships between data in different systems is maintained to be
accuracy and consistency so data relationships are valid and connected. Wrong
data type for a data value like too big amount or too long string text.
Allowed values like dates, weekdays and weeknumbers is constant data. Accuracy (faktuelt, kendsgerning, nøjagtighed, korrekthed) Where data represents what it is intending to represent in the
real world e.g. a Date of Birth (DOB) and age of a person is trusted or an
address (Mailing Address or Residence Address because they are not the same
for some people) is the correct one by performing a sample measurement, or
check and compare an address with another data set known to be accurate e.g.
a National Address Database or the openaddresses.io to make a degree of
which data correctly describes the real world object or event. A new product
name may not have the same name or similar sounding to a competitor's
product. A new product claims to weigh a specific amount, and the weight is
verified in an independent test. Inaccurate or incorrect data is
poison for a data warehouse, therefore it is good to set up some rules to
find inaccurate values and mark them to be taking care of. Semantic
accuracy must ensure consistency between the content of the data
recorded and actual conditions (content accuracy). Syntactic correctness must
ensure that data complies with syntactic rules e.g. spelling and formats e.g.
a date or an xml structure (form accuracy, syntax and schema validation). A
nurse entered the Date of Birth of all the patients in the format of dd/mm/yyyy
instead of the required format of mm/dd/yyyy. The data passed the system
validation check as the values are all within the legal range. However they
are not accurate. Reasonability or Outliers (rimelighed eller afvigelser) Where data is based on comparison to benchmark data or past
instances of a similar data set e.g. sales from the previous quarter to prove
that data is reasonableness or data has outliers (deviations) with data
point or data value that are completely off. It is a part of the reconciliation of data
and of a data set. Confidentiality and Security (fortrolighed og
sikkerhed) Where data will be maintained according to national and
internatinal standards for data e.g. GDPR. Which access to data is controlled
and restricted appropriately to maintain its security and how long it is
retained. Clarity
and Usefulness (klarhed, anvendelighed, forståelighed, genbrugelighed) Where to assess the extent to which data is understandable and
can be able to read and used by others without any misunderstanding the data
and without difficulty. Presentation quality and usability is the data readability
and understandable, relevantly, accessible, maintainable and credibility. Availability
of documentation and columns of metadata in a data set. Reports must have accessibility and communicate information in a
clear and concise manner. Reports should be easy to understand yet
comprehensive enough to facilitate informed decision-making. Reports
should include an appropriate balance between data, analysis and
interpretation, and qualitative explanations. Reports should include
meaningful information tailored to the needs of the recipients. Monitoring (tilsyn) the twelve data quality dimensions Monitoring the quality of the data and reporting it to the data
owner is part of been compliant. Being aware of quality of your data can be
termed as data quality but sometimes it is misunderstood that this will
improve the actual quality of data. The thing it does is to improve the
awareness. Data quality measurements are established on the basis of the
importance and use of data, respectively. Data should be monitored to ensure
that it continues to meet requirements. A data quality indicator is called a
metric, and several metrics will behind one data quality dimension. It must
be possible to measure each data quality dimension with a limit value range
for the data quality rule to indicate whether the measurement is:
A data quality measurement should preferably be performed
automatically and data quality measurements should be performed after each
ETL process. The result of metrics of data quality measurements must be
stored and documented so that they can be exhibited to the data owner and
data users, especially with which measurements exceed limit values. The
measurements must be able to be included in an overall reporting and
follow-up on data quality. Do alert the data owner and data users by email. Auditing is a kind of
inspection (in danish eftersyn). Remarks
There are two types of strategies for improving data quality:
Process-driven is better performing than Data-driven in long
period, because it remove root causes of the quality problems completely. Vilfredo Parento from Italy made in 1906 the 80:20 rule or
principle built on observations of his such as that 80% of the wealth in
Italy belonged to about 20% of the population. The same ratios (in danish
forholdstal) is often in data quality that 20% of a data set generates 80% of
the errors in data. Corresponding to that 20% of the goods generating 80% of
the earnings, which means that the other 80% of the goods generate only 20%
of the earnings. 1.6.
Data lineage and Data provenance Data lineage is the journey data
takes from its creation through its transformations over time. Data lineage
can tell us the source of data, where did the data come from and what
happened to the data and what transformations took place, and where data
moves to over time. It describes a certain data set's origin, movement,
characteristics and quality, and what happens to it and where it moves over
time. Data lineage (in danish afstamning, hvor stammer data fra eller hvor
kommer data fra samt hvor anvendes data) gives traceability (sporbarhed)
and visibility while greatly simplifying the ability to trace errors back to
the root cause in a data analytics process. Lineage is about achieving
traceability of data to make sure that at any given data point one can easily
find a way back to its origin. Data lineage is a end-to-end mapping
of upstream and downstream dependencies of the data from capture or ingestion
to analytics including which reports and dashboards rely on which data
sources, and what specific transformations and modeling take place at every
stage. Data lineage also includes how data is obtained by a transformation.
Data lineage must ensure that there is traceability from where data is
created and where data is subsequently used. Data lineage is both at table
level and column level. When a data warehouse is using data
from other source systems, inhouse or external, a data lineage can be
implemented by tagging data rows with metadata in a column like RecordSource from Data vault to
indicating where the data originated. When data is staged and present in a
target place in a data warehouse, data lineage is also important to inform a
business user where data is comming from. If an organization does not know
where data comes from and how it has changed as it has moved between systems,
then the organization can not prove that the data represents what they clain
it represents. In addition to documentation, data
lineage is also used in connection with change management so that a data
steward can determine which IT systems and solutions will be affected and how
they will be affected. Data provenance focuses on the
origin of the data aka source system, could be a screen from a host system
where users type-in data values after a phone conversation with af
customer, or customer fills a web page. The original data capture. Data
provenance is responsible for providing a list of origin, including inputs,
entities, systems, and processes related to specific data. Always good to
know of latest date of update and who made the update. If you want to drive real value from
your data, you must first understand where data is coming from, where data
has been, how data is processed, how data is being used, and who is using
data. That’s what data lineage and data provenance is all about to create
transparency. 1.7. Documentation My experience is that documentation
of a data warehouse system takes at least ϖ (Pi) times as long as the
development time. Tasks like writing explanations, reasons and arguments, set
up tables, drawings, figures and diagramming, show data flow and data lineage
and make a data dictionary, and last but not least proofreading. Followed by
review by colleagues and subsequent adjustments and additional writings.
Documentation is an iterative process and in a good agile way a programmer
should program in the morning and document it in the afternoon. There are many tools e.g. Erwin or Collibra or Sqldbm and Lineage tool
to show a change in one system has effects in another system. 1.8.
GDPR For EU GDPR General
Data Protection Regulation (Persondataforordning) a business key like a Social
Security Number needs to be anonymous after some year by adding a calculated cross sum of the SSN
and hash it. It will be a nice scramble or encryption by a scrambling
algorithm. To be sure of total anonymity only save Date of Birth and Gender
of a person; no name, no address, no zip code and no city. A Master Data Management
(MDM) database can contain social security number, name, address together with
CustomerId that is used in emails to customers. Therefore we have a surrogate key
CustomerInternalId that is used in all other operational systems and data
warehouse together with BirthDate and Gender which is not personally identifiable
or enforceable. A web service will provide all systems the data of personally
from MDM database. When a customer is going to be deleted because it is
his/her wish or the data limitation period, we only need to anonymization
(the way of delete data) in the MDM database and we can keep
CustomerInternalId, BirthDate and Gender in the other systems to make sure
statistics remain unchanged back in time. When a system is calling the web
service it will give back unknown for SSN, name and address when a customer
no longer exists in MDM database. If we don’t have a MDM, we must do
anonymization in all data
layers and data areas and in relevant source systems as well, and flag data
with an IsAnonymized column. Personal Information (PI), Personally
Identifying Information (PII), Sensitive Personal Information (SPI) is data
relating to identifying a person, read more. 1.9.
DataOps and Data governance and Data management DataOps is how to organize your data
and make it more trusted and secure with tools like:
Read more about DataOps,
DataOps is not DevOps for data and DataOps principles. The Future of Data Management. The purpose of data governance is to
provide tangible answers to how a company can determine and prioritize the
financial benefits of data while mitigating the business risks of poor data.
Data governance requires determining what data can be used in what scenarios
– which requires determining exactly what acceptable data is: what is data, where
is it collected and used, how accurate must it be, which rules must it
follow, who is involved in various parts of data? According to the DAMA-DMBOK: Data
Management Body of Knowledge, 2nd edition: »Data governance is the exercise
of authority and control (planning, monitoring, and enforcement) over the
management of data assets.« This definition focuses on authority and control
over data assets. The important distinction between data governance and data
management is that the DG ensures that data is managed (oversight) and DM
ensures the actual managing of data to achieve goals (execution), in short
that DM ensures that data is compliant and ensures compliance. According to Seiner: »Data
governance is the formalization of behavior around the definition,
production, and usage of data to manage risk and improve quality and
usability of selected data.« This definition focuses on formalizing behavior
and holding people accountable. [Robert S. Seiner. Non-Invasive Data
Governance: The Path of Least Resistance and Greatest Success.] If data
management is the logistics of data, data governance is the strategy of data. The traditional, centralized,
data-first approach to data governance represents a top-down, defensive
strategy focused on enterprise risk mitigation at the expense of the true
needs of staff who work with data. The Data Governance Office develops
policies in a silo and promulgates them to the organization, adding to the
burden of obligations on data users throughout the enterprise who have little
idea of how to meet the new and unexpected responsibilities. The modern, agile, people-first
approach to data governance focuses on providing support to people who work
with data. People are empowered but nonetheless, expected to contribute to
the repository of knowledge about the data and follow guidelines, rather than
rigid, prescriptive procedures. Governance can be divided into three
maintainings:
Policies describe what to do and
what not do to, guide the way people make decisions. A policy established
and carried out by the government goes through several stages from inception
to conclusion. Procedures describe how to do for
accomplish and completing a task or process. Procedures are the operations to
express policies. Community
of practice (CoP)
is a group of people who share a concern or a passion for something they do
and learn how to do it better as they interact regularly. A little wider term is Data Enablement to enable business
users to safely use data where data enablement will be fully aligned with the
organization's strategic and organizational goals of data curation: All You
Need to Know About Data Curation Big Data Curation Data Mesh is an architectural
approach that decentralises data ownership and management. The approach helps
you scale, improve data quality, and increase agility in your data team. Illustration of
data mesh Video
about data mesh Data as a product ensures a better
data-driven decision. Data governance ensures data is
managed. Data management ensures data is
compliant. 1.10.
Big data Some properties of big data:
A data lake can be used for massive quantities of unstructured data
and big data with tools that can easily interface with them for analysis for
business insights. A datum in a lake has tags (or labels) to give it a
characteristic and to cataloging the data in data lake. By a tag we can fetch
data from the lake without knowing the physical location like a server url
with a folder path. A data lake can contain files on multiple servers on
premise in different folders and in the cloud (many nodes), and we only using
a tag to finding, fetching, searching, exploring or discovering data. For example, I like to find photos
of smiling employees in all albums, I can search for a tag FacialExpression =
smiling. A data lake is using ELT (extract, load, and then transform). A tool
for a data lake can be like Apache Hadoop or Microsoft Azure. Data Discovery
Area is a sandbox for business users. Can use U-SQL to dive in the data lake
and fetch the wanted data and do the wanted transformations. Read about a data lakehouse. SQL on-demand is a query service over the data in your data
lake. Azure Synapse Analytics formerly known as SQL DW. A relational database is
characteristic by the acronym ACID:
This means that a transaction is
either carried out completely successfully or not at all (Atomic), that only
valid data is added to the database (Consistent), that transactions never
affect each other (Isolated), and that transactions are never lost (Durable).
Read more. Lock granularity of database, table,
extent, page or a row is the simplest way to meet the ACID requirement. A
page is an eight-kilobyte storage area and an extent is a group of eight
pages. Deadlock is when one resource is waiting on the action of a second
resource while the second resource is waiting of the first resource, so there
is no way to finish and the database system will kill one transaction so the
the other can complete. A NoSQL = Not Only SQL database means it can use SQL type query
language, but usually do not do so. NoSQL database often designed to run on
clusters, made by open source and the database does not operate with a fixed
schema structure but allow the addition of data without a pre-defined
structure. A NoSQL database is characteristic by BASE (Basic Availability,
Soft state, Eventually consistent). A ACID system guarantees data consistency
after each transaction; a BASE system guarantees data consistency within a
reasonable period of time after each transaction. In other words, there is
data consistency in the system, just not immediately. This leads on to the
Soft State principle. If the data is not consistent at all times, the system
must take a temporary data state into account. The sum of both these
principles means that data accessibility is given very high priority, even if
coincident errors occur in the database system, operating system or hardware.
If parts of the database does not work, other parts of the database take
over, so that data can always be accessed. 2.
Grain (granularity) of data warehouse The grain is the level of detail of
data in the table, both in dimension tables by the hierarchy and in fact
tables by the definition of the measurement event and of interest, and the
grain can later be expressed in terms of the dimensions. The lowest grain
keep all relevant data from source systems and is therefore the most flexible
approach but also take most storage space and easy can cost high query performance.
To improve query performances the grain can be lifted up to higher level while
data will be aggregated and summarized and therefore take less storage space.
Data can also be divided such as current year data is in grain daily
(dately), previous year in grain weekly and older data in grain monthly,
because very detailed information is normally not relevant for analysis
years back in time. Granularity or grain of fact table
can be divided into four types:
Grain yearly to grain monthly to grain
weekly to grain daily we say that each level e.g. daily increase the
granularity and the number of rows in the fact table. Aggregation is the process of
calculating summary data from detail base-level table rows (records) and is a
powerful tool for increasing query processing speed in data marts. For
example, a sales is a fact with analysis variable and measures like quantity
sold and amount, and dimensions like product, customer and date of purchase brings
a sales in a context. The grain of the sales is limit to a date (like December
23) and a time (like 4:30 pm) and therefore the fact is on Transaction grain.
In case we drop the time, the measures would be called TotalQuantitySold and
TotalAmount because they are the result of a summation of sales times to a sales
date and therefore the fact is on Periodic grain. Also if we decide to summarize
the sales date to a weekly or monthly level. In case we decide to grain the
customers by aggregate them into segments and don’t keep names and addresses,
then the fact becomes Accumulating/Aggregated grain. A Product dimension
has a three level hierarchy of category name, brand name and product name,
and therefore we can say that product has the highest level of detail. When
a source system only gives a sales transaction with a brand name and no
product name, a Product dimension must fit that grain of fact and take out
the product name level so brand name becomes the lowest level that match the
fact grain. Instead of summation of sales times
or sales date and loose some important information of time of purchase, we
can summarize data into two other fact tables for a weekly level for data from
the last year to this year, and a monthly level for older data, because when
we go back in time, we don’t need to analyze on a daily or weekly level,
and by aggregation we save harddisk space and improve the performance of the
query because fewer rows in month level need to be summarized to fetch the
year level data. A business key has the same
granularity and the same semantic meaning across a business organization
company. Granularity is a description of a
level of detail of data, e.g. the combination of customer and product
is a fine level to tell me about sales, and when I add store to the
combination the granularity is driven one level lower, and when I add salesperson
to the combination the granularity is driven one level lower and I know much more
about sales, and when I add a date or better a datetime it will bring
the granularity to its lowest level or to the highest level of detail of sales
and a sale. The lowest level of aggregation or
the highest level of detail is referred as the grain of the fact table. The grain of a fact is determined by
either the combination of dimensions or the actual transaction level. For
aggregate fact or summarized fact the grain is the intersection of its dimensions. 3.
Fact A fact table can contain fact’s data
on detail or aggregated level depends of the grain approach. A fact table can
have five types of columns: Dimension as a dimension key
that is a foreign key to a dimension table dimension key (primary key) that provides
the context of a measure with name, text and description. Use conformed
dimensions. Measure or analysis variable
that contains quantitative numeric fact as a number value from a business
event e.g. amount of sale, produced number of units or called minutes is fact
data. All measures in a fact table must have the same grain like timely, dately,
weekly or monthly level. Use conformed measures. Primary key to uniquely identify fact
rows to be updatable if data is changed or deleted in a source system and to
avoid duplicate rows. But it conflicts with non-volatile rule! A true transactional
fact accepts a counterpart row to handle changed, corrected or cancelled data
in a source system, therefore a transaction identification can’t be a
primary key but can be in combination of a datetime or a sequence number for
a composite key. Consider this for each fact and don’t worry when your
fact table does not have a »natural« primary key. Sometimes dimensions can be
involved in a composite primary key and the grain of the fact is important
here. The Achilles heel of a composite primary key of dimensions is, when
several rows have missing or unknown members or values in their dimensions
and the date or datetime dimension is the same in the rows, we will have a
violation. A primary key is important in case of linking to other fact table
(drill across). The primary key constraint can be non-clustered to get better
insertion performance where data is loaded in batch. Tag contains a text, a date
or a number that can not be summarized, e.g. a voucher number, a receipt
number or a sequence number. A tag column in a fact table is a candidate for
a degenerate dimension or later a real dimension with extra descriptive data. Technical for a surrogate identity
column (as an artificial primary key) a unique sequence number instead of a
composite key by a subset of the columns in a fact table. The name of the
column can be like FactSales_key. A surrogate unique clustered identity
column is useable for parent-child relationship link between fact rows. For
example, a order row has a reference to a cancel row of the order and the
cancel row has a parent reference to the previous order row, a link column
pointing to the surrogate identity column. EntryDateTime or
TransactionDateTime (in danish Transaktionstidspunkt) can be divided into
two dimensions for Date and Time, and the Date column is used for
partitioning of the table to a partitioned fact table. Dates for ValidFrom
and ValidTo to represent a timeline/period where the measures was valid like
a BalanceAmount. Maybe include a Replacement reference from the new fact row
back to the soft deleted row. Indicator flag like IsDeleted and DeletedTime
(ModifiedTime). For the data lineage a fact table will include metadata
columns from the Archive area as RecordSource, ArcRecordId and
ArcGlobalId for traceability back to the archive. The ETL process has to secure the key,
so fact rows are distinct. Fact tables in a dimensional model express
the many-to-many relationships between dimensions and is implemented as
one-to-many relationships between dimension tables and fact tables. I never
have foreign key constraint on a fact table because it decrease inserting
performance, and I trust the ETL process and date range lookup, and no human
being will be doing a update or delete of a dimension table or a fact table. 3.1.
Types of facts Let us characterize the various
facts into different types of facts or more exactly different types of the
columns in a fact table. Conforming facts means making agreements on common
business metrics such as key performance indicators (KPI) across separated source
systems so that these numbers can be compared mathematically for calculating
differences and ratios (Kimball page 446). A dimension has value or member as a row, or values or
members as rows. Fully-Additive
measure - summable across any dimension A fact table has numerical measures
that can be summed up for all of the dimensions in the fact table, so the
measure columns data type is a number. A Sales fact is a good example for
additive fact with measures like Quantity sold and Amount. In case of a
transaction data set to a fact table refer to a measure column which value is
empty, null or nullable, use the default value 0 because this won’t bother aggregation
like summation. Each measure must have its metrics. When it is a monetary measure, it
may have a currency column and if it is a unit measure it may have a column
to explain the kind of units used like centimeters, litres, cubic metres etc.
Fact can have a calculated measure or a derived measure based on existing
measures and constants e.g.:
Profit or Surplus = Revenue – Costs. Semi-Additive
measure - summable across some dimensions A fact table has measures that can
be summed up for some of the dimensions in the fact table and not for other
dimensions. For example, a daily balance measure can be summed up through the
customers dimension but not through the date dimension. Inventory levels cannot
be summed across time periods. Non-Additive
measure - not summable for any dimension A fact table has measures that
cannot be summed up for any of the dimensions in the fact table. For example,
a room temperature fact is non-additive and summing the temperature across different
times of the day produces a totally non-meaningful number. However, if we
do an average of several temperatures during the day, we can produce the
average temperature for the day, which is a meaningful number. Sum of a measure
called DaysForOrderToCompletion in a FactOrder is meaningless but finding a
minimum, maximum and average values is meaningful to planning of production. Other examples is a percentage and a
ratio are non-additive measures. A fact table that only contains column with
transaction number such as order number, invoice number or a voucher number
that can’t be summing up. A order fact with measures like Unit price for a
single product makes no sence to summarize, but the derived column Amount
= Unit price x Quantity is to be summarized and becomes an additive column, called a
calculated measure. Trend, Stock and Ranking can’t be added and in general
all calculations on one specific intersection of the dimension. Year-to-Date
ytd measure can’t be summed up. Count of rows is normally used. Unit price
can be placed in a Product dimension as a current standard list unit price
and still keep Unit price in the fact together with a Discount (%) when the
purchase was happening, occurred, registered or took place. Back in 2016 I
had a customer that used two kinds of discounts when a product had a bundle
option and was showned in an invoice line with description »Discounted Bundle«: Unit price = List price - Unit
discount amount. Line amount = Unit price x Quantity – (Unit price x Quantity x Line discount percentage
/ 100). Some business users used the term
List price other users used Catalog price. When there was a promotion going on,
the unit price had a different calculation: Unit price = List price - Proration
amount (also called deduction amount). There was also a Line discount
amount and an Order discount amount. I added all the columns into the Sales
fact. Conformed fact A conformed measure in multiple
facts must use the same common business rule and definition so multiple facts
can be united in a report or a cube. When several data marts are using fact
data with same name of fact tables or name of columns for measures and they
have compatible calculation methods and units of measure and support additivity
across business processes. If a measure e.g. Revenue is used in multiple fact
tables with different calculations and meanings, it is best to use different
column names because theses facts are not conformed. Figures in DKK '000
means in DKK thousand, where a revenue 395,732 is 395,732,000 DKK. Factless
fact A factless table contains no
measures, no metrics and no numeric additive values as a normal fact table
do, we have a measureless fact that records and event. For example, a fact
table which has only columns for employee, date, time and event of work like
»workstart«, »workstop« and »workstopofsickness« and no columns for a measure. You
can get the number of employees working over a period by a »select
count(distinct Employee_key)« or by
a distinct row-count calculated measure in your OLAP cube. For a factless
fact you will normally count the number of rows, row count or counting rows
and call it »Number of <a name>«. Sometimes a factless fact has a
value column called Count with only one value as 1 used in a data access tool
to sum over and get the number of rows. In case the fact grain is weekly and
a week is missing, it can be inserted to have all weeks complete and here
will Count gets the value 0. Factless fact is for registration of
event or assignment e.g. attendence take place in a school class with
dimensions for Student, Class, Room and Professor. If we add a measure column
for Attendence with 1 or 0 per date per student it is not a factless fact
anymore. Factless fact can be used to
represent a many-to-many relationship among dimensions. Factless fact can
have a ValidFrom and a ValidTo column to tell the timeline/period when the
fact was valid. Capture a relationship in the fact To be a column in a dimension or to
be its own dimension and used in a fact is a good question. Kimball’s financial services example
starts with an Account dimension including data of products and branches
but he choose to remove these descriptive data columns to form independent
dimensions of Product and Branch and use them in a fact together with the
Account dimension. Therefore the fact capture a relationship among accounts,
products and branches. The
account-product-branch relationship is a factual relationship, or a
delegation factless (or coverage) fact table. Another
example is that an account can belong to two customers and a customer can
have several accounts. This many-to-many relationship can be expressed in a factless
fact or in a bridge table, see later. A bank transaction is done
by one customer from an account and it is natural to have a Customer
dimension in the fact. Kimball said: »Demoting the
correlations between dimensions into a fact table«, and I like to add: »With
capture a relationship in the fact table we also keep the registered
relationship at the time the event or transaction occurred and was entered
into the fact table«. An example is a retailer SuperChemi belongs
to a chain Ethane at date of sale e.g. 2013-05-17 and the fact table has two columns
for dimensions to Retailer and Chain. In 2015 the retailer SuperChemi changes
to another chain Propane but we still keep the registered relationship back
in 2013 and 2014 in the fact. When a chain is not a part of the fact table
and we in year 2016 like to find sales of retailers for a specific chain e.g.
Propane, we will use the current relationship between dimension Retailer and
dimension Chain as a snowflake dimension, meaning that SuperChemi belongs to
Propane, and when we summarize sales per year since 2010, chain Propane will
include the sales of SuperChemi in 2013 even though the retailer belonged to
Ethane at that time. Transactional fact or Transaction fact A fact table that describes an event
or operation that occurred at a point in time in a source system e.g. an
invoice line item. The row has a date e.g. a transaction date, an entry date
or a post date or a datetime time stamping to represent the point in time
with other lowest-level data. Key values for dimensions is found at
transaction datetime. If data needs to be changed,
corrected or cancelled in a source system, then a data warehouse needs to
make a counterpart row in the fact with a new time stamping or a time span.
Source data examples, booking a hotel room, order a car in a special color,
buy a lottery ticket with my lucky numbers, receive an invoice, put and pick
items in stock (in danish lægge varer på lager og plukke varer fra lager). In a data warehouse modeling process
I try to interpret some data as transactional data or transactions. Read more
in section 6.4. Some data modeler
does not like separate facts for each transaction type but build a single
blended fact with a transaction type dimension and a mix of other dimensions
can make N/A dimensions. I have
seen a blended fact called FactEvent which I think is a poor and non-descript
name of a fact table, and the date dimension gets a generalized name. I
prefer multiple facts where the date dimension is role-playing for e.g. order
date, purchase date, shipping date and receipt date. Snapshot
fact versus Transactional fact Snapshot fact represents a state,
e.g. my bank balance right now is $500, and tomorrow it is $600 because I
will deposit $100, therefore the latest date of balance contains the current
balance. It is easy to see my balance ten days ago by look it up at the
SnapshotDate. It is like a picture from the past. Transactional fact contains all my
deposits and all my withdrawals with a date and an amount to records the
»happening of an action«. In any point of time (normally today) I can
calculate my balance, or the fact contains a running balance measure per
transaction/per fact row. Snapshot fact calculates the balance
of my account at the end of each day because there can have been many
deposits and withdrawals within a day. This snapshot fact can then be easily
used to calculate the average daily balance for interest or fees. The balance
in the snapshot fact is not additive, able to add/sum up together into a
meaningful metric, instead the balance is semi-additive, able to use aggregate
functions like Avg, Min and Max. A fact table containing sales order
lines and we update a line with changed data, add a new line or delete an
existing line, the fact is a snapshot. A periodic snapshot is one that
represents states over time where order lines is never changed. If we store a new complete image of
the order lines when it is changed, it is an accumulating snapshot or what I
call a discrepancy snapshot fact, see later. If we store the difference between
the original order line and the new order line (i.s. net change), it is
transactional fact. Periodic
snapshot fact A fact table that describes the
state of things in a particular instance of time or a one point in time, and
usually includes more semi-additive and non-additive measures. It is a
table with frozen in time data, meaning a row will never be changed/modified/deleted
(is unchanged) because the row can have been used in a report like a
monthly or annual report and later it is a must to be able to create the same
raport with the exact same data. The periodic snapshot fact table will never
be done empty or updated, therefore it is a true delta load or incremental
load and the table has a snapshot column like a Day (yyyymmdd) for
daily basis, Week (yyyyww) for weekly basis, Month (yyyymm) for a monthly
basis or a Year for annual basis to fetch the wanted data as a day slice,
week slice, month slice or a year slice. When a day, a week, a month or a
year is over and data is ready, data will be loaded. Sometimes data for the current month
is also loaded every day for a current-month-to-date, therefore the current
month will be updated until it is over, finish and can be closed. Measures
can be a balance in account or inventory level of products in stock and so
on. Key values for dimensions is found at the end of the period. A fact table can be daily, weekly, monthly
or annual partitioning for making a faster query performance when searching for
a specific day, week, month or year. In case there is no data for a specific
month it will be nice to insert an artificial row with »missing« value of the
dimensions and 0 in the measures. (Periodic snapshot in danish »en skive
data«, fastfrysning). It is common that fact data needs a
correction from a source system which can cause a redelivery of a snapshot
e.g. a specific month that needs a reload in a periodic snapshot fact. When a
fact table is implemented as partitioned table, it is easy to truncate the
related partition and load it again (Kimball page 517). A ETL process can use
a staging table and perform a switching in to a fact table related partition.
A redelivery can contain a date of redelivery or versionnumber in case of
several versions of the same delivery, sometimes called a generation. It is called
incremental refresh when a ETL
process does a dynamically replace or reload of a partition of a table by
empty the partition and load data with a higher date of redelivery or higher versionnumber
into the partition, because it is refreshing an existing delivery in a
snapshot. Accumulating
snapshot fact A fact table that describes a
process with milestones of multiple dates and values columns with different
names which will be filled out gradually that reflect the completion of
events in a lifecycle process to representing the entire history. Therefore
over time the same fact table row will be revisited and updated multiple
times where a default date key value is -1 for »hasn’t happened yet« or »be
available later«. Each stage of the lifecycle has its
own columns e.g. milestones of a hospitalization or steps of the
manufacturing of a product. In a retail store a product has three movements
as ordered, received and sold that would be three date dimension columns in
an accumulating snapshot fact. A fact table where a row is a
summarize of measurement events occurring at predictable steps between the
beginning and the end of a process. For example, a source system for current
payments from customers where some pay several times over a month, and the
first payment becomes a new row in fact table with date of payment in columns
BeginDate and EndDate and the amount in Paid column. The next ETL process will
do a summarize of payment per customer from the BeginDate to current date or
end-of-month date, and then update the fact row with same BeginDate with the
new summarized payment and new EndDate, so a fact row will be revisited and
updated multiple times over the »life« of the process (hence the name
accumulating snapshot). Timespan
accumulating snapshot fact or State oriented fact A fact table where a fact is not a
singular event in time but consists of multiple states or events that occur
over time so each row represents the state of an object during a period of
time in which this state didn’t change. It is similar to a type 2 dimension
with snapshot start date, end date and current indicator/active flag where
the row formerly-known-as-current will be revisited and updated. (Kimball
Design Tip #145), see more at Time span fact. Depending on the frequency and volume
of changes, you may consider a »current only« version of the fact to improve
performance of what will probably be a majority of the queries. Then you can
use the fact to drive the loading of selected events in a accumulating
snapshot fact. Discrepancy
snapshot fact A snapshot describes the state of
things at a particular time and when the state does change there is a
discrepancy. It is one way to capture versions of the fact over time with no
need to keep all the daily snapshots from periodic snapshot fact. A fact table stores metrics and
things that provide context for the metrics goes into dimension tables,
therefore it is also possible to have a discrepancy snapshot dimension
with discrete textual values and non-additive numeric values that can’t be
summarized, reminiscent of type 5 with a foreign key back to the main dimension. Discrepancy
snapshot fact example. Data entry
action snapshot fact – a way of Change Data Capture CDC A snapshot that keep track of the
state of things at a particular time as a logbook with three actions as:
Added for a new entry, Changed for an existing entry and Removed for a non-existing
entry that has been deleted or marked deleted in a source system and is not anymore
a part of the full payload to the data warehouse. This snapshot approach has
a dimension for data entry action with the three mentioned action values
(could also have been: Influx, Changed, Departure). The data entry action snapshot fact
can be used to show the added/gain values from the latest ETL process or the
removed/lost values and in combination with changed values from one state to
another state. Formula: current = old current +
added/influx – removed/departures. (in danish: igangværende = forrige igangværende +
tilgang – afgang, for bestand,
nytilkomne, indgået og udgået, ikraft og udtrådt). Data entry
action snapshot example. Derived
fact or Additional fact Kimball recommands fact data at the
lowest detail grain as possible for ensures maximum flexibility and
extensibility, he calls it a base-level fact. A derived fact table is created
for performing an advanced mathematical calculation and complex transformations
on a fact table like for a specific KPI (Key Performance Indicator), or an
aggregate fact with a summation of measures to a higher grain like from
date level to month level and from product level to brand level and using shrunken dimensions for
Month and Category as a dimension lifted to a higher grain from
the base-level dimensions as Date and Product. A derived fact can be based on
multiple
fact tables for making faster ad hoc query performance and simplify
queries for analysts and for providing a data set to the Presentation
interface area. Remember data lineage. Be aware of drill across and fan trap
and chasm trap and loop trap can occurs when two fact tables are joined
together, because there can be a fact table with no relation to another fact
table except that each one contains a foreign key to a shared dimension
table. Aggregate
fact or Summarized fact A derived fact table that is created
to referred to as a pre-calculated fact with computed summarized measures
at a higher grain level of one or more dimensions to reduce storage space and
query time greatly and eliminate incorrect queries. An aggregate fact is
derived from a base-level fact and measures in an aggregate fact is a
computed summary of measures in the base-level fact. Dimensions in an aggregate fact
can be derived from base-level dimensions and is called shrunken dimensions
because the values is rolled up to create less fact rows e.g. Date dimension
becomes a Month dimension, Product dimension becomes a Category dimension
and an Address dimension becomes a Region dimension, and therefore the measures
can be summarized to less fact rows for better query performance. Year-to-Date ytd fact where month
February is a summing up or roll up of January and February and so forth.
Last-year-this-year fact with calculation of index compared to last year as
a new column and easy to show in a report. Aggregate fact table is simple
numeric roll up of atomic fact table data built solely to accelerate query
performance. It is called incremental
aggregation when a ETL process does a dynamically update of a table by
applying only new or changed data without the need to empty the table and
rebuild aggregates. Consolidated
fact A fact table used to combine fact
data from multiple source systems and multiple business processes together
into a single consolidated fact table when they expressed at the same grain.
E.g. sales actuals can be consolidated with sales forecasts in a single fact
table to make the task of analyzing actuals versus forecasts simple and fast
to compare how the year is going. A consolidated fact is a derived fact table
that combine data from other facts. Smashed
fact A fact table contents several
measures but only one or few of them has a value in each fact row. For the
fact rows with same dimension member repeated in multiple contiguous rows
with identical values, they will be smashed or collapsed into one fact row
using operation as sum, min or max to limit the number of rows in the fact. Time span fact or history fact or historical fact A fact table used for a source
system that is regularly updatable meaning that a source system changes and
overwrites its values. To capture a continuous time span when the fact row was
effective, the
fact table will act as SCD type 2 dimension with BeginAt and EndAt columns
to keep historical data and to represents the span of time when the fact row
was the »current
truth«, and with a query it is easy to fetch it at a given point
in time. It is called
Slowly Changing Facts. Read more in section 6.4. Counterpart fact (negating fact) and Transactional fact A fact table used for Slowly
Changing Facts because a source system is changing fact value without keeping
the old value as a historical transactions. Read more in section 6.4. Column wise
fact and Row wise fact A column wise pivoted fact table is
useful to be columns in a report e.g.
Revenue Jan, Revenue Feb, Cost Jan, Cost Feb, Sales Jan, Sales Feb. For a cube a row wise is much better
because it gives good dimensions e.g.
Period, Entry, Amount. Therefore a data warehouse needs to
convert from columns to rows or vice versa. Exploded
fact A fact table contents huge number of
rows where a period e.g. from a »date of employment« to a »termination date«
as columns in one row, is going be turned around with one row per day of the
period, or per week or per month depends of the wanted grain of the period,
it can give a lot of fact rows, when an employee has 10 year anniversary it
will make more than 3650 rows on day grain. Employee dimension type 2 to keep
history of different names, departments and job functions etc. For a windturbine power production
the grain would be 15 minute grain which gives 96 rows per day for a
windturbine power production. When the exploded fact is a source
for a olap cube it can sometimes be implemented as a view in the database,
and when it is for ad hoc reporting it will be used several times per day
then it must be a materialized view stored in a table or sometimes as a
indexed view. 3.2.
Other fact classifications Transaction has one row per
transaction when they occur together with a datetime. Periodic has one row for a group
of transactions made over a period of time through summation like from daily
grain to monthly grain so the Date dimension is represented by the month
level with the first day of the each month. Notice that certain dimensions
are not defined when compared to a transaction fact such as dimension
TransactionType. Accumulating/Aggregated has one row for the entire
lifetime of an event and therefore constantly updated over time. For
example, an application for a bank loan until it is accepted or rejected or a
customer or working relationship. These fact tables are typically used for
short-lived processes and not constant event-based processes, such as bank
transactions. Mostly a fact table describes what
has happened over a period of time and is therefore an additive or
cumulative facts. An example of a status column in a fact table that receive
data from a school system where a student follow a course and later finish
it, but sometimes a student skip the course and are delete in the system. Before
reload the fact it can be a good idea to have a CourseStatus column with
values like: Active, Completed or Dropped. 4.
Dimension A dimension table containing
business element contexts and the columns contain element descriptions and
a dimension is referenced by multiple fact tables so the containing measurements
make sense. A dimension has value or member as a row, or values or
members as rows. 4.1.
Purpose of a dimension Some purposes as I seen it:
Hierarchical
with hierarchy levels The dimension values can be placed
in a hierarchy like a Location
with three levels Country→Region→City. A dimension can have several
separate and independent hierarchies with different numbers of levels. Grouping with
group or Band The dimension values can be placed
in a group that is grouping values
in band intervals like person ages
in custom buckets like a Age group column with intervals of Child (0-9),
Tween (10-12), Teeanage (13-19), Young adult (20-29), Adult (30-66) and
Senior citizen (67-130). A dimension normally contains one or
multiple hierarchies and/or groups/bands to fulfill requirements from the business
users. For example, a Product dimension has
two hierarchies (here called group) and one band to divide the prices:
A dimension can of course be non-hierarchical
and non-grouping/non-band. Different dimensionality covers the
issue that not all combination of multiple dimension values are allowed in a
fact and the data warehouse needs to make sure of the data quality. For the data lineage a
dimension table will include metadata columns from the Archive area as
RecordSource, ArcRecordId and ArcGlobalId for traceability back to the
archive. 4.2.
Dimension column types A dimension table has mininum three
types of columns: Dimension
key is
the primary key in a dimension table in a data mart area. A dimension key is an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of a dimension. It is the dimension's own
responsibility to generate the next value in column dimension key. The
value of a dimension key does not come from a key map table to remove
dependence from a source system. Dimension key is a foreign key in a fact
table which has several columns as dimension keys to several dimension
tables. For a type 1 dimension a dimension key
can be named Entity key EK or EKey because it represents an entity in source
data, or Surrogate key with prefix like SK_ or a suffix like _SK, _SKey, SK, SID,
ID, UID unique identification to join to a fact table to fetch and see the
current values. For a type 2 dimension a dimension
key can be named History key HK or HKey or HSID because it represents a history
reference over time to join to a fact table to fetch and see the registered
values. For me »surrogate« is a characteristic
or a property, and not a name for a column. I prefer a suffix _key for a dimension key column in a
dimension as a primary key for the dimension. An alternative suffix _dimkey
or _dim_key. For example, Customer_key column in
a Customer dimension which is related to a Sales fact to a Customer_key
column as a foreign key. A dimension key must not be
data-bearing (in danish data bærende), it must be meaningless, but for a
date dimension and a time dimension I like to use a smart valued dimension
key, e.g. a date 2013-12-31 as a data type date or a value 20131231 as an data
type integer, and a time 08:30 am as an integer value 830 and 08:30 pm as an
integer value 2030. With these dimension keys I can avoid a join from a fact
table to a dimension table to fetch and see a date and a time. I have seen
a dimension key store which generates a global number or RefId for each
dimension key value across all the dimensions to achieve uniqueness. I like to
think a dimension table as an object that has its own methods or functions on
data where each dimension object has its own responsibility to generate the
next value for the dimension key _key and for the durable key _dkey, see
example in section 6.3 for columns System_key and System_dkey. Business
key is
from a source system where it can be a primary key or a secondary key with unique
values. For example, Social Security number
(SSN), StatusCode and CustomerNumber. A business key value is either
mutable (changeable) or immutable (unchangeable), and is meaningful for a
human being where a business user prefers to use a business key value to
identify an
entity for a search criteria lookup value giving in a phone call to a company,
a bank, a hospital or the government. A business key has an embedded meaning
and represent a unique object in the business. An alias is an Enterprise wide
business key because the same value is used in multiple source systems. If a
business key is immutable (unchangeable) in a source system then it is called a durable
business key. From a data warehouse point of view,
a business key also represents a surrogate key from a source system where
it often is a primary key as an artificial auto-generated unique sequence
number or an identity column (id, uid unique identification) and is immutable
and meaningless for a human being. A key map table can transform business
key values either
to an immutable surrogate business key _sbkey e.g. a code which can be
understood by a human, or to an artificial auto-generated unique sequence
number or an identity column as a durable key _dkey for later use in a dimension. A durable key is an integer representation
of a business key and can be generated in a key map
table and fetched by a business key. Multiple source systems have a
business key for a Social security number per person with a primary key as a
surrogate sequence number, therefore we can’t use the primary keys to
interconnector (join, drill-across) the systems data, instead we use the
business keys. The value of a business key can change over time e.g. a person
obtains witness protection and gets a new Social security number, therefore
it is up to the ETL process to make a solid
mapping. When a business key is an employee number and a employee resigns
and some years later is rehired, there could be two different employee
numbers for the same employee (EmployeeNumber or EmployeeId), and a data
warehouse must find a way to map (»glue«) them together, so the employee only
occurs once in a dimension with help from a key map
table that remain the durable
key. The other way around can also happen in a source system where the same
value of a EmployeeNumber or a EmployeeId is used for two different employees. Descriptive
data is
representing a dimension value context
description as textual, numeric, date, time etc. and is saved in columns in a
dimension table and will be shown to a business user as descriptive data columns
with values to explain fact data in a row from a fact table. 4.3.
Changing dimensions Source data is volatile data because
they will change over time e.g. a customer change his name or address and a
product change place in a hierarchical structure as a result of a
reorganization. How can we support evolving
dimension data when dimension values and instances normally will change
over time because of the volatility in source systems? The rate of changes can
be divided into two kinds of classifications and afterwards we will be looking
into techniques to handle and tracking changes and to capture its history and
preserve the life cycle of source data also called Change Data Capture CDC. Slowly
Changing Dimensions SCD Columns of a dimension that would
undergo changes over time. It depends on the business requirement whether
particular column history of changes should be preserved in the data warehouse
or data mart. (In danish stille og rolig ændrede (skiftende,
foranderlige) dimensioner med langsom opbyggende historik). Rapidly
Changing Dimensions RCD A dimension column that changes
frequently. If you do need to track the changes, using a standard Slowly
Changing Dimensions technique can result in a huge inflation of the size of
the dimension. One solution is to move the column to its own dimension with
a separate foreign key in the fact table. Rapidly changing data usually
indicate the presence of a business process that should be tracked as a separate
dimension in a separate historical data table. (In danish hurtig ændrede (skiftende, foranderlige)
dimensioner med omfangsrig opbyggende historik). Techniques or methods to handle
dimension values that is changing over time from the source systems is called
Ralph Kimball’s eight types or
approaches of SCD: Type 0 Keep original value, where a dimension
value will never change, and when a value is changed in a source system, the
original value is retained in a dimension. It is called no versioning. The type 0 approach is only to store
the original value from a source system. A single fixed value does
not change over time but can be corrected in case of an error in a source system, e.g. a correction of a mistyping,
a typos or a misspelling of a name or a wrong Date of Birth, Date of Issue, Date of
Expiration, Date of Launching something or Date of First purchase. The value of a business key will be in
one row only, therefore the business key column can be a unique index. There
is a one-to-one relationship between business key and dimension key. Dimension key is an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of the dimension to join to a fact dimension key. A fact table refers to a dimension key
value to fetch the original values, as-began. Inserting a new row into a fact
table will lookup a business key value in a dimension table to fetch a
dimension key value for the new row. A view upon the dimension will
provide the original values for the dimension key to join to a fact dimension
key to fetch the original values, as-began. This is the technique for Slowly
Changing Dimension: »Retain original«. Type 1 Keep current value, where a
dimension value will change, and when a value is changed in a source system, the
old current value will be overwritten in a dimension by the new current
value in the same row in the dimension. It is called no versioning. The type 1 approach is only to store
the active, actual, present, newest or latest of the value, the most recent
value or current value from a source system. The old value will be forgotten.
The history of the old value is lost forever. Historical descriptive data values
is not preserved. This approach is suitable when historical changes are not
significant or when tracking historical changes is not required. The value of a business key will be in
one row only, therefore the business key column can be a unique index. There
is a one-to-one relationship between business key and dimension key. Dimension key is an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of the dimension to join to a fact dimension key. A fact table refers to a dimension key
value to fetch the most recent, current values, as-is. Inserting a new row into a fact
table will lookup a business key value in a dimension table to fetch a
dimension key value for the new row. A view upon the dimension will
provide the current values for the dimension key to join to a fact dimension
key to fetch the most recent, current values, as-is. The current view will show a unique
list of most recent values which is handy for a dropdown box for search
criteria in a dashboard. (In danish aktuelle, nuværende, gældende, seneste
værdi). This is the technique for Slowly
Changing Dimension: »Current by overwrite«. Type 2 Keep historical value and current
value, all values as history, where a dimension value will never change, and
when a value is changed in a source system, the old current value will be remained
in a dimension and the new current value will be inserted into a new row in
the dimension. It is preserving historical descriptive data values in a
dimension table. It is called row versioning. We achieve to have an unlimited
history of values over time marked by timeline columns as a pair of data
type date or datetime that represents the span of time when a row with its values
was the »current truth« with metadata columns:
The type 2 approach is to store the full
history of changed value from a source system and to fetch the most recent
value or current value from a source system. Historical descriptive data values
is preserved. This approach is suitable when historical changes are
significant and when tracking historical changes is required to have a full
historical tracking, no data loss. Timeline metadata columns ValidFrom
and ValidTo represents a period for the time when the value was current and
active in a source system. The value of a business key will be
repeated in a new row every time a data value is changing. There is a
one-to-many relationship between business key and dimension key. A ETL process is using a business
key value to update the old current row by setting metadata column ValidTo
to a datetime stamp, and to insert a new current row with same datetime stamp
to metadata column
ValidFrom and set metadata column ValidTo to 9999-12-31 to represent
the new current row for the business key. Kimball page 508 calls it no gap exists
between rows of the same business
key value, meaning that the old current row ValidTo and the new current row
ValidFrom are using the same datetime stamp value. A key map
table can transform business key values either to an immutable surrogate business key
_sbkey e.g. a code which can be understood by a human, or to an artificial
auto-generated unique sequence number or an identity column as a durable key
_dkey which can be included in a type 2 dimension. Dimension key is an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of the dimension to join to a fact dimension key. A fact table refers to a dimension key
value to fetch the registered values of the dimension at the time when fact
data was happening, occurred, registered or took place often by a date column
in the fact table based on source data, or when fact data was entered into
the fact table by a current load insert date, as-was. Inserting a new row into a fact
table will date range lookup a business key value in a dimension table
between ValidFrom and ValidTo to fetch a dimension key value for the new row. A view upon the dimension will
provide the registered values for the dimension key to join to a fact dimension
key to fetch the registered values, as-was. (In danish oprindelige værdi). A view upon the dimension will
provide the current values for the dimension key to join to a fact dimension
key to fetch the most recent, current values, as-is. The current view will show a
non-unique list of most recent values which is not handy for a dropdown box
for search criteria in a dashboard, because a value can be repeated for
different dimension key values used in a fact table. (In danish aktuelle, nuværende, gældende, seneste
værdi). I like to add a metadata column
called IsCurrent with two values: 0 for historical and 1 for current to mark
each row in a type 2 dimension. This is the technique for Slowly
Changing Dimension: »Keep history in rows as full history«. Type 3 Keep current value and previous
value, where a dimension value will change, and when a value is changed in a source
system, the old current value will be remained and will be stored in a Previous
column (or Historical column) in a dimension, and hereafter the old current
value will be overwritten in a dimension by the new current value in the same
row in the dimension. It is called column versioning as an extension to type
1. The type 3 approach is to store the
latest value or current value from a source system along with a sufficiently
limited amount of historical data in a previous value or few prior values. The value of a business key will be in
one row only, therefore the business key column can be a unique index. There
is a one-to-one relationship between business key and dimension key. Dimension key is an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of the dimension to join to a fact dimension key. A fact table refers to a dimension key
value to fetch the most recent, current values, as-is, and some previous
dimension values, as-was. Inserting a new row into a fact
table will lookup a business key value in a dimension table to fetch a
dimension key value for the new row. A view upon the dimension will
provide the current values for the dimension key to join to a fact dimension
key to fetch the most recent, current values, as-is, and to fetch some previous
dimension values, as-was. This is the technique for Slowly
Changing Dimension: »Keep history in columns as partial history«. Type 4 When a group of columns in a
dimension is going to change often as a Rapidly Changing Dimension, we can
split the dimension into a Base-dimension that has slowly changed columns and
into one or more separate Mini-dimensions that has often changed columns
(volatile data values) to keep the fast changing values in its own table. Type
4 store all historical changes in separate historical data tables. The type 4 approach is to store separately
the current value and the historical value to maintaining data integrity. The Base-dimension can be either a
type 1 or type 2 (most often type 1) and the Mini-dimension becomes a type
2. We get a »capture a relationship in the fact«. A fact table refers to a Base-dimension
key value to fetch the Base-dimension values. A fact table refers to a Mini-dimension
key value to fetch the Mini-dimension values. This is the technique for Rapidly
Changing Dimension: »Keep history in tables«. Type 5 Builds on the type 4 where a Mini-dimension
gets a view to fetch the current rows of the Mini-dimension and the Base-dimension
is extended with a key value that point to the current view and the key
becomes a type 1 column outrigger in base. Base-dimension and Mini-dimension
can join without include a fact to save query performance. The join can be
implemented in a view. We get a »capture a relationship in the dimensions« The ETL process must overwrite the
type 1 column in Base-dimension whenever the current Mini-dimension changes
over time. Therefore 4 + 1 = 5 type. This is the technique for Rapidly
Changing Dimension: »Outrigger current dimension«. Type 6 Mixture of type 1 and type 2 columns
therefore a good idea to suffix columns as _t1 and _t2 to know which columns
can be overwritten in the current row. Can also have column of type 3,
therefore 3 + 2 + 1 = 6 type. Type 6 act as type 2 of tracking
changes by adding a new row for each new version but type 6 also overwrites
_t1 columns on the previous row versions to reflect the current state of data
by using the business key to join the new row with the previous rows. This is the technique for Slowly
Changing Dimension: »Hybrid«. Type 7 All rows follow type 2 to keep track
of history values with a dimension key column and an extra key column called
a durable key follows type 1 for the current value. A key map
table can transform business key values either to an immutable surrogate business key
_sbkey e.g. a code which can be understood by a human, or to an artificial
auto-generated unique sequence number or an identity column as a durable key
_dkey which can be included in a type 7 dimension. The value of a business key and the value
of a durable key will be repeated in a new row every time a data value is
changing. There is a one-to-many relationship between business key and dimension
key. There is a one-to-many relationship between durable key and dimension
key. Dimension key is an artificial
auto-generated unique sequence number or an identity column as a surrogate
key for the primary key of the dimension to join to a fact dimension key. A fact table refers to a dimension key
value to fetch the registered values of the dimension at the time when fact
data was happening, occurred, registered or took place often by a date column
in the fact table based on source data, or when fact data was entered into
the fact table by a current load insert date, as-was. A fact table refers to a durable key
value to fetch the most recent, current values, as-is. Inserting a new row into a fact
table will date range lookup a business key value in a dimension table
between ValidFrom and ValidTo to fetch a dimension key value and a durable key
value for the new row. A view upon the dimension will
provide the registered values for the dimension key to join to a fact dimension
key to fetch the registered values, as-was. (In danish oprindelige værdi). A view upon the dimension will
provide the current values for the durable key to join to a fact durable key
to fetch the most recent, current values, as-is. The current view will show a unique
list of most recent values which is handy for a dropdown box for search
criteria in a dashboard, because the values will be distinct by the use of
the durable key column in a dimension table of type 7. (In danish aktuelle, nuværende, gældende, seneste værdi). I like to add a metadata column
called IsCurrent with two values: 0 for historical and 1 for current to mark
each row in a type 7 dimension. This is the technique for Slowly
Changing Dimension: »Dual Type 1 and Type 2 Dimensions and Dual foreign keys
in fact for a given dimension«. Correction of
data together with the types For type 0 I mentioned »correction of a mistyping,
a typos or a misspelling«
e.g. a first name of a customer Kelli is corrected to Kelly after few days in
a source system, for type 2 or type 7 it will per automatic mean an extra row
in the dimension, but I think it needs some consideration especially if a source
system can inform about the correction, maybe the dimension can do an overwrite/update
of the row as type 1 and mark the row with a metadata column IsCorrected, or
tag changed data with a reason (Kimball page 451). Examples of
type 0, type 1, type 2, type 3, type 4 and type 7 Type 0 for a Date dimension and a
Time dimension and a Gender dimension because the values will never change. Type 1 for an Invoice dimension because
old values are not important. A dimension for MaritalStatus with values:
Single, Married, Separated, Divorced, Widowed, Separated, Registered
partnership and Abolition of registered partnership because in case a value will
change, we don’t want to keep the old value. Type 1 for a Customer dimension example
is in section 6.2.1. Type 2 for a Customer dimension because
customer values from a date of purchase are important to keep for
accounting, purchase statistics and analyzes for marking. Example with date
range lookup is in section 6.2.2 and includes a non-unique list of customer
current names. Type 7 for a Customer dimension example
and date range lookup is in section 6.2.3 and includes a unique list of customer
current names. Here is a type 3 example for a
customer contains three columns for postal codes. The column names are
Current Postal Code, Previous Postal Code and Oldest Postal Code. When a
customer address is changed to a new postal code, the ETL process will move a
value from the Previous column to the Oldest column and move a value from
Current column to the Previous column and add the new value to Current column. Here is a type 4 example where the
customer values are split into a Base-dimension with slowly changes in
columns e.g. CustomerName, ContactName, Address, City, Postal Code, Discount Band, Marketing
category, and a Mini-dimension with fast/rapidly changes in columns e.g.
CustomerAge, Income, Raiting, AccountStatus, Days of payment. Here is a type 6 example: In a
dimension table the columns can be a mix of type 1 and type 2 e.g. a Customer
dimension where columns for customer name, street name, house or apartment
number is type 1 because we only need the recent value for shipment, and
columns for postal code (zip code) and city is type 2 because we like to tracking
these changes and keep data for the city where a customer was living in when
purchase a product. A »sales in cities« report for the last ten years will
use the right city of the customer at the time the purchase was happening,
occurred, registered or took place. Temporal
table system-versioned table Temporal table looks like a type 4 dimension
because keeping history in the separate table while original dimension table
keeps current dimension member, but the dimension key is not an artificial
auto-generated unique sequence number for each update of values in the
dimension and therefore it can not be a type 2 to type 7. Temporal table is
only usable for a type 1, when it is including the deleted rows because older
fact rows can be referring to it. We can say that a temporal table or a system-versioned
table is a type 1 dimension with an addition of a history-tracking to handle
historical data values for changes (updated or deleted) at any time. Example from Kimball design tip #100 4.4.
Types of dimensions Let us characterize the various
dimensions into different types of dimensions. Conformed dimension
or Shared dimension or Common dimension A conformed dimension has the same
meaning to every fact in multiple data marts and measures will be categorized
and described in the same way and ensuring consistent reporting across the
data warehouse and is essential for enterprise data warehousing. A conformed dimension is a consistent
interface to make sure that data can be combined in a data warehouse and be
used all over the business because values of a dimension means the same thing
in each fact. With a conformed dimension we can combine and drill across from
fact to fact in one data mart or over several data marts, and analyze common
columns and values. Seperate fact tables can be used together with shared,
common and conformed dimensions. Conforming of several source data is
part of the integration to achieve a conformed dimension where data is
integrated of different meanings and different columns must be compared
against each other, rules must be set, and data must be cleansed to create a
single version of the entity. Conformed dimensions will unite and integrate
data values among multiple source systems so it is easy to search across
different types of data and sync them in a common report. Shared dimension is
utilized in multiple fact tables in a data mart or across multiple data
marts. Dimension values comes from either
a source system or is built by business rules in Usage supporting database. Non-conformed
dimension can only be used within one fact. It is part of the ETL process to
do conforming by merge, unite and consolidate multiple source data across the
enterprise for making a conformed dimension e.g. a Customer dimension
from different business areas as
order, sale, invoice, delivery and support service (for B2B and B2C) with both
different customers and same customers and with different business key values
and with different addresses like a shipping address and a billing address.
Sometimes data from a conformed dimension is send back to the source systems
as master data to be reusable in the organization. Hierarchical with hierarchy levels The dimension values can be placed
in a hierarchy like a Location
with three levels Country→Region→City. A dimension can have several
separate and independent hierarchies with different numbers of levels. Grouping with group or Band The dimension values can be placed
in a group that is grouping values
in band intervals like person ages
in custom buckets like a Age group column with intervals of Child (0-9), Tween
(10-12), Teeanage (13-19), Young adult (20-29), Adult (30-66) and Senior
citizen (67-130). Data classification is the process
of organizing data into categories, group or category is part of a
categorization or grouping of data to make a dimension more user-friendly to
see data of a dimension on an aggregated and summed level. A dimension normally contains one or
multiple hierarchies and/or groups to fulfill requirements from the business
users. A dimension can of course be
non-hierarchical and non-grouping. Date
dimension or Calendar dimension A very common dimension with the
granularity of a single day with hierarchies as: Year→Half year→Quarter→Month→Date (five levels) or Year→Week→Date (three levels)
because a week does not always belong to one month. When a fact date is connected to the
Date dimension it is easy to make a report based on month, quarter or yearly
level or a search filter criteria as Q3 2013. Dates or days can be grouped to a
Season or TimeOfYear: Winter
(December January February), Spring
(March April May), Summer (June
July August) and Fall (September
October November) (in danish Årstider). Date dimension has normally many
supporting columns like Weekday name and Month name and an IsHoliday column
has a data type of boolean/bit with two values True (1) and False (0) as a
flag indicator, but to be used in a presentation tool it is better to have a
textual column DayOnOff with values »Holiday« and »Working day«. Of course the
rule of when a specific date is a holiday is depending on country, collective
agreement, company policy etc. is programmed into the stored procedure that
builds the date dimension. |