Dimensional modeling – architecture and terminology

by Joakim Dalby (danish Dimensionel modellering, Datamodellering, Data varehus)

 

1. Introduction

Information has become a major asset for any organization. The book Das Kapital by Karl Marx could today get the title Die Information. A data warehouse is the way to provide information to the business users and add value to the business and reuse data in a new way to increase the revenue. A data warehouse is a separated sy­stem so an user query and analysis will not slow down and not reduce the work­load on the operational systems.

 

Data warehouse is

 

·  Subject-oriented because collect business data from a variety of source systems.

·  Integrated because using common and consistent business names and units.

·  Time-variant because historical data is kept as current data at any given time.

·  Non-volatile because data do not change or historical data will never be altered.

·  Enrichment and processing of data for providing new knowledge.

·  Collection of data in support of management's decision making process and analy­zing the business with business performance measurement qualifiers and KPI Key Performance Indicator used as a measure of how a company is doing.

·  Utilizing dimensional modeling, end users and analysts can easily understand and navigate the data structure and fully exploit the data for self-service BI.

·  Goal to support slice and dice of data for business analysis.

·  Compliance requirement for provide proof that the reported numbers are accurate and complete by having an audit trail for reconciliation check etc.

·  Driven by business requirements specification and use story from stakeholders and users. A data warehouse must have focus on the needs of the business.

 

Data from different operational source legacy systems are sto­red in tables in a rela­tio­nal database called a data warehouse. Data will be merged and integrated inside a data warehouse to get a consistent format and use same naming. It is called conforming when create a conform column with a single and consistently labeled name that has the same common meaning and content all over the data warehouse e.g. same spelling and unit of measurement. For example, product data can be found in multiple source legacy systems with different name of columns and pro­duct segmentations, therefore a data ware­house will unite, interconnect, compile, conforme and consolidate the data into one table to make it easy for the business users to make a list of all products. Therefore it is important to design a data warehouse to sup­port re­porting and data analyses by a single and common data model that is easy to navi­gate in and a business user don’t think of data across of source legacy systems. Data integration become seamlessly. Many operational source legacy systems update its data to reflect the most current and recent state while a data warehouse also maintain history.

 

Data mart is a specific subject area contains the data and information that is re­le­vant for a business user. A data warehouse can have multiple data marts bound together with conformed and shared dimensions. A data mart has one pur­pose and is custo­mized and/or summarized data and tailored to support the specific ana­ly­tical re­qui­re­ments of a bu­si­ness unit or function. It utilizes a common enter­prise view of stra­te­gic data and provides business units more flexibility, con­trol and re­spon­sibi­lity. For example, Sales mart, Cu­sto­mer mart, CRM mart, Churn prediction mart, Market mart, Pro­duc­tion mart, Inventory mart, Shipment mart, HR mart, Tax mart, Fraud de­tec­tion mart etc.

 

Dimensional modeling is a technique approach that seeks to present data in a stan­dard, intuitive framework of dimension (descriptive context and hierarchy) and fact (mea­sure­ment) and focuses on ease of end user acces­sibility and provides a high level of performance access to the data ware­house. Ralph Kimball re­com­mends in the design of a data warehouse to decide what busi­ness process(es) to model by combining an understanding of the business re­qui­re­­ments with an understanding of the available data.

 

Business Intelligence (BI) system provides the information that management ne­eds to make good business decisions and data warehousing is a method to help database designer to build a comprehensive and reliable data warehouse system, e.g. based on Dimensional modeling prin­ciples. An OLAP (OnLine Analytical Pro­ces­sing) cube can be at the top of the Dimensional modeling to present data in tools like Excel, Power BI, QlikView, Tableau or Targit.

BI is going from data and information to knowledge and wisdom to the users.

Previously called Decision Support System. BI could also stand for Business Insight.

 

In the article I will focus on the basic concepts, terminology and architecture in Di­men­sio­nal mo­deling. My homepage has other articles of how to design and im­ple­ment a data warehouse system with levels of data in the ETL process (Extracting, Trans­for­ma­tion, Loading). I see Dimensional modeling as a subset of the Entity Re­la­tion­ship (ER) data modeling design method of a rela­tional database system OLTP (On­Line Transaction Processing). Peter Chen, the father of ER modeling said in his seminal paper in 1976: »The entity-relationship model adopts the more natural view that the real world consists of entities and re­la­tion­ships. It incorporates some of the important semantic information about the real world.« Where ER data mo­dels have little or no redundant da­ta, dimensional models typically have a large a­mount of redundant data. Where ER data models have to assemble data from nu­me­rous tables to produce anything of great use, dimensional models store the bulk of their data in the single fact table or a small number of them and divide data into different databases called data mart.

 

Back in the 1970's William H. Inmon began to define the term data warehouse as a one part of the overall business intelligence system. An enterprise has one data warehouse with entities in a relational model as a centralized repository, and mul­ti­ple data marts which source their information from the data warehouse. In the data warehouse, data is stored in 3rd normal form. Data warehouse is at the center of the Corporate Information Factory (CIF), which provides a logical framework for delivering business intelligence. A data mart is a simple form of a data warehouse that is focused on a single subject. Denormalization and redundancy is the norm for data modeling techniques in a data mart. Ralph Kimball: »A data warehouse is an union of all its data marts.« Kimball’s data warehousing architecture is also known as Enterprise Data Warehouse Bus Architecture matrix (BUS matrix) as a col­lec­tion of conformed dimensions that has the same meaning to every fact.

 

A table contains columns also known as attribu­tes or fields. Important tip by naming columns, do not use possessive nouns such as Recipient's birth date, better use CamelCase like RecipientBirthDate in a column name, and don’t use values in column name like EmployeeDay/NightCode better with Em­ploy­­ee­­Shift­Type­Code.

A column contains an atomic value like a First name or a composite value like a Full name that can be further divided into three columns for Firstname, Middlename and Surname. A null represents a value for a column that is unknown at the present time or is not defined for the row.

A derived column represents a value that is derivable from the value of a related column or set of columns and not necessarily in the same table. For example, the age of a person is derivable from the date of birth column and date of today.

A single-valued column holds a single value for an entity occurrence.

A multi-valued column holds multiple values, MovieCategory »Children, Comedy«.

A composite primary key is composed of multiple columns.

An entity is a set of objects with the same properties and is implemented as a table with rows where a primary key is unique and for use in identification of rows.

A relationship is a meaningful association among entities between primary key and foreign key in an one-to-one, one-to-many or many-to-many relationship to make sure of the consistency in database. When a database is consistent we can na­vigate with a join among tables and relationships become navigation paths.

 

SQL stands for Structured Query Language e.g. Select From Where Inner Join.

 

Data is a plural of datum e.g. 5.5 or 38 and after labelling them shoe size in US and in EU they become information. When I know many of my friends' shoe sizes to­gether with other information I have knowledge that I can use for better birthday gifts or sell to a marketing campaign.

 

1.1. Data layer in a dimensional modeling architecture

A description of the content of data layers of the throughput from source legacy systems to business users PC screen for improved analytic decision making.

 

Input data area - IDA

From source legacy systems to Input data area for storage of raw data. Also called Raw data, Source, Le­gacy, Capture Operational Data layer, Data Acquisition or Ex­tract. Empty this area in the beginning of the ETL process. Datatypes will be ad­justed to fit with the receiving database system. Do reconciling between data ware­house and source legacy systems with reconciliation of row count and sum of values and mark as re­con­ci­led and do auditing report. (to reconcile in danish at afstemme, stemmer overens). Maybe a receipt sy­stem where IDA tells the source legacy systems that data has been received. Data might be wrong in input data area or in archive and later source legacy system will resend a new revision of data, therefore important to create output views that filter wrong data away and making correct data as an extraction to the next layer of the data warehouse. Each table has a column IdaInsertTime so all rows in all tables will use the same capture time as a received date that can be used later for keep history. IdaInsertTime re­pre­sents a snapshot of the total amount of data as it look likes in the source legacy systems. IdaInsertTime can be set in a System database when the fetching is starting and ETL jobs/pac­ka­ges/trans­forma­tions/sql will reuse the same time for all data rows. So­me­times a source legacy system has a Changed co­lumn with data/time of first inserted or data/time of latest updated data in the row/record that can be used for later history.

Exchanging data from source legacy system to IDA can be in different formats e.g. JSON data-interchange format where I prefer storing data for IDA in a relational database with tables with rows, columns and data types, I must parse a json for­mat.

Kimball recommends that source legacy systems express data at the lowest detail possible for maximum flexibility and extensibility to provide the data warehouse for simplicity and ac­ces­sibility and it will be the data warehouse to make summary data, not the ope­ra­tio­nal source data system. IDA is a mirror of the source.

 

Archive - ARC

From Input data area to Archive. Also called Operational Data Store (ODS), Persistent Staging Area (PSA), Persistence Layer, Data Repository or History. Never empty this area because it is archiving of time variant source data and it will retain historical value changes in the source legacy system. Simple data adjustment can be done to gain same date and amount format and same data representation of a social security number etc. but it will be in new columns so the original values are unchanged. Each table has co­lumns e.g. IdaInsertTime, ArcInsertTime (date­ti­me2(7) default sysdatetime(), not unique per row), ArcTs (timestamp, row­version, unique per row), ArcRecordId (a sequence number per row per table for uni­que­ness) to fetch delta data out of the archive that have not yet been loaded, and maybe an ArcGlobalId (unique sequence number per row across all tables). Each table can also have a source legacy system indicator. Reasons for having an archive can come from issues of data quality and data governance and external regulations e.g. Sarbanes-Oxley, Basel I, Basel II, HIPAA. Data Vault is one approach of an implementation of an archive as a back-room. Archive is a versioning of the source. Read more about how to load data from an archive in chapter 8.

 

Data staging area - DSA

From Archive to Data staging area by extraction to prepare data cleansing, data clea­ning and data scrubbing (e.g. trim string, unicode string, max length of string, null gets a value as 0 or empty string '', replace illegal characters, replace value, ve­ri­fi­cation of data type, data syntax, date format (e.g. mm/dd/yy convert to UTC yyyy-mm-dd), correcting misspelling, fix im­pos­sible values, punc­tuation and spel­ling differences), to achieve common format, notation, re­pre­sen­ta­tion, validate da­ta, correct data and data deduplica­tion by selecting one of the duplicate rows and keep the others in a variety table. The data cleansing and data integration process with multiple source legacy systems is to make data cleaned and conformed, e.g. a gender code from one source as »Man, Women«, another as »M, F« will be con­for­med to »Male, Female« through a mapping of source data. Data enrichment accor­ding to business rules, iden­ti­fy dimension data and fact data with derived values e.g. total cost and re­venue, elapsed time and overdue time. This is the trans­for­ma­tion of data. Empty this area in the beginning of the ETL process because data is only transient in this layer. When an Archive is stored on an another server it is common to load data from the archive into a data staging area e.g. data to a dimension of type 2 or 7 and from the stage table do the merge or insert/update to the dimension table in the data mart area. I like to design a stage table to suit the destination or target table structure in a data mart rather than the source e.g. translate names of columns and data types, merge or divide columns like a Name to become two columns of Firstname and Surname. I like to use computed columns in a stage table for calculation, string manipulation and hashbyte value for a com­parison column to compare data with a dimension table in the data mart area.

DSA will perform a data quality and filter wrong data and invalid data into a quality assurance and quality control (QA/QC) database. DSA continues reconciling data from the archive for re­fe­ren­tial integrity i.e. foreign key value exists as primary key value, re­la­tion­ship cardinality rules for depen­dencies as mandatory, optional or contingent to make sure there is a foreign key or not, it can be null or it needs a value. Many other value checks and summation and so on for a true validation of data. When all tables in this area is loaded with correct data, the DSA is suc­cess­fully completed with ValidFrom and ValidTo based on the date of today, or the sour­ce legacy system new date or changed date or the Ida­InsertTime that represent a batch of insertion and mark all data to that batch time no matter which source legacy sy­stem data is coming from. All Valid­From will be using the same batch time which is usefull for join between tables and between source legacy systems to fetch the right data shapshot at a particular time back in history.

Furthermore is derived co­lumns and cal­cu­la­tions ready for loa­ding to the next layer of the data warehouse. In case of an error in the ETL process or in important data there will be raised a halt condition to stop the ETL process. Other times data will pass by with a flag in an audit dimension. Kimball has called it data wrangling to lasso the data and get it under control (data mun­ging). DSA can also be called Work area or just Staging or STG. (In danish this layer is called data forberedelses­område eller data ran­ge­rings­område eller til­rette­lagt lag eller behandlet lag, data vask, data ud­søg­ning, data berigelse, data be­hand­ling, data beregning, data sammen­lægning, sammen­stilling, samkøring, data er blevet vasket og strøget.) Kimball calls this layer ETL-sy­stem.

 

Data mart - DMA

From Data staging area to Data mart with dimensional mode­ling, conformed and shared dimension tables, star schema around each fact table, assigned surrogate key (artificial key, i­den­tity column, an unique sequence number) for each dimension and use it in fact as foreign key. When multiple fact tables share dimensions it is called constallation schema or multi-star schema. Keep historical data in dimension and fact with ValidFrom and ValidTo columns. Make it easy for users to search for current data through a view and to give a historical date to a table-valued function to fetch dimension data and fact data at any point of time. A data mart can have multiple fact tables with different granularities. One or multiple fact tables can create an extra derived fact table with special calculations and search filter cri­teria to enrich data to match the business requirements spe­cifi­cation. Never empty this area and always back­up before the ETL process. Data mart is a front-room for pu­blis­hing the organization’s data assets to effectively supports improved business decision ma­king. A data mart contains one data area for one purpose and is su­bject-orien­ted there­fore a data warehouse will consist of many data marts and they will have some common dimensions. To avoid to maintain the same dimension in multiple data marts may it be considered to have a dimension data mart that share its tables through views inside the other data marts. Kimball calls this layer di­men­sio­nal presentation area and he is using the term first-level for a data mart contains data from only one source legacy system and second-level or consolidated for a data mart with multiple sources to cross business processes.

 

Presentation interface - PIF

From Data mart to Presentation interface through a data access tool or a data visua­liza­tion tool like Ta­bleau, QlikView/Qlik Sense and Power BI can import data from a di­men­sio­nal schema and handle in-memory calculated KPI as quick measure u­sing filter. An OLAP cube is loa­ding data from a data mart and the processing do calculation of KPI and display a pivot in Excel, Targit or Power BI. A fixed report with criteria parameter do search, select and calculate data from a data mart based on an ad hoc query of a user or the report is send out as pdf file og excel file to business users every day, week, month and so on. Data mining and machine lear­ning will also use data from a data mart. Percentage (fraction) and ratio based on a dimension slice will be calculated in the tool and a fact table contains a numerator and a denominator, e.g.:

Average price = Sum(Unit price x Quantity) / Sum(Quantity).

Active customers =

  VAR measurevalue =

  CALCULATE(

    DISTINCTCOUNT('Fact_Sales'[Customer_key]);

      FILTER ('Fact_Sales';'Fact_Sales'[EndDate] = BLANK() &&

                  'Fact_Sales'[Amount] > 0))

  RETURN IF (ISBLANK(measurevalue); 0; measurevalue)

Kimball calls this layer Business In­tel­ligen­ce Application.

 

Supporting databases (not part of data layer and is introduced by Joakim Dalby)

  • Audit with audit trail for reconciliation check from source legacy systems with number of rows in tables and other summarized measures to monitor and control to avoid discrepancy between source legacy systems and the data ware­house and make sure that all data has been fetched and saved in the data warehouse. Validation check of data quality and log the number of rows loa­ded successfully or rejected with missing value or out of bounds amount or other kind of invalid data de­pends of the validation of data in the data lineage. Do alert the users.
  • System with data for the ETL process e.g. connection string, capture and record runtime metadata like process execution time, status of execution and error log of job, package, transform and exception handling. To storage date­time values e.g. for Ida­Insert­Time, and values e.g. for »latest value« for incremental load with delta data detection.
  • Test with test cases and other test data for reconciliation while pro­gram­ming the ETL process and for making KPI values to validate the program.
  • Usage with custom data for mapping and consolidation of data from mul­ti­ple source legacy systems and rule data for dimension hierarchy, grouping, sort order and new dimension values. Usage tells how source data is transformed and changed into usefull information by the ETL process to meet the re­qui­re­ments of the business and displayed in a data mart. Usage is maintained and up­da­ted by the business users through an ap­pli­ca­tion. Some data in the Usage supporting database will be loaded from the Archive, because the data comes from a source legacy system, to be enriched by the business users. Input da­ta area will fetch data from the Usage sup­por­ting database and further to the Archive to be used in trans­for­ma­tion and updating dimension with en­ric­hed data that doesn't e­xists in the source le­ga­cy systems. Usage data will also be used for making fact data i.e. a cal­cu­la­tion with a constant value and a rule as if-then-else, lookup, get age from a social security number etc. Usage is data-driven to avoid data values inside the ETL pac­kage, the sql statements and the program. A Master Data Management (MDM) is an extension of an Usage database (in danish brugerdefinerede stamdata.)
  • Wrong with data that have some errors and do not follow the quality as­su­ran­ce and quality control. Quality assuring, consistency and integrity is impor­tant parts of the ETL process and the audit trail.

 

When custom data in the Usage database is going to be changed it can be done in a UAT user acceptance testing environment. After the ETL process has been executed we can test the data warehouse. Later we can take a backup of the Usage database and restore it in the production environment.

 

1.2. Data capture or data ingestion

Data capture (in danish modtagelse, hjemtagelse) from source legacy systems is based on two data flow directions:

 

  • Push where Source find the data and deliver it to an area like a database or a file at ftp and later the Input data area will fetch the data from that area and indicate that the data is downloaded, received and stored, so Source knows that it can make a new delivery by using a semaphore to control access to a common resource by multiple processes.
  • Pull where Input data area will fetch data from Source through a sql Select From Where statement maybe with joins of tables or using a view or calling a stored procedure with a parameter as a criteria value to limit source data and for doing delta data detection or incremental load. From an ETL tool or a web-service.

 

Granularity of data capture integration strategy is to consider when the amount of data is huge in the source legacy systems. A data warehouse prefer to receive at the lowest granularity level of detail in case of specific analysis usage or data validation, but some­times it is necessary to aggregate and summarize source data to a higher granularity like per day and per customer segment for transaction data or transactional data from an OLTP system e.g. orders, invoices, billings, payments, site hits. Non-transactional data e.g. Customer, Location, Contact, Supplier, Part, Product can be stored in a Master Data Management (MDM) database as master tables that is shared with multiple operational applications in the organization or in the company. MDM is an extension of the Usage supporting database.

 

Data capture from source legacy systems is based on multiple data delivery forms:

 

  • Data rowset, recordset, multiple result sets from a relational database.
  • csv file with a separator like comma or semicolon, maybe with field header.
  • tsv file e.g. from Excel save as unicode text with tabular separator, txt file.
  • XML, JSON or another data structure in a file.

 

Data capture from source legacy systems is based on two data delivery methods:

 

  • Full dump or full load where Input data area gets all data from Source.
  • Incremental load or delta data detection where Input data area only gets new data or changed data from Source, called Change data capture. Need an Initial load the first time running. It can be Source that keep track of deli­vered date, or it can be Input data area that keep track of a last received Sequence Id (integer), Sequence Time (datetime) or Insert­Update­Time­stamp value from the Source for each data row, so Input can detect new data or changed data like asking for data since that value (Insert­Update­Time­stamp > latest value). It could also be a posting data e.g. entry date (or created date) and next load is for EntryDate > »the last received EntryDate«. (in danish Posteringsdato).

 

It is important to do a log of number of rows/records from the source lega­cy systems to the Input data area for later reconciliation check count auditing (in danish kon­trol­optælling, afstemning) in case a source legacy sy­stem does not deliver the expected number of rows and to monitor over time the number of rows to see if it increases as expected and be used in a graph over the amount of data in the data warehouse. Other data values e.g. amount, quantity and volume can also be logged and monitored. I recommend that source legacy system tells the number of rows per table that will be compared to the saved number of rows in Input data area in an audit trail. It can be implemented as a view in the source legacy system like this:

CREATE VIEW BI_Audit_NumberOfRows AS

SELECT 'Northwind' AS Sourcename, 'Customers' AS Tablename, COUNT(*) AS NumberOfRows

FROM dbo.[Customers]

UNION ALL

SELECT 'Northwind' AS Sourcename, 'Orders' AS Tablename, COUNT(*) AS NumberOfRows

FROM dbo.[Orders]

UNION ALL

SELECT 'Northwind' AS Sourcename, 'Order Details' AS Tablename, COUNT(*) AS NumberOfRows

FROM dbo.[Order Details]

 

The data warehouse can also do its own delta data detection between Input data area and Archive to identify new data, changed data and deleted data to maintain historical data with ValidFrom and ValidTo datetimestamps is handled by following actions:

 

  • which rows is new and give them a CreatedDateTime or a flag N for new.
  • which rows has been changed and give them a ModifiedDateTime or a Last­Updated­Date or a flag C for changed or U for updated.
  • which rows has been deleted and give them a DeletedDateTime or a flag D for deleted.

 

Sometimes the source legacy system provides a flag information like N for new data, C for changed data and D for deleted data.

I seen pair as: New – Changed, Created – Modified or Insert – Update.

 

Sometimes the source legacy system update a data row multiple times but only the most recent version goes to the data warehouse. If source legacy system insert and update a row before a new load to data warehouse, only the updated version goes to data warehouse. If source legacy system insert a new row and delete the same row before a new load to data warehouse, the data will never goes to data warehouse. It is seldome that a source legacy sy­stem has a log and is logging all changes with a revision nummer or a ValidFrom and ValidTo date­time­stamps, therefore a data warehouse does not contains 100% full historical updates of data in an organization. Sometimes a source legacy system has for each (or selected) tables in the database an extra historical log audit table (shadow table) that contains all inserted, updated and deleted data rows from the original table together with an Action column (values for Insert, Update, Delete) and an EntryDate date­time­stamp column that mark when the action was occurred and a SequenceId that is used for delta data detection to pull data from Source log to Input data area. An example of a source historical log audit table with rows every time a customer change:

 

SequenceId_log

EntryDate_log

Action_log

CustomerId

CustomerName

626

2000-05-08 10:12:34.4906240

Inserted

421

Marie Beaulieur

7503

2007-11-16

14:37:07.2353231

Updated

421

Marie Lesauvage

19473

2014-07-15

11:09:33.7629562

Updated

421

Marie Sainte

25003

2017-02-07

23:57:21.9876543

Deleted

421

Marie Sainte

 

Implemented as after-insert/update/delete triggers in source legacy system.

 

Data latency describes how quickly source data has to be ready in the data ware­house for the business users to do their reporting. Deadline is normally in the morning based on a periodic batch of source data from yesterday. For a real-time load to a data warehouse with constantly new presentation in a dashboard, the ETL process must be streaming oriented where source data continuously flows into the data warehouse do to transformation and make conforming data.

 

1.3. Enterprise Data Warehouse modeling architecture

In a data warehouse architecture and environment the data from the source legacy sy­stems can go into a database called Enterprise Data Warehouse (EDW). When a EDW is modeled after Bill Inmon it is based on Peter Chen’s Entity Relationship data modeling with super-sub entity, associative entity and with 80% normalized data. The EDW offers integrated, granular, historical and stable data that has not yet been modified for a concrete usage and can therefore be seen as neutral. It can keep historical data meaning all the changes to the data or only the days-end status e.g. End Of The Day for each data revision from Archive or Input data area. An entity Person can have an one-to-many relationship to the addresses of the person’s living places in another PersonAddress entity with ValidFrom and ValidTo co­lumns and these entities or tables will be merged into one type 2 dimension table according to Kimball in a data mart. Let the design of the data model for EDW do a »helicopter perspective« when identification of entities, e.g. sa­les­person or en­gi­neer will be stored in an Employee entity with an one-to-many to a Jobfunction entity or more abstraction in a Person entity where customers users can be stored too. In the EDW keep the source data with timespan columns like Start date and Stop date, and sometimes also convert the source data to row wise entity as a kind of transactional data that is a sequence of information exchange like financial, logistical or work-related data, involving everything from a purchase order that become an invoice with multiple invoice line items, to ship­ping status, to employee hours worked, plan and activity records, for sub­scrip­tion period and to insurance costs and claims. The EDW will be the source to one or mul­­tiple data marts using Dimensional modeling with denormalized data con­trol­led by the ETL process. Or maybe the data mart is using a different modeling that fit better for use of data.

 

EDW can use an Anchor modeling where an anchor is a surrogate key generator of the primary key (or operational key) of the source legacy system called a business key and surrogate key can become a Kimball durable key, and an attribute has the descriptive values per source legacy system meaning in case of five sources there will be five attri­bute tables placed in a five schemas because each source has a schema. Attribute will be a Kimball type 2 with a foreign key reference back to the anchor. Anchor modeling has only extension of new tables and none modification of existing tables. A knot is a lookup table and a tie is a relationship table. (»Tie the knot« meaning get married.) There is also a Party modeling. In the next I will give a briefly summary of a new model for EDW.

 

When EDW is modeled after Dan Linstedt it is based on Data vault modeling that holds on historical and time-variant raw data in entities/tables with labels:

Hub (blue) separates the business keys from the rest of the model and can trans­late business key and/or natural key to a hashbyte e.g. ProductHashKey with busi­ness key ProductNumber. An in­te­gra­tion point of business keys.

Link (green) integrate and store relationship between business keys and/or hub and link do not carry any data. A link has foreign keys to 2 or more hubs e.g. Link­Invoice has CustomerHashKey and ProductHashKey and its own pri­ma­ry key Invoice­HashKey. A link handling of one-to-one, one-to-many and many-to-many relationships because for data vault there is only many-to-many.

Satellites (yellow) store the context, descriptive data and measure values in co­lumns of either the hubs or links e.g. a SatInvoice with primary key/foreign key Invoice­HashKey and data in InvoiceDate or SatProduct with Product­Hash­Key and data in ProductName. Satellites has all the rele­vant data for the data ware­house.

Data warehouse is using a surrogate key instead of a business key (or natural key, see more in section 4.2) to remove dependence from the source legacy system. Data vault 2.0 recommends to hash a business key value instead of using an identity column as an unique sequence number because:

·  HASHBYTES('MD5',CONCAT(SSN,';',UPPER(TRIM(Area)))) to binary(16)/char(32)

·  Hashing two values could create same hashkey value which is called a collision and is very unlikely else use SHA2_256.

·  It is ease of copying data from environment to another like dev to test to prod without worrying about sequences being out of sync.

·  When two source legacy systems share same business key (natural key) but in one source it is an integer data type and in the other source it is a string, by has­hing they become same data type in the hub table and sat table.

·  When a source legacy system is repla­ced, the new source comes with different data types e.g. from an integer business key to a string business key, it will still be has­hed to same data type in the data vault, therefore no maintenance.

·  A composite business key of multiple columns become one hashkey column which is easy to use for join and merge.

·  Easy to create a hashkey of a business key value when needed by an ETL tool or in a sql select statement instead of doing a lookup to find the sequence number stored in a table. But of course still need to make sure that the same business key is not inserted twice in same hub table.

·  Loading to tables can be done parallelly e.g. at same time to a hub table and to a satellite table from the same source legacy system table.

·  For EU GDPR General Data Protection Regulation (Persondataforordning) a bu­si­ness key like a social security number needs to be anonymous after some year by adding a calculate cross sum of the ssn and hash it. It will be a nice scramble or encryption by a scrambling algorithm. To be sure of total anonymity only save Birthday and Gender of a person; no name, no address, no zipcode and no city. A Master Data Management (MDM) database can contain social security number, name, address together with CustomerId that is used in emails to customers. There­fore we have a surrogate key CustomerInternalId that is used in all other operational systems and data warehouse together with Birthday and Gender which is not personally identifiable or enforceable. A web service will provide all systems the data of personally from MDM database. When a customer is going to be deleted because it is his/her wish or the data limitation period, we only need to delete data in the MDM database and we can keep CustomerInternalId, Birth­day and Gender in the other systems to make sure statistics remain un­chan­ged back in time. When a system is calling the web service it will give back unknown for ssn, na­me and address when a customer no longer exists in MDM database.

 

Be aware it can be a performance issue to have columns of data type binary(16) or char(32) for storing millions of rows and for join many tables together compared by using an integer data type. For business rules the real business key values are used for definitions. This is purely for readability and maintainability.

Querying a data vault re­qui­red many more joins than Inmons model. A data vault is an archive and there­fore the business rules and transformations is first enforced when data is go­ing to data marts.

Data Vault 2.0 has also Same-as link (sal which maps different business keys from multiple source system where each source has its own satellite), Point-in-time (pit which is calculated by ETL from hub/link to improve the performance of queries) and several other types e.g. Date table is called a nonhistory reference table.

RDV is a raw data vault like an archive and BDV is a business data vault.

 

EDM = Enterprise Data Model is an term from IBM back in 1998 for a data modeling tech­nique for data warehousing, where ETL was called Capture, Transform, Apply.

 

Ralph Kimball do not like to store data in an EDW, he only store data in data marts that is using Dimensional modeling and Star schema, therefore EDW becomes an union of all data marts.

 

 

Read more about differences of opinion

 

1.4. Schema

Dimensional modeling ends up in a star schema or constallation sche­ma (multi-star schema) with fact tables (analysis variable, measures, events) surrounded by dimension tables (context), where dimensions explain the facts. Di­mension and fact conformance is a must in a successfull data warehouse imple­men­tation to meets the requirements of legislation, accepted practices, prescribed rules and regulations, specified standards and terms of a contract.

 

Before I used the general word data mart for a database with tables of dimensions and facts, but there is a more narrow word Multidimensional database (MDB) that is the source database for an OnLine Analytical Processing (OLAP) application like a cube also known as a Multidimensional online analytical processing (MOLAP) application. To me, it is a principle for a cube that all dimensions and all measures can be combined freely else divide into multiple cubes.

 

1.5. Reconciliation

After data capture, please remember to implement a Reconciliation Summary Report with the results from your different recon e.g. Uncategorized assets, Invalid country codes, Derivatives transactions with missing currency code.

Audit trail (in danish kontrolspor) becomes important for the credibility of a data ware­house. An example from a source legacy system that has a value 0.003589 and export it to a txt file where the value becomes 3.589E-3 and by a mistake in the ETL process the data warehouse saved and displayed the value as 3.589. A contract number 700002848572 become 7.00003E+11 and the last part of the value got lost. When reconciliation is built-in the data model and the ETL process, this mistake would be reported and the programmer can fix the import and update his data profiling do­cu­men­ta­tion.

 

1.6. Data quality

Some data quality skills and data cleansing examples:

  • Allowed values like dates, weekdays and weeknumbers is constant data.
  • Accuracy like is the age of a person is trusted and not like 134.
  • Validity like a date is within an acceptable range and not ten years wrong else data become invalid and should be rejected or corrected.
  • Correction like changing a wrong year of a date or casted to right data type. Incorrect data is poison for a data warehouse.
  • Duplication like same customer has addresses spelled in different ways.
  • Unambiguous value for having one meaning and avoid ambiguous values at same level in a dimension or need to combine two values to one unique data.
  • Consistency like a zipcode is right and has a cityname or a data mapping like DK for Denmark, or Cph, Kbh, København as Copenhagen to be con­sistent for the capital of Denmark. Inconsistent value gives problem in any database and data warehouse.
  • Completeness to avoid data values is missing like a phonenumber or a date. Incomplete value (null) must be marked as missing and be taking care of.
  • Improve values that take care of case-sensitive words through translate, map­ping or bridging rules to create one and only one truth.
  • Conformity to make data from different source legacy systems conformed through data conformation that is a part of the ETL process.
  • For fact tables one truth depends of the use because a fact for Finance and a fact for Marketing may look like the same but their business requirements specification has small differences.

 

1.7. Big data

Some properties of big data:

  • Volume refers to the amount of data e.g. from GB over TB to PB as data at scale and transactions.
  • Velocity refers to the speed of change and data processing for the timeliness of data, data in motion from batch in periodic, frequently, night or daily loads, intra-day e.g. three times a day or nearly real time with little delay and real time instantaneously with streaming data with no delay as inter­active.
  • Variety refers to data in many forms and the number of types of data from relational database, image, photo, audio, video, multimedia, sensor, web logs from social network data, smartphone and unstructered text.
  • Veracity refers to data uncertainty due to data inconsistency and incom­pleteness by managing the reliability, trustworthiness, authenticity, origion, repitation, availability, accountability, quality and predictability of inherently imprecise data types.
  • Variability to increase in the range of values typical of a large data set and non-standard.
  • Value refers to data of many values which addresses the need for valuation of enterprise data e.g. statistical, events, correlations, hypothetical, leve­rage­able.
  • Visibility refers to data in the open and being able to handle european General Data Protection Regulation (GDPR) and issues of privacy, security and provenance.
  • Visualization refers to the flexible viewing of data.
  • Viscosity, Validity, Venue, Vocabulary and Vagueness is other V-words.

 

A data lake can be used for massive quantities of unstructured data and big data with tools that can easily interface with them for analysis for business insights. A datum in a lake has tags to give it a characteristic and by the tags we can fetch data from the lake without knowing the physical location like a server url with a folder path. A data lake can contain files on multiple servers on premise in different fol­ders and in the cloud (many nodes), and we only using a tag to find and fetch data. For example, I like to find photos of smiling employees in all albums, I can search for a tag FacialExpression = smiling. A data lake is using ELT (extract, load, and then transform). A tool for a data lake can be like Apache Hadoop or Microsoft Azure. Data Discovery Area is an end-user sandbox. Can use U-SQL to dive in the data lake and fetch the wanted data.

A relational database is characteristic by ACID (Atomicity, Consistency, Isolation, Durability). This means that a transaction is either carried out completely or not at all (Atomic), that only valid data is added to the database (Consistent), that transactions never affect each other (Isolated), and that transactions are never lost (Durable). Read more.

 

A NoSQL = Not Only SQL database means it can use SQL type query language, but usually do not do so. NoSQL database often designed to run on clusters, made by open source and the database do not operate with a fixed schema structure but allow the addition of data without a pre-defined structure. A NoSQL database is cha­rac­te­ristic by BASE (Basic Availability, Soft state, Eventually consistent). An ACID system guarantees data consistency after each transaction; a BASE system guarantees data consistency within a reasonable period of time after each trans­action. In other words, there is data consistency in the system, just not im­me­dia­tely. This leads on to the Soft State principle. If the data is not consistent at all times, the system must take a temporary data state into account. The sum of both these principles means that data accessibility is given very high priority, even if coincident errors occur in the database system, operating system or hardware. If parts of the database do not work, other parts of the database take over, so that data can always be accessed.

 

2. Grain (granularity) of data warehouse

The grain is the level of detail of data in the table, both in dimension tables by the hie­rarchy and in fact tables by the definition of the measurement event and of in­terest, and the grain can later be expressed in terms of the dimensions. The lowest grain keep all relevant data from source legacy systems and is therefore the most flexible approach but also take most storage space and easy can cost high query per­formance. To improve query performances the grain can be lifted up to higher level whi­le data will be aggregated and summarized and therefore take less storage space. Data can also be divided such as current year data is in grain daily (dately), previous year in weekly and older data in monthly, because very detailed infor­mation is normally not relevant for analysis years back in time. Granularity of fact table can be divided into three types:

 

  • Transaction grain as finest level of detail with one row per transaction together with a transaction datetime like a entry date, post date, sales date or simply a created date, inserted time or registered date of the row. It is a base-level fact.
  • Periodic grain with one row per time period or time span like a week or a month for a montly grain.
  • Accumulating/Aggregated grain as more summarized level like one row for the en­tire lifetime of an event or with a clear beginning and end.

 

Grain yearly to monthly to weekly to daily we say that each level e.g. daily increase the granularity and the number of rows in the fact table.

Aggregation is the process of calculating summary data from detail base level table rows (records) and is a powerful tool for increasing query processing speed in data marts. For example, a sale is a fact with analysis variable and measures like quantity sold and amount, dimensions like product, customer and date of purchase that bring a sale in a context. The grain of the sale is limit to a date (like December 23) and a time (like 4:30 pm) and therefore the fact is on Transaction grain. In case we drop the time, the measures would be called TotalQuantitySold and TotalAmount because they are the result of a summation of sales time to sale date and therefore the fact is on Periodic grain. Also if we decide to summarize the sales date to a weekly or montly level. In case we decide to grain the customers by aggregate them into segments and don’t keep names and adresses, then the fact becomes Ac­cu­mu­la­ting/Aggrega­ted grain. The product has a three level hierarchy of category name, brand name and product name, and therefore we can say that product has the highest level of detail.

Instead of summation of sales time or sale date and loose some important information of date of purchase, we can summarize data into two other fact tables for a weekly level for data from the last year and a monthly level for older data, because when we go back in time we do not need analyzes on a dai­ly or we­ek­ly level and by aggregation we save harddisk space and improve the performance of the query because fewer rows in month level need to be summarized to fetch the year level data.

The lowest level of aggregation or the highest level of de­tail is referred as the grain of the fact table.

 

3. Fact

A fact table can contain fact’s data on detail or aggregated level depends of the grain approach. A fact table has four types of columns:

 

  • Dimension that is a foreign key to a dimension table primary key.
  • Measure or analysis variable that contain numeric facts as number value as fact data. All mea­su­res in a fact table should have the same grain like dately or monthly level.
  • Tag contains a text or a number that can not be summarized, e.g. a voucher number, a receipt number or a transaction identification string value that makes each fact row unique. A tag column in a fact table is a candidate for a degenerate dimension or later a real dimension with extra descriptive data.
  • Technical for surrogate primary key i­den­tity column an unique sequence number instead of a composite key by a subset of the columns in fact table. EntryDateTime or a TransactionDateTime (in danish Transaktions­tids­punkt) because it has been divided into two dimensions for Date and Time, and the DateTime column is used for partitioning of the table to a partitioned fact table. Dates for ValidFrom and ValidTo to represent a period of time where the measures was valid like a Balance­Amount. Flag as IsDeleted and Dele­tedTime (ModifiedTime). ArcRecordId for traceability back to the archive.

 

The ETL process has to secu­re the natural key, so fact rows are distinct. Fact tables in a di­­men­­sio­nal model express the many-to-many relationships between dimen­sions and is implemented as one-to-many relationships between dimension tables and fact tables.

 

3.1. Types of facts

Let us characterize the various facts into different types of facts or more exactly different ty­pes of the columns in a fact table. Conforming facts means making agree­ments on common business metrics such as key performance indicators (KPI) across separa­ted source legacy systems so that these numbers can be compared mathematically for calculating differences and ratios.

 

Additive measure - summable across any dimension

A fact table has numerical measures that can be summed up for all of the di­mensions in the fact table, so the measure columns data type is a number. A sales fact is a good example for additive fact with measures like Quantity sold and Amo­unt. In case of a transaction dataset to a fact table refer to a measure column which value is empty, null or nullable, use the default value 0 because this won't bother aggre­gation like summation. Each measure must have its metrics. If it is a monetary measure, it may have a currency column and if it is an unit measure it may have a column to explain the kind of units used like centimeters, litres, cubic metres etc. Fact can have a cal­culated measure or a derived measure based on existing measures and constants e.g. Profit or Surplus = Revenue – Costs. Some­times called Fully-Additive fact.

 

Semi-Additive measure - summable across some dimensions

A fact table has measures that can be summed up for some of the dimensions in the fact table and not for other dimensions. For example, a daily balance measure can be summed up through the customers dimension but not through the time di­men­sion. Inventory levels cannot be summed across time periods.

 

Non-Additive measure - not summable for any dimension

A fact table has measures that cannot be summed up for any of the dimensions in the fact table. For example, a room temperature fact is non-additive and summing the temperature across dif­fe­rent times of the day produces a totally non-mea­ning­ful number. However, if we do an average of several temperatures during the day, we can produce the average temperature for the day, which is a meaningful number. Other example is a percentage and a ratio measure. A fact table that only contains column with transaction number such as order num­ber, invoice number or a voucher number that can’t be summing up. A order fact with measures like Unit pri­ce and Discount percentage where an Unit price for a single product ma­kes no sence to summarize, but the derived column Amount = Unit price x Quantity is to be summarized and become an additive column, called a calculated measure. Trend, Stock and Ranking can't be added and in general all calculations on one specific intersection of the dimension. Year-to-Date ytd measure can’t be summed up. Count of rows is normally used.

 

Conformed fact

A conformed measure in multiple facts must use the same common business rule and definition so multiple facts can be united in a report or a cube. When several data marts are using fact data with same name of fact tables or name of columns for measures and they have compatible calculation methods and units of measure and support additivity across business processes. If a measure e.g. Revenue is used in multiple fact tables with different calculations and mea­nings, it is best to use different column names because theses facts are not conformed.

 

Factless fact

A fact table that contains no measures is called factless or measureless fact. For example, a fact table which has only columns for employee, date, time and type, where type is like »work­start«, »work­stop« and »workstopofsickness« and there are no columns for measure. You can get the number of employees working over a period by a »select count(distinct EmployeeId)« or by a distinct row-count cal­cu­la­ted mea­sure in your OLAP cube. For a factless fact you will normally count the number of rows, row count or counting rows and call it »Number of <a name>«. Sometimes a factless fact has a value column called Count with only one value as 1 used in a data access tool to sum over and get the number of rows. In case the fact grain is weekly and a week is missing, it can be inserted to have all weeks complete and here will Count gets the value 0. Factless fact is for registration of event or assignment i.e. attendence take place in a school class with dimensions for student, class, room and professor. If we add a measure column for Attendence with 1 or 0 per date per student it is not a factless fact anymore. Factless fact can be used to represent a many-to-many relationship among dimensions.

 

Capture a relationship fact

To be a column in a dimension or to be its own dimension and used in a fact is a good question. Kimball’s financial services example starts with an Account di­men­sion including data of products and branches but he choose to remove these descriptive columns to form independent dimensions of Product and Branch and use them in a fact together with the Account. Therefore the fact capture a relationship among accounts, products and branches. Another example is that an account can belong to two customers and a customer can have several accounts. This many-to-many relationship can be expressed in a factless fact or in a bridge table, see later. A bank transaction is done by one customer from an account and it is natural to have a Customer dimension in the fact.

Kimball says: »Demoting the correlations between dimensions into a fact table.«

 

Transaction fact

A fact table that describes an event or operation that occurred at a point in time in a source legacy system e.g. an invoice line item, and the data row will never be chan­ged in the source. The row has a date e.g. a transaction date, an entry date or a post date and sometimes a time to represent the point in time with other lowest-level data. Key va­lues for dimensions is found at transaction datetime.

Some data modeler do not like separate facts for each transaction type but build a single blended fact with a transaction type dimension and a mix of other di­men­sions can make N/A dimensions. I have seen a blended fact called FactEvent which I think is a poor and non-signing name of a fact table, and a date dimension gets a generalized name instead of multiple facts has names as order date, pur­chase date and receipt date.

 

Periodic snapshot fact

A fact table that describes the state of things in a particular instance of time, and usual­ly includes more semi-additive and non-additive measures. It is a table with frozen data, meaning a row will never be changed/modified/deleted (is unchanged) be­cau­se the row can have been used in a report like a monthly or annual report and later it is a must to be able to create the same raport with the exact same data. The periodic snapshot fact table will never be done empty or updated, there­fore it is a true incre­men­tal load and the table has a column like a Month (yyyymm) for a monthly basis or a Year to fetch the wanted data as a month slice or a year slice. When a month or a year is over and data is ready, data will be loaded. Sometimes data for the current month is also loaded every day for a current-month-to-date, therefore the current month will be updated until it is over, finish and can be closed. Measures can be a balance in account or inventory level of products in stock and so on. Key values for dimensions is found at the end of the period. Fact table can be monthly partitioning for making a faster query perfor­mance when searching for a month.

 

Accumulating snapshot fact

A fact table that describes a process with milestones of multiple dates and values columns with different names which will be filled out gradually. Therefore over time the same fact table row will be revisited and updated multiple times where a default date key value is -1 for »hasn’t happened yet« or »be available later«. Each stage of the lifecycle has its own columns e.g. milestones of a hospitalization or steps of the manufacturing of a product. In a retail store a product has three movements as ordered, received and sold that would be three date dimension columns in an accumulating snapshot fact.

A fact table where a row is a summarize of measurement events occurring at predictable steps between the beginning and the end of a process. For example, a source legacy system for current payments from customers where some pay seve­ral times over a month, and the first payment become a new row in fact table with date of payment in columns BeginDate and EndDate and the amount in Paid co­lumn. The next ETL process will do a summarize of payment per customer from the Begindate to current date or end-of-month date, and then update the fact row with same BeginDate with the new summarized payment and new EndDate, so a fact row will be revisited and updated multiple times.

 

Derived fact

Kimball recommands fact data at the lowest detail grain as possible for ensures ma­xi­mum flexibility and extensibility, he call it a base-level fact. A derived fact table is created for performing an advanced mathe­matical calculation and complex transformations on a fact table like for a specific KPI (Key Performance Indicator), or an aggregate fact with a summation of measures to a higher grain like from date level to month level and from product level to brand level and using shrunken dimensions for Month and Category as a dimension lifted to a higher grain from the base-level dimensions as Date and Product. A derived fact can be based on multiple fact tables for making faster query performance and to provide a dataset to the Presentation interface.

 

Aggregate fact or Summarized fact

A derived fact table that is created to referred to as a pre-calculated fact with com­pu­ted sum­ma­ri­zed measures at a higher grain level of one or more dimensions to reduce storage space and query time greatly and eliminate incorrect queries. An aggregate fact is derived from a base-level fact and measures in an aggregate fact is a computed summary of measures in the base-level fact. Dimensions in an ag­gre­gate fact can be derived from base-level dimensions and is called shrunken dimensions because the values is rolled up to create less fact rows e.g. Date di­men­sion become Month dimension, Product dimension be­co­me a Category di­men­sion and an Address dimension become a Region dimension, and therefore the mea­­sures can be sum­ma­ri­zed to less fact rows for better query performance.

Year-to-Date ytd fact where month February is a summing up or roll up of January and February and so forth. Last-year-this-year fact with calculation of index com­pa­red to last year as a new column and easy to display in a report.

Aggregate fact table is simple numeric roll up of atomic fact table data built solely to accelerate query performance. It is called incremental aggregation when an ETL process do a dynamically update of a table by applying only new or changed data without the need to empty the table and rebuild aggregates.

 

Smashed fact

A fact table contents several measures but only one or few of them has a value in each fact row. For the fact rows with same dimension member repeated in multiple contiguous rows with identical values, they will be smashed or collapsed into one fact row using operation as sum, min or max to limit the number of rows in the fact.

 

Consolidated fact

A fact table used to combine facts from multiple source legacy systems and busi­ness processes toge­ther into a single consolidated fact table if they can be expressed at the same grain. I.e. sales actuals can be consolidated with sales forecasts in a single fact table to make the task of analyzing actuals versus fore­casts simple and fast to compare how the year is going.

 

Timespan fact

A fact table used for a source legacy system that is regularly updatable meaning that the source change and overwrite its values. To capture a timespan when the fact row was effective, the fact table will act as SCD type 2 dimension with a BeginAt and EndAt columns to keep historical data and to represents the span of time when the fact row was the »current truth«. It is called Slowly Changing Facts. See more in section 6.4.

 

Counterpart fact (negating fact) and Transactional fact

A fact table used for Slowly Changing Facts because the source legacy system is changing fact value without keeping the old value as a historical transactions. See more in section 6.4.

 

Column wise fact and Row wise fact

A column wise fact table is usefull to be columns in a paper report e.g.

   Revenue Jan, Revenue Feb, Cost Jan, Cost Feb, Sales Jan, Sales Feb.

For a cube a row wise is much better because it gives good dimensions e.g.

   Period, Entry, Amount.

Therefore a data warehouse needs to convert from columns to rows or vice versa.

 

Exploded fact

A fact table contents huge number of rows where a period e.g. from a »date of employ­ment« to a »termination date« as columns in one row, will be turned around to many rows with one row per day of the period, or per week or per month depends of the wanted grain of the period. When an employee has 10 year anniversary it will make more than 3650 rows on day grain, and if the employee dimension keep hi­story of different names, departments and job functions etc. for each staff, then the fact rows will have different key references to the dimension. Another grain would be 15 minute grain which gives 96 rows per day for a windturbine power pro­duc­tion. When the exploded fact is a source for a olap cube it can sometimes be imple­men­ted as a view in the database, and when it is for ad hoc reporting it will be used several times per day then it must be a materialized view stored in a table or sometimes as an indexed view.

 

3.2. Other fact classifications

Transaction has one row per transaction when they occur together with a datetime.

Periodic has one row for a group of transactions made over a period of time through summation like from daily grain to monthly grain so the Date dimension is re­pre­sen­ted by the month level with the first day of the each month. Notice that certain dimensions are not defined when com­pared to a transaction fact such as dimension TransactionType.

Accumulating/Aggregated has one row for the entire lifetime of an event and the­re­­fore constantly updated over time.  For example, an application for a bank loan until it is accepted or rejected or a customer or working relationship. These fact tables are typically used for short-lived processes and not constant event-based processes, such as bank transactions.

 

Mostly a fact table describes what has happened over a period of time and is the­re­fo­re an additive or cumulative facts. An example of a status column in a fact table that receive data from a school system where a student follow a course and later finish it, but sometimes a student skip the course and are delete in the system. Before reload the fact it can be a good idea to have a CourseStatus column with values like: Active, Completed or Dropped.

 

4. Dimension

A dimension table containing business element contexts and the columns contain element de­scrip­tions and a dimension is referenced by multiple fact tables so the conta­ining mea­sure­ments make sense.

 

4.1. Purpose of a dimension

Some purposes as I seen it:

  • Entry point for a business user to query and analysis data in a data mart.
  • Slice and dice or slicing and dicing where the dimension value (member) is selected and become a filter, a zoom or a search criteria to do a filtering of the fact measure data and analysis variable after the wish of the user.
  • Dimensions are placed in the preposition and form rows and as columns and form a cross tabulation or pivot for presentation and reporting of analysis variable or measure value divided by dimension values.
  • Drill-down for presentation of dimension values in gradually greater detailing through a hierarchy or a group that is grouping values in intervals.
  • Drill-up or roll up views dimension value at higher aggregation level and is summing up the fact measure.
  • Drill-through shows the underlying data from the source legacy system.
  • Drill-across ties data marts and facts together through conformed dimension and different fact measures can be compiled.
  • Ranking or Sorting presents dimension values according to an analysis va­ria­ble, e.g. customer segments in descending order after the sale.
  • Top or bottom for presentation, e.g. top 5 products or 10 poorest customers on sale.
  • Contains relevant data columns for reporting, analysis and self-service BI.

 

The dimension values can be placed in a hi­e­rar­chy like a Location with levels CountryRegionCity, and in a group like a Age with intervals of Child (0-12), Teeanage (13-19), Adult working (20-66) and Senior citizen (67-130). A dimension normally contains one or multiple hierarchies and/or groups to fulfill requirements from the u­sers. A dimension can of course be non-hierarchical and non-grouping.

 

Different dimensionality covers the issue that not all combination of multiple dimension values are allowed in a fact and the data warehouse needs to make sure of the data quality.

 

4.2. Dimension keys

A dimension table has mininum three types of co­lumns:

Primary key is a surrogate key identity column an unique sequence number to remove dependence from the source legacy system and for using in fact table foreign key. It can be called Entity key EK because it represent an entity from source or Surrogate key SK but for me »surrogate« is a property not a name of a column. Primary key must not be data-bearing (in danish data bærende), it must be mea­ning­less.

Business key is the primary key (or operational key) of the source legacy system, e.g. Id as a surrogate sequence number, a No (number) or a Code that is unique. Often the business key is a mea­ning­less value e.g. integer or guid based.

Natural key is other candidate key of the source legacy system based on data in the real world, e.g. Code, Serial number or an unique Name. Natural key has an embedded mea­ning and represent an unique object in the business.

Textual data is representing the dimension value context description and is saved in columns of the dimension table and will be shown to the users as descriptive co­lumns to explain fact data.

 

A vehicle has an unique and long serial number as natural key because the system is using an unique sequence number as primary key that become business key. In a Status table a column StatusId become business key, column StatusCode become natural key and column StatusName become textual data. In a Customer di­men­sion table we will have columns from source legacy sy­stem like Cu­sto­mer­Id, Customer­Num­ber and CustomerName together with Shipping­Address and Zipcode, Region­code, Country, Age, Gender and what about the Bil­ling­Address and so on. Cu­sto­mer­Id is the business key and the Customer dimension primary key would be na­med e.g. Customer_key, CustomerId_dwh, Customer­EntityKey or EK or SK for sur­ro­gate key. »A beloved child has many names.«

 

Different source legacy systems have a natural key for a social security number per person where the business keys is an individual surrogate sequence number, the­re­fore we can’t use the business key to interconnector (join, drill-across) the systems data instead we use the natural keys. The value of a natural key can change over time e.g. a person obtains witness protection and gets a new social security num­ber and the business keys remains, then it is up to the ETL process to make a solid map­ping. When an natural key is an employee number that is using the social se­cu­rity number (SSN) and the employee resign and some years later is rehired there will be two sequence numbers as business key and the data warehouse must have a way to glue or map them together so the employee only occurs once in a di­men­sion. Another example is from a Customer source table with a identity sequence number called CustomerId as primary key and a CustomerNumber as an unique secondary index (candicate key). Sometimes a customer row is deleted by a mis­take and later it will be inserted again with same CustomerNumber but for sure a new sequence number in CustomerId. That will create a problem in a data ware­house because it will over time receive the same CustomerNumber with two different values in CustomerId. When the data warehouse is only using the CustomerId as business key and not using CustomerNumber as natural key, the data warehouse will get two rows with same CustomerNumber which can give problem in reporting. Merging and integrating data can be hard to set up for the ETL process.

 

4.3. Changing dimensions

Source data is volatile data because they will change over time e.g. a customer change his name or address and a product change place in a hierarchical structure as a result of a reorganization. Therefore dimension values will normally be chan­ged over time because of the volatility in source legacy systems. The rate of chan­ges can be divided into two kinds of classifications and afterwards we will be loo­king into techniques to handle and tracking changes and to capture its history and preserve the life cycle of source data also called Change Data Capture CDC.

 

Slowly Changing Dimensions SCD

Columns of a dimension that would undergo changes over time. It depends on the business requirement whether particular column history of changes should be pre­ser­ved in the data warehouse or data mart.

 

Rapidly Changing Dimensions RCD

A dimension column that changes frequently. If you do need to track the changes, using a standard Slowly Changing Dimensions technique can result in a huge inflation of the size of the dimension. One solution is to move the column to its own di­men­sion with a separate foreign key in the fact table. Rapidly changing data usu­al­ly indicate the presence of a business process that should be tracked as a se­pa­ra­te di­men­sion in a separate historical data table.

 

Techniques or methods to handle dimension values that is changing over time from the source legacy systems, called Ralph Kimball’s eight types or approaches:

 

Type 0: Original value, where the dimension value never change (method is pas­sive) meaning keeping the original value from the source legacy system, and when the value is changed in source legacy system, we don’t change the value in the dimension. »Retain original«.

Type 1: Current value, where the old dimension value will be changed and for­got­ten when value is changed in source legacy system. The history of data values is lost forever. The actual, active, newest or latest of the value or most recent in­di­ca­tor. A fact table refers to a dimension value most recent, as-is. »Current by over­write«. Section 6.2 will show an example of type 1. (In danish aktuelle, nuværende, gældende.). There is an one-to-one relationship between the business key and the surrogate key identity column as primary key of the dimension.

Type 2: Keep all values, where new dimension value is inserted into a new row to have an unlimited history of dimension values over time marked by Effective date and Expiration date (Active date, Expired date, Expiry date) (or Start­Date and StopDate or BeginDate and EndDate or ValidFrom and ValidTo), a pair of data type »date« or »datetime« that represents the span of time when a value was the »current truth«. A fact table refers to a dimension value in effect when fact data occurred, as-was, often by a date column in fact table based on source data or by a current load insert date when the fact data was entered into the fact table. »Keep history in rows«. The value of a business key will be repeated every time the tex­tual data is changing, therefore the primary key is a surrogate key identity co­lumn an unique sequence number. A view upon the dimension will provide the cur­rent va­lues. A view upon the fact will provide the current keys to join to dimension view. A column called IsCurrent has two values: 0 for historical and 1 for current to mark each data row of a type 2 dimension. This is the technique for Slowly Changing Dimension. Section 6.2 will show an example of type 2. There is an one-to-many relationship between the business key and the surrogate key identity column as primary key of the dimension.

Type 3: Keep the last value, where the previous dimension value is stored in a Previous column (or Historical column) and current dimension value stored in a Current column. »Keep history in column«. This is the technique for Slowly Chan­ging Dimension.

Type 4: Fast changing value in dimension columns will be split into one or more se­pa­rate Mini Dimensions. »Keep history in tables«. This is the technique for Ra­pid­ly Changing Dimension to store all historical changes in separate historical da­ta tables. The fact table contains foreign key for a given mini dimension.

Type 5: Builds on the type 4 Mini Dimension by embedding a current profile mini dimension key in the base dimension that’s overwritten as a type 1 Current value column. Therefore 4 + 1 = 5 type. »Sub class dimension«.

Type 6: Mixture of type 1 and type 2 columns therefore a good idea to suffix co­lumns as _t1 and _t2 to know which columns can be overwritten in the current row. »Hybrid«. Can also have column of type 3, therefore 3 + 2 + 1 = 6 type.

Type 7: All rows follow type 2 to keep track of historical values with a key column and an extra key column called a durable key follow type 1 for the current value. The durable key is an integer re­pre­sen­tation of the business key. The fact table contains dual foreign keys for a given dimension (its key and its durable key) to show the historical value of the dimension at the time the fact data was created, born or occurred, and to show the current value of the dimension. A view upon the dimension will provide the current values with the durable key to join to the fact durable key. »Dual Type 1 and Type 2 Dimensions«. Section 6.2 and 6.3 will show exam­ples of type 7 dimension that has historical rows with a current mirror.

 

In a dimension table the columns can be a mix of type 1 and type 2 e.g. a Customer dimen­sion where columns for customer name, street name, house or apartment number is type 1 because we only need the recent value for shipment, and columns for postal code (zip code) and city is type 2 because we like to tracking these changes and keep data for the city where a customer was living in when purchase a product. A »sales in cities« report for the last ten years will use the right city of the customer at the time the purchase occurred.

 

4.4. Types of dimensions

Let us characterize the various dimensions into different types of dimensions.

 

Conformed dimension or Shared dimension or Common dimension

A conformed dimension has the same meaning to every fact in multiple data marts and measures will be categorized and described in the same way and ensuring consistent reporting across the data warehouse. A conformed dimension is a con­sistent interface to make sure that data can be combined in a data warehouse and be used all over the business because values of a dimension means the same thing in each fact. With a conformed dimension we can combine and drill across from fact to fact in one data mart or over several data marts, and analyze common columns and values. Se­pe­ra­te fact tables can be used together with shared, common and conformed dimensions. Conforming of several source legacy system data is part of the in­te­gra­tion to achieve a conformed dimension where data is integrated of dif­fe­rent mea­nings and different columns must be compared against each other, rules must be set, and data must be cleansed to create a single version of the entity. Conformed dimensions will unite and integrate data values among different source le­gacy systems so it is easy to search across different types of data and sync them in a common report. Shared di­men­sion is utilized in multiple fact tables in a data mart or across multiple data marts. Di­men­sion values comes from either the source le­gacy systems or is built by business rules in Usage supporting database. Non-conformed dimension can only be used within one fact. It is part of the ETL process to do conforming by merge, unite and consolidate different source legacy sy­stem data across the enterprise for making a conformed dimension e.g. a Cu­sto­mer di­men­sion from different business areas  as order, sale, invoice, delivery and sup­port service (for B2B and B2C) with both different customers and same customers and with different business key values and with different addresses like a shipping ad­dress and a billing address. Sometimes data from a conformed dimen­sion is send back to the source legacy systems as master data to be reusable in the organi­zation.

The dimension values can be placed in a hi­e­rar­chy like a Location with three levels CountryRegionCity. A dimension can have several separate and independent hierar­chies with different numbers of levels.

The dimension values can be placed in a group that is grouping values in intervals like person ages in custom buckets like a Age group column with intervals of Child (0-9), Tween (10-12), Teeanage (13-19), Young adult (20-29), Adult (30-66) and Senior citizen (67-130).

Data classification is the process of organizing data into categories, group or cate­gory is part of a categorization or grouping of data to make a dimension more user-friendly to see data of a dimension on an aggregated and summed level.

A dimension normally contains one or multiple hierarchies and/or groups to fulfill requirements from the users. A dimension can of course be non-hierarchical and non-grouping.

 

Date dimension or Calendar dimension

A very common dimension with the granularity of a single day with hierarchies as:

YearHalf yearQuarterMonthDate (five levels) or

YearWeekDate (three levels) because a week does not always belong to one month. I like to use an integer value as surrogate key identity column in format yyyymmdd e.g. 20131224 for Christ­mas day, therefore the fact table will contain useful information that can be used in a query like:

»between 20131201 and 20131231« for month of December 2013.

When a fact date is connected to the Date dimension it is easy to make a re­port based on month, quarter or yearly level or a search filter cri­teria as Q3 2013.

Dates or days can be grouped to a Season or TimeOfYear: Winter (December Janu­ary February), Spring (March April May), Summer (June July August) and Fall (September October November) (in danish Årstider).

 

When a surrogate integer value is used instead of a date stamp it allows us to have values to handle these cases:

»Missing« (-1) the date is not present, not reported or »hasn’t happened yet« or »be available later« because the date to be de­ter­mi­ned is expected to be available later and fact table will be updated thereby.

»Not available« (-2) the source legacy system did not provide a date value and will not do it later.

»Not applicable« (-3) the dimension is not relevant for the fact row (N/A, N.A.).

»Bad«, »Corrupt« or »Dirty« (-4) the source legacy system provide bad date e.g. year of birth as 2099 or February 29, 2017.

»Unknown« is used for an inferred member in a fact table without a member value in the dimension. Hopefully all dates in a fact table will already exists as values in the date dimension. Please read more about it in section 4.5.

 

A fact table has often at least one column that represent a date e.g. an entry date, order data, invoice date, shipping date. When an event is happening it has normally a date or multiple dates like an injury with an InjuryDate, a ReceivedDate at insurance company, a RulingDate, a TreatmentDate, a BillingDate and a Pay­ment­Date. Date dimension is a role-playing dimension and therefore the data mart contains multiple views upon the date dimension so each view can join to each date column in the fact table and each view has a good name like the columns in the fact table.

 

Time dimension or Time-of-day dimension

A very common dimension with hierarchies such as:

HourMinuteSecond (three levels) or

HourMinute (two levels)

The time dimension has the granularity of a single second with 86400 rows or of a sin­gle minute with 1440 rows.

Time dimension has an aggregation hierarchy that roll up of time periods into more summarized business-specific time grouping e.g. Time division or Time interval (»døgn inddeling«):

Morning (6 am to 8 am)

Rush hour (8 am to 11.30 am and 4 pm to 6 pm)

Lunch hour (11.30 am to 1 pm)

Afternoon hour (1 pm to 4 pm)

Dinner hour (6 pm to 8 pm)

Evening hour (8 pm to 11 pm)

Night hour (11 pm to 6 am)

I like to use an integer value as surrogate key identity column in format hhmmss e.g. 0 for 00:00:00, 63000 for 6:30:00 am and 202530 for 8:25:30 pm because 8 pm is also called 20 o’clock in the evening. When Time dimension has granularity of minute an integer value 10 stands for o'clock 00:10 and 2359 stands for 23:59. A missing time is represented with the value -1, not available is -2, not applicable is -3 and bad is -4 if source time is e.g. 24:12:06.

In a ETL process for making fact data rows, developed in a SSIS package, the co­lumn Time_key will be created as a derived column to the pipeline, expression:

(DT_I2)DATEPART("hh",TransactionDatetime) * 100 + DATEPART("mi",TransactionDatetime)

 

Regular dimension

All dimension members or values (or branches in the hierarchy) have the same number of levels which makes the dimension symmetrical or balanced such as the Date dimension or fixed-depth hierarchy CountryRegionCity where the three levels is in separate columns. Se­veral columns in the dimension table is not at third normal form (3NF), therefore the dimension table contains re­dun­dant data or duplicate data by nature. A re­gu­lar dimension has a flattened denormalized struc­ture. A Pro­duct dimension table can have columns like ProductId, Code, Name and other descriptive columns, a Category because products are divided into categories and each category has a Target group. When a category belongs to multiple pro­ducts, the target group will be repeated, meaning it does not satisfy third normal form (3NF), but it is okay for a regular dimension.

 

Example of a Product dimension with hi­e­rar­chy Target groupCategoryProduct

 

Product_key

Code

Name

Category

Target group

5831

E-973

EVGA GTX Titan

Graphic card

Gamer

5832

G-108

GeForce GTX 1080

Graphic card

Gamer

5833

R-929

Radeon R9 290

Graphic card

Gamer

7092

B-009

Besteker Q9

Karaoke mic

Singer

7093

C-008

CHUWI K8

Karaoke mic

Singer

9250

U-670

Ultra boost

Running shoe

Sport practitioner

9463

T-001

Trek Madone 9.9

Racing bicycle

Sport practitioner

 

There is three levels in the dimen­sion hi­e­rar­chy and it is balanced meaning that each Product has a Category and each Category has a Target group, so the dimen­­sion hi­e­rar­chy maximum dimensionality is 3 (level of dimensionality).

 

For the Product dimension the product is the lowest level of granularity, and there is two many-to-one relationships because many products roll up to a single category and many categories roll up to a single target group.

 

Ragged dimension

A ragged dimension has leaf members (the last level of a hierarchy) that appears at different levels of the hierarchy and therefore contains branches with varying depths and number of levels. Eu­ro­peDen­markCopen­hagen with three levels,

North AmericaUnited StatesCali­for­niaSacra­mento with four levels represent

continentcountrystatecapital. To keep that hierarchy structure for a fixed ba­lan­ced regular dimension I have made a dummy level not applicable state for Dk: Eu­ro­peDen­markN/ACopen­hagen.

Denmark and France is divided into regions, Philippines is divided into provinces and Germany has states/lands, therefore the name of level »state« could be chan­ged to a broader word like »division« which covers the different countries' areas.

A ragged dimension is implement as a Parent-child structure or with a bridge.

 

Parent-child dimension

Used to model flexible hierarchical structure where some dimension values have different levels of hierarchies called unbalanced or variable-depth hierarchy. Every value in the dimension have a related parent (mother) value, except the top value. The hierarchy is a­sym­me­tri­cal or unbalanced because values are placed in dif­fe­rent levels within the same hierarchy. For example, an Employee dimension where the parent is the ma­na­ger and the children is the employees under the manager, and some employees are both a manager and have another manager above, and the chain of command can be different from department to department. Another example is an Organization dimension where some departments have sub-de­part­ments and some teams have sub-teams, but there are also teams that don’t have sub-teams. This is the stron­gest side of Parent-child dimension to modeling.

 

Snowflake dimension

Dimensional model dimensions is comply with second normal form (2NF).

a) Splitting columns of a dimension table into smaller dimension tables with one-to-many relationships so data values fulfill and comply with 3NF and BCNF to avoid re­dun­­dant data and denormalized structure. Snowflaking is normalization to 3NF.

b) Splitting a dimension hierarchy into two or more dimension tables is called »snow­fla­ke a hierarchy«. For example, a Customer dimension with the hi­e­rar­chy:

CountryRegionCity will be split into three dimension tables so only column City remains in the Customer di­men­sion table that is connected to the Sales fact table, and two Snowflake dimen­sion tables for Region and for Country with one-to-many from Country to Region and one-to-many from Region to Customer. Region and Country can be reused in other dimensions for Supplier and Store or in a FactSales to quick divide in regions and countries. Of course depending of business re­qui­re­ments specification.

c) Splitting a dimension with columns from different source legacy systems or when some columns will be user type-in from an application to enrich data and to avoid re­dun­­dant data. See an example in section 6.1.

d) A Product dimension with a Category hierarchy is used in a Sales fact and Inven­tory fact. For forecasting (budgeting) the data is generated at Category level. Kimball will not do snowflaking of Product dimension instead he roll up to Ca­te­go­ry dimension as a strict subset of Product that the ETL process must take care of.

e) Splitting dimensions and move common columns to a general dimension is called an Outrigger dimension. For e­xam­ple, an Address dimension with street name, house or apartment number, postal code, city, county, area, region, country, gps coor­di­na­tes and so forth to be used in other dimensions e.g. a Customer dimension with a shipping address and a billing address, a Building dimension with location address and an Employee di­men­sion with home address. The Address dimension contains one row per unique address. A Sales fact can have a ShippingAddress_key and a BillingAddress_key as role-play­ing dimensions. Or a factless fact table with Customer_key, Ship­ping­Address_key, BillingAddress_key and effective and ex­pi­ra­tion dates for repor­ting track addresses of a customer. See more later.

f) A Shrunken dimension is not a snowflake dimension, see more later.

g) Splitting a dimension with rapidly changing columns is not snowflaking, see Mini.

h) Kimball’s financial services example starts with an Account dimension including data of products and branches but he choose to remove these descriptive columns to form independent dimensions of Product and Branch and not snowflake the Ac­count instead add them to the fact because that’s the way business users think of them too. The fact will capture the relationship among data en­ti­ties.

Some data modellers do not like Snowflake dimension be­cau­se of the query per­for­man­ce with more tables in the join. I don’t use the terms snowflake schema and starflake schema because for me snowflake belongs to a dimension and sometimes it is great to have a denormalized structure and other times it is good to think of the above points. Most times a data mart will be a constallation schema (multi-star schema) where some dimensions is shared among different facts.

Now and then a data warehouse architecture has an extra layer after the data mart that is called presentation area where the ETL process or a view has joined and merged snowflaked di­men­sions together to only one dimension with all the columns with a lot of data redundancy, and that is fine sin­ce this layer will be refilled in every ETL process and it is easier for the users to do their query and for the tool to load data into its data model. Can use a materialized view. The layer is also called data consumption layer or data delivery layer for reporting, analysis and self-service BI.

 

Outrigger dimension or Reference dimension

When many columns belong logically together in a cluster or group it is fine to do a snowflaking to avoid a repeating large set of data and therefore making a di­men­sion smaller and stable. Sometimes a canoe or a sailboat is using a rig to achieve balance because they are very narrow and a cluster of columns is placed in a Out­rig­ger dimension or a reference dimension because it’s primary key will be a foreign key in the main dimension, but there is no reference found in any fact table. For example, a Product dimension has a column called OfferDate that represent the date when the product will be available for the customers, and from that date the product can appear in the Sales fact table. The OfferDate column is a foreign key to a Date dimension that become Outrigger di­men­sion because Date dimension has many columns about dates, weeks, months, years and maybe fiscal columns too. Another example is demographic data for each country which is providing with 50 different columns. When we using outrigger dimensions we let each dimension has its own core columns.

Another example is from a bank with two kind of customers for Person and Com­pa­ny with common data as name and address and with specific data where a Person has social security number, gender and marital status and a Company has VAT iden­tifi­ca­tion number, industry code and turnover amount. A Customer di­men­sion handle the connections to the fact tables and become a hub, an anchor or a party. Every time the bank gets a new customer, it will be set up in Customer dimension and the surrogate key value will be reused in either Person outrigger dimension or in Com­pany outrigger dimension where both of them have a one-to-one rela­tion­ship to the Customer dimension.

 

Shrunken dimension

A Shrunken dimension is a subset of another dimension columns that apply to a higher level of summary of an aggregated fact and the shrunken dimension key will appear in the fact.

a) The Month dimension is a shrunken dimension of the Date di­men­sion. The Month dimension would be connected to a fore­cast fact table whose grain is at the monthly level, while Date dimension is connected to the realized fact table.

b) A base-level Sales fact has a grain per date and product and is connected to a Date dimension with columns of date, month, year and a Product dimension with columns of names of product, brand and category. The Sales fact is derived to a aggregate fact with a grain per month and category and is connected to a shunken Month dimension with columns of month, year and a shunken Category dimension with a name column. The aggregate fact has columns for Month_key and Ca­te­gory_key and a summary amount of the Sales fact. Product dimension still keep the name of the category and is rolled up to a Category dimension which also has name of category so name column become redundant data in the dimensional modeling. Therefore shrunken is not snowflaking because a snowflake dimension is on 3 NF.

c) Shrunken roll up dimensions are required when constructing aggregate fact table. When a Sales fact has a daily grain the number of rows can become very large over time, therefore an aggregate fact summarized to monthly level will have less rows. Since the aggregate fact don’t need detailed customer information, the Customer dimension can make new shrunken dimensions for Region out of address, for Gender and for Age groups, and the summarized fact data become even less rows.

d) Sometimes users want few dimension values e.g. Red, Yellow and Green and they want Pink to become Red and Orange to become Yellow and so forth and the rest of the colours gets a residual value called Others. I will make a Colour di­men­sion for the colours from the source data and I will make an extra Colour­Group dimension for the four colours Red, Yellow, Green and Others. I will make a map table that will translate the colours like Red to Red, Pink to Red, Yellow to Yellow and Orange to Yellow and the rest of the colours to Others. In the loading to the fact table I will let the colours go through the map and fetch the ColourGroup di­men­sion to the fact table to obtain good performance for various statistics for the users.

 

Aggregate dimension

Dimensions that represent data at different levels of granularity to give higher performance. Can also refer to hierarchies inside a dimension with a higher grain.

 

Derived dimension

Like a Month dimension that is derived from a Calendar dimension or we can say that Calendar has been reduced to Month, Year and Week with the start date of the week together with a week number and which year the week belongs to. A derived dimension can also be created by aggregating two existing dimensions.

In a hospital we can from a Patient dimension and an Employee dimension derive a Person dimension. A person can over time be both an employee and a patient or at the same time when the employee become sick and will be hospitalized.

Fact data can derive dimension data and it is called a degenerate dimension.

 

Mini dimension or Historical dimension

For Rapidly Changing Dimensions for managing high frequency and low cardinality changes in a dimension of fast changing volatile columns they are pla­ced in a mini dimension or historical dimension with its own surrogate key identity column which will be included in the fact table. A dimension table will be split into two tables, one with type 0 or 1 columns and the other with type 2 or 4 columns. For example, cu­stomer data with columns as Name, Address and CountryRegionCity is placed in a Customer dimension table, and the fast changing columns BodyWeightAt­Pur­chase­Time and MonthlyIncome interval e.g. $ 0-10000, 10000-25000, 25000-50000, 50000-99999 is placed in a mini di­men­sion table called Customer­Body­Weight­Income with its own surrogate key identity column. The sales fact table will have two columns to provide data for a customer, one key for Customer dimension and another key for Cu­sto­mer­­Body­­Weight­­Income dimension. Sometimes it is ne­ces­sary to have two or more mini dimensions if the columns is changing rapidly at different times. Normally there is no hierarchy in a mini dimension.

 

Bridge dimension

The bridge is to connect or map data that have a many-to-many relationship between a fact and a dimension or between a dimension and a dimension to take care of multivalued column at the conceptual level, and break it down at the logical level to a bridge dimension table with two one-to-many relationships and a com­po­si­te primary key from the implicated data. Instead of the word Bridge table a term Helper table could be used. Peter Chen calls it an associative entity. A bridge table can also be used for ragged hierarchies. (A song can have a Verse 1, a Chorus, a Verse 2, a Chorus, a Bridge and ends with the Chorus again.)

 

Multivalued dimension or Many-valued dimension

a) To handle when one fact row have two or more dimension values from same di­men­sion. For example, a Sales fact can have up to four different sales staff em­ploy­ees and a sales staff has many sales. Therefore we say there is a many-to-many relationship be­twe­en Sales fact and ­Em­ploy­ee dimension, and sales staff employee becomes multivalued in the Sales fact. It can be implemented by an extra table called SalesEm­ployee­Bridge that contains the Id of a fact row and the Em­ployee­_key from the Employee dimension, or the fact table will get a dimension key co­lumn for a SalesEmployeeGroup di­men­sion meaning a group of sales staff is con­nec­ted to one fact row; and since there is a many-to-many relationship between a SalesEmployeeGroup and a Employee dimension, a SalesEmployee­GroupBridge ta­ble will express that by com­bi­ning the keys from SalesEmployeeGroup and Em­ployee.

b) When buying a paint mixture the different colors are mixed with a ratio or weight (sum up to 100%) the amount of paint. One sales fact row contains many colors and one color is included in many paint mixtures. Color become multivalued and a bridge gets a weight value to tell how much of that color is used.

c) To handle when one dimension row has two or more dimension values from a­no­ther di­men­sion. For example, an employee has a group of skills and one skill can belong to several employees. Therefore we say there is a many-to-many re­la­tion­ship be­twe­en Employee dimension and Skill dimension, and skills of an employee becomes multivalued in the Employee dimension. It can be implemented by an extra table called EmployeeSkillGroupBridge that has a com­po­si­te primary key of Skill­Group_key and Skill_key, and SkillGroup_key is a foreign key in Employee di­men­sion and Skill_key is a foreign key in Skill dimension, so the many-to-many re­la­tion­ship comes two one-to-many relationships. The data ex­pe­rien­ce level would be placed in the EmployeeSkillGroupBridge. Sometimes we crea­te a fact table e.g. EmployeeSkillFact, but for me a skill is a description and a cha­rac­te­ri­stics of an em­ploy­ee in a company and facts would be a Sales fact or a Pro­duc­tion fact. In a School mart I would place the students' courses in a fact table because it is a result of study and passed an exam. Another example is a t-shirt can have up to three sizes »small«, »medium« and »large« and they can become three columns in the T-shirt dimension, or to make a Size dimension that has a many-to-many relationship to T-shirt dimension and the relationship become a T-shirt­Size­Bridge with Quan­tity­InStock. Another example is a bank account can two bank customers like wife and husband, and of course each bank customer can have several accounts like budget, saving and pension. Therefore we say there is a many-to-many re­la­tion­ship be­twe­en Customer dimension and Account dimension. We create a Bank­Account fact ta­ble that will refer to the Account dimension, and the Account dimension refer to an AccountCustomerBridge table that again refer to the Customer dimension, so the BankAccount fact table will not refer directly to the Customer dimension. The Ac­count­Custo­merBridge table contains two columns Account_key and Cu­sto­mer­_key as a composite primary key so an account can have several customers.

 

Role playing dimension or Repeated dimension

A role-playing dimension is repeated two or several times in same fact i.e. a Date dimension which key column is repeated in three foreign key columns in a fact table for three roles labeled SaleDate, ShipmentDate and DeliveryDate. For each role I create a view that has distinguishable and unambiguously column names and in this example it become three views upon the Date dimension called SaleDate, ShipmentDate and DeliveryDate with columns like Date of sale, Year of sale, Date of shipment, Year of shipment, Date of delivery and Year of delivery.

A City dimension can be repeated in multiple roles in a fact table of persons like these columns: BirthplaceCity, ResidensCity, WorkingplaceCity, SeniorCity and DeathCity. It will become five views upon the City dimension.

A Manager dimension can be repeated as Sales clerk and Store manager.

Another example is a boolean dimension with key values 1/True and 0/False and in different views for several role-playing statuses the two values is translated to good texts as »Yes« and »No«, »In stock« and »Delivery made«, or »Available« and »Utilized« for a Utilization status.

When a dimension has an outrigger dimension e.g. Customer dimension has a co­lumn for FirstPurchageDate I create a view upon Date dimension.

Kimball says: »Create the illusion of independent date dimensions by using views or aliases and uniquely label the columns.« It will be easy for an user in Power Bi or Tableau to drag into a data model several views for fact and dimensions without thinking of a dimension is playing multiple roles. In a olap cube data model the fact can be joined multiple times to the same dimension and at Dimension Usage can each role be labeled, but since we are using the same dimension the column names will be reused.

 

Junk dimension, Garbage dimension, Abstract or Hybrid dimension

A single table with a combination of different and unrelated columns to avoid ha­ving a large number of foreign keys in the fact table and therefore have decreased the number of dimensions (dimensionality) in the fact table. Kimball recommand up to 25 di­men­sions in a fact. The content in the junk dimension table is the combina­tion of all possible values of the individual columns called the cartesian product. For example, four different values that can be cross-joined into a Junk dimension:

Payment type: Cash or Credit card.

Coupon used: Yes or No or Not applicable.

Bag type: Fabric or Paper or Plastic or Unspecified.

Customer feedback: Good, Bad or None.

with 2 x 3 x 4 x 3 = 72 values or rows in a Junk dimension table, but can contain only the combination of values that actually occur in the source data.

The fact table only needs one key to the Junk dimension for getting the descriptive values of fact data for reporting. The pitfall of a Junk dimension is the filtering be­cause a value (e.g. Credit card) exists as duplicate in multiple rows and therefore gives multiple key values to be joined to the fact table. To display unique content of a column from a Junk dimension in a dropdown or listbox I need to create a view for that column e.g. create view [Dim Bag type] as select distinct [Bag type] from [Dim Junk]. The view will handle a one-to-many relationship to the Junk dimension in the same way we handle a snowflake dimension. I can in Power BI create a cal­cu­la­ted table upon the Junk dimension with a Dax like:

Dim Bag type = distinct(Dim Junk[Bag type]) and I build a relationship from the calculated table Dim Bag type back to the Junk dimension in the model where Junk dimension already has an one-to-many relationship back to the fact. I hide the Dim Junk because an user do not need it after we have calculated tables for each of the columns in the Junk dimension and therefore the Junk dimension become a bridge or a helper. I hope it shows how to use a Junk dimension in practice.

 

Degenerate dimension

Degenerate dimension values exist in the fact table, but they are not foreign keys, and they do not join to a real dimension table. When the dimension value is stored as part of fact table, and is not in a separate dimension table, it is typically used for lo­west grain or high cardinality dimensions such as voucher number, transaction num­ber, order number, invoice number or ticket number. These are essentially dimension key for which there are no other columns, so a degenerate dimension is a dimension without columns or hierarchy. For example, the OrderNumber column can be in the Order fact table with several rows using the same order number, because one order can contain several products. Therefore the OrderNumber column is important to group together all the products in one order. Later for searching for an order number in a OLAP cube, a Order number dimension is very usefull, but it is not a dimension table, it is generated by the Order fact table.

Degenerate dimension is also used as an im­ple­men­ted Bridge dimension table in making a Multivalued dimension. In a data mart these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Normally a degenerate dimension is not a table in a da­ta­base, it is a view with distinct values based on the fact table.

 

Dimensionalised dimension

Dimensionalised dimension is a replacement of Degenerate dimension where the view become a materialized view meaning it become a table in the data mart and where the text column will be transformed to an integer key, like this:

 

SELECT QuantityPerUnit_key = CONVERT(BIGINT, HASHBYTES('MD5', t.QuantityPerUnit)),

            t.QuantityPerUnit

    INTO DimQuantityPerUnit

FROM (SELECT DISTINCT QuantityPerUnit

           FROM Products WHERE QuantityPerUnit IS NOT NULL) t

 

Static dimension or Constant dimension

Static dimensions are not extracted from the source legacy system, but are created within the context of the data warehouse or data mart. A static dimension can be loaded manually with Status codes or it can be generated by a pro­ce­dure such as a Date dimension and Time dimension. The opposite would be called Dynamic di­men­sion.

 

Heterogeneous dimension

Several different kinds of entry with different columns for each fact (like sub clas­ses). For example, heterogeneous products have separate unique columns and it is therefore not possible to make a single product table to handle these he­te­ro­ge­ne­ous products.

 

Fact dimension

Used to identify different facts that is populated in the same measure column in a fact table because the fact rows represent different entry typies. An entry type fact dimension describes what the fact row represents and how measures must be understand and used. The alternative is to have multiple measure columns for each entry type in the fact table where only one column has a value for each row.

 

Time machine dimension

A combination of two dimensions for Entry date (or event date) and for Year and where fact data rows is based on either »counterpart« as in a financial accounting or »transactions« as in a financial transactions. Section 6.4 will show an example.

 

Monster dimension

A very large dimension that has a huge number of rows or many columns. For a real e­state agent, I implemented a 132-columns dimension through a merge of five source legacy systems. The column names was made together with the users. The created dimension table got the column names in an alphabetical order so it is easy to find a specific column.

 

Supernova dimension

Dimensional columns are allowed to become complex objects rather than simple text like unstructured text, gps tracking, graphic images and in time series and in NoSQL bases like Hadoop and MongoDB, and become much more malleable and extensible from one analysis to another. Extensible design in software engineering is to accept that not everything can be designed in advance and extensibility is a soft­ware design principle defined as a system’s ability to have new functionality extended.

 

Audit dimension or Data quality dimension

A description of each fact table row would be »Normal value«, »Out-of-bounds value«, »Unlikely value«, »Verified value«, »Unverified value« and »Uncertain value«. Kimball recommands to load all rows to the fact table and use an Audit dimension to do a tagging of data because of an error condition and thereby to tell the state of each row in an audit report so an user can look at the data and make a data change or a counterpart in source legacy system and do a new load to the data warehouse to fix the data in the fact table. Data quality is monitored during the ETL process and it can procedure an audit statistics.

In a SSIS package a derived column can have an expression to validate data and give value to an Audit dimension where an amount from the source is less than 0 or more than 5000 gets audit key value 12 for out of bounds amount else value 0 for okay: Amount < 0 || Amount > 5000 ? (DT_I4)12 : (DT_I4)0

All rows are checked for compliance with the constraints.

An audit dimension can also have data for name and version of source legacy sy­stem, name of data table in source, time of extract from source, time of insert into the fact table etc.

 

4.5. Inferred members or Inferred dimensions

A dimension has a member value of »Unknown«, »Missing«, »N/A«, »-« etc. to handle source data that is going to be loaded into a fact table:

 

  • The fact business key has no value, it is null or it is an empty string.

Approach: Handling a null/empty business key as a missing member be­cau­se a row in the fact table for column of business key is not registered.

 

  • The fact business key value not found as a member of the dimension.

Approach: »Early arriving fact« for handling of orphaned data where the fact data has an unknown mem­ber, meaning the fact value is an orphan child because there is no parent value in the cor­re­spon­ding dimension table, the fact value does not exist in the dimension.

 

In a relational database it is called a referential integrity constraint violation in a table when a foreign key contains a value that does not exists as a primary key in a different (or the same) table. There is a no referred member and there is a need to have an inferred member. A forthcoming fact row has a member that will infer a new dimension member, therefore it is called inferred member of the dimension and it is to im­pro­ve data quality and audit trail/control track for re­con­ci­lia­tion in the data warehouse.

 

Handling a null/empty business key as a missing member

When a dataset is transformed to a fact table, there can be a business key column which value is empty or null, meaning it does not yet exist for the fact data. To keep all the fact data rows in the fact table, the related dimension table is already born with a »Missing« value (an inferred member) with surrogate key identity value -1, which can be used as a default value in the fact table foreign key column to the dimension table primary key column. Later the fact table with a -1 value can be up­da­ted with a real business key value that either exists in dimension table or will be inserted first and get a new surrogate key identity that the fact data row can refer to. Sometimes I have seen a dimension with an inferred mem­ber value of -1 for »Unknown«, but I prefer using -1 for »Missing« and using »Unknown« to handle the situation I will describe below, because I divide infer­red members in two situations as the points shown above.

 

Early arriving fact as an unknown member

When a dataset is transformed to a fact table, there can be a business key column which value has not yet been received to the related dimension table and therefore does not yet exist. To keep all the fact data rows in the fact table, the related dimen­sion table will first have to insert a new business key value with an »Unknown« value (an inferred member) which later will be updated in the di­men­sion with the correct text value then it is known. The »Unknown« value gets the next surrogate key identity as an unique sequence number and will be used in fact table like any other dimension member value. A dimension table can at the same time have se­veral »Unknown« member values with their own business key, sur­ro­gate key identity and the text value can include the value of the business key like »Unknown 886«, »Unknown 887«, »Unknown 888« and so on to distinct them for the users of the dimension. When a regular dimension has a hierarchy these levels can have text value »Unknown« as a special branch in the hierarchy.

 

Late arriving dimension

When a dimension data has been delayed:

a) If business key not exists then insert a new row.

b) If business key exists as an inferred member »Unknown« then type 1 update the row and type 2 insert a new row to keep history.

c) If business key exists and the delayed data comes with a date of valid then type 2 becomes more complex because validfrom and validto has to be adjusted and fact rows has to be revisited to update the dimension key column to point at the right dimension value at that date. You have to consider if it is allowed to change old data and old reporting result.

 

Late arriving fact

When a fact data has been delayed maybe it is including a date that can be used to search and fetch the current dimension member at that time if dimension keeps history. You have to consider if it is allowed to add a late arriving fact row because it will change an old report. For example, a boss already got the report of sales for the second quarter and at June 5 a late sale fact for March 31 is arriving and when it is added to the Sales fact the report for second quarter will change so it do not match the old reporting result.

When building the ETL process for a fact that is using type 2 or 7 dimensions some­times we can assume that new fact data rows is at current date and therefore we only need to do a lookup for the dimension key value with this criteria:

 

   dim.Businesskey = fact.Businesskey AND dim.IsCurrent = 1

 

But if a fact data row can be late arriving with an old date stamp in column Re­gi­ste­red­Date we need to do a range lookup for the dimension key value with a criteria to found the fact business key in the dimension at the time when fact date was valid in the dimen­sion:

 

   dim.Businesskey = fact.Businesskey AND

   dim.EffectiveDate <= fact.RegisteredDate AND dim.ExpirationDate > fact.RegisteredDate

 

Sometimes an ETL process will solve it in two tracks and pipelines that both trans­late and replace a fact business key value with a dimension surrogate primary key value to be saved in the fact data row in the fact table.

 

Early arriving dimension

When a dimension has members that is not yet been referred to from a fact row or maybe it will never be i.e. a type of payment called »Blood«, but still keep it as a member of the dimension.

 

Other inferred members

Va­ri­e­ty of constant inferred members would be like:

»Missing« (-1) the business key value is not present, not reported or »hasn’t hap­pe­ned yet« but is expected to be available later and at that time the value will get its own identity and fact table will be updated thereby.

»Not available« (-2) the source legacy system did not provide a business key value (null) and will not do it later. (»None« is another value).

»Not applicable« (-3) the dimension is not relevant for the fact row (N/A, N.A.). It is best to avoid a N/A dimension by making separate fact tables.

»Bad«, »Corrupt« or »Dirty« (-4) the source legacy system provide bad data (like year of birth as 2099) or not enough data to determine the appropriate dimension key identity. This may be due to corrupted data in the source legacy system or incomplete knowledge of the bu­si­ness rules for the source data for the dimension.

 

IsInferred flag

A dimension table can have a metadata column like IsInferred as a boolean (bit) with two values (1 = true and 0 = false):

 

  • IsInferred = 1 means that the row represent an unknown value from source.
  • IsInferred = 0 means that the row represent a known value from source.

 

An example of a Customer dimension

 

Customer_key

Business_key

Region

Name

IsInferred

-1

0

Missing

Missing

1

1

176

Europe

Hans Andersen

0

2

359

North America

Joe Watson

0

3

421

Europe

Marie Beaulieur

0

134

886

Unknown

Unknown 886

1

135

887

Unknown

Unknown 887

1

136

888

Unknown

Unknown 888

1

137

889

East Asia

Jian Lewai

0

138

890

North America

Emily Gates

0

139

891

North America

Michael Brown

0

 

Will an inferred dimension member value data row be updated when the source legacy system provide a Region and a Name?

 

  • Yes, for type 1 the unknown value is updated and set IsInferred = 0.

 

  • No, for type 2 keep all values (incl. type 4, 5, 6, 7) because it is im­por­tant to keep history in rows, therefore the updated Region and Name will be inserted as a new data row with IsInferred = 0 and keep the original row of the dimension unchanged. But Kimball page 479 says it is okay to do a type 1 overwrite in a type 2 dimension, so I will say it is up to you.

 

The fact table is untouched or unchanged because the foreign key Customer_key values 134 - 136 are at the right place already, it is only Unknown values of the dimension that is changed. Kimball says there is no need for revisiting the fact table for making an inferred member to a normal member of the dimension.

 

When other levels in a hierarchy has text value »Unknown« like Region and users of the dimension in a cube has already made reports with fact data connected to »Unknown«, it is important to keep the history in the dimension so an user some months later still can make the same report with same summary value »Unknown«. If an user instead want to use the current values of the previous inferred members, the cube should provide such a dimension too.

 

4.6. Conclusion

The design of a dimensional model has a four-step process described by Kimball:

1.  Select the business process to model by gathering and understanding business needs and available data in source legacy systems.

2.  Declare the grain of fact as Transaction, Periodic or Accumulating/Aggregated.

3.  Identify the dimensions with columns, hierarchy and group.

4.  Identify the facts with measures as additive or non-additive.

 

Dimensional modeling ends up in relational database tables for dimensions and facts with one-to-many relationships between primary keys and foreign keys, meaning it become an Entity Relationship data model where entities are labeled with Kimball naming.

 

An entity or table can have a normal name e.g. Product and Sales and they can be placed into different database schemas based on Kimball naming to labeling each table in the dimensional modeling e.g. of some database schema names:

 

  • Dim
  • Snowdim
  • Junkdim
  • Pcdim
  • Minidim
  • Bridge
  • Fact
  • Lessfact
  • Sumfact
  • Tnsfact
  • Ctpfact
  • Dvdfact

 

Extra schemas for Usage supporting tables in the ETL process:

 

  • Map    for mapping tables between different source legacy systems
  • Rule    for rule tables for enrichment of dimensions and search criteria

 

Other data modeler is adding a label Dim and Fact to the name of the table e.g. DimProduct and FactSales or Product_dim and Sales_fact.

 

The most important of all is a good table name e.g. in Kimball’s Healthcare example he has tables like:

 

  • Health Care Billing Line Item Fact
  • Surgical Events Transaction Fact
  • Medical Record Entries Fact
  • Hospital Dimension
  • Patient Dimension
  • Diagnosis Group Dimension
  • Diagnosis Group Bridge
  • Diagnosis Dimension
  • Treatment Dimension

 

Invoice

An invoice has multiple line items. Why make invoice as a dimension, when the parent invoice header fact table has 4 dimensions and 5 measures, and the child invoice line item fact table has 6 dimensions and 3 measures? Kimball tip 25: We can't roll up our business by product! If we constrain by a specific product, we don't know what to do with invoice level discounts, freight charges and tax. Have to take the invoice level data and allocate down to the line item level. Therefore we get one Invoice fact table with dimensions and measures from the invoice header e.g. invoice no, order no, invoice date, customer, ship-to and sales person, and from in­voi­ce line item e.g. product, quantity, unit price and amount.

Utilizing dimensional modeling, end users easily understand and navigate data structure and fully exploit the data.

 

5. Links for more readings

http://www.wikipedia.org/wiki/Data_warehouse

http://www.kimballgroup.com

https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/

http://www.kimballgroup.com/2009/05/the-10-essential-rules-of-dimensional-modeling/

http://www.kimballgroup.com/1997/08/a-dimensional-modeling-manifesto/

https://www.kimballgroup.com/2004/03/differences-of-opinion/

http://www.wikipedia.org/wiki/Slowly_changing_dimension

http://www.kimballgroup.com/2012/07/design-tip-147-durable-super-natural-keys/

http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7

http://www.kimballgroup.com/2015/12/design-tip-180-the-future-is-bright/

http://www.kimballgroup.com/wp-content/uploads/2007/10/An-Architecture-for-Data-Quality1.pdf

http://www.1keydata.com/datawarehousing/datawarehouse.html

https://www.youtube.com/watch?v=cwpL-3rkRYQ&list=PLrbIyvYCdg0iAUQoxG5vI_yKqzZ2AcgGe

http://www.redbooks.ibm.com/

Peter Chen: The Entity-Relationship Model-Toward a Unified View of Data, March 1976

 

Books:

Ralph Kimball and Margy Ross: The Data Warehouse Toolkit, The Definitive Guide to Dimensional Modeling, Third Edition, 2013.

Ralph Kimball and Joe Caserta: The Data Warehouse ETL Toolkit, Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, 2004.

William H. Inmon: Building the Data Warehouse, Fourth Edition, 2005.

 

6. Examples with sample values

 

6.1. Customer snowflake dimension

Customer table in a source legacy system with CustomerId as a primary key (called Business key in a data warehouse or data mart) and CustomerNum­ber as a se­con­dary key including data of the nationality (country) and the gender of the customer (M or F):

 

CustomerId

CustomerNumber

CustomerName

176

DK-0912-M

Hans Andersen

359

US-3942-M

Joe Watson

421

FR-1429-F

Marie Beaulieur

422

FR-1430-M

Mathis Leon

 

Customer dimension table as a snowflake dimension:

 

Customer _key

Custo

merId

Customer

Number

Customer

Name

Country

_key

Gender _key

Payment _key

-1

0

Missing

Missing

-1

-1

-1

1

176

DK-0912-M

Hans Andersen

1

1

2

2

359

US-3942-M

Joe Watson

5

1

3

3

421

FR-1429-F

Marie Beaulieur

3

2

3

4

422

FR-1430-M

Mathis Leon

3

1

3

 

Customer_key is an unique sequence number as primary key for the dimension for ha­ving independence on the business key CustomerId from the source legacy sy­stem. From CustomerNumber I derived two extra columns for Country and Gender and they become snowflake dimensions and can be used together with other dimension tables and fact tables. Pay­ment is taking from another table in the CRM system and become a dimension itself and is merged into the Customer di­men­sion to do an enrichment of the customers. Country dimension has been en­riching inside the data warehouse by the business users with ISOcode and Cur­ren­cy values to fulfill repor­ting requirements. Each country is placed in a continent hie­rar­chy (Africa, Antarctica, Asia, Australia, Europe, North America and South America) in case a report want a continent overview of the customers and sales.

 

Country dimension table becomes an Outrigger dimension:

 

Country_key

Country

ISOcode

Currency

Continent

-1

Missing

--

Missing

Missing

1

Denmark

DK

Danish Krone

Europe

2

England

UK

British Pound

Europe

3

France

FR

Euro

Europe

4

Norway

NO

Norwegian Krone

Europe

5

USA

US

United States Dollar

North America

6

China

CH

Renminbi

Asia

 

Gender dimension table:

 

Gender_key

Gender

-1

Missing

1

Male

2

Female

3

Transgender

 

Payment dimension table with an extra data warehouse column to specify the sort order of payments for reporting purpose and a type hierarchy:

 

Payment_key

Payment

PaymentSortorder

PaymentType

-1

Missing

99

Missing

1

Bank account

3

Transfer payment

2

Cash

1

Cash payment

3

Credit card

2

Online payment

4

PayPal

5

Online payment

5

Bitcoin

4

Online payment, virtual currency

 

For an olap multidimensional cube loading data to a Customer dimension will be implemented with a view that join the relevant tables and with several columns and hierarchies to make an easy search and zoom into specific data in access data tools like Excel, Power BI, QlikView, Tableau or Targit:

 

Customer

·         Number

·         Name

·         Gender

·         Nationality (Country renamed to play a role)

·         Payment

 

In terms of granularity I want to see daily sales by customer and product for each individual employee which gives the grain declaration of date, customer, product, employee. The time-of-day from the operational system is left out so quantity has been summarized to date level in case a customer purchase the same product at the same employee in the one day e.g. 10.15am and 3.45pm.

An example of a sales statistic would be: »What is the amount of sale to female customers from France that have been paid with credit card in April 2014?«

We need a Sales fact table like this:

 

PurchaseDate

Customer_key

Product_key

Employee_key

Quantity

Amount

2014-04-23

2

256

12

1

125

2014-04-23

2

512

12

3

75

2014-04-24

3

128

7

7

10

2014-04-25

1

512

12

1

25

2014-04-29

3

64

10

2

35

2014-04-29

4

6

10

3

99

2014-04-30

0

19

0

1

0

 

The last row shows an unwanted sale because it is a stolen item therefore the cu­sto­mer is missing together with the amount.

 

A report want to show the total amount per gender including Trans­gen­der even though there is no customer yet. Since snowflaking has created a Gender di­men­sion with all genders independent of customer, a query will use an outer join to include all genders:

 

SELECT g.Gender, TotalAmount = ISNULL(SUM(s.Amount),0)

FROM Fact.Sales s

         INNER JOIN Dim.Customer c ON c.Customer_Key = s.Customer_key

         RIGHT OUTER JOIN Dim.Gender g ON g.Gender_Key = c.Gender_key

GROUP BY g.Gender_key

 

Gender

TotalAmount

Missing

0

Male

324

Female

45

Transgender

0

 

When snowflaking is not wanted for the Customer dimension, it will contain all the relevant descriptive columns:

 

Customer_key

CustomerId

CustomerNumber

CustomerName

Country

ISOcode

Currency

Continent

Gender

Payment

PaymentSortorder

PaymentType

 

The Customer dimension will contain re­dun­dant data i.e. a PaymentSortorder value is repeated for all the customers that is using same payment. One day business ask for a new sort order of payments because it will be more suitable for the reporting. I recommend to have a Usage supporting database in the data warehouse with tables where business users can update data through an application. With a Pay­ment table with columns Payment, PaymentSortorder and PaymentType a bu­si­ness user can change the values in column PaymentSortorder. The Usage supporting data­base will be a data source to the data warehouse where the ETL process will fetch the Payment table to the Input data area and the ETL process will update the Customer dimension table in the Data mart. Since business do not care of old sort order of payment, the column PaymentSortorder in Customer dimension will be treated as a type 1 column meaning a simple update of all the rows in the di­men­sion table.

In the Usage supporting database a business user can add extra data to the coun­tries and hereby enrich the reporting with data that doesn't e­xists in the sour­ce legacy systems.

Since there is no customer of Trans­gen­der, the value will not exists in the Customer dimension and therefore Trans­gen­der can’t been shown in a report except when it is added through an Union sql statement but data belongs in a row in a table.

 

When all dimension values is wanted as fact, the fact table needs to include extra dimensions for Gender, Country and Payment and drop them from Customer.

 

6.2. Type 1, type 2 and type 7 with most recent current value

A customer has Id 421 in a source legacy system and her maiden name is Marie Beaulieur, and she is inserted into a Customer dimension. At 2007-11-16 she got married and took her husband’s surname, therefore her name was changed to Marie Lesauvage.

 

When the dimension is a type 1, Marie’s maiden name will be overwritten and will be forgotten. Customer_key is a surrogate key identity column an unique sequence number.

 

Before:

 

Customer_key

Business_key

CustomerName

1

176

Hans Andersen

2

359

Joe Watson

3

421

Marie Beaulieur

 

After:

 

Customer_key

Business_key

CustomerName

1

176

Hans Andersen

2

359

Joe Watson

3

421

Marie Lesauvage

 

The type 1 dimension will always display the customers most recent name (cur­rent or latest name). A Customer_key column will be in a fact table and in a data access tool like Tableau and QlikView fact column Customer_key will be joined to di­men­sion column Customer_key inside the tool’s data model.

 

When the dimension is a type 2, Marie’s maiden name will remain and will be re­mem­bered and her changed name will continue in an extra row. To keep track of when a value is changed, there is two datetime columns called ValidFrom and ValidTo. Marie’s maiden name was valid from she was born or she become custo­mer at 2000-05-08 and that date is known in the source legacy system. The first entering will be giving a ValidFrom e.g. 1900-01-01 as a beginning of time and a ValidTo e.g. 9999-12-31 as an end of time. Marie’s marriage at 2007-11-16 will create a new row in the dimension table with the ValidFrom 2007-11-16 because she took her husband’s surname and the previous row will get ValidTo 2007-11-16 meaning that until that date she kept her maiden name. Customer_key is a surrogate key identity column an unique sequence number.

 

Before:

 

Customer_key

Business_key

CustomerName

ValidFrom

ValidTo

1

176

Hans Andersen

1900-01-01

9999-12-31

2

359

Joe Watson

1900-01-01

9999-12-31

3

421

Marie Beaulieur

1900-01-01

9999-12-31

 

After:

 

Customer_key

Business_key

CustomerName

ValidFrom

ValidTo

1

176

Hans Andersen

1900-01-01

9999-12-31

2

359

Joe Watson

1900-01-01

9999-12-31

3

421

Marie Beaulieur

1900-01-01

2007-11-16

4

421

Marie Lesauvage

2007-11-16

9999-12-31

 

Note: Kimball page 508: If ValidFrom and ValidTo are datetime stamps then the ValidTo must be set to exactly the ValidFrom of the next row so that no gap exists between rows and no overlap too with continuous time. Other kind of time is event time (in a point of time) and interval time with gaps or overlap.

 

ValidFrom represents »from now on« (in danish »fra og med«) and ValidTo repre­sents »to and not included« (in danish »til og ikke medtaget«).

 

What happen to the date when Marie become a customer at 2000-05-08? We can add a column called CustomerBeginDate, or we can add two rows at the first time a customer is entering so ValidFrom in the second row will represent the date the person become a customer, it could look like this:

 

Customer_key

Business_key

CustomerName

ValidFrom

ValidTo

1

176

Hans Andersen

1900-01-01

1997-04-05

2

176

Hans Andersen

1997-04-05

9999-12-31

3

359

Joe Watson

1900-01-01

2000-01-12

4

359

Joe Watson

2000-01-12

9999-12-31

5

421

Marie Beaulieur

1900-01-01

2000-05-08

6

421

Marie Beaulieur

2000-05-08

2007-11-16

7

421

Marie Lesauvage

2007-11-16

9999-12-31

 

The above table is not used in the following example.

 

Marie Lesauvage remarried at 2014-07-15 and took her new husband’s surname therefore her name is changed to Marie Sainte and become a new row in the table.

 

Customer_key

Business_key

CustomerName

ValidFrom

ValidTo

1

176

Hans Andersen

1900-01-01

9999-12-31

2

359

Joe Watson

1900-01-01

9999-12-31

3

421

Marie Beaulieur

1900-01-01

2007-11-16

4

421

Marie Lesauvage

2007-11-16

2014-07-15

5

421

Marie Sainte

2014-07-15

9999-12-31

 

The Business_key column is the »glue« that holds the multiple records together.

 

Columns ValidFrom and ValidTo is metadata information about the data row itself. A dimension can have extra columns for metadata for a data row in a table e.g.:

IsCurrent, IsDelete, IsInferred, State column or Status column with values e.g. Current, Deleted, Expired. More about metadata columns in section 6.3.

 

Marie has made six purchases which is entered into a fact table and the date of purchase­ determines the value in column Customer_key after this rule.

 

   Business_key = CustomerId AND  

   ValidFrom <= PurchaseDate AND ValidTo > PurchaseDate

 

It is called Range Lookup when we are going to find the right Customer_key in the Customer dimension for a specific date of purchase­ for example, 2014-05-08 for Marie becomes:

 

   Business_key = 421 AND

   ValidFrom <= 2014-05-08 AND ValidTo > 2014-05-08

 

The dimension row with ValidFrom 2007-11-16 and ValidTo 2014-07-15 meets the rule:

 

   Business_key = 421 AND

   2007-11-16 <= 2014-05-08 AND 2014-07-15  > 2014-05-08

 

The dimension data row has Customer_Key value 4 which we are using in the in­ser­ted fact row that represent her purchase at 2014-05-08 where her surname was Lesauvage at that time. The Range Lookup will be performed for each row of purchase from the source legacy system where each PurchageDate must fit the rule. An example of a fact table where Marie has made many purchages since year 2005 and each purchage has its own Customer_Key value as a navigation reference back to the Customer dimension to fetch full name when the purchage was registered:

 

PurchaseDate

Customer_key

Product_key

Quantity

Amount

2005-11-10

1

21

1

45

2005-11-10

2

29

2

140

2005-12-01

3

32

2

200

2010-06-30

4

17

1

65

2014-05-08

4

32

3

300

2014-08-01

5

21

2

90

2016-03-15

5

29

1

70

2016-03-15

5

32

1

100

 

For each row in the fact table the value in column Customer_key is pointing to a row in the dimension table and give the name of the customer at the time the purchase occurred or was registered. Customer_key column in fact table represent as-was when it is joined to the dimension table.

 

When a report of purchase­s wants to show the customers most recent name (cur­rent or latest name) independent of the date of purchase, it is solved with a data­base view that find the most recent CustomerName for each value of Custo­mer_key with this result:

 

Customer_key

CustomerName

1

Hans Andersen

2

Joe Watson

3

Marie Sainte

4

Marie Sainte

5

Marie Sainte

 

A type 2 current view dimension can be implemented like this:

 

CREATE VIEW DimCustomer_Current AS

SELECT d.Customer_key, c.CustomerName

FROM DimCustomer d

         INNER JOIN

         (SELECT Business_key, CustomerName

          FROM DimCustomer

          WHERE ValidTo = '9999-12-31'  -- datetime2(7) '9999-12-31 23:59:59.9999999'

         ) c ON c.Business_key = d.Business_key

 

The view will do a query for always to select the current value for each key values in the dimension and show the result set or recordset as showned in the table above.

 

If your sql has windowing function the view can avoid the self join which will en­han­ce performance and here is no use of ValidTo column:

 

CREATE VIEW DimCustomer_Current AS

SELECT Customer_key, CustomerName = LAST_VALUE(CustomerName)

            OVER(PARTITION BY Business_key ORDER BY ValidFrom

            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

FROM DimCustomer

 

Use the view in a join with the fact table to fetch the current value for each fact row:

 

SELECT c.CustomerName, f.Quantity, f.Amount

FROM FactSales f

         INNER JOIN DimCustomer_Current c ON c.Customer_key = f.Customer_key

 

Customer_key column from the view represent as-is for most recent data when it is joined to the fact table.

 

The disadvance of the first view for current value of the dimension is the join on the business key because it can be a string representation or a composite business key is composed of multiple columns that needs to be part of the inner join columns, which can cost performance when the dimension has many rows. With type 7 dimension Kimball introducing an integer representation of the business key and he called it durable key, an artificial surrogate key value per business key, see later.

The disadvance of the second view is that the windowing function must be used for each column that is wanted to be displayed for the dimension.

 

Temporal tables in SQL Server 2016 is closest to type 4 because keeping history in the separate table while original dimension table keeps current dimension member, more reading.

 

In case you want to display the original value of the dimension as type 0:

 

CREATE VIEW DimCustomer_Original AS

SELECT Customer_key, CustomerName = FIRST_VALUE(CustomerName)

            OVER(PARTITION BY Business_key ORDER BY ValidFrom

            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

FROM DimCustomer

 

Use the dimension table in a join with the fact table to fetch registered value for each fact row:

 

SELECT c.CustomerName, f.PurchaseDate, f.Quantity, f.Amount

FROM FactSales f

         INNER JOIN DimCustomer c ON c.Customer_key = f.Customer_key

 

Customer_key column from the table of dimension represent as-was when the fact row was entered into the fact table when the data occurred.

 

There is another type 2 approach that do not need the ValidTo column in the dimen­­sion table and therefore the ETL process shall not update the previous row, because a ValidTo column can be calculated when it is needed to determine the dimension key for the coming fact row:

 

CREATE VIEW DimCustomer_ValidTo AS

SELECT Customer_key, CustomerName, ValidFrom,

            ValidTo = LEAD(ValidFrom, 1, '9999-12-31')

            OVER(PARTITION BY Business_key ORDER BY ValidFrom)

FROM DimCustomer

 

In a data access tool like Tableau and QlikView the type 2 for showing most recent values needs two views that will inside the tool’s data model be joined together. An user can in a dropdown box select Marie Sainte (me­a­ning Customer_key value 5) and all her six purchases facts will be shown or a sum of Amount even though she has changed her name over time. The dimension for current customers will show this unique list of most recent customers in a dropdown box based on a view:

 

Customer

_Current

_key

CustomerName

1

Hans Andersen

2

Joe Watson

5

Marie Sainte

 

CREATE VIEW dma.DimCustomer_Current AS

SELECT Customer_key AS Customer_Current_key, CustomerName

FROM DimCustomer

WHERE ValidTo = '9999-12-31'  -- datetime2(7) '9999-12-31 23:59:59.9999999'

 

The fact will show this through a view that include a Customer_Current_key column to be joined to the view DimCustomer_Current in the tool’s data model:

 

PurchaseDate

Customer

_key

Customer

_Current

_key

Product

_key

Quantity

Amount

2005-11-10

1

1

21

1

45

2005-11-10

2

2

29

2

140

2005-12-01

3

5

32

2

200

2010-06-30

4

5

17

1

65

2014-05-08

4

5

32

3

300

2014-08-01

5

5

21

2

90

2016-03-15

5

5

29

1

70

2016-03-15

5

5

32

1

100

 

Column Customer_Current_key represents a master value for the different slave va­lues in column Customer_key, e.g. master value 5 has slave values 3, 4, 5 in a mas­ter slave relationship.

 

CREATE VIEW dma.FactSales AS

WITH dimCustomer_Current (Customer_key, Customer_Current_key) AS

(

SELECT Customer_key,

            MAX(Customer_key) OVER(PARTITION BY Business_key) AS Customer_Current_key

FROM DimCustomer

)

SELECT f.PurchaseDate, f.Customer_key, c.Customer_Current_key, f.Product_key,

            f.Quantity, f.Amount

FROM FactSales f

         INNER JOIN dimCustomer_Current c ON c.Customer_key = f.Customer_key

 

When current customer Marie Sainte is selected in a dropdown box, the tool will do a query based on the tool’s data model to show all her six purchases facts even though she has changed her name over time:

 

SELECT c.CustomerName, f.PurchaseDate, f.Quantity, f.Amount

FROM dma.FactSales f

         INNER JOIN dma.DimCustomer_Current c

           ON c.Customer_Current_key = f.Customer_Current_key

WHERE c.CustomerName = 'Marie Sainte'

 

CustomerName

PurchaseDate

Quantity

Amount

Marie Sainte

2005-12-01

2

200

Marie Sainte

2010-06-30

1

65

Marie Sainte

2014-05-08

3

300

Marie Sainte

2014-08-01

2

90

Marie Sainte

2016-03-15

1

70

Marie Sainte

2016-03-15

1

100

 

The fact view is still keeping the Customer_key column to be joined to the Cu­sto­mer dimension for showing the name of a customer when the purchase was hap­pe­ning.

 

The view dma.FactSales is using a With part (Common Table Expression) to make a tem­porary result set. It could be changed to a materialized view which store data in a table with a clustered primary key through the ETL process, and table FactSales will join to the materialized-view-table to get a better performance.

 

When a dimension is a type 7 it keeps history as type 2 and it does the same as type 1 current value with an extra column called Durable key that is an integer re­presentation of the business key because a business key could be a mix of number and text and could be a composite key.

 

Customer

_key

Customer

_Durable

_key

Business

_key

CustomerName

ValidFrom

ValidTo

1

1

176

Hans Andersen

1900-01-01

9999-12-31

2

2

359

Joe Watson

1900-01-01

9999-12-31

3

3

421

Marie Beaulieur

1900-01-01

2007-11-16

4

3

421

Marie Lesauvage

2007-11-16

2014-07-15

5

3

421

Marie Sainte

2014-07-15

9999-12-31

 

Customer_key is holding the type 2 complete history of the dimension columns changes, and Customer_Durable_key is holding the type 1 current version of the dimension columns.

 

Both the dimension primary key (here Customer_key) and the durable key (here Customer_Durable_key) will appear in the fact table as a dual foreign key for a gi­ven dimen­sion.

 

PurchaseDate

Customer

_key

Customer

_Durable

_key

Product

_key

Quantity

Amount

2005-11-10

1

1

21

1

45

2005-11-10

2

2

29

2

140

2005-12-01

3

3

32

2

200

2010-06-30

4

3

17

1

65

2014-05-08

4

3

32

3

300

2014-08-01

5

3

21

2

90

2016-03-15

5

3

29

1

70

2016-03-15

5

3

32

1

100

 

The Durable key will also appear in the current view that will show each customer in an unique list of most recent customers.

 

Customer

_Durable

_key

CustomerName

1

Hans Andersen

2

Joe Watson

3

Marie Sainte

 

A type 7 current view dimension can be implemented like this:

 

CREATE VIEW DimCustomer_Current AS

SELECT Customer_Durable_key, CustomerName

FROM DimCustomer

WHERE ValidTo = '9999-12-31'  -- datetime2(7) '9999-12-31 23:59:59.9999999'

 

The Durable key will also appear in a join with the fact table to fetch the current value for each fact row:

 

SELECT c.CustomerName, f.PurchaseDate, f.Quantity, f.Amount

FROM FactSales f

         INNER JOIN DimCustomer_Current c

           ON c.Customer_Durable_key = f.Customer_Durable_key

 

In a data access tool like Tableau and QlikView the type 7 current view become a dimension of most recent values and inside the tool’s data model the view is joined to the fact table’s durable key column. An user can in a dropdown box select Marie Sainte (me­a­ning Customer_Durable_key value 3) and all her six purchases facts will be shown or a sum of Amount even though she has changed her name over time.

 

The advance with type 7 is that the business key from a source legacy system is not part of a query to fetch the most recent values from a dimension. See more about SCD type 7 dimension in next section.

 

6.3. Kimball type 7 dimension

A type 7 dimension will keep all values in a dimension independent of the source legacy system with an option to easy to roll up historical facts based on current dimension characteristics. It is formerly known as »Compliance-Enabled Data Warehouses« in Kim­ball Design Tip #74 that achieve:

 

·      Eliminating Type 1 and Type 3 updates.

·      Accessing all prior versions of a database at points in time.

·      Protecting the custody of your data.

·      Showing why and when changes to data occurred.

 

Durable key was presented in Design Tip #147 Durable “Super-Natural” Keys, and type 7 in Design Tip #152 Slowly Changing Dimension Types 0, 4, 5, 6 and 7.

 

Durable means there is no business rule that can change the key value because it belong to the data warehouse. A business key CustomerNumber with value 421 gets a durable key identifier with value 3, and it will be used in fact table to refer­ring to the customer. When the CRM system will be replaced with a new system in the cloud, it is a task for the data warehouse to map the old CustomerNumber (421) to the new source legacy system CustomerId (836-Q) and thereby keep the durable key value 3 so the old fact rows will be unchanged. That makes the data warehouse robust.

 

Kimball says: »Slowly changing dimension type 7 is the final hybrid technique used to support both as-was and as-is reporting. A fact table can be accessed through a dimension modeled both as a type 1 dimension showing only the most current va­lues, or as a type 2 dimension showing correct contemporary historical profiles. The same dimension table enables both perspectives. Both the durable key and primary surrogate key of the dimension are placed in the fact table. For the type 1 perspective, the current flag in the dimension is constrained to be current, and the fact table is joined via the durable key. For the type 2 perspective, the current flag is not constrained, and the fact table is joined via the surrogate primary key. These two perspectives would be deployed as separate views to the BI applications.« link.

 

In this section, I like to give my own description of the dimension type 7 together with an example and how join data from fact to dimensions.

 

I will start with the life cycle terms for a value (member) in a dimension:

 

·      Original is the value of birth when it occurred and entered first time.

·      Historical is tracking the value for changes (modified/updated) at any time.

·      Current is active, newest, latest of the value or most recent indicator.

·      Deleted is the value no longer available in the source legacy system.

 

From a fact data point of view it is important to know the value of a dimension at the time of registration, therefore I will add to the life cycle:

 

·      Registered is the dimension value when the fact data occurred or entered.

 

There will probably be dimension values that will not be used in fact data.

 

When I have one fact data row in my hand, I can provide four versions of the va­lue in a dimension, can also be called four perspectives:

 

·      Registered is the dimension value when the fact data row was registered or entered into the data warehouse with a transaction date or an insert time.

·      Current is the latest and active dimension value and thus independent of the time of registration of the fact data row.

·      Original is the entered first time dimension value maybe even before the fact data row exists in the source legacy system.

·      Historical is the dimension value at a specific given date reporting point in time datetime independent of the fact transaction date.

 

Registered stands for as-was when occurred, Current stands for as-is for most recent data, Historical stands for as-of a specific particular reporting date or at any arbitrary date in the past and Original stands for as-began when the data value was opened, appeared or borned.

 

An additional information for the fact data row would be about the dimension value is not longer available in source legacy system because it has been marked expired or purged or it has been deleted.

 

My design pattern for type 7 dimension will have extra key audit columns or house­kee­ping columns or support columns or administrative columns:

 

·         _key stands for a primary key column which is a new integer surrogate num­ber e­ve­ry time a va­lue in a source legacy system is changed and send to the data warehouse. The key is an unique iden­tity auto increment se­quen­tial number normally starting from 1. Can be called _skey for sur­ro­gate key (SK) or even better _rkey for the registered key reference from a fact table row to the dimension table row with the textual data descriptive columns at the time the fact data row was registered or entered into the data warehouse.

 

·         _dkey stands for a durable key which is not an unique integer number per row, instead it will hold the same number as the _key column when the dimension value entered the di­men­sion table for the first time and that number will be repeated every time the value change. Can be called _ekey for entity key (EK) becau­se the unique number of _dkey is also an unique in­de­pen­dent data ware­house surrogate number for the business key which is unique in the so­ur­ce legacy system like a primary key. Sometimes called a stable key. Can also be called _hkey for history reference over time used for lookup in dimen­sion for the current value, for the original value when it was entered the first time or for the historical value at a specific repor­ting datetime point in time.

 

·         _pkey stands for a previous key (or parent-child key) which is a re­fe­ren­ce to a _key value for the previous or latest version of a dimension value when a value change over time. It is a referential integrity where _pkey is a foreign key to primary _key with one ex­cep­tion that value 0 in _pkey does not has a reference to a _key value because value 0 represent the first time the dimension value enter the dimension table and therefore has no previous value and that is the same as the original value.

 

·         _bkey stands for a business key which is unique in the so­ur­ce legacy sy­stem like a primary key. A composite business key is composed of multiple columns like Name_bkey + Street_bkey + Zip­code_bkey and they will be multiple columns in the dimension table. Sometimes a string concatenation with a separator character to handle them in one _bkey column to compare with source data. Identical data can exists in multiple source legacy sy­stems like an employee in CRM, ERP and HR with their own business key in the Employee dimension through sibling columns like CRM_bkey, ERP_bkey and HR_bkey with different representation like a number, a code and an e­mail address. Depen­ding of complexity there is a need of a mapping table in the data warehouse to take care of the merge of business keys and some­times it is a manual task to merge data together when there is no sui­table natural key. Importance is to have an employee as one entity with one durable sur­rogate number for the combination of business keys making a one-to-one relationship between _bkey and _dkey. Examples of con­cate­na­ted string column that can be used together with a checksum column cal­led _Comparison that represent the other columns outside the source pri­mary key:

concat(Name_bkey,';',Street_bkey,';',Zip­code_bkey)

concat(Id,';',Date,';',Step)

cast(hashbytes('sha2_256', concat(FirstName,';',LastName)) as binary(32))

used to find out when data has been changed in the source legacy system to make a new row in dimension with new validfrom/validto in the ETL pro­cess.

 

·         _nkey stands for a natural key which is unique in the so­ur­ce legacy sy­stem like a candidate key or a secondary key and can be needed to con­form a dimension across multiple so­ur­ce legacy systems which have their own business key that can’t be used in the merge because it is a sur­ro­gate key i­den­tity column an unique sequence number.

 

An example

Lets have an example where a source legacy system has changed its value three times from »Warning« to »Notification« and to »Caution«. Each of the three values become a new row in a dimension table together with two metadata co­lumns that representing the period for which the value was found in the sour­ce legacy system, here called:

 

·         ValidFrom  also called EffectiveDate or StartDateTime or BeginDateTime.

·         ValidTo      also called ExpirationDate or StopDateTime or EndDateTime.

 

A System dimension type 7 with audit, metadata and descriptive columns:

 

System

_key

System

_dkey

System

_pkey

System

_bkey

System

ValidFrom

ValidTo

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

35

19

19

76CB

Notification

2015-05-08

14:12:21

2015-07-31

08:45:48

87

19

35

76CB

Caution

2015-07-31

08:45:48

9999-12-31

00:00:00

 

_key   stands for a primary key as a surrogate increment se­quen­tial number.

_dkey stands for a durable key is a integer representation of a business key.

_pkey stands for a previous key is re­fe­ren­ce to a _key value for the previous.

_bkey stands for a business key is the primary key in the source data.

 

Column _key is an unique primary key of the dimension table and the com­bi­nation of _bkey + ValidFrom or _bkey + ValidTo is unique too.

 

[I prefer to think a table of a dimension as an object that has its own functions on data, therefore I like to have _key and _dkey to start from 1 for each dimen­sion instead of having a dimension key store map with a global number for all the keys across all the dimensions where each business key will get its own surrogate key and a new dimension table will start with a _key like 4983562.]

 

»Warning« and »Notification« is history va­lues of the System dimension and »Caution« is the most recent indicator of the System dimension with these terms:

 

·         »Warning« is the original value.

·         »Notification« is the historical value.

·         »Caution« is the current va­lue.

·         »?« is the registered value depend of the transaction date in the fact row.

 

The challenge is that you have to choose which version of a dimension value you like to present in your report.

 

Registered value

From a source legacy system the data warehouse has received two outages fact data with an OutageDate (kind of a transaction date) and a SystemId business key that match to System_bkey in the System dimension. Example of two source data rows:

 

OutageDate

SystemId

ProductionKWh

2015-05-25 10:22:43

76CB

173

2015-08-02 15:47:10

76CB

221

 

For each source data row the value of OutageDate will look up in the System dimension at the time when the outage was happening to determine and fetch the correct dimension value. The fact data row is registered with criteria like:

 

System_bkey = SystemId AND

ValidFrom <= OutageDate AND ValidTo > OutageDate

 

It will select one row from the dimension table and the values of columns Sy­stem_key and Sy­­stem_dkey will be copied to the new fact row, like for these two outages incidents at Maj 25 and August 2 in 2015 will become two fact data rows:

 

OutageDate

System_key

System_dkey

ProductionKWh

2015-05-25 10:22:43

35

19

173

2015-08-02 15:47:10

87

19

221

 

System_key is a reference to the System dimension value at the registered fact time from column OutageDate, and System_dkey is a reference to the entity with business key 76CB to provide current, original or historical value.

 

When we like to see the value from the dimension at the time where an outage happen, fact table column System_key will join to the dimension column Sy­stem_key and get the value from the unique row. We call this value the registered value, because it was the value of the dimension when the outage incidents was occurred, and the value will never change because the dimension is handling all changes of va­lues.

Fact table join to the dimension table to fetch registered value:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_key = f.System_key

 

System

OutageDate

ProductionKWh

Notification

2015-05-25 10:22:43

173

Caution

2015-08-02 15:47:10

221

 

Here we see the purpose of the OutageDate and in case of the source legacy sy­stem does not provide a »transaction date« to determine and fetch the dimension value, the data warehouse will use a RegisteredTime (receipt datetime, createddate, inserttime, current_time­stamp, getdate, sysdatetime) for when the fact data is regi­stered or entered into the fact table row with criteria like:

 

dim.System_bkey = fact.SystemId AND

dim.ValidFrom <= fact.RegisteredTime AND dim.ValidTo > fact.RegisteredTime

 

Current va­lue

It can sometimes confuse users because the value of System has changed over time, and what is the summation of ProductionKWh of the System when using the registered value? Therefore we can provide the current value of the System dimension.

Fact table join to the dimension table to fetch current value:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_dkey = f.System_dkey AND

                              d.ValidTo = '9999-12-31 00:00:00'

 

System

OutageDate

ProductionKWh

Caution

2015-05-25 10:22:43

173

Caution

2015-08-02 15:47:10

221

 

And a summation of ProductionKWh is 394 KWh for Caution.

 

Original value

We can provide an original value of the System dimension.

Fact table join to the di­men­sion table to fetch original value:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_dkey = f.System_dkey AND

                              d.System_pkey = 0           (d.System_key = d.System_dkey)

 

System

OutageDate

ProductionKWh

Warning

2015-05-25 10:22:43

173

Warning

2015-08-02 15:47:10

221

 

And a summation of ProductionKWh is 394 KWh for Warming, which is the original value of the system, maybe some users remember it best.

 

Historical value

We can provide a historical value of the System dimension at a specific given as-of date reporting point in time datetime (ReportTime, Reque­sted­Date) that will be part of the criteria.

Fact table join to the di­men­sion table to fetch historical value:

 

DECLARE @ReportTime datetime = '2015-06-01 10:24:47'

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_dkey = f.System_dkey AND

                              d.ValidFrom <= @ReportTime AND d.ValidTo > @ReportTime

 

System

OutageDate

ProductionKWh

Notification

2015-05-25 10:22:43

173

Notification

2015-08-02 15:47:10

221

 

The »between-filter« do the constrained to a single version value of system. A re­por­ting datetime at »June 1, 2015, 10.24« will fetch sy­stem Notification and that will never change even though the system name changes value over time. It means I can always repeat and reprocedure my report as long as I remember the reporting datetime. If I choose another reporting datetime like »September 1, 2015, 09.46« it will fetch system Caution. The reality with this approach is that you have a many-to-many relationship between the fact table and the dimension, so you must force your delivery reporting tools to choose an as-of date to allowing point in time analysis at a given date to avoid bringing back all combinations and resulting in overstated measurements.

 

Helper views

It is hard to remember the criteria for _key, _dkey, ValidTo and ValidFrom therefore I will make four sql helper views to make the join between dimension and fact more easy for users and analysis tools:

 

Current view

 

CREATE VIEW dim.System_Current AS

SELECT System_dkey, System

FROM   dim.System

WHERE ValidTo = '9999-12-31 00:00:00'

 

The view will do a query for always to select the current value for each values in the dimension.

 

Used in join with fact table to fetch current value for a fact row:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Current d ON d.System_dkey = f.System_dkey

 

Let us imagine we are back in time at June 5, 2015 and we like to see in Tableau the current sy­stem of the outage at May 25, the view System_Current will provide system Noti­fi­ca­tion because that was the current value at that time. When we do the same in Tableau at August 22, the view System_Current will provide sy­stem Caution. Fact table co­lumn Sy­stem_dkey will join to the view column System_dkey and it will provide dif­fe­rent values from the dimension depend of the date and time we do the reporting in Tableau and the values will always be the current and most recent.

 

A dimension is used to search for data rows from a fact table and with the view Sy­stem_Cur­rent it is easy inside Tableau and QlikView to make a dimension for recent values and join it to the fact through the column System_dkey in the tool’s data model. Tableau and QlikView shows the most recent values of System e.g. Caution and an user select the value in a dropdown as a criteria, and Tableau and QlikView do a join to the fact table and find all the data rows that is using this system even though the name of system has been changed over time:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Current d ON d.System_dkey = f.System_dkey

WHERE d.System = 'Caution'

 

In Tableau and QlikView the dimension would be named Current_System.

 

Registered view

 

CREATE VIEW dim.System_Registered AS

SELECT System_key, System

FROM   dim.System

 

Used in join with fact table to fetch registered value for a fact row:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Registered d ON d.System_key = f.System_key

 

In Tableau and QlikView the dimension would be named Registered_System and it is used to report a name of a system when the Outage was happening.

 

Original view

 

CREATE VIEW dim.System_Original AS

SELECT System_dkey, System

FROM   dim.System

WHERE System_pkey = 0          (System_key = System_dkey)

 

Used in join with fact table to fetch original value for a fact row:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Original d ON d.System_dkey = f.System_dkey

 

Seldom to have an Original_System dimension in Tableau but we can if we want.

 

Historical function with a datetime parameter

 

CREATE FUNCTION dim.System_Historical(@ReportTime datetime)

RETURNS TABLE AS RETURN (

SELECT System_dkey, System

FROM   dim.System

WHERE ValidFrom <= @ReportTime AND ValidTo > @ReportTime)

 

Used in join with fact table to fetch historical value for a fact row with a datetime:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Historical('2015-06-01 10:24') d

               ON d.System_dkey = f.System_dkey

 

Tableau and QlikView does not yet offer an user typein datetime value for a di­men­sion before dis­play­ing the values of the dimension, but a report has normally some criteria boxes and one of them could be an input datetime for a historical di­men­sion, so the report shows the values from a specific point in time, maybe to re­pro­duce exactly the sa­me report as it was made before at that time.

 

I prefer to make four database schemas: dimcurrent, dimhistorical, dimoriginal and dimregistered and reuse the name of the view System in all schemas, like I had multiple namespaces with their own objects. Then I write a sql statement I can in From/Inner Join part use IntelliSense e.g. »dimcurrent.« and I get the list of di­men­sions that presents current values that I want together with a fact table that I have placed in a schema called fact.

 

Conclusion

Dimension type 7 provides up to four values for registered, current, original and hi­storical and it is very flexible for a fact data row to request for and fetch a di­men­sion value. Of course a lot of fact data rows have only one value in a di­men­sion, like a fact for sale to a new customer that is at the same time entered into the dimension. When the customer don’t change any data for a long time, the di­men­sion value is regi­stered = current = original = historical. The cu­stomer purchases many times and have many rows in sales fact, but one day the customer moves from his home country to three different countries over time and still continue to purchase. This means that the original value become the home country, historical values is the second and third country, and current value is the fourth and recent country, and the registered value country is depending of the date of the sale. For statistics of the sales we can have summaries for customers home countries even though they have moved to other countries because we simple use the original value of the dimension. We can have summaries for countries where the customer did their purchase from by using the registered value. In case we like to know the sales from customers who have once lived in a specific country, we first look up that country in the customer dimension and gets their _dkey’s and join them to the fact sale table. The marketing people like to know the current address, phone number and email of all the customers to contact them with new exciting offers. Type 7 gives you all the options and the four helper sql views is handy for ad hoc queries. The right information, at the right time, in the right place, in the right way, in the right format, to the right person. In a data access tool like Excel, Power BI, QlikView, Tableau or Targit it is easy to make dimensions based on the helper views and join the dimensions to the fact through the dkey or key columns in the tool’s data model.

 

Dimension table and Fact table

A dimension table in the physical design will have extra key audit columns or house­kee­ping columns or support columns or administrative columns:

 

·         _key    used for the registered fact data value of the dimension.

·         _dkey  used for the current, original and historical value of the dimension.

·         _pkey  used for the previous key that is a reference to the _key column.

·         _bkey  used for the business key from the so­ur­ce legacy system.

·         _nkey  used for the natural key from the so­ur­ce legacy system.

 

A fact table in the physical design has for each dimension two audit columns:

 

·         _key    used for the registered fact data value of the dimension.

·         _dkey  used for the current, original and historical value of the dimension.

 

A dimension table will keep track of datetime stamp of inserting and updating a row with two audit columns:

 

·         InsertTime    database server system time when the row was inserted.

·         UpdateTime   database server system time when the row was updated.

 

I don’t like an audit column to be Null or N/A, therefore UpdateTime gets the same datetime stamp as InsertTime, so both columns will have the same value until data in the row will be either changed or deleted in the source legacy system. The two audit columns can be assigned either by triggers for insert and update at the table or by the ETL process. I prefer the ETL process to handle all audit columns.

 

A dimension table will keep track of job, package or data flow execution identifi­ca­tion stamp of inserting and updating a row with two audit columns:

 

·         InsertExec    execution identifi­ca­tion when the row was inserted.

·         UpdateExec   execution identifi­ca­tion when the row was updated.

 

Execution identifi­ca­tion is usefull for an audit trail to do a log of number of rows that has been inserted or updated in a table in the data warehouse.

 

A dimension table will keep track of a current value in the source legacy system with a metadata column which is a boolean (bit) with two values (1 = true and 0 = false):

 

IsCurrent

IsCurrent = 1 means that the row represent the current value from source system.

IsCurrent = 0 means that the row represent a historical value from source system.

 

Column IsCurrent can also be called NewestFlag, ActiveFlag or RecentFlag.

 

When IsCurrent = 1 the ValidFrom/ValidTo represent the period where the row va­lue is current in the source legacy system.

 

A dimension table will keep track of an expired or deleted value in the source le­gacy system with a metadata column which is a boolean (bit) with two values (1 = true and 0 = false):

 

IsDeleted

IsDeleted = 1 means that the row represent the deleted value from source system.

IsDeleted = 0 means that the row represent an existing value from source system.

 

Column IsDeleted can also be called DeleteFlag, DeletedAtSource, ExpiredFlag or InactiveFlag, when data are deleted in a system and therefore are gone forever.

 

When IsDeleted = 1 the ValidFrom/ValidTo represent the period where the row va­lue does not exists in the source legacy system.

 

A dimension table can keep track of the reason why the row was changed in a co­lumn called RowChangeReason so it is easy for doing search and analysis. The column could have value »New« when the row represent a new customer, and later when customer move to another country the value could change to »Moved country« and later it would be »Changed credit status« and so on.

 

A fact table will keep track of datetime stamp of inserting a row with audit column:

 

·         InsertTime  database server system time when the row was registered.

 

When there is no transaction date for a fact data the InsertTime will be used to look up an unique dimension table row. Sometimes the term RegisteredTime is used.

 

A fact table will keep track of datetime stamp of updating a row with audit and metadata co­lumns:

 

·         UpdateTime  database server system time when the row was updated.

·         IsCurrent      if fact row will change then keep track of current row.

·         IsDeleted      if fact row can be deleted from source legacy system.

·         ValidFrom     if fact row is treated as SCD type 2, Slowly Changing Facts.

·         ValidTo          if fact row is treated as SCD type 2, Slowly Changing Facts.

 

Slowly Changing Facts SCF or Frequently Changing Facts FCF but hopefully not Rapidly Changing Facts RCF because that could give a lot of facts rows over time.

 

When transactions are changed retroactively, you should update corresponding fact rows. If you need to track the history of fact rows it become Slowly Changing Facts, see more in section 6.4.

 

A developer like to prefix or suffix all key and audit and metadata columns to make sure that the source legacy system does not have a column with the same name. A suffix would be like _dw or _dwh or _audit or _meta.

 

An example

Lets have an example where a source legacy system has changed its value three times from »Warning« to »Notification« and to »Caution« and in the ending the source legacy system will delete the value so it will not exists anymore, but the dimension will keep it forever. Each of the following steps will show how the dimension table looks like after each process has been done.

 

1. First time a value arrives to the data warehouse which happen at Maj 1, 2015, 5 pm or 17:00, a new row is added to the dimension table to contain the new value in column System to­gether with ten audit and metadata columns:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

9999-12-31

00:00:00

1

0

2015-05-01

17:00:00

2015-05-01

17:00:00

 

2. Value changed at 2015-05-08 therefore the first row is not current and a new row is added:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

0

0

2015-05-01

17:00:00

2015-05-08

17:01:09

35

19

19

76CB

Notification

2015-05-08

14:12:21

9999-12-31

00:00:00

1

0

2015-05-08

17:01:10

2015-05-08

17:01:10

 

3. Value changed at 2015-07-31 therefore the second row is not current and a new row is added:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

0

0

2015-05-01

17:00:00

2015-05-08

17:01:09

35

19

19

76CB

Notification

2015-05-08

14:12:21

2015-07-31

08:45:48

0

0

2015-05-08

17:01:10

2015-07-31

14:03:22

87

19

35

76CB

Caution

2015-07-31

08:45:48

9999-12-31

00:00:00

1

0

2015-07-31

14:03:23

2015-07-31

14:03:21

 

4. Value deleted at 2015-08-06 where the current row will be copied to a new row mark deleted:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

0

0

2015-05-01

17:00:00

2015-05-08

17:01:09

35

19

19

76CB

Notification

2015-05-08

14:12:21

2015-07-31

08:45:48

0

0

2015-05-08

17:01:10

2015-07-31

14:03:22

87

19

35

76CB

Caution

2015-07-31

08:45:48

2015-08-06

10:10:05

0

0

2015-07-31

14:03:23

2015-08-06

12:00:06

99

19

87

76CB

Caution

2015-08-06

10:10:05

9999-12-31

00:00:00

1

1

2015-08-06

12:00:08

2015-08-06

12:00:05

 

With this approach there will be inserted an additional row for a deleted (or closed item) in the so­ur­­ce legacy system, so the dimension will tell in the last row which is mark as deleted, a period where the value is deleted from 2015-08-06 10:10:05 to forever. IsDelete = 1 represent a deleted  row or record or data in the source legacy system, therefore the business key is no longer available in source legacy system in the period of ValidFrom/ValidTo. No fact data with a _key co­lumn will never point to or refer to a deleted value in a dimension, therefore no fact row can use a _key value with IsDeleted = 1. IsCurrent = 0 represent the old historical version of the value and ValidFrom/ValidTo tells us the period when the old value was current or active back in time. Since older fact data can be referring by column _dkey to a deleted dimension value, it is still needed to have IsCurrent = 1, and with IsDeleted = 1 we can see that the value is not available in the source le­gacy system. Therefore a fact table data is consistent (not inconsistent) and will ne­ver be missing a dimension value in a dimension table. In case a deleted value will be reopen (reappear or reborn) with the same business key in the source legacy system, the current dimension row will be updated with IsCurrent = 0 and new ValidTo datetime, and a new row is inserted and become current, and we keep IsDeleted = 1 to show the value was deleted in a period.

 

When source data with an OutageDate arrives to the data warehouse it will find the historical system name at the time when the outage was happening with a criteria that will not include the dele­ted dimension value because a fact data with a _key column can’t refer to a deleted value:

 

IsDeleted = 0 AND System_bkey = SystemId AND

ValidFrom <= OutageDate AND ValidTo > OutageDate

 

Example of two source data rows:

 

OutageDate

SystemId

ProductionKWh

2015-05-25 10:22:43

76CB

173

2015-08-02 15:47:10

76CB

221

 

Become two fact data rows where System_key is a reference to the System dimension value at the registered fact time from column OutageDate and System_dkey is a reference to the entity with business key 76CB to provide current, original or historical value.

 

OutageDate

System_key

System_dkey

ProductionKWh

2015-05-25 10:22:43

35

19

173

2015-08-02 15:47:10

87

19

221

 

The current view for the dimension to show all current values from the dimension will search after criteria IsCurrent = 1, so the view gives the current value every time I access the view, and the fact co­lumn _dkey can join to the view for showing the current va­lues of the dimension, therefore dimen­sion values with IsDeleted = 1 will be included in the view because there are old fact data back in time with a _dkey which refer to a deleted dimension data in the source legacy system.

 

View provide registered dimension values and will be joined to fact data through _key

 

CREATE VIEW dim.System_Registered AS

SELECT System_key, System

FROM   dim.System

WHERE IsDeleted = 0

 

View provide current dimension values and will be joined to fact data through _dkey

 

CREATE VIEW dim.System_Current AS

SELECT System_dkey, System

FROM   dim.System

WHERE IsCurrent = 1

 

View provide original dimension values and will be joined to fact data through _dkey

 

CREATE VIEW dim.System_Original AS

SELECT System_dkey, System

FROM   dim.System

WHERE System_pkey = 0          (System_key = System_dkey)

 

View provide historical dimension values and will be joined to fact data through _dkey

 

CREATE FUNCTION dim.System_Historical(@ReportTime datetime)

RETURNS TABLE AS RETURN (

SELECT System_dkey, System

FROM   dim.System

WHERE IsDeleted = 0 AND ValidFrom <= @ReportTime AND ValidTo > @ReportTime)

 

When dimension data have IsDeleted = 1 the value of _key is never used among fact data _key column therefore the registered view and historical view do not need to include the deleted dimension value and any­way the same dimension value exists already in the previous row in the dimension table with a ValidFrom/ValidTo period for the time when the value was current and active in the source legacy system.

 

An implementation in SSIS SQL Server Integration Services of type 7 dimension

 

 

Read how to implement the type 7 dimension ETL process in SSIS and how to load fact data with a transaction date

 

A dimension table will keep track of unknown values in the source legacy system

with an metadata column which is a boolean (bit) with two values (1 = true and 0 = false):

 

  • IsInferred

IsInferred = 1 means that the row represent an unknown value from source system.

IsInferred = 0 means that the row represent a known value from source system.

 

An example

Lets have an example where the source legacy system has business key with null value and unknown values that does not exist in the dimension table:

 

System

_key

System

_dkey

System

_pkey

System

_bkey