Dimensional modeling - basic concepts and terminology

by Joakim Dalby

 

1. Introduction

Data Warehouse is

 

·  Subject-oriented because collect business data from different source systems.

·  Integrated because using common and consistent business names and units.

·  Time-variant because historical data is kept as current data at any given time.

·  Non-volatile because data do not change or historical data will never be altered.

·  Enrichment and processing of data for providing new knowledge.

·  Collection of data in support of management's decision making process and analy­zing the business with business performance measurement qualifiers and KPI Key Performance Indicator used as a measure of how a company is doing.

·  Utilizing dimensional modeling, end users and analysts can easily understand and navigate the data structure and fully exploit the data.

·  Goal to support slice and dice of data for business analysis.

·  Driven by business requirements specification from stakeholders and users.

 

Data from different volatile operational data source legacy systems are sto­red in tables in a relational database called a data warehouse and it is designed to sup­port reporting and data analyses by a single and common data model that is easy to navigate in. A table contains Columns also known as Attribu­tes or Fields. Important tip by naming columns, do not use possessive nouns such as Recipient's birth date, better use CamelCase like RecipientBirthDate in a column name, and don’t use values in column name like EmployeeDay/NightCode better with Em­ploy­­ee­­Shift­Type­Code.

 

A Business Intelligence (BI) system provide the information that management ne­eds to make good business decisions and data warehousing is a method to help database designer to build the system e.g. based on Dimensional modeling principles. An OLAP (OnLine Analytical Processing) cube can be at the top of the Dimensional modeling to present data in tools like Excel, Power BI, Tableau or Targit.

 

In this article I will focus only on the basic concepts and terminology in Dimensional mo­deling. My homepage has other articles of how to design and implement a data warehouse system with levels of data in the ETL process (Extracting, Trans­for­ma­tion, Loading). I see Dimensional modeling as a kind of a superset of the Entity Re­la­tion­ship (ER) data modeling design method of a relational database system OLTP (On­Line Transaction Processing). Peter Chen, the father of ER modeling said in his seminal paper in 1976: »The entity-relationship model adopts the more natural view that the real world consists of entities and re­la­tion­ships. It incorporates some of the important semantic information about the real world.« Where ER data mo­dels have little or no redundant da­ta, dimensional models typically have a large amount of redundant data. Where ER data models have to assemble data from numerous tables to produce anything of great use, dimensional models store the bulk of their data in the single Fact table or a small number of them and devide data into different databases called Data Mart (DM). A data mart contains one data area for one purpose and is customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. For example: Sales mart, Customer mart, Market mart, Production mart, Inventory mart, Shipment mart.

 

Dimensional modeling is a technique that seeks to present data in a standard, intuitive framework that allows for high-per­for­man­ce access.

 

Back in the 1970's William H. Inmon began to define the term data warehouse as a one part of the overall business intelligence system. An enterprise has one data warehouse with entities in a relational model, and multiple data marts which source their information from the data warehouse. In the data warehouse, data is stored in 3rd normal form. A data mart is a simple form of a data warehouse that is focused on a single subject. Denormalization and redundancy is the norm for data modeling techniques in a data mart. Ralph Kimball: »A data warehouse is an union of all its data marts.« Kimball does not like a to store data in a data warehouse database, his data flow goes through fewer data layer:

 

  1. From source legacy data systems to Input data area (IDA) (sometimes called Raw data, Source, Legacy, Capture Operational Data layer, or Extract layer, but for me extract is an operation and process on data). Empty this area before the ETL process. Datatypes will be adjusted to fit with the receiving database system. Do reconciling between data warehouse and source legacy data systems with reconciliation of row count and mark as reconciled and reporting it. (to reconcile in danish at afstemme, stemmer overens). Data might be wrong in input data area or in archive and later source legacy data system will resend a new revision of data, therefore important to create output views that filter wrong data away and making correct data as an extraction to the next layer of the data warehouse.
  2. From Input data area to Archive (ARC) or Operational Data Store (ODS). Never empty this area because it is archiving of time variant source data and it will retain historical value changes in the source system. Also called Persistent Staging Area (PSA), Persistence Layer, Data repository or History. Simple data adjustment can be done to gain same date and amount format and same data representation of a social security number etc. but it will be in new columns so the original values are unchanged. Each table has columns e.g., ARC_INSERT_TIME (datetime2(7) default SYSDATETIME(), not unique per row), ARC_TS (timestamp, rowversion, unique per row), ARC_RECORD_ID (a sequence number per row per table for uniqueness) to fetch delta data out of the archive that have not yet been loaded, and maybe an ARC_GLOBAL_ID (unique sequence number per row across all tables). Reasons for having an archive can come from data governance and external regulations e.g. Sarbanes-Oxley, Basel I, Basel II, HIPAA. Data Vault is one approach of an implementation of an archive as a back-room.
  3. From Archive to a Data Staging Area (DSA) by extraction for prepare data by cleansing, scrubbing (e.g. trim string, unicode string, max length of string, null gets a value as 0 or empty string '', replace illegal characters, replace value, verification of data type, data syntax and date format), validate data, duplicate rows or almost data deduplication by selecting one of the duplicate rows and keep the others in a variety table. Data enrichment according to business rules, identify dimension and fact data with derived values e.g. total cost and re­venue, elapsed time and overdue time. This is the trans­for­ma­tion of data. Empty this area before the ETL process. DSA will perform a data quality and filter wrong data into a quality assurance and quality control (QA/QC) database. DSA continues reconciling data from the archive for re­fe­ren­tial integrity i.e. foreign key value exists as primary key value, re­la­tion­ship cardinality rules for depen­dencies as mandatory, optional or contingent to make sure there is a foreign key or not, it can be null or it needs a value. Many other value checks and summation and so on for a true validation of data. When all tables in this area is loaded with correct data, the DSA is suc­cess­fully completed with ValidFrom, ValidTo, derived columns and cal­cu­la­tions and is ready for loading to the next layer of the data warehouse.
  4. From Data staging area to Data mart (DMA) with dimensional mode­ling, conformed or shared dimensions, star schema around each fact, assigned surrogate key (artificial key, i­den­tity column, a unique sequence number) at dimension and use it at fact. Keep historical data in dimension and fact with ValidFrom and ValidTo date/datetime columns. Never empty this area and always backup before the ETL process. Data mart is a front-room for pu­blis­hing the organization’s data assets to effectively support decision ma­king.

 

Data capture from source legacy systems is based on two data flow directions:

 

  • Push where Source find the data and deliver it to an area like a database or a file at ftp and later the Input data area will fetch the data from that area and indicate that the data is downloaded, received and stored, so Source knows that it can make a new delivery by using a semaphore to control access to a common resource by multiple processes.
  • Pull where Input data area will fetch data from Source through a SQL Select From Where statement maybe with joins of tables or using a view or calling a stored procedure with a parameter as a criteria value to limit source data and for doing delta data detection or incremental load. From a ETL tool or a web-service.

 

Data capture from source legacy systems is based on multiple data delivery forms:

 

  • Data rowset, recordset, multiple result sets from a relational database.
  • csv file with a separator like comma or semicolon, maybe with field header.
  • txt file e.g. from Excel save as unicode text with tabular separator.
  • XML, JSON or another data structure.

 

Data capture from source legacy systems is based on two data delivery methods:

 

  • Full dump where Input data area gets all data from Source.
  • Incremental load or delta data detection where Input data area only gets new data or changed data from Source, called Change data capture. Need an Initial load the first time running. It can be Source that keep track of deli­vered date, or it can be Input data area that keep track of a last received Sequence Id (integer), Sequence Time (datetime) or Insert­Update­Time­stamp value from the Source for each data row so Input can detect new data or changed data like asking for data since that value (Insert­Update­Time­stamp > latest value). It could also be a posting data e.g. entry date (or created date) and next load is for EntryDate > »the last received EntryDate«.

 

The data warehouse can also do its own delta data detection between Input data area and Archive to identify new data, changed data and deleted data to maintain historical data with ValidFrom and ValidTo datetimestamps is handled by following actions:

 

  • which rows is new and give them a CreatedDateTime or an Inserted­Date­Time or a flag N for new.
  • which rows has been changed and give them a ModifiedDateTime or a Last­Updated­Date or a flag C for changed or U for updated.
  • which rows has been deleted and give them a DeletedDateTime or a flag D for deleted.

 

Sometimes the source legacy system provides a flag information like N for new data, C for changed data and D for deleted data.

I seen pair as: New – Changed, Created – Modified or Insert – Update.

 

Sometimes the Source update a data row multiple times but only the most recent version goes to the data warehouse. If Source insert and update a row before a new load to data warehouse, only the updated version goes to data warehouse. If Source insert a new row and delete the same row before a new load to data warehouse that will never goes to data warehouse. It is seldome that Source legacy sy­stems has a log and is logging all changes with a revision nummer or a ValidFrom and ValidTo date­time­stamps, therefore a data warehouse does not contains 100% full historical updates of data in an organization. Sometimes a Source has for each (or selected) tables in the database an extra historical log audit table that contains all inserted, updated and deleted data rows from the original table together with an Action column (values for Insert, Update, Delete) and an EntryDate date­time­stamp column that mark when the action was occurred and a SequenceId that is used for delta data detection to pull data from Source log to Input data area. An example of a source historical log audit table with rows every time a customer change:

 

SequenceId_log

EntryDate_log

Action_log

CustomerId

CustomerName

626

2000-05-08 10:12:34.4906240

Inserted

421

Marie Beaulieur

7503

2007-11-16

14:37:07.2353231

Updated

421

Marie Lesauvage

19473

2014-07-15

11:09:33.7629562

Updated

421

Marie Sainte

25003

2017-02-07

23:57:21.9876543

Deleted

421

Marie Sainte

 

Implemented as after-insert/update/delete triggers in source legacy data system.

 

In a data warehouse architecture and enviroment I will place data from the Source legacy data systems into a database called Enterprise Data Warehouse (EDW) which is based on Entity Relationship data modeling with super-sub entity and with 80% normalized data. The EDW offers integrated, granular, historic, stable data that has not yet been modified for a concrete usage and can therefore be seen as neutral. It can keep historical data meaning all the changes to the data or only the days-end status e.g. End Of The Day for each data revision from Archive or Input data area, meaning from the source legacy data systems. Let the design of the data model do a »helicopter perspective« when identification of entities e.g. salesperson or engineer will be stored in an Employee entity with a one-to-many to a Jobfunction entity or more abstraction to a Person entity where customers users can be stored too. Convert source data with timespan start date and end date to transaction row wise entity as transactional data that is a sequence of information exchange and can be financial, logistical or work-related, involving everything from a purchase order that become an invoice with multiple invoice line items, to shipping status, to employee hours worked, plan and activity records, for subscription period and to insurance costs and claims. The EDW will be the source to one or multiple Data Marts using the Dimensional modeling with denormalized data controlled by the ETL process or using other modeling depending of the use of data in the mart.

 

After data capture, please remember to implement a Reconciliation Summary Report with the results from your different recon e.g.: Uncategorized assets, Invalid country codes, Derivatives transactions with missing currency code.

 

Dimensional modeling ends up in a star schema or constallation schema (multi-star schema) with fact tables (measures) surrounded by dimension tables (context), where dimensions explain the facts. Di­mension and fact conformance is a must in a successfull data warehouse imple­men­tation to meets the requirements of legislation, accepted practices, prescribed rules and regulations, specified standards, or terms of a contract.

 

Before I used the general word data mart for a database with tables of dimensions and facts, but there is a more narrow word Multidimensional database (MDB) that is the source database for an OnLine Analytical Processing (OLAP) application like a cube also known as a Multidimensional online analytical processing (MOLAP) application.

 

I like to end this chapter with some data quality skills and data cleansing examples:

  • Accuracy like is the age of a person fine and not like 125.
  • Validity like a date is within a acceptable range and not ten years wrong else data become invalid and shoud be rejected or corrected.
  • Correction like changing a wrong year of a date or casted to right data type.
  • Duplication like same customer has addresses spelled in different ways.
  • Consistency like a zipcode is right and has a cityname or a data mapping like DK for Denmark. Inconsistent value gives problem in any database and data warehouse.
  • Completeness to avoid data values is missing like phonenumbers or a date.
  • Conformity to make data from different source systems conformed and approved values that take care of case-sensitive words through translate, map­ping or bridging rules to create one and only one truth.
  • For fact tables one truth depends of the use because a fact for Finance and a fact for Marketing may look like the same but their business requirements specification has small differences.

 

Properties of big data:

  • Volume refers to the amount of data e.g. from GB over TB to PB as data at scale and transactions.
  • Velocity refers to the speed of change and data processing for the timeliness of data, data in motion from batch in the night, periodic loads e.g. three times a day, near real time with some delay and real time with almost no delay or streamning data with no delay as interactive.
  • Variety refers to data in many forms and the number of types of data from relational database, image, photo, audio, video, multimedia, sensor, web logs from social network data, smartphone and unstructered text.
  • Variability to increase in the range of values typical of a large data set and non-standard.
  • Value refers to data of many values which addresses the need for valuation of enterprise data e.g. statistical, events, correlations, hypothetical, leve­rage­able.
  • Veracity refers to data uncertainty due to data inconsistency and incom­pleteness by managing the reliability, trustworthiness, authenticity, origion, repitation, availability, accountability, quality and predictability of inherently imprecise data types.
  • Visibility refers to data in the open and being able to handle european General Data Protection Regulation (GDPR) and issues of privacy, security and provenance.
  • Visualization refers to the flexible viewing of data.
  • Viscosity, Validity, Venue, Vocabulary and Vagueness is other V-words.

 

2. Grain (granularity) of data warehouse

The grain is the level of detail of data in the table, both in dimension tables by the hie­rarchy and in fact tables by the definition of the measurement event and of in­terest, and the grain can later be expressed in terms of the dimensions. The lowest grain keep all relevant data from legacy systems and is therefore the most flexible approach but also take most storage space and easy can cost high query per­formance. To improve query performances the grain can be lifted up to higher level whi­le data will be aggregated and summarized and therefore take less storage space. Data can also be devided such as current year data is in grain daily (dately), previous year in weekly and older data in monthly, because very detailed information is normally not relevant for analysis years back in time. Granularity of fact table can be divided into three types:

 

  • Transaction grain as finest level of detail with one row per transaction together with a transaction datetime like a entry date, sales date or simply a created date, inserted time or registered date of the row.
  • Periodic grain with one row per time period or time span like a week or month.
  • Aggregated/Accumulating grain as more summarized level like one row for the en­tire lifetime of an event or with a clear beginning and end.

 

Aggregation is the process of calculating summary data from detail base level table rows (records) and is a powerful tool for increasing query processing speed in data marts. For example a sale is a fact with measures like quantity sold and amount, dimensions like product, customer and date of purchase that bring a sale in a context. The grain of the sale is limit to a date (like December 23) and a time (like 4:30 pm) and therefore the fact is on Transaction grain. In case we drop the time, the measures could be called TotalQuantitySold and TotalAmount because they are the result of a summation of sales time to sale date and therefore the fact is on Periodic grain. Also if we decide to summarized the sales date to a weekly or montly level. In case we decide to grain the customers by aggregate them into segments and don’t contain names and adresses, then the fact become Aggrega­ted/Ac­cu­mu­la­ting grain. The product has a three level hierarchy of category name, brand name and product name, and therefore we can say that product has the highest level of detail.

Instead of summation of sales time or sale date and loose some important information of date of purchase, we can summarized data into two other fact tables for a weekly level for data from the last year and a monthly level for older data, because when we go back in time we do not need analyzes on a dai­ly or we­ek­ly level and by aggregation we save harddisk space and improve the performance of the query because fewer rows in month level need to be summarized to fetch the year level data.

The lowest level of aggregation or the highest level of de­tail is referred as the grain of the fact table.

 

3. Fact table

A fact table contains columns of dimensions and columns of measures. A fact table can contain fact’s data on detail or aggregated level depends of the grain approach. All measure columns in a fact table should have the same grain like dately or monthly level. A fact table typically has two types of columns, foreign keys to dimension tables and measures that contain numeric facts. Every fact table has a primary key de­fined by a subset of the columns in the table, but normally a surrogate key i­den­tity column a unique sequence number is used instead. The ETL process has to secu­re the natural key, so fact rows are distinct. Fact tables in a di­men­sio­nal model express the many-to-many relationships between dimensions and is implement as one-to-many relationships between dimension tables and fact tables.

 

3.1. Types of Fact tables

Lets have names of some different types of fact table or more exactly different ty­pes of the columns in a fact table. Conforming facts means making agreements on common business metrics such as key performance indicators (KPI) across separa­ted source legacy systems so that these numbers can be compared mathematically for calculating differences and ratios.

 

Additive fact

A fact table has numerical measures that can be summed up for all of the di­mensions in the fact table, so the measure columns datatype is a number. A sales fact is a good example for additive fact where a measure like Quantity sold and Amo­unt. In case of a transaction dataset to a fact table refer to a measure column which value is empty or null, use the default value 0 because this won't bother aggre­gation like summation. Each measure must have its metrics. If it is a monetary measure, it may have a currency field and if it is a unit measure it may have a field to explain the kind of units used like centimeters, litres, cubic metres etc. Fact can have cal­culated measure e.g.: Surplus or Profit = Revenue – Costs.

 

Semi-Additive fact

A fact table has measures that can be summed up for some of the dimensions in the fact table and not for other dimensions. For example a daily balance measure can be summed up through the customers dimension but not through the time di­men­sion.

 

Non-Additive fact

A fact table has measures that cannot be summed up for any of the dimensions in the fact table. For example a room temperature fact is non-additive and summing the temperature across dif­fe­rent times of the day produces a totally non-mea­ning­ful number. However, if we do an average of several temperatures during the day, we can produce the average temperature for the day, which is a meaningful number. Other example is ratios cal­culated measure or only contains column with transaction number such as order num­ber, invoice number or a voucher number that can’t be summing up. A order fact with measures like Unit pri­ce and Discount percentage where a Unit price for a single product ma­kes no sence to summarize, but the derived column Amount = Unit price x Quantity is to be summarized and become an additive column called a calculated measure. Trend, Stock and Ranking can't be added and in general all calculations on one specific intersection of the dimension.

 

Factless fact

A fact table that contains no measures is called factless or measureless fact. For example a fact table which has only columns for employee, date, time and type, where type is like »work­start«, »work­stop« and »workstopofsickness« and there are no columns for measure. You can get the number of employees working over a period by a »select count(distinct EmployeeId)« or by a distinct row-count calculated mea­sure in your OLAP cube. For a factless fact you will normally count the number of rows, row count or counting rows, to a normal measure and call it »Number of <a name>«. Factless fact if for event or assignment i.e. attendence take place in a school class and with adding a column Attendence with 1 or 0 per date, student, class, room and professor the fact table gets a measure.

 

Derived fact, Aggregate fact, Summarized fact and Consolidated fact

A fact table that is created by performing a mathematical calculation on a fact table or on multiple fact tables, and therefore referred to as a pre-calculated fact with summarise aggregate measures at a higher level of one or more dimensions to reduce query time greatly and eliminate incorrect queries. For example a Month derived fact table that is a summation of rows of dates from another fact table. Year-to-Date ytd derived fact where month February is a summing up or roll up of January and February and so forth. Last-year-this-year derived fact with calculation of index compared to last year as a new column and easy to display in a report. Aggregate fact table is simple numeric roll ups of atomic fact table data built solely to accelerate query performance. Consolidated fact table is often convenient to combine facts from multiple processes together into a single consolidated fact table if they can be expressed at the same grain. I.e. sales actuals can be consolidated with sales forecasts in a single fact table to make the task of analyzing actuals versus forecasts simple and fast.

 

Timespan fact

A fact table used for a data source legacy system that is regularly updatable meaning that the source change and overwrite its values. To capture a timespan when the fact row was effective, the fact table will act as SCD type 2 dimension with a ValidFrom and ValidTo columns to keep historical data and to represents the span of time when the fact row was the »current truth«. It is called Slowly Changing Facts. See more in section 6.3.

 

Counterpart fact and Transactional fact

A fact table used for Slowly Changing Facts because the data source legacy system is changing fact value without keeping the old value as a historical transactions. See more in section 6.3.

 

Exploded fact

A fact table contents huge number of rows where a period like from a »date of employ­ment« to a »termination date« as columns in one row, will be turned around to many rows with one row per day of the period, or per week or month depends of the wanted grain of the period. When an employee has 10 year anniversary it will make more than 3650 rows on day grain, and if the employee dimension keep hi­story of different names, departments and job functions etc. for each staff, then the fact rows will have different key references to the dimension. Another grain could be 15 minute grain which gives 96 rows per day for a windturbine power pro­duc­tion. When the exploded fact is a source for a olap cube it can sometimes be imple­men­ted as a view in the database, and when it is for ad hoc reporting it will be used several times per day then it must be a materialized view stored in a table.

 

Column wise fact and Row wise fact

A column wise fact table is usefull to be columns in a paper report e.g.

   Revenue Jan, Revenue Feb, Cost Jan, Cost Feb, Sales Jan, Sales Feb.

For a cube a row wise is much better because it gives good dimensions e.g.

   Period, Entry, Amount.

Therefore a data warehouse needs to convert from columns to rows or vice versa.

 

Accumulating snapshot fact

A fact table where a row is a summarizes the measurement events occurring at predictable steps between the beginning and the end of a process. For example a source legacy system for current payments from customers where some pay seve­ral times over a month, and the first payment become a new row in fact table with date of payment in columns BeginDate and EndDate and the amount in Paid column. The next ETL process will do a summarize of payment per customer from the Begindate to current date or end-of-month date, and then update the fact row with same BeginDate with the new summarized payment and new EndDate, so a fact row will be revisited and updated multiple times.

 

Periodic snapshot fact

A fact table describes the state of things in a particular instance of time, and usual­ly includes more semi-additive and non-additive measures. It is a table with frozen data, meaning a row will never be changed/modified because the row can have been used in a report like a monthly or annual report and later it is a must to be able to create the same raport with the exact same data. The periodic snapshot fact table will never been truncated/deleted or updated, therefore it is a true incre­men­tal insert load and the table has a column like Month (yyyymm) or Year to fetch the wanted data as a month slice or year slice. Measures can be a balance in account or inventory level of products in stock and so on.

 

3.2. Other fact table classifications

Transaction has one row per transaction when they occur together with a datetime.

Periodic has one row for a group of transactions made over a period of time through summation like from daily grain to monthly grain so the Date dimension is re­pre­sen­ted by the month level with the first day of the each month. Notice that certain dimensions are not defined when com­pared to a transaction fact such as dimension TransactionType.

Aggregated/Accumulating has one row for the entire lifetime of an event and the­re­­fore constantly updated over time.  For example an application for a bank loan until it is accepted or rejected or a customer or working relationship. These fact tables are typically used for short-lived processes and not constant event-based processes, such as bank transactions.

 

Mostly a fact table describes what has happened over a period of time and is the­re­fo­re an additive or cumulative facts. An example of a status column in a fact table that receive data from a school system where a student follow a course and later finish it, but sometimes a student skip the course and are delete in the system. Before reload the fact it can be a good idea to have a CourseStatus column with values like: Active, Completed or Dropped.

 

4. Dimension

A dimension table containing business element contexts and the columns contain element de­scrip­tions and the dimension is referenced by multiple fact tables so the conta­ining mea­sure­ments make sense. A dimension table has mininum three types of co­lumns:

Primary key is a surrogate key identity column a unique sequence number to remove dependence from the source legacy system and for using in fact table foreign key. It can be called Entity key EK because it represent an entity from source or Surrogate key SK but for me »surrogate« is a property not a name of a column.

Business key is the primary key (operational key) of the legacy system, like Id or Number that is unique and normally the primary key in the database of the source.

Natural key is other candidate key of the source legacy system based on data in the real world, like a Code or an unique Name.

Textual data is representing the dimension value context description and is saved in columns of the dimension table and will be shown to the users as descriptive attributes.

 

In a Customer dimension table we will have columns from source legacy sy­stem like Cu­sto­mer­Id, Customer­Num­ber and CustomerName together with Shipping­Address and Zipcode, Regioncode, Country, Age, Gender and what about the Billing­Address and so on. Cu­sto­mer­Id is the business key and the Customer dimension primary key could be named like DimCustomerId, Customer­EntityKey or EK, SK for surrogate key or Customer_key etc. »A beloved child has many names.«

 

The dimension values can be placed in a hi­e­rar­chy like a Location with levels CountryRegionCity and in a group like a Age with members Child (0-12), Teeanage (13-19), Adult working (20-66), Senior citizen (67-130). A dimension normally contains one or multiple hierarchies and/or groups to fulfill requirements from the u­sers. A dimension can of course be non-hierarchical and non-grouping.

 

4.1. Changing Dimensions

Dimension values will normally be changed over time such as a customer or pro­duct change the name or a hierarchical structure will be reorganized. The changes can be places into two kinds of classifications and afterwards we will be looking into some techniques to handle and tracking changes and to capture its history and preserve the life cycle of source legacy data also called Change Data Capture CDC.

 

Slowly Changing Dimensions SCD

Columns of a dimension that would undergo changes over time. It depends on the business requirement whether particular column history of changes should be pre­ser­ved in the data warehouse or data mart.

 

Rapidly Changing Dimensions RCD

A dimension column that changes frequently. If you do need to track the changes, using a standard Slowly Changing Dimensions technique can result in a huge inflation of the size of the dimension. One solution is to move the column to it’s own di­men­sion with a separate foreign key in the fact table. Rapidly changing data usu­al­ly indicate the presence of a business process that should be tracked as a se­pa­ra­te di­men­sion or as a fact table or better into separate historical data tables.

 

Techniques or methods to handle dimension values that is changing over time from the source legacy systems, also called Ralph Kimball’s eight types or approaches:

 

Type 0: Original value, so dimension values never change (method is passive) meaning keeping the original value from the legacy system, and when the value is changed in legacy system, we don’t change the value in this dimension. »Retain original«.

Type 1: Current value, so the old dimension value will be changed and forgotten when value is changed in legacy system. The history of data values is lost forever. The latest or actual value of the dimension. Fact table refer to the dimension value most recent, as-is key. »Current by overwrite«.

Type 2: Keep all values, an unlimited history of dimension values over time marked by Effective date and Expiration date (Active date, Expired date, Expiry date) (or Start­Date and StopDate or BeginDate and EndDate or ValidFrom and ValidTo), a pair of data type »date« or »datetime« that represents the span of time when the value was the »current truth«. Fact table refer to the dimension value in effect when fact data occurred, as-was key, often by a date column in fact table based on source data or the current load date when the fact data was entered into the fact table. »Keep history in rows«. This is the technique for Slowly Changing Dimension.

Type 3: Keep the last value, where the previous dimension value is stored in Previous_ column (or Historical_ column) and current dimension value in Current_ column. »Keep history in column«. This is the technique for Slowly Changing Dimension.

Type 4: Fast changing value in many dimension value columns are split into one or more separate Mini Dimensions. »Keep history in tables«. This is the technique for Rapidly Changing Dimension to store all historical changes in separate historical da­ta tables.

Type 5: Builds on the type 4 Mini Dimension by embedding a current profile mini dimension key in the base dimension that’s overwritten as a type 1 Current value column. Therefore 4 + 1 = 5 type. »Sub class dimension«.

Type 6:  Some columns follow type 2 with a key column and maybe others co­lumns follow type 1. »Hybrid«. Can also have column for type 3, therefore 3 + 2 + 1 = 6 type.

Type 7: All rows follow type 2 to keep track of historical values with a key column and an extra key column called a durable key follow type 1 for the current value together with EffectiveDate and ExpirationDate. The durable key is an integer re­pre­sen­tation of the business key. The fact table contains dual foreign keys for a given dimension to show the historical value of the dimension at the time the fact data was created, born or occurred and to show the current value of the dimension. A view upon the dimension will provide the current values. »Dual Type 1 and Type 2 Dimensions«. Section 6.2 will show an example of type 7 with audit columns or housekeeping columns or support columns.

 

[Kimball types remind me of normal forms in a database design but the last types are not to hard to understand like fourth and fifth normal form.]

 

An example of type 1 and type 2

A customer has Id 421 in a source legacy system and her maiden name is Marie Beaulieur, and she is inserted into a Customer dimension. One day she gets married and takes her husband’s surname therefore her name is changed to Marie Lesauvage.

 

When the dimension is a type 1, her maiden name will be overwritten and will be forgotten. DimCustomerId is a surrogate key identity column a unique sequence number.

 

Before:

 

DimCustomerId

BusinessKey

CustomerName

1

176

Hans Andersen

2

359

Joe Watson

3

421

Marie Beaulieur

 

After:

 

DimCustomerId

BusinessKey

CustomerName

1

176

Hans Andersen

2

359

Joe Watson

3

421

Marie Lesauvage

 

When the dimension is a type 2, her maiden name will remain and will be remembered and her changed name will continue in a extra row. To keep track of when a value is changed, there is two datetime columns called ValidFrom and ValidTo. Her maiden name was valid from the beginning of time e.g. 1900-01-01 and valid to until 2007-11-16 meaning not included, because she got married at 2007-11-16 and changing her name is valid from that date until the end of time e.g. 9999-12-31.

 

DimCustomerId

BusinessKey

CustomerName

ValidFrom

ValidTo

1

176

Hans Andersen

1900-01-01

9999-12-31

2

359

Joe Watson

1900-01-01

9999-12-31

3

421

Marie Beaulieur

1900-01-01

2007-11-16

4

421

Marie Lesauvage

2007-11-16

9999-12-31

 

Note: Kimball page 508: If ValidFrom and ValidTo are datetime stamps then the ValidTo must be set to exactly the ValidFrom of the next row so that no gap exists between rows and no overlap too with continuous time. Other kind of time is event time (in a point of time) and interval time with gaps or overlap.

 

[In danish we say that ValidFrom is »fra og med« og ValidTo is »til og ikke medtaget« because the second row with same datetime in ValidFrom as the above ValidTo is representing a new period from that datetime.]

 

Marie Lesauvage remarried at 2014-07-15 and took her new husband’s surname therefore her name is changed to Marie Sainte and become a new row in the table.

 

DimCustomerId

BusinessKey

CustomerName

ValidFrom

ValidTo

1

176

Hans Andersen

1900-01-01

9999-12-31

2

359

Joe Watson

1900-01-01

9999-12-31

3

421

Marie Beaulieur

1900-01-01

2007-11-16

4

421

Marie Lesauvage

2007-11-16

2014-07-15

5

421

Marie Sainte

2014-07-15

9999-12-31

 

The Business key is the »glue« that holds the multiple records together.

 

A dimension can have extra columns for audit, housekeeping, support as IsCurrent, IsDelete, IsInferred or a Status column with values e.g. Current, Deleted, Expired.

 

Marie has made six purchases which is entered into a fact table where her date of purchage­ determines the value in column DimCustomerId after this rule:

 

   ValidFrom <= PurchageDate AND ValidTo > PurchageDate

 

PurchageDate

DimCustomerId

DimProductId

Quantity

Amount

2005-12-01

3

32

2

140

2010-06-30

4

17

1

65

2014-05-08

4

32

3

300

2014-08-01

5

21

2

90

2016-03-15

5

29

1

70

2016-03-15

5

32

1

110

 

For each row in the fact table the value in column DimCustomerId is pointing to a row in the dimension table and give the name of the customer at the time the purchage­ occurred or registered. DimCustomerId column in fact table represent as-was when it is joined to the dimension table.

 

When a report of purchage­s wants to show the customers recent name (current or latest name) independent of the purchase date e.g. 2005-12-01 has DimCustomerId value 3 but must show the latest name of Marie with value 5. It is normally solved with a database view that find the recent CustomerName for each value of DimCustomerId with this result:

 

DimCustomerId

CustomerName

1

Hans Andersen

2

Joe Watson

3

Marie Sainte

4

Marie Sainte

5

Marie Sainte

 

A type 2 current view dimension can be implemented like this:

 

CREATE VIEW DimCustomer_Current AS

SELECT d.DimCustomerId, c.CustomerName

FROM DimCustomer d

         INNER JOIN

         (SELECT BusinessKey, CustomerName

          FROM DimCustomer

          WHERE ValidTo = '9999-12-31'  -- datetime2(7) '9999-12-31 23:59:59.9999999'

         ) c ON c.BusinessKey = d.BusinessKey

 

The view will do a query for always to select the current value for each Id values in the dimension and show the result or recordset as showned in the table above.

 

If your sql has windowing function the view can avoid the self join which will en­han­ce performance and here is no use of ValidTo column:

 

CREATE VIEW DimCustomer_Current AS

SELECT DimCustomerId, CustomerName = LAST_VALUE(CustomerName)

            OVER(PARTITION BY BusinessKey ORDER BY ValidFrom

            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

FROM DimCustomer

 

Used in join with fact table to fetch current value for each fact row:

 

SELECT d.CustomerName, f.Quantity, f.Amount

FROM FactSales f

         INNER JOIN DimCustomer_Current d

            ON d.DimCustomerId = f.DimCustomerId

 

DimCustomerId column from the view represent as-is when it is joined to the fact table.

 

The disadvance of the first view for current value of the dimension is the join on the business key because it can be a string representation or a composite business key  as composed of multiple columns that needs to be part of the inner join columns, which all can cost performance when the dimension has many rows. With type 7 dimension Kimball introducing an integer representation of the business key and he called it durable key, a artificial surrogate key value per business key, see more in section 6.2.

The disadvance of the second view is that the windowing function must be used for each column that is wanted to be displayed for the dimension.

 

Temporal tables in SQL Server 2016 is closest to type 4 because keeping history in the separate table while original dimension table keeps latest (current) dimension member versions, more reading.

 

In case you want to display the original value of the dimension as type 0:

 

CREATE VIEW DimCustomer_Original AS

SELECT DimCustomerId, CustomerName = FIRST_VALUE(CustomerName)

            OVER(PARTITION BY BusinessKey ORDER BY ValidFrom

            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

FROM DimCustomer

 

There is another type 2 approach that do not need the ValidTo column in the dimen­­sion table and therefore the ETL proces shall not update the previous row, because a ValidTo column can be calculated when it is needed to determine the dimension Id for the coming fact row:

 

CREATE VIEW DimCustomer_ValidTo AS

SELECT DimCustomerId, CustomerName, ValidFrom,

            ValidTo = LEAD(ValidFrom, 1, '9999-12-31')

            OVER(PARTITION BY BusinessKey ORDER BY ValidFrom)

FROM DimCustomer

 

 

4.2. Types of Dimensions

Let us characterize the various dimensions into different types of dimensions.

 

Conformed dimension or Shared dimension

A dimension that is used in multiple areas is called a Conformed or Shared di­men­sion and may be used with multiple fact tables in the data mart or across multiple data marts. Di­men­sion values comes from either the legacy systems or is build in the Enterprise Data Warehouse by business rules. The dimension values can be placed in a hi­e­rar­chy like a Location with levels CountryRegionCity and in a group like a Age with members Child (0-12), Teeanage (13-19), Adult working (20-66), Senior citizen (67-130). A dimension normally contains one or multiple hierarchies and/or groups to fulfill requirements from the users. A dimension can of course be non-hierarchical and non-grouping.

 

Date dimension or Calendar dimension

Very common dimension with hierarchies such as

YearHalf yearQuarterMonthDate (five levels) or

YearWeekDate (three levels) because a week does not always belong to one month. The calendar date dimension has a granularity of a single day. I like to use an integer value as surrogate key identity column in format yyyymmdd e.g. 20131224 for Christ­mas day, therefore the fact table will contain useful information that can be used in a query like

»between 20131201 and 20131231« for month of December 2013.

 

Time dimension

A normally hierarchy HourMinuteSecond or by an aggregation hierarchy that roll-up of time periods into more summarized groupings like this:

Morning (6 am to 8 am)

Rush hour (8 am to 11.30 am and 4 pm to 6 pm)

Lunch hour (11.30 am to 1 pm)

Afternoon hour (1 pm to 4 pm)

Dinner hour (6 pm to 8 pm)

Evening hour (8 pm to 11 pm)

Night hour (11 pm to 6 am)

I like to use an integer value as surrogate key identity column in format hhmmss e.g. 63000 for 6.30:00 am and 202530 for 8.25:30 pm because 8 pm is also called 20 o’clock in the evening.

 

Regular dimension

All dimension values (or branches in the hierarchy) have same number of levels which makes the dimension symmetrical or balanced such as the Date dimension or hi­e­rar­chy CountryRegionCity. But se­veral columns the table is not in third normal form, so the table contains re­dun­dant data, but it’s okay in a data mart. Dimensions contain a large a­mo­unt of duplicate data by nature, we can say the dimension is denormalized.

 

Snowflake dimension

The implement of a dimension hierarchy that is build of two or more tables to avoid redundant data or denormalized, so the structure is in third normal form (3NF). For example a hi­e­rar­chy CountryRegionCity will be split into three tables so only column City remains in the Customer dimension that is connected to the Sales fact table, and then two Snowflake dimension tables with Region and Country, that also could be used in others dimensions like Supplier and Store. Snowflake dimension is normally rare be­cau­se of  the query performance with more tables in the join, and users typically use different levels of the hierarchy in ma­king reports.

 

Outrigger dimension or Reference dimension

It is a term for when one dimension table is referenced in another dimension table without that reference is found in any fact table. For example a Product dimension has a column called OfferDate, that represent the date when the product will be available for the customers, and from that date the product can appear in the Sales fact table. The OfferDate column is a foreign key to the Date dimension, and the­re­for­e it’s a reference column and Product dimension is called a Outrigger di­men­sion. Outrigger dimensions are permissible, but should be used sparingly. In most cases, the correlations between dimensions should be demoted to a fact table, where both dimensions are represented as separate foreign keys.

 

Parent-child dimension

Used to model flexible hierarchical structures where some dimension values have different levels of hierarchies called unbalanced. Every value in the dimension have a related parent (mother) value, except the top value. The hierarchy is asymmetrical or unbalanced because values are placed in dif­fe­rent levels within the same hierarchy. For example is an Employee dimension where the parent is the ma­na­ger and the children is the employees under the manager, and some employees are both a manager and have another manager above, and the chain of command can be different from department to department. Another example is an Organization dimension where some departments have sub-departments and some teams have sub-teams, but there are also teams that don’t have sub-teams. This is the stron­gest side of Parent-child dimension to modeling.

 

Ragged dimension

A ragged dimension contains at least one member whose parent belongs to a hierarchy that is more than one level above the child, and normally implement as Parent-child or by making dummy levels in a Regular dimension, e.g. Eu­ro­peDen­markCopen­hagen or North AmericaUnited StatesCali­for­niaSacra­mento with four levels continentcountrystatecapital, so we make a null state for Den­mark.

 

Static dimension or Constant dimension

Static dimensions are not extracted from the legacy system, but are created within the context of the data warehouse or data mart. A static dimension can be loaded manually for example with Status codes or it can be generated by a procedure, such as a Date dimension. The opposite could be called Dynamic dimension.

 

Aggregate dimension

Dimensions that represent data at different levels of granularity to give higher performance. Can also refer to hierarchies inside a dimension with a higher grain.

 

Shrunken dimension

Is a subset of another dimensions columns that apply to a higher level of summary, for example a Month dimension would be a shrunken dimension of the Date di­men­sion. The Month dimension could be connected to a forecast fact table whose grain is at the monthly level, while Date Dimension is connected to the realized fact table. Another example is the Order fact table may include a foreign key for Product, but the target fact table wish only to include a foreign key for ProductCategory, which is in the Product table but with less granular. The Product dimension can be snowflaked to a separate table for ProductCategory or there can be a map table between the two tables Product and ProductCategory so there are not directly related. Sometimes users want few dimension values like Red, Yellow and Green and they want Pink to become Red and Orange to become Yellow and so forth and the rest of the colours gets a residual value called Others. I will make a Colour dimension for the colours from the source data and I will make an extra ColourGroup dimension for the four colours Red, Yellow, Green and Others. I will make a map or bridge table that will translate the colours like Red to Red, Pink to Red, Yellow to Yellow and Orange to Yellow and the rest of the colours to Others. In the loading to the fact table I will let the colours go through the map and fetch the ColourGroup dimension to the fact table to obtain good performance for various statistics for the users. Shrunken rollup dimensions are required when constructing aggregate fact table.

 

Mini dimension

For Rapidly Changing Dimensions and used for managing high frequency and low cardinality changes in a dimension of fast changing columns which are typically pla­ced in the mini dimension with it’s own surrogate key identity column which is included the fact table. A dimension table will be split into two tables, one with type 0 or 1 columns and the other with type 2 or 4 columns. For example a Customer di­men­sion with columns Name, Address and CountryRegionCity in one table, and the fast changing columns MonthlyIncome and BodyWeightAtPurchaseTime in ano­ther mini dimension table. When many customers will be placed in groups based on sex, marital status and income interval e.g. $ 0-10.000, 10.000-25.000, 25.000-50.000 I will place these groups as rows in a mini dimension table and I will give the Customer dimension a foreign key to mini dimension table. Normally there is no hierarchy in a mini dimension. For an OLAP cube I would prefer to merge the two tables into one single dimension. Sometimes it is necessary to have two or more Mini dimensions if the columns is changing rapidly at different times.

 

Monster dimension

A very large dimension.

 

Heterogeneous dimension

Several different kinds of entry with different columns for each fact (like sub clas­ses). For example heterogeneous products have separate unique columns and it is therefore not possible to make a single product table to handle these heterogeneous products.

 

Multivalued dimension

a) To handle when one fact data row have two or more dimension values from same di­men­sion. For example a Sales fact can have up to four different sale staff employees. And of course one sale staff employee has many sales. Therefore we say that there is a many-to-many relationship be­twe­en Sales fact and ­Em­ploy­ee dimension. It is implement by an extra table called Bridge dimension table  (SaleEm­ploy­eeBridge), that contains an Id key of a fact data row and has rows for all the dimension data values connected to that fact data row by the Id key in the Employee dimension.

b) To handle when one dimension data row has two or more dimension values from another di­men­sion. For example one bank account has two bank customers like wife and husband, and of course each bank customer can have several accounts like budget, saving and pension. Therefore we say that there is a many-to-many re­la­tionship be­twe­en Customer dimension and Account dimension. The Bank­Account fact table will refer to the Account dimension, and the Account dimension refer to a AccountCustomerBridge table that again refer to the Customer dimension, so the BankAccount fact table will not refer directly to the Customer dimension. Ac­count­Custo­merBridge table contains two columns DimAccountId and a DimCu­sto­mer­Id as a composite primary key.

 

Degenerate dimension

Degenerate dimension values exist in the fact table, but they are not foreign keys, and they do not join to a real dimension table. When the dimension value is stored as part of fact table, and is not in a separate dimension table, it is typically used for lo­west grain or high cardinality dimensions such as voucher number, transaction num­ber, order number, invoice number or ticket number. These are essentially dimension key for which there are no other columns, so a degenerate dimension is a dimension without columns or hierarchy. For example the OrderNumber column can be in the Order fact table with several rows using the same order number, because one order can contain several products. Therefore the OrderNumber column is important to group together all the products in one order. Later for searching for an order number in a OLAP cube, a Order number dimension is very usefull, but it is not a dimension table, it is generated by the Order fact table.

Degenerate dimension is also used as an im­ple­men­ted Bridge dimension table in making a Multivalued Dimension. In a data mart these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Normally a degenerate dimension is not a table in a da­ta­base, it is a view with distinct values based on the fact table.

 

Dimensionalised dimension

Dimensionalised dimension is a replacement of Degenerate dimension where the view become a materialized view meaning it become a table in the data mart and where the text column will be transformed to an integer key, like this:

 

SELECT QuantityPerUnitKey = CONVERT(BIGINT, HASHBYTES('MD5', t.QuantityPerUnit)),

            t.QuantityPerUnit

    INTO Dim_QuantityPerUnit

FROM (SELECT DISTINCT QuantityPerUnit

           FROM Products WHERE QuantityPerUnit IS NOT NULL) t

 

Bridge dimension

The bridge is to connect or map data that have a many-to-many relationship between a fact and a dimension or between a dimension and a dimension.

Instead of the word Bridge table a term Helper table could be used.

 

Role playing dimension or Repeated dimension

A role-playing dimension is repeated two or several times in same fact i.e. a Date dimension which key column is repeated in three foreign key columns in a fact table for three roles labeled SaleDate, ShipDate and DeliveryDate. Here is the Date dimension taking multiple roles to map the dates and the user will get three date dimensions with it’s own na­mes. Another example is a boolean dimension with key values 1/True and 0/False and texts »Yes« and »No« and is repeated in many different kind of status dimensions like 'In stock' or 'Delivery made'. A City dimension can be repeated in multiple roles in a fact table of persons like this columns: BirthplaceCity, ResidensCity, WorkingplaceCity and DeathCity.

 

Junk dimension or Garbage dimension

A single table with a combination of different and unrelated columns to avoid ha­ving a large number of foreign keys in the fact table and therefore have decreased the number of dimensions in the fact table. The content in the junk dimension table is the combination of all possible values of the individual columns called the cartesian product. For examples of three different values that can be joined into one dimension with 3 x 2 x 2 = 12 values or 12 rows in the Junk dimension table such as:

Customer feedback: Good, Bad or None.

Payment type: Cash or Credit card.

Bag type: Plastic or Paper.

Does not need to be the cartesian product of all the attributes’ possible values, but should only contain the combination of values that actually occur in the source data.

 

Time machine dimension

A combination of two dimensions for Entry date (or event date) and for Year and where fact data rows is based on either »counterpart« as in a financial accounting or »transactions« as in a financial transactions. Section 6.3 will show an example.

 

Supernova dimension

Dimensional attributes are allowed to become complex objects rather than simple text like unstructured text, gps tracking, graphic images and in time series and in NoSQL bases like Hadoop and MongoDB, and become much more malleable and extensible from one analysis to another. Extensible design in software engineering is to accept that not everything can be designed in advance and extensibility is a soft­ware design principle defined as a system’s ability to have new functionality extended.

 

Data quality dimension

A description of each fact table row could be »Normal value«, »Out-of-bounds value«, »Unlikely value«, »Verified value«, »Unverified value« and »Uncertain value«.

 

4.3. Inferred members or Inferred dimensions

A dimension has normally an »Unknown«, »Missing«, »N/A« or »-« member value to handle source legacy data in theses two situations:

 

  • Business key is empty with no value, is null or is an empty string.

Approach: Handling null business key as a missing member.

 

  • Business key value does not exist in the dimension.

Approach: »Late arriving dimension« or »Early arriving fact« for handling of orphaned data where the fact data has an unknown mem­ber, meaning the value is an orphan child because there is no parent value in the corre­spon­ding dimension table.

 

In a relational database it is called a referential integrity constraint violation in a table if a foreign key contains a value that does not exists as a primary key in a different (or the same) table. There is a no referred member and there is a need to have an inferred member. A forthcoming fact row has a member that will infer a new dimension member, therefore it is called inferred member and it is to improve data quality in the data warehouse.

 

Handling null business key as a missing member

When a dataset is transformed to a fact table, there can be a business key column which value is empty or null meaning it does not yet exist in fact data. To keep all the fact data rows in the fact table, the related dimension table is already born with a »Missing« value (an inferred member) with surrogate key identity e.g. value 0 or -1, which can be used as a default value in the fact table foreign key column to the dimension table primary key column. Later the fact table default value can be updated with a real business key value that either exists in dimension table or will be inserted there first.

 

Late arriving dimension or Early arriving fact as an unknown member

When a dataset is transformed to a fact table, there can be a business key column which value has not yet been received to the related dimension table and therefore does not yet exist. To keep all the fact data rows in the fact table, the related dimen­sion table will first have inserted a new business key value with an »Unknown« value (an inferred member) which later will be updated with the correct text value then it is known. The »Unknown« value gets the next surrogate key identity a unique sequence number and will be used in fact table like any other dimension values. There can in a dimension at the same time be several »Unknown« values with their own key identity and text value can include the value of the business key like »Unknown 886«, »Unknown 887«, »Unknown 888« and so on to distinct them for the users of the dimension. When a regular dimension has a hierarchy theses levels can have text value »Unknown« as a special branch in the hierarchy.

 

Late arriving fact

When a fact data has been delayed maybe it is including a date that can be used to search and fetch the current dimension member at that time if dimension keeps history.

 

Early arriving dimension

When a dimension has members that is not yet been referred to from a fact row or maybe it will never be i.e. a type of payment called »Blood«, but still keep it as a member of the dimension.

 

Other inferred members

Mo­re va­ri­e­ty of constant inferred members could be like this:

»Missing« (0) the missing business key value is not present but is expected to be available later and at that time the value will get it’s own identity and fact table will be updated thereby.

»Not available« (-1) the legacy system did not provide a business key value (null) and will not do it later, so it is really a missing value.

»Bad« or »Dirty« (-2) the legacy system provide bad data (like year of birth as 2099) or not enough data to determine the appropriate dimension key identity. This may be due to corrupted data at the legacy system or incomplete knowledge of the bu­si­ness rules for the source data for the dimension.

»Not applicable« (-3) the dimension is not applicable to this fact row.

 

IsInferred flag

A dimension table can have different audit columns (see more in section 6.2) like a IsInferred flag as a boolean (bit) with two values (1 = true and 0 = false):

 

  • IsInferred = 1 means that the row represent an unknown value from legacy.
  • IsInferred = 0 means that the row represent a known value from legacy.

 

An example of a Customer dimension:

 

DimCusto

merId

Business key

Region

Name

IsInferred

0

0

Missing

Missing

1

1

176

Europe

Hans Andersen

0

2

359

North America

Joe Watson

0

3

421

Europe

Marie Beaulieur

0

134

886

Unknown

Unknown 886

1

135

887

Unknown

Unknown 887

1

136

888

Unknown

Unknown 888

1

137

889

East Asia

Jian Lewai

0

138

890

North America

Emily Gates

0

139

891

North America

Michael Brown

0

 

Will an inferred dimension row be updated when the source legacy system provide a Region and Name?

 

  • Yes, for type 1 current value and therefore set IsInferred = 0.

 

  • No, for type 2 keep all values and for type 4, 5, 6 and 7 because it is im­por­tant to keep history in rows, therefore the updated Region and Name will be inserted as a new row with IsInferred = 0 and keep the original row unchanged.

 

The fact table is untouched or unchanged because the foreign key DimCustomerId values 134 - 136 are at the right place already, it is only Unknown values of the dimension that is updated. Kimball says there is no need for revisiting the fact table for making an inferred member to a normal member of the dimension.

 

When other levels in a hierachy has text value »Unknown« like Region, users of the dimension in a cube had already made reports with fact data connected to »Unknown« like Region above. Therefore it is important to keep the history in the dimension so a user some months later still can make the same report with same summary value at »Unknown« level. When the user instead want to use the current values of the previous inferred members, the cube should provide such a dimension too.

 

5. Links

http://www.wikipedia.org/wiki/Data_warehouse

http://www.kimballgroup.com

http://www.kimballgroup.com/2009/05/the-10-essential-rules-of-dimensional-modeling/

http://www.wikipedia.org/wiki/Slowly_changing_dimension

http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7

http://www.kimballgroup.com/2012/07/design-tip-147-durable-super-natural-keys/

http://www.kimballgroup.com/2015/12/design-tip-180-the-future-is-bright/

http://www.1keydata.com/datawarehousing/datawarehouse.html

https://www.youtube.com/watch?v=cwpL-3rkRYQ&list=PLrbIyvYCdg0iAUQoxG5vI_yKqzZ2AcgGe

 

6. Examples

 

6.1. Customer dimension

Customer table in a source legacy system with CustomerId as a primary key (called Business key in a data warehouse or data mart) and CustomerNum­ber as a se­con­dary key including data of the nationality and the gender of the customer:

 

CustomerId

CustomerNumber

CustomerName

Country

176

DK-0912-M

Hans Andersen

Denmark

359

US-3942-M

Joe Watson

USA

421

FR-1429-F

Marie Beaulieur

France

 

Customer dimension table in a data warehouse that has made the table at 3NF based on the approach of Snowflake dimension and Outrigger dimension or Reference dimension:

 

DimCusto

merId

Custo

merId

Customer

Number

Customer

Name

DimCoun

tryId

DimGen

derId

DimPay

mentId

1

176

DK-0912-M

Hans Andersen

1

0

2

2

359

US-3942-M

Joe Watson

5

0

3

3

421

FR-1429-F

Marie Beaulieur

3

1

3

 

DimCustomerId is an unique sequence number as primary key for the dimension for ha­ving independence on the business key CustomerId from the source legacy system. Country and Gender becomes dimensions and can be used together with other dimension tables and fact tables. Pay­ment is taking from another table in the CRM system and become a dimension itself and is merged into the Customer di­men­sion to do an enrichment of the customers:

 

DimCountryId

CountryName

1

Denmark

2

England

3

France

4

Norway

5

USA

 

DimGenderId

GenderName

0

Male

1

Female

 

DimPaymentId

PaymentName

Sortorder

1

Bank account

3

2

Cash

1

3

Credit card

2

4

PayPal

5

5

Bitcoin

4

 

For an olap multidimensional cube loading data to a Customer dimension will be implemented with a view that join the relevant tables and with several attributes and hierarchies to make an easy search and zoom into specific data in tools like Excel, Power BI, Tableau or Targit:

 

Customer

·         Number

·         Name

·         Gender

·         Nationality (Country renamed to play a role)

·         Payment

 

An example of a sales statistics could be: »What is the amount of sale to female customers from France that have been paid with credit card in April 2014?«

Of course we need a Sales fact table like this:

 

PurchaseDate

DimCustomerId

DimEmployeeId

DimProductId

Quantity

Amount

2014-04-23

2

12

256

1

125

2014-04-23

2

12

512

3

75

2014-04-24

3

7

128

7

10

2014-04-25

1

12

512

1

85

2014-04-29

3

10

64

2

35

 

 


6.2. Kimball type 7 dimension

A type 7 dimension will keep all values in a dimension independent of the source legacy system with an option to easy to roll up historical facts based on current dimension characteristics. It is formerly known as »Compliance-Enabled Data Warehouses« at Kim­ball Design Tip #74 that achieve:

 

·      Eliminating Type 1 and Type 3 updates.

·      Accessing all prior versions of a database at points in time.

·      Protecting the custody of your data.

·      Showing why and when changes to data occurred.

 

In this section, I like to give my description of the dimension type 7.

 

I will start with the life cycle terms for a value (member) in a dimension:

 

·      Original is the value of birth when it occurred and entered first time.

·      Historical is tracking the value for changes (modified/updated) at any time.

·      Current is active, newest, latest of the value or most recent indicator.

·      Deleted is the value no longer available in the source legacy system.

 

From a fact data point of view it is important to know the value of a dimension at the time of registration, therefore I will add to the life cycle:

 

·      Registered is the dimension value when the fact data occurred or entered.

 

There will probably be dimension values that will not be used in fact data.

 

When I have one fact data row in my hand, I can provide four versions of the va­lue in a dimension, can also be called four perspectives:

 

·      Registered is the dimension value when the fact data row was registered or entered into the data warehouse with a transaction date or a current time.

·      Current is the latest and active dimension value and thus independent of the time of registration of the fact data row.

·      Original is the entered first time dimension value maybe even before the fact data row exists in the source legacy system.

·      Historical is the dimension value at a specific given date reporting point-in-time datetime independent of the fact transaction date.

 

Registered stands for as-was when occurred, Current stands for as-is for most recent data, Historical stands for as-of a specific particular reporting date and Original stands for as-began when the data value was born.

 

An additional information for the fact data row could be about the dimension value is not longer available in source legacy system because it has been marked expired or purged or it has been deleted.

 

My design pattern for type 7 dimension will have four key audit columns or house­kee­ping columns or support columns:

 

·         _key stands for a unique primary key column which is a new integer num­ber every time a va­lue in a source legacy system is changed and send to the data warehouse. A surrogate key is an iden­tity auto increment sequential number normally starting from 1. Can be called _skey for sur­ro­gate key or be called _rkey for the registered value reference from a fact table row to the dimension table row with the textual data descriptive attri­bu­tes at the time the fact data row was registered or entered.

 

·         _dkey stands for a durable key which is not a unique integer number per row, instead it will hold the same number as the _key column when the dimension value entered the di­men­sion table for the first time and that number will be repeated every time the value change. Can be called _ekey for entity key becau­se the unique number of _dkey is also a unique indepen­dent data ware­house surrogate number for the business key (natural key) from the source legacy system. Sometimes called a time-independent surrogate key. Can be called _hkey for history reference over time used for look up in dimen­sion for the current value, historical value at a specific reporting datetime or original value when it was entered the first time.

 

·         _pkey stands for a previous key (or parent-child key) which is a reference to a _key value for the previous or latest version of a dimension value when a value change over time. It is a referential integrity where _pkey is a foreign key to primary _key with one ex­cep­tion that value 0 in _pkey does not has a reference to a _key value because value 0 represent the first time the dimension value enter the dimension table and therefore has no previous value.

 

·         _bkey stands for a business key (or natural operational application primary key) which is unique in the so­ur­ce legacy system like a primary key. A composite business key is composed of multiple columns like Name_bkey + Street_bkey + Zip­code_bkey and they will be multiple columns in the dimension table. Sometimes a string concatenation with a separator character to handle them in one _bkey column to compare with data from the source system. Identical data can exists in multiple source legacy systems like an employee in CRM, ERP and HR with their own business key in the Employee dimension through sibling columns like CRM_bkey, ERP_bkey and HR_bkey with different representation like a number, a code and an email address. Depen­ding of complexity there is a need of a mapping table in the data warehouse to take care of the merge of business keys and sometimes it is a manual task to merge data together. Importance is to have an employee as one entity with one durable surrogate number for the combination of business keys making a one-to-one relationship between _bkey and _dkey. [concat(Name_bkey,'¤',Street_bkey,'¤',Zip­code_bkey)]

 

An example

Lets have an example where a source legacy system has changed its value three times from »Warning« to »Notification« and now to »Caution«. Each of the three values become a new row in a dimension table together with two audit co­lumns that representing the period for which the value was found in the source legacy system, here called:

 

·         ValidFrom  also called EffectiveDate or StartDateTime or BeginDateTime.

·         ValidTo      also called ExpirationDate or StopDateTime or EndDateTime.

 

In a dimension table of type 7 it will look like this with audit columns and one business key and one text value column in a System dimension:

 

System

_key

System

_dkey

System

_pkey

System

_bkey

System

ValidFrom

ValidTo

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

35

19

19

76CB

Notification

2015-05-08

14:12:21

2015-07-31

08:45:48

87

19

35

76CB

Caution

2015-07-31

08:45:48

9999-12-31

00:00:00

 

Column _key is an unique primary key of the dimension table and the com­bi­nation of _bkey + ValidFrom or _bkey + ValidTo is unique too.

 

[I prefer to think a table of a dimension as an object that has its own functions on data, therefore I like when the dimension _key and _dkey starts from 1 for each dimen­sion instead of having a dimension key store map with a global number for all the keys across all the dimensions where each business key will get its own surrogate key and a new dimension table will start with a _key like 4983562.]

 

»Warning« and »Notification« is history va­lues of the System dimension and »Caution« is the most recent indicator of the System dimension with theses terms:

 

·         »Warning« is the original value.

·         »Notification« is the historical value.

·         »Caution« is the current va­lue.

·         »?« is the registered value depend of the transaction date in the fact row.

 

The challenge is that you have to choose what version of dimension value you like to present in your report.

 

Registered value

From a source legacy system the data warehouse has received two outages fact data with an OutageDate (kind of transaction date) and a SystemId business key that match to System_bkey in the System dimension. Example of two source data rows:

 

OutageDate

SystemId

ProductionKWh

2015-05-25 10:22:43

76CB

173

2015-08-02 15:47:10

76CB

221

 

For each source data row, the value of OutageDate will look up in the System dimension at the time for the outage was happening to determine and fetch the correct dimension value so the fact data row is registered with criteria like:

 

System_bkey = SystemId AND

ValidFrom <= OutageDate AND ValidTo > OutageDate

 

It will select one row from the dimension table and the values of columns Sy­stem_key and Sy­­stem_dkey will be copied to the new fact row, like for theses two outages incidents at Maj 25 and August 2 in 2015 that will become fact data like:

 

OutageDate

System_key

System_dkey

ProductionKWh

2015-05-25 10:22:43

35

19

173

2015-08-02 15:47:10

87

19

221

 

System_key is a reference to the System dimension value at the registered fact time from column OutageDate and System_dkey is a reference to the entity with business key 76CB to provide current, original or historical value.

 

When we like to see the value from the dimension at the time where an outage happen, fact table column System_key will join to the dimension column Sy­stem_key and get the value from the unique row. We call this value the registered value, because it was the value of the dimension when the outage incidents was occurred, and the value will never change because the dimension is handling all changes of va­lues.

Fact table join to the dimension table to fetch registered value:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_key = f.System_key

 

System

OutageDate

ProductionKWh

Notification

2015-05-25 10:22:43

173

Caution

2015-08-02 15:47:10

221

 

Here we see the purpose of the OutageDate and in case of the source legacy sy­stem does not provide a »transaction date« to determine and fetch the dimension value, the data warehouse will use a Fact_RegisteredTime (receipt datetime, createddate, inserttime, current_time­stamp, getdate, sysdatetime) for when the fact data is regi­stered or entered into the fact table row with criteria like:

 

System_bkey = SystemId AND

ValidFrom <= Fact_RegisteredTime AND ValidTo > Fact_RegisteredTime

 

Current va­lue

It can sometimes confuse users because the value of System has changed over time and what is the summation of ProductionKWh of the System when using the registered value. Therefore we can provide the current value of System dimension.

Fact table join to the dimension table to fetch current value:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_dkey = f.System_dkey AND

                              d.ValidTo = '9999-12-31 00:00:00'

 

System

OutageDate

ProductionKWh

Caution

2015-05-25 10:22:43

173

Caution

2015-08-02 15:47:10

221

 

And a summation of ProductionKWh is 394 KWh for Caution.

 

Original value

We can provide an original value of System.

Fact table join to the di­men­sion table to fetch original value:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_dkey = f.System_dkey AND

                              d.ValidFrom = '1900-01-01 00:00:00'

 

System

OutageDate

ProductionKWh

Warning

2015-05-25 10:22:43

173

Warning

2015-08-02 15:47:10

221

 

And a summation of ProductionKWh is 394 KWh for Warming, which is the original value of the system, maybe some users remember it best.

 

Historical value

We can provide a historical value of System at a specific given as-of date reporting point-in-time datetime (ReportTime, Reque­sted­Date) that will be part of the criteria.

Fact table join to the di­men­sion table to fetch historical value:

 

DECLARE @ReportTime datetime = '2015-06-01 10:24:47'

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System d ON d.System_dkey = f.System_dkey AND

                              d.ValidFrom <= @ReportTime AND d.ValidTo > @ReportTime

 

System

OutageDate

ProductionKWh

Notification

2015-05-25 10:22:43

173

Notification

2015-08-02 15:47:10

221

 

The »between-filter« do the constrained to a single version value of System. A re­por­ting datetime at »June 1, 2015, 10.24« will fetch sy­stem Notification and that will never change even though the system name changes value over time. It means I can always repeat and reprocedure my report as long as I remember the reporting datetime. If I choose another reporting datetime like »September 1, 2015, 09.46« it will fetch system Caution. The reality with this approach is that you have a many-to-many relationship between the fact table and the dimension, so you must force your delivery reporting tools to choose a as-of date to allowing point-in-time analysis at a given date to avoid bringing back all combinations and resulting in overstated measurements.

 

Helper views

It is hard to remember the criteria for _key, _dkey, ValidTo and ValidFrom therefore I will make four sql helper views to make the join easy for users and analysis tools:

 

Registered view

 

CREATE VIEW dim.System_Registered AS

SELECT System_key, System

FROM   dim.System

 

Used in join with fact table to fetch registered value for a fact row:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Registered d ON d.System_key = f.System_key

 

Current view

 

CREATE VIEW dim.System_Current AS

SELECT System_dkey, System

FROM   dim.System

WHERE ValidTo = '9999-12-31 00:00:00'

 

The view will do a query for always to select the current value for each values in the dimension.

 

Used in join with fact table to fetch current value for a fact row:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Current d ON d.System_dkey = f.System_dkey

 

Let us imagine we are back in time at June 5, 2015 and we like to see in Tableau the current sy­stem of the outage from May 25 the view System_Current will provide system Noti­fi­ca­tion because that was the current value at that time. When we do the same in Tableau at August 22 the view System_Current will provide sy­stem Caution. Fact table co­lumn Sy­stem_dkey will join to the view column System_dkey and it will provide dif­fe­rent values from the dimension depend of the date and time we do the reporting in Tableau but the values will always be the current and most recent.

 

Original view

 

CREATE VIEW dim.System_Original AS

SELECT System_dkey, System

FROM   dim.System

WHERE ValidFrom = '1900-01-01 00:00:00'

 

Used in join with fact table to fetch original value for a fact row:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Original d ON d.System_dkey = f.System_dkey

 

Historical function with a datetime parameter

 

CREATE FUNCTION dim.System_Historical(@ReportTime datetime)

RETURNS TABLE AS RETURN (

SELECT System_dkey, System

FROM   dim.System

WHERE ValidFrom <= @ReportTime AND ValidTo > @ReportTime)

 

Used in join with fact table to fetch historical value for a fact row with a datetime:

 

SELECT d.System, f.OutageDate, f.ProductionKWh

FROM    fact.Outage f

            INNER JOIN dim.System_Historical('2015-06-01 10:24') d

               ON d.System_dkey = f.System_dkey

 

I could instead have made four database shemas like: dimregistered, dimcurrent, dimoriginal and dimhistorical and reused the view name like System.

 

Conclusion

Dimension type 7 provides up to four values for registered, current, original and hi­storical and is very flexible for a fact data row to request for and fetch a dimension value. Of course a lot of fact data rows have only one value in a dimension, like a fact for sale to a new customer at the same time is entered into the dimension and if the customer will not change any data for a long time, the dimension value is regi­stered = current = original = historical. The cu­stomer purchases many times and have many rows in fact sale, but one day the customer moves from his home country to three different countries over time and still continue to purchase. This means that the original value become the home country, historical values is the second and third country, and current value is the fourth and recent country, and the registered value country is depending of the date of the sale. For statistics of the sales we can have summaries for customers home countries even though they have moved to other countries because we simple use the original value of the dimension. We can have summaries for countries where the customer did their purchase from by using the registered value. In case we like to know the sales from customers who have once lived in a specific country, we first look up that country in the customer dimension and gets their _dkey’s and join them to the fact sale table. The marketing people like to know the current address, phone number and email of all the customers to contact them with new exciting offers. Type 7 gives you all the options and the four helper sql views is handy for ad hoc queries. The right information, at the right time, in the right place, in the right way, in the right format, to the right person.

 

Dimension table and Fact table

A dimension table in the physical design will have four key audit columns or house­kee­ping columns or support columns:

 

·         _key    used for the registered fact data value of the dimension.

·         _dkey  used for the current, original and historical value of the dimension.

·         _pkey  used for the previous key that is a reference to the _key column.

·         _bkey  used for the business key from the so­ur­ce legacy system.

 

A fact table in the physical design has for each dimension two audit columns:

 

·         _key    used for the registered fact data value of the dimension.

·         _dkey  used for the current, original and historical value of the dimension.

 

A dimension table will keep track of datetime stamp of inserting and updating a row with two audit columns:

 

·         InsertTime    database server system time when the row was inserted.

·         UpdateTime   database server system time when the row was updated.

 

I don’t like an audit column to be Null or N/A, therefore UpdateTime gets the same datetime stamp as InsertTime, so both columns will have the same value until data in the row will be either changed or deleted in the source legacy system. The two audit columns can be assigned either by triggers for insert and update at the table or by the ETL process. I prefer the ETL process to handle all audit columns.

 

A dimension table will keep track of a current value in the source legacy system with an audit column which is a boolean (bit) with two values (1 = true and 0 = false):

 

IsCurrent

IsCurrent = 1 means that the row represent the current value from source system.

IsCurrent = 0 means that the row represent a historical value from source system.

 

Column IsCurrent can also be called NewestFlag, ActiveFlag or RecentFlag.

 

When IsCurrent = 1 the ValidFrom/ValidTo represent the period where the row va­lue is current in the source system.

 

A dimension table will keep track of a expired or deleted value in the source le­gacy system with an audit column which is a boolean (bit) with two values (1 = true and 0 = false):

 

IsDeleted

IsDeleted = 1 means that the row represent the deleted value from source system.

IsDeleted = 0 means that the row represent an existing value from source system.

 

Column IsDeleted can also be called DeleteFlag, DeletedAtSource, ExpiredFlag or InactiveFlag, when data are deleted in a system and therefore are gone forever.

 

When IsDeleted = 1 the ValidFrom/ValidTo represent the period where the row va­lue does not exists in the source system.

 

A dimension table can keep track of the reason why the row was changed in a co­lumn called RowChangeReason so it is easy for doing search and analysis. The column could have value »New« when the row represent a new customer, and later when customer move value could be »Moved country« and later »Changed credit status« and so on.

 

A fact table will keep track of datetime stamp of inserting a row with audit column:

 

·         InsertTime  database server system time when the row was registered.

 

When there is no transaction date for a fact data the InsertTime will be used to look up an unique dimension table row. Sometimes the term RegisteredTime is used.

 

A fact table will sometimes keep track of datetime stamp of updating a row with audit columns:

 

·         UpdateTime  database server system time when the row was updated.

·         IsCurrent      if fact row will change then keep track of current row.

·         IsDeleted      if fact row can be deleted from source.

·         ValidFrom     if fact row is treated as SCD type 2, Slowly Changing Facts.

·         ValidTo         if fact row is treated as SCD type 2, Slowly Changing Facts.

 

Slowly Changing Facts SCF or Frequently Changing Facts FCF but hopefully not Rapidly Changing Facts RCF because that could give a lot of facts rows over time.

 

When transactions are changed retroactively, you should update corresponding fact rows. If you need to track the history of fact rows it become Slowly Changing Facts, see more in section 6.3.

 

Sometimes a developer like to prefix or postfix all key and audit columns to make sure that the source system does not has a column with the same name. A postfix could be like _dw or _dwh.

 

An example

Lets have an example where a source legacy system has changed its value three times from »Warning« to »Notification« and now to »Caution« and in the ending the source legacy system will delete the value so it will not exists anymore, but the dimension will keep it forever. Each of the following steps will show how the dimension table looks like after each process has been done.

 

1. First time a value arrives to the data warehouse which happen at Maj 1, 2015, 5 pm or 17:00, a new row is added to the dimension table to contain the new value in column System to­gether with ten audit columns:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

9999-12-31

00:00:00

1

0

2015-05-01

17:00:00

2015-05-01

17:00:00

 

2. Value changed at 2015-05-08 therefore the first row is not current and a new row is added:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

0

0

2015-05-01

17:00:00

2015-05-08

17:01:09

35

19

19

76CB

Notification

2015-05-08

14:12:21

9999-12-31

00:00:00

1

0

2015-05-08

17:01:10

2015-05-08

17:01:10

 

3. Value changed at 2015-07-31 therefore the second row is not current and a new row is added:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

0

0

2015-05-01

17:00:00

2015-05-08

17:01:09

35

19

19

76CB

Notification

2015-05-08

14:12:21

2015-07-31

08:45:48

0

0

2015-05-08

17:01:10

2015-07-31

14:03:22

87

19

35

76CB

Caution

2015-07-31

08:45:48

9999-12-31

00:00:00

1

0

2015-07-31

14:03:23

2015-07-31

14:03:21

 

4. Value deleted at 2015-08-06 where the current row will be copied to a new row mark deleted:

 

_key

_dkey

 

_pkey

 

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

InsertTime

UpdateTime

19

19

0

76CB

Warning

1900-01-01

00:00:00

2015-05-08

14:12:21

0

0

2015-05-01

17:00:00

2015-05-08

17:01:09

35

19

19

76CB

Notification

2015-05-08

14:12:21

2015-07-31

08:45:48

0

0

2015-05-08

17:01:10

2015-07-31

14:03:22

87

19

35

76CB

Caution

2015-07-31

08:45:48

2015-08-06

10:10:05

0

0

2015-07-31

14:03:23

2015-08-06

12:00:06

99

19

87

76CB

Caution

2015-08-06

10:10:05

9999-12-31

00:00:00

1

1

2015-08-06

12:00:08

2015-08-06

12:00:05

 

With this approach there will be inserted an additional row for a deleted (or closed item) in the so­ur­­ce legacy system, so the dimension will tell in the last row which is mark as deleted a period, here the value is deleted from 2015-08-06 10:10:05 to forever. IsDelete = 1 represent a deleted  row or record or data in the source legacy system, therefore the business key is no longer available in source legacy system in the period of ValidFrom/ValidTo. We asume the value will not be reopen (or reborn) with the same business key in the source legacy system. No fact data with a _key co­lumn will never point to or refer to a deleted value in a dimension, therefore no fact row can use a _key value with IsDeleted = 1. IsCurrent = 0 represent the old historical version of the value and ValidFrom/ValidTo tells us the period when the old value was current or active back in time.

 

When source data with an OutageDate arrives to the data warehouse it will find the historical system name at the time the outage was happening with this criteria that will not include the dele­ted dimension value because no fact data with a _key column can’t refer to a deleted value:

 

IsDeleted = 0 AND System_bkey = SystemId AND

ValidFrom <= OutageDate AND ValidTo > OutageDate

 

Example of two source data rows:

 

OutageDate

SystemId

ProductionKWh

2015-05-25 10:22:43

76CB

173

2015-08-02 15:47:10

76CB

221

 

Become two fact data rows where System_key is a reference to the System dimension value at the registered fact time from column OutageDate and System_dkey is a reference to the entity with business key 76CB to provide current, original or historical value.

 

OutageDate

System_key

System_dkey

ProductionKWh

2015-05-25 10:22:43

35

19

173

2015-08-02 15:47:10

87

19

221

 

The current view for the dimension to show all current values from the dimension will search after criteria IsCurrent = 1, so the view gives the current value every time I access the view, and the fact co­lumn _dkey can join to the view for showing the current va­lues of the dimension, therefore dimen­sion values with IsDeleted = 1 will be included in the view because there are old fact data with a _dkey which refer to deleted dimension data in the source legacy system.

 

View provide registered dimension values and will be joined to fact data through _key

 

CREATE VIEW dim.System_Registered AS

SELECT System_key, System

FROM   dim.System

WHERE IsDeleted = 0

 

View provide current dimension values and will be joined to fact data through _dkey

 

CREATE VIEW dim.System_Current AS

SELECT System_dkey, System

FROM   dim.System

WHERE IsCurrent = 1

 

View provide original dimension values and will be joined to fact data through _dkey

 

CREATE VIEW dim.System_Original AS

SELECT System_dkey, System

FROM   dim.System

WHERE System_pkey = 0       (or System_key = System_dkey)

 

View provide historical dimension values and will be joined to fact data through _dkey

 

CREATE FUNCTION dim.System_Historical(@ReportTime datetime)

RETURNS TABLE AS RETURN (

SELECT System_dkey, System

FROM   dim.System

WHERE IsDeleted = 0 AND ValidFrom <= @ReportTime AND ValidTo > @ReportTime)

 

Since older fact data can be refering by column _dkey to a deleted dimension value, it is still needed to have IsCurrent = 1, but with IsDeleted = 1 we can see, that the value is not available in the source legacy system. Therefore a fact table data is consistent (not inconsistent) and will never be missing a dimension value in a dimension table. When dimension data have IsDeleted = 1 the value of _key is never used among fact data _key column therefore the registered view and historical view do not need to include the deleted dimension value and any­way the same dimension value exists already in the previous row in the dimension table with a ValidFrom/ValidTo period for the time when the value was current and active in the source legacy system.

 

An implementation in SSIS SQL Server Integration Services of type 7 dimension

 

 

Read how to implement the type 7 dimension ETL process in SSIS and how to load fact data with a transaction date

 

A dimension table will keep track of unknown values in the source legacy system

with an audit column which is a boolean (bit) with two values (1 = true and 0 = false):

 

  • IsInferred

IsInferred = 1 means that the row represent an unknown value from source system.

IsInferred = 0 means that the row represent a known value from source system.

 

An example

Lets have an example where the source legacy system has business key with null value and unknown values that does not exist in the dimension table:

 

System

_key

System

_dkey

System

_pkey

System

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

IsInferred

0

0

0

-

Missing

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

19

19

0

76CB

Warning

1900-01-01 00:00:00

2015-05-08 14:12:21

0

0

0

35

19

19

76CB

Notification

2015-05-08 14:12:21

2015-07-31 08:45:48

0

0

0

87

19

35

76CB

Caution

2015-07-31 08:45:48

9999-12-31 00:00:00.

1

0

0

88

88

0

88FF

Sign

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

0

 

Example of table Source_Production with business key called SystemId:

 

OutageDate

SystemId

ProductionKWh

2015-05-25 10:22:43

76CB

173

2015-08-02 15:47:10

76CB

221

2015-08-12 22:21:42

88FF

100

2015-10-02 00:00:00

90AA

200

2015-10-12 00:00:00

90AA

300

2015-10-15 00:00:00

91KL

400

2015-02-07 00:00:00

NULL

500

 

All data rows in table Source_Production will be loaded to table Fact_System where the business key SystemId will be transformed to the dimension key System_key in table Dim_System. There is three values in SystemId (90AA, 91KL, NULL) in four rows that does not exist in Dim_System, therefore they are inferred members and is infer to Dim_System where NULL will be transformed to System_key 0 for »Missing«, and 90AA and 91KL will be inserted into new data rows in Dim_System as »Unknown« with new numbers in System_key that will be used in loading data rows into Fact_System table.

 

After running the SSIS package the table Dim_System will have new inferred rows:

 

System

_key

System

_dkey

System

_pkey

System

_bkey

System

ValidFrom

ValidTo

IsCurrent

IsDeleted

IsInferred

0

0

0

-

Missing

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

19

19

0

76CB

Warning

1900-01-01 00:00:00

2015-05-08 14:12:21

0

0

0

35

19

19

76CB

Notification

2015-05-08 14:12:21

2015-07-31 08:45:48

0

0

0

87

19

35

76CB

Caution

2015-07-31 08:45:48

9999-12-31 00:00:00

1

0

0

88

88

0

88FF

Sign

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

0

89

89

0

90AA

Unknown 90AA

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

90

90

0

91KL

Unknown 91KL

1900-01-01 00:00:00

9999-12-31 00:00:00

1

0

1

 

The table Fact_System is loaded with all the data rows from Source_Production and is also using the new inferred numbers of System_key from the table Dim_System:

 

OutageDate

System_key

System_dkey

ProductionKWh

2015-05-25 10:22:43

35

19

173

2015-08-02 15:47:10

87

19

221

2015-08-12 22:21:42

88

88

100

2015-10-02 00:00:00

89

89

200

2015-10-12 00:00:00

89

89

300

2015-10-15 00:00:00

90

90

400

2015-02-07 00:00:00

0

0

500

 

An implementation in SSIS SQL Server Integration Services of Inferred members

 

 

Read how to implement Inferred members in a SSIS Lookup

 

For a type 1 dimension IsInferred will change from 1 to 0 when the source system provide the real dimension value.

 

6.3. Slowly Changing Facts

Most of the time a fact table row will never be updated but if it is happening I will keep history if a measure value in a fact table row is changed by inser­ting a new row so the fact table changes are treated as SCD type 2 dimension. The fact data is provided with a business key (composed of se­veral columns) from the source legacy system so the data warehouse knows when the data has been changed. There will be a »between-filter« in the reporting tool on the fact table. I will be capturing historical facts with a entry date into a fact table (for transactions) and from there I will have maximum flexibility for aggregation.

 

A marketing source legacy system contains the expected sale for each sales person and each year, example for year 2016 Mr. Sellers expects to have a sale of 10 quantity to a budget amount of 100. But as the year goes by the expectation will be changed to adapted the marked and the figures will be overwritten in the system together with a status that’s starts at open for changes and later closed for changes, or canceled if the sales person failed in his duty. By overwritten it means it is a task for the data warehouse fact table to remember all figures and statuses like it was a type 2 dimension SCD that use row versioning (compliance-enabled), that means the fact table has validfrom and validto timestamps.

 

Timespan fact table could look like:

 

Year

Person

Quantity

Budget

Status

ValidFrom

ValidTo

IsCurrent

2016

005 Sellers

10

100

Open

2015-12-12

2016-03-26

0

2016

005 Sellers

11

125

Open

2016-03-27

2016-04-12

0

2016

005 Sellers

9

117

Open

2016-04-13

2016-06-20

0

2016

005 Sellers

14

130

Open

2016-06-21

2016-07-31

0

2016

005 Sellers

15

130

Closed

2016-08-01

9999-12-31

1

2016

007 Bond

10

100

Open

2015-12-16

2016-03-31

0

2016

007 Bond

20

180

Open

2016-04-01

2016-05-20

0

2016

007 Bond

0

0

Canceled

2016-05-21

9999-12-31

1

 

Notice that next ValidFrom is one day after the previous ValidTo.

Loading of data could be based for incremental load or delta data detection.

 

Dimensions will be for Year, Person, Status and Requested date where a given date has to be between ValidFrom and ValidTo. When the given time period is greater than a day like a week, month, quarter, all the year or year-to-date ytd the fact table is not suitable.

 

Counterpart fact table as in a financial accounting look like:

 

Year

Person

Quantity

Budget

Status

EntryDate

IsCP

IsCurrent

FactDataId

2016

005 Sellers

10

100

Open

2015-12-12

0

0

123

2016

005 Sellers

-10

-100

Open

2016-03-27

1

0

123

2016

005 Sellers

11

125

Open

2016-03-27

0

0

123

2016

005 Sellers

-11

-125

Open

2016-04-13

1

0

123

2016

005 Sellers

9

117

Open

2016-04-13

0

0

123

2016

005 Sellers

-9

-117

Open

2016-06-21

1

0

123

2016

005 Sellers

14

130

Open

2016-06-21

0

0

123

2016

005 Sellers

-14

-130

Open

2016-08-01

1

0

123

2016

005 Sellers

15

130

Closed

2016-08-01

0

1

123

2016

007 Bond

10

100

Open

2015-12-16

0

0

124

2016

007 Bond

-10

-100

Open

2016-04-01

1

0

124

2016

007 Bond

20

180

Open

2016-04-01

0

0

124

2016

007 Bond

-20

-180

Open

2016-05-21

1

0

124

2016

007 Bond

0

0

Canceled

2016-05-21

0

1

124

 

I will use EntryDate to find a month calculating summary, like April 2016, and it does not care of the expectations that has been changed after April because EntryDate is a time machine dimension used to go back in time and make a roll up aggregation:

 

SELECT Person, Quantity = SUM(Quantity), Budget = SUM(Budget)

FROM    Fact

WHERE  Year = 2016 AND EntryDate <= '2016-04-30'

GROUP BY Person

 

With data for some years and loaded into an olap multidimensional cube with an analysis tool I select a given date in Entry date dimension like 2016-04-30 and I use the Year dimension as column in a matrix to compare the month of April between the years:

 

Year

2016

 

2015

 

2014

 

2013

 

Person

Budget

Index

Budget

Index

Budget

Index

Budget

Index

005 Sellers

117

104

112

113

99

132

75

112

007 Bond

180

200

90

290

31

148

21

97

 

[An extended technique when there can be multiple changes over time, is to extract and negate all versions of the fact rows into a holding table (old positives become negative and old negatives become positive), with the addition of a positive version of the new row. Then summarize the holding table across all dimensions (except activity/posting date, use MAX() if it is in the holding table), and filter out zero rows, the result is the total net change resulting in one or two rows per fact. You would wind up with two rows if there is a change in one or more dimensions, one row if the dimensions did not change and no rows if there was no change (all measures are zero).]

 

Transactional fact table as in a financial transactions net change look like:

 

Year

Person

Quantity

Budget

Status

EntryDate

IsOriginal

FactDataId

2016

005 Sellers

10

100

Open

2015-12-12

1

123

2016

005 Sellers

1

25

Open

2016-03-27

0

123

2016

005 Sellers

-2

-8

Open

2016-04-13

0

123

2016

005 Sellers

5

13

Open

2016-06-21

0

123

2016

005 Sellers

1

0

Closed

2016-08-01

0

123

2016

007 Bond

10

100

Open

2015-12-16

1

124

2016

007 Bond

10

80

Open

2016-04-01

0

124

2016

007 Bond

-20

-180

Canceled

2016-05-21

0

124

 

A transactional fact table is insert only and stores changes as the difference between the current state and the new state.

 

I can use the same sql statement as above and this fact table approach keep the same number of rows as the original fact table because there is no counterpart rows, but the loading needs to do some calculations, so there is always pros and cons with an approach and a methodology.

 

It is awesome to deal with data change history over time, therefore I recommend Kimball Dimension type 7 and Transactional fact.

 

7. IsDeleted or DeleteFlag in a data warehouse

I have seen many data warehouses that has a audit or support column like:

 

IsDeleted = 1 means that the row represent the deleted value from source system.

IsDeleted = 0 means that the row represent an existing value from source system.

 

Column IsDeleted can also be called DeleteFlag, DeletedAtSource, ExpiredFlag or InactiveFlag, when data are deleted in a system and therefore are gone forever.

 

To have an IsDeleted column in an archive table and later in a dimension table and in a fact table, the data flow from Source to Input data area needs some rules to avoid problems, because I have seen data warehouses where many data rows has been marked IsDeleted = 1 but data is still in the source legacy data system.

 

If data capture from source legacy systems is based on incremental load or delta data detection, the source will never tell the data warehouse when a data row is deleted. For example every month the source push the latest month data to the data warehouse and therefore older data that has not been changed will not be included in the monthly delivery data e.g. a csv file. In case the data warehouse has implemented a »not in« or »not exists« to mark the Deleted flag, the older data rows will be marked with IsDeleted = 1 even though data still exists in the source.

 

Data capture from source legacy systems must be based on full dump, meaning all data rows is going to data warehouse so it can carry out the appropriate comparison between Input data area and Archive or dimension/fact to mark IsDeleted = 1 when data really not exists in the source system. The problem with this approach is the amount of data because transferring many million data rows and do comparison can cost a lot of the total ETL process time performance. A vulnerability is if the source by a mistake only provide half of the data rows, then the data warehouse will think that the other half of data rows has been deleted in the source system and therefore data warehouse will mark many data rows in archive as deleted.

 

I recommend having an IsDeleted flag in data warehouse only when source system provide data rows that is marked as Deleted (or D), therefore source system expli­cit tells the data warehouse which data rows or information that does not exists anymore and therefore are gone forever. It can be provided with a special file or table but of course it sets requirement for the source system. When source system has a histo­rical log audit table as shown in the first chapter, it can be part of the data capture incremental load or delta data detection.