Tips for SSIS SQL Server Integration Services

by Joakim Dalby

 

1. Introduction

Best Practices of making SSIS Integration Services package in form of some tips shown by examples. I my article about SQL Server 2014 I describe deployment methods of SSIS project and package on different servere using parameter for con­nec­tion string and Environment variable inside the SSISDB catalog

SQL Server 2014, read chapter 3 and 8.

 

A solution in SQL Server Data Tools for Visual Studio 2013 can contain severals projects for different data layer in a data warehouse for SSIS package, SSRS report, SSAS olap cube, SSAS tabular cube and export to Excel file and so on. It is almost like a .NET solution with projects that become different dll files, but normally the dll files belong together and is called from the same exe file. If you like to show solution node in Solution Explorer: Tools, Options, Projects and Solutions and check­mark in 'Always show solution'. I recommand a SSIS project per data layer in a data warehouse like for doing extraction from source legacy systems, for doing archive or ods, for dsa, for edw and for each datamart.

 

A SSIS project properties is shown from Solution Explorer pane by rightclick the project name and select {Properties}. I always set ProtectionLevel to Dont­Save­Sensitive so other deve­lopers can access the SSIS package therefore not depen­ding of the creator user or a password. When a package is saved, sensitive pass­word values will be removed. This will result in passwords needing to be supplied to the package through a configuration file or by the user. All new IS package inherits the protection level.

 

Please remember to consider MaxConcurrentExcecutables property in the SSIS package, more at this link Data Flow Performance Features

When SSIS package runs on a dedicated server and there are a lot of operations that run in parallel, it can be beneficial to increase this setting if some of the operations (threads) do a lot of waiting for external systems to reply. A Data Flow Task has property for EngineThreads, DefaultMaxBufferRows and DefaultMaxBufferSize.

 

A SSIS project has a Connection Managers where all connectionsstring to files and database should be, so they can be resused in all the SSIS packages in the project.

Start rightclick {New Connection Manager} and for connection to a SQL Server database select OLEDB and click [Add] and [New], typein SQL Server name, normally use Windows Authentication and select the database in the dropdown. The data provider become 'Native OLE DB\SQL Server Native Client 11.0'. This connection is used in Control Flow at Execute SQL Task and in Data Flow Task at OLE DB Source, OLE DB Destination and Lookup. A connection manager used inside a SSIS package has a property Expression that for a flat file which name include a date yyyy-mm-dd can have an expression that assign the right filename at runtime mode for a dynamic connection like this:

"C:\\File\\Customer"+LEFT((DT_WSTR,30)GETDATE(),10)+".csv"

 

Show SSIS Toolbox

In menubar select View and Other Windows and SSIS Toolbox and you have it in the left side of your screeen showing the tools or components depends of you are in a control flow or a data flow canvas.

 

To delete a row in a component

Like 'Derived Column' or 'Union All', first left click at another row and then set mouse over the want-to-delete row and right click and select {Delete} in the menu.

 

Execute and format a SSIS package

Normally I run a SSIS packages from the design mode by clicking at Start button, but sometimes it will evaluate and syntax check all files in the project and in the solution. To avoid that it is better to close all tabs (or documents as SQL Server Data Tools for Visual Studio 2013 call them) and right click the package and choose {Execute Package} like this:

 

 

Layout of a SSIS package

In case you like to organized the layout of a SSIS package, Format can helps:

 

 

Some of the components

 

 

 

2. Control Flow and Data Flow components

A SSIS package has a 'Control Flow' canvas or design surface which is using 'Task' components to display the steps of a process and is the task workflow engine that coordinates the business process flow logic for a package. The 'Data Flow' is the data processing engine that handles data movement, transformation logic, data organization, and the extraction and commitment of the data to and from sources and destinations.

 

Parameter / Variable

Used in expression

Used in Dts.Variables[…]

Project parameter

@[$Project::parameter]

"$Project::parameter"

Package parameter

@[$Package::parameter]

"$Package::parameter"

Package variable

@[User::variable]

"User::variable"

Sequence Container

@[User::variable_in_scope]

Data Flow can’t refer to it.

"User::variable"

 

Some of the components and the usage

 

  • Execute T-SQL Statement Task for a fixed sql statement like TRUNCATE TABLE, UPDATE or DELETE. In case you like to use T-SQL MERGE statement it is also placed here. Sadly this task needs its own connection mana­ger because it is using ADO.NET connection manager type with '.Net Provi­ders\Sql­Client Data Provider'.
  • Execute SQL Task for a fixed or dynamic sql statement where criteria depends of an another task that has assigned a SSIS package variable that can be used in a Where @para­meter by using a question mark ? or as a parameter to execute a stored procedure. This task can also assign a SSIS package variable with a value from a Select statement. Connection mana­ger is the same as of 'Data Flow Task' called 'Native OLE DB\SQL Server Native Client 11.0' that provide ? parameter for each variable. Instead when using ADO.NET connection it is providing a named @parameter like in T-SQL that is more self explanatory but also using an extra connection to the database beside the connection used inside the 'Data Flow Task'.
  • Sequence Container for grouping tasks that can be runing in parallel. Tasks is normally performed in a particular sequential order with the Precedence constraint green arrows between them. When a package has many tasks on different datasets they can be grouped together in several Sequence Container and each container can be performed in a parallel order. Please be carefull of having many Sequence Container and do tasks on many different datasets because that will make your SSIS package complex in runtime with many open connections to the database and for later maintenance. I recom­mand to have many SSIS packages like one for each dataset and then have a batch/main/master/autoexec SSIS package that is calling packages and can do that in some parallel order when there is any constraint green arrow between them.
  • Data Flow Task for pull or fetch data out of a database into a SSIS pipeline and for push or insert modified data back to the database to store the changes. Pipeline engine for manage the flow of data from the data source adapters to the data destination adapters and do necessary transformations.
  • OLE DB Source pull or fetch data out of a database into a pipeline using a Select statement with only the required columns or when using a table (or view) go to Columns pages and checkmark the required columns. Sometimes go to Columns page and typein a prefix s_ to the output columns when it is easier to merge, union or lookup to another tables with same column names. At Error Output page bad data can be re­di­rect to a red pipeline to be taking care of instead of ignore or failure. 'Source' can also pass data to a 'Recordset Destination' that is using a package variable with data type Object and the variable can be used in 'Control Flow' in a 'Foreach Loop Container' to loop and iterate each record/row of the variable, where a column is assigned to another package variable with type Int32 (int), Int64 (bigint), String (nvarchar), Boolean (bit), Decimal and Double or a char (nchar(1)) matching the data type in the database table.
  • OLE DB Destination push or insert modified data from a pipeline back to the database to store the changes using 'Table or view – fast load', checkmark in 'Table lock', no checkmark in 'Constraints' because it takes performance time to check data we asume ETL process already has done some data checking. Seldome checkmark in 'Keep identity' and 'Keep nulls'. Set 'Rows per batch' and 'Maximum insert commit size' to avoid transaction log growing huge, like going to insert 1 million rows, set Rows and Maximum to 100000 rows to be saved per batch by breaking the incoming rows in multiple chunks of rows. Remember to set database property Recovery model to Simple.
  • Multicast copy or clone a pipeline into two or several similar pipelines to avoid reading the same data from database or file by different 'Source'.
  • Data Conversation convert a column in a pipeline like data from a flat file to a better data type. A source column with data type varchar (string DT_STR) to a destination column as nvarchar therefore data conversion data type will be set to »Unicode string (DT_WSTR) and typein a length«. The output alias column can be prefix c_ or be the same name as source column and the pipeline will prefix column with »DataConversion« and I use this co­lumn in the OLE DB Destination mapping.
  • Derived Column add an extra column into a pipeline or replace the va­lue of an existing column with new va­lue or an expression like changing a null value REPLACENULL(<column­name>, -1) or assign a value to an e­xis­ting column in a pipeline like if the co­lumn is null it become -1 else it become 1, ISNULL(<columnname>) ? -1 : 1 or do a calculation. A SSIS package variable can be an extra column too. A new derived column name can be prefix d_.
  • Lookup add an extra column into a pipeline from another table like a Inner join. Normally use 'Partiel cache'. The output alias column can be prefix l_. Remember to set 'Specify how to handle rows with no matching entries' to at least 'Ignore failure' else Lookup gives a failure when one row from pipe­line does not exists in the lookup table or the cache. Or to 'Redirect rows to no match output' where Lookup devide a pipeline into two pipelines called 'Lookup Match Output' and 'Lookup No Match Output'. SQL statement can insert an inferrred member and Lookup returns new Identity number value.
  • Cache Transform to place data from a pipeline into a in-memory variable that later can be used in a Lookup for faster performance.
  • Merge Join combines two sorted pipelines into one pipeline as adding an extra column into a pipeline from another pipeline with a Left outer join, or adding an extra row into a pipeline from another pipeline with a Full outer join, or reduce rows in a pipeline based of another pipeline with a Inner join. Pipelines has to be sorted before used in merge join.
  • Conditional Split devide a pipeline into two or multiple pipelines by criteria like data is ok or has error (right or wrong) or a pipeline into three pipelines by doing delta data detection for New rows, Changed rows and Deleted rows from a Full outer join merge to get data that does not exists.
  • Aggregate value in a pipeline to a new summarized pipeline. The output alias column can be prefix a_.
  • Union All combine multiple pipeline into one pipeline.
  • Script Component do C# programming to change data in a pipeline and divide to multiple pipelines.
  • OLE DB Command run a sql Update or Delete statement for each row in a pipe­line. That is not good for performance with a large number of rows in a pipeline. Seldome do a sql Insert into statement because OLE DB Destination is good for that through bulk insert. Like UPDATE T SET F = ?, D = Getdata() WHERE Id = ?. Question mark ? input parameter refers to a column in a pipeline. The best practices is to have a staging table in database, truncate it by 'Execute T-SQL Statement Task', do like Derived Column and Lookup transformation with the data inside the pipeline and then push it into the staging table through 'OLE DB Destination'. Then do a real bulk sql Update by joining the staging table with the real table in another 'Execute T-SQL Statement Task'.
  • Package variable is usefull for 'Control Flow' and for 'Data Flow' like criterie to a sql Select-Where, input or output parameter to stored procedure and a recordset or resultset. A variable gets default value from an expression like a DateTime variable has expression: DATEADD("m",-1,GETDATE()).
  • Package parameter is usefull to receive data from calling package to use in expression: !ISNULL(@[$Package::Period]) && @[$Package::Period] != "" in 'Control Flow' or as a criteria in a 'Data Flow Task'.
  • Project parameter is usefull for multiple packages using same variable like a path to a directory together with a package variable as a con­ca­te­nate string variables in this expression with double backslash:

"N'"+@[$Project::BackupDirectory]+"\\Backup_"+@[User::Name] +".bak'"

  • Recordset Destination to place data from a pipeline into a in-memory variable with data type Object that later can be used in a For each loop.
  • Expression Task do assigning of a package variable like this expression:

@[User::Counter] = @[User::Counter] + 1

Only one package variable can be assigned a value in a 'Expression Task', so with multiple variable I need to make multiple 'Expression Task's or instead make a 'Script Task' what can do multiple variables in a C# program.

  • Script Task do C# programming at package variable like show the value in debug runtime mode, remember to set property ReadOnlyVariables: MessageBox.Show(Dts.Variables["User::Number"].Value.ToString()); or MessageBox.Show(Convert.ToString(Dts.Variables["User::Num"].Value));

if the variable can be null. Examples with the three types of variable:

MessageBox.Show(Dts.Variables["User::PackageVariable"].Value.ToString());

MessageBox.Show(Dts.Variables["$Project::ProjectParameter"].Value.ToString());

MessageBox.Show(Dts.Variables["$Package::PackageParameter"].Value.ToString());

Script can do calculations with programming or build a dynamic sql statement to a variable and use it in 'Execute SQL Task' to store a value in a table with sql Insert or Update. In 'Execute SQL Task' set SQLSourceType to Variable and at SourceVariable pick a pac­kage variable like User::TSQL­Save­Cal­cu­la­tion when it is a dynamic build sql statement.

  • Sort data in pipeline and can remove rows with duplicate sort values. Sort and Ag­grea­te is blocking the stream of data in the pipeline because they need all the data to do the operations so it is not recommanded for many rows. Therefore let the 'Source' sql statement has ORDER BY that is needed for 'Merge Join', 'Lookup' or fit primary clustered key in 'Destination'. Assign sortorder to 'Source' by rightclick {Show Advanced Editor}, tab 'Input and Output Properties', click at 'OLE DB Source Output' and set property IsSorted to True:

 

 

And open 'OLE DB Source Output' and open 'Output Columns' and select the first column from ORDER BY and set property SortKeyPosition to 1:

 

 

Continue with next column from ORDER BY that is set to 2 and so on. When a column is DESC SortKeyPosition gets a minus in front as a negative value.

Read more. When data source can’t be ORDER BY like a flat file but data is delivered sorted, use SortKeyPosition on the columns that is actually sorted. 'Sort' removes duplicate values of sort key from a flat file.

  • OLE DB Destination extra options

OLE DB Destination has a property FastLoadOptions that normally shows TABLOCK, CHECK_CONSTRAINTS. Four extra options can be added there:

ROWS_PER_BATCH = <number>

<number> indicates that destination is configured to use batches of rows.

KILOBYTES_PER_BATCH = <number>

<number> specifies the size in kilobytes to insert.

FIRE_TRIGGERS

Specifies whether triggers fire on the insert table because of fast load/bulk.

ORDER(<columnname> ASC, <columnname> DESC)

Specifies how the input data is sorted. If sort order is omitted, the insert o­pe­ra­tion assumes the data is unsorted. Performance can be improved if you use the ORDER option to sort the input data according to the clustered index on the table. For clarity, the ORDER hint needs to exactly match the clus­te­ring keys of the target table. ORDER is specially good when OLE DB Source is a flat file. Remember to set database property Recovery model to Simple. Remember also to look at index to get faster load performance like this:

* Disable/drop non-clustered indexes before bulk load.

* Rebuild/create non-clustered  indexes after bulk load.

  • Restart Packages by Using Checkpoints read more.
  • Audit Transformation package system package variables read more.

 

 

3. Data type and Expression

Data Conversation is not using SQL Server database table data type and SSIS pac­kage variable and SSIS project parameter has their own data type, the map is:

 

SQL Server data type

SSIS data type for derived expression

Variable/para-meter datatype

OLE DB for input / output ? param.

bigint

DT_I8

Int64

LARGE_INTEGER

binary

DT_BYTES

Object

n/a

bit

DT_BOOL

Boolean

VARIANT_BOOL

char

DT_STR

String

VARCHAR

date

DT_DBDATE

DateTime

DBDATE

datetime

DT_DATE

DateTime

DATE for input

DBTIMESTAMP for output

datetime2

DT_DBTIMESTAMP2

DateTime

DATE / DBTIMESTAMP

datetimeoffset

DT_DBTIMESTAMP

OFFSET

Object

DBTIMESTAMP

OFFSET

smallint

DT_I2

Int16

SHORT

decimal

DT_DECIMAL

Decimal

DECIMAL

float

DT_R8

Double

FLOAT

int

DT_I4

Int32

LONG

image

DT_IMAGE

Object

n/a

money

DT_CY

Decimal

CURRENCY

smallmoney

DT_CY

Decimal

CURRENCY

nchar

DT_WSTR

String

NVARCHAR

nvarchar

DT_WSTR

String

NVARCHAR

ntext

DT_NTEXT

String

n/a

nvarchar(max)

DT_NTEXT

Object

n/a

numeric

DT_NUMERIC

Single

NUMERIC

real

DT_R4

Single

FLOAT, DOUBLE

smallint

DT_I2

Int16

SHORT

text

DT_TEXT

String

n/a

time

DT_DBTIME2

DateTime

DBTIME2

timestamp

DT_BYTES

DateTime

n/a

tinyint

DT_UI1

Byte

BYTE

uniqueidentifier

DT_GUID

Object

GUID

varbinary

DT_BYTES

Object

n/a

varchar(max)

DT_TEXT

Object

n/a

varchar

DT_STR

String

VARCHAR

varchar(max)

DT_TEXT

Object

n/a

xml

DT_WSTR

String

n/a

single character char(1)

DT_UI2

Char

VARCHAR

declarative NULL value

 

DBNull

 

recordset

resultset

 

Object

 

 

DT_WSTR(<length>) unicode string, length of characters, for nvarchar or nchar.

DT_STR(<length, codepage>) ansi string with codepage 1252 for varchar or char.

 

When you have a nvarchar column in a source table and you want only the first three letters and save it in a varchar column destination table, do this in a Derived Column to cast the value to codepage 1252 = ANSI = varchar data type:

(DT_STR,3,1252)SUBSTRING(<columnname>,1,3)

 

DT_NUMERIC(<precision, scala>) precision tells the max number of digits in the number value both left and right of decimal point, scale tells the number of digits right of decimal point. Like DT_NUMERIC(5,2) is max 999.99.

Read more about Integration Services Data Types

 

Variable name is case-sensitive like in C#.

Using a variable in script C# programming: Dts.Variables["User::Number"].Value;

Using a variable in an Expression: @[User::Name]

            MONTH(@[User::Birthdate]) == 5 ?  "Maj" : "Another month"

Using a parameter in an Expression: @[$Project::BackupDirectory]

Using a system variable in an Expression: @[System::PackageName]

 

Cast always a value

 

Data type in SQL Server

Derived column

int

(DT_I4)0

(DT_I4)1

NULL(DT_I4)

bigint

(DT_I8)0

(DT_I8)123456789123456789

NULL(DT_I8)

numeric(5,2)

(DT_NUMERIC,5,2)999.99

NULL(DT_NUMERIC,5,2)

decimal(18,2)

DT_DECIMAL(2)999.99

NULL(DT_DECIMAL,2)

float

(DT_R8)999.99

NULL(DT_R8)

bit

(DT_BOOL)1

(DT_BOOL)0

NULL(DT_BOOL)

datetime

(DT_DATE)"9999-12-31"

(DT_DATE)"2016-12-31 23:50:45"

NULL(DT_DATE)

nvarchar(3)

(DT_WSTR,3)"Big"

NULL(DT_WSTR,3)

varchar(5)

(DT_STR,5,1252)"Small"

NULL(DT_STR,5,1252)

money

(DT_CY)123.45

NULL(DT_CY)

 

Expression that cast or convert an integer value to concatenate to a string:

 

OrderDate is a datetime, cast it to a dbdate and to a date that has time 00:00:00

 

(DT_DATE)(DT_DBDATE)OrderDate

 

Get OrderDate as the first day of the month:

 

(DT_DATE)((DT_WSTR,4)(YEAR(OrderDate)) + "-" +  (DT_WSTR,2)(MONTH(OrderDate)) + "-01")

 

(DT_WSTR,4) is C# way to cast a value at the right side to a nvarchar(4).

 

(DT_WSTR, 10) (DT_DATE) DATEADD( “dd”, – @[$Package::DaysAgo] , GetDate())

 

Make a filename with number of current month:

 

"C:\\File\\Data_"+RIGHT("0"+ (DT_WSTR,2) MONTH(GETDATA()), 2)+".txt"

 

(DT_WSTR,2) is C# way to cast a value at the right side to a nvarchar(2).

(DT_DECIMAL,2) cast string to decimal value with 2 digits after decimal point.

 

Get Filename from column PathFileName:

RIGHT([PathFileName],FINDSTRING(REVERSE([PathFileName]),”\\”,1) – 1)

 

Get Folderpath from column PathFileName:

SUBSTRING([PathFileName], 1, LEN([PathFileName]) – FINDSTRING(REVERSE([PathFileName] ), “\\” ,1 ) + 1)

 

Get Foldername from column PathFileName:

TOKEN[PathFileName],”\\”,TOKENCOUNT([PathFileName],”\\”) – 1)

 

Guid variable with data type Object

Dts.Variables["User::Guid"].Value = "{" + Guid.NewGuid().ToString() + "}";

 

Expression is useful in Derived Column but can also be used in Send Mail Task for property Subject like "ETL done at: " + (DT_WSTR,30) @[System::StartTime] +"."

Or in Execute SQL Task at property SqlStatementSource to make a dynamic sql statement but be careful because it can be hard to read, understand and maintain. I think it is better using a variable for dynamic sql so I can messagebox it for syntaks and correctness before it is send to the database server to pull or fetch data to a resultset inside Execute SQL Task or before to a 'Source' but SSIS 2014 now prefer a variable. I can make an expression at the Data Flow Task to Disable it, in case a previous step had put pipeline Row Count to a variable: @[User::NOBills] == 0,

or at Precedence constraint green arrow for continue when @[User::NOBills] > 0 and a fx logo is shown.

 

Expression operators taken from C#:

 

==

equivalent

same == same

!=

inequality

Old != new

>   >=

greater than

10 > 5

<   <=

less than

10 < 15

&&

logical and

x >= 100 && x <= 200 interval

||

logical or

x == 100 || x == 200

!

logical not

!ISNULL(x)  x has a value

<boolean>?<true>:<false>

iif – inline if

ISNULL(x) ? -1 : 1

%

modulus

25 % 2 is 1 for a odd number

 

SUBSTRING, UPPER, LOWER, TRIM, LEN, DATEDIFF, CEILING for round up, FLOOR for round down, REPLACENULL(<columnname>, 0) for if null replace zero in SSIS.

 

Unnatural NULL Behaviour

  • NULL is not equal to zero or blank.
  • NULL is not equal to NULL.
  • NULL is not »not equal« to NULL.
  • TRUE OR NULL => TRUE.
  • TRUE AND NULL => NULL.
  • SELECT F FROM T WHERE EXISTS (SELECT NULL) makes EXISTS TRUE.
  • NULL-valued integer column is different from a NULL-valued string column.
  • IS NOT NULL in a query to search a column that has a value or an empty string "" with a length 0.
  • A null string is different than an empty string "" with length 0, because a null string doesn't point to anything in memory.
  • Also called DBNull, vbNullString or Nothing like: If Not rst Is Nothing Then.

 

ISNULL(SalesPersonName) ? "Unknown" : SalesPersonName

better with: REPLACENULL(SalesPersonName, "Unknown")  (or use "N/A").

REPLACENULL(SalesAmount, (DT_CY)0)

REPLACENULL(SalesDate, (DT_DATE)"1900-01-01")

 

In case of null valu or a zero length empty string value:

(ISNULL(SalesPersonName) || TRIM(SalesPersonName) == "") ?

       "Unknown" : TRIM(SalesPersonName)

 

Return null value cast to the right data types

NULL(DT_STR,4,1252)     to a varchar(4) column

NULL(DT_WSTR,3)           to a nvarchar3) column

NULL(DT_DATE)               to datetime with 00:00:00.000

NULL(DT_DBTIMESTAMP)

NULL(DT_BOOL)               to bit also called boolean

NULL(DT_I4)                    to int column

NULL(DT_I2)                    to smallint column

NULL(DT_DECIMAL,2)       to decimal

NULL(DT_NUMERIC,3,2)

NULL(DT_CY)                    to money

 

Compare two columns that might have a Null in a Conditional Split

I always replace Null with this character ¤ because very rare a text in a column on­ly has this value. Then I can compare columns that has checkmark in 'Allow Nulls'.

 

For nvarchar I can replace Null to ¤ because columns are already a string:

REPLACENULL(s_Name,"¤") != REPLACENULL(d_Name,"¤")

 

For int, bigint, tinyint, smallint, decimal, float, real, numeric, money, and date, datetime and time, I first cast column to a string with 30 characters:

 

REPLACENULL((DT_WSTR,30)s_Number,"¤") != REPLACENULL((DT_WSTR,30)d_Number,"¤")

 

A CustomerNo is a bigint and when it is null I want to -1 instead of null:

 

REPLACENULL(CustomerNo, (DT_I8)-1)  or  (DT_I8)REPLACENULL(CustomerNo,-1)

 

If an OrderDate column in pipeline is null then change it to January 1, 1900:

 

REPLACENULL(OrderDate, (DT_DATE)"1900-01-01")

 

If an OrderDate datetime column with time 00:00:00.000 is this date 2016-12-31, I cast the date with (DT_DATE) to make it 2016-12-31 00:00:00.000 so the two dates can be compared, then give true else false:

 

OrderDate == (DT_DATE)"2016-12-31" ? (DT_BOOL)1 : (DT_BOOL)0

 

If a ValidTo date column with no time I cast the date with (DT_DBDATE):

 

(DT_DBDATE)ValidTo == (DT_DBDATE)"9999-12-31" ? (DT_BOOL)1 : (DT_BOOL)0

 

If divide by zero then let the result be 0 else calculate the ratio:

 

REPLACENULL(Quantity,0) == 0 ? 0.0 : Price / Quantity

 

Other string functions

Replace in a string or if integer, cast first:

REPLACE((DT_WSTR,8)NumberOfDays,"6","5")

 

Replace zero (0) with NULL:

REPLACE((DT_WSTR,8)<IntegerColumn>,"0",NULL(DT_I4))

 

Convert a blank string into NULL in a 'Derived Column' as varchar:

TRIM(<TextColumn>) == "" ? (DT_STR, 4, 1252)NULL(DT_STR, 4, 1252) : <TextColumn>

 

Replace an integer column with NULL if the value is negative:

(<IntegerColumn> < 0) ? NULL(DT_I4) : <IntegerColumn>

 

Replace an double column with NULL if the value is negative:

(<DoubleColumn> < 0) ? NULL(DT_NUMERIC,3,2) : <DoubleColumn>

 

Replace a date column with NULL if the value is »in the future«:

(<DateColumn> > GETDATE()) ? NULL(DT_DATE) : <DateColumn>

 

LEFT(<string>, <number-of-chars>) A shorthand way of writing SUBSTRING(<string>, 1, <number-of-chars>)

TOKEN(<string>, <delimiter>, N) Returns the Nth token in <string> when it is split by <delimiter>

TOKENCOUNT(<string>, <delimiter>) Returns the total number of tokens in <string> as determined by <delimiter>

TOKEN(,,) will be particularly useful when manually parsing columns from a flat file.

When used in conjunction with TOKENCOUNT it can be used to return the last token in a string like so:

TOKEN(<TextColumn>, ",", TOKENCOUNT(<TextColumn>, ",") )

 

Compare with a string use " around the text like: @[User::Name] == "Paris" in a 'Conditional Split'. Use [ ] around a column that contains space or start with a digit.

 

Inside Data Flow Task the Derived Column do the manipulation of data

When data from a OLE DB Source goes to a pipeline the columns can be derived to new columns with new column name that exists in table of OLE DB Destination, or the columns values can be replaced by cleansing rule expression together with ? condition (if-then-else) and cast to get the right data type and length of text, here is some examples:

 

 

Expression: 'Discount != 0 ? (DT_BOOL)1 : (DT_BOOL)0' means when the pipeline column Discount has a value different from 0, a new column in the pipeline called DiscountStatus with datetype bit (DT_BOOL in SSIS) is assigned value 1 = true else value 0 = false and saved into a column in the table of 'OLE DB Destination'.

 

Expression: 'Quantity > 0 && Quantity < 50 ? Quantity : NULL(DT_I2)' means when the pipeline column Quantity has a value between 1 and 49, the value will be passed through, but if the column value is less than or equal to 0 or greater than or equal to 50 it will be replaced to a null value because theses values is wrong from the legacy source system and can give wrong summary and average in a data warehouse reporting.

 

The new columns NumberOfAbroad and NumberOfDomestic will get value 1 or 0 depends of the business logic rule of column Place and since a sql statement, co­lumn in a report or a measure in a cube will do a summary and total of NumberOf, it is important that the derived column has the right data type that will cover the range of the total value, therefore I cast value 1 and 0 to DT_I4 that is int data type in the table of OLE DB Destination.

 

The SSIS package has a int32 variable called Batchnumber and the variable has been assigned a value in the 'Control Flow'. The variable value can be added to the pipeline inside the 'Derived Column Transformation' with this syntax notation: @[User::Batchnumber] and saved into a column in the table of 'OLE DB Destination'.

 

By Edit the 'OLE DB Destination' I can click New button and typein a good table name to be created automatically when I click OK, so I will change the [OLE DB Destination] to table name like [Stage_OrderDetails], and the pipeline will map the co­lumns from the source table and from the derived columns to the new destination table columns:

 

 

The new table will allow null in all columns and that can be good for a source input table therefore normally there is no need for check at 'Check constraints' above.

 

Cast or convert danish letter from code page 865 DOS ASCII to code page 1252 Windows ANSI to a varchar column

Cannot be processed because more than one code page means, that data in OLE DB Destination is using a different codepage than 1252. Here is an example to cast or convert between code pages, here I am using: DT_STR(50, 1252) as a cast in front of the column name, important in derived column to make a new column in the pipeline, like this where the column name is »by« meaning city:

 

 

When column is an unicode with nvarchar(50) use then (DT_WSTR,50)by

 

 

4. Execute SQL Task to assign a variable to use in Control Flow

A SQL statement will query a database table and will only return one row and the two columns will be assigned to two SSIS package variables:

 

SELECT StatusId, StatusDateTime  -- int and datetime

FROM dbo.STATUS

WHERE Status = 0

 

In a SSIS package in Control Flow I rightclick the canvas and select {Variables} and in the window I click the small left side button [Add Variable] to insert a new row and type in the variable name and data type where sql int is a package Int32. Since my sql statement has two columns I also make two variables with a default value, but the default value will be override by the value from the sql statement:

 

 

In Control Flow I drag in a 'Execute SQL Task' and I edit it and choose a Connection and type in the sql statement at the property SQLStatement. I set property Result­Set to 'Single row' and at the page 'Result Set', I click the [Add] button and override 'NewResultName' with the first column name or just 0 for first column and 1 for second column. I also choose the right package variable:

 

 

To test if the assignment of the two variable has been done, I drag in a 'Script Task':

 

 

I edit it 'Script Task' and at property ReadOnlyVariables I click the […] and check­mark the two package user variables I like to use later inside the script. I click at [Edit Script] button and typein a MessageBox line above the Success line like this:

 

MessageBox.Show(Dts.Variables["User::SID"].Value.ToString() + " - " +

Dts.Variables["User::SDATE"].Value.ToString());

 

Dts.TaskResult = (int)ScriptResults.Success;

 

I start the package and see the messagebox showing the values from the row in the status table in the database.

 

When the sql statement does not return a row, an error occurred while assigning a value to variable: "Single Row result set is specified, but no rows were returned." Therefore I need to extend the sql statement so it will always return a row and I will make a stop constraint so Control Flow goes to another script.

 

In 'Execute SQL Task' I change the sql statement to control if there is a row and if the exists become false I create a row with default values for the two columns:

 

IF EXISTS(SELECT 1

          FROM dbo.STATUS

          WHERE Status = 0)

   SELECT StatusId, StatusDateTime

   FROM dbo.STATUS

   WHERE Status = 0

ELSE

   SELECT 0 AS StatusId, '1900-01-01 00:00:00' AS StatusDateTime

 

I edit the green 'Precedence constraint' arrow between 'Execute SQL Task' and 'Script Task', because I can extend the constraint with an logical boolean expression, so I change 'Evaluation operation' from "Constraint" to "Expression and Constraint". In the expression I’m u­sing the SID variable with a C# notation for comparison for equivalent:

@[User::SID] != 0

 

 

A fx logo is shown at the green 'Precedence constraint' arrow telling there is an expression included in the criteria for successful and continuation.

 

 

When the sql statement does not return a row the 'Precedence constraint' will not con­tinue and the SSIS package does not failed.

 

But if I want the 'Control Flow' to continue, I drag in another 'Script Task', rename it to 'No Row' and typein a MessageBox "No row". I edit the green 'Precedence constraint' arrow between 'Execute SQL Task' and 'No row' using: @[User::SID] == 0

 

 

The 'Control Flow' will always continue the task of the SSIS package and the two variables has a value that can be used in the package later like in a 'Data Flow Task'.

 

Assign a datetime variable for fixed value

 

 

Drag in an 'Execute SQL Task' and at edit then typein a sql statement to get the date and time from the SQL Server, you can also use GETUTCDATE() for London time with­out summertime, like this for normal local time with summertime:

 

SELECT GETDATE() AS CurrentDateTime

 

Change ResultSet to 'Single row' and select a Connection and at ResultSet page typein the columnname from the sql statement and select the variable:

 

 

Then you can use the package variable User::CurrentDateTime in a 'Script Task'

 

MessageBox.Show(Dts.Variables["User::CurrentDateTime"].Value.ToString());

 

Then you can use the package variable User::CurrentDateTime in a 'Execute SQL Task' as a input parameter to make a dynamic sql statement in a where part or in a set part of a update statement like this:

 

 

Please remark to use DATE as Date Type for Input Direction, because OLE DB DATE do also cover a time as a datetime. For Output Direction use a DBTIMESTAMP.

 

If you typein an Expression for the variable in the variables window like this:

 

 

the variable CurrentDateTime will change value each time it is used, like it is execute the expression.

 

Execute SQL Task does not failed with no row

In case you does not like the sql statement starting with IF EXISTS(SELECT 1 the 'Execute SQL Task' can be changed for using 'Parameter Mapping' which me­ans that the sql statement will have two output parameters like this where question mark is a output parameter:

 

SELECT ? = StatusId, ? = StatusDateTime

FROM dbo.STATUS

WHERE Status = 0

 

Other examples: SELECT TOP 1 ? = CustomerId - SELECT ? = MAX(System_key)

'Parameter Mapping' does not use property ResultSet, therefore I will set it to 'None' and the 'Execute SQL Task' will not failed when there is no row.

 

At page 'Pa­ra­me­ter Mapping' I connect each variable with each output parameter where Para­me­ter Name 0 is the first ? as StatusId and second ? is StatusDateTime. The Data Type for the output or input parameter is OLE DB as shown in chapter 3.

 

 

I still keep the two 'Precedence constraint' green arrows so the 'Control Flow' will do different tasks depending of a row or no row, because when there is no row the 'Pameter Mapping' will assign the SID variable to value 0.

 

Control flow depends of exists a value in a table

In a Result table I like to get a StatusId that has a status as true and the StatusId I like to check if it exists in a Status table. In a SSIS package I will use two 'Execute SQL Task' to get values from the database tables into SSIS package variables and let one variable ExistsStatus to control the flow by two fx expressions:

  @[User::ExistsStatus] == True         // for Exists

  @[User::ExistsStatus] == False        // for Not exists

 

 

Get StatusId from Result

Variable @[User::StatusId] becomes the output parameter for the question mark at left side of equal sign:

 

SELECT TOP 1 ? = StatusId

FROM dbo.RESULT

WHERE Status = 1

 

In case there is no row then @[User::StatusId] get value 0.

 

Exists StatusId ?

Variable @[User::ExistsStatus] becomes the output parameter for the first question mark at left side of equal sign and variable @[User::StatusId] becomes the input parameter of the second question mark at the right side of the equal sign

 

SELECT ? = IIF(EXISTS(

               SELECT 1

               FROM dbo.STATUS

               WHERE StatusId = ?), 1, 0) -- 1 = True, 0 = False

 

Fx expressions

The left fx is using the logical expression: @[User::ExistsStatus] == True

and when the condition is fulfilled the left green 'Precedence constraint' will continue the control flow.

 

The right fx is using the logical expression: @[User::ExistsStatus] == False

and when the condition is fulfilled the right green 'Precedence constraint' will continue the control flow.

 

Insert into a table

'Execute SQL Task' can do an insert into sql statement with input parameter:

 

INSERT INTO dbo.RESULT(StatusId, Status)

VALUES(?,0)

 

Update of a table

'Execute SQL Task' can do an update sql statement with two input parameters:

 

UPDATE dbo.RESULT       UPDATE r

SET Status = ?          SET Status = ?

WHERE StatusId = ?      FROM dbo.RESULT r

                        WHERE r.StatusId = ?

 

Delete in a table

'Execute SQL Task' can do a delete sql statement with input parameter:

 

DELETE

FROM dbo.RESULT

WHERE StatusId = ?

 

Call stored procedure to do update and get the result back

I like to update a column NumberOf for a specific OrderId and get the result

 

OrderId

NumberOf

1

10

2

20

3

30

 

I make the business logic in a stored procedure with two input parameter and one output parameter with the result of the updating that is assigned together with UPDATE sql statement:

 

ALTER PROCEDURE [dbo].[Order_NumberOf]

      (@OrderId int, @Extra int, @NumberOf int OUTPUT)

AS 

BEGIN 

    BEGIN TRANSACTION 

    UPDATE dbo.OrderNumber

    SET @NumberOf = NumberOf = NumberOf + @Extra

    WHERE OrderId = @OrderId

    COMMIT TRANSACTION 

END

 

 

Dts.Variables["User::OrderId"].Value = 2;

Dts.Variables["User::Extra"].Value = 3;

Dts.TaskResult = (int)ScriptResults.Success;

 

MessageBox.Show(Dts.Variables["User::NumberOf"].Value.ToString());

Dts.TaskResult = (int)ScriptResults.Success;

 

After running the SSIS package it will show in messagebox 23 and table look like:

 

OrderId

NumberOf

1

10

2

23

3

30

 

Parameter Names must be a 0, 1, 2 and showing as ? parameter within same order. Named @parameter requires an ADO.NET Connection.

 

In case I do not like to do the two package variable assignment in a 'Script Task', I can replace it with two 'Expression Task's that do each package variable assignment:

 

 

 

Read variable values from a table to SSIS packages variable

Table SSIS_Variable has a row for each variable with a datatype and a value:

 

Variable

Datatype

Value

EndDate

DateTime

31-12-2015

Summertime

Boolean

True

Type

String

Production

Weight

Decimal

0,15

Year

Int32

2015

 

A SSIS package has same variables that will be assigned with the values from the table by a Script Task:

 

 

Assign variable has a C# program that connect to the database and to the table and loop for each row and assign each variable with the right data type:

 

#region Namespaces

using System;

using System.Data;

using Microsoft.SqlServer.Dts.Runtime;

using System.Windows.Forms;

//For connection to SQL Server

using System.Data.SqlClient;

using Microsoft.SqlServer.Dts.Runtime;

using System.Text.RegularExpressions;

#endregion

 

namespace ST_cad614c72d4f4a1ab407b6cb5a7ec482

{

 [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]

 public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks. Script

                                   Task.VSTARTScriptObjectModelBase

 {

  public void Main()

  {

   try

   {

    //String connectionstring = "Server=<server>;Database=<db>;

    //                           Trusted_Connection=Yes;Connect Timeout=30";

    //Better using the Connection Manager from the package to get connection:

    String connectionstring =

      Dts.Connections["DSAconnectionstring"].ConnectionString.ToString();

    //Cleansing the connectionstring so it can be used for SqlConnection below:

    Regex regProvider = new Regex("Provider=([^;]*);");

    Regex regTranslate = new Regex("Auto Translate=([^;]*);");

    connectionstring = regProvider.Replace(connectionstring, "");

    connectionstring = regTranslate.Replace(connectionstring, "");

    SqlConnection connection = new SqlConnection(connectionstring);

    SqlCommand command = new SqlCommand();

    SqlDataReader reader;

    command.Connection = connection;

    connection.Open();

    command.CommandType = CommandType.Text;

    command.CommandText = "SELECT Variable,Datatype,Value FROM dbo.SSIS_Variable";

    reader = command.ExecuteReader();

    if (reader.HasRows)

    {

     while (reader.Read())

     {

      //For debugging:

      //if (reader["Variable"].ToString() == "Year")

      //{

      //MessageBox.Show(reader["Variable"].ToString()+reader["Value"].ToString());

      //}

      //For convert or parse to the right data type for the package variables:

      switch (reader["Datatype"].ToString())

      {

       case "Boolean":

          Dts.Variables[reader["Variable"]].Value = Boolean.Parse(reader["Value"].ToString());

       break;

       case "DateTime":

          Dts.Variables[reader["Variable"]].Value = DateTime.Parse(reader["Value"].ToString());

       break;

       case "Decimal":

          Dts.Variables[reader["Variable"]].Value = Decimal.Parse(reader["Value"].ToString());

       break;

       case "Int32":

          Dts.Variables[reader["Variable"]].Value = Int32.Parse(reader["Value"].ToString());

       break;

       default: //For String

         Dts.Variables[reader["Variable"]].Value = reader["Value"].ToString();

       break;

      }                                       

     }

    }

    command.Dispose();

    command = null;

    connection.Close();

    connection.Dispose();

    connection = null;

   }

   catch(SqlException ex)

   {

    MessageBox.Show("No connection");

   }

   Dts.TaskResult = (int)ScriptResults.Success;

  }

 }

}

 

Show variable has a C# program:

 

 public void Main()

 {

  MessageBox.Show(Dts.Variables["User::Year"].Value.ToString());

  MessageBox.Show(Dts.Variables["User::Weight"].Value.ToString());

  MessageBox.Show(Dts.Variables["User::EndDate"].Value.ToString());

  MessageBox.Show(Dts.Variables["User::Type"].Value.ToString());

  if ((bool)Dts.Variables["User::Summertime"].Value == true)

     MessageBox.Show("Summertime");

  else

     MessageBox.Show("Wintertime");

  Dts.TaskResult = (int)ScriptResults.Success;

 }

 

Package parameter in child package to be assigned from a parent package

A SSIS package can be parameterized like a stored procedure by using parameters that is places at the tab next to Control Flow and Data Flow and parameters can al­so have a default value which is useful under develop execute the package.

 

The SSIS package parameter has Required = True, meaning that the calling parent SSIS package need to provide a parameter value to its child. In case the parent does not always has a parameter value, Required should be set to False and then Value will be the default parameter value.

 

Inside a package I can use a parameter like it was variable in an Expression Task or a Script Task where I write like this: $Package::FromYearMonth.

 

But a package parameter is a readonly variable therefore it can not be assigned in an Expression Task like this:  @[$Package::FromYearMonth] = 201512

But a parameter can be used at the right side of the equal sign like this:

 

 

A SSIS package as a parent package when it execute or call another SSIS package as a child package and pass by value to the parameter through a 'Execute Package Task' like this:

 

 

 

 

5. Control Flow, workflow in design mode and runtime mode

When I open a SSIS package in design mode for development, I like to see the con­trol flow as it will be in runtime mode, therefore I do not like a package to do dynamically disable 'Data Flow Task' by expression fx mark at 'Data Flow Task' or by script task programming to do disabling 'Data Flow Task' DFT.

 

I like to use Precedence constraints green arrow with expression fx to show the control flow or workflow of the package in a self-explanatory graphical user interface perspective where the control flow determine the order in which executables (tasks and containers) will run and the control flow determines under what conditions they are executed. In other words, certain executables will run only when a set of defined conditions are met in an expression fx at precedence constraints green arrows.

 

Setting Precedence Constraints on Tasks and Containers for a linear or branching control flow like this where I fetch two boolean values from a table into two packages variable and let them with true/false do the runtime workflow:

  • Always do Fetch Customers as first task.
  • After Fetch Customers do Fetch Pro­ducts task if variable is true.
  • Either after Fetch Customers Or after Fetch Products do Fetch Suppliers task.
  • Only do Fetch Orders and details if variable is true.

 

 

·       Precedence constraint green arrow has an expression, showned by a fx logo.

·       Solid arrow meaning AND, that all precedence constraints that point to the constrained executable must evaluate to true in order for that executable to run.

·       Dotted arrow meaning OR, that only one precedence constraint that points to the constrained executable must eva­lua­te to true in order for that executable to run.

 

In runtime mode

 

 

Fetch data variable

I have made a table FetchData with columns of bit data type to control the flow that will be fetch into two packages variables through an 'Execute SQL Task':

 

 

 

From FetchCustomers to FetchProducts

When the package variable FetchProducts has value True from the table, the workflow will do FetchProducts.

 

I connect the two tasks with a precedence constraint green arrow and I right click at the arrow and select {Edit}, in first dropdown I select 'Expression and Constraint' so both the Expression »@[User::FetchProducts] == True« and Constraint of »FetchCustomers« must be true and success to continue with FetchProducts task. I use default Logical AND and is giving a solid arrow:

 

 

From FetchProducts to Fetch Suppliers

I do the same as above and I still using Logical AND and get a solid arrow.

 

From FetchCustomers to Fetch Suppliers

When the package variable FetchProducts has value False from the table, the workflow will only do FetchSuppliers, therefore the above path is shipped so no go for FetchProducts.

 

I connect the two tasks with a precedence constraint green arrow and I right click at the arrow and select {Edit}, in first dropdown I select 'Expression and Constraint' so both the Expression »@[User::FetchProducts] == False« and Constraint of »FetchCustomers« must be true and success to continue with FetchProducts task. I use now Logical OR and is giving a dotted arrow:

 

 

The solid arrow from FetchProducts to Fetch Suppliers become dotted automatically to show in the self-explanatory graphical user interface that Fetch Suppliers is only depends of either been doing FetchProducts or FetchCustomers.

 

To Fetch Orders and details / Do nothing, continue when FetchOrders is True

Task Fetch Orders and details will only be done if the package variable FetchOrders has value true. To show this in a self-explanatory graphical user interface perspective I am using an empty 'Script Task' that has no variable and has no programming, and from 'Script Task' to Fetch Orders and details I add this Expression: »@[User::FetchOrders] == True« and no need for Constraint because 'Script Task' will never failure:

 

 

 

Do nothing, continue when package parameter is True

A SSIS package can receive a value from another package through a parameter and then the package can control the flow depends of value of parameter.

 

An empty 'Script Task' »Do nothing, continue when Fetch is True« has no variable and has no programming, it is just placed there as a dummy object, so I can make a Precedence constraint green arrow that has an expression showned by a fx logo

fx: @[$Package::Fetch] == True

for programming a self-explanatory graphical user interface of the control flow of the package like this:

 

 

When the fx expression condition is right/true,

the job will enter the Sequence Contai­ner.

 

 

When the fx expression condition is wrong/false,

the job will not enter the Sequence Contai­ner.

 

 

Both 'Execute SQL Task' for »Begin package« and »End package« for log regi­stra­tion will be done and »Fetch Stock« will also be done because it is independent of the Fetch package parameter.

 

 

When a package variable or a package parameter has data type Int32 and a default value as 0, the Expression for continuation with the task can be like this:

 

@[$Package::parameter] != 0 && !ISNULL(@[$Package:: parameter])

 

When a package variable or a package parameter has data type String and a default value as 0000 (from nchar(4) column in a table), the Expression for continuation with the task can be like this:

 

@[$Package::param] != "0000" && !ISNULL(@[$Package::param]) && @[$Package::param] != ""

 

 

 

6. Package Job Log for raise and catch error failure

When I have parents and children packages, I like to wrap them all into one Job package with a simple joblog table that take care of controlling that the job can not be running two times at the same time and has few status state like 'Running', 'Finish' and 'Error'. To catch an error in the Job package or the below parent and child package I will use Event Handlers that is a tab next to Control Flow, Data Flow and Parameters. The main Job package can look like this where I using fx expression to control the process flow:

 

 

A JobLog table has some columns:

 

JobLogId

Status

BeginTime

EndTime

ErrorMessage

6

Finish

2015-01-01 04:00:00.000

2015-01-01 04:56:02.123

7

Finish

2015-01-02 04:00:00.000

2015-01-02 04:43:50.497

8

Error

2015-01-03 04:00:00.000

2015-01-02 04:12:31.617

Error while executing package

"Suppliers.dtsx".

 

Execute SQL: JobLog Status is not Running

To control that the job is not running or has status 'Running':

SELECT ? = IIF(EXISTS(SELECT 1 FROM dbo.JobLog WHERE Status = 'Running'), 0, 1)

and set package variable Status_Not_Running to output value True or False.

fx: @[User::Status_Not_Running] == True to continue the job control flow.

fx: @[User::Status_Not_Running] == False to send a failure email cos Running.

 

Execute SQL: Job Status set to Running and set BeginTime and get next JobLogId

To insert a new row in JobLog and fetch the next JobLogId from Identity number and set Status to 'Running' and set BeginTime datetimestamp:

BEGIN TRANSACTION

DECLARE @InsertedRow TABLE (JobLogId int)

INSERT INTO dbo.JobLog (Status, BeginTime)

OUTPUT inserted.JobLogId INTO @InsertedRow

VALUES ('Running', Getdate())

SELECT ? = JobLogId

FROM @InsertedRow

COMMIT TRANSACTION

and set package variable JobLogId to output value.

fx: @[User::JobLogId] >= 1 to continue the job control flow.

fx: @[User::JobLogId] == 0 to send a failure email cos no new JobLogId.

 

Sequence Container: Execute packages in parallel

The Sequence Container execute multiple packages in parallelism at same time.

 

Execute SQL: JobLog Status set to Finish and set EndTime

To update the new row in JobLog and set Status to 'Finish' and set EndTime:

BEGIN TRANSACTION

UPDATE dbo.JobLog

SET Status = 'Finish', EndTime = Getdate()

WHERE JobLogId = ?

COMMIT TRANSACTION

with package variable JobLogId as input value.

 

Event Handlers

I click at tab Event Handlers, and in left dropdown I select the name of the package that represent package level scope for the event. In right dropdown I select 'OnError' to catch an error in the process or children packages.

To get a canvas or design surface to drag in SSIS components, I click at line link:

 

              Click here to create an 'OnError' event handler

 

The canvas will become dark gray and I can drag in an 'Execute SQL Task' or copy one from 'Control Flow' tab:

 

Execute SQL: JobLog Status set to Error and set EndTime

To update the new row in JobLog and set Status to 'Error' and set EndTime and get system error message from the package:

BEGIN TRANSACTION

UPDATE dbo.JobLog

SET Status = 'Error', EndTime = Getdate(), ErrorMessage = ?

WHERE JobLogId = ?

COMMIT TRANSACTION

with package variable JobLogId as input value and the first ? parameter with index 0 to system variable as input for System::ErrorDescription shown in picture before.

 

Event Handlers has many properties like OnPreExecute and OnPostExecute that can do things before and after the package process.

 

Of course the Event Handlers can also be placed in each parent and child packages to fetch a better System::ErrorDescription message to be register in JobLog.

 

A SQL Server Agent Job will in case of a failure be red with a message 'The step failed' when the Job package raise an error.

 

 

A Sequence Container and be used for many different tasks and in case you like to catch error failure inside the Sequence Container it can be done like this:

 

 

 

 

7. Execute T-SQL Statement Task

At Control Flow drag in 'Execute T-SQL Statement Task' and rightclick {Edit} and click [New] when first time using in a project. Call the connection TSQL<data­base­name> and typein SQL Server name (or use . or localhost). Typein a sql statement like a UPDATE, DELETE, TRUNCATE, EXEC stored procedure or a dummy like »print 'hello'«. A new connection is shown in the buttom of the package at Connection Managers and it is a new connection because this task is using ADO.NET connection manager type with '.Net Provi­ders\Sql­Client Data Provider'. Rightclick the TSQL connection and choose {Convert to Project Connection}. In Solution Explorer open the connec­tion and select a database for the sql statement else the database name has to be part of the statement FROM <database>.dbo.<table>. Back in package rightclick the TSQL connection {Parameterise} then the connection string becomes a project parameter, the value in the dialogbox only needs to be:

 

Data Source=<server>;Initial Catalog=<database>;Integrated Security=True;Connect Timeout=30;

 

Remember to click [Save All] and the parameter can be map to an Environment va­riable in SSISDB catalog. I can go back now and typein the right sql statement.

 

A project will have two connections to the same database for Data Flow Task and for Execute T-SQL Statement Task.

 

 

8. Kimball SCD type 7 dimension

I will show a way to implement type 7 dimension in a SSIS package fetching data from a source table from a source legacy system to be stored in a dimension table with ten audit columns that support type 7, that is an extension of type 1 and 2 and 3 plus handling deleted values. There is an example in my article Dimensional modeling that is good to read before programming the package that will perform the ETL process from source to dimension.

 

In case you don’t like the Previous reference _pkey column in the dimension table from the article, you can skip the _pkey column in the table and SSIS package im­ple­men­ta­tion, because I will in the end of this tip show an alternative sql update without _pkey. For my self I have nice usage of the _pkey co­lumn in fetching from a dimension.

 

The two table structures for this example:

 

CREATE TABLE [dbo].[Source_System]

([Id] [int] NOT NULL,                              -- Business key in source system therefore primary key here.

 [Name] [nvarchar](50) NULL,

 [Version] [decimal](5, 2) NULL,

 CONSTRAINT [PK_Source_System] PRIMARY KEY CLUSTERED

([Id] ASC) ) ON [PRIMARY]

GO

CREATE TABLE [dbo].[Dim_System]

([System_key] [int] IDENTITY(1,1) NOT NULL,

 [System_dkey] [int] NOT NULL,

 [System_pkey] [int] NOT NULL,

 [System_bkey] [int] NOT NULL,

 [System] [nvarchar](50) NULL,

 [Version] [decimal](5, 2) NULL,

 [ValidFrom] [datetime] NOT NULL,

 [ValidTo] [datetime] NOT NULL,

 [IsCurrent] [bit] NOT NULL,

 [IsDeleted] [bit] NOT NULL,

 [InsertTime] [datetime] NOT NULL,

 [UpdateTime] [datetime] NOT NULL,

 CONSTRAINT [PK_Dim_System] PRIMARY KEY CLUSTERED

([System_key] ASC) ) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_Dim_System_bkey] ON [dbo].[Dim_System]

([System_bkey] ASC) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_Dim_System_dkey] ON [dbo].[Dim_System]

([System_dkey] ASC) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_Dim_System_pkey] ON [dbo].[Dim_System]

([System_pkey] ASC) ON [PRIMARY]

GO

 

Remark: datetime2(n) has a date range of 0001-01-01 through 9999-12-31 while datetime starts from 1753-01-01. datetime takes 8 bytes in storage size. datetime2 takes 6 bytes for precisions less than 3; 7 bytes for precisions 3 and 4 else 8 bytes.

datetime2(3) has same precision as datetime. smalldatetime has a date range of 1900-01-01 through 2079-06-06 and 00 second in the timepart and takes 4 bytes in storage size and I could use »the forever date« like 2070-12-31.

I choose datetime and let ValidFrom default at 1900-01-01 and ValidTo default at 9999-12-31.

 

The source legacy system does not provide any datetime column when data is valid from or any status flag for New, Created, Changed, Updated or Deleted data, therefore the SSIS package has to do the data detection of New, Changed or Deleted data in the source legacy system and reflect it to the dimension table.

 

In the SSIS package I’m not using 'OLE DB Command' instead the sql update of audit columns in previous rows will be done by a sql statements that don’t need variable therefore I’m placing them in 'Execute T-SQL Statement Task' to get the best performance.

 

I use 'Data Flow Task', 'Execute T-SQL Statement Task', 'OLE DB Source', 'Merge Join', 'Conditional Split', 'Derived Column' and 'OLE DB Destination' in flow of data from a source legacy system from table Sour­ce_System to a type 7 dimension table Dim_Source in a datamart:

 

 

I will make a SSIS package where data in a Source_System table will do a Kimball SCD type 7 at a Dim_System table. From Source_System table package will do data detection of :

 

  • New data in Source_System that does not exists in Dim_System.
  • Changed data in Source_System compared to data in Dim_System.
  • Deleted data in Source_System that still exists in Dim_System.
  • Unchanged data in Source_System that has not changed.

 

The package will insert row from Source_System table into Dim_System table together with audit columns. The package will end up with updating rows in table Dim_System to set _dkey column and other audit columns.

 

Since the audit columns will use default values, I can either typein the default va­lues for each audit columns in the table design of Dim_System, or I can make them as parameters and variables for the SSIS project and the SSIS package like this:

 

Parameter name

Datatype

Value

defaultdkey

Int32

0

defaultpkey

Int32

0

defaultIsCurrent

Boolean

True    (same as 1 in sql)

defaultIsDeleted

Boolean

True    (same as 1 in sql)

defaultIsDeleted_Not

Boolean

False   (same as 0 in sql)

defaultValidFrom

DateTime

01-01-1900 00:00:00

defaultValidTo

DateTime

31-12-9999 00:00:00

DMAconnstr

String

TSQL_DMAconnstr

String

 

 

Variable name

Datatype

Expression / Value

currentDateTime

DateTime

GETDATE()

RowCount

Int32

0

 

In 'Control Flow' I drag in a 'Data Flow Task' and inside I drag in two 'Source' for tables Source_System and Dim_System with sql statements to make sure that data has the right sortorder for a later of 'Merge Join' that will join over the business keys therefore order by business keys in both sql statements where I only include the columns the package need, like this:

 

'Source' to get data from source table

'Source' to get data from dimension

SELECT Id, Name, Version

FROM Source_System

ORDER BY Id  -- business key

SELECT System_bkey, System, Version,

            System_key, System_dkey

FROM Dim_System

WHERE IsCurrent = 1 AND IsDeleted = 0

ORDER BY System_bkey  -- business key

 

When sql statement takes data from an archive there will be a WHERE Active = 1.

 

Since both 'Source' tables are using same column name, I will rename the columns of the Source_System table by Edit the 'Source', go to Columns page and typein a prefix s_ to the output columns like this:

 

 

At both 'Source' rightclick {Show Advanced Editor}, tab 'Input and Output Properties', click at 'OLE DB Source Output' and set property IsSorted to True and open 'OLE DB Source Output' and open 'Output Columns' and select the first column from ORDER BY and set property SortKeyPosition to 1, see previous tip about this.

 

Drag in 'Merge Join' and connect the two 'Source' tables as 'Merge Join Left Input' but in real I want a 'Full outer join' because it will give me all rows from both 'Source' and both business keys will be null if there is no match. I include all co­lumns from both tables because:

 

  • New data I will use columns from Source_System.
  • Changed data I will compare columns from both tables.
  • Deleted data I will use columns from Dim_System.
  • Unchanged data is not relevant because data are »same same«.

 

I rightclick {Edit} and select join type 'Full outer join' and select all columns:

 

 

'Full outer join' will provide data for four situations called:

 

  • New when source table has a row that does not exists in dimension table, therefore all columns from Dim_System is Null.
  • Changed when source has values that is different compare to dimension.
  • Deleted when source does not has value anymore but dimension will keep, therefore all columns from Source_System is Null.
  • Unchanged when source data is still the same and already exists in dim.

 

I will devide or filtering data to each own pipeline through a 'Conditional Split' with three conditions using the business key and the columns of values from source table (prefix s_) and dimension table.

 

Pipeline

Rule

Remark

New

ISNULL(System_bkey)

Business key in dimension does not match to source business key there­fore source must come with new data.

Deleted

ISNULL(s_Id)

Business key in source does not match to dimension business key therefore source must had delete the data.

Changed

REPLACENULL(s_Name,"¤") != REPLACENULL(System,"¤") || REPLACENULL((DT_WSTR,30) s_Version,"¤") !=

REPLACENULL((DT_WSTR,30) Version,"¤")

Source has change some of the values therefore dimension need to make a new row with new ValidFrom/ValidTo. In case of many values a checksum can make them to one value to save many || = OR comparison. != not equal to.

Unchanged

Default output name

complement of a set

refers to not included in the above rules.

Source has not changed the data therefore no need to do more and I send the pipeline to a Row Count, but it can be dropped before deployment.

 

The order (rank) of rules is important else Changed must starts with a condition to make sure that both business keys has a value and therefore exists in both tables: !ISNULL(System_bkey) && !ISNULL(s_Id) && (<rule as above>).

 

 

New pipeline will insert new rows into Dim_System table, therefore the package has to set default values for audit columns using the project parameters and pac­kage variable. I do it in a 'Derived Column' that will add extra columns to the New pipeline with a default value typein as an Expression refering to parameters and variable. ValidFrom is set to its general »begindate« because the data is new in Dim_System table and ValidTo is set to its general enddata called »forever end­date«. Sometimes the source legacy system provides ValidFrom when the data was been new/created/born inside the system:

 

Derived Column name

Derived Column

Expression

System_dkey

Replace

'System_dkey'

@[$Project::defaultdkey]

System_pkey

 

@[$Project::defaultpkey]

ValidFrom

 

@[$Project::defaultValidFrom]

ValidTo

 

@[$Project::defaultValidTo]

IsCurrent

 

@[$Project::defaultIsCurrent]

IsDeleted

 

@[$Project::defaultIsDeleted_Not]

InsertTime

 

@[User::currentDateTime]

UpdateTime

 

@[User::currentDateTime]

 

I don’t like an audit column to be Null or n/a, therefore UpdateTime gets the same datetime stamp as InsertTime, so both columns has the same value until data in the row will be either changed or deleted in the source legacy system.

 

 

The New pipeline with audit columns default values can now be insert into Dim_System table, I drag in a 'Destination' and connect new pipeline to it and select Dim_Source table as fast load. There is a default mapping that match columns with same name because the pipeline contains columns from both tables Sour­ce_System and Dim_System after the full outer join:

 

 

But the source columns (left side) has prefix s_ and they need to be map to the right columns in destination table for Dim_System (right side), so I rearrange the mapping:

 

Pipeline column

Dimension column

Remark

s_Id

System_bkey

Primary key from source become business key in destination.

s_Name

System

Destination not use generic word Name instead the column is named System.

s_Version

Version

From source to destination.

System_dkey

System_dkey

Taken from default value.

System_pkey

System_pkey

Taken from default value.

ValidFrom

ValidFrom

Taken from default value.

ValidTo

ValidTo

Taken from default value.

IsCurrent

IsCurrent

Taken from default value.

IsDeleted

IsDeleted

Taken from default value.

InsertTime

InsertTime

Taken from default value.

UpdateTime

UpdateTime

Taken from default value.

 

Audit columns System_dkey and System_pkey for the new rows will be set to the right value in a sql update statement in the end of the ETL process.

 

After I have deleted the map lines and drag new map lines I rearrange the mapping, where I can’t use the input distination columns because data is new in the source and therefore is Null in the new pipeline so all the audit columns comes from the default values to be insert into Dim_System table as new rows:

 

 

Audit column System_dkey will be set to the value of System_key in a sql update statement in the end of the ETL process, because System_key is an Identity sequential number that first get its value after the row has been inserted. Then Sy­stem_dkey will inherited the number, so in the first row both columns has the same number. When the data later will be changed or deleted, it become a new row in Dim_System table with a new number in System_key and audit column System_pkey will have a value as previous reference back to the number of System_key for the previous row before data was changed or deleted.

 

Changed pipeline will insert new rows into Dim_System table, therefore the pac­kage has to set default values for audit columns through 'Derived Column' where ValidFrom is set to currentdatetime because that is the time where Dim_System receive a changed data from Source_System. Sometimes the source legacy system provides ValidFrom when the data was been changed/updated inside the system:

 

Derived Column name

Derived Column

Expression

ValidFrom

 

@[User::currentDateTime]

ValidTo

 

@[$Project::defaultValidTo]

IsCurrent

 

@[$Project::defaultIsCurrent]

IsDeleted

 

@[$Project::defaultIsDeleted_Not]

InsertTime

 

@[User::currentDateTime]

UpdateTime

 

@[User::currentDateTime]

 

Audit columns System_dkey and System_pkey will inherited the values from the existing rows in Dim_System table.

 

 

The Changed pipeline with audit columns default values can now be insert into Dim_System table, I drag in a 'Destination' and connect changed pipeline to it and select Dim_Source table as fast load. I rearrange the mapping:

 

Pipeline column

Dimension column

Remark

s_Id

System_bkey

Primary key from source become business key in destination.

s_Name

System

Destination not use generic word Name instead the column is named System.

s_Version

Version

Changed/Updated Version to destinatio.

System_key

System_pkey

Previous reference is inherited from old row to new row for changed data.

System_dkey

System_dkey

The durable key is inherited from old row to new row for changed data.

ValidFrom

ValidFrom

Taken from default value.

ValidTo

ValidTo

Taken from default value.

IsCurrent

IsCurrent

Taken from default value.

IsDeleted

IsDeleted

Taken from default value.

InsertTime

InsertTime

Taken from default value.

UpdateTime

UpdateTime

Taken from default value.

 

 

Audit columns ValidTo, IsCurrent and UpdateTime for the old previous rows will be set to the right value in a sql update statement in the end of the ETL process.

 

Deleted pipeline will insert new rows into Dim_System table, therefore the pac­kage has to set default values for audit columns through 'Derived Column' where ValidFrom is set to currentdatetime because that is the time where Dim_System found out that data has been deleted in the Source_System. Sometimes the source legacy system provides ValidFrom when the data was been deleted inside the sy­stem. The new row will be mark with IsDeleted = 1 = true because the row re­pre­sent the dele­ted data in the source legacy system, therefore I use another pa­ra­me­ter default value here:

 

Derived Column name

Derived Column

Expression

ValidFrom

 

@[User::currentDateTime]

ValidTo

 

@[$Project::defaultValidTo]

IsCurrent

 

@[$Project::defaultIsCurrent]

IsDeleted

 

@[$Project::defaultIsDeleted]

InsertTime

 

@[User::currentDateTime]

UpdateTime

 

@[User::currentDateTime]

 

Audit columns System_dkey and System_pkey will inherited the values from the existing rows in Dim_System table.

 

 

The Deleted pipeline with audit columns default values can now be insert into Dim_System table, I drag in a 'Destination' and connect deleted pipeline to it and select Dim_Source table as fast load. I rearrange the mapping, where I can’t use the prefix s_ columns because data has been deleted in the source and therefore is Null in the deleted pipeline:

 

Pipeline column

Dimension column

Remark

System_bkey

System_bkey

Business key is inherited from old row to new row for deleted data.

System

System

System value is inherited from old row to new row for deleted data.

Version

Version

Version value is inherited from old row to new row for deleted data.

System_key

System_pkey

Previous reference is inherited from old row to new row for changed data.

System_dkey

System_dkey

The durable key is inherited from old row to new row for changed data.

ValidFrom

ValidFrom

Taken from default value.

ValidTo

ValidTo

Taken from default value.

IsCurrent

IsCurrent

Taken from default value.

IsDeleted

IsDeleted

Taken from default value.

InsertTime

InsertTime

Taken from default value.

UpdateTime

UpdateTime

Taken from default value.

 

 

Audit columns ValidTo, IsCurrent and UpdateTime for the old previous rows will be set to the right value in a sql update statement in the end of the ETL process.

 

Audit columns

The SSIS package control flow is ending with a TSQL statement that has two sql update statements.

 

  • The first update statement is only for new data rows from the source legacy sy­stem to the dimension table to set audit column System_dkey the same number value as the primary key column System_key that is an Identity sequential number that first gets its value after the new rows has been inser­ted. The default value of System_dkey is 0, therefore it is easy to find the new rows and when set System_dkey to same value as System_key. Some­times a dimension has an Unknown value with _key, _dkey and _pkey as 0 and _bkey as '¤' to handle source data when a business key does not exists in the dimension, therefore I add a criterie that System_key is greater than 0:

 

UPDATE d

SET System_dkey = d.System_key

FROM dbo.Dim_System d

WHERE d.System_key > 0 AND d.System_dkey = 0

 

  • The second update statement is for the old previous rows in the dimension table to set audit columns. I’m using column System_pkey from new rows to connect to the old previous rows through its primary key column Sy­stem_key to perform update of three audit columns in the old previous rows:

 

 * Set ValidTo to same datetimestamp that new row got in ValidFrom.

 * Set IsCurrent = 0 = false because it is new row that is current row.

 * Set UpdateTime to the current datetime when previous row was updated.

 

The sql statement is using a self-join between the new row with alias n and the previous row in the dimension with alias d with joining from new row column System_pkey to previous row column System_key where both has IsCurrent = 1 to fetch the new row and the last previous row and not older historical rows, but then the updating will set IsCurrent = 0 for the previous row, therefore there will be only one row with IsCurrent = 1 per data value:

 

UPDATE d

SET ValidTo = n.ValidFrom, IsCurrent = 0, UpdateTime = Getdate()

FROM dbo.Dim_System d

     INNER JOIN dbo.Dim_System n ON n.System_pkey = d.System_key

WHERE d.System_key > 0 AND n.IsCurrent = 1 AND d.IsCurrent = 1

 

In case you don’t like to use the Previous reference _pkey audit column in a dimen­sion table, the second sql update statement will use System_dkey to perform the self-join together with two extra conditions that new row System_key must be greater than previous row System_key to avoid a row joining to itself:

 

UPDATE d

SET ValidTo = n.ValidFrom, IsCurrent = 0, UpdateTime = Getdate()

FROM dbo.Dim_System d

     INNER JOIN dbo.Dim_System n ON n.System_dkey = d.System_dkey AND

                                    n.System_key > d.System_key

WHERE d.System_key > 0 AND n.IsCurrent = 1 AND d.IsCurrent = 1

 

 

 

The SSIS package is ready to be testet.

 

Approaches for writing to the dimension table

The approach above starts with inserting the rows into the dimension table and hereafter updating the audit columns. Another approach would be using a staging table which contains changed and deleted data and a ValidFrom column that is assigned inside the staging table, and hereafter the ValidFrom will be used to update the previous rows ValidTo column in the dimension table together with the other audit columns, and hereafter inserting the rows from the staging table into the dimension table.

 

Merge in T-SQL

The process can also be solve by T-SQL programming with INSERT INTO and UPDATE or with a MERGE but it needs two updates afterwards like this where Target is table Dim_System and Source is table Source_System where Merge compare business key between source and target.

Remarks that (ISNULL(Source.Name,'¤') <> ISNULL(Target.System,'¤') is not wor­king for Merge when a value change to null.

 

 

DECLARE @defaultdkey int = 0

DECLARE @defaultpkey int = 0

DECLARE @defaultIsCurrent bit = 1

DECLARE @defaultIsDeleted bit = 1

DECLARE @defaultIsDeleted_Not bit = 0

DECLARE @defaultValidFrom datetime = '1900-01-01 00:00:00'

DECLARE @defaultValidTo datetime   = '9999-12-13 00:00:00'

DECLARE @currentDateTime datetime  = GETDATE()

 

INSERT INTO dbo.Dim_System -- Insert rows for updated and deleted data into Target table

(System_bkey, [System], [Version], System_dkey, System_pkey, ValidFrom, ValidTo, IsCurrent,

 IsDeleted, InsertTime, UpdateTime)

SELECT System_bkey, [Name], [Version], System_dkey, System_key, @currentDateTime,

  @defaultValidTo, @defaultIsCurrent, @defaultIsDeleted_Not, @currentDateTime, @currentDateTime

FROM – the columns in select comes from output action columns, System_key is for System_pkey

(

MERGE dbo.Dim_System AS [Target]

USING (SELECT [Id], [Name], [Version] FROM dbo.Source_System) AS [Source]

ON [Target].System_bkey = [Source].Id AND [Target].IsCurrent = 1 AND [Target].IsDeleted = 0

WHEN NOT MATCHED BY TARGET -- Insert rows for new data into Target table

  THEN INSERT (System_bkey, [System], [Version], System_dkey, System_pkey, ValidFrom, ValidTo,

               IsCurrent, IsDeleted, InsertTime, UpdateTime)      

  VALUES ([Source].[Id], [Source].[Name], [Source].[Version], @defaultdkey, @defaultpkey,

          @defaultValidFrom, @defaultValidTo, @defaultIsCurrent, @defaultIsDeleted_Not,

          @currentDateTime, @currentDateTime)

WHEN MATCHED AND ((([Target].[System] <> [Source].[Name]) OR ([Target].[System] IS NOT NULL

    AND [Source].[Name] IS NULL) OR ([Target].[System] IS NULL AND [Source].[Name] IS NOT NULL))

  OR (([Target].[Version] <> [Source].[Version]) OR ([Target].[Version] IS NOT NULL

    AND [Source].[Version] IS NULL) OR ([Target].[Version] IS NULL

    AND [Source].[Version] IS NOT NULL)))

  -- Update rows in Target table because rows is not current anymore because Source change value

  THEN UPDATE SET ValidTo = @currentDateTime, IsCurrent = 0, UpdateTime = @currentDateTime

WHEN NOT MATCHED BY SOURCE AND [Target].IsCurrent = 1 AND [Target].IsDeleted = 0

  -- Update rows in Target table because rows has been deleted in Source table.

  THEN UPDATE SET ValidTo = @currentDateTime, IsCurrent = 0, UpdateTime = @currentDateTime 

  -- Update rows is in OUTPUT be inserted into Target table as new rows in the above INSERT INTO

OUTPUT $Action AS ActionOutput, inserted.System_key, inserted.System_bkey, inserted.System_dkey,

                                [Source].[Name], [Source].[Version]

) AS MergeOutput WHERE MergeOutput.ActionOutput = 'UPDATE'

 

-- For new rows to set System_dkey as same value as the given Identify for System_key

UPDATE d

SET System_dkey = d.System_key

FROM dbo.Dim_System d

WHERE d.System_key > 0 AND d.System_dkey = 0

 

-- For deleted rows to set IsDeleted in the new row and fetch the other values from previous row

UPDATE d 

SET IsDeleted = @defaultIsDeleted, [System] = p.[System], [Version] = p.[Version]

FROM dbo.Dim_System d

     INNER JOIN dbo.Dim_System p ON p.System_key = d.System_pkey

WHERE d.System_key > 0 AND d.IsCurrent = 1 AND d.IsDeleted = 0 AND d.[System] IS NULL AND

      d.[Version] IS NULL

 

MERGE is most for insert new data and update existing data and become little more complex for Kimball type 2 or 7 dimensions because the syntax of OUTPUT and an extra INSERT INTO statement around the MERGE statement.

 

Insert and Update instead of Merge in T-SQL stored procedure

 

CREATE PROCEDURE [dbo].[DimensionT7_InsertUpdate]

AS

BEGIN

SET NOCOUNT ON

 

DECLARE @defaultValidFrom datetime = '1900-01-01 00:00:00'

DECLARE @defaultValidTo   datetime = '9999-12-13 00:00:00'

DECLARE @currentDateTime  datetime = GETDATE() -- or IDA_INSERT_TIME from a System database.

 

DECLARE @New_ROW_COUNT     int = 0

DECLARE @Deleted_ROW_COUNT int = 0

DECLARE @Changed_ROW_COUNT int = 0

 

-- I am using a @table_variable instead of a #temporary_table or a real_stage_table inside db or

-- a memory optimized table. I have added primary key index on the columns I will join to later.

DECLARE @NewRows TABLE (System_key int NOT NULL PRIMARY KEY CLUSTERED)

DECLARE @DeletedRows TABLE (System_key int, System_dkey int, System_bkey int,

        System nvarchar(50), Version decimal(5, 2))

DECLARE @ChangedRows TABLE (System_key int, System_dkey int,

        System_bkey int NOT NULL PRIMARY KEY CLUSTERED)

 

-- bkey is not in dimension, therefore row is new in source, that will be marked as current row.

INSERT INTO dbo.Dim_System

  (System_bkey, System, Version, System_dkey, System_pkey,

   ValidFrom, ValidTo, IsCurrent, IsDeleted, InsertTime, UpdateTime)

OUTPUT inserted.System_key INTO @NewRows

SELECT DISTINCT s.Id, s.Name, s.Version, 0, 0,

   @defaultValidFrom, @defaultValidTo, 1, 0, @currentDateTime, @currentDateTime

FROM dbo.Source_System s

WHERE NOT EXISTS(SELECT 1 FROM dbo.Dim_System d WHERE d.System_bkey = s.Id)

    

SET @New_ROW_COUNT = @New_ROW_COUNT + @@ROWCOUNT

 

/* For new rows to set System_dkey as same value as the given Identify for System_key. */

UPDATE d

SET System_dkey = d.System_key

FROM dbo.Dim_System d

     INNER JOIN @NewRows n ON n.System_key = d.System_key

 

/* bkey is not in source, therefore the row is deleted in source, that will be marked in

   dimension in current row and an extra row is inserted. */

UPDATE d

SET IsCurrent = 0, ValidTo = @currentDateTime, UpdateTime = @currentDateTime

OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey, inserted.System,

       inserted.Version INTO @DeletedRows

FROM dbo.Dim_System d

WHERE d.System_key > 0 AND d.IsCurrent = 1 AND d.IsDeleted = 0 AND

      NOT EXISTS(SELECT 1 FROM dbo.Source_System s WHERE s.Id = d.System_bkey)

 

INSERT INTO dbo.Dim_System

  (System_bkey, System, Version, System_dkey, System_pkey,

   ValidFrom, ValidTo, IsCurrent, IsDeleted, InsertTime, UpdateTime)

SELECT DISTINCT d.System_bkey, d.System, d.Version, d.System_dkey, d.System_key,

   @currentDateTime, @defaultValidTo, 1, 1, @currentDateTime, @currentDateTime

FROM @DeletedRows d

--FROM dbo.Dim_System d

--WHERE d.IsCurrent = 0 AND d.IsDeleted = 0 AND d.ValidTo = @currentDateTime AND

--      NOT EXISTS(SELECT 1 FROM dbo.Source_System s WHERE s.Id = d.System_bkey)

 

SET @Deleted_ROW_COUNT = @Deleted_ROW_COUNT + @@ROWCOUNT

 

/* bkey is marked deleted in dimension but exists now again in source, that will be marked in

   dimension in current/deleted row and an extra row is inserted.

   Keep IsDeleted = 1 to show it was deleted in that period. */

UPDATE d

SET IsCurrent = 0, ValidTo = @currentDateTime, UpdateTime = @currentDateTime

OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey INTO @ChangedRows

FROM dbo.Dim_System d

     INNER JOIN dbo.Source_System s ON s.Id = d.System_bkey

WHERE d.IsCurrent = 1 AND d.IsDeleted = 1

 

/* bkey's values is changed in source, that will be marked in dimension in current row and an

   extra row is inserted. */

UPDATE d

SET IsCurrent = 0, ValidTo = @currentDateTime, UpdateTime = @currentDateTime

OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey INTO @ChangedRows

FROM dbo.Dim_System d

     INNER JOIN dbo.Source_System s ON s.Id = d.System_bkey

WHERE d.IsCurrent = 1 AND d.IsDeleted = 0 AND

   (ISNULL(s.Name,'¤') <> ISNULL(d.System, '¤') OR ISNULL(s.Version,0) <> ISNULL(d.Version, 0))

   -- d.dwh_Comparison != s.dwh_Comparison

 

INSERT INTO dbo.Dim_System

  (System_bkey, System, Version, System_dkey, System_pkey,

   ValidFrom, ValidTo, IsCurrent, IsDeleted, InsertTime, UpdateTime)

SELECT DISTINCT s.Id, s.Name, s.Version, c.System_dkey, c.System_key,

   @currentDateTime, @defaultValidTo, 1, 0, @currentDateTime, @currentDateTime

FROM dbo.Source_System s

     INNER JOIN @ChangedRows c ON c.System_bkey = s.Id

--WHERE NOT EXISTS(SELECT 1 FROM dbo.Dim_System d WHERE d.System_bkey = s.Id AND d.IsCurrent = 1)

--Because source table contains rows that has not been changed and must be skipped in the ETL.

               

SET @Changed_ROW_COUNT = @Changed_ROW_COUNT + @@ROWCOUNT

 

END

 

 

9. Lookup, Cache Connection Manager Transform, Data Viewer

In case the ETL process needs to do two or more lookups to same table or sql sta­te­ment by adding the same Lookup in several Data Flow Tasks, data will be reloa­ded which can cost performance. Instead data can be loaded one time and added to a cache and multiple Lookup can use same cache.

 

Lookup is a join between two tables in a one-to-many relationship, from a many-table to a one-table for adding extra columns to the pipeline build on many-table.

 

  • In a 'Control Flow' drag in 'Data Flow Task' for doing Cache for later Lookup.
  • Make a 'OLE DB Source' for data from the one-table used for Lookup.
  • Drag in 'Cache Transform' and connect 'OLE DB Source' to it.
  • Edit 'Cache Transform' and click [New] and in tab Columns set Index Position to 1 for primary key that Lookup will join to. For later maintenance click [Edit] for open the Cache Tranformation Editor and go to tab Columns.
  • Notice that a 'Cache Connection Manager' is shown in 'Connection Managers' in the buttom of the package.
  • Back in 'Control Flow' drag another 'Data Flow Task' and insert a 'OLE DB Source' for data from the many-table.
  • Drag in 'Lookup' and connect the pipeline to it.
  • Edit 'Lookup' and select 'Cache connection manager' and at Connection page choose the Cache from before.
  • Set 'Specify how to handle rows with no matching entries' to 'Ignore failure' else Lookup gives a failure when one row from pipleline does not exists in the cache.
  • At Columns page join foreign key from pipeline with index 1 from the Cache (shown as a magnifying glass) and checkmark the other column to extend the pipeline with extra column, maybe replace it or chan­ge Output Alias if you like another columnname in pipeline.
  • Normally we have a 'OLE DB Destination' to pump data into from the pipeline, but instead I make a variable cal­led RowCount and drag in 'Row Count', pipeline to it as 'Lookup Match Output' and add the variable.
  • I like to see the data in the pipeline in a 'Data Viewer' window while I run the package, so I Rightclick the blue arrow from 'Lookup' to 'Row Count' and {Enable Data Viewer}. Of course this is only for development and rightclick and disable it.

 

 

 

10. Kimball SCD type 7 dimension through DSA layer

It is nice to have a Data Staging Area (DSA) layer in the ETL process between Source data and a dimension like type 7 from Kimball. A very strict ETL flow is first to have all data ready at DSA layer before start to update and insert data in dimensions and facts. Instead of compare columns if values has changed I make a checksum hashbytes column called dwh_Comparison that contains the compared columns which is the normal columns and not included the primary key (business key).

 

10.1 Using Lookup

For a new employee I let ValidFrom start from 1900-01-01 and for the current row I let ValidTo stop at 9999-12-31.

 

Dim_Employee

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

0

0

0

0

Missing

Missing

0x000000000000

1900-01-01

9999-12-31

1

1

1

0

123

Davolio

Nancy

0x8B186E741C4

1900-01-01

9999-12-31

1

2

2

0

124

Fuller

Andrew

0x43E8CE52E17

1900-01-01

9999-12-31

1

3

3

0

125

Leverling

Janet

0xC72A5851BBC

1900-01-01

9999-12-31

1

4

4

0

126

Peacock

Margaret

0x2EA9AFCDDD

1900-01-01

9999-12-31

1

 

From the source legacy system the ETL fetch employees as full dump (delta data see 10.3) with an Updated_Time column that I will use for ValidFrom and ValidTo because the HR people is used to do ad hoc queries in the source system. Therefore the source will give 5 employees

 

Source_Employees

EmployeeID

LastName

FirstName

Updated_Time

123

Davolio

Nancy

2014-01-01

124

Fuller

Andrew

2014-01-01

125

Jones

Janet

2014-10-09

126

Fuller

Margaret

2014-10-09

127

Buchanan

Steven

2014-10-09

 

2 employees has changed last name (ID = 125 Janet and 126 Margaret) and 1 new employee (ID = 127 Steven) has started.

 

ETL process to a Stage_Employee table in a self-explanatory graphical user interface where I use Lookup to devide into existing employees (match output) and new employees (no match output) and I use a Conditional Split to only continue with changed employees where I compare the values in dwh_Comparison. The changed rows will be added two times in the stage and later in the dimension because of the ValidFrom and ValidTo, therefore I use a Multicast to copy the pipeline and save the rows with different audit columns values. The data flow from source to stage table:

 

Multicast do a duplicate for Update and Insert of the pipeline of changed rows.

 

The »2 rows« is from Janet and Margaret that has created new key numbers, the »2 rows« is key 3 and 4 from existing row in dimension and the »1 rows« is from Steven that will be new member in the dimension.

 

Stage_Employee (order by bkey + ValidFrom)

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

3

3

0

125

NULL

NULL

NULL

NULL

2014-10-09

0

6

6

3

125

Jones

Janet

0xC099E645E83

2014-10-09

9999-12-31

1

4

4

0

126

NULL

NULL

NULL

NULL

2014-10-09

0

7

7

4

126

Fuller

Margaret

0x3C09C8DEA7D

2014-10-09

9999-12-31

1

5

5

0

127

Buchanan

Steven

0x14226BE0760

1900-01-01

9999-12-31

1

 

Key 3 and 4 rows is for update same rows in the dimension where ValidTo and IsCurrent will be changed, therefore no need for names in these rows. The same business keys is repeated in key 6 and 7 rows for insert into the dimension with the new names. Key 5 is the new employee ready to be inserted into dimension.

 

So the SSIS package has made the staging data ready to be launched into the dimension and that should be done in another SSIS packages that can do it parallel for all dimensions and facts from DSA to DMA a data mart. 2 rows updated with a new ValidTo from the source with 2014-10-09 and 3 new rows added because there was a new employee too:

 

 

Dim_Employee (order by bkey + ValidFrom)

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

0

0

0

0

Missing

Missing

0x000000000000

1900-01-01

9999-12-31

1

1

1

0

123

Davolio

Nancy

0x8B186E741C4

1900-01-01

9999-12-31

1

2

2

0

124

Fuller

Andrew

0x43E8CE52E17

1900-01-01

9999-12-31

1

3

3

0

125

Leverling

Janet

0xC72A5851BBC

1900-01-01

2014-10-09

0

6

3

3

125

Jones

Janet

0xC099E645E83

2014-10-09

9999-12-31

1

4

4

0

126

Peacock

Margaret

0x2EA9AFCDDD

1900-01-01

2014-10-09

0

7

4

4

126

Fuller

Margaret

0x3C09C8DEA7D

2014-10-09

9999-12-31

1

5

5

0

127

Buchanan

Steven

0x14226BE0760

1900-01-01

9999-12-31

1

 

Table Stage_Employee where some columns allow null value because these va­lues is not needed to update ValidTo in the dimension table, and _key is an Identity to create the new key numbers for the dimension:

 

CREATE TABLE [dbo].[Stage_Employee](

  [Employee_key] [int] IDENTITY(1,1) NOT NULL,

  [Employee_dkey] [int] NOT NULL,

  [Employee_pkey] [int] NOT NULL,

  [Employee_bkey] [int] NOT NULL,

  [LastName] [nvarchar](20) NULL,

  [FirstName] [nvarchar](10) NULL,

  [dwh_Comparison] [binary](32) NULL,

  [ValidFrom] [datetime] NULL,

  [ValidTo] [datetime] NOT NULL,

  [IsCurrent] [bit] NOT NULL,

  [InsertTime] [datetime] NOT NULL CONSTRAINT [DF_Stage_Employee_InsertTime]

                                   DEFAULT (getdate()),

  CONSTRAINT [PK_Stage_Employee] PRIMARY KEY CLUSTERED ([Employee_key] ASC)

) ON [PRIMARY]

GO

 

Employee_key is an Identity because it will calculate the new numbers for the new rows, and later the new rows will be inserted into dimension table Dim_Employee.

 

Insert the Missing row into the dimension table:

TRUNCATE TABLE dbo.Dim_Employee

-- SET IDENTITY_INSERT dbo.Dim_Employee ON

INSERT INTO dbo.Dim_Employee

(Employee_key, Employee_dkey, Employee_pkey, Employee_bkey, LastName, FirstName,

 dwh_Comparison, ValidFrom, ValidTo, IsCurrent)

VALUES

(0, 0, 0, 0, 'Missing', 'Missing',

 0x0000000000000000000000000000000000000000000000000000000000000000,

 '1900-01-01', '9999-12-31', 1)

-- SET IDENTITY_INSERT dbo.Dim_Employee OFF

GO

 

When I have a DSA layer and a staging table, there is no need to have an Identity in dimension table Dim_Employee. Hash column dwh_Comparison using HASH­BYTES('SHA2_256') that become data type binary(32) (or in text nvarchar66)) and I just give the Missing a default value that never will come from the source system.

 

ETL for staging table into DSA layer

 

The Truncate Stage_Employee makes empty staging table and copy the max key number value from the dimension table, so new inserted rows into staging table will get new identity number that later can be bulk insert to dimension:

 

TRUNCATE TABLE dbo.Stage_Employee

DECLARE @maxKey int

SELECT  @maxKey = ISNULL(MAX(Employee_key),0) + 1

FROM    dbo.Dim_Employee

DBCC CHECKIDENT (Stage_Employee, RESEED, @maxKey)

 

The Source_Employees makes a comparison column for the columns to compare with to the dimension, here FirstName and LastName is concatenate:

 

SELECT

      EmployeeID,

      LastName,

      FirstName,

      Updated_Time,

      dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(FirstName, LastName)) AS binary(32))

FROM  dbo.Source_Employees

 

The cast makes sure the default data type binary(8000) is not used because that would take a lot of memory for the pipeline.

Concat takes care of null value in column else isnull(firstname, '').

 

The Dim_Employee makes a lookup to the dimension to fetch the key columns and the comparison column for the current dimension rows:

 

SELECT

      Employee_key,

      Employee_dkey,

      Employee_pkey,

      Employee_bkey,

      dwh_Comparison

FROM  dbo.Dim_Employee

WHERE IsCurrent = 1

 

 

The Lookup No Match Output means that the business key from source EmployeeID does not exists in the dimension. Derived Column extends the pipeline with extra co­lumn values that will be inserted into the staging table in Stage_Employee new rows with new identity number values of the _key column:

 

 

The Lookup Match Output means that the business key from source EmployeeID do exists in the dimension but then only continue with the rows where source has changed FirstName or LastName by compare the comparison column from source and dimension that was included in the Lookup in Find changed rows Conditional Split with the criteria not to be equal:

 

 

A changed row meaning source has updated FirstName or LastName will result in two rows in staging table, one row for update ValidTo/IsCurrent=0 and one row for insert new row with ValidFrom/IsCurrent=1, therefore a Multicast double the pipeline. Derived Column extends the pipeline with extra co­lumn values that will be inserted into the staging table in Stage_Employee existing rows but keep the identity number value of the _key column for doing a sql update later by joine the _key columns, and ValidTo gets its value from Updated_Time:

 

 

Stage_Employee changed rows with new identity number values of the _key and the _dkey gets the value from the dimension from the Lookup above and the _pkey gets the value from the dimension _key because _pkey is a reference back to previous row in the dimension, it is made as a column copy in Derived Column, and ValidFrom gets its value from Updated_Time:

 

 

The Stage_Employee set durable key update the _dkey for the new rows:

 

UPDATE dbo.Stage_Employee WITH(TABLOCK)

SET    Employee_dkey = Employee_key

WHERE  IsCurrent = 1 AND Employee_dkey = 0

 

ETL from staging table in DSA layer into the dimension

 

The Update Dim_Employee to set ValidTo/IsCurrent for the old current row:

 

UPDATE d

SET    ValidTo    = s.ValidTo,

       IsCurrent  = s.IsCurrent,

       UpdateTime = Getdate()

FROM dbo.Dim_Employee d WITH(TABLOCK)

     INNER JOIN dbo.Stage_Employee s WITH(TABLOCK) ON s.Employee_key = d.Employee_key

WHERE s.IsCurrent = 0

 

The Insert Dim_Employee for insert new row with ValidFrom/IsCurrent=1 from Stage_Employee:

 

SELECT

  Employee_key, Employee_dkey, Employee_pkey, Employee_bkey,

  LastName, FirstName, dwh_Comparison, ValidFrom, ValidTo, IsCurrent

FROM dbo.Stage_Employee

WHERE IsCurrent = 1

 

Into table Dim_Employee:

 

 

A SSIS package for update and insert data from staging tables to dimensions and fact tables can use BEGIN TRANSACTION and COMMIT TRANSACTION or ROLLBACK TRANSACTION with failure to secure that all will be loaded or nothing will happen.

 

10.2 Not using Lookup

I think it is important to make a SSIS package in a self-explanatory graphical user interface way where the Dim_Employee lookup and two pipelines tells a lot. But with many rows in a dimension Lookup is slow and use a lot of memory, therefore I can drop the Lookup and inside 'OLE DB Source' Source_Employees I make a sql that do a kind of lookup through Left Oute Join and it will perform inside the database at SQL Server:

 

;WITH source AS

(

 SELECT EmployeeID, LastName, FirstName, Updated_Time,

         dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(FirstName, LastName)) AS binary(32))

 FROM dbo.Source_Employees WITH(TABLOCK)

)

SELECT s.EmployeeID, s.LastName, s.FirstName, s.Updated_Time, s.dwh_Comparison,

       d.Employee_key, d.Employee_dkey, d.Employee_pkey

FROM source s

   LEFT OUTER JOIN dbo.Dim_Employee d WITH(TABLOCK) ON d.Employee_bkey = s.EmployeeID

WHERE d.dwh_Comparison IS NULL OR

      (d.IsCurrent = 1 AND d.dwh_Comparison <> s.dwh_Comparison)

 

·       Left outer join + d.dwh_Comparison IS NULL fetchs new rows from source with a business key that does not exist in the dimension, so all d. columns has null value.

·       d.IsCurrent = 1 AND d.dwh_Comparison <> s.dwh_Comparison fetchs chan­ged rows from source that does not compare to the current rows in the dimension.

 

A Conditional Split has an output called New for ISNULL(Employee_key) and the default output is called Changed. The New pipeline is 'Lookup No Match Output' and the Changed pipeline goes to 'Multicast'. The data flow from source to stage table:

 

Multicast do a duplicate for Update and Insert of the pipeline of changed rows.

 

Here is only 3 rows coming out of Source_Employees because 1 is new and 2 is changed, but as the first data flow shown there was 5 rows from source. There is always pros and cons to self-explanatory graphical user interface and performance.

 

10.3 Source data into an Archive with repeating members

The data warehouse fetchs source data and add them all to the archive also when source gives revision of rows, or rows that seams not to have been changed, I call it repeating members e.g. Fullor in three rows:

 

Archive_Employee

EmployeeID

LastName

FirstName

Updated_Time

ARC_INSERT_TIME

ARC_RECORD_ID

123

Davolio

Nancy

2016-01-01 10:00:00

2016-01-02 02:38:03

34521

124

Fuller

Andrew

2016-01-01 11:00:00

2016-01-02 02:38:03

34522

124

Fullar

Andrew

2016-01-01 12:00:00

2016-01-02 02:38:03

34523

124

Fullor

Andrew

2016-01-01 13:00:00

2016-01-02 02:38:03

34524

124

Fullor

Andrew

2016-01-01 14:00:00

2016-01-02 02:38:03

34525

124

Fullor

Andrew

2016-01-01 15:00:00

2016-01-02 02:38:03

34526

124

Fullar

Andrew

2016-01-01 16:00:00

2016-01-02 02:38:03

34527

124

Fuller

Andrew

2016-01-01 17:00:00

2016-01-02 02:38:03

34528

127

Buchanan

Steven

2016-01-01 17:00:00

2016-01-02 02:38:03

34529

127

Buchanan

Stephen

2016-01-01 20:00:00

2016-01-02 02:38:03

34530

 

In the dimension Andrew’s name is Fuller, but above there is a revision row with an unchanged name, later many rows with changed name and the last change is back to the original name Fuller. Buchanan is a new employee and the first row is a wrong typing of his first name.

 

»End Of The Day« for each data revision in Archive is easy by select the max time of Updated_Time from the source, but in this approach I will respect all the changes in the source with these rules:

 

  • When member is unchanged it will be skipped e.g. ID 123 has same name as the current member in the dimension, also the first row with Fuller.
  • When member is repeated in multiple contiguous rows with identical data except time, they will be smashed or collapsed into one row e.g. Fullor with time span of Updated_Time.

 

The result for a stage table in DSA where ValidFrom and ValidTo is smashed and null key will be set by Identity.

key

dkey

pkey

bkey

LastName

FirstName

dwh_Comparison

ValidFrom

ValidTo

IsCurrent

2

2

0

124

NULL

NULL

NULL

NULL

2016-01-01 12:00:00

0

NULL

2

0

124

Fullar

Andrew

0x130A89EB575

2016-01-01 12:00:00

2016-01-01 13:00:00

0

NULL

2

0

124

Fullor

Andrew

0xB2829725711

2016-01-01 13:00:00

2016-01-01 16:00:00

0

NULL

2

0

124

Fullar

Andrew

0x130A89EB575

2016-01-01 16:00:00

2016-01-01 17:00:00

0

NULL

2

0

124

Fuller

Andrew

0x43E8CE52E17

2016-01-01 17:00:00

9999-12-31 00:00:00

1

NULL

0

0

127

Buchanan

Steven

0x03A97F4DBE3

2016-01-01 17:00:00

2016-01-01 20:00:00

0

NULL

0

0

127

Buchanan

Stephen

0x903346CB324

2016-01-01 20:00:00

9999-12-31 00:00:00

1

 

  • When ValidFrom is null, it means the row will update an existing current row in a dimension by setting ValidTo and IsCurrent with values from stage table.
  • When ValidFrom is not null and dkey <> 0, it means the row is a changed member to insert into dimension.
  • When ValidFrom is not null and dkey =0, it means the row is a new member to insert into dimension.

 

The SQL that do the magic where dwh_Sortorder take care of the last row of Fuller and dwh_Groupnumber take care of the double change of Fullar and smashing the Fullor rows is done by using Min and Max:

 

;WITH source AS

(

 SELECT

  EmployeeID, LastName, FirstName, Updated_Time,

  dwh_Sortorder = ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY Updated_Time),

  dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(FirstName, LastName)) AS binary(32))

 FROM ARC.dbo.Archive_Employee WITH(TABLOCK)

),

datarow

AS

(

 SELECT

  s.LastName, s.FirstName, s.dwh_Comparison,

  d.Employee_key, d.Employee_dkey, d.Employee_pkey, Employee_bkey = s.EmployeeID,

  ValidFrom = s.Updated_Time,

  ValidTo = LEAD(s.Updated_Time, 1, '9999-12-31') OVER(PARTITION BY s.EmployeeID

                                                       ORDER BY s.Updated_Time),

  dwh_Groupnumber = ( ROW_NUMBER() OVER(ORDER BY s.EmployeeID, s.Updated_Time) ) -

                    ( ROW_NUMBER() OVER(PARTITION BY s.EmployeeID, s.dwh_Comparison

                                        ORDER BY s.Updated_Time) )

 FROM source s

      LEFT OUTER JOIN dbo.Dim_Employee d WITH(TABLOCK) ON d.Employee_bkey = s.EmployeeID

 WHERE d.dwh_Comparison IS NULL OR (d.IsCurrent = 1 AND (d.dwh_Comparison <> s.dwh_Comparison OR