Tips for SSIS SQL Server Integration Services by Joakim Dalby 1.
Introduction Best
Practices of making SSIS Integration Services package in form of some tips
shown by examples. I my article about SQL Server 2014 I describe deployment
methods of SSIS project and package on different servere using parameter for connection
string and Environment variable inside the SSISDB catalog SQL
Server 2014, read chapter 3 and 8. A
solution in SQL Server Data Tools for Visual Studio 2013 can contain severals
projects for different data layer in a data warehouse for SSIS package, SSRS report,
SSAS olap cube, SSAS tabular cube and export to Excel file and so on. It is
almost like a .NET solution with projects that become different dll files,
but normally the dll files belong together and is called from the same exe
file. If you like to show solution node in Solution Explorer: Tools, Options,
Projects and Solutions and checkmark in 'Always show solution'. I recommand
a SSIS project per data layer in a data warehouse like for doing extraction
from source legacy systems, for doing archive or ods, for dsa, for edw and
for each datamart. A SSIS project properties is shown
from Solution Explorer pane by rightclick the project name and select
{Properties}. I always set ProtectionLevel to DontSaveSensitive so other developers can access the SSIS
package therefore not depending of the creator user or a password. When a package
is saved, sensitive password values will be removed. This will result in
passwords needing to be supplied to the package through a configuration file
or by the user. All new IS package inherits the protection level. Please remember to consider MaxConcurrentExcecutables property in
the SSIS package, more at this links Data Flow Performance Features and Performance Tuning Techniques. When SSIS package runs on a
dedicated server and there are a lot of operations that run in parallel, it
can be beneficial to increase this setting if some of the operations
(threads) do a lot of waiting for external systems to reply. A Data Flow Task
has property for EngineThreads, DefaultMaxBufferRows and DefaultMaxBufferSize. When a OLE DB Source wants to fetch
data from a view in a database, use SQL command with Select * From <view>
that will give much better performance. A
SSIS project has a Connection Managers where all connectionsstring to files
and database should be, so they can be reused in all the SSIS packages in the
project. Start
rightclick {New Connection Manager} and for connection to a SQL Server
database select OLEDB and click [Add] and [New], typein SQL Server name, normally
use Windows Authentication and select the database in the dropdown. OLEDB data provider is Native OLE DB\SQL Server Native Client 11.0. This
connection is used in Control Flow at Execute SQL Task and in Data Flow Task
at OLE DB Source, OLE DB Destination and Lookup. ADO.NET connection manager type .Net Providers\SqlClient Data Provider. This
connection is used in Execute T-SQL Statement Task, Execute SQL Task and when
Script Component wants to read directly from a sql server via SqlCommand. A
connection manager used inside a SSIS package has a property Expression that for
a flat file which name include a date yyyy-mm-dd can have an expression that assign
the right filename at runtime mode for a dynamic connection like this: "C:\\File\\Customer"+LEFT((DT_WSTR,30)GETDATE(),10)+".csv" Show SSIS Toolbox In
menubar select View and Other Windows and SSIS Toolbox and you have it in the
left side of your screeen showing the tools or components depends of you are
in a control flow or a data flow canvas. To delete a row in a component Like
'Derived Column' or 'Union All', first left click at another row and then set
mouse over the want-to-delete row and right click and select {Delete} in the
menu. ID of package Make
sure the package ID guid number is unique if you using a template or have
copied a package, go to property ID and click dropdown and select Generate
New ID. Get the ID by parameter System::PackageID. Also a
System::PackageName. Execute and format a SSIS package Normally
I run a SSIS packages from the design mode by clicking at Start button, but sometimes
it will evaluate and syntax check all files in the project and in the
solution. To avoid that it is better to close all tabs (or documents as SQL
Server Data Tools for Visual Studio 2013 call them) and right click the
package and choose {Execute Package} like this: Layout of a SSIS package In
case you like to organized the layout of a SSIS package, Format can helps: Some of the components 2.
Control Flow and Data Flow components A
SSIS package has a 'Control Flow' canvas
or design surface which is using 'Task' components to display the steps
of a process and is the task workflow engine that coordinates the business process
flow logic for a package. The 'Data Flow' is the data processing engine that
handles data movement, transformation logic, data organization, and the
extraction and commitment of the data to and from sources and destinations.
Some of the components and the usage
"N'"+@[$Project::BackupDirectory]+"\\Backup_"+@[User::Name]
+".bak'"
@[User::Counter]
= @[User::Counter] + 1 Only
one package variable can be assigned a value in a 'Expression Task', so with
multiple variable I need to make multiple 'Expression Task's or instead make
a 'Script Task' what can do multiple variables in a C# program.
if
the variable can be null. Examples with the three types of variable: MessageBox.Show(Dts.Variables["User::PackageVariable"].Value.ToString()); MessageBox.Show(Dts.Variables["$Project::ProjectParameter"].Value.ToString()); MessageBox.Show(Dts.Variables["$Package::PackageParameter"].Value.ToString()); Script
can do calculations with programming or build a dynamic sql statement to a
variable and use it in 'Execute SQL Task' to store a value in a table with sql
Insert or Update. In 'Execute SQL Task' set SQLSourceType to Variable and at
SourceVariable pick a package variable like User::TSQLSaveCalculation
when it is a dynamic build sql statement.
And
open 'OLE DB Source Output' and open 'Output Columns' and select the first
column from ORDER BY and set property SortKeyPosition to 1: Continue
with next column from ORDER BY that is set to 2 and so on. When a column is DESC
SortKeyPosition gets a minus in front as a negative value. Read more. When data source can’t be ORDER BY like a flat
file but data is delivered sorted, use SortKeyPosition on the columns that is
actually sorted. 'Sort' removes duplicate values of sort key from a flat file. From
a OLE DB Source make a Script Component type Transformation to do filter duplicates
in C# and remember to set property at page Inputs and Outputs at Output 0 set
ExclusionGroup from 0 to 1. Filter on CustomerName: using System; using System.Data; using
Microsoft.SqlServer.Dts.Pipeline.Wrapper; using
Microsoft.SqlServer.Dts.Runtime.Wrapper; using System.Collections.Generic; [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { public HashSet<string> Customers = new HashSet<string>(); // int. public override void Input0_ProcessInputRow(Input0Buffer Row) { if (Row.CustomerName_IsNull) return; if (!Customers.Contains(Row.CustomerName)) { Customers.Add(Row.CustomerName); // only unique values is added. Row.DirectRowToOutput0(); } } }
OLE
DB Destination has a property FastLoadOptions that normally shows TABLOCK,
CHECK_CONSTRAINTS. Four extra options can be added there: ROWS_PER_BATCH
= <number> <number>
indicates that destination is configured to use batches of rows. KILOBYTES_PER_BATCH
= <number> <number>
specifies the size in kilobytes to insert. FIRE_TRIGGERS Specifies
whether triggers fire on the insert table because of fast load/bulk. ORDER(<columnname>
ASC, <columnname> DESC) Specifies
how the output data is sorted. If sort order is omitted, the insert operation
assumes the data is unsorted. Performance can be improved if you use the
ORDER option to sort the output data according to the clustered index on the
table. For clarity, the ORDER hint needs to exactly match the clustering
keys of the target table. ORDER is specially good when OLE DB Source is a
flat file. E.g.: TABLOCK, ORDER(Id) Id
is a clustered index. Remember
to set database property Recovery model to Simple. Remember also to look at
index to get faster load performance like this: *
Disable/drop non-clustered indexes before bulk load. *
Rebuild/create non-clustered indexes
after bulk load.
3.
Data type and Expression Data Conversation is not using SQL
Server database table data type and SSIS package variable and SSIS project
parameter has their own data type, the map is:
DT_WSTR(<length>) unicode
string, length of characters, for nvarchar or nchar. DT_STR(<length, codepage>) ansi
string with codepage 1252 for varchar or char. When you have a nvarchar column in a
source table and you want only the first three letters and save it in a
varchar column destination table, do this in a Derived Column to cast the
value to codepage 1252 = ANSI = varchar data type: (DT_STR,3,1252)SUBSTRING(<columnname>,1,3) DT_NUMERIC(<precision, scala>)
precision tells the max number of digits in the number value both left and
right of decimal point, scale tells the number of digits right of decimal
point. Like DT_NUMERIC(5,2) is max 999.99. Read more about Integration Services Data Types (DT_DBTIME2, «scale») like
(DT_DBTIME2,2)TransactionDateTime have
between 0 and 7 digits specified for fractional seconds.) Variable name is case-sensitive like
in C#. Using a variable in script C#
programming: Dts.Variables["User::Number"].Value; Using a variable in an Expression: @[User::Name] MONTH(@[User::Birthdate]) == 5
? "Maj" : "Another
month" Using a parameter in an Expression: @[$Project::BackupDirectory] Using a system variable in an
Expression: @[System::PackageName] Cast
always a value
Expression that cast or convert an integer value to concatenate to a string: OrderDate is a datetime, cast it to
a dbdate and to a date that has time 00:00:00 (DT_DATE)(DT_DBDATE)OrderDate Get OrderDate as the first day of
the month: (DT_DATE)((DT_WSTR,4)(YEAR(OrderDate))
+ "-" +
(DT_WSTR,2)(MONTH(OrderDate)) + "-01") (DT_WSTR,4) is C# way to cast a
value to a nvarchar(4). (DT_WSTR, 10)
(DT_DATE) DATEADD( “dd”, – @[$Package::DaysAgo] , GetDate()) Make a filename with number of
current month: "C:\\File\\Data_"+RIGHT("0"+
(DT_WSTR,2) MONTH(GETDATA()), 2)+".txt" (DT_WSTR,2) is C# way to cast a
value to a nvarchar(2). (DT_DECIMAL,2) cast string to
decimal value with 2 digits after decimal point. Get Filename from column
PathFileName: RIGHT([PathFileName],FINDSTRING(REVERSE([PathFileName]),”\\”,1)
– 1) Get Folderpath from column
PathFileName: SUBSTRING([PathFileName],
1, LEN([PathFileName]) – FINDSTRING(REVERSE([PathFileName] ), “\\” ,1 ) + 1) Get Foldername from column
PathFileName: TOKEN[PathFileName],”\\”,TOKENCOUNT([PathFileName],”\\”)
– 1) Guid variable with data type Object Dts.Variables["User::Guid"].Value
= "{" + Guid.NewGuid().ToString() + "}"; Expression is useful in Derived
Column but can also be used in Send Mail Task for property Subject like "ETL
done at: " + (DT_WSTR,30) @[System::StartTime] +"." Or in Execute SQL Task at property
SqlStatementSource to make a dynamic sql statement but be careful because it
can be hard to read, understand and maintain. I think it is better using a
variable for dynamic sql so I can messagebox it for syntaks and correctness
before it is send to the database server to pull or fetch data to a resultset
inside Execute SQL Task or before to a 'Source' but SSIS 2014 now prefer a
variable. I can make an expression at the Data Flow Task to Disable it, in
case a previous step had put pipeline Row Count to a variable:
@[User::NOBills] == 0, or at Precedence
constraint green arrow
for continue when @[User::NOBills]
> 0 and a fx logo is shown. Expression operators taken from C#:
SUBSTRING,
UPPER, LOWER, TRIM, LEN, DATEDIFF, CEILING for round up, FLOOR for round
down, REPLACENULL(<columnname>, 0) for if null replace zero in SSIS. VB
example in a Script Component that build up a large string with errors: Public Class ScriptMain Inherits UserComponent Dim errorMessages As
StringBuilder = New StringBuilder() Public Overrides Sub
Input0_ProcessInputRow(ByVal Row As Input0Buffer)
errorMessages.AppendLine(
String.Format("Key value {0} not
found in the reference table.", Row.KEY)) End Sub Unnatural NULL Behaviour
ISNULL(SalesPersonName) ?
"Unknown" : SalesPersonName better with: REPLACENULL(SalesPersonName,
"Unknown") (or use "N/A"). REPLACENULL(SalesAmount, (DT_CY)0) REPLACENULL(SalesDate, (DT_DATE)"1900-01-01") In case of null value or a zero length empty
string value: (ISNULL(SalesPersonName) || TRIM(SalesPersonName)
== "") ?
"Unknown" : TRIM(SalesPersonName) Return null value cast to the right data types NULL(DT_STR,4,1252) to
a varchar(4) column NULL(DT_WSTR,3) to a nvarchar3) column NULL(DT_DATE) to datetime with 00:00:00.000 NULL(DT_DBTIMESTAMP) NULL(DT_BOOL) to bit also called boolean NULL(DT_I4) to
int column NULL(DT_I2) to smallint column NULL(DT_DECIMAL,2) to
decimal NULL(DT_NUMERIC,3,2) NULL(DT_CY) to money Compare two columns that might have a Null in a Conditional
Split I always replace Null with this character
¤ because very rare a text in a column only has this value. Then I can
compare columns that has checkmark in 'Allow Nulls'. For nvarchar I can replace Null to ¤
because columns are already a string: REPLACENULL(s_Name,"¤") !=
REPLACENULL(d_Name,"¤") For int, bigint, tinyint, smallint,
decimal, float, real, numeric, money, and date, datetime and time, I first
cast column to a string with 30 characters: REPLACENULL((DT_WSTR,30)s_Number,"¤")
!= REPLACENULL((DT_WSTR,30)d_Number,"¤") A CustomerNo is a bigint and when it is
null I want to -1 instead of null: REPLACENULL(CustomerNo, (DT_I8)-1) or (DT_I8)REPLACENULL(CustomerNo,-1) If an OrderDate column in pipeline is
null then change it to January 1, 1900: REPLACENULL(OrderDate, (DT_DATE)"1900-01-01") If an OrderDate datetime column with time
00:00:00.000 is this date 2016-12-31, I cast the date with (DT_DATE) to make
it 2016-12-31 00:00:00.000 so the two dates can be compared, then give true
else false: OrderDate == (DT_DATE)"2016-12-31"
? (DT_BOOL)1 : (DT_BOOL)0 If a ValidTo date column with no time I
cast the date with (DT_DBDATE): (DT_DBDATE)ValidTo
== (DT_DBDATE)"9999-12-31" ? (DT_BOOL)1 : (DT_BOOL)0 If divide by zero then let the result be
0 else calculate the ratio: REPLACENULL(Quantity,0) == 0 ? 0.0 :
Price / Quantity Other string functions Replace in a string or if integer, cast
first: REPLACE((DT_WSTR,8)NumberOfDays,"6","5") Replace zero (0) with NULL: REPLACE((DT_WSTR,8)<IntegerColumn>,"0",NULL(DT_I4)) Convert a blank string into NULL in a 'Derived Column' as
varchar: TRIM(<TextColumn>) == ""
? (DT_STR, 4, 1252)NULL(DT_STR, 4, 1252) : <TextColumn> Replace an integer column with NULL if
the value is negative: (<IntegerColumn> < 0) ?
NULL(DT_I4) : <IntegerColumn> Replace an double column with NULL if the
value is negative: (<DoubleColumn> < 0) ?
NULL(DT_NUMERIC,3,2) : <DoubleColumn> Replace a date column with NULL if the
value is »in the future«: (<DateColumn> > GETDATE()) ?
NULL(DT_DATE) : <DateColumn> LEFT(<string>,
<number-of-chars>) A shorthand way of writing SUBSTRING(<string>,
1, <number-of-chars>) TOKEN(<string>, <delimiter>,
N) Returns the Nth token in <string> when it is split by <delimiter> TOKENCOUNT(<string>,
<delimiter>) Returns the total number of tokens in <string> as
determined by <delimiter> TOKEN(,,) will be particularly useful
when manually parsing columns from a flat file. When used in conjunction with TOKENCOUNT
it can be used to return the last token in a string like so: TOKEN(<TextColumn>, ",", TOKENCOUNT(<TextColumn>,
",") ) Compare
with a string use " around the text like: @[User::Name] == "Paris"
in a 'Conditional
Split'.
Use [ ] around a column that contains space or start with a digit. In
a ETL process for making fact data rows, developed in a SSIS package, the column
Time_key will be created as a derived column to the pipeline, expression: (DT_I2)DATEPART("hh",TransactionDatetime) * 100 +
DATEPART("mi",TransactionDatetime) Inside Data Flow Task the Derived Column do the manipulation
of data When
data from a OLE DB Source goes to a pipeline the columns can be derived to
new columns with new column name that exists in table of OLE DB Destination, or
the columns values can be replaced by cleansing rule expression together with
? condition (if-then-else) and cast to get the right data type and length of
text, here is some examples: Expression:
'Discount != 0 ? (DT_BOOL)1 : (DT_BOOL)0' means when the pipeline column Discount
has a value different from 0, a new column in the pipeline called
DiscountStatus with datetype bit (DT_BOOL in SSIS) is assigned value 1 = true
else value 0 = false and saved into a column in the table of 'OLE DB
Destination'. Expression:
'Quantity > 0 && Quantity < 50 ? Quantity : NULL(DT_I2)' means when
the pipeline column Quantity has a value between 1 and 49, the value will be
passed through, but if the column value is less than or equal to 0 or greater
than or equal to 50 it will be replaced to a null value because theses values
is wrong from the legacy source system and can give wrong summary and average
in a data warehouse reporting. The
new columns NumberOfAbroad and NumberOfDomestic will get value 1 or 0 depends
of the business logic rule of column Place and since a sql statement, column
in a report or a measure in a cube will do a summary and total of NumberOf,
it is important that the derived column has the right data type that will
cover the range of the total value, therefore I cast value 1 and 0 to DT_I4
that is int data type in the table of OLE DB Destination. The
SSIS package has a int32 variable called Batchnumber and the variable has
been assigned a value in the 'Control Flow'. The variable value can be added
to the pipeline inside the 'Derived Column Transformation' with this syntax
notation: @[User::Batchnumber] and saved into a column in the table of 'OLE
DB Destination'. By
Edit the 'OLE DB Destination' I can click New button and typein a good table
name to be created automatically when I click OK, so I will change the [OLE
DB Destination] to table name like [Stage_OrderDetails], and the pipeline
will map the columns from the source table and from the derived columns to
the new destination table columns: The
new table will allow null in all columns and that can be good for a source
input table therefore normally there is no need for check at 'Check
constraints' above. Cast
or convert danish letter from code page 865 DOS ASCII to code page 1252
Windows ANSI to a varchar column Cannot be processed because more
than one code page means, that data in OLE DB Destination is using a
different codepage than 1252. Here is an example to cast or convert between
code pages, here I am using: DT_STR(50, 1252) as a cast in front of the column
name, important in derived column to make a new column in the pipeline, like
this where the column name is »by« meaning city: When column is an unicode with nvarchar(50)
use then (DT_WSTR,50)by 4.
Execute SQL Task to assign a variable to use in Control Flow A SQL statement will query a
database table and will only return one row and the two columns will be
assigned to two SSIS package variables: SELECT StatusId, StatusDateTime
-- int and datetime FROM dbo.STATUS WHERE Status = 0 In a SSIS package in Control Flow I
rightclick the canvas and select {Variables} and in the window I click the
small left side button [Add Variable] to insert a new row and type in the
variable name and data type where sql int is a package Int32. Since my sql
statement has two columns I also make two variables with a default value, but
the default value will be override by the value from the sql statement: In Control Flow I drag in a 'Execute
SQL Task' and I edit it and choose a Connection and type in the sql statement
at the property SQLStatement. I set property ResultSet to 'Single row' and
at the page 'Result Set', I click the [Add] button and override 'NewResultName'
with the first column name or just 0 for first column and 1 for second
column. I also choose the right package variable: To test if the assignment of the two
variable has been done, I drag in a 'Script Task': I edit it 'Script Task' and at
property ReadOnlyVariables I click the […] and checkmark the two package
user variables I like to use later inside the script. I click at [Edit Script]
button and typein a MessageBox line above the Success line like this: MessageBox.Show(Dts.Variables["User::SID"].Value.ToString()
+ "
- " + Dts.Variables["User::SDATE"].Value.ToString()); Dts.TaskResult
= (int)ScriptResults.Success; I start the package and see the
messagebox showing the values from the row in the status table in the
database. When the sql statement does not
return a row, an error occurred while assigning a value to variable: "Single
Row result set is specified, but no rows were returned." Therefore I
need to extend the sql statement so it will always return a row and I will
make a stop constraint so Control Flow goes to another script. In 'Execute SQL Task' I change the
sql statement to control if there is a row and if the exists become false I
create a row with default values for the two columns: IF EXISTS(SELECT 1 FROM
dbo.STATUS WHERE Status
= 0) SELECT
StatusId, StatusDateTime FROM dbo.STATUS WHERE Status
= 0 ELSE SELECT 0 AS StatusId, '1900-01-01 00:00:00' AS
StatusDateTime I edit the green 'Precedence
constraint'
arrow
between 'Execute
SQL Task' and 'Script Task',
because I can extend the constraint with an logical boolean expression, so I
change 'Evaluation
operation'
from "Constraint" to "Expression and Constraint". In the expression I’m using
the SID variable with a C# notation for comparison for equivalent:
@[User::SID] != 0 A fx logo is shown at the green 'Precedence
constraint'
arrow
telling there is an expression included in the criteria for successful and continuation. When the sql statement does not
return a row the 'Precedence constraint' will not continue and
the SSIS package does not failed. But if I want the 'Control Flow' to continue,
I drag in another 'Script Task', rename it to 'No Row' and typein a
MessageBox "No row". I edit the green 'Precedence
constraint'
arrow
between 'Execute
SQL Task' and 'No row'
using: @[User::SID]
== 0 The 'Control Flow' will always
continue the task of the SSIS package and the two variables has a value that
can be used in the package later like in a 'Data Flow Task'. Assign
a datetime variable for fixed value Drag in an 'Execute SQL Task' and at
edit then typein a sql statement to get the date and time from the SQL
Server, you can also use GETUTCDATE() for London time without summertime,
like this for normal local time with summertime: SELECT GETDATE() AS
CurrentDateTime Change ResultSet to 'Single row' and
select a Connection and at ResultSet page typein the columnname from the sql
statement and select the variable: Then you can use the package
variable User::CurrentDateTime in a 'Script Task' MessageBox.Show(Dts.Variables["User::CurrentDateTime"].Value.ToString()); Then you can use the package
variable User::CurrentDateTime in a 'Execute SQL Task' as a input parameter
to make a dynamic sql statement in a where part or in a set part of a update
statement like this: Please remark to use DATE as Date
Type for Input Direction, because OLE DB DATE do also cover a time as a datetime.
For Output Direction use a DBTIMESTAMP. If you typein an Expression for the variable in the
variables window like this: the variable CurrentDateTime will change value each
time it is used, like it is execute the expression. Execute
SQL Task does not failed with no row In
case you does not like the sql statement starting with IF EXISTS(SELECT 1 the 'Execute SQL Task' can be changed for using 'Parameter
Mapping' which means that the sql statement will have two output parameters
like this where question mark is a
output parameter: SELECT ? = StatusId, ? = StatusDateTime FROM dbo.STATUS WHERE Status = 0 Other examples: SELECT TOP 1 ? =
CustomerId - SELECT ? = MAX(System_key) 'Parameter
Mapping' does not use property ResultSet, therefore I will set it to 'None'
and the 'Execute SQL Task' will not failed when there is no row. At
page 'Parameter Mapping' I connect each variable with each output
parameter where Parameter Name 0 is the first ? as StatusId and second ? is
StatusDateTime. The Data Type for the output or input parameter is OLE DB as
shown in chapter 3. I
still keep the two 'Precedence constraint' green arrows so the 'Control
Flow' will do different tasks depending of a row or no row, because when
there is no row the 'Pameter Mapping' will assign the SID variable to value 0. Control
flow depends of exists a value in a table In a Result table I like to get a StatusId that has
a status as true and the StatusId I like to check if it exists in a Status
table. In a SSIS package I will use two 'Execute SQL Task' to get values
from the database tables into SSIS package variables and let one variable ExistsStatus
to control the flow by two fx expressions: @[User::ExistsStatus] == True // for Exists @[User::ExistsStatus] == False // for Not exists Get StatusId from Result Variable @[User::StatusId] becomes the output
parameter for the question mark at left side of equal sign: SELECT TOP
1 ? = StatusId FROM dbo.RESULT WHERE Status = 1 In case there is no row then @[User::StatusId] get
value 0. Exists StatusId ? Variable @[User::ExistsStatus] becomes the output
parameter for the first question mark at left side of equal sign and variable
@[User::StatusId] becomes the input parameter of the second question mark at
the right side of the equal sign SELECT ? =
IIF(EXISTS( SELECT
1 FROM
dbo.STATUS WHERE
StatusId = ?),
1, 0) -- 1 = True, 0 = False Fx expressions The left fx is using the logical
expression: @[User::ExistsStatus] == True and when the condition is fulfilled
the left green 'Precedence constraint' will continue the
control flow. The right fx is using the logical
expression: @[User::ExistsStatus] == False and when the condition is fulfilled
the right green 'Precedence constraint' will continue the
control flow. Insert
into a table 'Execute
SQL Task' can do an insert into sql statement with input parameter: INSERT INTO
dbo.RESULT(StatusId, Status) VALUES(?,0) Update
of a table 'Execute
SQL Task' can do an update sql statement with two input parameters: UPDATE dbo.RESULT UPDATE
r SET Status = ? SET Status = ? WHERE StatusId = ? FROM dbo.RESULT r WHERE r.StatusId = ? Delete
in a table 'Execute
SQL Task' can do a delete sql statement with input parameter: DELETE FROM dbo.RESULT WHERE StatusId = ? Call stored
procedure to do update and get the result back I
like to update a column NumberOf for a specific OrderId and get the result
I
make the business logic in a stored procedure with two input parameter and
one output parameter with the result of the updating that is assigned
together with UPDATE sql statement: ALTER PROCEDURE
[dbo].[Order_NumberOf] (@OrderId
int, @Extra int, @NumberOf int OUTPUT) AS BEGIN BEGIN
TRANSACTION
UPDATE dbo.OrderNumber SET
@NumberOf = NumberOf =
NumberOf + @Extra WHERE
OrderId = @OrderId COMMIT
TRANSACTION
END Dts.Variables["User::OrderId"].Value = 2; Dts.Variables["User::Extra"].Value = 3; Dts.TaskResult
= (int)ScriptResults.Success; MessageBox.Show(Dts.Variables["User::NumberOf"].Value.ToString()); Dts.TaskResult
= (int)ScriptResults.Success; After
running the SSIS package it will show in messagebox 23 and table look like:
Parameter Names must be a 0, 1, 2
and showing as ? parameter within same order. Named @parameter requires an
ADO.NET Connection. In case I do not like to do the two
package variable assignment in a 'Script Task', I can
replace it with two 'Expression Task's that do each package variable
assignment: Read variable
values from a table to SSIS packages variable Table SSIS_Variable has a row for
each variable with a datatype and a value:
A SSIS package has same variables
that will be assigned with the values from the table by a Script Task: Assign variable has a C# program that
connect to the database and to the table and loop for each row and assign
each variable with the right data type: #region Namespaces using System; using System.Data; using
Microsoft.SqlServer.Dts.Runtime; using
System.Windows.Forms; //For
connection to SQL Server using
System.Data.SqlClient; using Microsoft.SqlServer.Dts.Runtime; using
System.Text.RegularExpressions; #endregion namespace
ST_cad614c72d4f4a1ab407b6cb5a7ec482 { [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute] public partial class ScriptMain :
Microsoft.SqlServer.Dts.Tasks. Script Task.VSTARTScriptObjectModelBase { public void Main() { try { //String
connectionstring = "Server=<server>;Database=<db>; // Trusted_Connection=Yes;Connect
Timeout=30"; //Better using the Connection Manager from the
package to get connection: String connectionstring = Dts.Connections["DSAconnectionstring"].ConnectionString.ToString(); //Cleansing
the connectionstring so it
can be used for SqlConnection below: Regex regProvider = new Regex("Provider=([^;]*);"); Regex regTranslate = new Regex("Auto
Translate=([^;]*);"); connectionstring =
regProvider.Replace(connectionstring, ""); connectionstring =
regTranslate.Replace(connectionstring, ""); SqlConnection connection = new SqlConnection(connectionstring); SqlCommand command = new SqlCommand(); SqlDataReader reader; command.Connection = connection; connection.Open(); command.CommandType = CommandType.Text; command.CommandText = "SELECT
Variable,Datatype,Value FROM dbo.SSIS_Variable"; reader = command.ExecuteReader(); if (reader.HasRows) { while (reader.Read()) { //For debugging: //if
(reader["Variable"].ToString() == "Year") //{ //MessageBox.Show(reader["Variable"].ToString()+reader["Value"].ToString()); //} //For convert or parse to the
right data type for the package variables: switch (reader["Datatype"].ToString()) { case "Boolean": Dts.Variables[reader["Variable"]].Value = Boolean.Parse(reader["Value"].ToString()); break; case "DateTime": Dts.Variables[reader["Variable"]].Value = DateTime.Parse(reader["Value"].ToString()); break; case "Decimal": Dts.Variables[reader["Variable"]].Value = Decimal.Parse(reader["Value"].ToString()); break; case "Int32": Dts.Variables[reader["Variable"]].Value = Int32.Parse(reader["Value"].ToString()); break; default: //For String Dts.Variables[reader["Variable"]].Value =
reader["Value"].ToString(); break; } } } command.Dispose(); command = null; connection.Close(); connection.Dispose(); connection = null; } catch(SqlException ex) { MessageBox.Show("No connection"); } Dts.TaskResult = (int)ScriptResults.Success; } } } Show
variable has
a C# program: public void Main() { MessageBox.Show(Dts.Variables["User::Year"].Value.ToString()); MessageBox.Show(Dts.Variables["User::Weight"].Value.ToString()); MessageBox.Show(Dts.Variables["User::EndDate"].Value.ToString()); MessageBox.Show(Dts.Variables["User::Type"].Value.ToString()); if ((bool)Dts.Variables["User::Summertime"].Value == true) MessageBox.Show("Summertime"); else MessageBox.Show("Wintertime"); Dts.TaskResult = (int)ScriptResults.Success; } Package parameter
in child package to be assigned from a parent package A SSIS package can be parameterized
like a stored procedure by using parameters that is places at the tab next to
Control Flow and Data Flow and parameters can also have a default value which
is useful under develop execute the package. The SSIS package parameter has Required
= True, meaning that the calling parent SSIS package need to provide a
parameter value to its child. In case the parent does not always has a
parameter value, Required should be set to False and then Value will be the
default parameter value. Inside a package I can use a
parameter like it was variable in an Expression Task or a Script Task where I
write like this: $Package::FromYearMonth. But a package parameter is a
readonly variable therefore it can not be assigned in an Expression Task like
this: @[$Package::FromYearMonth] =
201512 But a parameter can be used at the
right side of the equal sign like this: A SSIS package as a parent package when
it execute or call another SSIS package as a child package and pass by value
to the parameter through a 'Execute Package Task' like this: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5.
Control Flow, workflow in design mode and runtime mode When I open a SSIS package in design
mode for development, I like to see the control flow as it will be in
runtime mode, therefore I do not like a package to do dynamically disable 'Data
Flow Task' by expression fx mark at 'Data Flow Task' or by script task
programming to do disabling 'Data Flow Task' DFT. I like to use Precedence constraints
green arrow with expression fx to show the control flow or workflow of the
package in a self-explanatory graphical user interface perspective where the control
flow determine the order in which executables (tasks and containers) will run
and the control flow determines under what conditions they are executed. In
other words, certain executables will run only when a set of defined
conditions are met in an expression fx at precedence constraints green arrows. Setting Precedence Constraints on
Tasks and Containers for a linear or branching control flow like this where I
fetch two boolean values from a table into two packages variable and let them
with true/false do the runtime workflow:
·
Precedence
constraint green arrow has an expression,
showned by a fx logo. ·
Solid arrow meaning AND, that all
precedence constraints that point to the constrained executable must evaluate
to true in order for that executable to run. ·
Dotted arrow meaning OR, that only
one precedence constraint that points to the constrained executable must evaluate
to true in order for that executable to run. In runtime mode Fetch data variable I have made a table FetchData with
columns of bit data type to control the flow that will be fetch into two packages
variables through an 'Execute SQL Task':
From FetchCustomers to FetchProducts When the package variable FetchProducts
has value True from the table, the workflow will do FetchProducts. I connect the two tasks with a precedence
constraint green arrow and I right click at the arrow and select {Edit}, in
first dropdown I select 'Expression and Constraint' so both the Expression »@[User::FetchProducts]
== True« and Constraint of »FetchCustomers« must be true and success to
continue with FetchProducts task. I use default Logical AND and is giving a
solid arrow: From FetchProducts to Fetch
Suppliers I do the same as above and I still
using Logical AND and get a solid arrow. From FetchCustomers to Fetch
Suppliers When the package variable FetchProducts
has value False from the table, the workflow will only do FetchSuppliers,
therefore the above path is shipped so no go for FetchProducts. I connect the two tasks with a precedence
constraint green arrow and I right click at the arrow and select {Edit}, in
first dropdown I select 'Expression and Constraint' so both the Expression »@[User::FetchProducts]
== False« and Constraint of »FetchCustomers« must be true and success to
continue with FetchProducts task. I use now Logical OR and is giving a dotted
arrow: The solid arrow from FetchProducts
to Fetch Suppliers become dotted automatically to show in the self-explanatory
graphical user interface that Fetch Suppliers is only depends of either been
doing FetchProducts or FetchCustomers. To Fetch Orders and details / Do
nothing, continue when FetchOrders is True Task Fetch Orders and details will
only be done if the package variable FetchOrders has value true. To show this
in a self-explanatory graphical user interface perspective I am using an
empty 'Script Task' that has no variable and has no programming, and from 'Script
Task' to Fetch Orders and details I add this Expression: »@[User::FetchOrders]
== True« and no need for Constraint because 'Script Task' will never failure: Do nothing,
continue when package parameter is True A SSIS package can receive a value
from another package through a parameter and then the package can control the
flow depends of value of parameter. An empty 'Script Task' »Do nothing,
continue when Fetch is True« has no variable and has no programming, it is
just placed there as a dummy object, so I can make a Precedence constraint green arrow that has an expression showned by a fx logo fx: @[$Package::Fetch] == True for programming a self-explanatory
graphical user interface of the control flow of the package like this: When
the fx expression condition is right/true, the
job will enter the Sequence Container. When
the fx expression condition is wrong/false, the
job will not enter the Sequence Container. Both 'Execute SQL Task' for »Begin
package« and »End package« for log registration will be done and »Fetch
Stock« will also be done because it is independent of the Fetch package
parameter. When a package variable or a package
parameter has data type Int32 and a default value as 0, the Expression for continuation
with the task can be like this: @[$Package::parameter] != 0 && !ISNULL(@[$Package:: parameter]) When a package variable or a package
parameter has data type String and a default value as 0000 (from nchar(4)
column in a table), the Expression for continuation with the task can be like
this: @[$Package::param]
!= "0000" && !ISNULL(@[$Package::param]) &&
@[$Package::param] != "" |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6.
Package Job Log for raise and catch error failure When I have parents
and children packages, I like to wrap them all into one Job package with a
simple joblog table that take care of controlling that the job can not be
running two times at the same time and has few status state like 'Running',
'Finish' and 'Error'. To catch an error in the Job package or the below
parent and child package I will use Event Handlers that is a tab next to
Control Flow, Data Flow and Parameters. The main Job package can look like
this where I using fx expression to control the process flow: A JobLog table has
some columns:
Execute SQL: JobLog
Status is not Running To control that the
job is not running or has status 'Running': SELECT ? = IIF(EXISTS(SELECT 1 FROM dbo.JobLog WHERE Status = 'Running'), 0, 1) and set package
variable Status_Not_Running to output value True or False. fx: @[User::Status_Not_Running]
== True to continue the job control flow. fx:
@[User::Status_Not_Running] == False to send a failure email cos Running. Execute SQL: Job
Status set to Running and set BeginTime and get next JobLogId To insert a new row in
JobLog and fetch the next JobLogId from Identity number and set Status to 'Running'
and set BeginTime datetimestamp: BEGIN TRANSACTION DECLARE @InsertedRow TABLE (JobLogId int) INSERT INTO dbo.JobLog (Status, BeginTime) OUTPUT inserted.JobLogId INTO
@InsertedRow VALUES ('Running', Getdate()) SELECT ? = JobLogId FROM @InsertedRow COMMIT TRANSACTION and set package
variable JobLogId to output value. fx: @[User::JobLogId]
>= 1 to continue the job control flow. fx: @[User::JobLogId]
== 0 to send a failure email cos no new JobLogId. Sequence Container: Execute
packages in parallel The Sequence Container
execute multiple packages in parallelism at same time. Execute SQL: JobLog
Status set to Finish and set EndTime To update the new row
in JobLog and set Status to 'Finish' and set EndTime: BEGIN TRANSACTION UPDATE dbo.JobLog SET Status = 'Finish', EndTime = Getdate() WHERE JobLogId = ? COMMIT TRANSACTION with package variable JobLogId
as input value. Event
Handlers I click at tab Event
Handlers, and in left dropdown I select the name of the package that
represent package level scope for the event. In right dropdown I select 'OnError'
to catch an error in the process or children packages. To get a canvas or design
surface to drag in SSIS components, I click at line link: Click here to create an
'OnError' event handler The canvas will become
dark gray and I can drag in an 'Execute SQL Task' or copy one from 'Control
Flow' tab: Execute SQL: JobLog
Status set to Error and set EndTime To update the new row
in JobLog and set Status to 'Error' and set EndTime and get system error
message from the package: BEGIN TRANSACTION UPDATE dbo.JobLog SET Status = 'Error', EndTime = Getdate(), ErrorMessage
= ? WHERE JobLogId = ? COMMIT TRANSACTION with package variable JobLogId
as input value and the first ? parameter with index 0 to system variable as
input for System::ErrorDescription shown in picture before. Event Handlers has
many properties like OnPreExecute and OnPostExecute that can do things before
and after the package process. Of course the Event
Handlers can also be placed in each parent and child packages to fetch a
better System::ErrorDescription message to be register in JobLog. A SQL Server Agent Job
will in case of a failure be red with a message 'The step failed' when the
Job package raise an error. A Sequence Container
and be used for many different tasks and in case you like to catch error
failure inside the Sequence Container it can be done like this: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7.
Execute T-SQL Statement Task At Control Flow drag in 'Execute
T-SQL Statement Task' and rightclick {Edit} and click [New] when first time
using in a project. Call the connection TSQL<databasename> and typein
SQL Server name (or use . or localhost). Typein a sql statement like a
UPDATE, DELETE, TRUNCATE, EXEC stored procedure or a dummy like »print
'hello'«. A new connection is shown in the buttom of the package at
Connection Managers and it is a new connection because this
task is using ADO.NET connection manager type with '.Net Providers\SqlClient
Data Provider'.
Rightclick the TSQL connection and choose {Convert to Project Connection}. In
Solution Explorer open the connection and select a database for the sql
statement else the database name has to be part of the statement FROM
<database>.dbo.<table>. Back in package rightclick the TSQL
connection {Parameterise} then the connection string becomes a project
parameter, the value in the dialogbox only needs to be: Data Source=<server>;Initial
Catalog=<database>;Integrated Security=True;Connect Timeout=30; Remember to click [Save All] and the
parameter can be map to an Environment variable in SSISDB catalog. I can go
back now and typein the right sql statement. A project will have two connections
to the same database for Data Flow Task and for Execute T-SQL Statement Task. 8.
Kimball SCD type 7 dimension I will show a way to implement type
7 dimension in a SSIS package fetching data from a source table from a source
legacy system to be stored in a dimension table with ten audit columns that
support type 7, that is an extension of type 1 and 2 and 3 plus handling
deleted values. There is an example in my article Dimensional
modeling that is good to read before programming the package that will
perform the ETL process from source to dimension. In case you don’t like the Previous reference
_pkey column in the dimension table from the article, you can skip the _pkey
column in the table and SSIS package implementation, because I will in
the end of this tip show an alternative sql update without _pkey. For my self
I have nice usage of the _pkey column in fetching from a dimension. The two table structures for this
example: CREATE TABLE [dbo].[Source_System] ([Id] [int] NOT NULL, -- Business
key in source system therefore primary key here. [Name] [nvarchar](50)
NULL, [Version] [decimal](5,
2) NULL, CONSTRAINT [PK_Source_System]
PRIMARY KEY CLUSTERED ([Id] ASC) ) ON [PRIMARY] GO CREATE TABLE [dbo].[Dim_System] ([System_key] [int] IDENTITY(1,1) NOT NULL, [System_dkey] [int] NOT
NULL, [System_pkey] [int] NOT
NULL, [System_bkey] [int] NOT
NULL, [System] [nvarchar](50)
NULL, [Version] [decimal](5,
2) NULL, [ValidFrom] [datetime]
NOT NULL, [ValidTo] [datetime]
NOT NULL, [IsCurrent] [bit] NOT
NULL, [IsDeleted] [bit] NOT
NULL, [InsertTime] [datetime]
NOT NULL, [UpdateTime] [datetime]
NOT NULL, CONSTRAINT [PK_Dim_System]
PRIMARY KEY CLUSTERED ([System_key] ASC) ) ON [PRIMARY] GO CREATE NONCLUSTERED INDEX [IX_Dim_System_bkey] ON
[dbo].[Dim_System] ([System_bkey] ASC) ON [PRIMARY] GO CREATE NONCLUSTERED INDEX [IX_Dim_System_dkey] ON
[dbo].[Dim_System] ([System_dkey] ASC) ON [PRIMARY] GO CREATE NONCLUSTERED INDEX [IX_Dim_System_pkey] ON
[dbo].[Dim_System] ([System_pkey] ASC) ON [PRIMARY] GO Remark: datetime2(n) has a date
range of 0001-01-01 through 9999-12-31 while datetime starts from 1753-01-01.
datetime takes 8 bytes in storage size. datetime2 takes 6 bytes for
precisions less than 3; 7 bytes for precisions 3 and 4 else 8 bytes. datetime2(3) has same precision as
datetime. smalldatetime has a date range of 1900-01-01 through 2079-06-06 and
00 second in the timepart and takes 4 bytes in storage size and I could use
»the forever date« like 2070-12-31. I choose datetime and let ValidFrom
default at 1900-01-01 and ValidTo default at 9999-12-31. The source legacy system does not
provide any datetime column when data is valid from or any status flag for
New, Created, Changed, Updated or Deleted data, therefore the SSIS package
has to do the data detection of New, Changed or Deleted data in the source
legacy system and reflect it to the dimension table. In the SSIS package I’m not using 'OLE
DB Command' instead the sql update of audit columns in previous rows will be
done by a sql statements that don’t need variable therefore I’m placing them
in 'Execute T-SQL Statement Task' to get the best performance. I use 'Data Flow Task', 'Execute
T-SQL Statement Task', 'OLE DB Source', 'Merge Join', 'Conditional Split', 'Derived
Column' and 'OLE DB Destination' in flow of data from a source legacy system
from table Source_System to a type 7 dimension table Dim_Source in a datamart: I will make a SSIS package where
data in a Source_System table will do a Kimball SCD type 7 at a Dim_System
table. From Source_System table package will do data detection of :
The package will insert row from
Source_System table into Dim_System table together with audit columns. The
package will end up with updating rows in table Dim_System to set _dkey
column and other audit columns. Since the audit columns will use
default values, I can either typein the default values for each audit
columns in the table design of Dim_System, or I can make them as parameters
and variables for the SSIS project and the SSIS package like this:
In 'Control Flow' I drag in a 'Data
Flow Task' and inside I drag in two 'Source' for tables Source_System and
Dim_System with sql statements to make sure that data has the right sortorder
for a later of 'Merge Join' that will join over the business keys therefore
order by business keys in both sql statements where I only include the
columns the package need, like this:
When sql statement takes data from
an archive there will be a WHERE Active = 1. Since both 'Source' tables are using
same column name, I will rename the columns of the Source_System table by
Edit the 'Source', go to Columns page and typein a prefix s_ to the output
columns like this: At both 'Source' rightclick {Show
Advanced Editor}, tab 'Input and Output Properties', click at 'OLE DB Source
Output' and set property IsSorted to True and open
'OLE DB Source Output' and open 'Output Columns' and select the first column
from ORDER BY and set property SortKeyPosition to 1, see previous tip about
this. Drag in 'Merge Join' and connect the
two 'Source' tables as 'Merge Join Left Input' but in real I want a 'Full
outer join' because it will give me all rows from both 'Source' and Dimension
table, and both business keys will be null if there is no match. I include
all columns from both tables because:
I rightclick {Edit} and select join
type 'Full outer join' and select all columns: Full outer join merge two datasets and
return all rows from both datasets into a new result dataset. 'Full outer join' will provide data
for four situations called:
I will devide or filtering data to
each own pipeline through a 'Conditional Split' with three conditions using
the business key and the columns of values from source table (prefix s_) and
dimension table.
The order (rank) of rules is
important else Changed must starts with a condition to make sure that both
business keys has a value and therefore exists in both tables: !ISNULL(System_bkey)
&& !ISNULL(s_Id) && (<rule as above>). New
pipeline
will insert new rows into Dim_System table, therefore the package has to set
default values for audit columns using the project parameters and package
variable. I do it in a 'Derived Column' that will add extra columns to the
New pipeline with a default value typein as an Expression refering to
parameters and variable. ValidFrom is set to its general »begindate« because
the data is new in Dim_System table and ValidTo is set to its general enddata
called »forever enddate«. Sometimes the source legacy system provides
ValidFrom when the data was been new/created/born inside the system:
I don’t like an audit column to be
Null or n/a, therefore UpdateTime gets the same datetime stamp as InsertTime,
so both columns has the same value until data in the row will be either
changed or deleted in the source legacy system. The New pipeline with audit columns
default values can now be insert into Dim_System table, I drag in a
'Destination' and connect new pipeline to it and select Dim_Source table as
fast load. There is a default mapping that match columns with same name
because the pipeline contains columns from both tables Source_System and
Dim_System after the full outer join: But the source columns (left side) has
prefix s_ and they need to be map to the right columns in destination table
for Dim_System (right side), so I rearrange the mapping:
Audit columns System_dkey and System_pkey
for the new rows will be set to the right value in a sql update statement in
the end of the ETL process. After I have deleted the map lines
and drag new map lines I rearrange the mapping, where I can’t use the input
distination columns because data is new in the source and therefore is Null
in the new pipeline so all the audit columns comes from the default values to
be insert into Dim_System table as new rows: Audit column
System_dkey will
be set to the value of System_key in a sql update statement in the end of the
ETL process, because System_key is an Identity sequential number that first
get its value after the row has been inserted. Then System_dkey will inherited
the number, so in the first row both columns has the same number. When the
data later will be changed or deleted, it become a new row in Dim_System
table with a new number in System_key and audit column System_pkey will have
a value as previous reference back to the number of System_key for the
previous row before data was changed or deleted. Changed pipeline will insert new rows
into Dim_System table, therefore the package has to set default values for
audit columns through 'Derived Column' where ValidFrom is set to
currentdatetime because that is the time where Dim_System receive a changed
data from Source_System. Sometimes the source legacy system provides ValidFrom
when the data was been changed/updated inside the system:
Audit columns System_dkey and System_pkey
will inherited the values from the existing rows in Dim_System table. The Changed pipeline with audit
columns default values can now be insert into Dim_System table, I drag in a
'Destination' and connect changed pipeline to it and select Dim_Source table
as fast load. I rearrange the mapping:
Audit columns ValidTo, IsCurrent and
UpdateTime for the old previous rows will be set to the right value in a sql
update statement in the end of the ETL process. Deleted pipeline will insert new rows
into Dim_System table, therefore the package has to set default values for
audit columns through 'Derived Column' where ValidFrom is set to currentdatetime
because that is the time where Dim_System found out that data has been
deleted in the Source_System. Sometimes the source legacy system provides
ValidFrom when the data was been deleted inside the system. The new row will
be mark with IsDeleted = 1 = true because the row represent the deleted
data in the source legacy system, therefore I use another parameter
default value here:
Audit columns System_dkey and System_pkey
will inherited the values from the existing rows in Dim_System table. The Deleted pipeline with audit
columns default values can now be insert into Dim_System table, I drag in a
'Destination' and connect deleted pipeline to it and select Dim_Source table
as fast load. I rearrange the mapping, where I can’t use the prefix s_
columns because data has been deleted in the source and therefore is Null in
the deleted pipeline:
Audit columns ValidTo, IsCurrent and
UpdateTime for the old previous rows will be set to the right value in a sql
update statement in the end of the ETL process. Audit
columns The SSIS package control flow is
ending with a TSQL statement that has two sql update statements.
UPDATE d SET System_dkey =
d.System_key FROM dbo.Dim_System d WHERE d.System_key > 0
AND d.System_dkey = 0
* Set ValidTo to same datetimestamp that new
row got in ValidFrom. * Set IsCurrent = 0 = false because it is
new row that is current row. * Set UpdateTime to the current datetime
when previous row was updated. The
sql statement is using a self-join
between the new row with alias n and the previous row in the dimension with
alias d with joining from new row column System_pkey to previous row column
System_key where both has IsCurrent = 1 to fetch the new row and the last
previous row and not older historical rows, but then the updating will set IsCurrent
= 0 for the previous row, therefore there will be only one row with IsCurrent
= 1 per data value: UPDATE d SET ValidTo = n.ValidFrom,
IsCurrent = 0, UpdateTime = Getdate() FROM dbo.Dim_System d
INNER JOIN dbo.Dim_System n ON n.System_pkey = d.System_key WHERE d.System_key > 0
AND n.IsCurrent = 1 AND d.IsCurrent = 1 In case you don’t like to
use the Previous reference _pkey audit column in a dimension table, the
second sql update statement will use System_dkey to perform the self-join
together with two extra conditions that new row System_key must be greater
than previous row System_key to avoid a row joining to itself: UPDATE d SET ValidTo = n.ValidFrom,
IsCurrent = 0, UpdateTime = Getdate() FROM dbo.Dim_System d INNER JOIN dbo.Dim_System n ON
n.System_dkey = d.System_dkey AND n.System_key >
d.System_key WHERE d.System_key > 0
AND n.IsCurrent = 1 AND d.IsCurrent = 1 The SSIS package is ready to be
testet. Approaches
for writing to the dimension table The approach above starts with
inserting the rows into the dimension table and hereafter updating the audit
columns. Another approach would be using a staging table which contains changed
and deleted data and a ValidFrom column that is assigned inside the staging
table, and hereafter the ValidFrom will be used to update the previous rows ValidTo
column in the dimension table together with the other audit columns, and
hereafter inserting the rows from the staging table into the dimension table. Merge
in T-SQL The process can also be solve by
T-SQL programming with INSERT INTO and UPDATE or with a MERGE but it needs
two updates afterwards like this where Target is table Dim_System and Source
is table Source_System where Merge compare business key between source and
target. Remarks that (ISNULL(Source.Name,'¤')
<> ISNULL(Target.System,'¤') is not working for Merge when a value
change to null. |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DECLARE @defaultdkey int = 0 DECLARE @defaultpkey int = 0 DECLARE @defaultIsCurrent bit
= 1 DECLARE @defaultIsDeleted bit
= 1 DECLARE @defaultIsDeleted_Not bit
= 0 DECLARE @defaultValidFrom datetime
= '1900-01-01
00:00:00' DECLARE @defaultValidTo datetime
= '9999-12-31 00:00:00' DECLARE @currentDateTime datetime
= GETDATE() -- or IdaInsertTime. INSERT INTO dbo.Dim_System -- Insert rows for updated and deleted data into Target
table (System_bkey,
[System], [Version],
System_dkey, System_pkey, ValidFrom,
ValidTo, IsCurrent, IsDeleted,
InsertTime, UpdateTime) SELECT System_bkey, [Name], [Version],
System_dkey, System_key, @currentDateTime,
@defaultValidTo, @defaultIsCurrent,
@defaultIsDeleted_Not, @currentDateTime, @currentDateTime FROM – the columns in select comes from output
action columns, System_key is for System_pkey ( MERGE dbo.Dim_System AS [Target] USING (SELECT [Id], [Name], [Version] FROM
dbo.Source_System)
AS [Source] ON [Target].System_bkey
= [Source].Id
AND [Target].IsCurrent
= 1 AND
[Target].IsDeleted =
0 WHEN NOT MATCHED BY TARGET -- Insert rows for new data
into Target table THEN INSERT (System_bkey,
[System], [Version],
System_dkey, System_pkey, ValidFrom,
ValidTo, IsCurrent,
IsDeleted, InsertTime,
UpdateTime) VALUES ([Source].[Id], [Source].[Name], [Source].[Version], @defaultdkey,
@defaultpkey, @defaultValidFrom, @defaultValidTo,
@defaultIsCurrent, @defaultIsDeleted_Not, @currentDateTime, @currentDateTime) WHEN MATCHED AND ((([Target].[System]
<> [Source].[Name]) OR ([Target].[System] IS NOT NULL AND
[Source].[Name] IS
NULL) OR ([Target].[System] IS NULL AND [Source].[Name] IS NOT NULL)) OR (([Target].[Version] <>
[Source].[Version])
OR ([Target].[Version]
IS NOT NULL AND
[Source].[Version] IS
NULL) OR ([Target].[Version] IS NULL AND
[Source].[Version] IS
NOT NULL))) -- Update rows
in Target table because rows is not current anymore because Source change
value THEN UPDATE SET
ValidTo = @currentDateTime, IsCurrent = 0, UpdateTime =
@currentDateTime WHEN NOT MATCHED BY SOURCE AND [Target].IsCurrent
= 1 AND
[Target].IsDeleted =
0 -- Update rows
in Target table because rows has been deleted in Source table. THEN
UPDATE SET
ValidTo = @currentDateTime, IsCurrent = 0, UpdateTime =
@currentDateTime -- Update rows
is in OUTPUT be inserted into Target table as new rows in the above INSERT
INTO OUTPUT $Action AS ActionOutput,
inserted.System_key,
inserted.System_bkey,
inserted.System_dkey,
[Source].[Name], [Source].[Version] ) AS MergeOutput WHERE
MergeOutput.ActionOutput = 'UPDATE' -- For new rows to set System_dkey as same value as
the given Identify for System_key UPDATE d SET System_dkey = d.System_key FROM dbo.Dim_System d WHERE d.System_key > 0 AND d.System_dkey = 0 -- For deleted rows to set IsDeleted in the new row and
fetch the other values from previous row UPDATE d SET IsDeleted =
@defaultIsDeleted, [System] = p.[System], [Version] = p.[Version] FROM dbo.Dim_System d INNER
JOIN dbo.Dim_System
p ON p.System_key
= d.System_pkey WHERE d.System_key > 0 AND d.IsCurrent = 1 AND d.IsDeleted = 0 AND d.[System] IS NULL AND d.[Version] IS NULL MERGE is most for
insert new data and update existing data and become little more complex for Kimball
type 2 or 7 dimensions because the syntax of OUTPUT and an extra INSERT INTO
statement around the MERGE statement. Insert
and Update instead of Merge in T-SQL stored procedure CREATE PROCEDURE [dbo].[DimensionT7_InsertUpdate] AS BEGIN SET NOCOUNT ON DECLARE @defaultValidFrom datetime
= '1900-01-01
00:00:00' DECLARE @defaultValidTo datetime = '9999-12-31 00:00:00' DECLARE @currentDateTime datetime = GETDATE() -- or IdaInsertTime. DECLARE @New_ROW_COUNT
int =
0 DECLARE @Deleted_ROW_COUNT int
= 0 DECLARE @Changed_ROW_COUNT int
= 0 -- I am using a @table_variable instead of a #temporary_table
or a real_stage_table inside db or -- a memory optimized table. I have added primary
key index on the columns I will join to later. DECLARE @NewRows TABLE (System_key int NOT NULL PRIMARY KEY CLUSTERED) DECLARE @DeletedRows TABLE (System_key int, System_dkey int, System_bkey int, System
nvarchar(50), Version decimal(5, 2)) DECLARE @ChangedRows TABLE (System_key int, System_dkey int, System_bkey int NOT NULL PRIMARY KEY CLUSTERED) -- bkey is not in dimension, therefore row is new in
source, that will be marked as current row. INSERT INTO dbo.Dim_System (System_bkey, System, Version, System_dkey,
System_pkey, ValidFrom, ValidTo, IsCurrent, IsDeleted,
InsertTime, UpdateTime) OUTPUT inserted.System_key
INTO @NewRows SELECT DISTINCT s.Id, s.Name, s.Version, 0, 0, @defaultValidFrom, @defaultValidTo,
1, 0,
@currentDateTime, @currentDateTime FROM dbo.Source_System s WHERE NOT EXISTS(SELECT 1 FROM dbo.Dim_System
d WHERE d.System_bkey
= s.Id) SET
@New_ROW_COUNT = @New_ROW_COUNT + @@ROWCOUNT /* For new rows to set System_dkey as same value as
the given Identify for System_key. */ UPDATE d SET System_dkey = d.System_key FROM dbo.Dim_System d INNER
JOIN @NewRows n ON
n.System_key =
d.System_key /* bkey is not in source, therefore the row is
deleted in source, that will be marked in dimension
in current row and an extra row is inserted. */ UPDATE d SET IsCurrent = 0, ValidTo =
@currentDateTime, UpdateTime = @currentDateTime OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey, inserted.System, inserted.Version INTO
@DeletedRows FROM dbo.Dim_System d WHERE d.System_key > 0 AND d.IsCurrent = 1 AND d.IsDeleted = 0 AND NOT
EXISTS(SELECT
1 FROM dbo.Source_System
s WHERE s.Id = d.System_bkey) INSERT INTO dbo.Dim_System (System_bkey,
System, Version, System_dkey, System_pkey, ValidFrom, ValidTo, IsCurrent, IsDeleted,
InsertTime, UpdateTime) SELECT DISTINCT d.System_bkey, d.System, d.Version, d.System_dkey, d.System_key, @currentDateTime, @defaultValidTo,
1, 1,
@currentDateTime, @currentDateTime FROM @DeletedRows d --FROM dbo.Dim_System d --WHERE d.IsCurrent = 0 AND d.IsDeleted = 0 AND
d.ValidTo = @currentDateTime AND -- NOT
EXISTS(SELECT 1 FROM dbo.Source_System s WHERE s.Id = d.System_bkey) SET
@Deleted_ROW_COUNT = @Deleted_ROW_COUNT + @@ROWCOUNT /* bkey is marked deleted in dimension but exists
now again in source, reappearing, that will be marked in
dimension in current/deleted row and an extra row is inserted. Keep
IsDeleted = 1 to show it was deleted in a period. */ UPDATE d SET IsCurrent = 0, ValidTo =
@currentDateTime, UpdateTime = @currentDateTime OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey
INTO @ChangedRows FROM dbo.Dim_System d INNER
JOIN dbo.Source_System
s ON s.Id = d.System_bkey WHERE d.IsCurrent = 1 AND d.IsDeleted = 1 /* bkey's values is changed in source, that will be
marked in dimension in current row and an extra row is inserted. */ UPDATE d SET IsCurrent = 0, ValidTo =
@currentDateTime, UpdateTime = @currentDateTime OUTPUT inserted.System_key, inserted.System_dkey, inserted.System_bkey
INTO @ChangedRows FROM dbo.Dim_System d INNER
JOIN dbo.Source_System
s ON s.Id = d.System_bkey WHERE d.IsCurrent = 1 AND d.IsDeleted = 0 AND (ISNULL(s.Name,'¤') <> ISNULL(d.System, '¤') OR ISNULL(s.Version,0) <> ISNULL(d.Version, 0)) --
d.dwh_Comparison != s.dwh_Comparison INSERT INTO dbo.Dim_System (System_bkey, System, Version, System_dkey,
System_pkey, ValidFrom, ValidTo, IsCurrent, IsDeleted,
InsertTime, UpdateTime) SELECT DISTINCT s.Id, s.Name, s.Version, c.System_dkey, c.System_key, @currentDateTime, @defaultValidTo,
1, 0,
@currentDateTime, @currentDateTime FROM dbo.Source_System s INNER
JOIN @ChangedRows c ON
c.System_bkey =
s.Id --WHERE NOT EXISTS(SELECT 1 FROM dbo.Dim_System d
WHERE d.System_bkey = s.Id AND d.IsCurrent = 1) --Because source table contains rows that has not
been changed and must be skipped in the ETL. SET
@Changed_ROW_COUNT = @Changed_ROW_COUNT + @@ROWCOUNT END |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9.
Lookup, Cache Connection Manager Transform, Data Viewer In case the ETL process needs to do
two or more lookups to same table or sql statement by adding the same
Lookup in several Data Flow Tasks, data will be reloaded which can cost
performance. Instead data can be loaded one time and added to a cache and
multiple Lookup can use same cache. Lookup is a join between two tables
in a one-to-many relationship, from a many-table to a one-table for adding
extra columns to the pipeline build on many-table.
10.
Kimball SCD type 7 dimension through DSA layer It is nice to have a Data Staging
Area (DSA) layer in the ETL process between Source data and a dimension like
type 7 from Kimball. A very strict ETL flow is first to have all data ready
at DSA layer before start to update and insert data in dimensions and facts.
Instead of compare columns if values has changed I make a checksum hashbytes
column called dwh_Comparison that contains the compared columns which is the normal
columns and not included the primary key (business key). 10.1
Using Lookup For a new employee I let ValidFrom
start from 1900-01-01 and for the current row I let ValidTo stop at
9999-12-31. Dim_Employee
From the source legacy system the
ETL fetch employees as full dump (delta data see 10.3) with an Updated_Time
column that I will use for ValidFrom and ValidTo because the HR people is
used to do ad hoc queries in the source system. Therefore the source will
give 5 employees Source_Employees
2 employees has changed last name
(ID = 125 Janet and 126 Margaret) and 1 new employee (ID = 127 Steven) has
started. ETL process to a Stage_Employee
table in a self-explanatory graphical user interface where I use Lookup to
devide into existing employees (match output) and new employees (no match
output) and I use a Conditional Split to only continue with changed employees
where I compare the values in dwh_Comparison. The changed rows will be added
two times in the stage and later in the dimension because of the ValidFrom
and ValidTo, therefore I use a Multicast to copy the pipeline and save the
rows with different audit columns values. The data flow from source to stage
table: Multicast do a duplicate for Update
and Insert of the pipeline of changed rows. The »2 rows« is from Janet and Margaret
that has created new key numbers, the »2 rows« is key 3 and 4 from existing
row in dimension and the »1 rows« is from Steven that will be new member in
the dimension. Stage_Employee (order by bkey +
ValidFrom)
Key 3 and 4 rows is for update same
rows in the dimension where ValidTo and IsCurrent will be changed, therefore
no need for names in these rows. The same business keys is repeated in key 6
and 7 rows for insert into the dimension with the new names. Key 5 is the new
employee ready to be inserted into dimension. So the SSIS package has made the
staging data ready to be launched into the dimension and that should be done
in another SSIS packages that can do it parallel for all dimensions and facts
from DSA to DMA a data mart. 2 rows updated with a new ValidTo from the
source with 2014-10-09 and 3 new rows added because there was a new employee
too: Dim_Employee (order by bkey +
ValidFrom)
Table Stage_Employee where some columns allow null value because these
values is not needed to update ValidTo in the dimension table, and _key is
an Identity to create the new key numbers for the dimension: CREATE TABLE [dbo].[Stage_Employee]( [Employee_key] [int] IDENTITY(1,1) NOT NULL, [Employee_dkey] [int] NOT NULL, [Employee_pkey] [int] NOT NULL, [Employee_bkey] [int] NOT NULL, [LastName] [nvarchar](20) NULL, [FirstName] [nvarchar](10) NULL, [dwh_Comparison] [binary](32) NULL, [ValidFrom] [datetime] NULL, [ValidTo] [datetime] NOT NULL, [IsCurrent] [bit] NOT NULL, [InsertTime] [datetime] NOT NULL CONSTRAINT [DF_Stage_Employee_InsertTime] DEFAULT (getdate()), CONSTRAINT
[PK_Stage_Employee] PRIMARY KEY CLUSTERED ([Employee_key] ASC) ) ON [PRIMARY] GO Employee_key is an
Identity because it will calculate the new numbers for the new rows, and
later the new rows will be inserted into dimension table Dim_Employee. Insert the Missing row into the
dimension table: TRUNCATE TABLE dbo.Dim_Employee -- SET IDENTITY_INSERT dbo.Dim_Employee ON INSERT INTO dbo.Dim_Employee (Employee_key, Employee_dkey, Employee_pkey,
Employee_bkey, LastName, FirstName, dwh_Comparison, ValidFrom,
ValidTo, IsCurrent) VALUES (-1, -1, 0, 0, 'Missing', 'Missing', 0x0000000000000000000000000000000000000000000000000000000000000000, '1900-01-01', '9999-12-31', 1) -- SET IDENTITY_INSERT dbo.Dim_Employee OFF GO When I have a DSA layer and a staging table, there
is no need to have an Identity in dimension table Dim_Employee. Hash column
dwh_Comparison using HASHBYTES('SHA2_256') that become data type binary(32)
or char(64) and I just give the Missing a default value that never will come
from the source system. ETL
for staging table into DSA layer The Truncate Stage_Employee makes empty staging table and copy the max
key number value from the dimension table, so new inserted rows into staging
table will get new identity number that later can be bulk insert to
dimension: TRUNCATE TABLE dbo.Stage_Employee DECLARE @maxKey int SELECT @maxKey = ISNULL(MAX(Employee_key),0) + 1 FROM dbo.Dim_Employee DBCC CHECKIDENT (Stage_Employee,
RESEED, @maxKey)
The Source_Employees makes a comparison column for the columns to
compare with to the dimension, here FirstName and LastName is concatenate: SELECT EmployeeID, LastName, FirstName, Updated_Time, dwh_Comparison =
CAST(HASHBYTES('SHA2_256', CONCAT(UPPER(RTRIM(LTRIM(FirstName))), ';',UPPER(RTRIM(LTRIM(LastName))))) AS binary(32)) FROM dbo.Source_Employees The cast makes sure the default data type
binary(8000) is not used because that would take a lot of memory for the pipeline. Concat takes care of null value in column else
isnull(firstname, ''). The Dim_Employee makes a lookup to the dimension to fetch the key
columns and the comparison column for the current dimension rows: SELECT Employee_key, Employee_dkey, Employee_pkey, Employee_bkey, dwh_Comparison FROM dbo.Dim_Employee WHERE IsCurrent = 1 The Lookup No Match Output means that the business key from source
EmployeeID does not exists in the dimension. Derived Column extends the
pipeline with extra column values that will be inserted into the staging
table in Stage_Employee new rows with
new identity number values of the _key column: The Lookup Match Output means that the business key from source
EmployeeID do exists in the dimension but then only continue with the rows
where source has changed FirstName or LastName by compare the comparison
column from source and dimension that was included in the Lookup in Find changed rows Conditional Split
with the criteria not to be equal: A changed row meaning source has
updated FirstName or LastName will result in two rows in staging table, one
row for update ValidTo/IsCurrent=0 and one row for insert new row with
ValidFrom/IsCurrent=1, therefore a Multicast
double the pipeline. Derived Column extends the pipeline with extra column
values that will be inserted into the staging table in Stage_Employee existing rows but keep the identity number value
of the _key column for doing a sql update later by joine the _key columns,
and ValidTo gets its value from Updated_Time: Stage_Employee
changed rows
with new identity number values of the _key and the _dkey gets the value from
the dimension from the Lookup above and the _pkey gets the value from the
dimension _key because _pkey is a reference back to previous row in the
dimension, it is made as a column copy in Derived Column, and ValidFrom gets
its value from Updated_Time: The Stage_Employee set durable key update the _dkey for the new rows: UPDATE dbo.Stage_Employee
WITH(TABLOCK) SET Employee_dkey
= Employee_key WHERE IsCurrent = 1 AND
Employee_dkey = 0 ETL
from staging table in DSA layer into the dimension The Update Dim_Employee to set ValidTo/IsCurrent for the old current
row: UPDATE d SET ValidTo
=
s.ValidTo, IsCurrent = s.IsCurrent, UpdateTime = Getdate() FROM dbo.Dim_Employee d WITH(TABLOCK) INNER
JOIN dbo.Stage_Employee
s WITH(TABLOCK) ON s.Employee_key = d.Employee_key WHERE s.IsCurrent = 0 The Insert Dim_Employee for insert new row with ValidFrom/IsCurrent=1
from Stage_Employee: SELECT Employee_key, Employee_dkey, Employee_pkey, Employee_bkey, LastName,
FirstName, dwh_Comparison, ValidFrom, ValidTo, IsCurrent FROM dbo.Stage_Employee WHERE IsCurrent = 1 Into table Dim_Employee: A SSIS package for update and insert
data from staging tables to dimensions and fact tables can use BEGIN
TRANSACTION and COMMIT TRANSACTION or ROLLBACK TRANSACTION with failure to
secure that all will be loaded or nothing will happen. 10.2
Not using Lookup I think it is important to make a
SSIS package in a self-explanatory graphical user interface
way where the
Dim_Employee lookup and two
pipelines tells a lot. But with many rows in a dimension Lookup is slow and
use a lot of memory, therefore I can drop the Lookup and inside 'OLE DB
Source' Source_Employees I make a sql
that do a kind of lookup through Left Oute Join and it will perform inside
the database at SQL Server: ;WITH source AS ( SELECT
EmployeeID, LastName,
FirstName, Updated_Time, dwh_Comparison = CAST(HASHBYTES('SHA2_256', CONCAT(UPPER(RTRIM(LTRIM(FirstName))), ';',UPPER(RTRIM(LTRIM(LastName))))) AS binary(32)) FROM dbo.Source_Employees WITH(TABLOCK) ) SELECT s.EmployeeID, s.LastName, s.FirstName, s.Updated_Time, s.dwh_Comparison, d.Employee_key, d.Employee_dkey, d.Employee_pkey FROM source s LEFT
OUTER JOIN
dbo.Dim_Employee d WITH(TABLOCK) ON d.Employee_bkey = s.EmployeeID WHERE d.dwh_Comparison IS NULL OR (d.IsCurrent = 1 AND d.dwh_Comparison <>
s.dwh_Comparison) ·
Left
outer join + d.dwh_Comparison IS NULL fetchs new rows from source with a
business key that does not exist in the dimension, so all d. columns has null
value. ·
d.IsCurrent
= 1 AND d.dwh_Comparison <> s.dwh_Comparison fetchs changed rows from
source that does not compare to the current rows in the dimension. A Conditional Split has an output
called New for ISNULL(Employee_key) and the default output is called Changed.
The New pipeline is 'Lookup No Match Output' and the Changed pipeline goes to
'Multicast'. The data flow from source to stage table: Multicast do a duplicate for Update
and Insert of the pipeline of changed rows. Here is only 3 rows coming out of Source_Employees
because 1 is new and 2 is changed, but as the first data flow shown there was
5 rows from source. There is always pros and cons to self-explanatory
graphical user interface and performance. 10.3
Source data into an Archive with repeating members The data warehouse fetchs source
data and add them all to the archive also when source gives revision of rows,
or rows that seams not to have been changed, I call it repeating members e.g.
Fullor in three rows: Archive_Employee
In the dimension Andrew’s name is
Fuller, but above there is a revision row with an unchanged name, later many
rows with changed name and the last change is back to the original name
Fuller. Buchanan is a new employee and the first row is a wrong typing of his
first name. »End Of The Day« for each data
revision in Archive is easy by select the max time of Updated_Time from the
source, but in this approach I will respect all the changes in the source
with these rules:
The result for a stage table in DSA
where ValidFrom and ValidTo is smashed and null key will be set by Identity.
|