Tuesday, July 5, 2011

custom logging in ssis

Problem
We have quite a few SSIS packages that we run on a regular basis to extract data from our OLTP systems and transform/load into our data warehouse. We would like to capture some basic information about these SSIS packages such as what packages were run, when, how long did they take to run, and how many rows were extracted and loaded. We have looked in to the built-in logging that comes with SSIS but we just don't see a straight forward way to get row counts. Are we missing something?
Solution
SSIS provides a flexible logging mechanism that is tightly integrated with the various events raised during package execution. However, the issue that you raise is accurate and the best approach is to use the built-in SSIS logging where appropriate and augment it with your own custom logging. A simple, straight-forward custom logging schema can provide the information you need. Let's take a look at a sample schema that can be used to capture the basic logging information that you have requested:

Sample SSIS Logging Data Model



Main points about the above schema:
•All of the tables are in the etl schema; it's a good idea to isolate these kinds of tables from other tables in your database.
•A row is inserted into the PackageLog table by every SSIS package that executes.
•A row is inserted into the ExtractLog when an SSIS package extracts rows from a table in the OLTP database.
•A row is inserted into the LoadLog table when an SSIS package inserts, updates or deletes rows in a dimension or fact table in the data warehouse.
•The EndTime and Success columns are updated if the package runs to completion.
•The above schema does not provide for error logging which is probably best handled using the built-in SSIS logging capabilities.

The pattern for populating the logging tables is to implement an "Init" and an "End" stored procedure for each table. The stored procedures are executed in the SSIS package using the Execute SQL task. The PackageLog stored procedures are as follows:
•stp_InitPackageLog is called at the beginning of the SSIS package to insert a row into the PackageLog table with the Start Time and Package Name; it returns the PackageLogID (identity value). The PackageLogID is saved in a package variable to update the row when the package is done.
•stp_EndPackageLog is called at the end of the SSIS package to update the row inserted in the PackageLog with the EndTime and set the Success column to 1.

Packages that extract data from a table in an OLTP system will update the ExtractLog table as follows:
•stp_InitExtractLog is called to insert a row into the ExtractLog table with the Start Time and OLTP Table Name that is being processed and return the ExtractLogID (identity value). The ExtractLogID is saved in a package variable to update the row when the extract is done. In addition the stored procedure returns the maximum LastExtractDateTime for the table; this value is stored in a package variable and used in the Data Flow to only get the rows that have changed since the last extract.
•stp_EndExtractLog is called when the extract is complete to update the ExtractCount (number of rows extracted), EndTime, LastExtractDateTime, and Success columns. The LastExtractDateTime column is a DATETIME type that is set to the maximum last update date/time extracted from the OLTP table. The OLTP table needs to provide a reliable last update date/time column in order to make this work. For example, tables in the AdventureWorks database have a ModifiedDate column that reflects the last date and time the row was updated.

Packages that update dimension or fact tables in the data warehouse will update the LoadLog table using the same pattern; i.e. stp_InitLoadLog and stp_EndLoadLog. The LoadLog table records the number of rows that were inserted, updated or deleted for a given dimension or fact table in the warehouse.

Now let's take a look at a simple SSIS package that implements the custom logging. We'll use the AdventureWorks sample database as our OLTP source, extract rows from the Sales.Customer table, and save the rows in a staging table. The following is the Control Flow for the package:

SSIS Control Flow


Next let's drill in to the Data Flow:

SSIS Data Flow



The Get Updated Customers task selects the rows from the Sales.Customer table where the ModifiedDate is greater than the maximum ModifiedDate the last time the package was run. The maximum ModifiedDate is stored in the LastExtractDateTime column in the ExtractLog table and returned in the call to the stp_InitExtractLog stored procedure (Init Extract Log task in the Control Flow). The Get Extract Row Count task simply assigns the number of rows passing through to a package variable; the row count is passed as a parameter to the stp_EndExtractLog stored procedure (End Extract Log task in Control Flow) and stored in the ExtractLog table.

Next Steps
•The sample described in this tip is just a starting point; in your particular circumstances you may find additional information that should be logged.
•The custom logging that you implement will likely provide future benefits as well; for instance in a major upgrade to an existing ETL process the information in the custom logging tables can be used to compare results between the currently deployed ETL process and your development environment.
•Custom logging is just one of a number of things that should be defined early in your development effort and made part of an SSIS package "template" that you use as the starting point for each new package.