SQL Server 2005 SSIS: Tuning the Dataflow Task
The Dataflow task is a high performance data transformation and integration engine. It is the data processing heart of Integration Services, capable of consuming data from multiple diverse sources, performing multiple transformations in parallel and outputting to multiple diverse destinations. It supports a pluggable architecture for stock components that ship with Integration Services as well as custom components written by third parties.
For data flows with five or six components with small data loads of a gigabyte or less, the Dataflow Task will almost certainly perform adequately out-of-the-box. However, large and complex data flow loads bring their own challenges and there are often more tradeoffs and decisions that need to be made that the Dataflow task cannot make for the user or over which it has no control.
In this article, I assess the ways in which you can tune the dataflow task by tweaking some basic settings. I call these turnkey settings because there isn't a lot of measurement or process involved when deciding to use them and because the benefits from the settings can usually be seen immediately. I then move on to define a general approach you can use for identifying and resolving performance issues, and squeezing every last ounce of performance from the Dataflow Task.
Turnkey Settings for Speedup
The Dataflow Task provides a number of settings and options you can use to eliminate unnecessary work or to diminish the amount of data that must flow through it. Some of these methods involve simple property settings, while others are a bit more complex. Also covered are the things you can do in the environment outside the Dataflow Task to speed up data flow. Although these methods are presented as a series of "do's and don'ts", understand that not every tip applies to every situation and that the items presented here are only guidelines, not hard rules that must always be followed. Knowing when to apply which rules is a matter of experience grown from testing and measuring the results.
Eliminating Unnecessary Work
Often you may find that the best data flow optimization you can achieve is to simply eliminate work the Dataflow Task must do. Eliminating work is almost always preferable to making the Dataflow Task faster because you get a 100% speedup for every bit of work you eliminate. The following tips describe ways to eliminate work from the Dataflow Task:
Optimize the Sources
Most of the optimizations for eliminating unnecessary work have something to do with eliminating the data coming from the sources. The following are a few ways to eliminate unnecessary data from entering into the data flow:
Remove Unneeded Columns
Unneeded columns are columns that never get referenced in the data flow. The Execution Engine emits warnings for unused columns, so they are easy to identify. This makes the buffer rows narrower. The narrower the row, the more rows that can fit into one buffer and the more efficiently the rows can be processed.
NOTE:
Buffers are objects with associated chunks of memory that contain the data to be transformed. As data flows through the Dataflow Task, it lives in a buffer from the time that the source adapter reads it until the time that the destination adapter writes it.
Binary Large Objects (BLOBs) are particularly burdensome to the Dataflow Task and should be eliminated if at all possible. Use the queries in the source adapters to eliminate unnecessary columns.
Use a SQL Select Statement to Retrieve Data From a View
Avoid using the Table or view access mode in the OLE DB Source Adapter. It is not as performant as using aSELECT statement because the adapter opens a rowset-based on the table or view. Then it calls OpenRowset in the validation phase to retrieve column metadata, and later in the execution phase to read out the data.
Testing has shown that using a SELECT statement can be at least an order of magnitude faster, because the adapter issues the specified command directly through the provider and fetches the data using sp_prepare without executing the command, avoiding the extra roundtrip and a possibly inappropriate cached query plan.
Optimize Source Queries
Using traditional query optimization techniques optimize the source adapter SQL query. SSIS doesn't optimize the query on your behalf, but passes it on verbatim.
Flat File and Other File Sources
Retrieving data from file sources presents its own set of performance challenges because the data is typically in some format that requires conversion. For example, the Jet Provider only supports a limited set of data types when reading Excel files and flat file data are always of type string until converted. Here are a few hints on how to eliminate unnecessary data flow work:
- Combine Adjacent Unneeded Flat File Columns – to eliminate unnecessary parsing
- Leave Unneeded Flat File Columns as Strings – don't convert them to dates etc. unless absolutely necessary
- Eliminate Hidden Operations – mostly the Dataflow Task is explicit about what it is doing. However, there are some components that perform hidden or automatic conversions. For example, the Flat File Source Adapter will attempt to convert external column types to their associated output column types. Use the Advanced Editor to explore each column type so that you know where such conversions occur.
- Only Parse or Convert Columns When Necessary – Reorganize the data flow to eliminate the Type Conversion Transform if possible. Even better, if possible, modify the source column data type to match the type needed in the data flow.
- Use the FastParse Option in Flat File Source – Fastparse is a set of optimized parsing routines that replace some of the SSIS locale-specific parsing functions.
- Eliminate Unneeded Logging – logging is useful for debugging and troubleshooting but, when deploying completed packages to production, be mindful and careful about the log entries you leave enabled and the log provider you use. Notably, OnPipelineRowsSent is somewhat verbose.
10 ways to optimize the Data Flow
Aside from the obvious need to make sure you have the right indexes in place (for example, in certain cases, the Lookup Transform can benefit from having an index on the reference table), here are 10 ways in which you might optimize your data flow:
NOTE:
This list is not comprehensive – for further tips please refer to my book.
Set based operations
When possible, perform set based operations in SQL Server. For example, SQL Server can generally sort faster than the sort transform, especially if the table being sorted is indexed. Set based operations such as joins, unions, and selects with ORDER BY and GROUP BY tend to be faster on the server.
Be Mindful of Transforms with Internal File IO
Some of the stock dataflow transforms perform internal file Input/Output. For example, the Raw Source and Destination, Import/Export Column Transforms, Flat File Source and Destination and Excel File Source and Destination Adapters are all directly impacted by the performance of the file system. File IO isn't always a bottleneck, but when combined with low memory conditions, causing spooling or with other disk intense operations, it can significantly impact performance. Components that read and write to disk should be scrutinized carefully, and if possible, configured to read and write to dedicated hard drives. Look for ways to optimize the performance of the hard drives using RAID, defragmentation and/or correct partitioning.
Monitor Memory Intensive Transforms
If your package is memory bound, look for ways to eliminate the memory intensive transforms or shift them to another package. Some transforms such as the Aggregate, Lookup and Sort use a lot of memory. The Sort, for example, holds all buffers until the last buffer and then releases the sorted rows. If memory runs low, these transforms may spool to disk, causing expensive hard page faults.
Monitor Other Memory Intensive Applications
When running on the same machine as other memory intensive applications, the data flow can become memory starved, even if there is plenty of memory on the machine. This is typically true when running packages on the same machine with SQL Server. SQL Server is aggressive about using memory. You can use the sp_configure system stored procedure instruct SQL Server to limit its memory usage.
Pare Down the Lookup Reference Data
The default lookup query for the Lookup Transform is
SELECT * FROM …
Select the option to use the results of a query for the reference data. Generally, the reference data should only contain the key and the desired lookup column. So, for a dimension table lookup, that would be the natural key and the surrogate key.
Use Lookup Partial or Full Cache Mode
Depending on the requirements and the data, you should choose one of these two modes to speedup the Lookup. Partial cache mode is useful when the incoming data is repetitive and only references a small percentage of the total reference table. Full cache mode is useful when the reference table is relatively small and the incoming data references the full spectrum of reference table rows.
Lookup, Aggregate and Sort Transforms
While performant for what they do, and important when absolutely necessary, these transforms invariably degrade data flow performance. If possible, eliminate them from your data flows. Sort and aggregate data at the source, and use the MergeJoin Transform instead of the Lookup, if possible.
Increase DefaultBufferMaxSize and DefaultBufferMaxRows
Increasing the values for these two properties can boost performance by decreasing the number of buffers moving through the data flow. However, you should avoid increasing the values too much to the point where the Execution Engine starts swapping out buffers to disk. That would defeat the purpose.
Use match indexes for repeat data cleansing sessions
When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleaning. If you choose to use an existing index, the index is created the first time that the package runs. If multiple Fuzzy Lookup transformations use the same reference table, they can all use the same index. To reuse the index, the lookup operations must be identical; the lookup must use the same columns. You can name the index and select the connection to the SQL Server database that saves the index.
Implement Parallel Execution
Both the Execution Engine for the Dataflow Task and the Execution Engine for the Control Flow are multithreaded.
NOTE:
The Execution Engine is at the heart of the Dataflow execution time behavior. It creates threads, plans executions, calls the various methods on components at the appropriate time and on the appropriate threads, logs output and handles errors. Please refer to Kirk's book for more details on this.
In particular:
- Use EngineThreads Property – controls the number of worker threads the Execution Engine will use. The default for this property is 5. However, as you now know, by simply adding a few components, data flow thread requirements will quickly exceed the default. Be aware of how many threads the data flow naturally needs and try to keep the EngineThreads value reasonably close to it.
- Set MaxConcurrentExecutables – if there are multiple Dataflow Tasks in the Control Flow, say 10 and MaxConcurrentExecutables is set to 4, only four of the Dataflow Tasks will execute simultaneously. Set, test and measure various value combinations of this property and the EngineThreads property to determine the optimal setting for your packages.
Four ways to Increase Insert Performance
Often the greatest data flow performance inhibitor is the sustained write speed of the destination hard drives. But there are other elements that impact destinations as well. Depending on where you are inserting data, there are some things you can do to speed up the inserts as follows (further options are listed in the book).
Use the SQL Server Destination Adapter
If running on the same machine as SQL Server, use the SQL Server Destination component instead of OLEDB Destination. Tests show a marked performance gain with the SQL Server Destination over the OLEDB Destination. This is almost always a slam dunk performance improvement.
Set the Commit Size
The Commit Size option allows you to set a larger buffer commit size for loading into SQL Server. This setting is only available in the OLEDB Destination when using the SQL Server OLEDB driver. A setting of zero indicates that the adapter should attempt to commit all rows in a single batch.
Turn on Table Lock
This option is available in the OLEDB Destination Editor. Selecting "Table Lock" also enables fast load, which tells the adapter to use the IRowsetFastload bulk insert interface for loading.
NOTE:
Fastload delivers much better performance, however it does not provide as much information if there is an error. Generally, for development you should turn it off and then turn it on when deploying to production.
Disable Constraints
This option is also available in the OLEDB Destination Editor by unchecking the "Check constraints" option.
An Approach to Data Flow Optimization
The preceding section presents a pretty long list of methods to improve the performance of your data flows and will address a large percentage of the data flow performance problems you'll run into. However, if after applying all those suggestions, you still need to squeeze some performance from your data flow, you may need to step back and take a more holistic approach to analyzing your packages. This section presents a disciplined and somewhat theoretical approach you can use to evaluate and incrementally optimize your data flows.
The Theory of Constraints
One of the more impressive books I read in college was "The Goal" by Eliyahu M. Goldratt. For me, it was an epiphany. In "The Goal", Dr. Goldratt articulates the kinds of problems businesses face when continually improving their processes and trying to get the most from their resources.
Dr. Goldratt codified his process of improvement into five steps and these steps can be applied to the optimization of data flow. I don't have room in this article to fully describe each step – you can find that in my book – but the following should give you a feel for the process.
Identify and prioritize the constraints
Identify elements of a system that limit the overall system's performance. In this step, you should focus on educating yourself about your computing environment. What are your resources and in what ways are those resources constrained?
The most common high priority constraints are, in no particular order:
- Amount of Memory Installed - Perhaps the most important factor for processing complex data.
- CPU speed - The raw processing power.
- Number of CPUs - More CPUs widen the processing pipe.
- Disk IO speed - Using slow IDE or optimized RAID drives? Important for big data throughput.
- Network Speed - Impacts cross network loads, extracts and queries.
The priorities will naturally fall out of the identification process. The most drastic decreases in performance will be caused by the highest priority constraints. To witness the effect of constraints, you need a baseline against which you can measure decreases in performance.
To understand where a given package spends its processing time, it's necessary to decompose the package into its processing time constituent parts. To do that, you must decompose the package and get a baseline measurement of its core processing time, then measure each incremental processing time increase as more components and other package parts are added to the core package. It is very important to isolate what you are measuring so that when you measure deltas, you know what causes them.
Start by decomposing the package and eliminating unmeasured constraints. For example, shutdown other applications that may consume memory or impact other resources, disconnect from networks and, if the package connects to SQL Server, replace the destination or source with a rowcount, trash, or data generation transform (if possible).
Next, take your baseline measurements. For example:
· Source->Multicast
· Source->Destination
· Source->Transforms->Multicast
Note the differences in performance between the different baselines. Once you've done that, you'll already have a good idea where some of your bottlenecks are. For example, when you replace the Multicast with a real destination, the execution time may increase, even drastically. But, when you add in transforms, you may see very little performance degradation. This is what is meant by witnessing or measuring the impact of changes.
You can then start incrementally adding previously removed elements to measure the impact each has on the overall system.
Measuring with Performance Counters
Performance counters are useful for measuring internal or otherwise hidden processes. Integration Services provides the following useful performance counters.
- Buffers in use
- Flat buffers in use
- Private buffers in use
- Rows read
- Rows written
- Buffers spooled
The most useful of these for performance measurements is Buffers Spooled. Values greater than 0 indicate that the Execution Engine has swapped buffers to disk. You should always try to avoid swapping out buffers to disk, which is usually an indication that you don't have enough memory or that you need to rearrange your data flow.
Rows Read and Rows Written help gauge the overall progress of the data flow and can be useful as sanity checks. For example, as you increase the number of rows a data flow processes, you would expect a commensurate increase in processing time. These measurements help you quickly determine if the increased processing times are warranted.
Decide how to exploit the constraint
This step is about making sure the bottleneck is never overutilized. Applying this to data flow, always try to match the data load at a transform's input with its throughput capacity. The Dataflow Task will already attempt to do this using back pressure, but that only temporarily masks the problem. It doesn't improve throughput, it just slows down the data flow. This is a tough one to get right because throughput and processing loads at an input aren't constant, so you have to rely on the Dataflow Task for the most part.
Subordinate and synchronize everything else to the above decisions.
Once you've identified the bottleneck, and you've fully exploited it as described in the last step, you should apply all the other non-constrained resources to breaking the bottleneck. Try to get any under-utilized resources to take on more of the work that the bottleneck is performing. In other words, offload work from your overutilized bottleneck resources and load it onto the laggard under-utilized resources. You can do this through techniques such as the following:
- Filtering the data before it flows into the bottleneck transform, for example, pre-building a Lookup Transform dimension reference table from the new facts so that only those dimensions that are required end up in the reference table.
- Ignoring errors and using a conditional split to filter them, for example, when doing upserts, testing shows that it is generally faster to ignore errors from the Lookup Transform and conditionally split the not found rows into an insert, than to try and handle the not found rows with an error output.
- Preprocessing the data the constrained transform will process, for example, placing a Lookup Transform before the Slowly Changing Dimension transform to filter out non-existent rows.
- Postprocessing the data in lieu of a transformation, for example, the OLEDB Command Transform performs row by row expensive operations which you can optimize by eliminating the transform and flowing the data to a table. Later, use SQL Server to perform the set based operations instead.
- Clone and parallelize the dataflow around the bottleneck. For example, if the bottleneck is a transform, segment the rows with a conditional split into clones of the bottleneck, then union all the rows back into the next transform.
Elevate the Bottlenecks
As you continue to increase the performance of the bottleneck by offloading work from it etc., and placing more of the load on other less utilized resources, the system will become more balanced and the bottleneck will eventually be broken.
If in any of the above steps the constraint has shifted, go back to Step 1.
When you take these steps, what you'll find is that you're in an iterative improvement loop. Once you eliminate one bottleneck, you'll find others. So, you start the process again until you reach your performance goal.
Summary
The Dataflow Task is pretty fast by default, but as the data flow grows more complex and the size of data loads increase, the opportunities for tuning the data flow also increase. The guidelines and methods for tuning the data flow described here address a large percentage of the performance issues you'll run into. However, if you still want to get the last drop of performance out of the Dataflow Task, use the Theory of Constraints as your guide to find and eliminate bottlenecks.
No comments:
Post a Comment