Monday, January 25, 2010

Common Table Expression (CTE) and capturing Identities on set-based inserts

Last week there was a nice article in www.sqlservercentral.com , with a decent short example of some newer SQL 2005 coding features..

I like the author's SQL example because it showed a few different things in a short script, which you can run on any SQL 2005+ SERVER...

He uses a CTE (Common Table Expression) table "mycte", showing a nice way to build a small test table. The main point of his article though, is that OUTPUT clause in the INSERT statement, which allows you to capture data as it's inserted, similar to how the INSERTED and DELETED tables work inside triggers. So, you can grab sets of row values, including any IDENTITY values. The article follows, good stuff...
___________________________________________________________________

http://www.sqlservercentral.com/articles/IDENTITY/69109/

Retrieving Identities on set-based inserts


By Steve McRoberts, 2010/01/20


A common situation where one is tempted to use cursors is this: You have a table with an identity column. You need to insert multiple rows to this table, then use the identities for further processing, such as inserting rows with those values into a child table.


The cursor method involves inserting the rows one at a time so you can access the @@identity value for use in inserting to the child table. This works as long as there are no triggers on the table that insert rows into other tables, which themselves contain identity columns . In that case the @@identity value will be useless. I know of a system that does exactly that as it logs inserts to a log table that has an identity column on it. Attempting to work-around this issue by selecting the max(identity-column) is only a reasonable idea if you are the sole process inserting to the table (and there is no likelihood of that fact ever changing.)


An attempt at a set-based approach is to do the insert with the TABLOCKX hint and then subtract @@rowcount-1 from @@identity to get the range of identities that were inserted. However, I'm leery of this approach since hints are not commands: there's no guarantee that some other process didn't insert a row during our insert, skewing our range. Also, it is liable to the same potential trigger issue noted above. And, of course, locking the entire table may not be desirable.


In SQL Server 2005 we have the perfect solution. It will work regardless of triggers, and it will eliminate the need to do the inserts one at a time. It is the new OUTPUT clause available on the Insert, Update and Delete statements.


The OUTPUT clause lets you output data from the INSERTED or DELETED virtual tables which were formerly only accessible from within a trigger. In our case, we simply need to preserve the identity column (in a real-life example you would preserve any other column data you needed).

DECLARE @parent TABLE(myID INT IDENTITY, myCol INT)

DECLARE @newID TABLE(new_ID INT)

-- Generate 100 rows of test data
;WITH mycte AS
(SELECT 100 AS seedValue
UNION ALL
SELECT seedValue + 1 FROM mycte WHERE seedValue + 1 < 200)


INSERT INTO @parent(myCol)
OUTPUT INSERTED.myId INTO @newID
SELECT * from mycte


SELECT * FROM @parent
SELECT * FROM @newID
 
The important part of this very simple example is:

OUTPUT INSERTED.myId INTO @newID

In the end, the @newID table variable will contain 100 new ID's guaranteed to be from the rows we just inserted into the @parent table variable.


By Steve McRoberts, 2010/01/20

Friday, July 31, 2009

DTS Data Pump Task - Methods and Performance

by Bruce de Freitas – July 21, 2009

As I was maintaining some old DTS packages (that run on SQL Server 2005), I noticed a pattern in them. Most of them have numerous Data Pump tasks between the exact same SOURCE and TARGET databases, but the Connection objects were repeated as in Example #1 below, with no workflow dependencies... I'm used to seeing this done more like Example #2, where you define unique connection objects just ONE time, and link multiple Data Pump tasks between them, when order of task execution does not matter. But I wasn't sure about the difference in Example #1 and Example #2...

I set up this small test moving 3 large tables via 3 Data Pump tasks in 5 DTS packages to see the performance differences. All 5 packages had the "max number of parallel tasks" set to 5, so that setting had no bearing on the results, as there were only 3 tasks per package anyway....

Bottom line is that only Example #3 actually multi-tasked (in 6m:53s), all other cases the 3 Data Pump tasks single-threaded thru (in 9m:35s). Example 1 and 2 execute virtually the same, performance-wise. If the real DTS packages, not these examples, were written with the intent that tasks would run concurrently, they aren't, they're single threading. I didn't do tests under SSIS, but I'd think it works the same with connection objects and pooling.... Moral of the story, if you want to run concurrent Data Pump tasks with the same SOURCE\TARGET connections, think about creating separate connection objects like Example #3 below (for DTS).... Some other time I'll try this under SSIS, unless someone else has tried it with SSIS and wants to share what they have found.....


Example #1 (575 secs) - This is the style I am seeing in existing DTS packages... Single SOURCE connection object and single TARGET connection object, re-used multiple times in the package. This method single threads thru each transform (one at a time).
















Example #2 (575 secs) - This method uses one unique SOURCE and one unique TARGET connection object, has 3 data transforms between them. This appears to execute exactly as Example #1 does, single threading thru each transform.




















Example #3 (413 secs) - This example runs the fastest because I created separate connection objects on both the SOURCE and TARGET databases. I could visually see them all clicking up concurrently only when run this way in Example #3, all other of these DTS examples, would pause and do one at a time, which makes sense as there is only one connection to pipe thru in those cases..... See the SQL Profiler trace and the DTS Log summary at the bottom of this article... Note in the trace, for Example#3, that the "SELECT *" generated by the Data Pump, is not waiting for completion of each task, they are all concurrently running!




















These last 2 examples were to see how it worked with EITHER of the SOURCE or TARGET connection objects unique, but not both, like Example #3.

Example #4 (568 secs) - Used unique SOURCE connection objects but shared the same TARGET connection object.




















Example #5 (571 secs) - Used unique TARGET connection objects but shared the same SOURCE connection object.





















DTS Log Output from all 5 Example Packages




















SQL Profiler Output from all 5 Example Packages