Test: confronting various methods of bulk loading data from a table to another, what's the fastest?

Table of Contents

Most of the support request I get involved with can be summarized with the following keywords: “slow” + “datawarehouse” + “ETL” + “Save us” What about thinking about ETL performance before it goes bad? Before the system is in production for some time , the data size reaches a decent size which wasn’t tested in DEV and you find out that the system is not able to scale at all?

Here are the steps to success:

  1. Don’t think about the data that you have right now, think about how many data you’ll have way after the Go Live
  2. TEST with the biggest dataset possibile.

On the subject, I’m saving you some job by testing how the performance changes when bulk importing data into a table in various conditions, let’s get to the data.

I’m testing this on my local machine, nothing fancy, SQL Server 2017 CU3, only 16GB of ram, 4 cores with HT and a very fast NVME drive for both Data files and Logs, I wanted to test the following conditions:

  • Table: Heap vs Clustered Index vs Clustered Columnstore Index
  • Recovery Mode: FULL vs BULK_LOGGED (which for bulk operations is basically the same as SIMPLE)
  • Hints: No Hints vs TABLOCK (My favorite hint)

In each run I monitored, other than the time to complete the operation, the log growth from a fixed dimension and the wait stats occurred during the insert; Each run was repeated twice, once with the log file starting from the standard 10GB size, and another one repeating the test with the size the Log had grown into to verify how much log growing affected performances. As you may have noticed, an interactive PowerBI report to analyze the data I’ve gathered (don’t forget to drill down, neat!) and at the end of the post you’ll find a summary table with all the tests.

As source data, I used the lineitem table in the TPC-H database created by HammerDB that I already had on my Instance; You can install it for free and it’s a nice tool to literally hammer your instance with a simulated workload of either TPC-C or TPC-H scenarios. The table has 60 million rows , weights about 2.3GB and it’s a Clustered Columnstore, so reading performance are not a bottleneck. This is the definition of the table I’m going to insert into:

CREATE TABLE [dbo].[lineitem](
  [identity] [bigint] IDENTITY(1,1) NOT NULL,
  [l_shipdate] [date] NULL,
  [l_orderkey] [bigint] NOT NULL,
  [l_discount] [money] NOT NULL,
  [l_extendedprice] [money] NOT NULL,
  [l_suppkey] [int] NOT NULL,
  [l_quantity] [bigint] NOT NULL,
  [l_returnflag] [char](1) NULL,
  [l_partkey] [bigint] NOT NULL,
  [l_linestatus] [char](1) NULL,
  [l_tax] [money] NOT NULL,
  [l_commitdate] [date] NULL,
  [l_receiptdate] [date] NULL,
  [l_shipmode] [char](10) NULL,
  [l_linenumber] [bigint] NOT NULL,
  [l_shipinstruct] [char](25) NULL,
  [l_comment] [varchar](44) NULL,
 -- CONSTRAINT [PK_lineitem] PRIMARY KEY CLUSTERED 
 --(
 -- [identity] ASC
 --) ON [PRIMARY]
) ON [PRIMARY]

This is the query that I’m using to test the inserts:

DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR);
GO

DECLARE @t1 datetime2, @t2 datetime2;

SET @t1= SYSDATETIME();

TRUNCATE TABLE dbo.lineitem;

INSERT INTO dbo.lineitem --WITH (TABLOCK) 
       ([l_shipdate],[l_orderkey],[l_discount],[l_extendedprice],
       [l_suppkey],[l_quantity],[l_returnflag],[l_partkey],
       [l_linestatus],[l_tax],[l_commitdate],[l_receiptdate],
       [l_shipmode],[l_linenumber],[l_shipinstruct],[l_comment])
SELECT
       [l_shipdate],[l_orderkey],[l_discount],[l_extendedprice],
       [l_suppkey],[l_quantity],[l_returnflag],[l_partkey],
       [l_linestatus],[l_tax],[l_commitdate],[l_receiptdate],
       [l_shipmode],[l_linenumber],[l_shipinstruct],[l_comment]
FROM   [tpch].[dbo].[lineitem];

--Insert Index Creation Statement here

SET @t2 = SYSDATETIME()

SELECT DATEDIFF(Millisecond,@t1,@t2) as Duration

I’m clearing the Waitstats right before starting the insert query to be able to get a snapshot of what the query was waiting on during the execution.

Run #1: Clustered Index, FULL Recovery model, No Hint, 10GB Log

Time: 239085ms Log Size: 10GB -> 23GB

This should be the slowest run, that we’re taking as a baseline. The log has grown by 13GB over the preallocated size and it shows in the waitstats too  as the waits related to the Log write and Log File file growth

Run #2: Clustered Index, FULL Recovery model, No Hint, 23GB Log

Time: 219969ms Log Size: 23GB

By having the log already grown to size, the PREEMPTIVE_OS_WRITEFILEGATHER wait is gone, since Windows didn’t have to grow the Log file on disk, it was  10 seconds faster, not much.

Run #3: Clustered Index, FULL Recovery model, TABLOCK, 10GB Log

Time: 239574ms Log Size 23GB

Regardless of the TABLOCK index, nothing really changed here.

Run #4: Clustered Index, FULL Recovery model, TABLOCK, 23GB Log

Same as #2

Run #5: Clustered Index, BULK_LOGGED Recovery model, No Hint, 10GB Log

Time: 170744ms Log: 10GB -> 32GB

By switching to the BULK_LOGGED recovery model we finally see some noticeable improvement in time, the WRITELOG wait stat has gone way down in time but the PREEMPTIVE_OS_WRITEFILEGATHER is still present, as the log file had to grow and for some reason it has grown even more than in the FULL recovery model, while one would expect to grow the same or less; One possible explanation that I have is that, while the operation is running, the Log in the BULK_LOGGED  contains both log records and the data pages that were changed by bulk operations (bol )

Run #6: Clustered Index, BULK_LOGGED Recovery model, No Hint, 32GB Log

 

Time: 140099ms Log: 32GB

With the Log file already grown, we only have a few relevant PAGELATCH waits and the WRITELOG has gone way down, indicating that we’re not waiting on the log anymore, we saved almost 30 second!

Run #7: Clustered Index, BULK_LOGGED Recovery model, TABLOCK Hint, 10GB Log

Time: 152851ms Log: 10GB -> 32GB

As expected, we’re still waiting on the Log file to grow, but the TABLOCK hint made this query a little bit faster, so we’re still better off than run #5

Run #8: Clustered Index, BULK_LOGGED Recovery model, TABLOCK Hint, 32GB Log

Time: 136668ms Log: 32GB

The fastest run yet, as expected not having to wait on the Log File to grow and with a TABLOCK hint we’re not waiting on anything.

Run #9: Heap, FULL Recovery model, no Hint, 10GB Log

Time: 304027ms Log: 10GB -> 28GB

What? Wasn’t inserting into a Heap always faster? Not in Full recovery model, a small Log and no Hints, it seems. Slowest run ever, the WRITELOG wait is killing the run.

Run #10: Heap, FULL Recovery model, no Hint, 28GB Log

Time: 270207ms Log: 28GB

The Log file didn’t have to grow, but we aren’t really in a better position.

Run #11: Heap, FULL Recovery model, TABLOCK Hint, 10GB Log

Time: 97065ms Log: 10GB

Finally we’re seeing good results! Even with the Full recovery model the insert with the TABLOCK hint more than halved the insert time, the Log file didn’t grow at all

Run #12: Heap, FULL Recovery model, TABLOCK Hint, 10GB Log

Same as #11, just because, just to verify.

Run #13: Heap, BULK_LOGGED Recovey mode, no Hint, 10GB Log

Time: 291038ms Log: 10GB -> 29GB

Losing the hint makes us lose all the performance benefits, the situation is basically the same as using the FULL recovery model.

Run #14: Heap, BULK_LOGGED Recovey mode, no Hint, 29GB Log

Time: 276963ms Log: 29GB

As expected, a little bit faster since we didn’t have to grow the log file, but nothing else.

Run #15: Heap, BULK_LOGGED Recovey mode, TABLOCK Hint, 10GB Log

Time: 77588ms Log: 10GB

Here is where the real money is! Lowest time ever, no additional Log Growth, as fast as it can get. But what if I then need to query this table, so I need at least a Clustered Index? Let’s see how much time it takes to create it:

Run #16: Run #15 + Create the Clustered Index

Time: 135376 Log: 10GB

In this case, by first inserting the data into the Heap with a TABLOCK, we create the index right after inserting all the records. The total time of the operation is basically the same one of inserting the records directly in the Clustered index with TABLOCK and BULK_LOGGED recovery model, the only difference is that the log didn’t grow to 32GB but stayed at the original size.

Run #17: Clustered Columnstore, Full Recovery Model, No Hint, 10GB Log

Time: 245172ms Log: 10GB

Throwing Columnstore into the mix, we’re seeing that for this kind of table structure/Data doesn’t really performs any better than a classic Clustered Index, but in this case the log didn’t need to grow more than the prefixed 10GB size. I expected better performance from this test, probably I’ll get some additional benefits in a machine with more ram, even if I’m dealing only with 2.3GB of data in 16GB of RAM the columnstore may benefit from a little bit more memory.

Run #18 to #21: All the various combinations of HINT/Recovery Model

Since Columnstore doesn’t need the TABLOCK hint to run minimally logged, these tests offer no real difference than the last one and were made/included just for completeness. Log size never exceeds 10GB and the time is always around the same with some small variability related to the recovery model.

Run #22: Run #15 + Columnstore Creation

Time: 133218ms Log 10GB

Even if in these test Columnstore wasn’t ideal to straight insert into (again, probably a limitation of my machine), when creating the index in a second time we basically get the same results as #16 However, it has the most interesting wait stats yet, as it shows that the index creation was done in parallel and that indeed I need way more than 16GB of ram to demonstrate the Columstore glory.

Round Up:

So, what’s best? As always, it depends. If you have to get it done and just need the data in the table in the fastest way, then go ahead with a Heap and TABLOCK (I guess you’ll create your indexes Online right after) If the data needs to be ready in an indexed table, then you can either create the index before or after the insert, the time will be the same but the transaction log will grow to a bigger size while the operation is running in the first case, maybe keep that in mind if you’re in a log shipping scenario.

The definitive answer is: TEST, TEST, TEST.

Finally, here’s table with all the results neatly organized:

[table id=1 /]

comments powered by Disqus

Related Posts

How to get and decode callstacks in in extended events for recent versions of SQL Server

The other day I was reading about spinlocks and troubleshooting and it mentioned to get the callstack for a certain xevent and to decode it using the windows debugging symbols, however, that material was related to SQL Server 2012 and the steps weren’t basically replicabile for a modern version like SQL 2016/2017 (or even 2014 I think, but haven’tn checked personally), I searched around and didn’t really find something straight forward for just the process as it is now: So, here we are, small guide on how to get and decode callstacks in recent versions of SQL Server.

Read More

SQL Server 2017 CU5 enhances troubleshooting for Intra-Query Parallelism Deadlock

Do you remember this post from not long ago ? Apparently, other than still being an issue, with the latest SQL Server 2017 CU there are additional tools to see what’s going on and troubleshoot.

Read More

How to read data from Oracle Database via PowerShell without using ODBC or installing Oracle Client (and import it to SQL Server too)

If you read my previous article on how to configure a Linked server to Oracle , you know that I feel like someone is plotting to keep the topic of how to get our precious data outside of the Oracle ecosystem as obscure as possible out of the oracle circle Fear not!

Read More