Thursday, October 20, 2011

A Fast Way to Update Large Amount of Data in SQL Server With Select Into Statement



I frequently need to update large amount of data, like tens of millions of rows.
Usually the update request comes as that there is a table (let’s call it new_table) with a record id column (record_id) and a few other columns, say date_field1, data_field2 and data_field3 that need to be updated into another table (old_table). The old_table also contains the record_id column and the data field columns. It also contains a lot of other fields. When you look at such request a straight forward solution is to issue an update statement to update the data fields into the old_table like this.
update o set o.data_field1=n.data_field1, o.data_field2=n.data_field2, o.data_field3=n.data_field from old_table o join new_table n on o.record_id=n.record_id
If you do this you will probably find your update query run for 24 hours and still can not finish. Even when you have indexes correctly built on both the new and old tables. I think a much faster approach is to create another table based on the old table and incorporate the updates into the new table while the table is built. To do that you need the help of a select … into query like this
select o.recordid,
case when n.data_field1 is not null then n.data_field1 else o.data_field1 end as data_field1,
case when n.data_field2 is not null then n.data_field2 else o.data_field2 end as data_field2,
case when n.data_field3 is not null then n.data_field3 else o.data_field3 end as data_field3,
o.other_fileds.....
into table_with_updates
from old_table o left join new_table n on o.record_id=n.record_id
A few notes about this query. First the old table is left join to new_table so that all records in the old table are in the result of the select statement. Second the three case statements are to use the data field values in the new table if it has one and use the value in the old table only when new table doesn’t have it. In effect this is an update of the data fields. Third “into table_with_updates” instructs SQL server to create a table called table_with_updates and dump the result set of the select statement into the table.
After the table with updates is built, you can drop the old table and rename the table with updates to the old table. If both the new table and old table have clustered indexes built on the record_id columns, the select…into query usually can finish within a couple hours. That is more than 10 times faster!