Creating a truly unique value from several others in T-SQL
Update 2015 August 28: I've replaced the function master.sys.fn_varbintohexstr with CONVERT, with the extra parameter 2, which translates a binary field into a hexadecimal string with no leading 0x. In addition to being ugly to use, fn_varbintohexstr is very slow.
Sometimes you need to create a unique identifier for a bunch of values so that you use it as an ID in the database. The immediately obvious choice is the CHECKSUM and BINARYCHECKSUM functions. But beware, the purpose of these functions is to detect changes in a string, not to uniquely identify it. It might seem strange, but the two concepts are very different. The change modification functionality is only meant to generate very different values on small changes. The uniqueness is trying to create a value as distinctive as possible for any string. That is why when you use a checksum you will get a lot of similar values for (very) different strings.
Enter HASHBYTES, another function that has the purpose of creating a cryptographic hash for a string. It is mainly used for password hashing, but it will fit nicely for our purpose. There are some caveats, though. First, CHECKSUM gets a variable number of parameters, HASHBYTES only accepts one, so we must take care of the cumbersome concatenation of multiple values. Unfortunately SQL functions do not have the option of variable parameters, which is truly a shame, so we can't hack it. Also, the value that HASHBYTES returns is a varbinary. We could cast it to NVARCHAR, but it turns into a weird Chinese characters string. In order to turn it into a proper string, we need to usethe same function used by SQL Server to display varbinary when selecting it: master.sys.fn_varbintohexstr the CONVERT function with a parameter of 2 (hex string without the leading 0x).
So let's compare the two usages. Suppose we have this nice table that contains company data: company name, contact first name, contact last name, phone, email, yearly value. We need to create a unique ID based on these values.
First CHECKSUM:
Second HASHBYTES:
However, it works.
WARNING: using CAST to NVARCHAR from a FLOAT loses precision. You should use STR instead!
A middle solution is to use XCHECKSUM. What is that, you ask? A placeholder that can be replaced with some regular expression search and replace, of course :)
Update: I've created a query that creates the script to update the value of a column called 'ValuesHash', for tables that have it, with the hash of all columns that are not in a list of names, they are not primary keys and they are not foreign keys, plus they are not computed, rowguidcol or filestream.
Imagine the scenario where you have something like this:
WARNING: each column data is always truncated to 4000 characters, then the corresponding string is also truncated to 4000 bytes before running HASHBYTES (which only accepts a maximum of 8000 bytes). This hash will help in determining unique records, but it is not 100%.
Change it to suit your needs. It is by no means perfect, but it's a start for whatever you need.
Update:
A new FORMAT function was introduced in SQL Server 2012, working somewhat similar to the .NET ToString method. Using that function is slightly more precise:
Sometimes you need to create a unique identifier for a bunch of values so that you use it as an ID in the database. The immediately obvious choice is the CHECKSUM and BINARYCHECKSUM functions. But beware, the purpose of these functions is to detect changes in a string, not to uniquely identify it. It might seem strange, but the two concepts are very different. The change modification functionality is only meant to generate very different values on small changes. The uniqueness is trying to create a value as distinctive as possible for any string. That is why when you use a checksum you will get a lot of similar values for (very) different strings.
Enter HASHBYTES, another function that has the purpose of creating a cryptographic hash for a string. It is mainly used for password hashing, but it will fit nicely for our purpose. There are some caveats, though. First, CHECKSUM gets a variable number of parameters, HASHBYTES only accepts one, so we must take care of the cumbersome concatenation of multiple values. Unfortunately SQL functions do not have the option of variable parameters, which is truly a shame, so we can't hack it. Also, the value that HASHBYTES returns is a varbinary. We could cast it to NVARCHAR, but it turns into a weird Chinese characters string. In order to turn it into a proper string, we need to use
So let's compare the two usages. Suppose we have this nice table that contains company data: company name, contact first name, contact last name, phone, email, yearly value. We need to create a unique ID based on these values.
First CHECKSUM:
SELECT CHECKSUM(companyName, firstName, lastName, phone, email, yearlyValue) FROM OurTableSo easy! Just add the columns, no matter how many or what type they have, and get a value as a result. You can even use * to select all columns in a row. You also have the advantage of getting the same checksum for differently capitalized strings. If you don't want this behaviour, use BINARYCHECSUM, which works even better.
Second HASHBYTES:
SELECT CONVERT(VARCHAR(Max),HASHBYTES('SHA1',companyName+'|'+firstName+'|'+lastName+'|'+phone+'|'+email+'|'+CAST(yearlyValue as NVARCHAR(100))),2) as id,*Ugly! You need to create a string from different types, using ugly casts. Also, this works more like BINARYCHECKSUM. If you want to get the same functionality as CHECKSUM you need to use LOWER(LTRIM(RTRIM(value))). Horrid!
FROM OurTable
However, it works.
WARNING: using CAST to NVARCHAR from a FLOAT loses precision. You should use STR instead!
A middle solution is to use XCHECKSUM. What is that, you ask? A placeholder that can be replaced with some regular expression search and replace, of course :)
Update: I've created a query that creates the script to update the value of a column called 'ValuesHash', for tables that have it, with the hash of all columns that are not in a list of names, they are not primary keys and they are not foreign keys, plus they are not computed, rowguidcol or filestream.
Imagine the scenario where you have something like this:
- Table A:
- Id: primary identity key
- Data1: some data
- Data2: some data
- CreateTime: the creation time
- ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
- Table B:
- Id: primary identity key
- AId: foreign key to A
- Data1: some data
- Data2: some data
- ModifyTime: the modification time
- ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
- Table C:
- Id: primary identity key
- AId: foreign key to A
- Data1: some data
- Data2: some data
WARNING: each column data is always truncated to 4000 characters, then the corresponding string is also truncated to 4000 bytes before running HASHBYTES (which only accepts a maximum of 8000 bytes). This hash will help in determining unique records, but it is not 100%.
SELECT *
FROM (
SELECT t.name,
'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING('
+ Stuff(
(SELECT '+ ''|''+ ISNULL('+CASE
WHEN tp.name IN ('float', 'real') THEN 'STR('+c.name+',30,30)'
WHEN tp.name IN ('binary', 'varbinary') THEN 'CONVERT(NVARCHAR(4000),'+c.name+',2)'
ELSE 'CONVERT(NVARCHAR(4000),'+c.name+')' END+','''')'
FROM sys.all_columns c
INNER JOIN sys.types tp
ON c.system_type_id=tp.system_type_id
AND c.user_type_id=tp.user_type_id
LEFT JOIN sys.index_columns ic
ON ic.object_id=t.object_id
AND ic.column_id=c.column_id
LEFT JOIN sys.indexes i
ON ic.object_id=i.object_id
AND ic.index_id=i.index_id
LEFT JOIN sys.foreign_key_columns fc
ON fc.parent_object_id=t.object_id
AND c.column_id=fc.parent_column_id
WHERE t.object_id=c.object_id
AND ISNULL(c.is_identity, 0)=0
AND ISNULL(c.is_computed, 0)=0
AND ISNULL(c.is_filestream, 0)=0
AND ISNULL(c.is_rowguidcol, 0)=0
AND ISNULL(i.is_primary_key, 0)=0
AND fc.parent_column_id IS NULL
AND c.name NOT IN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash')
ORDER BY Sign(c.max_length) DESC, c.max_length, Lower(c.name)
FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')
, 1, 7, '')
+ ',0,4000)) WHERE ValuesHash IS NULL' AS computed
FROM sys.tables t
INNER JOIN sys.all_columns c
ON t.object_id = c.object_id
WHERE c.name = 'ValuesHash') x
WHERE computed IS NOT NULL
ORDER BY name
Change it to suit your needs. It is by no means perfect, but it's a start for whatever you need.
Update:
A new FORMAT function was introduced in SQL Server 2012, working somewhat similar to the .NET ToString method. Using that function is slightly more precise:
SELECT *
FROM (
SELECT t.name,
'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING('
+ Stuff(
(SELECT '+ ''|''+ ISNULL('+CASE
WHEN tp.name IN ('float', 'real') THEN 'FORMAT('+c.name+',''R'')'
WHEN tp.name IN ('decimal') THEN 'FORMAT('+c.name+',''G'')'
WHEN tp.name IN ('datetime','datetime2') THEN 'FORMAT('+c.name+',''O'')'
WHEN tp.name IN ('binary', 'varbinary') THEN 'CONVERT(NVARCHAR(4000),'+c.name+',2)'
ELSE 'CONVERT(NVARCHAR(4000),'+c.name+')' END+','''')'
FROM sys.all_columns c
INNER JOIN sys.types tp
ON c.system_type_id=tp.system_type_id
AND c.user_type_id=tp.user_type_id
LEFT JOIN sys.index_columns ic
ON ic.object_id=t.object_id
AND ic.column_id=c.column_id
LEFT JOIN sys.indexes i
ON ic.object_id=i.object_id
AND ic.index_id=i.index_id
LEFT JOIN sys.foreign_key_columns fc
ON fc.parent_object_id=t.object_id
AND c.column_id=fc.parent_column_id
WHERE t.object_id=c.object_id
AND ISNULL(c.is_identity, 0)=0
AND ISNULL(c.is_computed, 0)=0
AND ISNULL(c.is_filestream, 0)=0
AND ISNULL(c.is_rowguidcol, 0)=0
AND ISNULL(i.is_primary_key, 0)=0
AND fc.parent_column_id IS NULL
AND c.name NOT IN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash')
ORDER BY Sign(c.max_length) DESC, c.max_length, Lower(c.name)
FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')
, 1, 7, '')
+ ',0,4000)) WHERE ValuesHash IS NULL' AS computed
FROM sys.tables t
INNER JOIN sys.all_columns c
ON t.object_id = c.object_id
WHERE c.name = 'ValuesHash'
) x
WHERE computed IS NOT NULL
ORDER BY name
0 comments:
Post a Comment