SSIS Violation of PRIMARY Key Constraint Error with Column Generated by HASHBYTES SHA2-256 Function

2

I have come across a problem that is beyond my Googling abilities, and I would appreciate any help at all.

I've created a very simple SSIS package to copy data from a source table in an Azure SQL database to a target in an MS SQL Database using an OLEDB connection. The target table was created from scripts generated from the source database (in SSMS, I right-clicked on database name -> Tasks -> Generate Scripts), so they should be identical. The SSIS package does not do any transformations--it's a simple truncation of the Target table, then a dataflow task with an OleDBSource connected to an OleDBDestination.

However, I'm getting the following error on package execution:

Hresult: 0x80004005  Description: "Violation of PRIMARY KEY constraint 'PK_HashKey'. Cannot insert
duplicate key in object 'TargetTable'. The duplicate key value is (                          ).".

The primary key column in question has a datatype of char(32) and is generated using the HASHBYTES function using the 'SHA2_256' algorithm. The error message gives an example of a bad primary key as basically a series of whitespaces. When I look in the Source table, it does look like there are multiple rows where the primary key is just a series of whitespaces. But I guess Azure SQL can distinguish between them somehow, because there are no primary key issues on the Source table. It's only when I'm trying to copy the data to an on-prem MS SQL Server database that I'm getting a primary key violation.

Things I've tried:

  1. Checked that the collation settings are the same on both the Source and Target table columns, database, and server.

  2. Checked the advanced editor for the SSIS dataflow Source and Destination to make sure the codepage is the same.

  3. Removed the primary key constraint on the Target table, then moved the data using SSIS, and then ran a query to compare the hashes transferred by SSIS vs. completely generating new hashes. For about 8% of the table, SQL Server did not think the transferred hash matched the new hash, despite the hashes generally looking alike. Here is the query:

SELECT CONVERT(CHAR(32), HASHBYTES('SHA2_256', BusinessKey)), Hashkey, BusinessKey 
FROM TargetTable 
WHERE CONVERT(CHAR(32), HASHBYTES('SHA2_256', BusinessKey)) <> HashKey

Based on #3, my guess is that one of the following is happening:

  1. SSIS is transforming the hash somehow when it copies data from Source to Target.

  2. The Target column/table/database has some setting that makes it store char(32) data differently than the Source column/table/database.

  3. Some sort of bug.

Anyone have any experiences that could help shed light on this issue?

sql-server
ssis
azure-sql-database
sql-server-2016
hashbytes
asked on Stack Overflow May 22, 2020 by deew

1 Answer

1

This

CONVERT(CHAR(32), HASHBYTES('SHA2_256', BusinessKey))

is going to cause problems. It takes the 32-byte hash and stores it as 32 code points in some varchar collation. Lots of the code points are non-printing characters, so inspecting the values is going to constantly be a problem.

And I wouldn't trust SSIS or any other external program to round-trip or even display such values. Many programming environments use null-terminated strings, where the byte 0x00 indicates the end of the string. And your hashes will have the 0x00 code point about 12% of the time. P=1-(255/256)^32

Store the hashes as BINARY(32) or as a Base64 encoded string and your problems should go away.

Here's a little utility extracted from a sample I'm working on that might work for you. I tested it with a hash on a 200k row table stored in a char(32) with a unique constraint, and it worked fine.

using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Text;

namespace Microsoft.Samples.SqlServerDataMover
{
    public class Program
    {

        public static void Main(string[] args)
        {
            var srcConnectionString = args[0];
            var destConnectionString = args[1];

            using (var src = new SqlConnection(srcConnectionString))
            using (var dest = new SqlConnection(destConnectionString))
            {
                src.Open();
                dest.Open();

                var cmdTables = src.CreateCommand();
                cmdTables.CommandText = "select schema_name(schema_id) schema_name, name table_name from sys.tables where type_desc = 'USER_TABLE'";

                var tables = new List<(string, string)>();
                using (var rdr = cmdTables.ExecuteReader())
                {
                    while (rdr.Read())
                    {
                        var schema = rdr.GetString(0);
                        var table = rdr.GetString(1);
                        tables.Add((schema, table));
                    }
                }

                foreach (var t in tables)
                {
                    CopyTable(src, dest, t.Item2, t.Item1);
                }

            }
        }
        static string QuoteName(string identifier)
        {
            var sb = new StringBuilder(identifier.Length + 3, 1024);
            sb.Append('[');
            foreach (var c in identifier)
            {
                if (c == ']')
                    sb.Append(']');
                sb.Append(c);
            }
            sb.Append(']');
            return sb.ToString();

        }

        public static void CopyTable(SqlConnection src, SqlConnection dest, string tableName, string schemaName, int batchSize = 10000)
        {
            var opts = SqlBulkCopyOptions.KeepIdentity | SqlBulkCopyOptions.KeepNulls | SqlBulkCopyOptions.TableLock;

            var sql = $"select * from {QuoteName(schemaName)}.{QuoteName(tableName)}";
            var cmd = src.CreateCommand();
            cmd.CommandTimeout = 0;
            cmd.CommandText = sql;

            using (var rdr = cmd.ExecuteReader())
            using (var bc = new SqlBulkCopy(dest, opts, null))
            {
                var schemaDt = rdr.GetSchemaTable();
                //schemaDt.WriteXml(Console.Out);

                var schema = new Dictionary<string, DataRow>();
                foreach (DataRow r in schemaDt.Rows)
                {
                    schema.Add(r[0].ToString(), r);
                }


                bc.BatchSize = batchSize;
                for (int i = 0; i < rdr.FieldCount; i++)
                {
                    var cn = rdr.GetName(i);
                    bool isreadonly = schema[cn].Field<bool>("IsReadOnly");

                    if (!isreadonly)
                    {
                        bc.ColumnMappings.Add(i, cn);
                    }

                }

                bc.NotifyAfter = 10000;

                bc.SqlRowsCopied += (s, a) =>
                {
                    Console.WriteLine($"[{schemaName}].[{tableName}] {a.RowsCopied} Rows Copied.");
                };

                bc.DestinationTableName = $"[{schemaName}].[{tableName}]";
                bc.WriteToServer(rdr);
                Console.WriteLine($"[{schemaName}].[{tableName}] Complete. {1} Rows Copied.");
            }

        }
    }
}

Just pass in two connection strings to run it, source first, then destination.

eg

PS C:\temp\datamover> .\bin\debug\DataMover.exe "Server=xxxxxx.database.windows.net;database=adventureworks;User ID=xxxxxx;Password=xxxxxx" "Server=localhost;database=awcopy;integrated security=true"
answered on Stack Overflow May 22, 2020 by David Browne - Microsoft • edited May 22, 2020 by David Browne - Microsoft

User contributions licensed under CC BY-SA 3.0