python checksum or hash need same output every execution

0

Trying to create unique key for a dataframe based on some columns. Used hashlib and zlib , both generating different values for each new python program execution for the same record in dataframe.

Looking for a way to create unique checksum and it should be the same for given data record in dataframe. There are many columns , so don't want to use concatenated column as a key. Any insights would be much appreciated. Sample code tested using hashlib and zlib below

Hashlib

    stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (hashlib.sha512(x).hexdigest().upper()))

zlib.adler32

stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
    str.encode('utf-8').apply(lambda x: (zlib.adler32(x) & 0xffffffff  ))

Edited(10/21) Changed code and hitting new problem. Please review. Sorry for any confusion

Above code snippets have problem. For a row , some other row's column values hash was added in 'Unique travelid' column due to pd.DataFrame() altering original df rows order. Below modified code fetches respective column values for a given row but hitting new issue explained below

Modified code

stg_matchdf["Unique travelid_Sum"] = stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1)
stg_matchdf["Unique travelid_Key"] = stg_matchdf["Unique travelid_Sum"].apply(lambda x: (zlib.adler32(str(x).encode('utf-8')) & 0xffffffff))

stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1) is not concatenating columns in one particular order across multiple runs. Please see sample below for two runs. Entire length is same , but order of concatenation is random. So it is causing hashlib or zlib to return different values each time. Is there any way to specify order of columns in above code?

Run1:
AHKGCANADACANADANORTH AMERICA266430RDirect WDAYYZINTERNATIONALMANULIFE - CANADA TRANSIENTFeb-2020HONG KONGASIA/PACIFICPARTIAL REFUND2020-02-15Canada266430.02020-02-02Hong Kong2020-03-01QVKGS6

Run2:
YYZCANADAPARTIAL REFUND2664302020-02-02AMANULIFE - CANADA TRANSIENTHONG KONGNORTH AMERICA2020-03-01Hong KongQVKGS6INTERNATIONALDirect WDRHKGACanadaFeb-2020266430.02020-02-15CANADAASIA/PACIFIC
python
zlib
checksum
hashlib
asked on Stack Overflow Oct 21, 2020 by Mohan Rayapuvari • edited Oct 22, 2020 by Mohan Rayapuvari

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0