How to split a dataset into a train and test dataset using hashcode method


I am following the code of the Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition. In the creating train and test dataset section they followed this procedure to create the training and testing dataset as follows:

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

According to the author:

You can compute a hash of each instance's identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

Therefore, I would like to understand what does this line of code do: crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Any help is much appreciated!

asked on Stack Overflow Nov 12, 2019 by I. A

1 Answer


This may be a little late, but if you are still looking for an answer, here it is form the documentation of the crc32 function:

Changed in version 3.0: Always returns an unsigned value. To generate the same numeric value across all Python versions and platforms, use crc32(data) & 0xffffffff.

So, essentially, its just to ensure that whoever runs this function, it wont matter if they're running Python 2 or 3.

answered on Stack Overflow Feb 25, 2020 by DataInTheStone

User contributions licensed under CC BY-SA 3.0