How to split a dataset into a train and test dataset using hashcode method


I am following the code of the Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition. In the creating train and test dataset section they followed this procedure to create the training and testing dataset as follows:

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

According to the author:

You can compute a hash of each instance's identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

Therefore, I would like to understand what does this line of code do: crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Any help is much appreciated!

asked on Stack Overflow Nov 12, 2019 by I. A

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0