I'm not understanding this function for stable train/test split even after updating the dataset

-1

I am reading Auriel Geron's book 'Hands on Machine Learning', and I have been trying to understand the following function:

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

According to the book:

to have a stable train/test split even after updating the dataset, a common solution is to use each instance's identifier to decide whether or not it should go in the test set (assuming instances have a unique a immutable identifier). For example, you could compute a hash of each instance's identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

Now, even with this explanation of the book, the function seems very confusing and I did not understand almost anything, so I started to find out what each part does,

So I started by searching what was crc32 and I found this excellent video. I learned that crc is a function to verify or check errors in a message sent and received. Alright, but I still don't understand what's the output of the crc32 function, is it the divisor of the message?

In this great answer I found that & 0xffffffff was just to make it work on all python version, no problem there.

In that answer I also found that 2 ** 32 represents the longest integer in a 32-bit system. Alright I got that, but why am I checking this?

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

test_set_check(2, 0.2)
>>>>True

housing_with_id.head()

enter image description here

python
machine-learning
hash
crc32
asked on Stack Overflow May 15, 2021 by Joseph Arcila • edited May 17, 2021 by Joseph Arcila

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0