I am following the code of the
Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition. In the creating train and test dataset section they followed this procedure to create the training and testing dataset as follows:
from zlib import crc32 def test_set_check(identifier, test_ratio): return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32 def split_train_test_by_id(data, test_ratio, id_column): ids = data[id_column] in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio)) return data.loc[~in_test_set], data.loc[in_test_set] housing_with_id = housing.reset_index() # adds an `index` column train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
According to the author:
You can compute a hash of each instance's identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Therefore, I would like to understand what does this line of code do:
crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
Any help is much appreciated!
This may be a little late, but if you are still looking for an answer, here it is form the documentation of the crc32 function:
Changed in version 3.0: Always returns an unsigned value. To generate the same numeric value across all Python versions and platforms, use crc32(data) & 0xffffffff.
So, essentially, its just to ensure that whoever runs this function, it wont matter if they're running Python 2 or 3.
User contributions licensed under CC BY-SA 3.0