What happens when you seek past the end of a file opened for writing?

3

I have a binary data-set of known size that arrives in fixed-sized chunks. The chunks are out of order, but their position in the final result is known when I get them. Here is a simple example:

from random import sample, seed
import numpy as np

chunk_size = 10
chunk_count = 10

def generate_data():
    seed(0xDEADBEEF)
    for i in sample(range(chunk_count), chunk_count):
        yield i, np.arange(i * chunk_size, (i + 1) * chunk_size, dtype=np.uint8)

My goal is to write this data to a file as it arrives:

with open('output.dat', 'wb') as output:
    for i, d in generate_data():
        output.seek(i * chunk_size)
        d.tofile(output)

This seems to work well on my Windows Anaconda python 3.7 install: it creates a 100-byte file with the bytes 0-99:

>>> with open('output.dat', 'rb') as f:
...     print(f.read())
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abc'

I expect that it is python version agnostic, at least as far back as 2.7. I am not sure that it is as platform agnostic, but I would expect it to be.

The example above does not show any artifacts in the file because the data is contiguous once the loop terminates. If I introduce a missing block, I see zeros in the file:

def generate_data():
    seed(0xDEADBEEF)
    for i in sample(range(chunk_count + 2), chunk_count):
        yield i, np.arange(i * chunk_size, (i + 1) * chunk_size, dtype=np.uint8)

with open('output.dat', 'wb') as output:
    for i, d in generate_data():
        output.seek(i * chunk_size)
        d.tofile(output)

The file is 10 bytes larger since it contains one missing chunk. All the elements are placed correctly, including the zero-filled hole:

>>> with open('output.dat', 'rb') as f:
...     print(f.read())
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0023456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklm'

Is zero-fill consistent (documented) behavior I can rely on? Is the behavior of the holes OS-specific (as this question implies)? I have not been able to find anything python-specific regarding a write following a seek past the current end-of-file.

python
seek
asked on Stack Overflow Jul 23, 2019 by Mad Physicist • edited Jul 30, 2019 by Mad Physicist

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0