Reduce copies when reading files in pyio, match behavior of _io #129005

cmaloney · 2025-01-18T22:59:43Z

Feature or enhancement

Proposal:

Currently _pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv
== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]
== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian
== Python build: debug
== cwd: <$HOME>/python/build/build/test_python_worker_32392æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all

Using random seed: 1740056613
0:00:00 load avg: 0.53 Run 1 test sequentially in a single process
0:00:00 load avg: 0.53 [1/1] test_largefile
test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
ok
test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
 ... process data size: 4.3G
 ... process data size: 4.7G
ok

----------------------------------------------------------------------
Ran 2 tests in 3.711s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 3.7 sec
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered)
Result: SUCCESS

Plan:

Switch to ~~os.readv()~~ os.readinto() to do readinto like C _Py_read used by _io does. os.read() can't take a buffer to use. This aligns behavior between _io.FileIO.readall and _pyio.FileIO.readall. os.readv works well today and takes a caller allocated buffer rather than needing to add a new os API. readv(2) mirrors the behavior and errors of read(2), so this should keep the same end behavior.
Update _pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of _pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from _pyio import open

import tracemalloc

with open("README.rst", 'rb') as file:
    tracemalloc.start()
    data = file.read()
    snap = tracemalloc.take_snapshot()


stats = snap.statistics('lineno')
for stat in stats:
    print(stat)

Loose Ends

os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio require readv, but can do conditional code if needed. If making readv non-optional generally is feasible, happy to work on that.
- os.readv is not supported on WASI, so need to add conditional code.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

The text was updated successfully, but these errors were encountered:

`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`

cmaloney · 2025-01-22T20:26:16Z

I took a tangent and looked at the code complexity of adding os.readinto vs. the conditionals around os.readv not existing on some platforms + needing to pass a sequence of buffers to os.readv.

_pyio Using os.readv with conditional: https://github.com/python/cpython/pull/129006/files
Adding os.readinto, _pyio using it: cmaloney@52c6013

Happy to work on either course, slight preference for os.readinto as it feels like by adding one function removes a lot of new conditional code to make these cases more efficient. Code wise, with the Buffer Protocol, the code to me gets quite a bit easier to follow. Not sure it's worth another public OS api to maintain though.

Know moving to a full PR for os.readinto would need to add tests and news, just focusing on "adding os.readinto" vs. "using os.readv".

cc: @tomasr8 , @vstinner , @gpshead (reviewers where os.read vs. _Py_read / readinto has come up).

vstinner · 2025-01-22T22:35:35Z

Can you please open a separated issue for os.readinto()? This function looks useful and simple.

`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto`. `os.readinto` uses `PyObject_GetBuffer`` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.

`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto`. `os.readinto` uses `PyObject_GetBuffer` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.

`os.read()` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto()`. `os.readinto()` uses `PyObject_GetBuffer()` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.

Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto, and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.3 seconds

Slicing buf and appending chunk would always result in a copy. Commonly in a readall there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.

This aligns the memory usage between _pyio and _io. Both use the same amount of memory.

cmaloney · 2025-01-29T21:09:30Z

The full set of changes on my Linux machine debug build, reduces ./python -m test -M8g -uall test_largefile -m test_large_read -v from ~3.8 seconds -> ~2.4 seconds with a peak memory usage of 2.3GB (identical memory usage for C and Python implementations).

Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto, and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds

Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto(), and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds.

Slicing buf and appending chunk would always result in a copy. Commonly in a readall() there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.

This aligns the memory usage between _pyio and _io. Both now use the same amount of memory when reading a file.

…#129500) This reverts commit e1c4ba9.

cmaloney · 2025-02-01T00:19:58Z

I think I found the issue in #129458. I tried using result[bytes_read:bufsize] = b'\0' to append bufsize-bytes_read null bytes to the bytearray, but that doesn't do that:

>>> a[0:5] = b'\0'
>>> a
bytearray(b'\x00')
>>> a[5:16] = b'\01'
>>> a
bytearray(b'\x00\x01')
>>> len(a)
2

(Pipes in the test are a case where we don't have a stat.st_size estimate, and can't seek to figure out size, so always goes through the resizing code). Not sure why that commit worked at first but later stopped working... Guessing a system needs to be somewhat loaded to hit.

Validated by adding an assert locally, which fails every time:

                result[bytes_read:bufsize] = b'\0'
                assert len(result) == bufsize, f"Should have expanded in size. {len(result)=}, {bufsize=}"

$ make && ./python -m test -uall test_io -vvv -m 'test_nonblock*'
<...>
AssertionError: Should have expanded in size. len(result)=8193, bufsize=16640

I don't see a clear way to resize the bytearray without making a whole array of null bytes which get copied into it... The tests just use the C API to resize: https://github.com/python/cpython/blob/main/Lib/test/test_capi/test_bytearray.py#L134-L160

Would adding a .resize() member be reasonable here? Is probably possible to do efficiently with slicing, but not in a way I know how...

cc: @vstinner

Move to a linear slice append with an iterator which has a length hint. This is more expensive then PyByteArray_Resize, but I think as efficient as can get without a new bytearray Python API to resize. The previous code didn't append as I had intended: ```python a = bytearray() >>> a[0:5] = b'\0' >>> a bytearray(b'\x00') >>> a[5:16] = b'\01' >>> a bytearray(b'\x00\x01') >>> len(a) 2 ```

…9458)" This reverts commit f927204.

) This reverts commit f927204.

cmaloney · 2025-02-02T21:13:45Z

Suggesting two new features to help implement this more effectively + make it easier to write zero-copy + zero extra allocation I/O loops:

Add bytearray.resize (Add bytearray.resize() method #129559)
Add zero-copy conversion of bytearray -> bytes https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164. This makes it so the return types in _pyio have to changed to get efficient behavior, should make return bytes(buffer) more efficient generally

gpshead · 2025-02-02T21:38:09Z

Internally that latter (2) is infeasible. PyBytesObject ends with the immutable array of bytes. PyByteArrayObject uses indirection via pointers. A CPython bytes object can't adopt and own memory for zero copy construction.

gpshead · 2025-02-02T21:41:18Z

Is there any reason to optimize for _pyio? The _io stack exists for a reason, _pyio is the reference implementation that, if used at all, is maybe only used by PyPy (that's up to them). CPython users don't use it.

cmaloney · 2025-02-02T22:20:42Z

@gpshead the theory for 2 (zero-copy bytearray -> bytes), is that the buffer is both a modifiable array and a valid PyBytes* / PyObject* in place simultaneously, the PyByteArray's two current pointers aim at the PyBytes* buffer rather than just a raw PyMem_Malloc, there is a PyBytes struct/object head before what the user of the bytearray in Python sees, meaning the PyByteArray's buffer could be finalized to a valid immutable PyBytes without a copy. PyByteArray already supports having an "offset" between the start of its allocation and what it presents in Python, just making it a couple more bytes and adding "use the PyBytes C APIs for allocations". The C PyBytes API already has the buffer be mutable if no initial string is passed.

re: _pyio, it significantly reduces memory usage (identical Python and C) and runtime (~3.8 -> ~2.4s last I tested) of ./python -m test -M8g -uall test_largefile -m test_large_read -v on my machine (64 bit Arch Linux ~1 year old; I suspect more delta on older machines). bytes(bytearray) comes up a number of places in other modules. MacOS on M2 chip performance is much more level.

This is underlying work as I try to get to be able to:

Rewrite BufferedIO
- 15% reduction in overheads in reading full files (see GH-120754: Disable buffering in Path.read_bytes #122111)
  - This also effects module import where https://docs.python.org/3/library/io.html#io.open_code must always return a BufferedIO
- Remove the two seek calls when just writing or just reading (BufferedIO today tracks position absolutely whenever possible)
- copying less into internal buffer on write + ideally non-locking write-only (stdout, stderr) which should make free-threaded I/O faster (Utilize buffer protocol / immutability of bytes to "store" buffers into slots (ala stringlib join) then use writev to multi-write or copy just what is needed)
- Remove the partial-write case of open(fileename, "wb", buffering=0) because BufferedIO(FileIO(), buffer_size=0) would have minimal performance delta vs today's direct FileIO which has that explicit, documented but unexpected behavior compared to other open() invocations (see: Corrupt .pyc files stay on disk after failed writes #126606, https://discuss.python.org/t/change-open-write-to-guarantee-all-bytes-will-be-written-or-an-exception-will-be-thrown/71082/9). buffering=0 is reasonably common in the python ecosystem for performance. It usually works. It can loose data silently on write (readall, the optimization I used in Path.read_bytes() it won't since that loops to EOF always).
Working on being able to use asynchronous I/O underneath FileIO. Exact shape of this down the line almost certainly requires a PEP, but today I have viable ways to add a os.deferred() which adds a ContextVar which gives acess to io_uring for Linux (ideally Windows async bits as well).

Currently patches I've written touching BufferedIO run into issues that the C and Python don't quite match but people want them to, bringing them into line currently both helps me understand the code better, bring them in line with reviewer expectations, and demonstrate I/O in Python can be as efficient as C (both in memory and CPU). Ideal/dream would be with interpreter improvements (Ex. JIT and free threading), the performance delta gets small enough can have one python io library implementation.

gpshead · 2025-02-03T01:01:31Z

Ah, yes, if bytearray internally allocates a phantom bytes object for its backing memory then I could see how that could be made real and returned in refcount=1 bytes(bytearray) scenarios. A little extra memory overhead but given that bytearray overhead shouldn't matter much as the best uses are large anyways... nice.

morotti · 2025-02-05T16:54:14Z

Hello, I am coming across this PR as I also noticed the horrible performance of readall() while adjusting buffer size in another PR.
One caveat though, I don't think anything is using the pyio API (the C API should be used), it's not really worth optimizing if nothing is using it.

Basically, the readall() function is stupid. It's copying data 3 times by appending to a buffer.
IMO what's important is to optimize the main use case, where the filesize is known in advance to avoid copying.

EDIT: forgot to say, this is the function def readall(self): in _pyio.py

# dumb code in master
        result = bytearray()  # idiotic -> the bytearray will have to be resized on every loop
        while True:
            if len(result) >= bufsize:
                bufsize = len(result)
                bufsize += max(bufsize, DEFAULT_BUFFER_SIZE)
            n = bufsize - len(result)
            try:
                chunk = os.read(self._fd, n)  # the returned chunk is bytes and should be reusable?
            except BlockingIOError:
                if result:
                    break
                return None
            if not chunk: # reached the end of the file
                break
            result += chunk

        return bytes(result)  # idiotic conversion bytearray to bytes. if there was a single read, that is a total of 3 conversions.

# better strategy. quick pseudo code.
        result = list(): List[bytes]  # more optimal -> take the bytes as they come, no copy needed
        while True:
            if len(result) >= bufsize:
                bufsize = len(result)
                bufsize += max(bufsize, DEFAULT_BUFFER_SIZE)
            n = bufsize - len(result)
            try:
                chunk = os.read(self._fd, n)
            except BlockingIOError:
                if result:
                    break
                return None
            if not chunk: # reached the end of the file
                break
            result.append(chunk)

        if len(result) == 1:
            # performance: the buffer is set to the filesize in advance when the filesize is known
            # it should result in a single read most of the time.
            return result[0]

        return b''.join(result)  # hopefully the function doing the join is optimized to allocate the buffer once, but might need to check that

cmaloney · 2025-02-05T17:15:15Z

the result list + os.read also force copies at the moment. I added os.readinto() (gh-129205) a little bit ago, and bytearray.resize() (gh-129559) is currently in progress, which should fix those ones and be cheaper than list join. Gets down to a single copy going from bytearray -> bytes which I'm hoping something like https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164/20 can fix.

Using C _io.BytesIO there's an optimization that would remove that copy, but _pyio.BytesIO can't do that trick currently.

morotti · 2025-02-05T17:27:19Z

Quick test with the code above, the test Lib.test.test_largefile.PyLargeFileTest.test_large_read that is reading 2.5 GB, is going from ~6.5 seconds to ~5.1 seconds.

Oddly enough. I've added a print to debug how many bytes are read on the call chunk = os.read(self._fd, n). I saw some occurrences where it did not read the full file in one go. 🤔

read chunk  2147479552
read chunk  352520449
read chunk  0

EDIT: raising to 10 GB for debugging. the partial reads are relatively consistent around the 2GB mark.
I guess there is a limit imposed by the OS or filesystem. (Ubuntu 22 LTS machine)

read chunk  2147479552
read chunk  2147479552
read chunk  2147479552
read chunk  910081793
read chunk  0

EDIT EDIT: that 10 GB run is going from ~25 seconds on main to ~18 seconds with the code above.

cmaloney · 2025-02-05T17:45:07Z

os.read caps all reads to a max size which I suspect is being hit here. In any case, definitely going to get faster, have a path there, but working on some underlying bits before updating the core code :) (How things like the buffer is resized also needs lined up with the C / the right non-linear read expansion makes a big difference)

…9702) Co-authored-by: Victor Stinner <[email protected]>

Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto(), and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds.

Slicing buf and appending chunk would always result in a copy. Commonly in a readall() there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.

…hon#129454)" (python#129500) This reverts commit e1c4ba9.

…9458)" (python#129572) This reverts commit f927204.

python#129702) Co-authored-by: Victor Stinner <[email protected]>

Utilize `bytearray.resize()` and `os.readinto()` to reduce copies and match behavior of `_io.FileIO.readall()`. There is still an extra copy which means twice the memory required compared to FileIO because there isn't a zero-copy path from `bytearray` -> `bytes` currently. On my system reading a 2 GB file: `./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v` Goes from ~2.7 seconds -> ~2.2 seconds Co-authored-by: Victor Stinner <[email protected]>

python#129702) Co-authored-by: Victor Stinner <[email protected]>

cmaloney added the type-feature A feature request or enhancement label Jan 18, 2025

bedevere-app bot mentioned this issue Jan 18, 2025

gh-129005: Avoid copy in _pyio.FileIO.readinto #129006

Closed

cmaloney mentioned this issue Jan 19, 2025

gh-129011: Update comments in FileIO to match current code #129012

Open

cmaloney mentioned this issue Jan 22, 2025

Add os.readinto API for reading data into a caller provided buffer #129205

Open

6 tasks

picnixz added performance Performance or resource usage stdlib Python modules in the Lib dir labels Jan 23, 2025

bedevere-app bot mentioned this issue Jan 26, 2025

gh-129005: Avoid copy in _pyio.FileIO.readinto #129324

Merged

bedevere-app bot mentioned this issue Jan 29, 2025

gh-129005: Align FileIO.readall allocation #129424

Closed

cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025

pythongh-129005: _pyio.BufferedIO Remove copy on readall

e382660

bedevere-app bot mentioned this issue Jan 29, 2025

gh-129005: _pyio.BufferedIO remove copy on readall #129454

Merged

cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025

pythongh-129005: _pyio.BufferedIO Remove copy on readall

d0d9463

cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025

pythongh-129005: Remove copy in _pyio.FileIO.readall()

9ccf3e2

This aligns the memory usage between _pyio and _io. Both use the same amount of memory.

bedevere-app bot mentioned this issue Jan 29, 2025

gh-129005: Align FileIO.readall allocation #129458

Merged

cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 31, 2025

pythongh-129005: Remove copy in _pyio.FileIO.readall()

a881837

This aligns the memory usage between _pyio and _io. Both now use the same amount of memory when reading a file.

bedevere-app bot mentioned this issue Jan 31, 2025

gh-129005: Remove copy in _pyio.FileIO.readall() #129496

Closed

vstinner pushed a commit that referenced this issue Jan 31, 2025

Revert "gh-129005: _pyio.BufferedIO remove copy on readall (#129454)" (…

3ebe3d7

…#129500) This reverts commit e1c4ba9.

cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025

pythongh-129005: Use bytearray.resize

3419a39

cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025

pythongh-129005: Fix buffer expansion in _pyio.FileIO.readall

5bf04e3

bedevere-app bot mentioned this issue Feb 1, 2025

gh-129005: Fix buffer expansion in _pyio.FileIO.readall #129541

Closed

cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 2, 2025

Revert "pythongh-129005: Align FileIO.readall() allocation (python#12…

eb3aa4b

…9458)" This reverts commit f927204.

bedevere-app bot mentioned this issue Feb 2, 2025

Revert "gh-129005: Align FileIO.readall() allocation (#129458)" #129572

Merged

vstinner pushed a commit that referenced this issue Feb 2, 2025

Revert "gh-129005: Align FileIO.readall() allocation (#129458)" (#129572

853a6b7

) This reverts commit f927204.

cmaloney mentioned this issue Feb 5, 2025

gh-117151: IO performance improvement, increase io.DEFAULT_BUFFER_SIZE to 128k #118144

Open

cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 5, 2025

pythongh-129005: Update _pyio.BytesIO to use bytearray.resize on write

63795c8

This was referenced Feb 5, 2025

gh-129005: Update _pyio.BytesIO to use bytearray.resize on write #129702

Merged

gh-129005: Align FileIO.readall between _pyio and _io #129705

Merged

vstinner added a commit that referenced this issue Feb 6, 2025

gh-129005: Update _pyio.BytesIO to use bytearray.resize on write (#12…

052ca8f

…9702) Co-authored-by: Victor Stinner <[email protected]>

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025

Revert "pythongh-129005: _pyio.BufferedIO remove copy on readall (pyt…

ae72910

…hon#129454)" (python#129500) This reverts commit e1c4ba9.

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025

Revert "pythongh-129005: Align FileIO.readall() allocation (python#12…

cee0ebc

…9458)" (python#129572) This reverts commit f927204.

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025

pythongh-129005: Update _pyio.BytesIO to use bytearray.resize on write (

75ef9b0

python#129702) Co-authored-by: Victor Stinner <[email protected]>

cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 8, 2025

pythongh-129005: Update _pyio.BytesIO to use bytearray.resize on write (

ef214e3

python#129702) Co-authored-by: Victor Stinner <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce copies when reading files in pyio, match behavior of _io #129005

Reduce copies when reading files in pyio, match behavior of _io #129005

cmaloney commented Jan 18, 2025 •

edited by bedevere-app bot

Loading

cmaloney commented Jan 22, 2025 •

edited

Loading

vstinner commented Jan 22, 2025 •

edited

Loading

cmaloney commented Jan 29, 2025

cmaloney commented Feb 1, 2025 •

edited

Loading

cmaloney commented Feb 2, 2025

gpshead commented Feb 2, 2025

gpshead commented Feb 2, 2025

cmaloney commented Feb 2, 2025 •

edited

Loading

gpshead commented Feb 3, 2025

morotti commented Feb 5, 2025 •

edited

Loading

cmaloney commented Feb 5, 2025 •

edited

Loading

morotti commented Feb 5, 2025 •

edited

Loading

cmaloney commented Feb 5, 2025

Reduce copies when reading files in pyio, match behavior of _io #129005

Reduce copies when reading files in pyio, match behavior of _io #129005

Comments

cmaloney commented Jan 18, 2025 • edited by bedevere-app bot Loading

Feature or enhancement

Proposal:

Plan:

Loose Ends

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

cmaloney commented Jan 22, 2025 • edited Loading

vstinner commented Jan 22, 2025 • edited Loading

cmaloney commented Jan 29, 2025

cmaloney commented Feb 1, 2025 • edited Loading

cmaloney commented Feb 2, 2025

gpshead commented Feb 2, 2025

gpshead commented Feb 2, 2025

cmaloney commented Feb 2, 2025 • edited Loading

gpshead commented Feb 3, 2025

morotti commented Feb 5, 2025 • edited Loading

cmaloney commented Feb 5, 2025 • edited Loading

morotti commented Feb 5, 2025 • edited Loading

cmaloney commented Feb 5, 2025

cmaloney commented Jan 18, 2025 •

edited by bedevere-app bot

Loading

cmaloney commented Jan 22, 2025 •

edited

Loading

vstinner commented Jan 22, 2025 •

edited

Loading

cmaloney commented Feb 1, 2025 •

edited

Loading

cmaloney commented Feb 2, 2025 •

edited

Loading

morotti commented Feb 5, 2025 •

edited

Loading

cmaloney commented Feb 5, 2025 •

edited

Loading

morotti commented Feb 5, 2025 •

edited

Loading