Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce copies when reading files in pyio, match behavior of _io #129005

Open
cmaloney opened this issue Jan 18, 2025 · 14 comments
Open

Reduce copies when reading files in pyio, match behavior of _io #129005

cmaloney opened this issue Jan 18, 2025 · 14 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@cmaloney
Copy link
Contributor

cmaloney commented Jan 18, 2025

Feature or enhancement

Proposal:

Currently _pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv
== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]
== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian
== Python build: debug
== cwd: <$HOME>/python/build/build/test_python_worker_32392æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all

Using random seed: 1740056613
0:00:00 load avg: 0.53 Run 1 test sequentially in a single process
0:00:00 load avg: 0.53 [1/1] test_largefile
test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
ok
test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
 ... process data size: 4.3G
 ... process data size: 4.7G
ok

----------------------------------------------------------------------
Ran 2 tests in 3.711s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 3.7 sec
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered)
Result: SUCCESS

Plan:

  1. Switch to os.readv() os.readinto() to do readinto like C _Py_read used by _io does. os.read() can't take a buffer to use. This aligns behavior between _io.FileIO.readall and _pyio.FileIO.readall. os.readv works well today and takes a caller allocated buffer rather than needing to add a new os API. readv(2) mirrors the behavior and errors of read(2), so this should keep the same end behavior.
  2. Update _pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of _pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from _pyio import open

import tracemalloc

with open("README.rst", 'rb') as file:
    tracemalloc.start()
    data = file.read()
    snap = tracemalloc.take_snapshot()


stats = snap.statistics('lineno')
for stat in stats:
    print(stat)

Loose Ends

  • os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio require readv, but can do conditional code if needed. If making readv non-optional generally is feasible, happy to work on that.
    • os.readv is not supported on WASI, so need to add conditional code.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@cmaloney cmaloney added the type-feature A feature request or enhancement label Jan 18, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 18, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readv`. `self.read()` was doing the
closed and readable checks so move those into `readinto`
@cmaloney
Copy link
Contributor Author

cmaloney commented Jan 22, 2025

I took a tangent and looked at the code complexity of adding os.readinto vs. the conditionals around os.readv not existing on some platforms + needing to pass a sequence of buffers to os.readv.

  1. _pyio Using os.readv with conditional: https://github.com/python/cpython/pull/129006/files
  2. Adding os.readinto, _pyio using it: cmaloney@52c6013

Happy to work on either course, slight preference for os.readinto as it feels like by adding one function removes a lot of new conditional code to make these cases more efficient. Code wise, with the Buffer Protocol, the code to me gets quite a bit easier to follow. Not sure it's worth another public OS api to maintain though.

Know moving to a full PR for os.readinto would need to add tests and news, just focusing on "adding os.readinto" vs. "using os.readv".

cc: @tomasr8 , @vstinner , @gpshead (reviewers where os.read vs. _Py_read / readinto has come up).

@vstinner
Copy link
Member

vstinner commented Jan 22, 2025

Can you please open a separated issue for os.readinto()? This function looks useful and simple.

@picnixz picnixz added performance Performance or resource usage stdlib Python modules in the Lib dir labels Jan 23, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 26, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readinto`.

`os.readinto` uses `PyObject_GetBuffer`` to make sure the passed in
buffer is writeable and bytes-like, drop the manual check.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 26, 2025
`os.read` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readinto`.

`os.readinto` uses `PyObject_GetBuffer` to make sure the passed in
buffer is writeable and bytes-like, drop the manual check.
vstinner pushed a commit that referenced this issue Jan 28, 2025
`os.read()` allocated and filled a buffer by calling `read(2)`, than that
data was copied into the user provied buffer. Read directly into the
caller's buffer instead by using `os.readinto()`.

`os.readinto()` uses `PyObject_GetBuffer()` to make sure the passed
in buffer is writeable and bytes-like, drop the manual check.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
Both now use a pre-allocated buffer of length `bufsize`, fill it using
a readinto, and have matching "expand buffer" logic.

On my machine this takes:

`./python -m test -M8g -uall test_largefile -m test_large_read -v`
from ~3.7 seconds to ~3.3 seconds
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
Slicing buf and appending chunk would always result in a copy. Commonly
in a readall there is no already read data in buf, and the amount of
data read may be large, so the copy is expensive.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
This aligns the memory usage between _pyio and _io. Both use the same amount
of memory.
@cmaloney
Copy link
Contributor Author

The full set of changes on my Linux machine debug build, reduces ./python -m test -M8g -uall test_largefile -m test_large_read -v from ~3.8 seconds -> ~2.4 seconds with a peak memory usage of 2.3GB (identical memory usage for C and Python implementations).

cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 29, 2025
Both now use a pre-allocated buffer of length `bufsize`, fill it using
a readinto, and have matching "expand buffer" logic.

On my machine this takes:

`./python -m test -M8g -uall test_largefile -m test_large_read -v`
from ~3.7 seconds to ~3.4 seconds
vstinner pushed a commit that referenced this issue Jan 30, 2025
Both now use a pre-allocated buffer of length `bufsize`, fill it using
a readinto(), and have matching "expand buffer" logic.

On my machine this takes:

`./python -m test -M8g -uall test_largefile -m test_large_read -v`

from ~3.7 seconds to ~3.4 seconds.
vstinner pushed a commit that referenced this issue Jan 30, 2025
Slicing buf and appending chunk would always result in a copy. Commonly
in a readall() there is no already read data in buf, and the amount of
data read may be large, so the copy is expensive.
cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 31, 2025
This aligns the memory usage between _pyio and _io. Both now use the
same amount of memory when reading a file.
@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 1, 2025

I think I found the issue in #129458. I tried using result[bytes_read:bufsize] = b'\0' to append bufsize-bytes_read null bytes to the bytearray, but that doesn't do that:

>>> a[0:5] = b'\0'
>>> a
bytearray(b'\x00')
>>> a[5:16] = b'\01'
>>> a
bytearray(b'\x00\x01')
>>> len(a)
2

(Pipes in the test are a case where we don't have a stat.st_size estimate, and can't seek to figure out size, so always goes through the resizing code). Not sure why that commit worked at first but later stopped working... Guessing a system needs to be somewhat loaded to hit.

Validated by adding an assert locally, which fails every time:

                result[bytes_read:bufsize] = b'\0'
                assert len(result) == bufsize, f"Should have expanded in size. {len(result)=}, {bufsize=}"
$ make && ./python -m test -uall test_io -vvv -m 'test_nonblock*'
<...>
AssertionError: Should have expanded in size. len(result)=8193, bufsize=16640

I don't see a clear way to resize the bytearray without making a whole array of null bytes which get copied into it... The tests just use the C API to resize: https://github.com/python/cpython/blob/main/Lib/test/test_capi/test_bytearray.py#L134-L160

Would adding a .resize() member be reasonable here? Is probably possible to do efficiently with slicing, but not in a way I know how...

cc: @vstinner

cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025
cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 1, 2025
Move to a linear slice append with an iterator which has a length hint.
This is more expensive then PyByteArray_Resize, but I think as efficient
as can get without a new bytearray Python API to resize.

The previous code didn't append as I had intended:

```python
a = bytearray()
>>> a[0:5] = b'\0'
>>> a
bytearray(b'\x00')
>>> a[5:16] = b'\01'
>>> a
bytearray(b'\x00\x01')
>>> len(a)
2
```
cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 2, 2025
vstinner pushed a commit that referenced this issue Feb 2, 2025
@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 2, 2025

Suggesting two new features to help implement this more effectively + make it easier to write zero-copy + zero extra allocation I/O loops:

  1. Add bytearray.resize (Add bytearray.resize() method #129559)
  2. Add zero-copy conversion of bytearray -> bytes https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164. This makes it so the return types in _pyio have to changed to get efficient behavior, should make return bytes(buffer) more efficient generally

@gpshead
Copy link
Member

gpshead commented Feb 2, 2025

Internally that latter (2) is infeasible. PyBytesObject ends with the immutable array of bytes. PyByteArrayObject uses indirection via pointers. A CPython bytes object can't adopt and own memory for zero copy construction.

@gpshead
Copy link
Member

gpshead commented Feb 2, 2025

Is there any reason to optimize for _pyio? The _io stack exists for a reason, _pyio is the reference implementation that, if used at all, is maybe only used by PyPy (that's up to them). CPython users don't use it.

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 2, 2025

@gpshead the theory for 2 (zero-copy bytearray -> bytes), is that the buffer is both a modifiable array and a valid PyBytes* / PyObject* in place simultaneously, the PyByteArray's two current pointers aim at the PyBytes* buffer rather than just a raw PyMem_Malloc, there is a PyBytes struct/object head before what the user of the bytearray in Python sees, meaning the PyByteArray's buffer could be finalized to a valid immutable PyBytes without a copy. PyByteArray already supports having an "offset" between the start of its allocation and what it presents in Python, just making it a couple more bytes and adding "use the PyBytes C APIs for allocations". The C PyBytes API already has the buffer be mutable if no initial string is passed.

re: _pyio, it significantly reduces memory usage (identical Python and C) and runtime (~3.8 -> ~2.4s last I tested) of ./python -m test -M8g -uall test_largefile -m test_large_read -v on my machine (64 bit Arch Linux ~1 year old; I suspect more delta on older machines). bytes(bytearray) comes up a number of places in other modules. MacOS on M2 chip performance is much more level.

This is underlying work as I try to get to be able to:

  1. Rewrite BufferedIO
  2. Working on being able to use asynchronous I/O underneath FileIO. Exact shape of this down the line almost certainly requires a PEP, but today I have viable ways to add a os.deferred() which adds a ContextVar which gives acess to io_uring for Linux (ideally Windows async bits as well).

Currently patches I've written touching BufferedIO run into issues that the C and Python don't quite match but people want them to, bringing them into line currently both helps me understand the code better, bring them in line with reviewer expectations, and demonstrate I/O in Python can be as efficient as C (both in memory and CPU). Ideal/dream would be with interpreter improvements (Ex. JIT and free threading), the performance delta gets small enough can have one python io library implementation.

@gpshead
Copy link
Member

gpshead commented Feb 3, 2025

Ah, yes, if bytearray internally allocates a phantom bytes object for its backing memory then I could see how that could be made real and returned in refcount=1 bytes(bytearray) scenarios. A little extra memory overhead but given that bytearray overhead shouldn't matter much as the best uses are large anyways... nice.

@morotti
Copy link
Contributor

morotti commented Feb 5, 2025

Hello, I am coming across this PR as I also noticed the horrible performance of readall() while adjusting buffer size in another PR.
One caveat though, I don't think anything is using the pyio API (the C API should be used), it's not really worth optimizing if nothing is using it.

Basically, the readall() function is stupid. It's copying data 3 times by appending to a buffer.
IMO what's important is to optimize the main use case, where the filesize is known in advance to avoid copying.

EDIT: forgot to say, this is the function def readall(self): in _pyio.py

# dumb code in master
        result = bytearray()  # idiotic -> the bytearray will have to be resized on every loop
        while True:
            if len(result) >= bufsize:
                bufsize = len(result)
                bufsize += max(bufsize, DEFAULT_BUFFER_SIZE)
            n = bufsize - len(result)
            try:
                chunk = os.read(self._fd, n)  # the returned chunk is bytes and should be reusable?
            except BlockingIOError:
                if result:
                    break
                return None
            if not chunk: # reached the end of the file
                break
            result += chunk

        return bytes(result)  # idiotic conversion bytearray to bytes. if there was a single read, that is a total of 3 conversions.
# better strategy. quick pseudo code.
        result = list(): List[bytes]  # more optimal -> take the bytes as they come, no copy needed
        while True:
            if len(result) >= bufsize:
                bufsize = len(result)
                bufsize += max(bufsize, DEFAULT_BUFFER_SIZE)
            n = bufsize - len(result)
            try:
                chunk = os.read(self._fd, n)
            except BlockingIOError:
                if result:
                    break
                return None
            if not chunk: # reached the end of the file
                break
            result.append(chunk)

        if len(result) == 1:
            # performance: the buffer is set to the filesize in advance when the filesize is known
            # it should result in a single read most of the time.
            return result[0]

        return b''.join(result)  # hopefully the function doing the join is optimized to allocate the buffer once, but might need to check that

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 5, 2025

the result list + os.read also force copies at the moment. I added os.readinto() (gh-129205) a little bit ago, and bytearray.resize() (gh-129559) is currently in progress, which should fix those ones and be cheaper than list join. Gets down to a single copy going from bytearray -> bytes which I'm hoping something like https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164/20 can fix.

Using C _io.BytesIO there's an optimization that would remove that copy, but _pyio.BytesIO can't do that trick currently.

@morotti
Copy link
Contributor

morotti commented Feb 5, 2025

Quick test with the code above, the test Lib.test.test_largefile.PyLargeFileTest.test_large_read that is reading 2.5 GB, is going from ~6.5 seconds to ~5.1 seconds.

Oddly enough. I've added a print to debug how many bytes are read on the call chunk = os.read(self._fd, n). I saw some occurrences where it did not read the full file in one go. 🤔

read chunk  2147479552
read chunk  352520449
read chunk  0

EDIT: raising to 10 GB for debugging. the partial reads are relatively consistent around the 2GB mark.
I guess there is a limit imposed by the OS or filesystem. (Ubuntu 22 LTS machine)

read chunk  2147479552
read chunk  2147479552
read chunk  2147479552
read chunk  910081793
read chunk  0

EDIT EDIT: that 10 GB run is going from ~25 seconds on main to ~18 seconds with the code above.

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 5, 2025

os.read caps all reads to a max size which I suspect is being hit here. In any case, definitely going to get faster, have a path there, but working on some underlying bits before updating the core code :) (How things like the buffer is resized also needs lined up with the C / the right non-linear read expansion makes a big difference)

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025
Both now use a pre-allocated buffer of length `bufsize`, fill it using
a readinto(), and have matching "expand buffer" logic.

On my machine this takes:

`./python -m test -M8g -uall test_largefile -m test_large_read -v`

from ~3.7 seconds to ~3.4 seconds.
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025
Slicing buf and appending chunk would always result in a copy. Commonly
in a readall() there is no already read data in buf, and the amount of
data read may be large, so the copy is expensive.
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025
vstinner added a commit that referenced this issue Feb 7, 2025
Utilize `bytearray.resize()` and `os.readinto()` to reduce copies
and match behavior of `_io.FileIO.readall()`.

There is still an extra copy which means twice the memory required
compared to FileIO because there isn't a zero-copy  path from
`bytearray` -> `bytes` currently.

On my system reading a 2 GB file:
`./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v`

Goes from ~2.7 seconds -> ~2.2 seconds

Co-authored-by: Victor Stinner <[email protected]>
cmaloney added a commit to cmaloney/cpython that referenced this issue Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants