-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce copies when reading files in pyio, match behavior of _io #129005
Comments
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
I took a tangent and looked at the code complexity of adding
Happy to work on either course, slight preference for Know moving to a full PR for os.readinto would need to add tests and news, just focusing on "adding os.readinto" vs. "using os.readv". cc: @tomasr8 , @vstinner , @gpshead (reviewers where os.read vs. _Py_read / readinto has come up). |
Can you please open a separated issue for |
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto`. `os.readinto` uses `PyObject_GetBuffer`` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto`. `os.readinto` uses `PyObject_GetBuffer` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.
`os.read()` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto()`. `os.readinto()` uses `PyObject_GetBuffer()` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto, and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.3 seconds
Slicing buf and appending chunk would always result in a copy. Commonly in a readall there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.
This aligns the memory usage between _pyio and _io. Both use the same amount of memory.
The full set of changes on my Linux machine debug build, reduces |
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto, and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto(), and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds.
Slicing buf and appending chunk would always result in a copy. Commonly in a readall() there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.
This aligns the memory usage between _pyio and _io. Both now use the same amount of memory when reading a file.
I think I found the issue in #129458. I tried using
(Pipes in the test are a case where we don't have a Validated by adding an assert locally, which fails every time: result[bytes_read:bufsize] = b'\0'
assert len(result) == bufsize, f"Should have expanded in size. {len(result)=}, {bufsize=}"
I don't see a clear way to resize the bytearray without making a whole array of null bytes which get copied into it... The tests just use the C API to resize: https://github.com/python/cpython/blob/main/Lib/test/test_capi/test_bytearray.py#L134-L160 Would adding a cc: @vstinner |
Move to a linear slice append with an iterator which has a length hint. This is more expensive then PyByteArray_Resize, but I think as efficient as can get without a new bytearray Python API to resize. The previous code didn't append as I had intended: ```python a = bytearray() >>> a[0:5] = b'\0' >>> a bytearray(b'\x00') >>> a[5:16] = b'\01' >>> a bytearray(b'\x00\x01') >>> len(a) 2 ```
Suggesting two new features to help implement this more effectively + make it easier to write zero-copy + zero extra allocation I/O loops:
|
Internally that latter (2) is infeasible. PyBytesObject ends with the immutable array of bytes. PyByteArrayObject uses indirection via pointers. A CPython |
Is there any reason to optimize for |
@gpshead the theory for 2 (zero-copy re: This is underlying work as I try to get to be able to:
Currently patches I've written touching BufferedIO run into issues that the C and Python don't quite match but people want them to, bringing them into line currently both helps me understand the code better, bring them in line with reviewer expectations, and demonstrate I/O in Python can be as efficient as C (both in memory and CPU). Ideal/dream would be with interpreter improvements (Ex. JIT and free threading), the performance delta gets small enough can have one python |
Ah, yes, if bytearray internally allocates a phantom bytes object for its backing memory then I could see how that could be made real and returned in refcount=1 bytes(bytearray) scenarios. A little extra memory overhead but given that bytearray overhead shouldn't matter much as the best uses are large anyways... nice. |
Hello, I am coming across this PR as I also noticed the horrible performance of Basically, the readall() function is stupid. It's copying data 3 times by appending to a buffer. EDIT: forgot to say, this is the function
|
the Using C |
Quick test with the code above, the test Oddly enough. I've added a print to debug how many bytes are read on the call
EDIT: raising to 10 GB for debugging. the partial reads are relatively consistent around the 2GB mark.
EDIT EDIT: that 10 GB run is going from ~25 seconds on main to ~18 seconds with the code above. |
|
…9702) Co-authored-by: Victor Stinner <[email protected]>
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto(), and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds.
Slicing buf and appending chunk would always result in a copy. Commonly in a readall() there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.
…hon#129454)" (python#129500) This reverts commit e1c4ba9.
…9458)" (python#129572) This reverts commit f927204.
python#129702) Co-authored-by: Victor Stinner <[email protected]>
Utilize `bytearray.resize()` and `os.readinto()` to reduce copies and match behavior of `_io.FileIO.readall()`. There is still an extra copy which means twice the memory required compared to FileIO because there isn't a zero-copy path from `bytearray` -> `bytes` currently. On my system reading a 2 GB file: `./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v` Goes from ~2.7 seconds -> ~2.2 seconds Co-authored-by: Victor Stinner <[email protected]>
python#129702) Co-authored-by: Victor Stinner <[email protected]>
Feature or enhancement
Proposal:
Currently
_pyio
uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.Details from test_fileio run
Plan:
os.readv()
os.readinto()
to do readinto like C_Py_read
used by_io
does.os.read()
can't take a buffer to use. This aligns behavior between_io.FileIO.readall
and_pyio.FileIO.readall
.os.readv
works well today and takes a caller allocated buffer rather than needing to add a newos
API.readv(2)
mirrors the behavior and errors ofread(2)
, so this should keep the same end behavior._pyio.BufferedIO
to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of_pyio.FileIO.readall
to it.For iterating, I'm using a small tracemalloc script to find where copies are:
Loose Ends
os.readv
seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio requirereadv
, but can do conditional code if needed. If makingreadv
non-optional generally is feasible, happy to work on that.os.readv
is not supported on WASI, so need to add conditional code.Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
_pyio.FileIO.readall()
#129496The text was updated successfully, but these errors were encountered: