Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DatetimeArray is not writable in cudf.pandas #17743

Open
galipremsagar opened this issue Jan 15, 2025 · 6 comments
Open

[BUG] DatetimeArray is not writable in cudf.pandas #17743

galipremsagar opened this issue Jan 15, 2025 · 6 comments
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
When a DatetimeArray is created in cudf.pandas, it appears to be that the internal ndarray is not writable.

Steps/Code to reproduce bug

import pandas as pd
i = pd.date_range("2016-01-01 01:01:00", periods=5)
print(i)
print(i._data)
print(i._data._ndarray.flags)

output:

(cudfdev) pgali@dgx19:/datasets/pgali/cudf$ python y1.py 
DatetimeIndex(['2016-01-01 01:01:00', '2016-01-02 01:01:00',
               '2016-01-03 01:01:00', '2016-01-04 01:01:00',
               '2016-01-05 01:01:00'],
              dtype='datetime64[ns]', freq='D')
<DatetimeArray>
['2016-01-01 01:01:00', '2016-01-02 01:01:00', '2016-01-03 01:01:00',
 '2016-01-04 01:01:00', '2016-01-05 01:01:00']
Length: 5, dtype: datetime64[ns]
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

(cudfdev) pgali@dgx19:/datasets/pgali/cudf$ python -m cudf.pandas y1.py 
DatetimeIndex(['2016-01-01 01:01:00', '2016-01-02 01:01:00',
               '2016-01-03 01:01:00', '2016-01-04 01:01:00',
               '2016-01-05 01:01:00'],
              dtype='datetime64[ns]', freq='D')
<DatetimeArray>
['2016-01-01 01:01:00', '2016-01-02 01:01:00', '2016-01-03 01:01:00',
 '2016-01-04 01:01:00', '2016-01-05 01:01:00']
Length: 5, dtype: datetime64[ns]
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : False # Expected True
  ALIGNED : True
  WRITEBACKIFCOPY : False
@galipremsagar galipremsagar added bug Something isn't working Python Affects Python cuDF API. labels Jan 15, 2025
@galipremsagar galipremsagar self-assigned this Jan 15, 2025
@mroeschke
Copy link
Contributor

It appears pyarrow is making this not writeable e.g.

In [4]: import pyarrow as pa

In [5]: import datetime

In [7]: pa.array([datetime.datetime(2020, 1, 1)]).to_pandas().array._ndarray.flags
Out[7]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False

@mroeschke
Copy link
Contributor

Opened apache/arrow#45341 to track this

@galipremsagar
Copy link
Contributor Author

Though the problem is in pyarrow, the issue only shows up with usage of cudf.pandas, can we not avert it?

@mroeschke
Copy link
Contributor

mroeschke commented Jan 23, 2025

to_pandas is being called during i._data which triggers the behavior shown in #17743 (comment)

But I think we can call pd.arrays.ArrowExtensionArray(pa_array).to_numpy(copy=) in to_pandas to sidestep arrow always avoiding a copy and making the result writable

@mroeschke
Copy link
Contributor

OK I realized we can use pyarrow.Array.to_numpy(zero_copy_only=False, writable=True) in mode.pandas_compatible to make the numpy array writable, but then we'll always be making a CPU copy of the data during to_pandas. Is that acceptable?

@vyasr
Copy link
Contributor

vyasr commented Jan 24, 2025

So our options are

  1. No copy but the result is not writeable
  2. With a copy and the result is writeable

If we take a step back for a second and forget about cudf.pandas, how do we expect this to behave if we did this explicitly in cudf? If we created a cudf.DatetimeIndex and called to_pandas on it, should the resulting object have a writeable _data array? Should that behavior perhaps be dependent on mode.pandas_compatible? It seems like this behavior is something we'd want to be decided in the to_pandas implementation regardless of whether we're using cudf.pandas or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants