Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying a row-wise UDF operation to cuDF #17756

Open
kev-pebble opened this issue Jan 16, 2025 · 1 comment
Open

Applying a row-wise UDF operation to cuDF #17756

kev-pebble opened this issue Jan 16, 2025 · 1 comment
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request Needs Triage Need team to review and classify

Comments

@kev-pebble
Copy link

Missing Pandas Feature Request
A clear and concise summary of the pandas function(s) you'd like to be able run with cuDF.

I have an UDF function which takes pandas row as an argument and makes an network API call and writes the result back to the dataframe. As per the documentation, cuDF requires the function to be numba compatible, so I tried adding @cuda.jit decorator, however, I didn't get the expected results.

Example code below -

import pandas as pd
import requests

# Example DataFrame
data = {
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "city": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)

# Define a function to make an API call
def fetch_city_info(row):
    """
    Fetch city information from a placeholder API.
    """
    city = row['city']
    try:
        # Replace the URL with a real API endpoint if needed
        response = requests.get(f"https://api.example.com/city_info?city={city}")
        if response.status_code == 200:
            return response.json().get("info", "No info available")
        else:
            return f"Error: {response.status_code}"
    except Exception as e:
        return f"Error: {str(e)}"

# Apply the function row-wise using lambda
df['city_info'] = df.apply(lambda row: fetch_city_info(row), axis=1)

# Display the DataFrame with the fetched information
print(df)

It appears there are GPU Execution Constraints when it comes to UDFs and making network API calls as the performance relies on API latency as well. Can API calls be run directly from the GPU? or do they need to offloaded to the CPU? Please share suggestions on how I can make the operation work on a GPU.

@kev-pebble kev-pebble added cudf.pandas Issues specific to cudf.pandas feature request New feature or request Needs Triage Need team to review and classify labels Jan 16, 2025
@kev-pebble kev-pebble changed the title Applying a UDF row-wise operation to cuDF Applying a row-wise UDF operation to cuDF Jan 16, 2025
@mroeschke
Copy link
Contributor

Thanks for the report.

Yeah I wouldn't expect your UDF that makes a network call to have any performance benefit from cuda.jit - it's more designed to accelerate a numerically oriented UDFs.

Even in vanilla pandas, your UDF with apply will be performance limited since it essentially just makes these API calls serially. I would just recommend doing these network calls outside pandas/cuDF and employ techniques like:

  1. Use a POST API if available
  2. Persist your network connection with request.Session
  3. Use an async http library to make these requests

and then re-assign the result back into your DataFrame

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request Needs Triage Need team to review and classify
Projects
Status: Todo
Development

No branches or pull requests

2 participants