Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/support for loading datasets from dlt within databricks #1148

Conversation

chakravarthik27
Copy link
Collaborator

@chakravarthik27 chakravarthik27 commented Nov 28, 2024

This pull request includes several changes to the langtest/datahandler/datasource.py and langtest/tasks/task.py files to add support for Spark datasets and improve the handling of file extensions. The most important changes include the addition of a new SparkDataset class and modifications to the __init__ and load methods to accommodate the new dataset type.

Support for Spark datasets:

  • Added a new SparkDataset class to handle Spark datasets, including methods for loading raw data and preprocessed data, and initializing a Spark session. (langtest/datahandler/datasource.py)

Improvements to file extension handling:

  • Updated the __init__ method to set file_ext from the file_path dictionary if provided, and adjusted the logic to handle cases where the source key is present. (langtest/datahandler/datasource.py) [1] [2]
  • Modified the load method to include "spark" as a valid file extension for initializing the data source class. (langtest/datahandler/datasource.py)

Enhancements to task handling:

  • Updated the create_sample method to ensure labels are converted to strings before being added to the list. (langtest/tasks/task.py)

@chakravarthik27 chakravarthik27 merged commit 0578ce6 into release/2.5.0 Dec 2, 2024
3 checks passed
@chakravarthik27 chakravarthik27 deleted the feature/support-for-loading-datasets-from-dlt-within-databricks branch December 27, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Reading and Writing Datasets in Delta Live Tables (DLT) within Databricks
2 participants