Feature/support for loading datasets from dlt within databricks #1148

chakravarthik27 · 2024-11-28T12:39:44Z

This pull request includes several changes to the langtest/datahandler/datasource.py and langtest/tasks/task.py files to add support for Spark datasets and improve the handling of file extensions. The most important changes include the addition of a new SparkDataset class and modifications to the __init__ and load methods to accommodate the new dataset type.

Support for Spark datasets:

Added a new SparkDataset class to handle Spark datasets, including methods for loading raw data and preprocessed data, and initializing a Spark session. (langtest/datahandler/datasource.py)

Improvements to file extension handling:

Updated the __init__ method to set file_ext from the file_path dictionary if provided, and adjusted the logic to handle cases where the source key is present. (langtest/datahandler/datasource.py) [1] [2]
Modified the load method to include "spark" as a valid file extension for initializing the data source class. (langtest/datahandler/datasource.py)

Enhancements to task handling:

Updated the create_sample method to ensure labels are converted to strings before being added to the list. (langtest/tasks/task.py)

…ic file formats

…enhance file path handling

…initialization with SparkSession

…tDataset for Delta Live Tables support

…rt functionality

…class naming convention

chakravarthik27 added 7 commits November 28, 2024 12:36

adding support for spark dataset in databricks

20f9f10

feat: enhance SparkDataset to accept a SparkSession and support dynam…

62163fa

…ic file formats

feat: update SparkDataset to dynamically initialize SparkSession and …

4e90228

…enhance file path handling

feat: enhance SparkDataset to support dynamic file paths and improve …

98084dd

…initialization with SparkSession

feat: extend SparkDataset to load data from any file and introduce Dl…

da0a439

…tDataset for Delta Live Tables support

feat: add validation for DltDataset file_path and implement data expo…

5589255

…rt functionality

feat: rename DltDataset to DeltaLiveTablesDataset and update dataset …

3bd5ca0

…class naming convention

chakravarthik27 requested a review from Prikshit7766 December 2, 2024 13:49

chakravarthik27 self-assigned this Dec 2, 2024

Prikshit7766 approved these changes Dec 2, 2024

View reviewed changes

chakravarthik27 merged commit 0578ce6 into release/2.5.0 Dec 2, 2024
3 checks passed

chakravarthik27 linked an issue Dec 24, 2024 that may be closed by this pull request

Support for Reading and Writing Datasets in Delta Live Tables (DLT) within Databricks #1147

Closed

chakravarthik27 deleted the feature/support-for-loading-datasets-from-dlt-within-databricks branch December 27, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/support for loading datasets from dlt within databricks #1148

Feature/support for loading datasets from dlt within databricks #1148

chakravarthik27 commented Nov 28, 2024 •

edited

Loading

Feature/support for loading datasets from dlt within databricks #1148

Feature/support for loading datasets from dlt within databricks #1148

Conversation

chakravarthik27 commented Nov 28, 2024 • edited Loading

Support for Spark datasets:

Improvements to file extension handling:

Enhancements to task handling:

chakravarthik27 commented Nov 28, 2024 •

edited

Loading