Feature/support for loading datasets from dlt within databricks #1148
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes several changes to the
langtest/datahandler/datasource.py
andlangtest/tasks/task.py
files to add support for Spark datasets and improve the handling of file extensions. The most important changes include the addition of a newSparkDataset
class and modifications to the__init__
andload
methods to accommodate the new dataset type.Support for Spark datasets:
SparkDataset
class to handle Spark datasets, including methods for loading raw data and preprocessed data, and initializing a Spark session. (langtest/datahandler/datasource.py
)Improvements to file extension handling:
__init__
method to setfile_ext
from thefile_path
dictionary if provided, and adjusted the logic to handle cases where thesource
key is present. (langtest/datahandler/datasource.py
) [1] [2]load
method to include "spark" as a valid file extension for initializing the data source class. (langtest/datahandler/datasource.py
)Enhancements to task handling:
create_sample
method to ensure labels are converted to strings before being added to the list. (langtest/tasks/task.py
)