- Detection of problematic data slices
- Basic explanation of found issues via feature importances
- Limited embedding computation for images, audio, text
- Extended embedding support, e.g., more embedding models and allow precomputed embeddings
- Speed up embedding computation using datasets library
- Improved issue detection algorithm, avoiding duplicate detections of similar problems and outliers influencing the segment detection
- Support application on datasets without labels (outlier based)
- Adaptive drop reference for datasets that contain a wide variety of data
- Large data support for detection and reporting, e.g., 500k audio samples with transcriptions
- Different interfaces from min_drop, min_support. Maybe n_slices and sort by criterion?
- Support application without model (by training simple baseline model)
- Improve normalization for mixed type runs e.g. embedding + one categorical or numeric variable.
- Walthroughs for unstructured, structured and mixed data. Also, in depth tutorial explaining all the parameters.
- Soft Dependencies for embedding computation and autml as torch and xgboost dependencies are large
- Per use case helpers such as find_issues_object_detection, find_issues_ts_forecasting, ...
- Allow for model comparisons via intersection, difference, ...
- Allow application of sliceguard on timeseries
- Add Sliceguard deepdive notebook to show more advanced usage
- Build sphinx docs
- Stronger automated testing
- Robustify outlier detection algorithm. Probably better parameter choice.
- Interpretable features for images, audio, text. E.g., dark image, quiet audio, long audio, contains common word x, ...
- Generation of a summary report doing predefined checks
- "Supervised" clustering that incorporates classes, probabilities, metrics, not only features
- Data connectors for faster application on common data formats
- Support embedding generation for remote resources, e.g. audio/images hosted on webservers
- Improved explanations for found issues, e.g., via SHAP