Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Tests with different file formats #8

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

KapiWow
Copy link
Collaborator

@KapiWow KapiWow commented Sep 29, 2024

  • test documents and expected results are in test_files dir
  • integration tests for rust interface
  • integration tests for python interface
  • tested formats: pdf, docx, pptx, doc, odt, pptx, csv, png, xlsx, epub

Issue-ID: 2

The test files were taken from an unstructured repository, and the expected result files were also generated by the unstructured library. Hopefully their library works well with their test files.

I used cosine_similarity because Levenshtein takes about 20 seconds to process the similarity of the extracted PDF text.

Copy link
Contributor

@nmammeri nmammeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Anton .. looks really good the tests are running fine.. please address those comments and I'll approve it.

#[test_case("simple.pptx", 0.9; "Test another PPTX file")]
#[test_case("table-multi-row-column-cells.png", -1.0; "Test PNG file")]
#[test_case("winter-sports.epub", 0.9; "Test EPUB file")]
fn test_extractor(file_name: &str, target_dist: f64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this test_extract_file_to_string as it is testing the extract to string functionality.

.extract_file_to_string(&format!("../test_files/documents/{}", file_name))
.unwrap();
// read expected string
let mut expected =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be mutable

fn test_extractor(file_name: &str, target_dist: f64) {
let extractor = Extractor::new().set_extract_string_max_length(1000000);
// extract file with extractor
let mut extracted = extractor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be mutable

@@ -23,7 +23,7 @@ requires-python = ">=3.8,<3.13"
docs = ["pdoc"]
# To run tests using pytest we need to run:
# pytest -s
test = ["pytest"]
test = ["pytest","sklearn"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip was reporting that sklearn is deprecated and should be replaced by scikit-learn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be rename the file to test_extract_file_to_string.py pytest shows the filename on the console. It would be better to know which test it is from the console output.

* test documents and expected results are in test_files dir
* integration tests for rust interface
* integration tests for python interface
* tested formats: pdf, docx, pptx, doc, odt, pptx, csv, png, xlsx, epub

Issue-ID: 2
@KapiWow KapiWow force-pushed the 2-tests-for-different-formats branch from baca44b to 69a24ed Compare October 2, 2024 10:43
@KapiWow
Copy link
Collaborator Author

KapiWow commented Oct 2, 2024

All comments are adressed

@nmammeri nmammeri merged commit ad7a6ac into yobix-ai:main Oct 2, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants