Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: load datasets with repeated items #92

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chrislawson-au
Copy link
Contributor

Came across this interesting scenario which I think can be fixed.
If the data we are trying to load has a repeated line item (e.g., two SALE items) the code is checking to see if all the data is the same but before this fix, it throws an error.
We could also add some tests for this.
What do you think?

@nickderobertis
Copy link
Owner

Hmm, could you describe a little better what you're trying to do? You're saying the input has two line items that are labelled the same but have different values? What are you suggesting that finstmt do in this case? I'm not sure anything other than throwing an error makes sense, because how would we know which if the two identically-named line items to pick?

If you think this is a common enough problem, we could consider configuration to throw a warning and take the first row rather than erroring out.

@@ -74,7 +74,7 @@ def from_series(
if item_config.key in data_dict:
# Multiple matches for data item.
# First see if data is the same, then just skip
if for_lookup[name] == data_dict[item_config.key]:
if (for_lookup[name] == data_dict[item_config.key]).all():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickderobertis
the situation I had was a data set that had the same data item twice.

for example, the query I ran from a DB had two line items for Inventory (arguably a mistake), the numbers were the same. It was just a duplicate line item, but I think this problem exists even if the duplicate line item has different data.

Here on line 77 I got an error. I think we need to add .all() to check if all the items in the data series are the same.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, yeah the code was definitely intended to handle duplicate items by skipping so if it is not doing that without this patch then we definitely need to get it in.

If you look at the test results it seems like this line change is causing an issue though. I'm also confused about why we would need this here, because for_lookup is a series and so for_lookup[name] should be a single value rather than a series. That's why the tests are erroring saying bool has no attribute all.

What error did you encounter with loading data with duplicate items? Perhaps a best next step on this is to revert this change and add two unit tests for loading a statement with duplicate items: one with the same numbers and one with different numbers. Then we will be able to see the error on CI and once all tests pass we will know this works properly for every case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants