Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-13589 Add unit test that locale time cycle matches region prefs #4383

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

btangmu
Copy link
Member

@btangmu btangmu commented Feb 18, 2025

-Similar to icu4j DateTimeGeneratorTest.testJjMapping

-Use current CLDR data, not ICU data

CLDR-13589

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

-Copy icu4j DateTimeGeneratorTest.testJjMapping almost verbatim
ULocale[] locales = DateFormat.getAvailableULocales();
for (ULocale locale : locales) {
String localeID = locale.getName();
DateTimePatternGenerator dtpg = DateTimePatternGenerator.getInstance(locale);
Copy link
Contributor

@pedberg-icu pedberg-icu Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we want to populate the DateTimePatternGenerator with current CLDR data, not ICU data. This typically uses ICUServiceBuilder. See for example: tools/cldr-code/src/main/java/org/unicode/cldr/test/FlexibleDateFromCLDR.java

for (ULocale locale : locales) {
String localeID = locale.getName();
DateTimePatternGenerator dtpg = DateTimePatternGenerator.getInstance(locale);
DateFormat dfmt = DateFormat.getTimeInstance(DateFormat.SHORT, locale);
Copy link
Contributor

@pedberg-icu pedberg-icu Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again here we want to get the current CLDR data, using a CLDRFile method, instead of the ICU data.

Copy link
Contributor

@pedberg-icu pedberg-icu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to revise to use CLDR data instead of ICU data

-Similar to icu4j DateTimeGeneratorTest.testJjMapping

-Use current CLDR data, not ICU data
@btangmu

This comment was marked as outdated.

// Compare ICU data version:
// DateTimePatternGenerator dtpg = DateTimePatternGenerator.getInstance(uloc);

DateTimePatternGenerator dtpg = DateTimePatternGenerator.getEmptyInstance();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEmptyInstance? Is it OK that dtpg itself isn't associated with the locale? Instead it's icuServiceBuilder that's locale-specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to load up the DateTimePatternGenerator yourself with the CLDR data for the locale, it does not automatically do that. Take a look at tools/cldr-code/src/main/java/org/unicode/cldr/test/FlexibleDateFromCLDR.java (in fact you may be able to use that class in this test)

// DateFormat dfmt = DateFormat.getTimeInstance(DateFormat.SHORT, uloc);

SimpleDateFormat dfmt =
icuServiceBuilder.getDateFormat(LDMLConstants.GREGORIAN, jPattern);
Copy link
Member Author

@btangmu btangmu Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateFormat.getTimeInstance specifies "time" while icuServiceBuilder.getDateFormat doesn't. I didn't see any method like icuServiceBuilder.getTime...

I'm just guessing about jPattern here, versus jPatSkeleton or shortPatSkeleton or something completely different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking generally here.

You have to be careful not to use "plain" ICU calls for formatting in tests or examples, since that will use ICU's (old) copy of CLDR data, not what is in the repository. So it won't test what needs to be tested.

If ICUServiceBuilder has API for what you want, that's great. Internally, it uses API code, but makes calls that load up CLDR data and then calls mechanisms that build the ICU classes using CLDR data.

There are some cases where ICUServiceBuilder doesn't have an API for what you want. In that case, do a search for the class you want to use on the ICU side within the CLDR code. For example, for DateTimePatternGenerator, there are a few places in the code that build it up, like

CheckDates:
DateTimePatternGenerator dateTimePatternGenerator = DateTimePatternGenerator.getEmptyInstance();

(This ought to be more centralized and documented.) I have to run to a meeting, but will check later.

@btangmu btangmu requested a review from pedberg-icu February 19, 2025 18:31
@btangmu btangmu marked this pull request as draft February 20, 2025 03:33
-This is only a test; draft PR
@btangmu
Copy link
Member Author

btangmu commented Feb 20, 2025

I've added a 3rd commit, with comments and debugging, and changed this to a DRAFT PR.

Calling FlexibleDateFromCLDR does result in some localized values, but usually not the ones matching the original ICU test, maybe involving the lack of specifying anything like DateFormat.SHORT...

@@ -30,7 +30,7 @@
*
* @author markdavis
*/
class FlexibleDateFromCLDR {
public class FlexibleDateFromCLDR {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an old (?) comment above, "Temporary class while refactoring", which casts suspicion on making this public

@pedberg-icu
Copy link
Contributor

pedberg-icu commented Feb 20, 2025

Hmm. It may be that using DateTimePatternGenerator to handle the 'j' character is not the right way to approach this. Basically what we are trying to do is make sure that the gregorian and/or preferred-calendar (if different) time formats in each locale - which may be inherited - use the preferred hour cycle for the region or region-language combination as specified by the supplemental <timeData>. I think we should probably do this without using ICU for locale data at all. I think we can do the following:

  1. Get the mapping from region (or language_region) to preferred hour cycle as follows:
    CLDRConfig testInfo = CLDRConfig.getInstance();
    SupplementalDataInfo sdi = testInfo.getSupplementalDataInfo();
    Map<String, PreferredAndAllowedHour> timeData = sdi.getTimeData();
  1. Get an empty DateTimePatternGenerator, we are only going to use this to parse time patterns (so no locale data)
    DateTimePatternGenerator dtpg = DateTimePatternGenerator.getEmptyInstance();
  1. Then get the list of locales and loop:
    // CLDRConfig testInfo = CLDRConfig.getInstance(); // already done above
    Factory cldrFactory = testInfo.getCldrFactory();
    for (String localeID : cldrFactory.getAvailable()) {
  1. For each locale, get the preferred hour:
        // first try the locale as is:
        PreferredAndAllowedHour prefAndAllowedHr = timeData.get(localeID);
        // if that does not work (returns null?), try the locale's region: if it has a region part, use that,
        // otherwise get the region from likelySubtags; you can use sdi.getLikelySubtags() to get a map
        String region = ...
        prefAndAllowedHr = timeData.get(region);
        // if that does not work, try "001" world
        prefAndAllowedHr = timeData.get("001");
        // then get the hour cycle from prefAndAllowedHr, have not yet figured out how to do that
  1. Then for each locale, use CLDR file to get the short time format for Gregorian calendar:
        CLDRFile cldrFile = testInfo.getCLDRFile(locale, true);
        String gregoShortTimePath = "//ldml/dates/calendars/calendar[@type=\"gregorian\"]/timeFormats/timeFormatLength[@type=\"short\"]/timeFormat/pattern";
        String shortTimeString = cldrFile.getWinningValue(gregoShortDatePath);
        // convert to skeleton to make it easier to check for hour character (eliminates literals etc.):
        String shortTimeSkeleton = dtpg.getBaseSkeleton(shortTimeString);
         // then see whether the shortTimeSkeleton uses the preferred hour cycle from step 4, error if it does not
  1. As a bonus step, you can find the preferred calendar for the locale (based on data from SupplementalDataInfo.getCalendars(region)), and if it is not Gregorian then also check the shortTimeFormat for that...

@btangmu
Copy link
Member Author

btangmu commented Feb 20, 2025

@pedberg-icu I'm starting to have some success with that, thanks! The path you gave doesn't seem quite right. I'm using this instead, OK?

//ldml/dates/calendars/calendar[@type="gregorian"]/timeFormats/timeFormatLength[@type="short"]/timeFormat[@type="standard"]/pattern[@type="standard"]

-This is only a test; draft PR
@btangmu
Copy link
Member Author

btangmu commented Feb 20, 2025

The 4th commit uses PreferredAndAllowedHour and //ldml/dates/calendars/calendar[@type="gregorian"]/timeFormats/timeFormatLength[@type="short"]/timeFormat[@type="standard"]/pattern[@type="standard"]

It fails for locale "aa", since timeData.get("aa") returns null, and new LocaleIDParser().set("aa").getRegion() returns null.

LocaleIDParser.setRegion has only one caller, FlexibleDateTime.DeprecatedCodeFixer.fixLocale, which seemingly is never called, at least not for "aa".

Is there some other way to get a region/territory from a locale ID?

@macchiati
Copy link
Member

You should be calling LikelySubtags to fill in the region if there isn't one.

@pedberg-icu
Copy link
Contributor

@pedberg-icu I'm starting to have some success with that, thanks! The path you gave doesn't seem quite right. I'm using this instead, OK?

//ldml/dates/calendars/calendar[@type="gregorian"]/timeFormats/timeFormatLength[@type="short"]/timeFormat[@type="standard"]/pattern[@type="standard"]

Yes you are right, I had forgotten about the (implied) type="standard"

@pedberg-icu
Copy link
Contributor

pedberg-icu commented Feb 20, 2025

Tom: It fails for locale "aa", since timeData.get("aa") returns null, and `new > Is there some other way to get a region/territory from a locale ID?
Mark: You should be calling LikelySubtags to fill in the region if there isn't one.

Yes. Tom, see my step 4, I had mentioned "you can use sdi.getLikelySubtags() to get a map". This returns a Map<String, String> that should mirror the content of common/supplemental/likelySubtags.xml, which will map e.g. "aa" to "aa_Latn_ET" and then you can use LocaleIDParser to get the region from that.

-This is only a test; draft PR
@btangmu
Copy link
Member Author

btangmu commented Feb 20, 2025

The 5th commit uses LikelySubtags. These 53 failures now occur, which is better than before:

1 locale apc, calendar gregorian, expected h to occur in both patterns h and HH:mm
2 locale apc_SY, calendar gregorian, expected h to occur in both patterns h and HH:mm
3 locale arn, calendar gregorian, expected h to occur in both patterns h and HH:mm
4 locale arn_CL, calendar gregorian, expected h to occur in both patterns h and HH:mm
5 locale az_Arab_IQ, calendar gregorian, expected h to occur in both patterns h and HH:mm
6 locale bal_Arab, calendar gregorian, expected H to occur in both patterns H and hh:mm a
7 locale bal_Latn, calendar gregorian, expected H to occur in both patterns H and hh:mm a
8 locale bgn, calendar gregorian, expected h to occur in both patterns h and HH:mm
9 locale bgn_AE, calendar gregorian, expected h to occur in both patterns h and HH:mm
10 locale bgn_OM, calendar gregorian, expected h to occur in both patterns h and HH:mm
11 locale bgn_PK, calendar gregorian, expected h to occur in both patterns h and HH:mm
12 locale cho, calendar gregorian, expected h to occur in both patterns h and HH:mm
13 locale cho_US, calendar gregorian, expected h to occur in both patterns h and HH:mm
14 locale cop, calendar gregorian, expected h to occur in both patterns h and HH:mm
15 locale cop_EG, calendar gregorian, expected h to occur in both patterns h and HH:mm
16 locale el_POLYTON, calendar gregorian, expected H to occur in both patterns H and h:mm a
17 locale en_Dsrt_US, calendar gregorian, expected h to occur in both patterns h and HH:mm
18 locale gn, calendar gregorian, expected h to occur in both patterns h and HH:mm
19 locale gn_PY, calendar gregorian, expected h to occur in both patterns h and HH:mm
20 locale ha_Arab_SD, calendar gregorian, expected h to occur in both patterns h and HH:mm
21 locale hi_Latn, calendar gregorian, expected H to occur in both patterns H and h:mm a
22 locale hnj, calendar gregorian, expected h to occur in both patterns h and HH:mm
23 locale hnj_Hmnp_US, calendar gregorian, expected h to occur in both patterns h and HH:mm
24 locale iu_Latn_CA, calendar gregorian, expected h to occur in both patterns h and HH:mm
25 locale kok_Deva, calendar gregorian, expected H to occur in both patterns H and h:mm a
26 locale kok_Latn, calendar gregorian, expected H to occur in both patterns H and a h:mm
27 locale ks_Arab, calendar gregorian, expected H to occur in both patterns H and h:mm a
28 locale ks_Deva, calendar gregorian, expected H to occur in both patterns H and a h:mm
29 locale kxv_Deva, calendar gregorian, expected H to occur in both patterns H and h:mm a
30 locale kxv_Latn, calendar gregorian, expected H to occur in both patterns H and h:mm a
31 locale kxv_Orya, calendar gregorian, expected H to occur in both patterns H and h:mm a
32 locale kxv_Telu, calendar gregorian, expected H to occur in both patterns H and h:mm a
33 locale mni_Beng, calendar gregorian, expected H to occur in both patterns H and h:mm a
34 locale mni_Mtei, calendar gregorian, expected H to occur in both patterns H and h.mm. a
35 locale ms_Arab, calendar gregorian, expected H to occur in both patterns H and h:mm a
36 locale nv, calendar gregorian, expected h to occur in both patterns h and HH:mm
37 locale nv_US, calendar gregorian, expected h to occur in both patterns h and HH:mm
38 locale pa_Guru, calendar gregorian, expected H to occur in both patterns H and h:mm a
39 locale pis, calendar gregorian, expected h to occur in both patterns h and HH:mm
40 locale pis_SB, calendar gregorian, expected h to occur in both patterns h and HH:mm
41 locale quc, calendar gregorian, expected h to occur in both patterns h and HH:mm
42 locale quc_GT, calendar gregorian, expected h to occur in both patterns h and HH:mm
43 locale sat_Deva, calendar gregorian, expected H to occur in both patterns H and h:mm a
44 locale sat_Olck, calendar gregorian, expected H to occur in both patterns H and h:mm a
45 locale sd_Arab, calendar gregorian, expected H to occur in both patterns H and h:mm a
46 locale sdh_IQ, calendar gregorian, expected h to occur in both patterns h and HH:mm
47 locale skr, calendar gregorian, expected h to occur in both patterns h and HH:mm
48 locale skr_PK, calendar gregorian, expected h to occur in both patterns h and HH:mm
49 locale vai_Latn, calendar gregorian, expected H to occur in both patterns H and h:mm a
50 locale vai_Vaii, calendar gregorian, expected H to occur in both patterns H and h:mm a
51 locale wbp, calendar gregorian, expected h to occur in both patterns h and HH:mm
52 locale wbp_AU, calendar gregorian, expected h to occur in both patterns h and HH:mm
53 locale yue_Hant, calendar gregorian, expected H to occur in both patterns H and ah:mm

for (char timeCycleChar : timeCycleChars) {
boolean has1 = jPatSkeleton.indexOf(timeCycleChar) >= 0;
boolean has2 = shortPatSkeleton.indexOf(timeCycleChar) >= 0;
if (has1 && !has2) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this (like the original ICU code) is unsymmetrical and leaves out checking if (has2 && !has1) -- however, I tried (has1 != has2) and just got exactly twice the number of failures, for example:

105 locale yue_Hant, calendar gregorian, expected H to occur in both patterns H and ah:mm
106 locale yue_Hant, calendar gregorian, expected h to occur in both patterns H and ah:mm

I guess (has2 && !has1) could be more significant if jPatSkeleton had none of h, H, k, K, but at least currently that's not the case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just taking one example, apc.xml

It has no timeFormatLength values, so it will just inherit from root.

Lines like the following also just inherit from root.

↑↑↑

root has (correctly) HH.

But its region is:

which matches

and results in 'h'.

So we should filter out locales that are top-level locals with no explicit value for timeFormatLength.

(Might also have to refine this further).

+ " get (region "
+ region
+ ") null, falling back to 001");
prefAndAllowedHr = timeData.get("001" /* world */);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback to 001 happens for az_Cyrl, az_Latn, bal_Arab, bal_Latn, be_TARASK, ..., altogether 53 locales. Strangely not the same as the 53 failing locales, though there is some overlap, such as for el_POLYTON

@macchiati
Copy link
Member

macchiati commented Feb 20, 2025 via email

@pedberg-icu
Copy link
Contributor

pedberg-icu commented Feb 21, 2025

Mark wrote: "So we should filter out locales that are top-level locals with no explicit value for timeFormatLength."

I disagree, finding those is part of the point of this (otherwise they will cause errors in ICU). If a top level locale has a default region that prefers 'h' and has no standard time formats, we need to add them. That is part the criteria for Basic I believe.

But maybe I misunderstood...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants