db-seed.po files need cleanup to remove duplicate IDs from generated localized seed data
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
Medium
|
Unassigned | ||
2.12 |
Fix Released
|
Medium
|
Unassigned |
Bug Description
This bug continues #1680312 - running through the translation process on master / rel_2_12, we find the localized 950.data.
cd build/i18n
# generate the new base POT
make newpot
# update the translation for fr-CA
make LOCALE=fr-CA updatepo
# create an updated SQL file to load fr-CA
make LOCALE=fr-CA project
# load the generated fr-CA SQL
psql -f project/
At this last step, I ran into some errors about duplicates in config.i18n_core. However, the load was successful, and the errors didn't seem to be harmful. To determine why we were getting duplicates, I examined the results of the "updatepo" step (which adds and merges the new strings with an existing translation) vs. a "newpo" step (which generates a fresh, untranslated PO file).
With the "updatepo" step, you get entries like:
# id::cmf.label__26 id::cza.label__1 id::cza.label__10
# id::cza.label__1 id::cza.label__10 id::cza.label__19
#: 950.data.
#: 950.data.
msgid "Title Control Number"
msgstr "Numéro de contrôle pour les titres"
With the "newpo" step, you get entries like:
# id::cmf.label__26 id::cza.label__1 id::cza.label__10
#: 950.data.
#: 950.data.
msgid "Title Control Number"
msgstr ""
So it looks like in the merge process, "pot2po" is holding onto prior ID rows, resulting in duplicates getting fed into the final SQL file. The problem could be cleaned up by running the output SQL through "sort | uniq", but that's fixing the symptom rather than the root of the problem. More investigation into what pot2po is doing may be required to fix the root of the problem.
Changed in evergreen: | |
status: | New → Triaged |
milestone: | none → 2.12.1 |
Changed in evergreen: | |
milestone: | 2.12.1 → 2.12.2 |
Changed in evergreen: | |
milestone: | 2.12.2 → 2.12.3 |
Changed in evergreen: | |
milestone: | 2.12.3 → 2.12.4 |
Changed in evergreen: | |
milestone: | 2.12.4 → 2.12.5 |
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
Looks like we may want to insert a step for "pocommentclean" which is part of translate toolkit and can be used to remove all existing comments in a given file. While the manpage says that you can only use pocommentclean against po directories and not individual files, my local testing with translate toolkit 1.13.0 on Ubuntu 16.04 yielded successfully cleaned files on a case by case basis.
So running "pocommentclean build/i18n/ po/db.seed/ fr-CA.po" resulted in a file fr-CA.po with no comments in it. Followed by running the individual command to update one PO file, like "pot2po --progress verbose -i po/db.seed/ db.seed. pot -o po/db.seed/fr-CA.po -t po/db.seed/ fr-CA.po" which resulted in an fr-CA.po file with the corrected IDs only (and all the updated IDs that were missing from the file too).
We'd have to figure out where best to implement the task to clean out the existing IDs and replace them all. And then with git version control tracking all our changes, the final update that gets committed would appear only to replace any mistakes in the file. Any existing IDs that are unchanged would remain unchanged with the git history.
Further testing and coding to come...