Page MenuHomePhabricator

Validate/clean language codes in crawled & API submitted records
Closed, ResolvedPublic

Description

Right now the backend is accepting and storing basically an string given as a language code. We should actually validate them somehow to keep junk out of the stored data.

Event Timeline

The first step in this is to get an "authoritative" list of language codes on the backend. We have this on the frontend via https://github.com/wikimedia/language-data. One possibility would be to add a local python script that can read the language-data provided json data file and make sure that json is present in a place that can be found at runtime. Similar, but slightly different, figure out how to add Python as a supported platform for wikimedia/language-data and published to pypi.

Change 677045 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] toolinfo: Add LanguageData helper class

https://gerrit.wikimedia.org/r/677045

Change 678116 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] toolinfo: validate and clean language codes

https://gerrit.wikimedia.org/r/678116

Change 677045 merged by jenkins-bot:

[wikimedia/toolhub@main] toolinfo: Add LanguageData helper class

https://gerrit.wikimedia.org/r/677045

Change 678116 merged by jenkins-bot:

[wikimedia/toolhub@main] toolinfo: validate and clean language codes

https://gerrit.wikimedia.org/r/678116

The missing model level validation is for the codes embedded in url_multilingual collections. These are being cleaned and validated for crawled toolinfo.json data, but not strongly enforced for API submissions.

The missing model level validation is for the codes embedded in url_multilingual collections. These are being cleaned and validated for crawled toolinfo.json data, but not strongly enforced for API submissions.

https://gerrit.wikimedia.org/r/c/wikimedia/toolhub/+/678683 added the same cleaning to API submissions as we were already using for crawled records. This should keep garbage out of the db, but right now there is no feedback to the user that this data cleaning has been done.

Change 679495 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] toolinfo: validate url_multilingual collections

https://gerrit.wikimedia.org/r/679495

Change 679495 merged by jenkins-bot:

[wikimedia/toolhub@main] toolinfo: validate url_multilingual collections

https://gerrit.wikimedia.org/r/679495

bd808 renamed this task from Validate and clean language codes in crawled records to Validate/clean language codes in crawled & API submitted records.Apr 16 2021, 4:26 PM
bd808 removed a project: Patch-For-Review.