Page MenuHomePhabricator

ListImporter Gadget: import UNILEX lists when available.
Open, HighPublicFeature

Description

Given iso369-3 code :

  • get content from relevant raw github file (warning: github could prevent JS query ?)
  • slice by 5000 items (via JS or before on github ?, see )
  • create lists up to 20,000 (if exist) :
    • create List:{iso3}/words-by-frequency-00001-to-05000 : append relevant items
    • create List:{iso3}/words-by-frequency-05001-to-10000 : append relevant items
    • create List:{iso3}/words-by-frequency-10001-to-15000 : append relevant items
    • create List:{iso3}/words-by-frequency-15001-to-20000 : append relevant items
  • create list_talks up to 20,000 (if exist) :
    • create List_talk:{iso3}/words-by-frequency-00001-to-05000 : append {UNILEX License}
    • create List_talk:{iso3}/words-by-frequency-05001-to-10000 : append {UNILEX License}
    • create List_talk:{iso3}/words-by-frequency-10001-to-15000 : append {UNILEX License}
    • create List_talk:{iso3}/words-by-frequency-15001-to-20000 : append {UNILEX License}

Server side split

split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-

Iso names

  • The largest languages use iso2. May need renaming on github.

Other commands

Event Timeline

Yug renamed this task from LanguagesImporter Gadget: import UNILEX lists when available. to LanguaImporter Gadget: import UNILEX lists when available. .Feb 25 2021, 12:27 PM
Yug changed the subtype of this task from "Task" to "Feature Request".

@Yug, could you elaborate a bit more? From the title I understand that you would like the LanguaImporter gadget be able to import such list? If so, I disagree I think it should be done by another gadget or be improted by hand and/or bot. We should not add more features to LanguaImporter other than creating an item for a language. So could you retitle?

My idea was to both create the language Qid and add a referencial lists on the go.
Could be a separate gadget, true, to separate concerns.
The two should be closely co-occuring tho.

Yug renamed this task from LanguaImporter Gadget: import UNILEX lists when available. to ListImporter Gadget: import UNILEX lists when available. .Feb 25 2021, 9:39 PM

Note: I developed a bot to import all unilex to such format. Bot ran for first 500 languages, as a test. Need to run for the next 500 languages.

Yug triaged this task as Unbreak Now! priority.
tstarling lowered the priority of this task from Unbreak Now! to High.Jul 15 2022, 6:14 AM
tstarling subscribed.

I don't think this is "unbreak now". See https://www.mediawiki.org/wiki/Phabricator/Project_management#Priority_levels for information about setting task priority.