Page MenuHomePhabricator

Make an Analytics Data Lake table to provide meta info about wikis
Closed, ResolvedPublic


The table should include

  • Site code (e.g. "amwiki")
  • Canonical domain name (e.g. "")
  • Project family (e.g. "wikipedia")
  • Language (e.g. "am")
  • Human-readable project name (from combining the site group with the CLDR dataset for mapping ISO 639 codes to language names in English, e.g. "Amharic Wikipedia")

In MariaDB, an arbitrary wiki's [sites table]( would be the normal source for this information.

In the Data Lake, the [wmf_raw.mediawiki_project_namespace_map table]( already contains the site code and domain name.

Event Timeline

nshahquinn-wmf renamed this task from Make a table to translate DB names into human readable project names to Make an Analytics Data Lake table to provide meta info about wikis .Mar 7 2018, 10:21 AM
nshahquinn-wmf raised the priority of this task from Low to Medium.
nshahquinn-wmf updated the task description. (Show Details)

Quick note: Knowing the domain of any project, it's relatively easy to extract the project-family and the language (if any).

238482n375 added a project: acl*security.
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
238482n375 added a subscriber: 238482n375.


Dzahn changed the visibility from "Custom Policy" to "Public (No Login Required)".
Dzahn removed a subscriber: 238482n375.

The most challenging part of this is coming up with human-readable project names, and I've actually already done that as part of the wiki segmentation work. I just started work wrapping that up in a slightly more general form so it can go in the canonical-data repo, although it's not high priority so I don't know when I'll finish.

Then, we can upload this into a canonical-data database in Hive.

Interested in seeing this complete. In particular I could use the wiki db name -> language map portion of this. Another important piece of information is if the wiki is private or not. Various operations where the results will be user facing and aggregated across sites need to know what wikis to throw out (or more ideally, a whitelist to keep in).

@EBernhardson, actually this is quite close to being done! If you look at the wiki segmentation spreadsheet, the database code is in column Z and the language name in English is in column AD. It doesn't include private wikis at all, and the data past row 663 is messed up (T199266), but it wouldn't be that much work to fix it up.

Consider me poked 😁

Okay, I've put up an initial version of the this dataset: see canonical-data/wikis.csv. The generation code is at canonical-data/generation/wikis.ipynb.

I still plan to add the site name, site type (e.g. Wikipedia, Wiktionary, affiliate, test), script direction, and whether it's a content or discussion project, but unless something unexpected happens, the current fields will stay as is.

@EBernhardson, the language names and wiki visibility are already there, so you should be able to use it. Let me know if you run into any issues!

nshahquinn-wmf added subscribers: cchen, Iflorez.

@Iflorez and @cchen ran into some issues using this data from the wiki segmentation (T221566). So this week, I'm going to work on updating the dataset to:

  1. Remove deleted wikis, which will require basing this on something other than than the MediaWiki sites table.
  2. Include the full wiki name

Now that I've added the wiki names to the dataset, along with some other fixes, (commit 817dc0d), this is all finished.