Page MenuHomePhabricator

Make an Analytics Data Lake table to provide meta info about wikis
Closed, ResolvedPublic

Description

The table should include

  • Site code (e.g. "amwiki")
  • Canonical domain name (e.g. "am.wikipedia.org")
  • Project family (e.g. "wikipedia")
  • Language (e.g. "am")
  • Human-readable project name (from combining the site group with the CLDR dataset for mapping ISO 639 codes to language names in English, e.g. "Amharic Wikipedia")

In MariaDB, an arbitrary wiki's [sites table](https://www.mediawiki.org/wiki/Manual:Sites_table) would be the normal source for this information.

In the Data Lake, the [wmf_raw.mediawiki_project_namespace_map table](https://github.com/wikimedia/analytics-refinery/blob/master/hive/mediawiki/history/create_mediawiki_project_namespace_map.hql) already contains the site code and domain name.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2018, 12:42 AM
nshahquinn-wmf triaged this task as Low priority.Jan 10 2018, 12:42 AM
nshahquinn-wmf renamed this task from Make a table to translate DB names into human readable project names to Make an Analytics Data Lake table to provide meta info about wikis .Mar 7 2018, 10:21 AM
nshahquinn-wmf raised the priority of this task from Low to Medium.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf added a project: Analytics.
nshahquinn-wmf updated the task description. (Show Details)

Quick note: Knowing the domain of any project, it's relatively easy to extract the project-family and the language (if any).

fdans moved this task from Incoming to Backlog (Later) on the Analytics board.Apr 12 2018, 5:05 PM
238482n375 set Security to Software security bug.Jun 15 2018, 8:07 AM
238482n375 added a project: acl*security.
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
238482n375 added a subscriber: 238482n375.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Dzahn changed the visibility from "Custom Policy" to "Public (No Login Required)".
Dzahn removed a subscriber: 238482n375.
Restricted Application added a project: acl*security. · View Herald TranscriptJun 15 2018, 10:40 AM
nshahquinn-wmf added a comment.EditedNov 10 2018, 11:00 PM

The most challenging part of this is coming up with human-readable project names, and I've actually already done that as part of the wiki segmentation work. I just started work wrapping that up in a slightly more general form so it can go in the canonical-data repo, although it's not high priority so I don't know when I'll finish.

Then, we can upload this into a canonical-data database in Hive.

EBernhardson added a subscriber: EBernhardson.EditedFeb 20 2019, 11:08 PM

Interested in seeing this complete. In particular I could use the wiki db name -> language map portion of this. Another important piece of information is if the wiki is private or not. Various operations where the results will be user facing and aggregated across sites need to know what wikis to throw out (or more ideally, a whitelist to keep in).

@EBernhardson, actually this is quite close to being done! If you look at the wiki segmentation spreadsheet, the database code is in column Z and the language name in English is in column AD. It doesn't include private wikis at all, and the data past row 663 is messed up (T199266), but it wouldn't be that much work to fix it up.

Consider me poked 😁

nshahquinn-wmf moved this task from Backlog to Next Up on the Product-Analytics board.

Okay, I've put up an initial version of the this dataset: see canonical-data/wikis.csv. The generation code is at canonical-data/generation/wikis.ipynb.

I still plan to add the site name, site type (e.g. Wikipedia, Wiktionary, affiliate, test), script direction, and whether it's a content or discussion project, but unless something unexpected happens, the current fields will stay as is.

@EBernhardson, the language names and wiki visibility are already there, so you should be able to use it. Let me know if you run into any issues!

kzimmerman removed nshahquinn-wmf as the assignee of this task.Aug 27 2019, 6:41 PM
kzimmerman removed a project: Contributors-Analysis.
nshahquinn-wmf added subscribers: cchen, Iflorez.

@Iflorez and @cchen ran into some issues using this data from the wiki segmentation (T221566). So this week, I'm going to work on updating the dataset to:

  1. Remove deleted wikis, which will require basing this on something other than than the MediaWiki sites table.
  2. Include the full wiki name
nshahquinn-wmf closed this task as Resolved.Nov 21 2019, 1:23 AM

Now that I've added the wiki names to the dataset, along with some other fixes, (commit 817dc0d), this is all finished.