Page MenuHomePhabricator

Extract cross-wiki WikiProject tags
Closed, ResolvedPublic

Description

Pipeline for bringing together the following information in a single file for every article in English Wikipedia with at least one WikiProject template (5,926,244 articles as of 1 December 2019):

  • Article metadata: talk page ID, talk page revision, article title, article page ID, article revision ID, Wikidata ID (QID)
  • WikiProject templates: list of WikiProject-related templates from an article's talk page

Example output JSON:

{
  "title": "Atlas Shrugged",
  "talk_pid": 128,
  "talk_revid": 911346471,
  "article_revid": 918538850,
  "article_pid": 18951386,
  "qid": "Q374098",
  "sitelinks": {
      "ro": "Revolta lui Atlas",
      "ja": "\u80a9\u3092\u3059\u304f\u3081\u308b\u30a2\u30c8\u30e9\u30b9",
      "is": "Undirsta\u00f0an",
     ...
      "cs": "Atlasova vzpoura",
      "en": "Atlas Shrugged",
      "uk": "\u0410\u0442\u043b\u0430\u043d\u0442 \u0440\u043e\u0437\u043f\u0440\u0430\u0432\u0438\u0432 \u043f\u043b\u0435\u0447\u0456"
  },
  "wp_templates": [
      "WikiProject Objectivism",
      "WikiProject Novels",
      "WikiProject Philosophy",
      "WikiProject Libertarianism",
      "WikiProject Politics",
      "WikiProject Trains"
  ]
}

Figshare item that includes the information above: https://doi.org/10.6084/m9.figshare.10248344.v1

Event Timeline

See T236713#5612526:

stat1007.eqiad.wmnet:/home/isaacj/drafttopic/full_wptemplates.json.bz2

@Halfak thanks for breaking this out as that other task was rapidly growing larger :)

To resolve this task, I'll put together that figshare item with the October results and the code for reproducing the dataset. Let me know if you want anything further.

Solid! I think that sound great.

If you could include some documentation in that task for the process you followed, that'd be awesome.

I talked to Isaac about including all of the relevant WikiProject templates. I've generated a list of all redirect pages that go to a WikiProject template. See stat1007:/home/halfak/projects/drafttopic/datasets/wikiproject_to_templates.yaml

WikiProject Moon: ["WikiProject Moon", "WPMoon"]
WikiProject Dadra and Nagar Haveli: ["WikiProject Dadra and Nagar Haveli"]
WikiProject China: ["WPCN", "WikiProject China", "Wikiproject China", "Wp prc", "Wikiproject china", "WPChina", "WikiProject Republic of China", "WPCHINA", "WP China"]
WikiProject Hudson Valley: ["WikiProject Hudson Valley", "Hudson Valley"]
WikiProject Polynesia: ["WP Polynesia", "WPPOLYNESIA", "WikiProject Polynesia", "WPPolynesia"]
WikiProject Marketing & Advertising: ["WP Marketing", "WikiProject Marketing", "WikiProject Marketing & Advertising", "WP Advertising", "WikiProject Advertising"]
WikiProject Albania: ["WPSQ", "WikiProject Albania", "Wikiproject Albania", "WPALBANIA", "WP Albania", "WPALB"]
...

The file is in yaml format, so you can read it with:

$ python
> import yaml
> templates_map = yaml.safe_load(open("/home/halfak/projects/drafttopic/datasets/wikiproject_to_templates.yaml"))

If you just want a raw list of the templates, check out wp_templates.unique.txt in the same folder:

$ head wp_templates.unique.txt 
AARTalk
AFRO
AircraftProject
Albm
ALBM
Album
ALBUM
Albums
Alm
Alternate History WikiProject

As discussed on IRC: the wikiproject_to_templates YAML is currently missing a number of WikiProjects. Based on the WikiProject templates that I detected in my previous of English Wikipedia by case-insensitive string-matching against "wp" and "wikiproject", here the top 100 templates that are missing from the YAML and how many articles they were found in. There is a long-tail too of unique template names (1877 total, though some of them are false positives). Full list at stat1007:/home/isaacj/drafttopic/templates_missing_from_yaml.tsv

Until we fix this, probably best to proceed with using both string-matching and the YAML (which catches instances that don't have "wp" or "wikiproject" in their name). I think most of these come from WikiProjects missing from the directory or odd section/template structure on the WikiProject directory that causes issues with the parser.

wikiproject disambiguation	104354
wikiproject lists	96341
wikiproject africa	67231
wikiproject articles for creation	64178
wikiproject college football	61430
wpbs	58798
wikiproject trains	49365
wikiproject ships	37389
wpafc	33720
trainswikiproject	31656
wikiproject journalism	30961
wikiproject catholicism	30443
wikiproject national football league	29016
wikiproject college basketball	27129
wpships	22450
wikiproject mcb	19962
wikiproject years	17942
wikiproject pakistan	17760
wikiproject turkey	15453
wikiproject israel	14783
wikiproject antarctica	13548
wikiproject judaism	10802
wikiproject archaeology	10259
wikiproject death	9526
wikiproject soap operas	8989
wikiproject magazines	8986
wikiproject islam	8685
wikiproject multi-sport events	8619
wptr	8577
wikiproject molecular and cellular biology	8259
wikiproject science fiction	8077
wikiproject anglicanism	7787
wikiproject fashion	7575
wikiproject crime	6808
wikiproject sociology	6619
wikiproject highways	6517
wp pakistan	6276
wikiproject former countries	6201
wikiproject palestine	6194
wikiproject sculpture	5821
wp lists	5540
wikiproject bbc	5532
wikiproject syria	5435
wikiproject chess	5398
wikiproject british empire	5360
wikiproject feminism	5229
wikiproject beauty pageants	5115
wikiproject europe	5022
wikiproject national basketball association	5003
wikiproject genetics	4825
wikiproject british tv shows	4825
wikiproject scouting	4749
wikiproject iraq	4610
wikiproject united states courts and judges	4570
wikiproject lebanon	4451
wikiproject organized crime	4392
wikiproject armenia	4217
wikiproject jewish history	3963
wikiproject european microstates	3881
wikiproject western asia	3841
wp disambiguation	3807
wikiproject referees	3742
wikiproject philately	3725
wikiproject normandy	3722
wikiproject sports facilities task force	3610
wikiproject bavaria	3584
wikiproject record labels	3506
wikiproject criminal biography	3504
wikiproject indigenous peoples of the americas	3497
wikiproject amusement parks	3491
twp	3437
wikiproject public art	3423
wikiproject shipwrecks	3379
wikiproject websites	3378
wikiproject buses	3310
wikiproject shopping centers	3255
wikiproject asia	3210
wikiproject skyscrapers	3198
wikiproject saudi arabia	3185
wikiproject guild of copy editors	3081
wikiproject arab world	3032
wikiproject overseas france	3019
wikiproject ottoman empire	3015
wikiproject cyprus	2863
wikiproject georgia (country)	2837
wikiproject lutheranism	2789
wpbeatles	2783
wikiproject doctor who	2755
wikiproject turtles	2744
wikiproject yugoslavia	2734
wikiproject u.s. supreme court cases	2721
wikiproject nickelodeon	2668
wikiproject united states territories	2625
wpdab	2605
wikiproject gambling	2528
wikiproject yemen	2523
wikiproject toys	2516
wpus50k	2467
wikiproject united arab emirates	2445
wikiproject green bay packers	2427

OK. I adjusted this in https://github.com/halfak/wikitax/pull/4

I excluded the following:

  • wikiproject disambiguation 104354 (Not topical)
    • wpdab 2605
    • wp disambiguation 3807
  • wikiproject lists 96341 (Not topical)
    • wp lists 5540
  • wikiproject articles for creation 64178 (Not topical)
    • wpafc 33720
  • wpbs 58798 (This is the banner shell template)
  • wikiproject death 9526 (Scope too wide)
  • wikiproject guild of copy editors 3081 (Not topical)
  • wpus50k 2467 (Not a wiki project)

@Halfak : dataset is now uploaded to Figshare: https://doi.org/10.6084/m9.figshare.10248344.v1

I updated the WikiProject names based on the YAML file at stat1007:/home/halfak/projects/drafttopic/datasets/wikiproject_to_templates.20191212.yaml but the string matching was still retained so there will still be some false positives in here.

Take a look and if everything looks good, we can close this out! I'll update the task description too to better reflect the actual format of the output.

It seems like we should get the wikiproject_to_templates and the wikiproject_taxonomy files in figshare somewhere. I wonder how you feel about extending this record to include those too.

Yeah, that works for me. I looked but doesn't seem I can give you edit permissions to a figshare item I created, so just point me towards what files you want uploaded and any additional description I should add. There were also two aaron halfakers (!!) on figshare when I went to add your name to the item and both were labeled as inactive, so let me know if there's an account you want linked to the item as well.

Indeed it looks like my account has been disabled! I emailed support to get it re-enabled. Weird.

I'll get the files together and ping back.

see https://github.com/halfak/wikitax/tree/master/datasets

Complete -- both uploaded with brief descriptions

I emailed support to get it re-enabled. Weird.

Sounds good, just let me know when it's worked out and I'll add you as an author.

Great, added! If you see anything that you'd like to change, just ping and I'll update.