HomePhabricator

Add filter for recommended links

Description

Add filter for recommended links

These were requested in previous feedback rounds described, e.g. in
T253279 and T261408.

This filters the set of candidate links contained in the anchor
dictionary.

  • generate_wdproperties_spark.py retrieves all wikidata-items that are

listed under the instance-of property (P31) of the wikidata-item that
corresponds to an article of the wiki. For example, the enwiki-article
"Statistics (Disambguation)" (Q2333935) contains the statement
"instance-of=Wikimedia disambiguation page (Q4167410)"

  • filter_dict_anchor.py removes all candidate-links from the anchor

dictionary that belong to one of the following instances:
Disambiguation-page (Q4167410), List-page (Q13406463), Year (Q577),
Calendar year (Q3186692)

These two steps were added to the run-pipeline.sh; the pipeline
was successfully run for the 4 wikis: arwiki, bnwiki, cswiki, viwiki

Minor changes:

  • reduce the number of workers for the wikipedia2vec-script from 10 to

8 in order to make the total number of workers on stat1008 an integer
multiple of the number of workers used here:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients

  • *_spark.py: change the path for intermediate files on hive to

/tmp/$USER/mwaddlink in order to avoid permission error

  • removed commented code

Bug: T253279
Bug: T261408
Bug: T278679
Change-Id: Ic5d15634b84e5f380acaeb42418c8752884a41ab

Details

Provenance
MGerlachAuthored on Mar 26 2021, 5:05 PM
kostajhCommitted on Mar 29 2021, 1:36 PM
Parents
rRMWA1815d9247d68: docs: Fix link to wikitext docs
Branches
Unknown
Tags
Unknown
References
refs/changes/72/675172/8
ChangeId
Ic5d15634b84e5f380acaeb42418c8752884a41ab