Page MenuHomePhabricator

Code and data onboarding for link recommendation project
Closed, ResolvedPublic

Description

Things to understand:

  • Code: where is it, version control, what scripts are relevant to which part of the pipeline (data gathering + preprocessing, model training, prediction)
  • Model: architecture (and alternatives considered / rejected), features being used (or explicitly not used)
  • Documentation: where it lives
  • Future work: what improvements are prioritized right now
  • ... (please add if I missed anything)

Event Timeline

Update week 2020-07-13:

  • met for 2 longer discussions with Djellel
    • getting overview of the aim of the project (incl plan for the year to work with product)
    • general approach of the existing algorithm
    • plans for impovement during this quarter, specifically the backtesting protocol for automatic offline evaluation and possible tunings for the algorithm
  • tuning the algorithm:
    • prepared navigation embeddings as additional features for link prediction
  • backtesting protocol:
    • several possibilities came up for creating a dataset of sentences for several languages which contain links to be predicted.
    • the most crucial features we agreed on were:
      • link completeness (how to find sentences for which we are sure that they contain all or most links that should be there); options include: using sentences only from abstract (to make sure link is notomitted bc it appeared before), sentences from featured or good articles (there exist collections of hundreds/thousands in several wikis [1],[2]); threshold on the link density (links per sentence length)
      • sampling from a diverse set of articles to capture different edge cases and that the task is not too easy to still be informative

[1] https://meta.wikimedia.org/wiki/Wikipedia_featured_articles
[2] https://meta.wikimedia.org/wiki/Wikipedia_good_articles

Update week 2020-07-27:

Update week 2020-08-03:

  • extensive discussion with Djellel on the model
    • clarify pipeline of the current model to generate link recommendations
    • discuss bottlenecks: specifically, mwparserfromhell is crucial for parsing wikicode to get links. slow when parsing full dump
  • started to discuss current approach to implement backtesting protocol and identfiy main challanges:
    • regex-heuristics might break down for some languages
    • for most languages we dont have reliable parsers (tokenizers). we thus use regex- and other heuristics. this is problematic since it i) could break down for some languages, ii) extracting sentences via "." might give spurious sentences.
    • avoid use of articles from some categories (e.g. articles missing citations)

Next:

  • make shared repo and write documentation

Update week 2020-08-10:

Update week 2020-08-17:

  • started to work hands-on with code in the repo
  • trying to re-run indivdual scripts on the stat-machines to replicate final model
    • generating the anchor dictionary; in order to parse through the dump more efficiently, we can do distributed processing on the smaller-sized chunks ignoring the full file, this substantially reduces time needed for parsing (even English takes only ~3 hours)
    • generating the sentences for the backtesting protocol
    • generating features for articles; here we generate embeddings from text, and navigation.

next:

  • generating the training data from sentences+features, training the model, and applying in api
  • currently, the scripts in the repo are only for english, so have to check how well these approaches work for other languages.

Update week 2020-08-24:

  • attended (parts of) Growth-teams virtual offsite around Add-a-link
  • reproduced prcessing pipeline for English
  • started to work on some fixes to bugs
    • noramlizing titles and resolving redirects
  • started to work on some improvements to pipeline:
    • build executible script to get navigation-based features for each language
    • memory-mapping features in order to allow for memory-efficent API-hosting

Upate week 2020-09-01:

  • created executable scripts to get feature vectors from navigation
  • added improvements to generating link-candidates
    • normalizing article titles
    • resolving redirects
    • only consider articles to recommend that are from main namespace and non-redirect (avoiding links to categories, files, etc)
  • this leads to substantial reduction the number of candidate links and should thus yield to higher quality predictions

playing with a trained model from the pipeline, it is becoming clear that there are still some link-recommendations for strings such as 'of' or 'it' (with proposed links to the corresponding or non-sensical articles). the solution for this is to calculate the link-probability (i.e. the ratio between the number of times the string occurs as a link over the total number of times it occurs whether as link or as plain text) which should remove these anchors. this requires some work to efficiently calculate the number of times each anchor-string appears as plain text across all articles.

Update week 2020-09-07:

  • added a first version of a backtesting-protocol to evaluate precision and recall of link-recommendation on a test-set containing sentences with links
  • built first prototype for an API which returns link-recommendation for an article
  • started to move generation of the anchor-dictionary (articles and their mentions as text) to spark to be able to calculate the link-probability
    • in the example above, we see the recommendation [[More (song)|more|pr=0.99998724]]; the link-probability of 'more' will be very low and thus this candidate will be removed

Update week 2020-09-14:

  • moved generation of anchor-dictionary to spark to filter anchors by link-probability; looking at some examples this seems markedly improves the recommendations
  • changed pipeline such that all feature-vectors are memory-mapped to decrease memory-requirements when loading the model and querying recommendations for articles
  • started applying the pipeline available for english to other languages
  • met with Kosta (Growth) and Guiseppe (SRE) to discuss strategies to move model towards production T258978#6462139
    • splitting model training and model querying. guiseppe recommends that the latter should be run in a docker container (I have already moved towards splitting the training and querying into separate repos, see https://github.com/dedcode/mwaddlink and https://github.com/martingerlach/mwaddlink-api)
    • guiseppe stressed the fact of assessing and trying to reduce the memory-usage when querying the model (I have already moved towards memory-mapping the model)

Update week 2020-09-21:

  • built full pipeline to generate trained model for arbitrary language
  • ran pipeline for one other language with qualitative inspection of sample articles (german)
  • started to implement and include the automatic backtesting to get quantitative evaluation (will investigate in more detail during next quarter to check performance in unknown languages)

Resolving this task as onboarding is complete (other work will be captured in separate task)