Code and data onboarding for link recommendation project
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MGerlach
	Jul 17 2020, 2:44 PM

Description

Things to understand:

Code: where is it, version control, what scripts are relevant to which part of the pipeline (data gathering + preprocessing, model training, prediction)
Model: architecture (and alternatives considered / rejected), features being used (or explicitly not used)
Documentation: where it lives
Future work: what improvements are prioritized right now
... (please add if I missed anything)

Related Objects

Mentioned Here: T258978: Service operations setup for Add a Link project
T252822: [EPIC] Growth: "add a link" structured task 1.0
T253279: Add a link: algorithm improvements
T257254: Add a link: backtesting protocol

Event Timeline

MGerlach created this task.Jul 17 2020, 2:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2020, 2:44 PM

Update week 2020-07-13:

met for 2 longer discussions with Djellel
- getting overview of the aim of the project (incl plan for the year to work with product)
- general approach of the existing algorithm
- plans for impovement during this quarter, specifically the backtesting protocol for automatic offline evaluation and possible tunings for the algorithm
tuning the algorithm:
- prepared navigation embeddings as additional features for link prediction
backtesting protocol:
- several possibilities came up for creating a dataset of sentences for several languages which contain links to be predicted.
- the most crucial features we agreed on were:
  - link completeness (how to find sentences for which we are sure that they contain all or most links that should be there); options include: using sentences only from abstract (to make sure link is notomitted bc it appeared before), sentences from featured or good articles (there exist collections of hundreds/thousands in several wikis [1],[2]); threshold on the link density (links per sentence length)
  - sampling from a diverse set of articles to capture different edge cases and that the task is not too easy to still be informative

[1] https://meta.wikimedia.org/wiki/Wikipedia_featured_articles
[2] https://meta.wikimedia.org/wiki/Wikipedia_good_articles

Update week 2020-07-27:

reading up on documentation of previous and ongoing work
- epic task Growth: "add a link" structured task T252822
- Algorithm improvement T253279
- Backtesting protocol T257254
- slides from djellel's presentation on link recommendation https://docs.google.com/presentation/d/1zU5VzJ_UCQR9ZuSst6As7BZQqIGWGruxaZwvco5GILA/edit#slide=id.p
coordinating meetings with djellel, and growth team (marshall), respectively

Update week 2020-08-03:

extensive discussion with Djellel on the model
- clarify pipeline of the current model to generate link recommendations
- discuss bottlenecks: specifically, mwparserfromhell is crucial for parsing wikicode to get links. slow when parsing full dump
started to discuss current approach to implement backtesting protocol and identfiy main challanges:
- regex-heuristics might break down for some languages
- for most languages we dont have reliable parsers (tokenizers). we thus use regex- and other heuristics. this is problematic since it i) could break down for some languages, ii) extracting sentences via "." might give spurious sentences.
- avoid use of articles from some categories (e.g. articles missing citations)

make shared repo and write documentation

kostajh subscribed.Aug 14 2020, 10:19 AM

Update week 2020-08-10:

discussed with marshall+djellel on timeline
going through the repo with djellel to understand individual pieces of the code https://github.com/dedcode/mwaddlink

Update week 2020-08-17:

started to work hands-on with code in the repo
trying to re-run indivdual scripts on the stat-machines to replicate final model
- generating the anchor dictionary; in order to parse through the dump more efficiently, we can do distributed processing on the smaller-sized chunks ignoring the full file, this substantially reduces time needed for parsing (even English takes only ~3 hours)
- generating the sentences for the backtesting protocol
- generating features for articles; here we generate embeddings from text, and navigation.

generating the training data from sentences+features, training the model, and applying in api
currently, the scripts in the repo are only for english, so have to check how well these approaches work for other languages.

MGerlach updated the task description. (Show Details)Aug 28 2020, 5:34 PM

Update week 2020-08-24:

attended (parts of) Growth-teams virtual offsite around Add-a-link
reproduced prcessing pipeline for English
started to work on some fixes to bugs
- noramlizing titles and resolving redirects
started to work on some improvements to pipeline:
- build executible script to get navigation-based features for each language
- memory-mapping features in order to allow for memory-efficent API-hosting

Upate week 2020-09-01:

created executable scripts to get feature vectors from navigation
added improvements to generating link-candidates
- normalizing article titles
- resolving redirects
- only consider articles to recommend that are from main namespace and non-redirect (avoiding links to categories, files, etc)
this leads to substantial reduction the number of candidate links and should thus yield to higher quality predictions

playing with a trained model from the pipeline, it is becoming clear that there are still some link-recommendations for strings such as 'of' or 'it' (with proposed links to the corresponding or non-sensical articles). the solution for this is to calculate the link-probability (i.e. the ratio between the number of times the string occurs as a link over the total number of times it occurs whether as link or as plain text) which should remove these anchors. this requires some work to efficiently calculate the number of times each anchor-string appears as plain text across all articles.

Update week 2020-09-07:

added a first version of a backtesting-protocol to evaluate precision and recall of link-recommendation on a test-set containing sentences with links
built first prototype for an API which returns link-recommendation for an article
- passing an article title (top) it returns the updated wikitext containing the recommendations for exploration (new links appear as e.g. [[Amendment|changed|pr=0.9999547]] )
- example query https://addlink-simple.toolforge.org/api/v1/addlink?title=Fernand_L%C3%A9ger
- this is for articles on simplewiki (small dataset but still in English); the model still needs some improvements as some of the current recommendations are idiosyncratic
- code: https://github.com/martingerlach/mwaddlink-api
started to move generation of the anchor-dictionary (articles and their mentions as text) to spark to be able to calculate the link-probability
- in the example above, we see the recommendation [[More (song)|more|pr=0.99998724]]; the link-probability of 'more' will be very low and thus this candidate will be removed

MGerlach updated the task description. (Show Details)Sep 11 2020, 4:19 PM

Update week 2020-09-14:

moved generation of anchor-dictionary to spark to filter anchors by link-probability; looking at some examples this seems markedly improves the recommendations
changed pipeline such that all feature-vectors are memory-mapped to decrease memory-requirements when loading the model and querying recommendations for articles
started applying the pipeline available for english to other languages
met with Kosta (Growth) and Guiseppe (SRE) to discuss strategies to move model towards production T258978#6462139
- splitting model training and model querying. guiseppe recommends that the latter should be run in a docker container (I have already moved towards splitting the training and querying into separate repos, see https://github.com/dedcode/mwaddlink and https://github.com/martingerlach/mwaddlink-api)
- guiseppe stressed the fact of assessing and trying to reduce the memory-usage when querying the model (I have already moved towards memory-mapping the model)

Update week 2020-09-21:

built full pipeline to generate trained model for arbitrary language
ran pipeline for one other language with qualitative inspection of sample articles (german)
started to implement and include the automatic backtesting to get quantitative evaluation (will investigate in more detail during next quarter to check performance in unknown languages)

Resolving this task as onboarding is complete (other work will be captured in separate task)

Code and data onboarding for link recommendation projectClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Code and data onboarding for link recommendation project
Closed, ResolvedPublic
Actions