Page MenuHomePhabricator

Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages
Closed, ResolvedPublic

Description

Develop a backtesting protocol to automatically evaluate performance of the link recommendation algorithm in different languages.

  • get backtesting-dataset (testset)
  • build pipeline to evaluate precision and recall for trained model
  • calculate performance metrics for 4 more languages (besides English)
  • inspect false positives to identify possible improvements to algorithm

Event Timeline

Update week 2020-10-12:

  • started to build backtesting dataset: (https://github.com/dedcode/mwaddlink/blob/master/scripts/generate_backtesting_data.py)
    • for a given article, we only take the first sentence;
      • the assumption is that the first sentence is well-linked
      • we avoid the potential issue that a link is missing because it appears earlier in the article
    • we collect 100k sentences (with existing links) for each wiki
  • we will run the trained link recommendation model to recommend links for each sentence
  • evaluate micro-precision and micro-recall averaged over all sentences

Update week: 2020-10-19:

  • ran backtesting on 7 wikis (simple, de, pt, ar,cs,ko,vi), results on meta
  • wrote a high-level summary of the model and put on meta
  • planned work
    • backtesting allows us to investigate false positives to identify issues with the model and to understand differences across the languages (even though at the moment, the model yields satisfactory results in all languages)
    • discussed the model/results in tuesday and thursday meeting in which there was some useful feedback on how to potentially improve the model and potential corner cases; for example: i) in order to avoid recommending existing links we parse the wikitext, but this misses some links from templates (this can be a substantial fraction of links), ii) use only high-quality articles to construct the gold-standard data for training and backtesting, iii) how to deal with red links.

Some work also happended around productionizing

  • in terms of productionizing the code was moved to gerrit so that product can start working on the deployment pipeline T261403

https://github.com/wikimedia/research-mwaddlink

  • there are some open issues in how to query the model T265610

Update week 2020-10-26:

  • familiarizing myself with code-review in gerrit where repo is now hosted
  • ongoing discussions and coordination to support growth in productionizing the model:
    • adapting the output of the model to ensure conversion between wikitext (model) and visual editor (front-end)
    • converting data generated as part of training needed for model-querying from sqlite to mysql-tables (reuqirement for productionizing) T265610

Update week 2020-11-03:

  • this week a lot of work went to working with product to move the model further towards production. I started to submit some patches to gerrit to introduce some changes to the codebase there; at the same time, I spent substantial amount of time reviewing code, most of it related to switching from sqlite to mysql, i.e. making sure that the model doesnt break in the transition.
  • particular points:
    • we figured out a working solution to convert data-tables needed for querying the model to mysql
    • also moved and fixed the training-pipeline; this was broken after the move of the codebase to gerrit
    • added a patch to get context-window for link-recommendation; this was crucial because links are generated using wikitext but on the front-end have to be inserted in visual-editor; context-window was requested in order to avoid potential ambiguity when placing the link in visual editor
    • adding option for maximum number of linkrecommendations to query in order to reduce the number of calls to mysql-tables

update week 2020-11-16:

  • this week met with Djellel to discuss possible further developments of algorithm to improve performance
    • generating backtesting-data from only high-quality languages to improve reliability of ground truth (challenge: quality score not easily available in all languages)
    • adding additional features to link prediction model
    • how to include user-feedback on accepted/rejected links back into model-training
    • we are exploring these options (with lower priority from my side) independently of the model that is currently being moved into production. the latter seems to have at least acceptable accuracy (see results) and we want to avoid breaking the current pipeline
  • still working on moving the output of the model to mysql-databases; this is not so much a research problem, but making sure the model can be moved to production in terms of the footprint of computational ressources without breaking/re-writing the whole training pipeline. for example, in order to limit number of queries to the mysql-databases, adding the possibilty to set a maxmimum number of recommendations per article
  • attended growth-teams deep dive meeting to discuss in detail issues in implementation of the model

update week 2020-11-23:

  • gave presentation in tuesday team-meeting: https://docs.google.com/presentation/d/1AGlNI6slw1ShCasT9OEbBik_sIsMwErFVyhJO1FE4K4/edit#slide=id.g6237f1b673_0_590
    • discussed different possibilities to improve algorithm
    • discussed experiences around productionizing research-models
  • managed to move data to mysql-databases on stats
    • this required some testing and profiling, e.g. in terms of the number of queries (potentially too many when generating recommendation for large articles; thus started some exploration on possible strategies to decrease), encoding issues with mysql (this is non-trivial but seems to be rare in most cases for the 7 languages tested)
    • integrated writing output of trained model (data+actual model) into training pipeline

update week 2020-12-04:
Spend some time doing some smaller improvements to algorithm that were on the list but not high-priority right now

  • fixing an artifact from earlier version about building train/test set for training and automatic evaluation; extracted linked sentences were not split randomly between the two sets leading to an imbalance (articles earlier in the dump with smaller IDs tended to be preferentially in the training set); ensuring random shuffling increased the precision from 0.7 to 0.79 with virtually same recall.
  • started to do some more systematic investigation of false positive in the backtesting evaluation to identify clear cases where recommendation fails (here). this revealed an error in generating candidate-anchors for n-grams involving non-alphabetic characters (common in, e.g., city names which are linked, example: "Latrobe, Pennsylvania"). while not linking that item is not problematic, it leads to artifacts where sub-ngrams will then be incorrectly recommended (in the case above a link to "Pennsylvania"). solve
  • discussion about how the feature-tables should be moved to production. one idea was to publish the data and the model since that makes it easier to copy and would allow users to generate their own recommendations. one unsolved issue is the case of the table-feature derived from reading sessions equivalent to the navigation-vectors. a pragmatic solution that appeared was to not use that data in the prediction (and hence not publish); a first test indicates that exclusion of this feature will only small detrimental effect on precision and recall. this motivates a more systematic evaluation of the efficacy of each feature for the recommendation. so far I have concentrated on making the model work ensuring a reasonably performance across a set of languages. however, it seems useful to check which features actually contribute substantially and which could b dropped; this would go hand in hand with some ideas discussed with Djellel on which other features could be added to improve the performance.
  • planning to add a more fine-grained evaluation analysis on the level of topics (instead of over all articles) since i) one might hypothesize that candidate links for articles are easier than for others, and ii) recommendations for newcomers are generated for articles in a specific topic such that it would be good to ensure that there are no outliers.

update week 2020-12-07:

  • did more through analysis of the model in terms of the importance of individual features
      • the navigation-based feature (distance between articles that should be linked in embedding obtained from reading sessions) has a very low feature importance (); a model without this feature has virtually the same performance in terms of precision and recall on the backtesting data (results in spreadsheet ) across all languages (simple, de, pt, ar, bn, cs, vi)
    • based on these insights we exclude the navigation-based feature from the model (patch in gerrit); this will make it easier to share model and data publicly
  • investigating performance of models in different languages, bn-wiki seems to be an outlier compared to the other languages in the sense that in order to get precision at least 70-80%, the recall would be only at around 10%. one possible implication is that we might be able to make good recommendations only for few articles.
    • I will try to make some exploratory research in the next week to see if we can identify possible issues with the model in this particular language. discussing with marshall, this serves as a good test case if and how we are able to identify the underlying problem once we detect the model is not performing well.

Task completed. Marking as resolved.