In {T252822}, we (#Growth-Team) are working on a project to guide new users in how to add links toin Wikipedia articles.
There is a tool, [mwaddlink](https://github.com/dedcode/mwaddlink) that #research (specifically @DED and @MGerlach) are working on; it's a python application with some machine learning libraries. It takes wikitext from articles and outputs new wikitext with additional links added based on a machine learning algorithm.Very high level summary (https://wikitech.wikimedia.org/wiki/Add_Link is the canonical source):
We probably do not want to call this tool on demand (it can take several hundred milliseconds to several seconds to generate link recommendations for an article). We will likely store the results of calling the service in a per-wiki MySQL table managed by the #growthexperiments extension.
We will likely store the index of which articles have link recommendations in ElasticSearch, using a new search field like `hasrecommendations:links`.
There is a [draft proposal for the project architecture](https://docs.google.com/document/d/1Y0Jt2N20e7-H83MMAqVYcSB-UIGba1YoSQE39z2dlds/edit?usp=sharing), and then there is also a [longer document with some notes exploring various options and questions.](https://docs.google.com/document/d/187LPs2c5j13O8dlemwsWMEkn4__LgaN7TcXwEkakxYY/edit?usp=sharing).
## Potential options
### 1 execute via shell to pre-compiled binary
per @Joe, this approached is discouraged so removing from list.
### 2A Standalone web service, like ORES
* the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID)- Research has a codebase which trains the AI model on production Stats machines
* it would need storage, not sure what that would look like, plus a way to update ElasticSearch with data about which articles have link recommendations
Questions:
- [ ] who would maintain / deploy this?- Research has a simple API that runs in a container via the Deployment Pipeline that accepts page title and wiki language and responds with wikitext containing link recommendations
- [ ] which team could do the setup work of writing the web service for the tool?- GrowthExperiments extension will call the API via a maintenance script on cron, (or is that on #growth-team) We could probably write something PHP based if that's acceptable but as our other services are nodeJS/python would this be an issue?and cache output in a MySQL table
- [ ] is this something that would be deployed via Kubernetes?GrowthExperiments will generate an event which Search team will consume and they will update the ElasticSearch index for a document to indicate if the article has link recommendations
### 2B Standalone web service, on Toolforge## Open questions
* the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID)- How do we transfer the trained model data (files, but could also imagine using Cassandra key value store) from the Stats machines to somewhere that the container can make use of them?
* our front-end code would call MediaWiki, which would proxy requests to Toolforge, to work around privacy/security issues, so it will be a bit slower with latency
Questions:- How much RAM is used per request in the mwaddlink-api application?
- [ ] Is this approach allowed from a #serviceops perspective?
### 3 Microservice deployed via Kubernetes## Miscellaneous
* A simple web service accepts a POST request with some input and responds with some HTML output
* the results are stored in a MySQL table managed by the GrowthExperiments extension- For our initial release we want to have a pool of several thousand articles that have link recommendations. That will mean processing perhaps tens of thousands of articles per wiki, as not every article will yield (good) link recommendations. (More details are in the [project architecture document](https://docs.google.com/document/d/1Y0Jt2N20e7-H83MMAqVYcSB-UIGba1YoSQE39z2dlds/edit#))
Questions:### Further reading
- [ ] Who needs to write the web service that interacts with the CLI? If that is Growth, is PHP acceptable?- https://wikitech.wikimedia.org/wiki/Add_Link
- [ ] Who would maintain the service?mwaddlink](https://github.com/dedcode/mwaddlink) that #research (specifically @DED and @MGerlach) are working on; it's a python application with some machine learning libraries to train the model. See also https://github.com/martingerlach/mwaddlink-api for the application which handles requests and returns a response.
- [ ] How would research update their code / models once it is running?
## Miscellaneous
- For our initial release we want to have a pool of several thousand articles that have link recommendations. That will mean processing perhaps tens of thousands of articles per wiki, as not every article will yield (good) link recommendations.proposal for the project architecture](https://docs.google.com/document/d/1Y0Jt2N20e7-H83MMAqVYcSB-UIGba1YoSQE39z2dlds/edit?usp=sharing), Then we will also want to regularly update link recommendations for every Nth article edit or via a cron job that targets a broad swath of articles with null edits to trigger the link recommendation generation process. (More details are in the [project architecture document](https://docs.google.com/document/d/1Y0Jt2N20e7-H83MMAqVYcSB-UIGba1YoSQE39z2dlds/edit#)and then there is also a [longer document with some notes exploring various options and questions.](https://docs.google.com/document/d/187LPs2c5j13O8dlemwsWMEkn4__LgaN7TcXwEkakxYY/edit?usp=sharing)