Change Details

In {T252822}, we are working on a project guide new users in how to add links to Wikipedia articles. There is a tool that #research (specifically @DED and @MGerlach) are working on; it's a python application with some machine learning libraries. It takes wikitext from articles and outputs new wikitext with additional links added based on a machine learning algorithm. We don't yet know if we want to call this tool on demand (it can take several hundred milliseconds to several seconds to generate link recommendations for an article), if we want to generate the recommendations and store them somewhere and have a lookup service to retrieve them, or if we have a hybrid model (try to use cached data but if it's stale, regenerate on demand). We do feel pretty confident that we will store the index of which article have link recommendations in ElasticSearch. We have some notes about [various technical questions here.](https://docs.google.com/document/d/187LPs2c5j13O8dlemwsWMEkn4__LgaN7TcXwEkakxYY/edit?usp=sharing). I'll try to break them out into various approaches with questions for us to discuss. ## Potential options ### 1A Generate-on-demand, execute via shell to pre-compiled binary * generate a binary based on the python code * front-end code requests link recommendations for an article from an API module * API module checks cache, if recommendations not, then shell exec command to the binary * cache the results * invalidate the results when an edit is made to the article * I'm assuming the pre-compiled binary (generated from the python codebase) would live in #growthexperiments extension Questions: - [ ] is `exec`'ing to an audited binary like this possible in our stack? - [ ] where would the binary live? In #growthexperiments extension? - [ ] what would the process be like for compiling the binary from a security standpoint? - [ ] Would this qualify as a "service" in the way it's defined in [Wikimedia services policy](https://www.mediawiki.org/wiki/Wikimedia_services_policy) ### 1B Generate lazily, execute via shell to pre-compiled binary * similar to above, except we would use a cron job or a deferred update on every Nth edit to generate the recommendations * for storage, MySQL would probably make the most sense Questions: - same as above, plus - [ ] Any concerns about using MySQL as a storage space in this scenario? ### 2A Standalone web service, like ORES * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * it would need storage, not sure what that would look like, plus a way to update ElasticSearch with data about which articles have link recommendations Questions: - [ ] who would maintain / deploy this? - [ ] which team could do the setup work of writing the web service for the tool? (or is that on #growth-team) We could probably write something PHP based if that's acceptable but as our other services are nodeJS/python would this be an issue? - [ ] is this something that would be deployed via Kubernetes? ### 2B Standalone web service, on Toolforge * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * our front-end code would call MediaWiki, which would proxy requests to Toolforge, to work around privacy/security issues, so it will be a bit slower with latency Questions: - [ ] Is this approach allowed from a #serviceops perspective? ### 3 Microservice deployed via Kubernetes * The details are hazy to me, so there is no list here yet :) Questions: - [ ] Who needs to write the web service that interacts with the CLI? If that is Growth, is PHP acceptable? - [ ] Who would maintain the service? - [ ] How would research update their code / models once it is running? ## Miscellaneous - We currently have about [275 users per week who click on a task](https://docs.google.com/spreadsheets/d/1Ft0KdSL2kVm37KVRhp4sNQPNe_i94v7r5cECrFTN6JQ/edit?usp=sharing) from Special:Homepage, which is provided by GrowthExperiments. We have plans to deploy GrowthExperiments to 100 wikis (T247507) but even once we are there, there is still going to be fairly limited traffic to the API which provides link suggestions: Special:Homepage is provided via a controlled experiment (80% of new registrants get it) and of those, not everyone will be using the "Add a link" frontend interface.

In {T252822}, we are working on a project guide new users in how to add links to Wikipedia articles. There is a tool, [mwaddlink](https://github.com/dedcode/mwaddlink) that #research (specifically @DED and @MGerlach) are working on; it's a python application with some machine learning libraries. It takes wikitext from articles and outputs new wikitext with additional links added based on a machine learning algorithm. We don't yet know if weWe probably do not want to call this tool on demand (it can take several hundred milliseconds to several seconds to generate link recommendations for an article), if we want to generate the recommendations and store them somewhere and have a lookup service to retrieve them, or if we have a hybrid model (try to use cached data but if it's stale,. regenerate on demand)We will likely store the results of calling the service in a per-wiki MySQL table managed by the #growthexperiments extension. We do feel pretty confident that we willWe will likely store the index of which articles have link recommendations in ElasticSearch, using a new search field like `hasrecommendations:links`. We have some notes about [various technical questions hereThere is a [draft proposal for the project architecture](https://docs.google.com/document/d/1Y0Jt2N20e7-H83MMAqVYcSB-UIGba1YoSQE39z2dlds/edit?usp=sharing), and then there is also a [longer document with some notes exploring various options and questions.](https://docs.google.com/document/d/187LPs2c5j13O8dlemwsWMEkn4__LgaN7TcXwEkakxYY/edit?usp=sharing). I'll try to break them out into various approaches with questions for us to discuss. ## Potential options ### 1A Generate-on-demand, execute via shell to pre-compiled binary * generate a binary based on the python code * front-end code requests link recommendations for an article from an API module * API module checks cache, if recommendations not, then shell exec command to the binary * cache the results * invalidate the results when an edit is made to the article * I'm assuming the pre-compiled binary (generated from the python codebase) would live in #growthexperiments extension Questions: - [ ] is `exec`'ing to an audited binary like this possible in our stack? - [ ] where would the binary live? In #growthexperiments extension? - [ ] what would the process be like for compiling the binary from a security standpoint? - [ ] Would this qualify as a "service" in the way it's defined in [Wikimedia services policy](https://www.mediawiki.org/wiki/Wikimedia_services_policy) ### 1B Generate lazily, execute via shell to pre-compiled binary * similar to above, except we would use a cron job or a deferred update on every Nth edit to generate the recommendations * for storage, MySQL would probably make the most sense1 execute via shell to pre-compiled binary Questions: - same as aboveper @Joe, plus - [ ] Any concerns about using MySQL as a stothis approached is discourage space in this scenario?d so removing from list. ### 2A Standalone web service, like ORES * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * it would need storage, not sure what that would look like, plus a way to update ElasticSearch with data about which articles have link recommendations Questions: - [ ] who would maintain / deploy this? - [ ] which team could do the setup work of writing the web service for the tool? (or is that on #growth-team) We could probably write something PHP based if that's acceptable but as our other services are nodeJS/python would this be an issue? - [ ] is this something that would be deployed via Kubernetes? ### 2B Standalone web service, on Toolforge * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * our front-end code would call MediaWiki, which would proxy requests to Toolforge, to work around privacy/security issues, so it will be a bit slower with latency Questions: - [ ] Is this approach allowed from a #serviceops perspective? ### 3 Microservice deployed via Kubernetes * The details are hazy to me, so there is no list here yet :)A simple web service accepts a POST request with some input and responds with some HTML output * the results are stored in a MySQL table managed by the GrowthExperiments extension Questions: - [ ] Who needs to write the web service that interacts with the CLI? If that is Growth, is PHP acceptable? - [ ] Who would maintain the service? - [ ] How would research update their code / models once it is running? ## Miscellaneous - We currently have about [275 users per week who click on a task](https://docs.google.com/spreadsheets/d/1Ft0KdSL2kVm37KVRhp4sNQPNe_i94v7r5cECrFTN6JQ/edit?usp=sharing) from Special:HomepageFor our initial release we want to have a pool of several thousand articles that have link recommendations. That will mean processing perhaps tens of thousands of articles per wiki, which is provided by GrowthExperimentsas not every article will yield (good) link recommendations. We have plans to deploy GrowthExperiments to 100 wikis (T247507) but even once we are there,Then we will also want to regularly update link recommendations for every Nth article edit or via a cron job that targets a broad swath of articles with null edits to trigger the link recommendation generation process. there is still going to be fairly limited traffic to the API which provides link suggestions: Special:Homepage is provided via a controlled experiment (80% of new registrants get it) and of those, not everyone will be using the "Add a link" frontend interface.(More details are in the [project architecture document](https://docs.google.com/document/d/1Y0Jt2N20e7-H83MMAqVYcSB-UIGba1YoSQE39z2dlds/edit#))