Change Details

In {T252822}, we are working on a project guide new users in how to add links to Wikipedia articles. There is a tool that #research (specifically @DED and @MGerlach) are working on; it's a python application with some machine learning libraries. It takes wikitext from articles and outputs new wikitext with additional links added based on a machine learning algorithm. We don't yet know if we want to call this tool on demand (it can take several hundred milliseconds to several seconds to generate link recommendations for an article), if we want to generate the recommendations and store them somewhere and have a lookup service to retrieve them, or if we have a hybrid model (try to use cached data but if it's stale, regenerate on demand). We do feel pretty confident that we will store the index of which article have link recommendations in ElasticSearch. We have some notes about [various technical questions here.](https://docs.google.com/document/d/187LPs2c5j13O8dlemwsWMEkn4__LgaN7TcXwEkakxYY/edit?usp=sharing). I'll try to break them out into various approaches with questions for us to discuss. ## Potential options ### 1A Generate-on-demand, execute via shell to pre-compiled binary * generate a binary based on the python code * front-end code requests link recommendations for an article from an API module * API module checks cache, if recommendations not, then shell exec command to the binary * cache the results * invalidate the results when an edit is made to the article * I'm assuming the pre-compiled binary (generated from the python codebase) would live in #growthexperiments extension Questions: - [ ] is `exec`'ing to an audited binary like this possible in our stack? - [ ] where would the binary live? In #growthexperiments extension? - [ ] what would the process be like for compiling the binary from a security standpoint? - [ ] Would this qualify as a "service" in the way it's defined in [Wikimedia services policy](https://www.mediawiki.org/wiki/Wikimedia_services_policy) ### 1B Generate lazily, execute via shell to pre-compiled binary * similar to above, except we would use a cron job or a deferred update on every Nth edit to generate the recommendations * for storage, MySQL would probably make the most sense Questions: - same as above, plus - [ ] Any concerns about using MySQL as a storage space in this scenario? ### 2A Standalone web service, like ORES * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * it would need storage, not sure what that would look like, plus a way to update ElasticSearch with data about which articles have link recommendations Questions: - [ ] who would maintain / deploy this? - [ ] which team could do the setup work of writing the web service for the tool? (or is that on #growth-team) We could probably write something PHP based if that's acceptable but as our other services are nodeJS/python would this be an issue? - [ ] is this something that would be deployed via Kubernetes? ### 2B Standalone web service, on Toolforge * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * our front-end code would call MediaWiki, which would proxy requests to Toolforge, to work around privacy/security issues, so it will be a bit slower with latency Questions: - [ ] Is this approach allowed from a #serviceops perspective? ## Miscellaneous - We currently have about [275 users per week who click on a task](https://docs.google.com/spreadsheets/d/1Ft0KdSL2kVm37KVRhp4sNQPNe_i94v7r5cECrFTN6JQ/edit?usp=sharing) from Special:Homepage, which is provided by GrowthExperiments. We have plans to deploy GrowthExperiments to 100 wikis (T247507) but even once we are there, there is still going to be fairly limited traffic to the API which provides link suggestions: Special:Homepage is provided via a controlled experiment (80% of new registrants get it) and of those, not everyone will be using the "Add a link" frontend interface.

In {T252822}, we are working on a project guide new users in how to add links to Wikipedia articles. There is a tool that #research (specifically @DED and @MGerlach) are working on; it's a python application with some machine learning libraries. It takes wikitext from articles and outputs new wikitext with additional links added based on a machine learning algorithm. We don't yet know if we want to call this tool on demand (it can take several hundred milliseconds to several seconds to generate link recommendations for an article), if we want to generate the recommendations and store them somewhere and have a lookup service to retrieve them, or if we have a hybrid model (try to use cached data but if it's stale, regenerate on demand). We do feel pretty confident that we will store the index of which article have link recommendations in ElasticSearch. We have some notes about [various technical questions here.](https://docs.google.com/document/d/187LPs2c5j13O8dlemwsWMEkn4__LgaN7TcXwEkakxYY/edit?usp=sharing). I'll try to break them out into various approaches with questions for us to discuss. ## Potential options ### 1A Generate-on-demand, execute via shell to pre-compiled binary * generate a binary based on the python code * front-end code requests link recommendations for an article from an API module * API module checks cache, if recommendations not, then shell exec command to the binary * cache the results * invalidate the results when an edit is made to the article * I'm assuming the pre-compiled binary (generated from the python codebase) would live in #growthexperiments extension Questions: - [ ] is `exec`'ing to an audited binary like this possible in our stack? - [ ] where would the binary live? In #growthexperiments extension? - [ ] what would the process be like for compiling the binary from a security standpoint? - [ ] Would this qualify as a "service" in the way it's defined in [Wikimedia services policy](https://www.mediawiki.org/wiki/Wikimedia_services_policy) ### 1B Generate lazily, execute via shell to pre-compiled binary * similar to above, except we would use a cron job or a deferred update on every Nth edit to generate the recommendations * for storage, MySQL would probably make the most sense Questions: - same as above, plus - [ ] Any concerns about using MySQL as a storage space in this scenario? ### 2A Standalone web service, like ORES * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * it would need storage, not sure what that would look like, plus a way to update ElasticSearch with data about which articles have link recommendations Questions: - [ ] who would maintain / deploy this? - [ ] which team could do the setup work of writing the web service for the tool? (or is that on #growth-team) We could probably write something PHP based if that's acceptable but as our other services are nodeJS/python would this be an issue? - [ ] is this something that would be deployed via Kubernetes? ### 2B Standalone web service, on Toolforge * the tool (currently a CLI tool) would need a layer on top of it to provide simple web service capabilities (a GET request with an article ID) * our front-end code would call MediaWiki, which would proxy requests to Toolforge, to work around privacy/security issues, so it will be a bit slower with latency Questions: - [ ] Is this approach allowed from a #serviceops perspective? ### 3 Microservice deployed via Kubernetes * The details are hazy to me, so there is no list here yet :) Questions: - [ ] Who needs to write the web service that interacts with the CLI? If that is Growth, is PHP acceptable? - [ ] Who would maintain the service? - [ ] How would research update their code / models once it is running? ## Miscellaneous - We currently have about [275 users per week who click on a task](https://docs.google.com/spreadsheets/d/1Ft0KdSL2kVm37KVRhp4sNQPNe_i94v7r5cECrFTN6JQ/edit?usp=sharing) from Special:Homepage, which is provided by GrowthExperiments. We have plans to deploy GrowthExperiments to 100 wikis (T247507) but even once we are there, there is still going to be fairly limited traffic to the API which provides link suggestions: Special:Homepage is provided via a controlled experiment (80% of new registrants get it) and of those, not everyone will be using the "Add a link" frontend interface.