Page MenuHomePhabricator

Service operations setup for Add a Link project
Closed, ResolvedPublic

Description

In T252822: [EPIC] Growth: "add a link" structured task 1.0, we (Growth-Team) are working on a project to guide new users in how to add links in Wikipedia articles.

Very high level summary (https://wikitech.wikimedia.org/wiki/Add_Link is the canonical source):

  • Research has a codebase which trains the AI model on production Stats machines
  • Research has a simple API that runs in a container via the Deployment Pipeline that accepts page title and wiki language and responds with wikitext containing link recommendations
  • GrowthExperiments extension will call the API via a maintenance script on cron, and cache output in a MySQL table
  • GrowthExperiments will generate an event which Search team will consume and they will update the ElasticSearch index for a document to indicate if the article has link recommendations

Miscellaneous

  • For our initial release we want to have a pool of several thousand articles that have link recommendations. That will mean processing perhaps tens of thousands of articles per wiki, as not every article will yield (good) link recommendations. (More details are in the project architecture document)

Further reading

Event Timeline

kostajh added subscribers: Catrope, Tgr.

Adding some tags for visibility. I'm going to be away from a computer for the next two weeks but @Tgr and @Catrope will be around. We're still at the point of gathering data about what our options are so there is not a hurry on this at the moment.

(I've tagged #product-infrastructure-team-backlog so you all are aware of this project but feel free to untag yourselves if you prefer.)

My first note here is that we are actively discouraging shelling out from MediaWiki in production for a series of reasons, ranging from security implications to issues with running in a containerized environment (see T252745).

I think the ideal model for this is having a very simple service that basically accepts post requests with the same data it would take from the command-line, and returns a calculated response.

This service should /not/ do any caching, which should instead be managed by MediaWiki itself - making this a true lambda service.

Also: what's the relationship between this service and recommendation-api we're already running?

How would this service load/update its ML models?

I think the ideal model for this is having a very simple service that basically accepts post requests with the same data it would take from the command-line, and returns a calculated response.
This service should /not/ do any caching, which should instead be managed by MediaWiki itself - making this a true lambda service.

That sounds fine. I think we'd want to do the caching in MediaWiki anyway, because we will know to invalidate based on page edits.

Also: what's the relationship between this service and recommendation-api we're already running?
How would this service load/update its ML models?

@DED / @MGerlach could you please respond to these two questions?

Also, @DED and @MGerlach could you please provide some information about CPU and memory usage for the script?

herron triaged this task as Medium priority.Jul 28 2020, 5:03 PM

This service should /not/ do any caching, which should instead be managed by MediaWiki itself - making this a true lambda service.

@Joe is there more information from SRE (a template / checklist maybe?) about what is required to get a microservice running in our kubernetes environment?

How would this service load/update its ML models?

I'm not sure. Conceivably we would need some place to store and retrieve the data, as I don't think we would be storing the data used by the https://github.com/dedcode/mwaddlink tool in whatever Git repository we use for deployment, and I'm not sure that building the models makes sense to do in the service that provides the link recommendations. Do we have any prior art to reference, or do you have any recommendations?

I have a few questions for you, before giving a refined recommendation:

  • do you think you'll need to develop this software often, or it will be only modified sporadically?
  • what are the python packages the script depends on? I didn't see a setup.py in the current repository? I see the notebook depends on quite a few external packages like numpy
  • What is the size of the trained model, and how often would that change?

depending on the answers, I think the suggestions on the solution to adopt will vary a bit.

I have a few questions for you, before giving a refined recommendation:

  • do you think you'll need to develop this software often, or it will be only modified sporadically?
  • what are the python packages the script depends on? I didn't see a setup.py in the current repository? I see the notebook depends on quite a few external packages like numpy
  • What is the size of the trained model, and how often would that change?

depending on the answers, I think the suggestions on the solution to adopt will vary a bit.

@MGerlach / @DED could you please comment on this when you have time? Thanks!

@kostajh @Joe some current estimates (@DED please correct/add):

  • once we have a fully running version, I do not think that we need to necessarily change the software. I believe it would be necessarily to re-train the model in order to get more -up-to-date predictions (this can be done with fixed code).
  • I added a requirements-file which contains (most of) the packages we use to train the model https://github.com/dedcode/mwaddlink/blob/master/requirements.txt
    • note that some of the packages are necessary only for training the model or generating the feature-datasets (also: some of the underlying code uses spark on the analytics-cluster and is not captured as part of the pip-packages)
    • if we only consider the packages that are used to query the model to make predictions: wikitextparser, mwparserfromhell, nltk (all for parsing wikitext), numpy, scipy, python-Levenshtein (all for calculating features from text), xgboost (making a prediction from the trained model)
  • the size varies across languages, at the moment we are using enwiki with ~6M articles to estimate the worst-case scenario since all other languages are smaller. we are talking of ~10GB in disk-space to save relevant features of candidate articles. some of this will need to be in memory, though we are trying to do as much as possible in memory-mapped mode.
    • for each language we will have a separate model, where its size will scale approximately with the number of articles (a wiki with ~500k articles would then require ~1GB of disk space).
    • The model changes once we re-run to update predictions. what is a good choice here is not clear to me. Probably not every day, but not less frequent than yearly. I am not sure how fast the predictions will be out-dated once we trained a model. It will also depend on if and how we want to include user feedback.

Meeting 14/09/2020

Attendees:

  • Kosta (Growth)
  • Giuseppe (SRE)
  • Martin (Research)

Summary:

  • We will want to train the model for add link in a production environment. The Stats machines are considered a production environment, in this context.
  • The existing project should be split into two Gerrit repositories (already moving in this direction):
    • One repository for building the model per wiki. Ideally this is as scripted as possible so that it's easy to rebuild the model. This will run on the Stats machines. For now assume that it is going to output a large file for each wiki (10 GB let's say).
    • Another repository contains the code for an HTTP API which accepts a page title and wiki language, and will return link suggestions, using the pretrained model. (There is already work in progress on this here https://github.com/martingerlach/mwaddlink-api)
  • For the API code, it's important to think about how much RAM is used in processing each request. Martin believes the RAM usage can be fairly low if we use the file system for look ups rather than attempting to load e.g. a 10 GB vector file into RAM. It means the process is slower but since we are not using this tool on demand, that is OK.
  • For the API code, we should assume that there will be one instance of this running in production. And therefore this codebase needs to be able to handle routing requests and loading configuration based on language wikis (e.g. there will not be one instance for enwiki, one instance for cswiki, etc)
  • The API code will be run in a docker container that mounts a directory containing the pre trained models and datasets. The repository for the API code will not have the data files directly in it, for ease of deployment.
  • We will use the Deployment Pipeline for the API code, so we will talk to RelEng about getting some help with that.

Adding some notes after yesterday's meeting:

  • the current script is using sqlitedict right now, and the idea is good. In production, though, we'll need to connect to a remote MySQL database. I suggest trying to convert your script to use sqldict instead. That can then be configured to use sqlite (in dev/ possibly on stat1007) or mysql (in production)
  • Open question: get the data from being computer on a stat machine to a production MySQL. We might need guidance from the analytics folks on that.
  • Logging: log in json format to stdout
  • The application should answer the following urls: /healthz to report its health status, and / should probably be a banner page using the OpenAPI spec. We also use our own extension to the spec to allow developers to define functional tests that should work in production using the x-amples stanzas. See for instance https://github.com/wikimedia/mobileapps/blob/ec89750b4df3713d471eaaaf6be589fdc2f4de8f/spec/base.yaml#L45
  • metrics, if any besides latency and requests rate are needed (see below), should be exposed in prometheus format on the /metrics endpoint. Latency and requests rate can be extracted from the envoy sidecar telemetry and are not needed in the application itself.
  • SLI/SLOs for this service can probably be relatively lax, given we're only planning on calling it asynchronously

A problem that we will need to find a solution for is running the model on a stat* server, then updating a production database from there. I don't know of predefined ways to do it, but on that we can ask for an opinion to the analytics team about this.

Somewhat unrelatedly - this seems the perfect fit of an execution model like kubeless (https://kubeless.io), where you have kubernetes execute a container as a reaction to an event being emitted to a kafka topic. It would save us from the need of most of the stuff like health endpoints, setting up a load balancer, adding specialized monitoring, etc. I would've strongly advocated to try to build it first and deploy this script on that system, but I think that timeline-wise that doesn't work out, which is a pity.

A problem that we will need to find a solution for is running the model on a stat* server, then updating a production database from there. I don't know of predefined ways to do it, but on that we can ask for an opinion to the analytics team about this.

@Joe do you know who I should ping about this, or would you or someone else from SRE want to initiate the discussion about this?

  • Logging: log in json format to stdout

Added json-logging to the script that queries the model (code in github: https://github.com/dedcode/mwaddlink/blob/master/addlink-query_links.py)

Change 643335 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Add query logging

https://gerrit.wikimedia.org/r/643335

Change 643335 merged by jenkins-bot:
[research/mwaddlink@main] Add query logging

https://gerrit.wikimedia.org/r/643335

@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?

@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?

The networking setup parts indeed needs to be done. We 'll use this task to track that work, but this isn't blocking you from using a maintenance script, as from what I gathered it won't be calling the service, but rather populated the database directly.

Change 656430 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Introduce linkrecommendation{,-external}

https://gerrit.wikimedia.org/r/656430

@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?

The networking setup parts indeed needs to be done. We 'll use this task to track that work, but this isn't blocking you from using a maintenance script, as from what I gathered it won't be calling the service, but rather populated the database directly.

The maintenance script will indeed populate the addlink database (item 2 from T266826) but the main thing it is doing is calling the link recommendation service over the network.

Change 656430 merged by Alexandros Kosiaris:
[operations/dns@master] Introduce linkrecommendation{,-external}

https://gerrit.wikimedia.org/r/656430

Change 658303 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Remove linkrecommendation-external

https://gerrit.wikimedia.org/r/658303

Change 658303 merged by Alexandros Kosiaris:
[operations/dns@master] Remove linkrecommendation-external

https://gerrit.wikimedia.org/r/658303

Change 658635 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] blubber: Add statsd implementation

https://gerrit.wikimedia.org/r/658635

Change 658636 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] linkrecommendation: Enable monitoring

https://gerrit.wikimedia.org/r/658636

Change 658636 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Enable monitoring

https://gerrit.wikimedia.org/r/658636

Change 658635 merged by jenkins-bot:
[research/mwaddlink@main] blubber: Add statsd implementation

https://gerrit.wikimedia.org/r/658635

kostajh moved this task from In Progress to QA on the Growth-Team (Sprint 0 (Growth Team)) board.

I think this is done; we can open new tasks as needed. Thank you for your help SRE! ❤