Service operations setup for Add a Link project
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kostajh
	Jul 27 2020, 9:30 PM

Description

In T252822: [EPIC] Growth: "add a link" structured task 1.0, we (Growth-Team) are working on a project to guide new users in how to add links in Wikipedia articles.

Very high level summary (https://wikitech.wikimedia.org/wiki/Add_Link is the canonical source):

Research has a codebase which trains the AI model on production Stats machines
Research has a simple API that runs in a container via the Deployment Pipeline that accepts page title and wiki language and responds with wikitext containing link recommendations
GrowthExperiments extension will call the API via a maintenance script on cron, and cache output in a MySQL table
GrowthExperiments will generate an event which Search team will consume and they will update the ElasticSearch index for a document to indicate if the article has link recommendations

Miscellaneous

For our initial release we want to have a pool of several thousand articles that have link recommendations. That will mean processing perhaps tens of thousands of articles per wiki, as not every article will yield (good) link recommendations. (More details are in the project architecture document)

Details

Subject	Repo	Branch	Lines +/-
blubber: Add statsd implementation	research/mwaddlink	main	+1 -1
linkrecommendation: Enable monitoring	operations/deployment-charts	master	+1 -1
Remove linkrecommendation-external	operations/dns	master	+2 -4
Introduce linkrecommendation{,-external}	operations/dns	master	+9 -3
Add query logging	research/mwaddlink	main	+49 -11

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	MMiller_WMF	T252822 [EPIC] Growth: "add a link" structured task 1.0
Resolved	kostajh	T266437 Add a link engineering: backend product specifications
Resolved	kostajh	T261396 Add a link: engineering tasks for initial release
Resolved	kostajh	T258978 Service operations setup for Add a Link project
Resolved	kostajh	T265345 Calculate estimated requests per second to mwaddlink-query

Event Timeline

kostajh created this task.Jul 27 2020, 9:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2020, 9:30 PM

Adding some tags for visibility. I'm going to be away from a computer for the next two weeks but @Tgr and @Catrope will be around. We're still at the point of gathering data about what our options are so there is not a hurry on this at the moment.

(I've tagged #product-infrastructure-team-backlog so you all are aware of this project but feel free to untag yourselves if you prefer.)

kostajh updated the task description. (Show Details)Jul 28 2020, 10:33 AM

My first note here is that we are actively discouraging shelling out from MediaWiki in production for a series of reasons, ranging from security implications to issues with running in a containerized environment (see T252745).

I think the ideal model for this is having a very simple service that basically accepts post requests with the same data it would take from the command-line, and returns a calculated response.

This service should /not/ do any caching, which should instead be managed by MediaWiki itself - making this a true lambda service.

Also: what's the relationship between this service and recommendation-api we're already running?

How would this service load/update its ML models?

I think the ideal model for this is having a very simple service that basically accepts post requests with the same data it would take from the command-line, and returns a calculated response.
This service should /not/ do any caching, which should instead be managed by MediaWiki itself - making this a true lambda service.

That sounds fine. I think we'd want to do the caching in MediaWiki anyway, because we will know to invalidate based on page edits.

Also: what's the relationship between this service and recommendation-api we're already running?
How would this service load/update its ML models?

@DED / @MGerlach could you please respond to these two questions?

Also, @DED and @MGerlach could you please provide some information about CPU and memory usage for the script?

kostajh updated the task description. (Show Details)Jul 28 2020, 2:00 PM

herron triaged this task as Medium priority.Jul 28 2020, 5:03 PM

LGoto moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Jul 29 2020, 3:39 PM

kostajh added a parent task: T252822: [EPIC] Growth: "add a link" structured task 1.0.Aug 13 2020, 3:49 PM

In T258978#6340838, @Joe wrote:

This service should /not/ do any caching, which should instead be managed by MediaWiki itself - making this a true lambda service.

@Joe is there more information from SRE (a template / checklist maybe?) about what is required to get a microservice running in our kubernetes environment?

How would this service load/update its ML models?

I'm not sure. Conceivably we would need some place to store and retrieve the data, as I don't think we would be storing the data used by the https://github.com/dedcode/mwaddlink tool in whatever Git repository we use for deployment, and I'm not sure that building the models makes sense to do in the service that provides the link recommendations. Do we have any prior art to reference, or do you have any recommendations?

kostajh updated the task description. (Show Details)Aug 17 2020, 12:32 PM

ArielGlenn subscribed.Aug 17 2020, 8:16 PM

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:45 PM

jijiki moved this task from 🔦Unused2 to Incoming 🐫 on the serviceops board.Aug 18 2020, 10:32 AM

kostajh mentioned this in T260330: RFC: PHP microservice for containerized shell execution.Aug 19 2020, 12:34 PM

I have a few questions for you, before giving a refined recommendation:

do you think you'll need to develop this software often, or it will be only modified sporadically?
what are the python packages the script depends on? I didn't see a setup.py in the current repository? I see the notebook depends on quite a few external packages like numpy
What is the size of the trained model, and how often would that change?

depending on the answers, I think the suggestions on the solution to adopt will vary a bit.

kostajh mentioned this in T261401: Add a link: Set up link recommendation testing environment with Cloud VPS and existing notebook code.Aug 27 2020, 12:22 PM

kostajh mentioned this in T261403: Move dedcode/mwaddlink from github to gerrit.Aug 27 2020, 12:25 PM

In T258978#6408429, @Joe wrote:

I have a few questions for you, before giving a refined recommendation:

do you think you'll need to develop this software often, or it will be only modified sporadically?

what are the python packages the script depends on? I didn't see a setup.py in the current repository? I see the notebook depends on quite a few external packages like numpy

What is the size of the trained model, and how often would that change?

depending on the answers, I think the suggestions on the solution to adopt will vary a bit.

@MGerlach / @DED could you please comment on this when you have time? Thanks!

kostajh mentioned this in T261407: Add a link engineering: Create event for event gate to update search index after obtaining link recommendations.Aug 27 2020, 7:22 PM

@kostajh @Joe some current estimates (@DED please correct/add):

once we have a fully running version, I do not think that we need to necessarily change the software. I believe it would be necessarily to re-train the model in order to get more -up-to-date predictions (this can be done with fixed code).
I added a requirements-file which contains (most of) the packages we use to train the model https://github.com/dedcode/mwaddlink/blob/master/requirements.txt
- note that some of the packages are necessary only for training the model or generating the feature-datasets (also: some of the underlying code uses spark on the analytics-cluster and is not captured as part of the pip-packages)
- if we only consider the packages that are used to query the model to make predictions: wikitextparser, mwparserfromhell, nltk (all for parsing wikitext), numpy, scipy, python-Levenshtein (all for calculating features from text), xgboost (making a prediction from the trained model)
the size varies across languages, at the moment we are using enwiki with ~6M articles to estimate the worst-case scenario since all other languages are smaller. we are talking of ~10GB in disk-space to save relevant features of candidate articles. some of this will need to be in memory, though we are trying to do as much as possible in memory-mapped mode.
- for each language we will have a separate model, where its size will scale approximately with the number of articles (a wiki with ~500k articles would then require ~1GB of disk space).
- The model changes once we re-run to update predictions. what is a good choice here is not clear to me. Probably not every day, but not less frequent than yearly. I am not sure how fast the predictions will be out-dated once we trained a model. It will also depend on if and how we want to include user feedback.

Meeting 14/09/2020

Attendees:

Kosta (Growth)
Giuseppe (SRE)
Martin (Research)

Summary:

We will want to train the model for add link in a production environment. The Stats machines are considered a production environment, in this context.
The existing project should be split into two Gerrit repositories (already moving in this direction):
- One repository for building the model per wiki. Ideally this is as scripted as possible so that it's easy to rebuild the model. This will run on the Stats machines. For now assume that it is going to output a large file for each wiki (10 GB let's say).
- Another repository contains the code for an HTTP API which accepts a page title and wiki language, and will return link suggestions, using the pretrained model. (There is already work in progress on this here https://github.com/martingerlach/mwaddlink-api)
For the API code, it's important to think about how much RAM is used in processing each request. Martin believes the RAM usage can be fairly low if we use the file system for look ups rather than attempting to load e.g. a 10 GB vector file into RAM. It means the process is slower but since we are not using this tool on demand, that is OK.
For the API code, we should assume that there will be one instance of this running in production. And therefore this codebase needs to be able to handle routing requests and loading configuration based on language wikis (e.g. there will not be one instance for enwiki, one instance for cswiki, etc)
The API code will be run in a docker container that mounts a directory containing the pre trained models and datasets. The repository for the API code will not have the data files directly in it, for ease of deployment.
We will use the Deployment Pipeline for the API code, so we will talk to RelEng about getting some help with that.

kostajh updated the task description. (Show Details)Sep 15 2020, 11:02 AM

MGerlach mentioned this in T258274: Code and data onboarding for link recommendation project.Sep 18 2020, 7:30 AM

Adding some notes after yesterday's meeting:

the current script is using sqlitedict right now, and the idea is good. In production, though, we'll need to connect to a remote MySQL database. I suggest trying to convert your script to use sqldict instead. That can then be configured to use sqlite (in dev/ possibly on stat1007) or mysql (in production)
Open question: get the data from being computer on a stat machine to a production MySQL. We might need guidance from the analytics folks on that.
Logging: log in json format to stdout
The application should answer the following urls: /healthz to report its health status, and / should probably be a banner page using the OpenAPI spec. We also use our own extension to the spec to allow developers to define functional tests that should work in production using the x-amples stanzas. See for instance https://github.com/wikimedia/mobileapps/blob/ec89750b4df3713d471eaaaf6be589fdc2f4de8f/spec/base.yaml#L45
metrics, if any besides latency and requests rate are needed (see below), should be exposed in prometheus format on the /metrics endpoint. Latency and requests rate can be extracted from the envoy sidecar telemetry and are not needed in the application itself.
SLI/SLOs for this service can probably be relatively lax, given we're only planning on calling it asynchronously

A problem that we will need to find a solution for is running the model on a stat* server, then updating a production database from there. I don't know of predefined ways to do it, but on that we can ask for an opinion to the analytics team about this.

Somewhat unrelatedly - this seems the perfect fit of an execution model like kubeless (https://kubeless.io), where you have kubernetes execute a container as a reaction to an event being emitted to a kafka topic. It would save us from the need of most of the stuff like health endpoints, setting up a load balancer, adding specialized monitoring, etc. I would've strongly advocated to try to build it first and deploy this script on that system, but I think that timeline-wise that doesn't work out, which is a pity.

kostajh added a project: Growth-Team (Sprint 0 (Growth Team)).Oct 13 2020, 11:24 AM

A problem that we will need to find a solution for is running the model on a stat* server, then updating a production database from there. I don't know of predefined ways to do it, but on that we can ask for an opinion to the analytics team about this.

@Joe do you know who I should ping about this, or would you or someone else from SRE want to initiate the discussion about this?

kostajh mentioned this in T265605: Add Link engineering: Consolidate dedcode/addlink and mgerlach/mwaddlink-query into single repository.Oct 15 2020, 1:39 PM

kostajh added a project: Add-Link.Oct 15 2020, 1:47 PM

kostajh mentioned this in T265610: Add Link engineering: Convert mwaddlink to read/write to MySQL instead of SQLite.Oct 15 2020, 1:51 PM

In T258978#6532612, @Joe wrote:

Logging: log in json format to stdout

Added json-logging to the script that queries the model (code in github: https://github.com/dedcode/mwaddlink/blob/master/addlink-query_links.py)

kostajh moved this task from Incoming to In Progress on the Growth-Team (Sprint 0 (Growth Team)) board.Nov 4 2020, 1:33 PM

kostajh mentioned this in T267214: Add a link engineering: Database for link recommendation service.Nov 4 2020, 1:45 PM

MMiller_WMF assigned this task to kostajh.Nov 23 2020, 6:09 PM

Change 643335 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Add query logging

https://gerrit.wikimedia.org/r/643335

gerritbot added a project: Patch-For-Review.Nov 25 2020, 10:46 AM

Change 643335 merged by jenkins-bot:
[research/mwaddlink@main] Add query logging

https://gerrit.wikimedia.org/r/643335

Maintenance_bot removed a project: Patch-For-Review.Nov 25 2020, 8:10 PM

jijiki added a subscriber: MoritzMuehlenhoff.Dec 2 2020, 10:16 PM

@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?

In T258978#6729580, @kostajh wrote:

@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?

The networking setup parts indeed needs to be done. We 'll use this task to track that work, but this isn't blocking you from using a maintenance script, as from what I gathered it won't be calling the service, but rather populated the database directly.

Change 656430 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Introduce linkrecommendation{,-external}

https://gerrit.wikimedia.org/r/656430

gerritbot added a project: Patch-For-Review.Jan 15 2021, 2:46 PM

kostajh closed subtask T265345: Calculate estimated requests per second to mwaddlink-query as Resolved.Jan 18 2021, 4:01 PM

In T258978#6751367, @akosiaris wrote:

In T258978#6729580, @kostajh wrote:

@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?

The networking setup parts indeed needs to be done. We 'll use this task to track that work, but this isn't blocking you from using a maintenance script, as from what I gathered it won't be calling the service, but rather populated the database directly.

The maintenance script will indeed populate the addlink database (item 2 from T266826) but the main thing it is doing is calling the link recommendation service over the network.

kostajh updated the task description. (Show Details)Jan 18 2021, 4:27 PM

Change 656430 merged by Alexandros Kosiaris:
[operations/dns@master] Introduce linkrecommendation{,-external}

https://gerrit.wikimedia.org/r/656430

Maintenance_bot removed a project: Patch-For-Review.Jan 20 2021, 5:11 PM

Change 658303 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Remove linkrecommendation-external

https://gerrit.wikimedia.org/r/658303

gerritbot added a project: Patch-For-Review.Jan 25 2021, 1:34 PM

Change 658303 merged by Alexandros Kosiaris:
[operations/dns@master] Remove linkrecommendation-external

https://gerrit.wikimedia.org/r/658303

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2021, 3:10 PM

Change 658635 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] blubber: Add statsd implementation

https://gerrit.wikimedia.org/r/658635

gerritbot added a project: Patch-For-Review.Jan 26 2021, 4:03 PM

Change 658636 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] linkrecommendation: Enable monitoring

https://gerrit.wikimedia.org/r/658636

Change 658636 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Enable monitoring

https://gerrit.wikimedia.org/r/658636

Change 658635 merged by jenkins-bot:
[research/mwaddlink@main] blubber: Add statsd implementation

https://gerrit.wikimedia.org/r/658635