Page MenuHomePhabricator

Put API on Cloud VPS
Closed, ResolvedPublic

Description

Background
The team is preparing to work on machine generated article descriptions. To determine if the feature will be a success we will create a MVP and check the quality of the algorithm. The algorithm needs to be converted into an API. The API will be hosted on Cloud VPS, which is the Research team's server.

Task
Turn EPFL Algorithm into an API and put it on Cloud VPS

Summary of outcome

Event Timeline

JTannerWMF created this task.
JTannerWMF added a subscriber: Isaac.

Hi @Isaac do you have any updates on the status of this task? Thank you!

@LGoto thanks for the ping. We're currently coordinating via email so there has been progress, just not publicly visible. I'll do my best to update regularly as we make progress but don't hesitate to ping if details are needed etc. Current state:

  • For those with access, the Cloud VPS instance is at android-machine-generated-desc.recommendation-api.eqiad1.wikimedia.cloud
  • Marija's (EPFL) initial code is on Github but we will be modifying it to have a Flask API and run on Cloud VPS
  • Marija, Dmitry, and I will be meeting this upcoming Tuesday to go through the specific next steps. After that meeting, we will hopefully have a clearer timeline on when an MVP API will be up for testing / iteration / etc.

Summary of notes from Tuesday meeting:

  • We're aiming for end of October for a functioning API and interface for easily testing the model.
  • The draft interface is hosted on Toolforge and can be found here (though it won't show any model results obviously yet): https://ml-article-descriptions.toolforge.org/
  • The draft API code with Flask as the API wrapper and the necessary Mediawiki API calls to gather input features for an article is here: https://github.com/wikimedia/research-api-endpoint-template/blob/mach-gen-art-descriptions/model/wsgi.py
    • The next steps are incorporating the code for pre-processing those input features and pushing them through the model and extracting predictions. This shouldn't be too much work but we'll have to make sure that everything runs as expected in the Cloud VPS environment.
    • Marija will be doing this part and also working to package up the API via Docker for easy deployment. I will then update the UI as necessary to make it easier to interact with the API.
    • Once everything is working, we can better evaluate the latency of the model and whether further refinements are necessary.

Updates:

  • Creating a Slack channel to help coordinate the minor issues that arise in setting up the API
  • Updated the template to gather the paragraphs of articles from up to 25 languages as model input -- I didn't realize the model used the wikitext from so many languages and this substantially increases the API latency so we'll have to find ways to address this as we get closer to deployment. It'll be fine for testing to begin with though. Some ideas below.

Things under consideration for decreasing API latency:

  • Move to asynchronous API requests for gathering input features: most notably around getting wikitext for each version of the article which can take several seconds for articles with many sitelinks
  • Consider dropping some features: we already did this with the knowledge graph embeddings which were quite large and going to be difficult to maintain but did not add much lift to the model accuracy. it's possible that instead of e.g., gathering the wikitext for all 25 languages for an article, we could cap it at 5 and that would likely be sufficient data for the model to do well. That would take some experimentation though to determine the right balance and the smartest way to sample the languages.
  • Pre-compute and store the results for most articles. Only make fresh predictions for new articles. Perhaps have a job to refresh the predicted descriptions on a regular cadence.

Quick update:

  • We don't have the API up quite yet as things have been busy on our collaborator's end. Hopefully just pushes back a week and still aiming to have it up by our meeting on 15 November.

Now that the model is accessible and the Cloud VPS instance seems to be stable, I think this task can be resolved. A new task should probably be opened for the bias / guardrails work described below.

Good iteration this week on improving the UI and API! Summary / next steps:

  • Got Cloud VPS instance working. I was having issues when using multiple uWSGI workers that I don't fully understand but a single worker seems to be effective so perhaps that problem can be ignored for now.
  • I generated some basic stats about model latencies here which clearly identified that the model processing is by far the largest bottleneck in response time: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Article%20Descriptions/API_testing.ipynb
    • @Dbrant updated the API to make input data requests concurrent which has stabilized the time required to gather features.
    • There may be small additional improvements to be made with the API in terms of latency but generally the consensus is that we can hopefully handle slow latency through setting user expectations, pre-requesting results, and good tool design for the future Product interface.
  • EPFL will be looking into the relationship between model latency and accuracy as far as the number of beams the model uses (essentially how many tries it makes to generate good article descriptions). This won't affect the API but will be used to determine the best input parameters to provide it.
  • Start of discussions around bias / guardrails for the model.

Maintenance / docs:

Isaac updated the task description. (Show Details)

Seems wrong? Leave feedback on Phabricator.

Human
3 beams:

  • Ethnic group
  • Ethnic group of humanes
  • Ethnic group of humans

4 beams:

  • Ethnic group involving humans
  • Ethnic group
  • Ethnic group of humanes
  • Ethnic group of humans

Human
3 beams:

Ethnic group
Ethnic group of humanes
Ethnic group of humans

Thanks for passing this along @Jack_who_built_the_house! I checked a number of other very high-level topics and didn't find it in Civilization or Primates but did get "Class of plants" for Plants. This sort of error seems most likely with article about very high-level concepts (which often already have article descriptions thankfully) but would still be nice to fix obviously. We might be able to address this sort of tautological output by adding a simple string-matching check to ensure that the output doesn't contain the title itself. Before we implement anything, I'd want to think about what sort of issues this might cause though with e.g., very simple titles where text matching might introduce a bunch of false positives (and therefore not return results).

We might be able to address this sort of tautological output by adding a simple string-matching check to ensure that the output doesn't contain the title itself.

(Shouldn't this be a factor for machine learning? I mean, if matching the title produced a wrong description as a general rule, wouldn't the machine learning algorithm infer it from the training set?)

(Shouldn't this be a factor for machine learning? I mean, if matching the title produced a wrong description as a general rule, wouldn't the machine learning algorithm infer it from the training set?)

Interesting question. My thoughts: so the model might learn to avoid using the title if the training included the article title as a explicit feature but that's not the case here: the features are just the first paragraphs of the articles and existing article descriptions in other languages (code; description) which means the model has no idea what the title actually is for the article its producing descriptions for. That sounds odd but in practice works quite well though not perfectly as this feedback indicates. We could potentially re-train the model to include the title as another feature so that the model might learn to not match it but I'll be honest that I don't think the model would necessarily pick up on that with that change alone. Language models tend to be good at learning what works but do not necessarily "remember" what doesn't work. We could try to force this behavior by explicitly penalizing it while training. For that, we'd need a large dataset of descriptions that were reverted perhaps. I'm assuming that would contain at least some where the title was re-used. But in the end I'm still not sure if it would be a clear enough pattern for the model to learn it well. Right now a string-matching filter after the predictions are made may seem like a hacky patch, but my intuition is that it actually is the best and simplest way to avoid this behavior. As another example, we already have in place a patch to block any predictions that use a date that didn't show up in the input data -- i.e. trying to prevent the model from hallucinating birthdates for people. This is something we could try to address more explicitly in the training process as well but models just don't handle numbers well so again this post-hoc patch actually is probably the best approach. There's an amazingly in-depth intro into tokenization and why this is the case (YouTube video) in case you're curious.

Interesting insight, thank you. But, from my dilettante's perspective, if we consider hallucinations "a strong problem" of AI that is at this stage unsolvable, this particular problem seems at least tackleable solely with ML internal means and without hardcoding? Good luck anyway.

Thanks! Unlikely to happen soon but when we reach a stage where we are re-training the model, I'll see if we can experiment with nudging the model away from these sorts of responses (because agreed that it's ideal to solve it via model architecture / training as opposed to post-hoc filters if possible). And please continue to share if you see other patterns in incorrect recommendations.