Page MenuHomePhabricator

Support pre-transformed inputs for Outlink topic model
Closed, ResolvedPublic

Description

We'd like to extend the current Outlink topic model to accept pre-transformed data, skipping the step for calling Mediawiki APIs to conduct prediction only. See the discussion in T287056#8148413

Event Timeline

Regarding the input parameters, there are two options:

  1. Add an optional parameter features_str for pre-transformed data: If an user provides features_str, a string of list of QIDs, the API skips the step of calling MW API.

An input would be like:

{
    "lang": "en",
    "page_title": "Victoria_Jubilee_Government_High_School",
    "features_str": "Q15987189 Q6604679 Q57137630 Q902 Q875538 Q13059501 Q4813767 Q4937781 Q19894466 Q25586928 Q13057539 Q2269756 Q7045494 Q6402931 Q2061316 Q25586927 Q22664 Q9610 Q16345577 Q9439 Q4860797 Q14980579 Q6947799 Q5151846 Q7121813",
}

In theory we don't need the page_title and lang parameters because they are used in getting outlinks. But in the output format (see below), we show the page url to users which is composed of the page_title and lang.

Output:

{"prediction": {"article": "https://en.wikipedia.org/wiki/Victoria_Jubilee_Government_High_School", "results": [{"topic": "History_and_Society.Education", "score": 0.9928885698318481}, {"topic": "Geography.Regions.Asia.South_Asia", "score": 0.9679093360900879}, {"topic": "Geography.Regions.Asia.Asia*", "score": 0.9433575868606567}]}}
  1. Another option is the API can accept a single parameter features_str: An user can provide a single parameter features_str (with other optional parameters like threshold). In this case, the output only displays the result, not showing the page url.

Input:

{
    "features_str": "Q15987189 Q6604679 Q57137630 Q902 Q875538 Q13059501 Q4813767 Q4937781 Q19894466 Q25586928 Q13057539 Q2269756 Q7045494 Q6402931 Q2061316 Q25586927 Q22664 Q9610 Q16345577 Q9439 Q4860797 Q14980579 Q6947799 Q5151846 Q7121813"
}

The output would be like:

{"prediction": { "results": [{"topic": "History_and_Society.Education", "score": 0.9928885698318481}, {"topic": "Geography.Regions.Asia.South_Asia", "score": 0.9679093360900879}, {"topic": "Geography.Regions.Asia.Asia*", "score": 0.9433575868606567}]}}

@Isaac For the first option, only a slight change needs to be done for the current API. The second option would require adding a bit more logic. Based on possible use cases, which one would you recommend?

Any comments or suggestions are welcome :)

@Isaac For the first option, only a slight change needs to be done for the current API. The second option would require adding a bit more logic. Based on possible use cases, which one would you recommend?

Let's keep it simple then and go with the first option (expect lang/page_title). That data would already be associated with the feature strings on the cluster so doesn't cause any issuse for me. So long as LiftWing isn't taking some sort of action based on that -- e.g., automatically posting the score to a stream -- we can always pass "incorrect" parameters I assume -- e.g., "en" and "test article" -- without throwing an error or causing issues elsewhere? And having some sort of metadata associated with a prediction, even if it's not a real article, feels important for logging. Thanks!

Also, I wanted to make the potential alignment between this task and T304425 explicit -- with these changes in place, I'd be happy to support that testing.

Change 828475 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] outlink: add features_str param for pre-transformed data

https://gerrit.wikimedia.org/r/828475

Change 828475 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] outlink: add features_str param for pre-transformed data

https://gerrit.wikimedia.org/r/828475

Change 829015 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update outlink image to support pre-transformed data

https://gerrit.wikimedia.org/r/829015

Change 829015 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update outlink image to support pre-transformed data

https://gerrit.wikimedia.org/r/829015

@Isaac The change (the first option) has been deployed to the production. :)

So long as LiftWing isn't taking some sort of action based on that -- e.g., automatically posting the score to a stream -- we can always pass "incorrect" parameters I assume -- e.g., "en" and "test article" -- without throwing an error or causing issues elsewhere?

Yes, we can do that! Lift Wing takes action to post the score to a stream only when the input is a specific event format.

Also, I wanted to make the potential alignment between this task and T304425 explicit -- with these changes in place, I'd be happy to support that testing.

From the load testing result in our ml-sandbox, it looks like skipping MW API lowers the latency quite huge! It was taking 100-2000ms in the transform step when we had to call MW API. Now if a feature_str was given, it only took 10-40ms in the transform step, and it can reach 352.64 requests/sec under 10 connections. With these changes, testing the model prediction/batch prediction from Hadoop seems to be feasible. That would be great if you want to test it. Please let the ML team know when you test it so we can monitor the dashboard and logs to react if something goes wrong.

With these changes, testing the model prediction/batch prediction from Hadoop seems to be feasible. That would be great if you want to test it. Please let the ML team know when you test it so we can monitor the dashboard and logs to react if something goes wrong.

Yay! I'll coordinate before launching any large jobs. Just to clear how I should do this, I'm largely just taking the code you provided for a single example from the stat boxes and implementing it via a UDF on the PySpark workers:

  • PySpark notebook that generates all the feature_str data for each article
  • PySpark UDF that is called that does the following:
def get_prediction(lang, page_title, feature_str):
    inference_url = 'https://inference.discovery.wmnet:30443/v1/models/outlink-topic-model:predict'
    headers = {
        'Host': 'outlink-topic-model.articletopic-outlink.wikimedia.org',
        'Content-Type': 'application/x-www-form-urlencoded',
    }
    data = {"lang": lang, "page_title": page_title, "feature_str": feature_str}
    response = requests.post(inference_url, headers=headers, data=json.dumps(data))
    return response.text
    
spark.udf.register('getPrediction', getPrediction, 'String')

Maybe we pick this up in a week or two when I'm back from research offsite and coordinate directly (meeting where I kick off slowly increasing loads of requests and you help monitor the load and tell me if I need to stop)? I discussed a bit with @fkaelin and he pointed out that using LiftWing in this way probably isn't a great long-term solution for e.g., getting inferences for all Wikipedia articles, but this sort of testing will still be valuable for understanding how well LiftWing responds to large requests and when this sort of bulk inference is a good fit for our needs. Long term it sounds like the hope is to have LiftWing more integrated with our cluster so we can essentially broadcast the LiftWing models to Hadoop workers and take advantage of them without having to send lots of data over the network!

Maybe we pick this up in a week or two when I'm back from research offsite and coordinate directly (meeting where I kick off slowly increasing loads of requests and you help monitor the load and tell me if I need to stop)?

That sounds great! I'm also interested to learn more about the idea of having LiftWing more integrated with our cluster. I'll sync up with @fkaelin.

The code change for this task are complete. Going to mark this as RESOLVED. I'll open another task when we're going to test large requests/bulk inference from Hadoop.

achou removed a project: Patch-For-Review.
achou moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.