Page MenuHomePhabricator

Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review
Open, Needs TriagePublic

Description

Hypothesis

If we make article topic inference data available via a service that meets agreed-upon scalability and availability requirements, plus any necessary data backfills, then we will have established the technical foundation necessary to support upcoming personalized reader experiences that depend on this data.

Scoping details

  • Problem: We want to be able to access the topic scores for articles at scale for each page view on the mobile apps to be able to track this information however there currently isn't a scalable way to access this information from liftwing.
  • [Optional] Possible solutions:
    • Add caching to LiftWing to reduce server load.
    • Establish way forward to leverage Cirrusdoc or other search based alternative as a key-value store more generally.
    • Make data lake queryable from external api call's and retrieve output from there.
  • Enabled projects: Which specific user-facing features or experiments would be unblocked or meaningfully enabled (in terms of development ease, velocity, etc.) by solving this problem? Which teams are launching these features or experiments?
    • Year-in-review on iOS and Android
  • Urgency and importance: When are these features or experiments expected to launch? How essential is this infrastructure for unblocking development?
    • Year-in-review would go out to users in Q2 however the later the availability, the amount of data that would need backfilling would increase with time.
  • [Optional] Notes: Is there anything else you'd like to share?
    • Being able to access model outputs at scale would likely unlock additioanal use cases for liftwing for the mobile apps and other teams in the long term

Reporting format

Summary of progress:

Next steps:

Event Timeline

Just adding some quick thoughts of nice-to-haves:

  • Regarding Being able to access model outputs at scale would likely unlock additioanal use cases for liftwing for the mobile apps and other teams in the long term:
    • This would be a more sustainable fix for the article-country model where we need country predictions for all of an article's links. We are currently using a static database hosted on LiftWing but that means it's constantly out-of-date and a bulkier image and more complicated pipelines for "re-training" the model: T385970.
    • Over the years, I've built little prototype user-scripts (details) that can query model outputs for all of the links in a given article in order to visualize them as you browse. For example, highlighting links based on whether they were biographies of men, women, or non-binary folks and displaying statistics about the distribution. This allows for easily visualizing gender bias in links. I also had one for showing article quality predictions for links so editors could see which articles to prioritize for improvement that are relevant to a given topic. I was just running offline bulk predictions and caching them in a database hosted on Cloud VPS but this would be a cool use-case to support officially.
  • I personally would love a solution based on Search because we already use their index for hosting predictions for many recommendation use-cases because it's highly accessible and they already handle the messiness of updating indexes to keep up with edits. In theory if they make the cirrusdoc (or similar) endpoint efficient for this sort of use-case, you also get some nice behavior for free such as the ability to use generator queries -- e.g., a single query to get topic predictions for links from the River Nile page: https://en.wikipedia.org/w/api.php?action=query&generator=links&titles=River_Nile&prop=cirrusdoc&format=json&formatversion=2&cdincludes=weighted_tags&gplnamespace=0&gpllimit=100&redirects. This is currently an expensive query for them and not meant to be used in production use-cases but the simplicity of it is quite beautiful. If the solution is a LiftWing cache, that's okay too but we might consider if there are ways to make it accessible in a similar way.
  • There's a new topic model under discussion (report) with meetings to hopefully kick off soon about next steps for bringing to production. I personally think it'd be great to have this new model for this use-case too as it would e.g., give reliable data on the gender of biographies that folks are reading. We also are in early discussions about what it would mean to incorporate a "time" element into the model so that e.g., you could see if folks were reading predominantly about more current or historical topics. The main blocker to this time topic is also a question of how to efficiently serve the data from the Search index so it's worth considering alongside this ask. There's no phab ticket yet but I've talked with the Search Platform about it.

@Seddon
I have a few clarifying questions that’ll help us (ML) understand whether the solutions we’re ideating will actually address the problem effectively.

  • Could you share more details about how the data will be handled on the Apps side? Is the request made from the application and where are the responses saved? On the device or somewhere else and then fetched to the device?
  • What exactly does “at scale” mean in this context? Do you have any rough estimates for the number of requests we might expect as well as if there are specific latency requirements?
  • Will there be duplicate requests (e.g. multiple users that have been viewing the same articles)? And is there a way to ensure uniqueness in the requests sent to the API?

We met on Friday to discuss Ilias' questions above and talk through some potential solutions. See meeting notes and meeting recording for more.

We've identified that there are two subproblems to address:

  1. Backfilling all of the topic data that is needed for January 2025 through (date of implementation).
  2. Enabling a steady-state (of potentially hundreds of millions of daily requests) to collect additional topic data needed through the end of 2025.

We've also outlined 2 potential options that should be investigated and scoped as a first step in addressing this ticket.

  • LiftWing caching, which we believe can solve at least subproblem (2) by improving our system's ability to handle these steady-state request at scale and allowing the Apps team to access topic data from Cassandra.
  • Using Data Gateway in an ad-hoc, one-time way to backfill and dump data into Cassandra, which might be needed to handle subproblem (1)

As a first step in addressing this request, we should investigate both of these options and decide:

  • Can a LiftWing cache adequately address this request? Are there any aspects of this request that cannot be addressed by a cache?
  • If the backfill (subproblem 1 above) requires its own solution, is the Data Gateway dump to Cassandra adequate to solve this?
  • What is the estimated level of effort around implementing each of these solutions?

If we find that a LiftWing cache can indeed support this request adequately, we should constrain our initial requirements around this use case and generalize the cache across other use cases later.

Sucheta-Salgaonkar-WMF renamed this task from AI/ML Infrastructure Request: **Accessing topics endpoints at scale** to Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.Aug 1 2025, 6:31 PM
Sucheta-Salgaonkar-WMF updated the task description. (Show Details)

Summary of progress:

  1. ML Team and Data Persistence team discussed Cassandra Cache design proposed in https://phabricator.wikimedia.org/T401778. We've agreed on changes that need to be made to the initial design. The updated design will be posted under the design review ticket: https://phabricator.wikimedia.org/T402984. Once the updated design is approved, the Data Persistence Team will move to deploying the Cassandra instance.
  2. ML Team decided to slightly modify the current architecture of the Articletopic model deployed on LiftWing. This will enable slight performance improvement, allow scaling to higher number of replicas and will make it easier to integrate the model code with the Cache. This work is being tracked under this ticket: https://phabricator.wikimedia.org/T404294.
  3. We've decided to pursue the backfilling strategy based on the existing snapshots of article topic predictions stored in the Hive table. This will involve creating an Airflow ETL pipeline, which will load the data from Hive table and populate the Cache in batch-processing fashion. This involved only discussion and planning so far, no code has been written yet. Ticket tracking this work: https://phabricator.wikimedia.org/T403254.
  4. The initial code for integrating Cache into article topic model has been published on Gerrit, but we have decided to first pursue the architecture modification discussed in point 2 above.

Next steps:

  • Post the updated Cache design document for the Data Persistence team to review described in point 1 above.
  • Finish the modifications described in point 2 above.
  • Start working on ETL code to backfill the Cache described in point 3 above.
  • Figure out and document the plan and requirements for connection between Cassandra Cache, Airflow Jobs and LiftWing.

Hello @Dbrant!

We have 1 technical question about the way Apps side will query our LiftWing model to retrieve the article topics. Currently, our LiftWing model expects users to pass page_title and lang parameters in POST requests to our model. ML team is also considering adding a page_id parameter that could be used instead of page_title.

To make sure we optimize our solution for Year in Review processing, we would need confirmation whether Apps side would use the page_title parameter (the current state) or the page_id parameter, which would be introduced by us. Do you have a preference of one over another?

@BWojtowicz-WMF we should probably sync up about this kind of requirement (and also data modeling when you work on T402984). We asked a similar question to YiR folks for T403660: WE3.3.7 Year in Review and Activity Tab Services - Global Editor Metrics, and IIRC, page_id was fine (and is obviously better for e.g. page move reasons).


Hm, it'd be nice to have a few sync points on this for both of these projects:

  • Both are 'derived data'
  • Both are supporting YiR (and other) active use cases.
  • Both have the same product owners.

It'd be nice if we could be consistent about API/data modeling stuff, ya? :)

To make sure we optimize our solution for Year in Review processing, we would need confirmation whether Apps side would use the page_title parameter (the current state) or the page_id parameter, which would be introduced by us. Do you have a preference of one over another?

Hello @BWojtowicz-WMF! The short answer is, we can do either/both.
Technically, the apps natively deal mostly with page titles, not ids. If a certain API necessitates using ids, we would need to make an additional API call to resolve the titles into ids.
However, since it sounds like the editor-metrics API (mentioned by @Ottomata above) will be returning ids, and since those ids are precisely the pages for which we'll need topics, it would be great if we could feed those ids into the topics API.

considering adding a page_id parameter that could be used instead of page_title.

When you say you'll "add" a page_id parameter, does this mean you'll keep the page_title parameter? If so, that would be the best of both worlds, since I could envision scenarios where either variation would be useful.

@Dbrant

When you say you'll "add" a page_id parameter, does this mean you'll keep the page_title parameter? If so, that would be the best of both worlds, since I could envision scenarios where either variation would be useful.

Yes, we would keep both parameters available.

However, we're also making a decision on our side as to which of those parameters (page_id or page_title) we will use as our cache index. If we decide to use page_id as cache index, but user would pass us the page_title parameter, it also means an additional API call to resolve title into id. If the user would pass page_id and we decide to use page_id as cache index there is no need for additional API call.
Since for YiR we will require the best performance/throughput possible, we want to make sure we won't need this additional API call.

So both parameters would be still available, but to make sure we can meet the YiR scale, we want to make sure that we know which one of those will be used for this project in particular. We will use it as our cache index.
If the case is that you can use either/both, could we agree on using the page_id parameter for the requests done in relation to Year in Review?

could we agree on using the page_id parameter for the requests done in relation to Year in Review?

Understood, and yes, that sounds reasonable!

could we agree on using the page_id parameter for the requests done in relation to Year in Review?

Understood, and yes, that sounds reasonable!

Perfect, thank you!

Weekly Report

Summary of progress:

  1. The cache design has been posted to review for the Data Persistence Team in https://phabricator.wikimedia.org/T402984
  2. Part of the architectural change merging transformer and predictor parts has been merged (ticket: https://phabricator.wikimedia.org/T404294) . Remaining work includes testing the new architecture, removing the legacy code and publishing new model to production.
  3. Machine Learning Team decided to add a new feature to article topics model allowing users to pass page_id instead of page_title as parameter to the model. Machine Learning Team and Apps team agreed to use the page_id parameter for queries originating from the Year in Review project. The work is being tracked in https://phabricator.wikimedia.org/T371021.

Next steps:

  • Iterate on the posted Cache design.
  • Thoroughly test the new combined architecture of article topics model. Once tested, remove the legacy code and publish the updated model to production.
  • Introduce page_id parameter to the article topics model.
  • (carry-over from last update) Start working on ETL code to backfill the Cache described in point 3 above.
  • (carry-over from last update) Figure out and document the plan and requirements for connection between Cassandra Cache, Airflow Jobs and LiftWing.

Weekly Report

Summary of progress:

  1. The architectural change described in https://phabricator.wikimedia.org/T404294 got merged, deployed and tested both on staging and production clusters. Patches cleaning up legacy code have been submitted for review.
  2. Patch adding the page_id parameter to the articletopic model has performance-tested and been submitted for review. Phabricator ticket: https://phabricator.wikimedia.org/T371021

Next steps:

  1. Deploy and load test the new model deployment accepting page_id parameter.
  2. Create mapping from wikipedia language to Wiki ID as discussed in design review ticket: https://phabricator.wikimedia.org/T402984
  3. Submit the initial patch adding cache logic to the articletopic model.
  4. Start working on ETL code to backfill the Cache described in point 3 above.
  5. Figure out and document the plan and requirements for connection between Cassandra Cache, Airflow Jobs and LiftWing.

Weekly Report
Sharing a day earlier as I'm OOO on 3rd of October.

Summary of progress:

  1. Work adding page_id parameter to the ArticleTopic model was deployed and load-tested on staging. Deployment on production will be done next week as a safety measure to avoid potential issues over the weekend.
  2. Discussions under the Cache design have been finalized https://phabricator.wikimedia.org/T402984
  3. Started updating the code adding Cache logic to inference code to match the new design. Created mappings from lang to wiki_id for this purpose as well.

Next Steps:

  1. Deploy the page_id change to production.
  2. Submit the patch adding Cache logic to the inference code.
  3. Start creating the necessary files to for the Cassandra <-> LiftWing connection
  4. Start working on ETL code to backfill the Cache

Weekly Report

Summary of progress:

  1. The article topic model now supports page_id parameter. The change was deployed and tested on production.
  2. Patch adding cache logic to inference code has been updated to match the new design and tested on local setup: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1176448.

Next Steps:

  1. Iterate through the patch introducing cache logic.
  2. Start creating the necessary files to for the Cassandra <-> LiftWing connection
  3. Start working on ETL code to backfill the Cache

Update / Task on pause

The task has been put on pause for the past month due to moving resources to a higher priority task (Revise Tone Structured Task) and the fact Year in Review project no longer depends on article topics for this year. However, we plan to pick up this task again in December.

At the same time, Revise Tone also depends on setting up the Cassandra connection, which is also large part of this problem. This means, that once we get back to this problem, it will be much easier to solve :)