Page MenuHomePhabricator

Replace the current recommendation-api service with a newer version
Closed, DeclinedPublic

Description

Hi everybody!

In T333893 the Research team, in collaboration with Content Translation, asked us (ML) to host a new service called "Recommendation API" (the idea is that ml-serve clusters can now host any service that wikikube can, so anything vaguely related to ML may fall on our shoulders so we keep things tidy on both ends).

The name recommendationa-api rang a bell, and indeed we have one service on Wikikube called like that. From some digging in Phabricator I found some traces of that service (see T333893#8901488), afaict it was created by Research and it was ported on nodejs at some point in the past. From the Grafana dashboard the traffic seems mostly health-check-related, and nobody in Research remembers or know about why we have the current recommendation-api deployed (maybe @leila does).

The "new" recommendation API is written in Python and currently running in WMF cloud (see https://recommend.wmflabs.org/types/translation/), and the Content Translation team would like to have it in production to move their service calls to a more stable endpoint.

What should we do? The options are several, for example calling the new service in a different way, but it could be a good occasion to deprecate the current recommendation-api if we verify that it is not really used by anybody (removing some tech debt).

Any thoughts? Or maybe context from the path to proceed in one direction or the other?

Event Timeline

Adding some more information:

The service was maintained by @bmansurov. It was deployed on the scb cluster. I am the one that moved it to Wikikube as a push to get rid of scb. It uses a very old database, judging by the name, last populated in 20181130 and I doubt it has been "retrained" since. It is exposed in RESTBase under /api/rest_v1/data/recommendation/article and does see some traffic per https://w.wiki/6oh7. Some ~7k requests per week.

Oh I forgot to add that we have https://meta.wikimedia.org/wiki/Recommendation_API for explaining what it is. Finally the referers in turnilo imply some functionality on mobile sites?

My personal take is btw that is unowned. I 'd say Code Stewardship request and maybe it's enough of a lost cause that we undeploy it?

Thanks for the info!

Oh I forgot to add that we have https://meta.wikimedia.org/wiki/Recommendation_API for explaining what it is. Finally the referers in turnilo imply some functionality on mobile sites?

I think that the link is about the "new" service, since it refers to Content translation and .py files.. not 100% positive but it doesn't look related to the nodejs app.

I also tried to check the timings of calls in turnilo, and they are all grouped together: https://w.wiki/6ohy. Maybe it was me trying the endpoint in these days?

My personal take is btw that is unowned. I 'd say Code Stewardship request and maybe it's enough of a lost cause that we undeploy it?

+1, probably if Research give us the green light we can start the deprecation process.

Change 929695 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] recommendation-api: restbase isn't used for anything

https://gerrit.wikimedia.org/r/929695

Thanks for the info!

Oh I forgot to add that we have https://meta.wikimedia.org/wiki/Recommendation_API for explaining what it is. Finally the referers in turnilo imply some functionality on mobile sites?

I think that the link is about the "new" service, since it refers to Content translation and .py files.. not 100% positive but it doesn't look related to the nodejs app.

The section "Future service" talks about ServiceTemplateNode, so it looks related to the nodejs app. It's also linked from here: https://en.wikipedia.org/api/rest_v1/#/Recommendation/get_data_recommendation_article_creation_translation__from_lang_ and it hasn't seen much change past 2017 (2 substantial edits in 2022, 2 minor ones in 2018 and 2023)

I also tried to check the timings of calls in turnilo, and they are all grouped together: https://w.wiki/6ohy. Maybe it was me trying the endpoint in these days?

Quite possibly? Did you remember trying with multiple referers though?

My personal take is btw that is unowned. I 'd say Code Stewardship request and maybe it's enough of a lost cause that we undeploy it?

+1, probably if Research give us the green light we can start the deprecation process.

Yeah, let's do that. I already uploaded a change to remove the services proxy support for allowing recommendation-api to reach out to restbase. The other data point that might help research is if we remove health checks and gauge at metrics then.

Thanks for the info!

Oh I forgot to add that we have https://meta.wikimedia.org/wiki/Recommendation_API for explaining what it is. Finally the referers in turnilo imply some functionality on mobile sites?

I think that the link is about the "new" service, since it refers to Content translation and .py files.. not 100% positive but it doesn't look related to the nodejs app.

The section "Future service" talks about ServiceTemplateNode, so it looks related to the nodejs app. It's also linked from here: https://en.wikipedia.org/api/rest_v1/#/Recommendation/get_data_recommendation_article_creation_translation__from_lang_ and it hasn't seen much change past 2017 (2 substantial edits in 2022, 2 minor ones in 2018 and 2023)

Very confusing, the page mentions the wmflabs endpoint that Content Translation is using right now (the Python app that should end up on Lift Wing), so at this point I don't know.. Nschaaf is listed among the editor of the page, and IIUC they have created the Python API, so maybe the code has been parked into this state for a while? It would match with what I see in https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendation-api/+log/refs/heads/master/recommendation/api/api.py

I also tried to check the timings of calls in turnilo, and they are all grouped together: https://w.wiki/6ohy. Maybe it was me trying the endpoint in these days?

Quite possibly? Did you remember trying with multiple referers though?

Good point, didn't try multiple ones, and I rechecked today, there is a bit of traffic hitting the restbase endpoint..

My personal take is btw that is unowned. I 'd say Code Stewardship request and maybe it's enough of a lost cause that we undeploy it?

+1, probably if Research give us the green light we can start the deprecation process.

Yeah, let's do that. I already uploaded a change to remove the services proxy support for allowing recommendation-api to reach out to restbase. The other data point that might help research is if we remove health checks and gauge at metrics then.

I asked Research to possibly come up with a new name for the new recommendation-api, if possible it would allow us to deprecate the old one in an easier way. If not we can probably announce the deprecation to Wikitech-l and then remove the restbase URI config?

Change 929695 merged by jenkins-bot:

[operations/deployment-charts@master] recommendation-api: restbase isn't used for anything

https://gerrit.wikimedia.org/r/929695

The other thing that I just noticed is that this service consumes 0.4% of the resources it is allocated

image.png (1×1 px, 29 KB)

I am gonna aggressively remove ~84% of the capacity it has been allocated, leaving the bare minimum for high availability (2 replicas).

Change 930178 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] recommendation-api: Remove 84% of assigned capacity

https://gerrit.wikimedia.org/r/930178

Change 930178 merged by jenkins-bot:

[operations/deployment-charts@master] recommendation-api: Remove 84% of assigned capacity

https://gerrit.wikimedia.org/r/930178

Change 931060 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/restbase@master] Deprecate the recommendation-api endpoint

https://gerrit.wikimedia.org/r/931060

I filed some code changes for Restbase, but to the wrong repo since we have gerrit mirroring github. The CI settings are broken, so I am not sure how to proceed, I asked to the api-platform team on slack, let's see how it goes.

After a chat with @daniel on slack we realized that the endpoint is indeed used: https://w.wiki/6rKU

@akosiaris I don't think that we can deprecate yet, the best course of action that I can think of is to add the new recommendation-api with a new name on Lift Wing, and possibly talk with the Product team using the "old" recommendation-api to migrate to the new one if feasible (then eventually deprecated). How does it sound?

After a chat with @daniel on slack we realized that the endpoint is indeed used: https://w.wiki/6rKU

@akosiaris I don't think that we can deprecate yet, the best course of action that I can think of is to add the new recommendation-api with a new name on Lift Wing, and possibly talk with the Product team using the "old" recommendation-api to migrate to the new one if feasible (then eventually deprecated). How does it sound?

Sigh. Interestingly, the API is apparently using data from late 2018 to recommend whatever it is recommending. I have no idea if this is ok or not, it might make sense to bring this to Android app's team attention, in case they aren't aware already.

However, the above is unrelated to your point. We can't just yank it from production, that's pretty clear. The path you sketch out is our best path forward apparently. Good luck coming up with a good new name!

Wuld it be possible to implement a compatibility layer, so the app can use the new service without any changes needed? Updating apps is problematic, old versions of the app will remain in use for months and years...

@diego we probably need to figure out a path forward for this, namely:

  1. review how the old recommendation-api works, what training data it was used, etc..
  2. compare it with the new one, and find overlapping/missing-bits.
  3. add the missing bits that are really needed so we can move people off the old api and deprecate it

This needs some help from Research since we (as ML) don't have a lot of context about the previous api..

@diego we probably need to figure out a path forward for this, namely:

  1. review how the old recommendation-api works, what training data it was used, etc..
  2. compare it with the new one, and find overlapping/missing-bits.
  3. add the missing bits that are really needed so we can move people off the old api and deprecate it

This needs some help from Research since we (as ML) don't have a lot of context about the previous api..

Let's ask @leila , she might have that information.

@diego in https://phabricator.wikimedia.org/T308165#7983559 @Isaac also mentioned this:

For example, similar service with endpoints described/testable here. @bmansurov would know more about the history of that service and why contenttranslation is still using the python instance on cloud vps instead of the nodejs version on mediawiki.

After a chat with @Miriam we started to wonder if the python and nodejs services are basically the same thing, done a long time ago so people don't have context anymore? My point is that the current (already deployed) nodejs app on k8s may be what content translation needs, see https://en.wikipedia.org/api/rest_v1/#/ (this is the nodejs app exposed to the internet by us via Restbase).

Change 931060 abandoned by Elukey:

[mediawiki/services/restbase@master] Deprecate the recommendation-api endpoint

Reason:

still used

https://gerrit.wikimedia.org/r/931060

@elukey looks like we are sticking with the old recommendation-api for a while. Should we resolve this?

Wuld it be possible to implement a compatibility layer, so the app can use the new service without any changes needed? Updating apps is problematic, old versions of the app will remain in use for months and years...

To keep archives happy - I am following up with the team owning the Android app to see if they can become the point of contact for this service, since its ownership is not clear at this point and the only major user of the API is the Android app.

@elukey Do we have another idea on the table aside from asking a team of Android devs (3) to maintain a recommendations service?

While I have my reservations about the proposal itself, I worry about the precedent it sets around moving ownership of backend services to frontend teams that use them.

@elukey Do we have another idea on the table aside from asking a team of Android devs (3) to maintain a recommendations service?

Hi! The alternative is to decommission the service, since no other team owns it, but this would break our users so I am trying to find a good compromise. The service is currently hosted by ServiceOps on k8s, so "maintenance" in this case would not be anything related to infrastructure, but to further developments to the API, or basic routine upgrade tasks that may be required once in a while (update docker images and test before rolling out, etc..).
Another alternative could be to migrate to another service, for example T340854, but nor SRE or ML has any bandwidth to asses if our Android app can/should do it (and it is outside the scope of our teams).

While I have my reservations about the proposal itself, I worry about the precedent it sets around moving ownership of backend services to frontend teams that use them.

The service is written in nodejs, and a frontend team would have probably more expertise than an SRE one about progressing the API. Another alternative is to follow up with teams using other services and migrate to them.

I am already following up with various people on slack about this, and I am waiting some answers to figure out what's best. Please keep in mind that the service is currently unowned and kept alive by the SRE team as it is, and Restbase is currently being deprecated (so its way of being exposed to the rest of the world is going away soon). I am trying to find its best path forward, and even if I have a lot of reservations on the services itself, I try to do my best to avoid ending up in a situation where we don't break our users.

I worry about the precedent it sets around moving ownership of backend services to frontend teams that use them.

In theory, https://www.mediawiki.org/wiki/Code_stewardship_reviews was meant to sort out codebase ownership issues.
In practice this process hasn't been in place for a while due to lack of authority and/or decision making in WMF at some point in time (personal opinion).

Just to add my 2 cents as a generic observation.

If we can't find any kind of an owner for this, it will eventually have to be undeployed and whatever functionality relied on it will have to be implemented in some other way. What that other way will be, backend, frontend, middleware, library, external 3rd party service or whatever else, is an entire discussion on its own. And with a single team being the user of this functionality, this same team is de facto the only one that has any kind of vested interest into participating into that discussion. Almost everyone else is at most the chicken from the The_Chicken_and_the_Pig business fable. In other words, anybody else is at most either interested or involved, but most definitely not committed (and we see this in the inability to obtain an owner).

Thanks for all the inputs.

I'd like to verify my expectations around service teams and infrastructure commitments first before I comment further. I'm talking to @ttaylor and @mark about this and will report back soon.

Adding a data point that just crossed my mind, just to rule it out.

A mysqldump of the recommendation API database right now sits at 810MB. A bzip2ed version of it sits at 136MB. Even if one uses a more efficient compression algorithm and trims the data, it probably remains something that can not be shipped to users bundled in the Android app without substantially increasing its size.

By the way, lack of solution to the ownership problem will eventually (yet inevitably) lead to an undeployment of the service, possibly in a similar fashion as the graph functionality, which was disabled in an emergency, without prior communication or fallback plans.

@SCherukuwada Hi! Shall we restart the conversation about recommendation-api?

Hello!

Seddon and I are meeting on Friday. We'll have a concrete action plan (or
the beginnings of one) to share on Monday.

@Seddon Could you please post an update here and link to relevant tickets?

Change #1057874 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] service: Remove probes from recommendation-api

https://gerrit.wikimedia.org/r/1057874

Change #1057874 merged by Alexandros Kosiaris:

[operations/puppet@production] service: Remove probes from recommendation-api

https://gerrit.wikimedia.org/r/1057874

To keep the archives happy, unless I am mistaken, per T373611: [spike] Determine if recommendation-api service calls can be migrated to local calls. Android applications have moved from the old recommendation-api to a different implementation.

This means we can sunset the service once the traffic from older versions has died out enough.