Page MenuHomePhabricator

Turn off Trending Service
Closed, ResolvedPublic

Description

We need to turn off the Trending Service until product needs become more clear in order to reduce the maintenance burden.

After performing user research with the current Trending Service, it was found that it didn't fit with the current trending use cases for mobile apps as well as top read. Report

Several potential improvements have been identified, but due to other priorities, it will be some time before these are explored more thoroughly.

In the interim, the Trending Service appears to be causing some additional maintenance burden (a few bug fixes needed to be performed last week)

Because of both of these reasons, it seems prudent to turn the the service off for now until we can revisit the improvements that need to be made for this to become a viable service.

If any more testing needs to be performed we can do this in Cloud VPS.

Steps:

  • Announce the removal
  • Set the end point's stability to Deprecated
  • Remove the endpoint from restbase, This needs a restbase deployment.
  • Remove the endpoint from lvs - this needs a puppet patch
  • Remove the data from conftool (puppet)
  • Remove the role from the scb cluster (puppet)
  • Cleanup the scb cluster (either via a newly-minted puppet class to remove a service, or via a manual script).

Event Timeline

Turning the service off in itself is trivial (a patch + deploy), but there is a larger question here. The results of the trending edits service are publicly available via the REST API. The end point is marked as experimental, so we are covered on that front. However, the task's description alludes to the service being put back in production at some point. Is that so? It is a bit confusing from the user perspective to have the end point available, then for it to disappear and resurface later.

It would be good to talk about time lines and future of this project before we effectively put it out.

@mobrovac that is not any type of assurance that would happen… sorry if that is confusing. We should not plan on this returning to production.

Really the concept needs more testing for product viability. Unfortunately, we were unable to test in a non-production environment due to Kafka not being available outside of production. This is the main driver for it moving to production more quickly than we would have liked.

So I think honestly we should figure out a way to get Kafka in Cloud VPS and we can get any further testing done there without ever needing it to be in production until we are sure that we have a clear product need and the service will fill that need.

I understand, but I must say, this bums me out as I just migrated a bunch of side projects to this service and now I have no alternative but to reimplement the service...
😞

Given I advertised it a fair bit it would be wise and responsible to take a look at the traffic to the endpoint to get a sense of if any clients adopted it (other than me) to give them sufficient heads up if needed that it's going away. I know a labs instance editor tool was interested in but no idea if they got round to using it.

Really the concept needs more testing for product viability. Unfortunately, we were unable to test in a non-production environment due to Kafka not being available outside of production.

Is it actually impossible to use that in Labs or is it just that whoever put it into production didn't properly mirror it in beta?

This whole event brings forward a larger question about microservices and their cost.

This service did cost various teams time and effort, specifically both Ops and Services spent time:

  • preparing the service for production
  • deploying it, and prepare monitoring, alerting, etc.
  • fixing issues it caused itself and to other services, even over a weekend

We have to be conscious bringing a software to production has significant costs, and it should not be done before we're confident in its utility and fitness.

This consideration aside, I'm not sure how to proceed here - I guess it will take time before we can undeploy this. At the very least, we need to:

  • Analyze who's using the service right now
  • Send a notice of dismissal to wikitech-l and probably to the communities at large
  • Notify the end date to all "customers" of that API
  • Remove the endpoints from restbase
  • Decommission the software in puppet, scap, manually clean up the servers maybe

Some of these activities will involve inevitably either Ops or Services; I think you'll need to talk with the managers of both teams to find out when and how they'll be able to allocate resources to this goal.

Really the concept needs more testing for product viability. Unfortunately, we were unable to test in a non-production environment due to Kafka not being available outside of production.

Is it actually impossible to use that in Labs or is it just that whoever put it into production didn't properly mirror it in beta?

It is actually impossible because AIUI the edits stream via Kafka is not available in deployment-prep, and this is of course an issue that left no room for testing there. It's a very unfortunate place to be in, and in fact this is probably the whole reason why this was deployed in production in the first place.

Do we really need all this for an endpoint marked as "experimental"?

Rolling out a more experimental service is a valid use case and having to do this work to deploy and undeploy isn't really letting us move quickly. The deployment part we're hopefully going to tackle with k8s, so there isn't much point to discuss this further now I think.

Undeployment and its social aspects (notifying users etc.) isn't something we have talked much about yet. I would say having a marker (the "experimental" one, or a similar one) and setting expectations to be "it may disappear or its API may change at any point in time without notice" would be the way to go here.

I would say having a marker (the "experimental" one, or a similar one) and setting expectations to be "it may disappear or its API may change at any point in time without notice" would be the way to go here.

We already mark it as experimental that links to the following explanation:

Experimental end points can change in incompatible ways at any time, without incrementing the API version. You are welcome to use them at your own risk.

Although it doesn't explicitly talk about endpoint disappearing, "change in incompatible way" covers that case I think.

I went to hive to check the external traffic to the endpoint from web request logs. For a random day there were just 300 requests PER DAY to the endpoint. Most of the external requests ore done with node-fetch user-agent and only about 50 req/day with a browser. So there is some real traffic on the endpoint, but the numbers are really really low.

! In T180384#3758434, @Pchelolo wrote:
I went to hive to check the external traffic to the endpoint from web request logs. For a random day there were just 300 requests PER DAY to the endpoint. Most of the external requests ore done with node-fetch user-agent and only about 50 req/day with a browser. So there is some real traffic on the endpoint, but the numbers are really really low.

That seems low enough not to need further work than just removing the service from the scb cluster, and lvs.

I woldn't even go as far as removing it from the deployment servers, as it might get deployed again, is that correct?

An undeployment procedure would be:

  • remove the endpoint from restbase, This needs a restbase deployment.
  • Remove the endpoint from lvs - this needs a puppet patch
  • Remove the data from conftool (puppet)
  • Remove the role from the scb cluster (puppet)
  • Cleanup the scb cluster (either via a newly-minted puppet class to remove a service, or via a manual script).

Really the concept needs more testing for product viability. Unfortunately, we were unable to test in a non-production environment due to Kafka not being available outside of production.

Is it actually impossible to use that in Labs or is it just that whoever put it into production didn't properly mirror it in beta?

It is actually impossible because AIUI the edits stream via Kafka is not available in deployment-prep, and this is of course an issue that left no room for testing there. It's a very unfortunate place to be in, and in fact this is probably the whole reason why this was deployed in production in the first place.

Both Kafka/EventBus and the Trending Edits service are available in deployment-prep but the stream of events is tied to the BetaCluster MW instance, which does not provide events meaningful enough to conduct functional/behavioural testing of the service (we could only establish that the service can read the events from Kafka).

I went to hive to check the external traffic to the endpoint from web request logs. For a random day there were just 300 requests PER DAY to the endpoint. Most of the external requests ore done with node-fetch user-agent and only about 50 req/day with a browser. So there is some real traffic on the endpoint, but the numbers are really really low.

In spite of it being marked as experimental I would prefer we first announced the deprecation and eventual removal of the end point because: (i) we cannot provide any alternative way for clients to obtain the same data; and (ii) even though the rate of requests seems low, I suspect some visualisaton (or other kind of) tools use it and periodically poll data from it. I would suggest putting a deprecation notice in the end point's documentation ASAP and give a month or two notice prior to removal.

Thinking about reason (i), now that EventStreams exposes edit data, a similar service consuming that stream could be built outside of the production environment and might be a suitable replacement.

Others have also asked for a limited Kafka Mirror available in Cloud VPS somewhere. I'm not opposed, we'd just need a hole between the networks so Cloud can talk to Production Kafka somehow.

@Pchelolo asked me a few questions

are you up for being a maintainer of it?

I am, although one of the biggest pain points for maintaining this so far has been the inability to get at live data during testing and the fact Vagrant has to be used and requires some hacking (https://gerrit.wikimedia.org/r/#/c/335555/). It's not been 100% clear who was responsible for updates, so I'd appreciate more clarity around that and what level of support I could get from services.

do you want it in prod? are you ok if we manage to move it into labs?

I don't think it needs to be in production if we're not using it anywhere, but yes if we could deploy an instance to labs with access to live data from the production wikis I think this would be a useful and suitable alternative. It also makes it easier for me to keep up to date and put back on production later if we need it.

@Pchelolo asked me a few questions

are you up for being a maintainer of it?

I am, although one of the biggest pain points for maintaining this so far has been the inability to get at live data during testing and the fact Vagrant has to be used and requires some hacking (https://gerrit.wikimedia.org/r/#/c/335555/). It's not been 100% clear who was responsible for updates, so I'd appreciate more clarity around that and what level of support I could get from services.

The owner of the service is its maintainer too. That includes, amongst other things, updating the service's dependencies as well as working on its functionality. Additionally, it is expected of the owner to make themselves available in case of problems in production and/or outages.

do you want it in prod? are you ok if we manage to move it into labs?

I don't think it needs to be in production if we're not using it anywhere, but yes if we could deploy an instance to labs with access to live data from the production wikis I think this would be a useful and suitable alternative. It also makes it easier for me to keep up to date and put back on production later if we need it.

The needs of the Trending Edits service can be fulfilled with the EventStreams service's data, so you should not need access to the production Kafka instance, and thus, the service can be deployed in an arbitrary environment (Labs, VPS, etc). However, if the service is moved out of production its public API end point will not be available any more.

Hey all, following up here after the holiday.

Looks like we have next steps from @Joe. I'll add this to the description.

An undeployment procedure would be:

  • remove the endpoint from restbase, This needs a restbase deployment.
  • Remove the endpoint from lvs - this needs a puppet patch
  • Remove the data from conftool (puppet)
  • Remove the role from the scb cluster (puppet)
  • Cleanup the scb cluster (either via a newly-minted puppet class to remove a service, or via a manual script).

I woldn't even go as far as removing it from the deployment servers, as it might get deployed again, is that correct?

Unlikely in any knowable timeframe. There is no plan in any product roadmaps to integrate this service or even experiment with it again though the end of the FY. So if there is an extra step to remove it from the deployment servers, I would go with it.

Also thanks to everyone for helping out here. We are trying to get better at being responsible maintainers and sunsetting components. Although this isn't the ideal outcome, I am at least happy that we at least exercising our sunsetting muscles.

Stashbot subscribed.

Mentioned in SAL (#wikimedia-operations) [2017-12-06T10:39:41Z] <mobrovac@tin> Started deploy [restbase/deploy@b1d7c82]: Use Cass3 for revisions, deprecate trending-edits, fix CX end point - T179421 T180384 T173801

Mentioned in SAL (#wikimedia-operations) [2017-12-06T10:45:43Z] <mobrovac@tin> Finished deploy [restbase/deploy@b1d7c82]: Use Cass3 for revisions, deprecate trending-edits, fix CX end point - T179421 T180384 T173801 (duration: 06m 02s)

Change 397567 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Remove the Trending Edits options and URI

https://gerrit.wikimedia.org/r/397567

Change 397571 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] Remove the Trending Edits service from production

https://gerrit.wikimedia.org/r/397571

Change 397745 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/dns@master] Remove trendingedits discovery endpoint

https://gerrit.wikimedia.org/r/397745

Change 397746 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/dns@master] Remove all references to trendingedits

https://gerrit.wikimedia.org/r/397746

Change 397567 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Remove the Trending Edits options and URI

https://gerrit.wikimedia.org/r/397567

Mentioned in SAL (#wikimedia-operations) [2017-12-14T13:26:38Z] <mobrovac@tin> Started deploy [restbase/deploy@187d8ba]: Remove Trending Edits end point and stop storing feed results in Cassandra - T180384 T179412

Mentioned in SAL (#wikimedia-operations) [2017-12-14T13:32:15Z] <mobrovac@tin> Finished deploy [restbase/deploy@187d8ba]: Remove Trending Edits end point and stop storing feed results in Cassandra - T180384 T179412 (duration: 05m 37s)

mobrovac edited projects, added Services (doing); removed Services (blocked).

The public end point has been removed from RESTBase, so the service is no longer reachable. The actual decommission of the service and the clean-up needed are going to be performed on Monday, 2017-12-18.

Change 398286 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] Trending Edits: Stop and mask the service

https://gerrit.wikimedia.org/r/398286

Change 398286 abandoned by Mobrovac:
Trending Edits: Stop and mask the service

Reason:
It's 17h on Friday, so this is not happening

https://gerrit.wikimedia.org/r/398286

Change 397745 merged by Giuseppe Lavagetto:
[operations/dns@master] Remove trendingedits discovery endpoint

https://gerrit.wikimedia.org/r/397745

Mentioned in SAL (#wikimedia-operations) [2017-12-18T11:04:30Z] <_joe_> disabled notifications for trendingedits.svc T180384

Change 397571 merged by Giuseppe Lavagetto:
[operations/puppet@production] Remove the Trending Edits service from production

https://gerrit.wikimedia.org/r/397571

Mentioned in SAL (#wikimedia-operations) [2017-12-18T11:20:19Z] <mobrovac> stopping the trending edits service - T180384

Change 397746 merged by Giuseppe Lavagetto:
[operations/dns@master] Remove all references to trendingedits

https://gerrit.wikimedia.org/r/397746

mobrovac claimed this task.
mobrovac updated the task description. (Show Details)
mobrovac added a subscriber: bearND.

The service has been completely removed from production. Thanks to @Joe for assisting.