Page MenuHomePhabricator

Rollback haproxy feed automated ingestion
Closed, ResolvedPublic1 Estimated Story Points

Description

Traffic SRE has disabled the HAProxy to Kafka shipper for feeds that back webrequest_frontend and the derived dataset.
To avoid confusion and ops burden, we should remove automated ingestion and ETL of the Kafka topics.

  • Remove the Gobblin MapReduce job that loads Kafka topics.
  • Remove webrequest_frontend_rc0 metrics from pushgateway
  • Remove the webrequest_frontend Airflow DAGs.
  • Clean up the staging database. Before doing this, we should validate if any of the logs need to be archived.
  • Clean up ESC and refinery configs.
Related

Event Timeline

Change #1062671 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/puppet@production] gobblin: remove webrequest_frontend ingestion job.

https://gerrit.wikimedia.org/r/1062671

Change #1062677 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] gobblin: remove webrequest_frontend_rc0.pull

https://gerrit.wikimedia.org/r/1062677

Change #1062679 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: remove webrequest_frontend.

https://gerrit.wikimedia.org/r/1062679

Change #1062707 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] gobblin: remove webrequest_frontend_rc0

https://gerrit.wikimedia.org/r/1062707

Ah wha! it does. I thought i looked for it. sorry! Abandoning.

Change #1062707 abandoned by Ottomata:

[operations/puppet@production] gobblin: remove webrequest_frontend_rc0

Reason:

Duplicate of Ibe8e9011cefc6bdcf0a663aa007b4eff130ed026

https://gerrit.wikimedia.org/r/1062707

Change #1062671 merged by Btullis:

[operations/puppet@production] gobblin: remove webrequest_frontend ingestion job.

https://gerrit.wikimedia.org/r/1062671

Change #1063819 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Absent the webrequest_frontend_rc0 gobblin job

https://gerrit.wikimedia.org/r/1063819

Change #1063820 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the webrequest_frontend_rc0 gobblin job

https://gerrit.wikimedia.org/r/1063820

Change #1063819 merged by Btullis:

[operations/puppet@production] Absent the webrequest_frontend_rc0 gobblin job

https://gerrit.wikimedia.org/r/1063819

Change #1063838 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the absenting of this gobblin test resource

https://gerrit.wikimedia.org/r/1063838

Change #1063838 merged by Btullis:

[operations/puppet@production] Fix the absenting of this gobblin test resource

https://gerrit.wikimedia.org/r/1063838

Remove the Gobblin MapReduce job that loads Kafka topics.

Removing this gobbling job resulted in this alert being fired:
https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DGobblinLastSuccessfulRunTooLongAgo

I silenced the alert while the job decommission is in progress.

I had a look at alert manager, but we don't have a specific rule per job. Just a catch all that triggers for all gobblin jobs that have ever executed.
As a workaround we could ACK! in the alert silence comment, but I'm afraid that to fully drop the alert we'll need to drain prometheus' metrics as suggested by @BTullis .

Change #1062679 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig: remove webrequest_frontend.

https://gerrit.wikimedia.org/r/1062679

f/up from a convo with @fgiunchedi in IRC.

We should be able to delete webrequest_frontend_rc0 metrics from prometheus pushgateway, and stop alerting on the gobblin job.

However, we were not able to delete metrics with:

curl -XDELETE http://prometheus-pushgateway.discovery.wmnet/metrics/job/webrequest_frontend_rc0
*   Trying 2620:0:861:101:10:64:0:82:80...
* Connected to prometheus-pushgateway.discovery.wmnet (2620:0:861:101:10:64:0:82) port 80 (#0)
> DELETE /metrics/job/webrequest_frontend_rc0 HTTP/1.1
> Host: prometheus-pushgateway.discovery.wmnet
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 202 Accepted
< Date: Tue, 27 Aug 2024 09:45:06 GMT
< Server: Apache
< Content-Length: 0
< 
* Connection #0 to host prometheus-pushgateway.discovery.wmnet left intact

The request is accepted (202), but metrics are not removed.

I added metrics deletion as a success criteria to close this ticket. KYP.

I don't know why matching on the job label alone did not work, by a more extensive match (all labels except instace, that is empty) did the trick:

$ curl  -v -XDELETE http://prometheus-pushgateway.discovery.wmnet/metrics/job/webrequest_frontend_rc0/kafka_partition/0/kafka_topic/webrequest_upload_test/reporter_type/EVENT
$ curl  -v -XDELETE http://prometheus-pushgateway.discovery.wmnet/metrics/job/webrequest_frontend_rc0/kafka_partition/0/kafka_topic/webrequest_text_test/reporter_type/EVENT

All webrequest_frontend_rc0 metrics have been deleted:

$ curl  http://prometheus-pushgateway.discovery.wmnet/metrics | grep webrequest_frontend_rc0  | wc -l
0