Page MenuHomePhabricator

Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes
Closed, ResolvedPublic

Description

On https://superset.wikimedia.org/superset/dashboard/webrequest-live/, in the filters under "Cache Cluster" I expected to see options for both text and upload, but instead I only see upload. Only upload data is shown in the charts.

image.png (528×1 px, 119 KB)

The only URI hosts shown are upload.wikimedia.org, maps.wikimedia.org, and similar -- not en.wikipedia.org and friends -- so the text data is missing, not mislabeled as upload.

The data at https://superset.wikimedia.org/superset/dashboard/webrequest-128 is complete -- it's only the live data that's affected.

(I took a rough stab at tagging and subscribers, but not sure where to file this properly -- any redirects are appreciated. <3)

Event Timeline

On March 9th ~ 16 UTC there was a severe drop in data ingested by Benthos:

https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?orgId=1&from=1678115242979&to=1678606706972&viewPanel=2

And I see several errors on centrallog1001 (same on centrallog2002):

Mar 12 07:41:41 centrallog1001 benthos@webrequest_live[813]: level=error msg="Kafka message recv error: kafka: error while consuming webrequest_upload/10: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)" @service=benthos label=webrequest_sampled_in path=root.input
Mar 12 07:41:42 centrallog1001 benthos@webrequest_live[813]: level=error msg="Failed to commit offsets: kafka: broker not connected" @service=benthos label=webrequest_sampled_in path=root.input

Around that time I see https://gerrit.wikimedia.org/r/c/operations/puppet/+/895898/, that removed centrallog1001 in favor of 1002 (previously added) in the firewall rules for kafka jumbo.

I get that 1001's kafka consumer is not working, but not sure about 2002.

Mentioned in SAL (#wikimedia-operations) [2023-03-12T07:49:10Z] <elukey> stop and mask benthos-webrequest-live on centrallog1001 - T331801

Mentioned in SAL (#wikimedia-operations) [2023-03-12T07:49:57Z] <elukey> restart benthos-webrequest-live on centrallog2002 - T331801

Mentioned in SAL (#wikimedia-operations) [2023-03-12T07:50:56Z] <elukey> restart benthos-webrequest-live on centrallog1002 - T331801

I see some text data in https://w.wiki/6Rzi, I'll recheck in a bit to see if everything is stable.

Something is still off, the traffic volume reported by turnilo for live vs batch webrequest data is still different (live a lot less). Something clearly happened when centrallog1001 was firewalled on kafka brokers, I suspect that it didn't have the time to offloads its partitions assignment to the consumer group and something got weird on the Kafka side.

Re-added 1001 back into Kafka Jumbo's firewall allowed host list, and restarted benthos on it. The traffic volume increased a lot, but then we went back into the only-upload-data state.

I stopped again 1001 and masked it, so in theory now it should have triggered a "regular" consumer group rebalance.

I would try with a consumer group offset reset:

kafka consumer-groups  --group benthos-webrequest-sampled-live --reset-offsets --to-latest --topic webrequest_text -execute

kafka consumer-groups  --group benthos-webrequest-sampled-live --reset-offsets --to-latest --topic webrequest_upload -execute

If it doesn't work we can change the consumer group on the Benthos config to start fresh.

Change 897063 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::benthos: change kafka consumer group name for webrequest

https://gerrit.wikimedia.org/r/897063

elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --group benthos-webrequest-sampled-live
kafka-consumer-groups --bootstrap-server kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092,kafka-jumbo1007.eqiad.wmnet:9092,kafka-jumbo1008.eqiad.wmnet:9092,kafka-jumbo1009.eqiad.wmnet:9092 --describe --group benthos-webrequest-sampled-live
Note: This will not show information about old Zookeeper-based consumers.
Consumer group 'benthos-webrequest-sampled-live' has no active members.

TOPIC             PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
webrequest_upload 11         246344525678    246344526726    1048            -               -               -
webrequest_upload 10         246334673161    246334673521    360             -               -               -
webrequest_upload 2          246368222778    246368224965    2187            -               -               -
webrequest_text   5          527407973516    527407974561    1045

This view makes some sense, the cgroup is pulling 4 out of 12 partitions afaics..

Mentioned in SAL (#wikimedia-operations) [2023-03-12T10:47:16Z] <elukey> reset offsets on kafka jumbo for benthos webrequest live (as indicated in https://phabricator.wikimedia.org/T331801#8685569)

Seems better now, from the consumer group's consistency point of view:

elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --group benthos-webrequest-sampled-live
kafka-consumer-groups --bootstrap-server kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092,kafka-jumbo1007.eqiad.wmnet:9092,kafka-jumbo1008.eqiad.wmnet:9092,kafka-jumbo1009.eqiad.wmnet:9092 --describe --group benthos-webrequest-sampled-live
Note: This will not show information about old Zookeeper-based consumers.
Consumer group 'benthos-webrequest-sampled-live' has no active members.

TOPIC             PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
webrequest_text   12         527386394506    527387558922    1164416         -               -               -
webrequest_text   23         527382447489    527383611302    1163813         -               -               -
webrequest_upload 16         246340499209    246341026087    526878          -               -               -
webrequest_text   15         527440232605    527441557441    1324836         -               -               -
webrequest_text   2          527460262241    527461548715    1286474         -               -               -
webrequest_upload 7          246335827347    246336372749    545402          -               -               -
webrequest_upload 2          246372268456    246372439239    170783          -               -               -
webrequest_text   20         527405277563    527406603469    1325906         -               -               -
webrequest_text   6          527452667291    527453780778    1113487         -               -               -
webrequest_upload 23         246336885517    246337407815    522298          -               -               -
webrequest_text   5          527417225623    527417649794    424171          -               -               -
webrequest_upload 1          246347329700    246347827279    497579          -               -               -
webrequest_upload 17         246347980427    246348538758    558331          -               -               -
webrequest_upload 8          246350554620    246350556840    2220            -               -               -
webrequest_upload 12         246365590407    246366124735    534328          -               -               -
webrequest_text   22         527388199552    527389611081    1411529         -               -               -
webrequest_text   14         527440846956    527442173271    1326315         -               -               -
webrequest_upload 18         246331270139    246331773084    502945          -               -               -
webrequest_upload 14         246357858943    246358428323    569380          -               -               -
webrequest_upload 11         246349032599    246349033303    704             -               -               -
webrequest_text   19         527401226361    527402325413    1099052         -               -               -
webrequest_upload 13         246377341705    246377942255    600550          -               -               -
webrequest_text   11         527404003520    527404363690    360170          -               -               -
webrequest_upload 9          246338214805    246338215264    459             -               -               -
webrequest_text   7          527403452922    527404732541    1279619         -               -               -
webrequest_upload 20         246335746466    246336245050    498584          -               -               -
webrequest_text   1          527415320549    527416579391    1258842         -               -               -
webrequest_upload 10         246338986308    246338989599    3291            -               -               -
webrequest_text   4          527394338048    527395392246    1054198         -               -               -
webrequest_upload 6          246390977825    246391609726    631901          -               -               -
webrequest_upload 22         246334805935    246335346620    540685          -               -               -
webrequest_text   16         527355499226    527356677556    1178330         -               -               -
webrequest_text   8          527414405490    527414678498    273008          -               -               -
webrequest_upload 3          246336161293    246336654790    493497          -               -               -
webrequest_upload 19         246345256821    246345812100    555279          -               -               -
webrequest_text   17         527427333933    527428444027    1110094         -               -               -
webrequest_text   9          527406276074    527407426775    1150701         -               -               -
webrequest_text   10         527369434852    527369514038    79186           -               -               -
webrequest_text   13         527448034405    527449203573    1169168         -               -               -
webrequest_upload 5          246342427743    246342976732    548989          -               -               -
webrequest_text   0          527415127372    527416519023    1391651         -               -               -
webrequest_upload 21         246333588922    246334106011    517089          -               -               -
webrequest_text   18         527366329858    527367438231    1108373         -               -               -
webrequest_text   21         527401435978    527402695090    1259112         -               -               -
webrequest_upload 15         246359919713    246360455021    535308          -               -               -
webrequest_text   3          527419928435    527420957486    1029051         -               -               -
webrequest_upload 4          246347155966    246347728298    572332          -               -               -
webrequest_upload 0          246358366955    246358923523    556568          -               -               -

The traffic handled by benthos is around 1/3 of the original one now (improved but not really ok). I don't see clear indications that Benthos itself is suffering, since it now runs on a better hardware and its config didn't really change.

I'd be inclined to change the consumer group now, and see if anything changes with a better one.

Tried to stop both consumers (benthos systemd units) on centrallog 1002 and 2002, reset again the offsets, start the consumers.

The weird thing is that I keep seeing zero consumers:

elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --group benthos-webrequest-sampled-live --state
kafka-consumer-groups --bootstrap-server kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092,kafka-jumbo1007.eqiad.wmnet:9092,kafka-jumbo1008.eqiad.wmnet:9092,kafka-jumbo1009.eqiad.wmnet:9092 --describe --group benthos-webrequest-sampled-live --state
Note: This will not show information about old Zookeeper-based consumers.
Consumer group 'benthos-webrequest-sampled-live' has no active members.

COORDINATOR (ID)                        ASSIGNMENT-STRATEGY       STATE                #MEMBERS
kafka-jumbo1004.eqiad.wmnet:9092 (1004)                           Empty                0

I tried to delete the consumer group but it requires some ACLs in Kafka, will see if I manage to do it later on.

Tried to stop all the consumers on centrallog nodes, delete the consumer group and restart all. Traffic changed and dropped back to previous values, still one third of the events processed.

Change 897063 merged by Elukey:

[operations/puppet@production] profile::benthos: change kafka consumer group name for webrequest

https://gerrit.wikimedia.org/r/897063

To keep archives happy - in order to be able to delete the consumer group I had to add the following:

kafka acls --add User:* --operation delete --group benthos-webrequest-sampled-live

I've removed it later on.

Change 897916 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] centrallog: restore benthos consumer group and read all partitions

https://gerrit.wikimedia.org/r/897916

Me and Filippo tried a ton of workarounds and solutions today, but none of them really worked. In the end we removed the restriction on the first 12 partitions for each webrequest topic (we introduced a limit a while ago to reduce the bw usage) and we started seeing a different behavior from Benthos:

COORDINATOR (ID)                        ASSIGNMENT-STRATEGY       STATE                #MEMBERS
kafka-jumbo1008.eqiad.wmnet:9092 (1008) range                     Stable               2

It makes zero sense but up to the change of centrallog nodes everything went fine, meanwhile now we are seeing issues. We'd like to proceed as follows:

  1. Let Benthos to pull from all webrequest partitions (downside: bw usage increase for centrallog nodes)
  2. Increase sampling to match the new number of partitions.
  3. Open a github issue to upstream.

Change 897916 merged by Filippo Giunchedi:

[operations/puppet@production] centrallog: restore benthos consumer group and read all partitions

https://gerrit.wikimedia.org/r/897916

All right the upstream issue has been resolved!

Next steps:

  1. Upgrade our benthos debian package to the new 4.15.0 upstream version (see kafka_franz changes in https://github.com/benthosdev/benthos/releases/tag/v4.15.0)
  2. Switch our config to use the kafka_franz input (a quick test is needed to verify that all works).
  3. Reduce the number of partitions that we pull from, adjusting the sampling ratio accordingly.

Mentioned in SAL (#wikimedia-operations) [2023-05-11T13:21:59Z] <elukey> upload benthos 4.15.0-1 to {buster,bullseye}-wikimedia - T331801

Change 919064 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] benthos: use kafka_franz for the webrequest_live instance

https://gerrit.wikimedia.org/r/919064

Mentioned in SAL (#wikimedia-operations) [2023-05-11T13:57:08Z] <elukey> upgrade benthos (4.9.1 -> 4.15.0) on centrallog nodes - T331801

Change 919064 merged by Elukey:

[operations/puppet@production] benthos: use kafka_franz for the webrequest_live instance

https://gerrit.wikimedia.org/r/919064

Change 919077 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] benthos::instance: add --skip-env-var-check to lint

https://gerrit.wikimedia.org/r/919077

Change 919077 merged by Elukey:

[operations/puppet@production] benthos::instance: add --skip-env-var-check to lint

https://gerrit.wikimedia.org/r/919077

Change 919158 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] benthos: change kafka consumer group name for webrequest_live

https://gerrit.wikimedia.org/r/919158

Change 919158 merged by Elukey:

[operations/puppet@production] benthos: change kafka consumer group name for webrequest_live

https://gerrit.wikimedia.org/r/919158

Mentioned in SAL (#wikimedia-operations) [2023-05-11T16:16:39Z] <elukey> benthos webrequest live instances migrated to kafka-franz (new consumer client, data may have some holes) - T331801

Had to change the consumer group name since Sarama and Kafka Franz (both go clients) don't play well together in the same consumer group.

Data may have some holes for today, but it should fix itself after the 24 hour window rolls over.

Next step:

  • reduce the number of partitions that we pull data from and adjust the sampling

Thanks a lot! We got a small hole for text and almost nothing for upload AFAICT:

Screenshot 2023-05-11 at 18.20.50.png (554×1 px, 109 KB)

Change 919268 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::syslog::centralserver: tune benthos config

https://gerrit.wikimedia.org/r/919268

Change 919268 abandoned by Elukey:

[operations/puppet@production] role::syslog::centralserver: tune benthos config

Reason:

https://gerrit.wikimedia.org/r/919268

elukey claimed this task.

From https://github.com/benthosdev/benthos/issues/1806 it seems that we cannot really use a consumer group and select partitions with the Benthos kafka clients (and not really sure if it is possible in general). The consumer groups are great for us since multiple clients split the consumption of the topic partitions, and if one goes down the other one can take over (our clients are the benthos instances on centrallog nodes for example). In our case we consume a ton of data input and we discard most of it, so it is a waste of bandwidth, but I think it is good enough to keep using a consumer group and rely on its nice failover capabilities. If we want to reduce the input bw that benthos consumes we can definitely work on it, but IMHO it would need to have a solid use case (since the work to do is not trivial).