Page MenuHomePhabricator

New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard
Closed, ResolvedPublic

Description

Hi,

Can you please create (or point how to create) the mediawiki.httpd.accesslog discussed in the parent task on both kafka-logging clusters as well as the related ingestion and dashboard?

Expected volume is 10-15k messages per second.

Thanks :)

Event Timeline

Clement_Goubert created this task.
Clement_Goubert edited projects, added serviceops-radar; removed serviceops.

Thank you for reaching out @Clement_Goubert ! re: topic creation IIRC is open (i.e. topic will be auto-created on first push), wrt adding ingestion the relevant file is modules/profile/manifests/logstash/production.pp (with examples) on how to configure new topics to be picked up by logstash (cc @colewhite for confirmation), hope that helps!

As noted in the parent task, and quite an important information I forgot in the task creation, is volume (10/15k messages per second).
From irc discussion:

I think that if we keep each partition at max 2k/3k msg/s is probably better, but we'd also need to consider the number of brokers and the downstream consumers.. we should have the same number of partitions on each broker (traffic in/out is spread evenly) and the consumers should be able to leverage the high number of partitions (like having one thread/process for each partition). Logstash should be good without a lot of fine tuning, but Cole will likely have more insights on that pipeline

I imagine that the partition number can't be set on first push with the open topic creation. It's not a problem since it can be changed dynamically, but we should start with a reasonable default to avoid a costly topic rebalancing in the future.

Volume recommendation is apparently ~2/3k mps/partition, so we may want 5 partitions, not considering broker equilibrium and consumer spread.
The topic can be created with the number of partitions we want through this command on a kafka node:
kafka topics --create --topic mediawiki.http.accesslog --partitions 5 --replication-factor 3

@Joe It maybe better to create it that way before starting to send messages to it and rebalance later? In any case, we can wait on @colewhite's opinion.

Clement_Goubert moved this task from Incoming 🐫 to 🌻Mediawiki on the serviceops board.

I'm not a kafka expert, but this seems like a reasonable place to start. Pre-creating the topics is definitely the way to go.

At the beginning, we should configure logstash to consume from the topic and store only 1% of the logs. Once we have data on actual volume and figure out data retention requirements, we can then raise the threshold to meet the need.

I've dug into it a bit, and we have 3 brokers per datacenter for kafka-logging, so for balance's sake I'll create the topic with 6 partitions.
kafka topics --create --topic mediawiki.http.accesslog --partitions 6 --replication-factor 3

eqiad:

cgoubert@kafka-logging1001:~$ kafka topics --create --topic mediawiki.http.accesslog --partitions 6 --replication-factor 3
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/logging-eqiad --create --topic mediawiki.http.accesslog --partitions 6 --replication-factor 3
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "mediawiki.http.accesslog".

codfw:

cgoubert@kafka-logging2001:~$ kafka topics --create --topic mediawiki.http.accesslog --partitions 6 --replication-factor 3
kafka-topics --zookeeper conf2004.codfw.wmnet,conf2005.codfw.wmnet,conf2006.codfw.wmnet/kafka/logging-codfw --create --topic mediawiki.http.accesslog --partitions 6 --replication-factor 3
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "mediawiki.http.accesslog".
``

Change 867136 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] P:logstash::production: mediawiki-http-accesslog

https://gerrit.wikimedia.org/r/867136

At the beginning, we should configure logstash to consume from the topic and store only 1% of the logs. Once we have data on actual volume and figure out data retention requirements, we can then raise the threshold to meet the need.

I've done my best to add sensible logstash::input::kafka, but couldn't find how to configure logstash to only store 1% of the logs. Can you point me towards where that should be set?

Change 867630 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: heavily restrict mediawiki http accesslog during initial onboarding

https://gerrit.wikimedia.org/r/867630

Change 867630 merged by Cwhite:

[operations/puppet@production] logstash: heavily restrict mediawiki http accesslog during initial onboarding

https://gerrit.wikimedia.org/r/867630

Change 867136 merged by Clément Goubert:

[operations/puppet@production] P:logstash::production: mediawiki-http-accesslog

https://gerrit.wikimedia.org/r/867136

Changed kafka topic retention time to 2 days instead of the default 7.

cgoubert@kafka-logging1001:~$ kafka topics --alter --config retention.ms=172800000 --topic mediawiki.http.accesslog
cgoubert@kafka-logging2001:~$ kafka topics --alter --config retention.ms=172800000 --topic mediawiki.http.accesslog

Change 880895 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] logstash: Fix typo in mediawiki.httpd.accesslog

https://gerrit.wikimedia.org/r/880895

There was a typo made when creating the topics (mediawiki.http.accesslog instead of mediawiki.httpd.accesslog)
The above CR fixes the logstash ingestion side.
For the kafka topics themselves, the topic already existed in eqiad, so I set its retention to two days (same as above), but it doesn't exist in codfw. Creating it with the required replication facter is blocked by kafka-logging2002 being behind b2 https://phabricator.wikimedia.org/T327001

Change 880895 merged by Clément Goubert:

[operations/puppet@production] logstash: Fix typo in mediawiki.httpd.accesslog

https://gerrit.wikimedia.org/r/880895

Retention updated for mediawiki.httpd.accesslog in codfw