Logging options for apache httpd in k8s
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Oct 19 2020, 8:50 AM

Description

Let's start with the requirements:

We need to be able to tail/grep/search these logs with ease, for debugging purposes. Logstash is ok only if it can be fast and reliable (any lag would kill our ability to debug things while they happen)
We need to be able to run mtail on those logs
We produce, daily, around 330 GB of API access logs and 190 GB of website access logs, over all of our traffic

Solution 1

We create a directory on the k8s node that works as a hostpath in all apache containers, and we make apache write its logs there, with a filename depending on the pod name.
Mtail can then run as a DaemonSet parsing those logs.

Open issues:

Creates i/o on the k8s hosts
Log rotation would be somewhat challenging

Solution 2
We make apache send its logs to logstash and a central log server(s) by logging to a piped command

In this case, logs would not be persisted on the individual server, but sent out to a central syslog server where they'd be stored for N days (as I said, we need ~ 500 GB / day of uncompressed logs, which become ~ 100 GB/day for compressed logs. Data could also be sent to logstash (at least sampled) for further and easier analysis. Mtail would have to be run on the central log server.

Open issues:

might lose logs if the central server is not HA
needs piped logging, which is not always great for performance, and probably for us to implement a better logger than apache's own.
log retention

Solution 3

A node level daemon that just just waits for log messages on a unix socket, with that socket being bind mounted to pods and apache writing to it. That component can sample if wanted and route where needed. Interestingly we got already a component that can do routing and throttling with message dropping, fluent-bit.

The advantage would be that we'd get the best of both worlds, at the cost of a slightly more complex setup.

Details

	Subject	Repo	Branch	Lines +/-
	mediawiki: allow rsyslog to process the apache logs	operations/deployment-charts	master	+48 -1
	httpd-fcgi: allow logging ECS to a local rsyslog	operations/docker-images/production-images	master	+22 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T198901 Migrate production services to kubernetes using the pipeline
Resolved	Clement_Goubert	T238770 Deploy MediaWiki to Wikimedia production in containers
Resolved	Jdforrester-WMF	T238771 Get production MW-land images built and published
Duplicate	None	T238773 Create initial production MW-land images with blubber
Open	None	T238747 Migrate www.wikipedia.org (and other www portals) to be its own service
Resolved	akosiaris	T238774 Provide the official production base images for Wikimedia use
Resolved	Joe	T265324 Create the base container images for running MediaWiki in a production environment
Resolved	Clement_Goubert	T265876 Logging options for apache httpd in k8s
Resolved	kamila	T276095 Keep calculating latencies for MediaWiki requests in the WikiKube environment
Open	None	T367076 benthos mw-accesslog-metrics kafka lag and interpolation errors
Resolved	Clement_Goubert	T324439 New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard

Event Timeline

Joe triaged this task as High priority.Oct 19 2020, 8:50 AM

Joe created this task.

Joe added a project: observability.

Additional datapoint that was required: we should be sending ~ 10/15k messages per second to the central log server, depending on traffic.

lmata subscribed.Oct 19 2020, 3:09 PM

Couple of points

We create a directory on the k8s node that works as a hostpath in all apache containers, and we make apache write its logs there, with a filename depending on the pod name.

Another con is that unless we can rely on an env var we might have to change on-disk the configuration of apache per pod, which means every config will become a snowflake as far as hashing functions like md5 or such goes. It will also require that we effectively do mutable images (and we might want to avoid that).

Mtail can then run as a DaemonSet parsing those logs.

Or just via puppet, at least in the beginning, which should make the migration easier.

Log rotation would be somewhat challenging

Yes, quite a bit given that the node will have to inform all the apaches in all the pods that it needs to logrotate their logs.

Let me add that we 'll probably need to repartition k8s nodes for Solution 1 as the bulk of the disk is given to container and not to the host / fs or any dedicated log fs.

Creates i/o on the k8s hosts

Indeed, but it might or might not be an issue. We 'll need some numbers on that. Intuitively I don't think that it would become an issue, but it makes sense to keep an eye out for that.

The interesting question is how much of that will be a problem in Solution 2 as the centralization is going to just exacerbate that.

There is also a solution 3 where we hybrid between solution 1 and solution 3. A node level daemon that just just waits for log messages on a unix socket, with that socket being bind mounted to pods and apache writing to it. That component can sample if wanted and route where needed. Interestingly we got already a component that can do routing and throttling with message dropping, fluent-bit.

Just dropping a quick update here, we should schedule some time to review options. Had a brief exchange with @akosiaris and we'll get the team together for a discussion on proposed paths and collaboration.

lmata moved this task from Inbox to Backlog on the observability board.Jan 25 2021, 4:23 PM

@lmata we really need to set up a meeting to tackle the questions here and in T271822 pretty soon; we're at the point where not figuring out this stuff will harm our schedule on the mediawiki on kubernetes project. If observability has already discussed the options here, we're glad to review them beforehand.

lmata moved this task from Backlog to Inbox on the observability board.Feb 1 2021, 4:16 PM

noted @Joe! I'll reach out to you to coordinate a time to talk with the team.

Joe updated the task description. (Show Details)Feb 3 2021, 2:46 PM

Joe moved this task from Backlog to In Progress on the MW-on-K8s board.Mar 1 2021, 11:10 AM

At the meeting we decided it's ok to let apache log to kafka as a main method of collection. We will therefore, at least in a first iteration:

Log to /dev/stdout from apache, in json format
The container runtime will save such logs on disk
rsyslog will pick them up and send them to kafka

When we will start having some traffic, we might want to switch to the following:

Have the CustomLog directive pipe to a process that will produce the messages to kafka
Potentially pick separate topics for the various clusters, and the canary deployments too

We might only sample messages that we actually decide to send to the ELK stack. We will need to find ways to process these logs via mtail to keep producing the metrics we want. That might happen on the central log server for the time being, and will probably need us to find a way to feed the logs from kafka to mtail. I'll open a subtask for that.

Joe moved this task from In Progress to Blocked on the MW-on-K8s board.Mar 1 2021, 2:05 PM

Lowering prioiry to medium as of discussion with @Joe

lmata moved this task from Inbox to Radar on the observability board.Jun 9 2021, 3:23 AM

Joe removed Joe as the assignee of this task.Jun 28 2021, 9:43 AM

Joe moved this task from Blocked to Backlog on the MW-on-K8s board.

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:24 PM

jijiki moved this task from 🙈🙉🙊Backlog to ⎈Kubernetes on the serviceops board.Oct 12 2022, 9:04 PM

Joe moved this task from ⎈Kubernetes to API Gateway 🥌 on the serviceops board.Nov 8 2022, 10:06 AM

jijiki moved this task from API Gateway 🥌 to this.quarter 🍕 on the serviceops board.Nov 14 2022, 2:39 PM

Joe moved this task from this.quarter 🍕 to Doing 😎 on the serviceops board.Nov 29 2022, 12:17 PM

Change 864547 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/docker-images/production-images@master] httpd-fcgi: allow logging ECS to a local rsyslog

https://gerrit.wikimedia.org/r/864547

gerritbot added a project: Patch-For-Review.Dec 5 2022, 7:17 AM

Change 864548 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mediawiki: allow rsyslog to process the apache logs

https://gerrit.wikimedia.org/r/864548

The two attached patches implement proposal #3

Now we just need to create the appropriate topic, named mediawiki.httpd.accesslog on both kafka-logging clusters. I'd keep the number of partitions relatively high given the traffic we expect once at regime.

Things left to do:

Create the kafka topic
Test everything works in production
Use benthos to replace mtail
Set up a sampled logstash ingestion and dashboard.

Clement_Goubert claimed this task.Dec 5 2022, 8:38 AM

Change 864547 merged by Giuseppe Lavagetto:

[operations/docker-images/production-images@master] httpd-fcgi: allow logging ECS to a local rsyslog

https://gerrit.wikimedia.org/r/864547

Kafka and logstash ingestion points configured.

Clement_Goubert closed subtask T324439: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard as Resolved.Jan 3 2023, 9:54 AM

Change 864548 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: allow rsyslog to process the apache logs

https://gerrit.wikimedia.org/r/864548

Maintenance_bot removed a project: Patch-For-Review.Jan 5 2023, 7:30 AM

We have now the logs in kafka, and thus should also be ingested in logstash, and create a dashboard.

Once that's done, we should reduce also the retention time of the kafka topic to 1 day at most.

If we did T291645: Integrate Event Platform and ECS logs and T276972: Set up cross DC topic mirroring for Kafka logging clusters, these logs could be mirrored to Kafka jumbo and available in Hive and Turnilo too.

In T265876#8512693, @Ottomata wrote:

If we did T291645: Integrate Event Platform and ECS logs and T276972: Set up cross DC topic mirroring for Kafka logging clusters, these logs could be mirrored to Kafka jumbo and available in Hive and Turnilo too.

While that is nice in general, I don't think there's great use for these logs in Hive for analytics purposes right now. It's great to know we'll have the option in the future.

In T265876#8512672, @Joe wrote:

We have now the logs in kafka, and thus should also be ingested in logstash, and create a dashboard.

Once that's done, we should reduce also the retention time of the kafka topic to 1 day at most.

As per https://phabricator.wikimedia.org/T324439#8513139 retention is now set to 2 days. The logs are ingested by logstash with a drop rate of 99%.

Clement_Goubert reopened subtask T324439: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard as Stalled.Jan 24 2023, 3:45 PM

Clement_Goubert closed subtask T324439: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard as Resolved.Jan 24 2023, 3:53 PM

Clement_Goubert moved this task from Backlog to Done on the MW-on-K8s board.Mar 6 2023, 9:30 AM

Clement_Goubert closed this task as Resolved.Mar 6 2023, 5:38 PM

kamila closed subtask T276095: Keep calculating latencies for MediaWiki requests in the WikiKube environment as Resolved.Aug 30 2023, 3:54 PM