Correctly collect logs from php-fpm pools
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Dec 5 2018, 8:59 AM

Description

We decided to go the same way as we went with HHVM, thus relaying the logs to syslog to then deliver wherever desired.

To this aim, we added error_log = syslog and syslog.ident = php7.2-fpm to the configuration, and the corresponding rsyslog rules.

We found out, however, that whatever gets logged from the fpm pools ends up having an identifier of ool $pool_name.

This is an unfortunate bug that has been reported multiple times. As a consequence, the broken programname containing a space is invalid as far as the syslog format is concerned, and rsyslog correctly refuses to parse it. So the programname is always "ool" as far as rsyslog is concerned

There are several ways to fix this, but none is exactly attactive:

Just add rules to catch "ool" as a programname. The log line will still be badly broken, and will be badly parsed once imported in logstash[1]. Also, we will have to repeat all the processing rules twice.
Add an openlog() function in the auto_prepend_file. That would set the correct ident for all requests. We might need to closelog() too, though.
Add catch_workers_output = yes to the pool configuration, with some clever setting of error_log to send errors to stderr might do what we want (let journald collect the log lines from stderr, mark them with the correct programname (the unit name), and relay them to syslog. Sadly, catch_workers_output is known to cause a severe performance degradation at high concurrency, so it's not an option.
Patch php-fpm so that when it sets the programname in syslog it does something sensible like setting it to php-pool-$pool_name
Configure php-fpm to send the logs to a specialized syslog facility and filter for it. This will reduce duplication of rules but still possibly have the broken parsing issue.

As it stands, I would prefer to use the openlog() approach, which seems the cleanest way to handle this.

[1] the message registered in logstash becomes ool www[12345]: JOE TEST instead of JOE TEST

Details

	Subject	Repo	Branch	Lines +/-
	Hotfix for logging in php-fpm	operations/mediawiki-config	master	+13 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T176370 Migrate to PHP 7 in WMF production
		Resolved		Joe	T211184 Correctly collect logs from php-fpm pools

Event Timeline

Joe triaged this task as Medium priority.Dec 5 2018, 8:59 AM

Joe created this task.

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptDec 5 2018, 8:59 AM

Mentioned in SAL (#wikimedia-operations) [2018-12-05T09:00:10Z] <_joe_> disabed puppet on mw1261, used for logging tests for T211184

Joe claimed this task.Dec 5 2018, 9:02 AM

Joe mentioned this in T210567: Create a way to intentionally trigger fatal errors in MediaWiki.Dec 5 2018, 10:29 AM

I took a quick look at this as well and indeed openlog() seems the simplest way. Also because altering programname in rsyslog isn't allowed, thus to fix this on the rsyslog side we'd have to use a different template for example, not really worth it IMO. Plus the bug is supposedly fixed in php 7.3 anyways.

FTR I tried strace on rsyslog upon receiving a test message from error_log:

[pid 32306] <... poll resumed> )        = 1 ([{fd=3, revents=POLLIN}])
[pid 32306] recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="<29>Dec  5 14:00:25 ool www[27112]: JOE TEST PLEASE DISREGARD", iov_len=8096}], msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */}, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=27112, uid=33, gid=33}}], msg_controllen=64, msg_flags=0}, MSG_DONTWAIT) = 61
[pid 32306] futex(0x55993f3d9f1c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x55993f3d9f18, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
[pid 32309] <... futex resumed> )       = 0
[pid 32306] poll([{fd=3, events=POLLIN}], 1, -1 <unfinished ...>
[pid 32309] futex(0x55993f3bd8d0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 32309] write(15, "Dec  5 14:00:25 mw1261 ool www[27112]: JOE TEST PLEASE DISREGARD\n", 65) = 65
[pid 32309] recvfrom(8, 0x7f90a97b27a7, 1, MSG_PEEK|MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 32309] sendmsg(8, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\27\3\3\0n\0\0\0\0\0\0\0.\301\"\3\206\20'\26O\31)f\201_\221\221\333r\304Uq\6\277\347L\r\34\307\260b\324Oe-\250\205#\333\305~G\336\355)P`N\307L\252 Y\246\2244\334\2^\244\355:\263\236\367\36\213\333{\372\207B\205\214\220\6\313\205\225p\22L\261\331\344\22`\336\251*bC<\33\36\346\270Y\217\26\234\376\31\231", iov_len=115}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 115
[pid 32309] recvfrom(9, 0x7f90a97b27a7, 1, MSG_PEEK|MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 32309] sendmsg(9, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\27\3\3\0n\0\0\0\0\0\0\0.\311\256)\34K\364\33\223T\302\220jhY4\\t\304\37413\34?gI\371F\211\230x\201\3351\203<\340\3{\204$\313\3174P\1\326lb\0016\371B~%r,T\362Z\241\232\316u-\257\316\305ga\264\216\372n\233\210\n\232\3304\373\335\365qN\tu\341a\36\1\310\366\230\331\307\202\204\360\302\304\317x", iov_len=115}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 115
[pid 32309] sendto(13, "ool Dec  5 14:00:25 mw1261:  www[27112]: JOE TEST PLEASE DISREGARD\n", 67, 0, {sa_family=AF_INET, sin_port=htons(8420), sin_addr=inet_addr("10.64.32.175")}, 16) = 67
[pid 32309] sendto(14, "<29>Dec  5 14:00:25 mw1261 ool www[27112]: JOE TEST PLEASE DISREGARD", 68, 0, {sa_family=AF_INET, sin_port=htons(10514), sin_addr=inet_addr("10.2.2.36")}, 16) = 68
[pid 32309] futex(0x55993f3d9f1c, FUTEX_WAIT_PRIVATE, 329, NULL

journald sees the messages too of course but is able to associate them with their originating unit:

# journalctl -u php7.2-fpm.service -n 3
-- Logs begin at Sat 2018-12-01 16:45:57 UTC, end at Wed 2018-12-05 14:05:54 UTC. --
Dec 05 13:59:45 mw1261 php-fpm7.2[27110]: ool www[27110]: JOE TEST PLEASE DISREGARD
Dec 05 14:00:16 mw1261 php-fpm7.2[27107]: ool www[27107]: JOE TEST PLEASE DISREGARD
Dec 05 14:00:25 mw1261 php-fpm7.2[27112]: ool www[27112]: JOE TEST PLEASE DISREGARD

Joe removed a project: TechCom-RFC (TechCom-RFC-Closed).Dec 6 2018, 6:33 AM

Joe added a project: User-Joe.Dec 6 2018, 12:31 PM

Change 478021 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Hotfix for logging in php-fpm

https://gerrit.wikimedia.org/r/478021

gerritbot added a project: Patch-For-Review.Dec 6 2018, 3:07 PM

Krinkle added a project: Performance-Team (Radar).Dec 8 2018, 10:32 PM

Change 478021 merged by jenkins-bot:
[operations/mediawiki-config@master] Hotfix for logging in php-fpm

https://gerrit.wikimedia.org/r/478021

Joe closed this task as Resolved.Jan 2 2019, 11:24 AM