Page MenuHomePhabricator

deployment-fluorine02 does not have logs
Closed, ResolvedPublic

Description

Per https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Overview#Logs , logs for the Beta Cluster are now at deployment-fluorine02 .

But while it has the archive directory, there are no current logs (and the last files in the archive directory are from 2016-09-01).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 26 2016, 11:52 PM
Mattflaschen-WMF triaged this task as High priority.Sep 26 2016, 11:55 PM

Mentioned in SAL (#wikimedia-releng) [2016-09-26T23:58:46Z] <bd808> Started udp2log-mw on deployment-fluorine02 for T146723

bd808 added a subscriber: bd808.Sep 27 2016, 12:04 AM
bd808@deployment-fluorine02:~$ sudo service udp2log-mw status
● udp2log-mw.service
   Loaded: loaded (/etc/init.d/udp2log-mw; static)
   Active: active (exited) since Wed 2016-08-31 19:37:40 UTC; 3 weeks 5 days ago
 Main PID: 26541 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/udp2log-mw.service

Sep 26 19:14:40 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 19:44:38 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 20:14:47 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 20:44:29 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 21:15:19 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 21:45:20 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 22:15:12 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 22:44:57 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 23:14:53 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 23:45:16 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.

I tried sudo service udp2log-mw start but that didn't seem to fix anything. This did however:

bd808@deployment-fluorine02:/srv/mw-log$ sudo service udp2log-mw stop
bd808@deployment-fluorine02:/srv/mw-log$ sudo service udp2log-mw status
● udp2log-mw.service
   Loaded: loaded (/etc/init.d/udp2log-mw; static)
   Active: inactive (dead)

Sep 26 18:45:02 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 19:14:40 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 19:44:38 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 20:14:47 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 20:44:29 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 21:15:19 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 21:45:20 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 22:15:12 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 22:44:57 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 23:14:53 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 23:45:16 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 26 23:57:58 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Sep 27 00:00:45 deployment-fluorine02 systemd[1]: Stopping udp2log-mw.service...
Sep 27 00:00:45 deployment-fluorine02 systemd[1]: Stopped udp2log-mw.service.
Hint: Some lines were ellipsized, use -l to show in full.
bd808@deployment-fluorine02:/srv/mw-log$ sudo service udp2log-mw start
bd808@deployment-fluorine02:/srv/mw-log$ sudo service udp2log-mw status
● udp2log-mw.service
   Loaded: loaded (/etc/init.d/udp2log-mw; static)
   Active: active (exited) since Tue 2016-09-27 00:00:52 UTC; 2s ago
  Process: 16485 ExecStart=/etc/init.d/udp2log-mw start (code=exited, status=0/SUCCESS)
 Main PID: 16485 (code=exited, status=0/SUCCESS)

I don't like the look of the Active: active (exited) systemd status. Maybe the systemd unit needs some work?

This is still having problems. Apparently, udp2log keeps exiting:

mattflaschen@deployment-fluorine02:~$ sudo service udp2log-mw status
● udp2log-mw.service
   Loaded: loaded (/etc/init.d/udp2log-mw; static)
   Active: active (exited) since Wed 2016-11-09 20:41:06 UTC; 5 days ago
  Process: 21810 ExecStart=/etc/init.d/udp2log-mw start (code=exited, status=0/SUCCESS)
 Main PID: 21810 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/udp2log-mw.service

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Even after restarting it still shows "active (exited)", but after the restart the logs do appear.

mattflaschen@deployment-fluorine02:~$ sudo service udp2log-mw status
● udp2log-mw.service
   Loaded: loaded (/etc/init.d/udp2log-mw; static)
   Active: active (exited) since Mon 2016-11-14 21:22:02 UTC; 4s ago
  Process: 24586 ExecStart=/etc/init.d/udp2log-mw start (code=exited, status=0/SUCCESS)
 Main PID: 24586 (code=exited, status=0/SUCCESS)

Nov 14 21:22:02 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
mattflaschen@deployment-fluorine02:~$ ls /srv/mw-log/ -lrth
total 2.1M
drwxr-xr-x 2 udp2log udp2log 132K Nov 13 02:00 archive
-rw-r--r-- 1 udp2log udp2log  568 Nov 14 21:22 CirrusSearch.log
-rw-r--r-- 1 udp2log udp2log 9.2K Nov 14 21:22 runJobs.log
-rw-r--r-- 1 udp2log udp2log  283 Nov 14 21:22 xff.log
-rw-r--r-- 1 udp2log udp2log  796 Nov 14 21:22 dnsblacklist.log
-rw-r--r-- 1 udp2log udp2log  326 Nov 14 21:22 CentralAuth.log
-rw-r--r-- 1 udp2log udp2log 201K Nov 14 21:22 FileOperation.log
-rw-r--r-- 1 udp2log udp2log  11K Nov 14 21:22 CentralAuthVerbose.log
-rw-r--r-- 1 udp2log udp2log 3.1K Nov 14 21:22 MessageCache.log
-rw-r--r-- 1 udp2log udp2log 1.8M Nov 14 21:22 wfDebug.log

@AlexMonk-WMF @Ottomata I see you worked on setting this up as part of T123728: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) . Could you take a look?

From the log file:

/var/log/udp2log/udp2log.log
[2016-11-10 08:51:06 UTC+0000] Packet rate: 0.037 k/s
[2016-11-10 08:56:06 UTC+0000] Packet rate: 0.471 k/s
Opened pipe with factor 1: python /usr/local/bin/demux.py --basedir /srv/mw-log
Opened pipe with factor 1: egrep '^(scap|scholarships|iegreview) ' | /usr/bin/log2udp -h deployment-logstash2.deployment-pr
ep.eqiad.wmflabs -p 8324
[2016-11-14 21:27:02 UTC+0000] Packet rate: 0.330 k/s
[2016-11-14 21:32:02 UTC+0000] Packet rate: 0.119 k/s

Yar, udp2log was notoriously flaky, and I'm not sure if I ever worked with it in Jessie/systemd (maybe I did...?). I know this isn't what you want to hear, but I have to ask. Is there any planned effort to get mw-log off of udp2log? Is it still needed now that logstash/kibana are around? The udp2log-mw instance is the only remaining use of udp2log at the foundation.

bd808 added a comment.Nov 15 2016, 5:10 PM

Yar, udp2log was notoriously flaky, and I'm not sure if I ever worked with it in Jessie/systemd (maybe I did...?).

The systemd port of the startup script is new and apparently still having issues. It seems like systemd doesn't like the way that udp2log forks and loses track of the main process. I haven't really looked into the mechanics at all though. I've been hoping that someone with more systemd experience would take a look at the upstart script and compare that to the systemd unit.

I know this isn't what you want to hear, but I have to ask. Is there any planned effort to get mw-log off of udp2log? Is it still needed now that logstash/kibana are around? The udp2log-mw instance is the only remaining use of udp2log at the foundation.

Our logstash pipeline has not been designed to be a full replacement for udp2log. It's more of a secondary analytics system for log data. Generally udp2log has been solid for the use case of writing debug logs for MediaWiki. In the last 3.5 years I can only think of about 3 times that it did something truly weird and those were short duration issues where it got confused about packet offsets and wrote log data to weird file names for a few minutes. I think it was a good thing that the analytics use cases were moved off to kafka, but for debug logs udp2log should be fine for the foreseeable future assuming we can get someone to figure out the systemd issues.

Yar, udp2log was notoriously flaky, and I'm not sure if I ever worked with it in Jessie/systemd (maybe I did...?). I know this isn't what you want to hear, but I have to ask. Is there any planned effort to get mw-log off of udp2log? Is it still needed now that logstash/kibana are around? The udp2log-mw instance is the only remaining use of udp2log at the foundation.

No, we need it until there is some other mechanism to have the logs written to plain log files on disk.

Reedy added a subscriber: Reedy.Nov 20 2016, 10:01 PM

Puppet is rather busted on this server too... T151169

Mentioned in SAL (#wikimedia-operations) [2016-11-21T12:51:51Z] <hashar> Restarted udp2log-mw service on deployment-fluorine02 . Was not available (ping T146723 T151169)

hashar closed this task as Resolved.Nov 21 2016, 12:53 PM
hashar claimed this task.

udp2log-mw was not spawning for whatever reason and puppet was broken (T151169). I got puppet fixed, then stopped and restarted the service entirely. It is now processing logs just fine and writing them to the disk.

Puppet log showing the service could not start:

Debug: Service[udp2log-mw](provider=debian): Executing 'ps -ef'
Debug: Executing '/bin/systemctl show -pSourcePath udp2log-mw'
Debug: Executing '/usr/sbin/service udp2log-mw start'
Debug: Executing '/bin/systemctl show -pSourcePath udp2log-mw'
Notice: /Stage[main]/Role::Logging::Mediawiki::Udp2log/Udp2log::Instance[mw]/Service[udp2log-mw]/ensure: ensure changed 'stopped' to 'running'
Debug: /Stage[main]/Role::Logging::Mediawiki::Udp2log/Udp2log::Instance[mw]/Service[udp2log-mw]: The container Udp2log::Instance[mw] will propagate my refresh event
Info: /Stage[main]/Role::Logging::Mediawiki::Udp2log/Udp2log::Instance[mw]/Service[udp2log-mw]: Unscheduling refresh on Service[udp2log-mw]
Debug: Udp2log::Instance[mw]: The container Class[Role::Logging::Mediawiki::Udp2log] will propagate my refresh event

Isn't it just going to fail again soon? We can't just keep having puppet restart it, it needs to be stable

I have no idea what went wrong but puppet is fixed (was T151169) and service seems stable:

Puppet find the service running:

root@deployment-fluorine02:~# puppet agent -tv
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-fluorine02.deployment-prep.eqiad.wmflabs
Info: Applying configuration version '1479730525'
Notice: Finished catalog run in 12.14 seconds
root@deployment-fluorine02:~#

Service is all fine:

$ systemctl status udp2log-mw
● udp2log-mw.service
   Loaded: loaded (/etc/init.d/udp2log-mw; static)
   Active: active (exited) since Mon 2016-11-21 12:50:25 UTC; 15min ago
  Process: 16426 ExecStart=/etc/init.d/udp2log-mw start (code=exited, status=0/SUCCESS)
 Main PID: 16426 (code=exited, status=0/SUCCESS)

And service is no more restarted over and over

$ sudo journalctl -u udp2log-mw
-- Logs begin at Fri 2016-11-18 02:21:43 UTC, end at Mon 2016-11-21 13:06:22 UTC. --
Nov 21 12:44:46 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Nov 21 12:46:08 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Nov 21 12:46:42 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Nov 21 12:47:33 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Nov 21 12:48:29 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
Nov 21 12:50:19 deployment-fluorine02 systemd[1]: Stopping udp2log-mw.service...
Nov 21 12:50:19 deployment-fluorine02 systemd[1]: Stopped udp2log-mw.service.
Nov 21 12:50:25 deployment-fluorine02 systemd[1]: Starting udp2log-mw.service...
Nov 21 12:50:25 deployment-fluorine02 systemd[1]: Started udp2log-mw.service.
$ date
Mon Nov 21 13:06:24 UTC 2016

No, we need it until there is some other mechanism to have the logs written to plain log files on disk.

Ops and Fundraising both use kafkatee, a drop in udp2log replacement.

No, we need it until there is some other mechanism to have the logs written to plain log files on disk.

Ops and Fundraising both use kafkatee, a drop in udp2log replacement.

But that hasn't replaced production fluorine, has it?

I've cherry-picked https://gerrit.wikimedia.org/r/#/c/313604/7 which I'm hoping will solve this