replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie)
Closed, ResolvedPublic

Description

Still on precise, upgrade to trusty or jessie (not sure how closely related it is to the trusty-based app servers)

Next steps:

  • send udp2log traffic from mwlog1001 to fluorine
  • backfill mwlog1001 with fluorine's logs
  • authorize mwlog[12]001 where fluorine is also authorized: statistics and dump
  • move scholarships udp2log traffic from fluorine to mwlog1001
  • move mw udp2log traffic from fluorine to mwlog1001
  • move scap udp2log traffic from fluorine to mwlog1001
  • move xenon redis traffic from fluorine to mwlog1001
  • point performance.w.o/xenon http traffic to mwlog1001
  • switch udplog CNAME to mwlog1001 and roll-restart rsyslog
  • announce new hostname for fluorine
There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 313604 merged by Filippo Giunchedi:
udp2log: move to service_unit and systemd

https://gerrit.wikimedia.org/r/313604

fgiunchedi added a comment.EditedJan 19 2017, 3:00 PM

mwlog[12]001 have been provisioned with jessie and are up and running. udp2log-mw runs as a systemd unit and so does xenon-log now.

Change 333235 had a related patch set uploaded (by Filippo Giunchedi):
scholarships: move udp2log to mwlog1001

https://gerrit.wikimedia.org/r/333235

Change 333235 merged by Filippo Giunchedi:
scholarships: move udp2log to mwlog1001

https://gerrit.wikimedia.org/r/333235

hashar removed a subscriber: hashar.Jan 30 2017, 11:23 AM
fgiunchedi updated the task description. (Show Details)Jan 30 2017, 5:25 PM

For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log receivers in both datacenters. I don't know if mediawiki is already able to do that via its logging configuration, @bd808 you might know how/if we can do that? thanks!

Change 335623 had a related patch set uploaded (by Filippo Giunchedi):
Allow mwlog[12]001 on datasets/dumps/eventlog/logstash

https://gerrit.wikimedia.org/r/335623

Change 335624 had a related patch set uploaded (by Filippo Giunchedi):
scap: move udp2log from fluorine to mwlog1001

https://gerrit.wikimedia.org/r/335624

Change 335625 had a related patch set uploaded (by Filippo Giunchedi):
udp2log: mirror traffic to mwlog1001

https://gerrit.wikimedia.org/r/335625

bd808 added a comment.Feb 2 2017, 4:33 PM

For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log receivers in both datacenters. I don't know if mediawiki is already able to do that via its logging configuration, @bd808 you might know how/if we can do that? thanks!

The config changes needed would be done in wmf-config/logging.php. Every MediaWiki\Logger\Monolog\LegacyHandler object in that file is a path to sending log event data to fluorine. Monolog has a Monolog\Handler\GroupHandler class that could be used to replace each one with a GroupHander containing two MediaWiki\Logger\Monolog\LegacyHandler objects, one that sends to fluorine and another that sends to mwlog1001.

The $wmgMonologHandlers['wgDebugLogFile'] handler is a special case that would either need to be treated explicitly or ignored. Generally it points to /dev/null, but on testwiki and test2wiki or when special request settings are present it gets pointed to distinct local or UDP log sinks. It would probably be easiest to just ignore all of this complexity and let it go to wherever $wmfUdp2logDest is pointing. That could be either of the two log destinations.

The other way that all of this could be handled would be to setup udp2log on mwlog1001 to relay everything it sees back to fluorine and then just switch $wmfUdp2logDest to point to mwlog1001. That would make the MediaWiki config stable and let techops control things on fluorine via udp2log configuration. This is basically the opposite of the configuration being applied in https://gerrit.wikimedia.org/r/335625. The benefit I see of this is that auditing can be done on mwlog1001 to know when all the things have been switched. If a log is on fluorine that is not on mwlog1001 then you have found something else that needs its configuration to be changed.

Change 335623 merged by Filippo Giunchedi:
Allow mwlog[12]001 on datasets/dumps/eventlog/logstash

https://gerrit.wikimedia.org/r/335623

fgiunchedi updated the task description. (Show Details)Feb 13 2017, 2:58 PM

For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log receivers in both datacenters. I don't know if mediawiki is already able to do that via its logging configuration, @bd808 you might know how/if we can do that? thanks!

The config changes needed would be done in wmf-config/logging.php. Every MediaWiki\Logger\Monolog\LegacyHandler object in that file is a path to sending log event data to fluorine. Monolog has a Monolog\Handler\GroupHandler class that could be used to replace each one with a GroupHander containing two MediaWiki\Logger\Monolog\LegacyHandler objects, one that sends to fluorine and another that sends to mwlog1001.

The $wmgMonologHandlers['wgDebugLogFile'] handler is a special case that would either need to be treated explicitly or ignored. Generally it points to /dev/null, but on testwiki and test2wiki or when special request settings are present it gets pointed to distinct local or UDP log sinks. It would probably be easiest to just ignore all of this complexity and let it go to wherever $wmfUdp2logDest is pointing. That could be either of the two log destinations.

The other way that all of this could be handled would be to setup udp2log on mwlog1001 to relay everything it sees back to fluorine and then just switch $wmfUdp2logDest to point to mwlog1001. That would make the MediaWiki config stable and let techops control things on fluorine via udp2log configuration. This is basically the opposite of the configuration being applied in https://gerrit.wikimedia.org/r/335625. The benefit I see of this is that auditing can be done on mwlog1001 to know when all the things have been switched. If a log is on fluorine that is not on mwlog1001 then you have found something else that needs its configuration to be changed.

Thanks a lot @bd808 for the explanation! I'll take a stab at the GroupHandler strategy first since that's the configuration I'd like to use for sending logs to both datacenters when mwlog1001 / mwlog2001 are fully in service. Though also mirroring things back to fluorine sounds like a viable option!

Thanks a lot @bd808 for the explanation! I'll take a stab at the GroupHandler strategy first since that's the configuration I'd like to use for sending logs to both datacenters when mwlog1001 / mwlog2001 are fully in service. Though also mirroring things back to fluorine sounds like a viable option!

Or not.. I've adjusted https://gerrit.wikimedia.org/r/#/c/335625/2 to actually mirror mwlog1001 -> fluorine as it is simpler for now. Will adjust mw-config to use mwlog1001.

Change 337560 had a related patch set uploaded (by Filippo Giunchedi):
Switch udp2log destination to mwlog1001

https://gerrit.wikimedia.org/r/337560

Change 337568 had a related patch set uploaded (by Filippo Giunchedi):
Switch xenon redis to mwlog1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/337568

Change 337569 had a related patch set uploaded (by Filippo Giunchedi):
performance: switch xenon apache backend to mwlog1001

https://gerrit.wikimedia.org/r/337569

Change 335625 merged by Filippo Giunchedi:
udp2log: mirror traffic from mwlog1001 to fluorine

https://gerrit.wikimedia.org/r/335625

Change 335624 merged by Filippo Giunchedi:
scap: move udp2log from fluorine to mwlog1001

https://gerrit.wikimedia.org/r/335624

fgiunchedi updated the task description. (Show Details)Feb 14 2017, 3:34 PM

Change 337798 had a related patch set uploaded (by Filippo Giunchedi):
udp2log: fix mirroring of received packets

https://gerrit.wikimedia.org/r/337798

Change 337798 merged by Filippo Giunchedi:
udp2log: fix mirroring of received packets

https://gerrit.wikimedia.org/r/337798

Change 337855 had a related patch set uploaded (by Filippo Giunchedi):
xenon: add apache 2.4 conditional for access control

https://gerrit.wikimedia.org/r/337855

Change 337855 merged by Filippo Giunchedi:
xenon: add apache 2.4 conditional for access control

https://gerrit.wikimedia.org/r/337855

Change 337568 merged by Filippo Giunchedi:
Switch xenon redis to mwlog1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/337568

Change 337569 merged by Filippo Giunchedi:
performance: switch xenon apache backend to mwlog1001

https://gerrit.wikimedia.org/r/337569

Mentioned in SAL (#wikimedia-operations) [2017-02-15T16:39:02Z] <godog> flip xenon redis and apache from fluorine to mwlog1001 - T123728

Mentioned in SAL (#wikimedia-operations) [2017-02-16T08:39:43Z] <godog> roll-restart jobrunner in codfw/eqiad to pick up fluorine -> mwlog1001 redis change - T123728

Mentioned in SAL (#wikimedia-operations) [2017-02-16T10:18:39Z] <godog> roll-restart hhvm in eqiad to pick up fluorine -> mwlog1001 changes - T123728

Change 338119 had a related patch set uploaded (by Filippo Giunchedi):
udp2log: mirror traffic via udpmirror.py

https://gerrit.wikimedia.org/r/338119

Gilles added a subscriber: Gilles.Feb 16 2017, 6:28 PM

Change 338119 merged by Filippo Giunchedi:
udp2log: mirror traffic via udpmirror.py

https://gerrit.wikimedia.org/r/338119

fgiunchedi updated the task description. (Show Details)Feb 20 2017, 2:59 PM

Just FYI, there is a Kafka based Monolog implementation in Mediawiki, currently used by the Discovery team for shipping some logs to Hadoop. I betcha we could pretty easily use it to send MW Logs to Kafka, and then write them to disk on mwlog1001 using kafkatee, just like we do for some ops webrequest logs on oxygen.

Change 337560 merged by jenkins-bot:
Switch udp2log destination to mwlog1001

https://gerrit.wikimedia.org/r/337560

Mentioned in SAL (#wikimedia-operations) [2017-02-22T10:05:52Z] <filippo@tin> Synchronized wmf-config/ProductionServices.php: Move udp2log from fluorine to mwlog1001 - T123728 (duration: 00m 41s)

Just FYI, there is a Kafka based Monolog implementation in Mediawiki, currently used by the Discovery team for shipping some logs to Hadoop. I betcha we could pretty easily use it to send MW Logs to Kafka, and then write them to disk on mwlog1001 using kafkatee, just like we do for some ops webrequest logs on oxygen.

That's a great idea @Ottomata! MediaWiki is for sure the last big udp2log traffic producer so it'd be nice to move it to kafka too. AFAIK there's been pushing to deprecate udp2log for quite some time but we never got around doing it. Not sure who would own it but it would easily fall into "tech debt" category at this point.

Change 339146 had a related patch set uploaded (by Filippo Giunchedi):
wmnet: switch udplog CNAME to mwlog1001

https://gerrit.wikimedia.org/r/339146

fgiunchedi updated the task description. (Show Details)Feb 22 2017, 10:45 AM

Change 339146 merged by Filippo Giunchedi:
wmnet: switch udplog CNAME to mwlog1001

https://gerrit.wikimedia.org/r/339146

Comparing logs available on mwlog1001 I see that the following are missing:

  • analysis (empty directory created in 2015)
  • apache2.log
  • hhvm.log (used by fatalmonitor)
  • memcached-keys.log

Sorry if it's a known issue or if mwlog1001 migration is still a WIP

Thanks @dcausse for the audit!

memcached-keys.log should have been there after I switched udplog CNAME, I've roll-restarted rsyslog on mc hosts.

I'll investigate apache2/hhvm logs

bd808 added a comment.Feb 22 2017, 5:23 PM

I'll investigate apache2/hhvm logs

These are both handled by the mediawiki::rsyslog Puppet class. That class sets up local rsyslog rules to forward apache2 and hhvm log events to the $::mediawiki::log_aggregator and $::mediawiki::forward_syslog endpoints (see modules/mediawiki/templates/rsyslog.conf.erb). These are currently pointed to:

  • modules/mediawiki/manifests/init.pp: $log_aggregator = 'udplog:8420',
  • hieradata/common/mediawiki.yaml: mediawiki::forward_syslog: "logstash1001.eqiad.wmnet:10514"

Mentioned in SAL (#wikimedia-operations) [2017-02-22T19:46:58Z] <godog> roll-HUP rsyslog on mw1* to pick up DNS udplog change - T123728

Krinkle added a subscriber: Krinkle.EditedFeb 22 2017, 8:47 PM

There seem to be a few issues that affect https://performance.wikimedia.org.

  • Coal: Requests fail with HTTP 500 (e.g. https://performance.wikimedia.org/coal/v1/metrics?period=day).
  • XHGui: Loading of profiling history is not working. Regular profiling details work, and the overview page of all profiles also works. But the history of one url seems to fail. Might be a pre-existing issue, though. (e.g. when loading https://performance.wikimedia.org/xhgui/url/view?url=%2F%2FWK33-ApAIHwAAC7E6gQAAAAQ%2Fw%2Fload.php - It claims to be caused by either MongoDB credentials being wrong, MonoDB not running, or the cache directory not being writeable.)

@Krinkle coal is fixed (the web part wasn't working but data collection was) and it broke as part of moving graphite1001 to jessie. No idea about xhgui (on tungsten) and its mongo though, afaik those haven't been touched.

Change 339367 had a related patch set uploaded (by Filippo Giunchedi):
wmnet: add udplog.codfw CNAME

https://gerrit.wikimedia.org/r/339367

fgiunchedi updated the task description. (Show Details)Feb 23 2017, 9:26 AM

Change 339367 merged by Filippo Giunchedi:
wmnet: add udplog.codfw CNAME

https://gerrit.wikimedia.org/r/339367

@Krinkle coal is fixed (the web part wasn't working but data collection was) and it broke as part of moving graphite1001 to jessie. No idea about xhgui (on tungsten) and its mongo though, afaik those haven't been touched.

Okay. I'll assume that never worked. It's not an important aspect of XHGui anyway. I initially thought that that error happened on all XHGui, which would've been a regression, but other queries work fine. Probably the db is getting too big for generic queries like on that page.

From the latest audit of fluorine logs missing from mwlog1001 I can't spot any files present on the former but not the latter.

fgiunchedi updated the task description. (Show Details)Mar 1 2017, 11:08 AM

Change 341570 had a related patch set uploaded (by filippo):
[operations/puppet] hieradata: make mwlog1001 primary log host

https://gerrit.wikimedia.org/r/341570

Note mwlog[12]001 need to be whitelisted to be reachable from analytics vlan, rsync-http-https term

Mentioned in SAL (#wikimedia-operations) [2017-03-08T12:35:23Z] <godog> add mwlog[12]001 to analytics-in4 term rsync-http-https - T123728

Change 341570 merged by Filippo Giunchedi:
[operations/puppet] hieradata: make mwlog1001 primary log host

https://gerrit.wikimedia.org/r/341570

Change 341789 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] site: use spare::system on fluorine

https://gerrit.wikimedia.org/r/341789

Krinkle removed a subscriber: Krinkle.Mar 8 2017, 8:43 PM

Change 341847 had a related patch set uploaded (by Krinkle):
[operations/puppet] Remove mentions of fluorine in old comments and descriptions

https://gerrit.wikimedia.org/r/341847

Change 341847 merged by Dzahn:
[operations/puppet] Remove mentions of fluorine in old comments and descriptions

https://gerrit.wikimedia.org/r/341847

Dzahn renamed this task from Upgrade fluorine to trusty/jessie to replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie).Mar 9 2017, 12:15 AM

Change 341940 had a related patch set uploaded (by Dzahn):
[operations/puppet] mediawiki::logging: remove fluorine from firewall rules

https://gerrit.wikimedia.org/r/341940

Change 341789 merged by Filippo Giunchedi:
[operations/puppet] site: use spare::system on fluorine

https://gerrit.wikimedia.org/r/341789

Change 342017 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] hieradata: remove access to fluorine

https://gerrit.wikimedia.org/r/342017

Change 342017 merged by Filippo Giunchedi:
[operations/puppet] hieradata: remove access to fluorine

https://gerrit.wikimedia.org/r/342017

fgiunchedi updated the task description. (Show Details)Mar 9 2017, 5:09 PM
fgiunchedi closed this task as Resolved.
fgiunchedi claimed this task.

This is completed, I've left out the part about sending logs to both datacenters as out of scope for this. The real solution as suggested by ottomata and tracked in T126989: MediaWiki logging & encryption is to use kafka for log shipping, that'd also buy us encryption

Dzahn awarded a token.Mar 9 2017, 7:19 PM

Change 341940 merged by Dzahn:
[operations/puppet] mediawiki::logging: remove fluorine from firewall rules

https://gerrit.wikimedia.org/r/341940

Krinkle reopened this task as Open.EditedMar 25 2017, 3:03 AM

@Krinkle coal is fixed (the web part wasn't working but data collection was) and it broke as part of moving graphite1001 to jessie.

T157022: Suspected faulty SSD on graphite1001

@fgiunchedi All performance data in coal data from before Feb 16 seems lost. See:

Can this please be restored ASAP?

I'm confused as to why it got lost. It's merely a subdirectory in the carbon/whisper directory as sibling to all other Graphite metrics. graphite1001:/var/lib/carbon/whisper/. The only difference is that files within ./coal/ namespace are owned by the coal user.

@Krinkle I though I'd copied coal data over to graphite2001 (and restored it back on graphite1001) but obviously that's not the case, I apologize :( The reason for that is that coal is a symlink: /var/lib/carbon/whisper/coal -> /var/lib/coal and /var/lib/coal isn't exported via rsync either.

I'm afraid we've lost the historical data for coal as both graphite2001 and graphite1001 contain only recent data afaics. In terms of restoring the data, is there anywhere else the timing data is written as well?

fgiunchedi moved this task from Backlog to Blocked on the User-fgiunchedi board.

I'm afraid we've lost the historical data for coal as both graphite2001 and graphite1001 contain only recent data afaics. In terms of restoring the data, is there anywhere else the timing data is written as well?

It was pointed out today at the ops meeting we haven't wiped graphite[12]001 SSDs yet, therefore might be able to recover from those!
I'll followup with @Papaul in T161538, I've also added backup /var/lib/coal as a step for T159354: Move coal from graphite machine(s)

In terms of restoring the data, is there anywhere else the timing data is written as well?

It is derived from EventLogging schemas, so they're naturally stored in MySQL and Kafka for a short time, but it's nearly impossible to reproduce in a meaningful way because Statsd needs to be in-between this and Graphite. Both Statsv and Graphite have no concept of time for incoming data, everything is "now".

It was pointed out today at the ops meeting we haven't wiped graphite[12]001 SSDs yet, therefore might be able to recover from those!
I'll followup with @Papaul in T161538, I've also added backup /var/lib/coal as a step for T159354: Move coal from graphite machine(s)

Thanks. We'll also need to devise a way to merge the data carefully, but let's first move the data to a different location on the new drive. (Because the time series database are stored by metric, not by time slice. We don't want to overwrite the recent data, either.)

fgiunchedi closed this task as Resolved.Mar 28 2017, 9:13 AM

Thanks. We'll also need to devise a way to merge the data carefully, but let's first move the data to a different location on the new drive. (Because the time series database are stored by metric, not by time slice. We don't want to overwrite the recent data, either.)

Agreed, we've used https://github.com/graphite-project/carbonate in the past with success to merge whisper files and it seems to do what is says on the tin.
I'm resolving this in favor of T161538, let's followup there.