Page MenuHomePhabricator

EventLogging needs to be ready for codfw failover
Closed, ResolvedPublic8 Story Points

Description

On the week of March 21, Wikimedia engineering is planning to fail over to CODFW for 48 hours. EventLogging needs to remain available, and there need to be instructions on Wikitech which document any manual steps (if any) required for shifting EventLogging to codfw.

Event Timeline

ori created this task.Feb 17 2016, 5:31 PM
ori raised the priority of this task from to Needs Triage.
ori updated the task description. (Show Details)
ori added subscribers: ori, Nuria.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 17 2016, 5:31 PM
ori triaged this task as High priority.Feb 17 2016, 5:34 PM
ori updated the task description. (Show Details)
ori set Security to None.
Nuria edited projects, added Analytics-Kanban; removed Analytics.Feb 17 2016, 7:00 PM
Nuria added a subscriber: Ottomata.
Nuria added a comment.Feb 23 2016, 6:08 PM

We need to verify that udp traffic can get from dallas to eqiad OR migrate eventlogging server to use kafka (as equiad will be up while this exercise is taken place)

Nuria added a comment.Feb 23 2016, 6:10 PM

Can server side produce via http to varnishkafka

faidon added a subscriber: faidon.Feb 24 2016, 1:59 PM

UDP traffic can get from codfw to eqiad in general — the two DCs are interconnected (although keep in mind that the fibers may be wiretapped and thus no expectations of privacy should exist).

ACLs is something we should check, let me know of the specific flows you're referring so I can check that. Also, is this multicast? If it is, it /should/ work, but I'd like to test it as it's a little bit more complicated (PIM is involved).

As for "eqiad being up while this exercise takes place"… we have tentative plans of bringing it down in parts (e.g. row by row) to do network maintenance such as switch/router upgrades, so ideally we shouldn't count on that.

Traffic is not multicast, it is direct from app servers -> eventlog1001.

Hitting the beacon/event.gif endpoint from app servers is not a bad idea, and would simplify configuration and processes on the eventlog server side. It doesn't help with the codfw failover, but it does eliminate one more special case.

All varnishkafkas produce to the analytics-eqiad Kafka cluster. If that is not reachable, then we will lose all eventlogging and webrequest messages.

Change 273006 had a related patch set uploaded (by Ori.livneh):
Fully-qualify EventLoggingBaseUri

https://gerrit.wikimedia.org/r/273006

Change 273008 had a related patch set uploaded (by Ori.livneh):
Submit server-side events via HTTP POST to the beacon endpoint

https://gerrit.wikimedia.org/r/273008

Change 273006 merged by jenkins-bot:
Fully-qualify EventLoggingBaseUri

https://gerrit.wikimedia.org/r/273006

Change 273513 had a related patch set uploaded (by Ottomata):
Add running eventlogging-devserver to role::eventlogging

https://gerrit.wikimedia.org/r/273513

Change 273527 had a related patch set uploaded (by Ottomata):
Update eventlogging-devserver log parsing to behave the same as eventlogging-processor

https://gerrit.wikimedia.org/r/273527

Change 273527 merged by Ottomata:
Update eventlogging-devserver log parsing to behave the same as eventlogging-processor

https://gerrit.wikimedia.org/r/273527

Milimetric assigned this task to Nuria.Mar 3 2016, 5:27 PM
Milimetric set the point value for this task to 8.

Change 273008 merged by jenkins-bot:
Submit server-side events via HTTP POST to the beacon endpoint

https://gerrit.wikimedia.org/r/273008

Hm, @ori, I was just testing this on MW vagrant with https://gerrit.wikimedia.org/r/#/c/273513/ again, and with role wikimediaevents and role eventlogging (from that patch), enabled, I get:

[error] [76308532] /w/index.php?title=Main_Page&action=submit   ErrorException from line 63 of /vagrant/mediawiki/extensions/EventLogging/includes/EventLogging.php: PHP Notice: Undefined variable: wgDBname

In mediawiki-wiki-debug.log

Ah, I see it was removed in https://gerrit.wikimedia.org/r/#/c/273008/1/includes/EventLogging.php and then re-added in response to a comment in a later patch., but without the global declaration.

Fixing...

Change 274986 had a related patch set uploaded (by Ottomata):
Fix missing global var declaration for $wgDBname

https://gerrit.wikimedia.org/r/274986

Change 274986 merged by jenkins-bot:
Fix missing global var declaration for $wgDBname

https://gerrit.wikimedia.org/r/274986

Change 273513 merged by Ottomata:
Add eventlogging-devserver service to role::eventlogging

https://gerrit.wikimedia.org/r/273513

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Mar 16 2016, 3:34 PM
Nuria closed this task as Resolved.Mar 22 2016, 7:13 PM

Change 452877 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] Remove deprecated $wgEventLoggingFile config

https://gerrit.wikimedia.org/r/452877

Change 452877 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove deprecated $wgEventLoggingFile config

https://gerrit.wikimedia.org/r/452877