Page MenuHomePhabricator

revision-create events are sometimes emitted in a secondary DC
Closed, DeclinedPublic

Description

About once an hour there's a mediawiki.revision-create event emitted in a secondary DC (codfw at the moment of writing this) which should not be happening. It's not clear where these events are coming from, but they all seem to be random legitimate edits. We need to somehow find out what's creating these events and why are they getting routed into codfw.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Could it have to do with automatic checks somehow?

Nope. The events are all legitimate edits, just random portion of them. Here's an example:

{"comment": "/* wbeditentity-update:0| */ Updating citation graph", "database": "wikidatawiki", "meta": {"domain": "www.wikidata.org", "dt": "2018-10-25T14:28:08+00:00", "id": "31481568-d862-11e8-a6c3-1866da994aad", "request_id": "6cf96684-6fa5-4968-bad1-eb1c9cf583fd", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://www.wikidata.org/wiki/Q30590474"}, "page_id": 32201215, "page_is_redirect": false, "page_namespace": 0, "page_title": "Q30590474", "parsedcomment": "\u200e<span dir=\"auto\"><span class=\"autocomment\">wbeditentity-update:0|: </span> Updating citation graph</span>", "performer": {"user_edit_count": 3921442, "user_groups": ["bot", "*", "user", "autoconfirmed"], "user_id": 2991187, "user_is_bot": true, "user_registration_dt": "2018-07-15T00:21:35Z", "user_text": "Citationgraph bot 2"}, "rev_content_changed": true, "rev_content_format": "application/json", "rev_content_model": "wikibase-item", "rev_id": 772918151, "rev_len": 107337, "rev_minor_edit": false, "rev_parent_id": 758988603, "rev_sha1": "bu10dhanqjyoonn6e2tdlctbnmqq1lw", "rev_timestamp": "2018-10-25T14:28:08Z"}
{"comment": "/* wbcreateclaim-create:1| */ [[Property:P945]]: [[Q15180]], #quickstatements; #temporary_batch_1540471925444", "database": "wikidatawiki", "meta": {"domain": "www.wikidata.org", "dt": "2018-10-25T15:29:23+00:00", "id": "bfc0952a-d86a-11e8-a9a8-1866da9950b2", "request_id": "38a99423-c1df-4e29-841c-90b9fce7b749", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://www.wikidata.org/wiki/Q30345539"}, "page_id": 31957263, "page_is_redirect": false, "page_namespace": 0, "page_title": "Q30345539", "parsedcomment": "\u200e<span dir=\"auto\"><span class=\"autocomment\">wbcreateclaim-create:1|: </span> <a href=\"/wiki/Property:P945\" title=\"Property:P945\">Property:P945</a>: <a href=\"/wiki/Q15180\" title=\"Q15180\">Q15180</a>, #quickstatements; #temporary_batch_1540471925444</span>", "performer": {"user_edit_count": 4635245, "user_groups": ["*", "user", "autoconfirmed"], "user_id": 4115, "user_is_bot": false, "user_registration_dt": "2012-10-30T10:17:33Z", "user_text": "ShinePhantom"}, "rev_content_changed": true, "rev_content_format": "application/json", "rev_content_model": "wikibase-item", "rev_id": 772963109, "rev_len": 7494, "rev_minor_edit": false, "rev_parent_id": 745060726, "rev_sha1": "4utwvagsrwwrfbpe701zjq6asb72wdw", "rev_timestamp": "2018-10-25T15:29:23Z"}
{"comment": "Removing from [[Category:2016 in Hokkaido]]", "database": "commonswiki", "meta": {"domain": "commons.wikimedia.org", "dt": "2018-10-25T18:30:24+00:00", "id": "0969a9b0-d884-11e8-a9a8-1866da9950b2", "request_id": "32bf941d-e7b8-4b20-a317-448274c9a5d3", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://commons.wikimedia.org/wiki/File:%E5%A2%83%E5%86%85%E3%81%AE%E3%81%95%E3%81%8F%E3%82%89_-_panoramio.jpg"}, "page_id": 60716174, "page_is_redirect": false, "page_namespace": 6, "page_title": "File:\u5883\u5185\u306e\u3055\u304f\u3089_-_panoramio.jpg", "parsedcomment": "Removing from <a href=\"/wiki/Category:2016_in_Hokkaido\" title=\"Category:2016 in Hokkaido\">Category:2016 in Hokkaido</a>", "performer": {"user_edit_count": 37671, "user_groups": ["autopatrolled", "filemover", "*", "user", "autoconfirmed"], "user_id": 346343, "user_is_bot": false, "user_registration_dt": "2008-08-22T13:36:53Z", "user_text": "\u30c8\u30c8\u30c8"}, "rev_content_changed": true, "rev_content_format": "text/x-wiki", "rev_content_model": "wikitext", "rev_id": 325404167, "rev_len": 705, "rev_minor_edit": true, "rev_parent_id": 287193822, "rev_sha1": "pcvept9ysyu0pkpo0thusiszi8x3hp5", "rev_timestamp": "2018-10-25T18:30:23Z"}
{"comment": "/* wbeditentity-update:0| */ nl-description, [[User:Edoderoobot/Set-nl-description|python code]], logfile on https://goo .gl/BezTim", "database": "wikidatawiki", "meta": {"domain": "www.wikidata.org", "dt": "2018-10-25T19:23:13+00:00", "id": "6a48ad20-d88b-11e8-b4fc-1866da994cb1", "request_id": "38cd2d8f-aff4-4c23-bd45-2bf5ac74a4fd", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://www.wikidata.org/wiki/Q56550957"}, "page_id": 56490478, "page_is_redirect": false, "page_namespace": 0, "page_title": "Q56550957", "parsedcomment": "\u200e<span dir=\"auto\"><span class=\"autocomment\">wbeditentity-update:0|: </span> nl-description, <a href=\"/wiki/User:Edoderoobot/Set-nl-description\" title=\"User:Edoderoobot/Set-nl-description\">python code</a>, logfile on https://goo .gl/BezTim</span>", "performer": {"user_edit_count": 29738284, "user_groups": ["bot", "*", "user", "autoconfirmed"], "user_id": 1554155, "user_is_bot": true, "user_registration_dt": "2014-12-24T08:58:28Z", "user_text": "Edoderoobot"}, "rev_content_changed": true, "rev_content_format": "application/json", "rev_content_model": "wikibase-item", "rev_id": 773135961, "rev_len": 7197, "rev_minor_edit": false, "rev_parent_id": 752535550, "rev_sha1": "sch0qvuyihfu3t2z070stzsx7nqg8vy", "rev_timestamp": "2018-10-25T19:23:12Z"}
{"comment": "I've linked the word \"Portugal\" in the text to the Portugal page in Wikipedia.", "database": "enwiki", "meta": {"domain": "en.wikipedia.org", "dt": "2018-10-25T20:03:56+00:00", "id": "1a6ce525-d891-11e8-a447-b083fecf12d7", "request_id": "79f5e632-ce78-4c4d-801c-ed15e5a2d11c", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://en.wikipedia.org/wiki/Portugal._The_Man"}, "page_id": 5497650, "page_is_redirect": false, "page_namespace": 0, "page_title": "Portugal._The_Man", "parsedcomment": "I&#039;ve linked the word &quot;Portugal&quot; in the text to the Portugal page in Wikipedia.", "performer": {"user_groups": ["*"], "user_is_bot": false, "user_text": "2804:14D:5C35:AEEB:AC0C:70B6:810B:7563"}, "rev_content_changed": true, "rev_content_format": "text/x-wiki", "rev_content_model": "wikitext", "rev_id": 865733708, "rev_len": 55641, "rev_minor_edit": false, "rev_parent_id": 863644396, "rev_sha1": "0xkbysk90c9ivqpghag6cmgkjb0vjci", "rev_timestamp": "2018-10-25T20:03:53Z"}

Hm interesting. These are all original events, so either the EventBus proxy service messes up the DNS (less likely) or somehow legitimate requests end up in codfw. But how?

Things I've checked so far:

  • There are no logs anywhere associated with these events.
  • All the events are legitimate edits, they exist in DB, they show up in the page history.
  • I did not find anything common between the events, it's just a completely random sample.
  • Events related to these edits are not present in eqiad.

EventBus proxy service messes up the DNS

EventBus extension, the proxy talks directly to Kafka and it's the one prepending the DC name, so if these are coming through the proxy - MediaWiki is the one messing the DNS.

Legitimate requests end up in codfw.

revision-create is emitted after a successful write to MariaDB - can we write in codfw? I though DBs are readonly in codfw and Mediawiki in codfw talks to DBs in codfw? Is that not the case?

Things I've checked so far:

  • There are no logs anywhere associated with these events.
  • All the events are legitimate edits, they exist in DB, they show up in the page history.
  • I did not find anything common between the events, it's just a completely random sample.
  • Events related to these edits are not present in eqiad.

EventBus proxy service messes up the DNS

EventBus extension, the proxy talks directly to Kafka and it's the one prepending the DC name, so if these are coming through the proxy - MediaWiki is the one messing the DNS.

Legitimate requests end up in codfw.

revision-create is emitted after a successful write to MariaDB - can we write in codfw? I though DBs are readonly in codfw and Mediawiki in codfw talks to DBs in codfw? Is that not the case?

Yes, and that's the confusing part. codfw MW is set to read-only, so successful write cannot happen there. Unless that instance of MW somehow talks to MariaDB in eqiad?

We can keep theorizing about this, but we need more information before we could make any real theory.

I think the first and foremost we need to find out where do these requests to eventbus service home from. Trace logging will not help us, because RESTBase actually sends its events into codfw..

I guess we could run tcpdump for some extended period of time writing the source IPs into a file, or excluding the known sources - after some time we will catch it..

I had a chat with Petr and the plan is the following: tail (with kafkacat or similar) mediawiki.revision-create in codfw and use httpry or tcpdump to capture HTTP traffic in the meantime. When a events pops up (should be in maximum one hour) we stop capturing and check who sent the data. Will start this work tomorrow morning EU time if nobody beats me on it :)

So one scenario in which this can happen is the following:

  1. mediawiki calls eventbus via the discovery dns url at eventbus.discovery.wmnet
  2. Our resolv.conf lists the DC-local resolver address first, then falls back to the remote-DC resolver if that doesn't work
  3. For clients like dig that support Edns-Client-Subnet, the eventbus that is nearest to the client is correctly located via GeoIP. If a client doesn't send the information, the resolver will return the instance nearest to the resolver itself.
  4. Thus we can have sporadic cases where an event is sent to eventbus in codfw if curl doesn't support making dns requests including edns-client-subnet:
# After commenting out the Eqiad resolver, dig still returns the correct eventbus instance
mw1:~$ dig eventbus.discovery.wmnet
...
eventbus.discovery.wmnet. 300	IN	A	10.2.2.23
# I adapted the script found in https://stackoverflow.com/questions/1369760/php-curl-retrieving-server-ip-address
mw1:~$ php test_curl_resolve.php 
10.2.1.23

I always assumed that would turn out ok as then events should then go through change-propagation which will again resolve the clients to which it needs to send the events, and getting the correct discovery URLs for all services (in the case of MediaWiki, it's active-passive so it will always route to the current active datacenter).

I saw stated in the ticket that "this shouldn't happen"; I'd like to understand why.

So while I think my theory is pretty sound, we can just set eventbus to be active/passive and see if the events still get created in codfw or not. If that's the case, we have an explanation. @elukey what do you think?

Not sure if I am doing anything wrong but I can't see any new events into codfw.mediawiki.revision-create after the one with dt: ~04:38. From the traffic logged by httpry though I saw this:

elukey@kafka2003:~$ grep 10.64.0.67 capture.log
2018-11-16 10:22:21	10.64.0.67	10.2.1.23	>	POST	eventbus.discovery.wmnet:8085	/v1/events	HTTP/1.1	-	-
2018-11-16 10:22:21	10.2.1.23	10.64.0.67	<	-	-	-	HTTP/1.1	201	All 1 events were accepted.

elukey@kafka2003:~$ dig -x 10.64.0.67 +short
mw1272.eqiad.wmnet.

So I think that Joe is right :)

elukey@kafka2002:~$ grep 10.64.32.36 capture.log
2018-11-16 10:51:48	10.64.32.36	10.2.1.23	>	POST	eventbus.discovery.wmnet:8085	/v1/events	HTTP/1.1	-	-
2018-11-16 10:51:48	10.2.1.23	10.64.32.36	<	-	-	-	HTTP/1.1	100	(Continue)
2018-11-16 10:51:48	10.2.1.23	10.64.32.36	<	-	-	-	HTTP/1.1	201	All 1 events were accepted.
2018-11-16 10:51:48	10.64.32.36	10.2.1.23	>	POST	eventbus.discovery.wmnet:8085	/v1/events	HTTP/1.1	-	-
2018-11-16 10:51:48	10.2.1.23	10.64.32.36	<	-	-	-	HTTP/1.1	201	All 1 events were accepted.

elukey@kafka2002:~$ dig -x 10.64.32.36 +short
mw1334.eqiad.wmnet.

Very interesting! Thank you for the thorough investigation @elukey and @Joe . In theory, in this case, there shouldn't be any negative consequences of this behaviour, because the jobrunner's discovery end point is set to active/passive, which means that once CP receives the event, it sends it to the correct DC. To corroborate that, I couldn't find any errors or complaints that these events resulted in a DB is read-only error.

However, that poses the question (which is probably out of scope for this ticket): does this happen in other instances where we have services set to active/active?

Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:33 AM
Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:36 AM

Declining due to lack of activity or impact. Please reopen if needed.