Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ssastry
	Dec 20 2016, 9:54 PM

Description

See https://logstash.wikimedia.org/goto/06e62b8ad0146ef9bec01baa1af89985 .. there were no 404s before the switch.
The query I used in Kibana is: "Batch request failure for" && "404"
The rates are not alarming, but, this merits investigation.

Details

Subject	Repo	Branch	Lines +/-
tlsproxy::localssl: add ability to have an access.log	operations/puppet	production	+6 -1
ApiRequest: Clone the request options before modifying them	mediawiki/services/parsoid	master	+2 -1
Bump src to e7e3a4d (deploy-20161221 src branch)	mediawiki/services/parsoid/deploy	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned Here: rGPARe7e3a4dce51f: ApiRequest: Clone the request options before modifying them
P4658 T153797

Event Timeline

ssastry created this task.Dec 20 2016, 9:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2016, 9:54 PM

Could this be 404s from replication lag, as handled in https://github.com/wikimedia/change-propagation/blob/9a877128015c9258524dbea359d4c5509ebe57c6/config.example.wikimedia.yaml#L180 ?

The 404 rate returned from Parsoid has not changed much over the last months:

In T153797#2891281, @GWicke wrote:

The 404 rate returned from Parsoid has not changed much over the last months:

I am talking about 404s from Parsoid -> Mediawiki API ... not RESTBase -> Parsoid.

@ssastry, my understanding is that Parsoid does not retry 404s from batch requests, so the two should be equal. Could you clarify why there would be a difference?

In T153797#2891289, @GWicke wrote:

@ssastry, my understanding is that Parsoid does not retry 404s from batch requests, so the two should be equal. Could you clarify why there would be a difference?

1 retry actually. Separately, if a batch api request 404s both times, it leaves holes in the parsed document, it doesn't return a 404 to RESTBase.

@ssastry, I haven't seen any code in Parsoid that would trigger retries on a 404 response. Could you point it out to me?

In T153797#2891321, @GWicke wrote:

@ssastry, I haven't seen any code in Parsoid that would trigger retries on a 404 response. Could you point it out to me?

Okay. we clarified on IRC. I was misreading the code. There are no batch API retries on 404.

I really doubt the 404s come from the replication lag. There were no 404s like this before and also since we now make 2 cross-dc trips the lag effect should've decreased, not increased..

Yeah, on IRC @ssastry clarified that in contrast to the primary revision source request, these batch requests would not cause a 404 to be returned to RESTBase. This means that they would not show up in the RESTBase 404 response metrics.

Another interesting type of 404s from parsoid started to appear after move to codfw: https://logstash.wikimedia.org/goto/b9bef516e5e1c3e75d7895185e2a7f2f

It says 'unconfigured domain'

I'm trying to reproduce the problem on mw1189, I couldn't find the exact request that parsoid is making to api.svc.eqiad.wmnet in the logs though

after chatting with @Pchelolo I've diffed the conftool pools for api.svc.eqiad.wmnet and https has servers disabled

P4658 T153797

1	--- a/api
2	+++ b/api-https
3	@@ -3,7 +3,7 @@
4	# for creating an external node pool reference file
5	#
6	# This file is generated from the etcd directories:
7	# [-["/conftool/v1/pools/eqiad/api_appserver/apache2/"]-]{+["/conftool/v1/pools/eqiad/api_appserver/nginx/"]+}
8
9
10	{ 'host': 'mw1189.eqiad.wmnet', 'weight':20, 'enabled': True }
11	@@ -41,18 +41,18 @@
12	{ 'host': 'mw1233.eqiad.wmnet', 'weight':20, 'enabled': True }
13	{ 'host': 'mw1234.eqiad.wmnet', 'weight':20, 'enabled': True }
14	{ 'host': 'mw1235.eqiad.wmnet', 'weight':20, 'enabled': True }
15	{ 'host': 'mw1276.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
16	{ 'host': 'mw1277.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
17	{ 'host': 'mw1278.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
18	{ 'host': 'mw1279.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
19	{ 'host': 'mw1280.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
20	{ 'host': 'mw1281.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
21	{ 'host': 'mw1282.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
22	{ 'host': 'mw1283.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
23	{ 'host': 'mw1284.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
24	{ 'host': 'mw1285.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
25	{ 'host': 'mw1286.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
26	{ 'host': 'mw1287.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
27	{ 'host': 'mw1288.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
28	{ 'host': 'mw1289.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }
29	{ 'host': 'mw1290.eqiad.wmnet', [-'weight':25,-]{+'weight':10,+} 'enabled': [-True-]{+False+} }

so it must be the same set of servers, assuming api.svc.eqiad.wmnet hasn't changed the only difference is that now parsoid talks to nginx and not apache

The Parsoid dashboard shows non-200 codes, I was looking for total 200s but couldn't find it yet https://grafana.wikimedia.org/dashboard/db/parsoid-http-status-codes?from=now-6h&to=now

You can find this info at https://grafana.wikimedia.org/dashboard/db/restbase?panelId=13&fullscreen

The rate of these 404s is really low comparing to the 200 rate, but still worths figuring out

In T153797#2891580, @Pchelolo wrote:

Another interesting type of 404s from parsoid started to appear after move to codfw: https://logstash.wikimedia.org/goto/b9bef516e5e1c3e75d7895185e2a7f2f

It says 'unconfigured domain'

Ok, this can be due to something lacking somehow in the nginx configuration and needs investigation.

So, about the unconfigured domain ones:

it seems that parsoid sometimes sends out a request without the Host: header set. In fact, I see what follows:

salt --output=raw -C 'G@cluster:api_appserver and G@site:eqiad' cmd.run 'fgrep api.svc.eqiad.wmnet /var/log/apache2/other_vhosts_access.log | wc -l'
{'mw1222.eqiad.wmnet': '15'}
{'mw1229.eqiad.wmnet': '8'}
{'mw1234.eqiad.wmnet': '12'}
{'mw1224.eqiad.wmnet': '19'}
{'mw1208.eqiad.wmnet': '9'}
{'mw1199.eqiad.wmnet': '22'}
{'mw1226.eqiad.wmnet': '15'}
{'mw1195.eqiad.wmnet': '14'}
{'mw1207.eqiad.wmnet': '17'}
{'mw1228.eqiad.wmnet': '17'}
{'mw1223.eqiad.wmnet': '14'}
{'mw1203.eqiad.wmnet': '14'}
{'mw1193.eqiad.wmnet': '19'}
{'mw1221.eqiad.wmnet': '19'}
{'mw1230.eqiad.wmnet': '11'}
{'mw1235.eqiad.wmnet': '24'}
{'mw1231.eqiad.wmnet': '17'}
{'mw1232.eqiad.wmnet': '15'}
{'mw1200.eqiad.wmnet': '11'}
{'mw1190.eqiad.wmnet': '9'}
{'mw1192.eqiad.wmnet': '13'}
{'mw1197.eqiad.wmnet': '11'}
{'mw1196.eqiad.wmnet': '8'}
{'mw1233.eqiad.wmnet': '11'}
{'mw1227.eqiad.wmnet': '12'}
{'mw1225.eqiad.wmnet': '17'}
{'mw1204.eqiad.wmnet': '16'}
{'mw1205.eqiad.wmnet': '18'}
{'mw1189.eqiad.wmnet': '17'}
{'mw1202.eqiad.wmnet': '14'}
{'mw1206.eqiad.wmnet': '18'}
{'mw1194.eqiad.wmnet': '12'}
{'mw1198.eqiad.wmnet': '17'}
{'mw1191.eqiad.wmnet': '17'}
{'mw1201.eqiad.wmnet': '11'}
{'mw1287.eqiad.wmnet': '0'}
{'mw1282.eqiad.wmnet': '0'}
{'mw1280.eqiad.wmnet': '0'}
{'mw1290.eqiad.wmnet': '0'}
{'mw1276.eqiad.wmnet': '0'}
{'mw1278.eqiad.wmnet': '0'}
{'mw1285.eqiad.wmnet': '0'}
{'mw1286.eqiad.wmnet': '0'}
{'mw1284.eqiad.wmnet': '0'}
{'mw1279.eqiad.wmnet': '0'}
{'mw1283.eqiad.wmnet': '0'}
{'mw1277.eqiad.wmnet': '0'}
{'mw1281.eqiad.wmnet': '0'}
{'mw1289.eqiad.wmnet': '0'}
{'mw1288.eqiad.wmnet': '0'}

The fact this is not happening on the machines that don't get called by parsoid in codfw seems to indicate this happens specifically from parsoid only.

The rate, given the logs are 2 hours old, seems to be 7/hour/server, which translates to roughly 250 errors/hour. Probably can't account for everything but it's a starting point.

After further inspection of the logs, it seems that those are the only 404 errors we get, apart from a few when trying to parse wikitech, which I guess is expected.

So the issue would seem to be in these lines of code I introduced:

https://github.com/wikimedia/parsoid/blob/master/lib/mw/ApiRequest.js#L300-L308

So, the only explanation I could come up with is that somehow

url.parse(options.uri);

returns an object that has no hostname. For example:

console.log(url.parse('/w/api.php').hostname);
null

Now I am looking at how we could recover when this happens, but I'm also asking myself how this could possibly happen.

Change 328495 had a related patch set uploaded (by Giuseppe Lavagetto):
tlsproxy::localssl: add ability to have an access.log

https://gerrit.wikimedia.org/r/328495

gerritbot added a project: Patch-For-Review.Dec 21 2016, 11:48 AM

• mobrovac triaged this task as High priority.Dec 21 2016, 11:53 AM

• mobrovac edited projects, added User-mobrovac, Services (doing); removed Services.

Change 328497 had a related patch set uploaded (by Mobrovac):
ApiRequest: Clone the request options before modifying them

https://gerrit.wikimedia.org/r/328497

Change 328497 merged by Mobrovac:
ApiRequest: Clone the request options before modifying them

https://gerrit.wikimedia.org/r/328497

Change 328506 had a related patch set uploaded (by Mobrovac):
Bump src to rGPARe7e3a4dce51f (deploy-20161221 src branch)

https://gerrit.wikimedia.org/r/328506

Change 328506 merged by Mobrovac:
Bump src to rGPARe7e3a4dce51f (deploy-20161221 src branch)

https://gerrit.wikimedia.org/r/328506

Mentioned in SAL (#wikimedia-operations) [2016-12-21T12:57:28Z] <mobrovac@tin> Starting deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797

Mentioned in SAL (#wikimedia-operations) [2016-12-21T13:03:01Z] <mobrovac@tin> Finished deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797 (duration: 05m 32s)

Errors have completely stopped after @mobrovac's patch has been added.

No batch request failure errors for the last 40 minutes, so I'm declaring victory.

Change 328495 merged by Giuseppe Lavagetto:
[operations/puppet] tlsproxy::localssl: add ability to have an access.log