See https://logstash.wikimedia.org/goto/06e62b8ad0146ef9bec01baa1af89985 .. there were no 404s before the switch.
The query I used in Kibana is: "Batch request failure for" && "404"
The rates are not alarming, but, this merits investigation.
Description
Details
Related Objects
Event Timeline
Could this be 404s from replication lag, as handled in https://github.com/wikimedia/change-propagation/blob/9a877128015c9258524dbea359d4c5509ebe57c6/config.example.wikimedia.yaml#L180 ?
@ssastry, my understanding is that Parsoid does not retry 404s from batch requests, so the two should be equal. Could you clarify why there would be a difference?
1 retry actually. Separately, if a batch api request 404s both times, it leaves holes in the parsed document, it doesn't return a 404 to RESTBase.
@ssastry, I haven't seen any code in Parsoid that would trigger retries on a 404 response. Could you point it out to me?
Okay. we clarified on IRC. I was misreading the code. There are no batch API retries on 404.
I really doubt the 404s come from the replication lag. There were no 404s like this before and also since we now make 2 cross-dc trips the lag effect should've decreased, not increased..
Yeah, on IRC @ssastry clarified that in contrast to the primary revision source request, these batch requests would not cause a 404 to be returned to RESTBase. This means that they would not show up in the RESTBase 404 response metrics.
Another interesting type of 404s from parsoid started to appear after move to codfw: https://logstash.wikimedia.org/goto/b9bef516e5e1c3e75d7895185e2a7f2f
It says 'unconfigured domain'
I'm trying to reproduce the problem on mw1189, I couldn't find the exact request that parsoid is making to api.svc.eqiad.wmnet in the logs though
after chatting with @Pchelolo I've diffed the conftool pools for api.svc.eqiad.wmnet and https has servers disabled
so it must be the same set of servers, assuming api.svc.eqiad.wmnet hasn't changed the only difference is that now parsoid talks to nginx and not apacheThe Parsoid dashboard shows non-200 codes, I was looking for total 200s but couldn't find it yet https://grafana.wikimedia.org/dashboard/db/parsoid-http-status-codes?from=now-6h&to=now
You can find this info at https://grafana.wikimedia.org/dashboard/db/restbase?panelId=13&fullscreen
The rate of these 404s is really low comparing to the 200 rate, but still worths figuring out
Ok, this can be due to something lacking somehow in the nginx configuration and needs investigation.
So, about the unconfigured domain ones:
it seems that parsoid sometimes sends out a request without the Host: header set. In fact, I see what follows:
salt --output=raw -C 'G@cluster:api_appserver and G@site:eqiad' cmd.run 'fgrep api.svc.eqiad.wmnet /var/log/apache2/other_vhosts_access.log | wc -l' {'mw1222.eqiad.wmnet': '15'} {'mw1229.eqiad.wmnet': '8'} {'mw1234.eqiad.wmnet': '12'} {'mw1224.eqiad.wmnet': '19'} {'mw1208.eqiad.wmnet': '9'} {'mw1199.eqiad.wmnet': '22'} {'mw1226.eqiad.wmnet': '15'} {'mw1195.eqiad.wmnet': '14'} {'mw1207.eqiad.wmnet': '17'} {'mw1228.eqiad.wmnet': '17'} {'mw1223.eqiad.wmnet': '14'} {'mw1203.eqiad.wmnet': '14'} {'mw1193.eqiad.wmnet': '19'} {'mw1221.eqiad.wmnet': '19'} {'mw1230.eqiad.wmnet': '11'} {'mw1235.eqiad.wmnet': '24'} {'mw1231.eqiad.wmnet': '17'} {'mw1232.eqiad.wmnet': '15'} {'mw1200.eqiad.wmnet': '11'} {'mw1190.eqiad.wmnet': '9'} {'mw1192.eqiad.wmnet': '13'} {'mw1197.eqiad.wmnet': '11'} {'mw1196.eqiad.wmnet': '8'} {'mw1233.eqiad.wmnet': '11'} {'mw1227.eqiad.wmnet': '12'} {'mw1225.eqiad.wmnet': '17'} {'mw1204.eqiad.wmnet': '16'} {'mw1205.eqiad.wmnet': '18'} {'mw1189.eqiad.wmnet': '17'} {'mw1202.eqiad.wmnet': '14'} {'mw1206.eqiad.wmnet': '18'} {'mw1194.eqiad.wmnet': '12'} {'mw1198.eqiad.wmnet': '17'} {'mw1191.eqiad.wmnet': '17'} {'mw1201.eqiad.wmnet': '11'} {'mw1287.eqiad.wmnet': '0'} {'mw1282.eqiad.wmnet': '0'} {'mw1280.eqiad.wmnet': '0'} {'mw1290.eqiad.wmnet': '0'} {'mw1276.eqiad.wmnet': '0'} {'mw1278.eqiad.wmnet': '0'} {'mw1285.eqiad.wmnet': '0'} {'mw1286.eqiad.wmnet': '0'} {'mw1284.eqiad.wmnet': '0'} {'mw1279.eqiad.wmnet': '0'} {'mw1283.eqiad.wmnet': '0'} {'mw1277.eqiad.wmnet': '0'} {'mw1281.eqiad.wmnet': '0'} {'mw1289.eqiad.wmnet': '0'} {'mw1288.eqiad.wmnet': '0'}
The fact this is not happening on the machines that don't get called by parsoid in codfw seems to indicate this happens specifically from parsoid only.
The rate, given the logs are 2 hours old, seems to be 7/hour/server, which translates to roughly 250 errors/hour. Probably can't account for everything but it's a starting point.
After further inspection of the logs, it seems that those are the only 404 errors we get, apart from a few when trying to parse wikitech, which I guess is expected.
So the issue would seem to be in these lines of code I introduced:
https://github.com/wikimedia/parsoid/blob/master/lib/mw/ApiRequest.js#L300-L308
So, the only explanation I could come up with is that somehow
url.parse(options.uri);
returns an object that has no hostname. For example:
console.log(url.parse('/w/api.php').hostname); null
Now I am looking at how we could recover when this happens, but I'm also asking myself how this could possibly happen.
Change 328495 had a related patch set uploaded (by Giuseppe Lavagetto):
tlsproxy::localssl: add ability to have an access.log
Change 328497 had a related patch set uploaded (by Mobrovac):
ApiRequest: Clone the request options before modifying them
Change 328497 merged by Mobrovac:
ApiRequest: Clone the request options before modifying them
Change 328506 had a related patch set uploaded (by Mobrovac):
Bump src to rGPARe7e3a4dce51f (deploy-20161221 src branch)
Change 328506 merged by Mobrovac:
Bump src to rGPARe7e3a4dce51f (deploy-20161221 src branch)
Mentioned in SAL (#wikimedia-operations) [2016-12-21T12:57:28Z] <mobrovac@tin> Starting deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797
Mentioned in SAL (#wikimedia-operations) [2016-12-21T13:03:01Z] <mobrovac@tin> Finished deploy [parsoid/deploy@dab1f27]: Bug fix for mwApiServer T153797 (duration: 05m 32s)
Change 328495 merged by Giuseppe Lavagetto:
[operations/puppet] tlsproxy::localssl: add ability to have an access.log