See the graphs:
Description
Related Objects
Event Timeline
20:30 is the time we deployed several new things (celery4, etc.) but it's only affecting codfw and not eqiad...
Mentioned in SAL (#wikimedia-operations) [2018-11-05T12:10:20Z] <ladsgroup@deploy1001> Started deploy [ores/deploy@096ffb3]: T208577 T181632 T208608
Mentioned in SAL (#wikimedia-operations) [2018-11-05T12:33:32Z] <ladsgroup@deploy1001> Finished deploy [ores/deploy@096ffb3]: T208577 T181632 T208608 (duration: 22m 58s)
I looked at the data and specially the logs. The problem is caused by very small amount of requests in codfw. Non-precaching requests on codfw is just 30 req/min. Or 4 reqs/min per node. So when just one request gets throttled, it would cause a jump in 75% poolcounter lock time of one node which is what happens in the graphs. If you look at uwsgi logs and response time on codfw, it's fine.
@Bpeynetti: How many parallel connections you make to ores? Do you know up to 4 parallel connections per IP is allowed?
2 separate servers each making 4 parallel requests at a time. Recently (~7 days ago) added the additional server, but stopping all requests from the new server for >12 hours did not have any effect on response time.
@Ladsgroup I am still seeing elevated response times, not much changed in latency in the past 48 hours. Do you have any insight or suggestions?
@Bpeynetti: If you're requesting for edits that happen recently, that's definitely an issue on our side and the caching. If you're requesting on old versions. The response for any batch is as fast the time of the slowest score which means if you parallel your request on smaller batches (instead of 4 parallel connections with batches of 50, send request with 8 parallel requests from two IPs but batches of 25), that would definitely reduce your average response time.
The underlying reason for this elevation is that we changed the way we distribute scoring each revision from -0fast to -0fair in celery. That would utilize our resource in a better way + it's fair to all of our clients. Since your requests are among the heavy one (batches of 50 revision is a heavy request), it would see an increase in response time which on other hand, other type of requests are seeing a huge decrease in response time: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=15&fullscreen&orgId=1&from=now-7d&to=now-1m (More like capitalism vs. socialism)
HTH
@Ladsgroup Thank you for the quick response and for the explanation!
All of the requests are for edits that happened recently (within the last few hours, 99% of the time for edits that happened within <10 minutes). This last weekend, because of the increase in latency, the requests backed up and now the services are asking for older (~2 day old) edits. If I understand this correctly, this could have a compounding effect since 1) the client is sending batches of 50 which will see an increase in latency and 2) older edits might not be cached as much as recent edits, and if even just one is not cached, the response time for that one batch will be much slower.
Will take into consideration distributing the requests across more IPs and sending smaller batches in each.
If it's recent edits (by recent I mean last month basically), unless they are made by bot, ORES should respond pretty quickly (<1 second for a batch of 50), I will double check the logs.
I checked the logs, lots of them are edits made by bot so the cache part is okay. I would recommend omitting bot edits. Also you can get the scores through other means like RC stream and ores extension on mediawiki.
Apparently it's recovering.