Page MenuHomePhabricator

High (2-3x) write and connection load on enwiki databases
Closed, ResolvedPublicPRODUCTION ERROR

Description

The extra writes come from the new jobqueue- the refreshlinks job. The deploy and the load increase timestamps match almost perfectly:

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1052&var-port=9104

This was later proven by lowering the concurrency, which immediately affected the databases.

The extra writes by themselves are not a huge problem (even at 2-3 times, they are relatively not huge for a datababase server), but it is causing disk slowdowns, plus there are extra load in connections/s, which make reads queue and in some cases fail to check replication lag. With 100-200 connections per server, instead of the usual <100, queries get queued for longer, affecting overall performance of all queries.

Event Timeline

jcrespo created this task.Mar 8 2018, 12:38 PM
Restricted Application added a project: Analytics. · View Herald TranscriptMar 8 2018, 12:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo triaged this task as High priority.Mar 8 2018, 12:39 PM
mobrovac edited projects, added Services (doing); removed Services.Mar 8 2018, 3:30 PM
mobrovac added a subscriber: mobrovac.

After lowering the concurrency the number of new connections slightly dropped, but it's still high. Also, with lowering the concurrency the mean execution time of refreshLinks jobs almost halved.

jcrespo moved this task from Triage to In progress on the DBA board.Mar 8 2018, 4:54 PM
fdans moved this task from Incoming to Radar on the Analytics board.Mar 8 2018, 6:02 PM
jcrespo updated the task description. (Show Details)Mar 9 2018, 9:01 AM

The lower concurrency is better, but the problem it is still ongoing- it is too "bursty"- moments where many connections happen at the same time at the master. That is to be avoided, background writes are supposed to come with a much smoother- the first "enemy" of storage systems is not high load, but unpredictable load. That is impossible to do with interactive requests, but it must be done with async requests. This was achieved with the old queue system, in which connections to the master varied between 39 and 59 https://grafana.wikimedia.org/dashboard/db/mysql?var-server=db1052&var-port=9104&var-dc=eqiad%20prometheus%2Fops&panelId=9&fullscreen&orgId=1&from=1520327033235&to=1520413433235 , but now they vary between 35 and 235. Again, the largest overhead is not extra writes (although I wonder why those growled), but the extra new connections per second. If that continues, maybe I can put a hard limit on concurrency for jobrunners so it doesn't affect the webrequests.

In other order of things, is it normal to still get errors from 127.0.0.1, which I think it points to the older queue? I thought everything had been migrated to the new one? Is there a possibility we are executing the same jobs twice?

In other order of things, is it normal to still get errors from 127.0.0.1, which I think it points to the older queue? I thought everything had been migrated to the new one? Is there a possibility we are executing the same jobs twice?

The jobs are being migrated one-by-one, so many high-traffic use-cases have been migrated, but there's still quite a few use cases on the old queue.

As for the connection number, the problem i that the old system worked in a different fashion - the jobrunner was instructed to read and execute jobs from the queue for a certain period of time and all of the jobs in one run reused the db connection. The new system works differently - the jobs are being submitted to jobrunners one-by-one and that means each job establishes a new connection.

I'm not quite sure if it's possible to do some kind of connection pooling on the runners side, but what's definitely possible is to batch the submitted jobs - say accumulate N jobs, submit them all together, reuse a single connection while executing all of them and return an array of individual job results.

The super-high exec rate you're seeing right now should, in theory, go away soon (no idea when exactly though) because it's processing some backlog that was created during the refreshLinks job switchover.

Thanks, that last comment was indeed *very very* useful.

one-by-one and that means each job establishes a new connection

Let's wait for the backlog to clear, but definitely that needs to change- either on code, we group those execution more intelligently, or we put something like proxysql pooling connections. Connection overhead was already an issue on the old jobrunner, reaching 2 or 3 seconds to connect for some race conditions, and definitely this made it worse. No more than 20-50 threads should be connecting at the same time to the master. As I said, even if backlog clears, issues in the future may require more concurrency, and we should be able to do that without falling on such a high connection overhead. Proxy/connection pool is something that we are going to use for crossdc connections, so it was already in the backlog.

Proxy/connection pool is something that we are going to use for crossdc connections, so it was already in the backlog.

Great! I think it would be much better do it on your side of the "fence" cause intelligent batching with reliable retries is not an easy problem if done in code.

I think it would be much better do it on your side of the "fence"

I can own this no problem, but if I do, I will ask to revert the change first because it is causing issues.

Change 418928 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Lower refreshLinks concurrency to 120.

https://gerrit.wikimedia.org/r/418928

Change 418928 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Lower refreshLinks concurrency to 120.

https://gerrit.wikimedia.org/r/418928

The change that partitioned the refreshLinks topic in line with MySQL sharding has been deployed. Now we just need to wait for the data to accumulate to see if the situation with spikiness has improved.

Relevant dashboards to monitor (for posterity):

From my point of view, the problem has gone:
https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?panelId=9&fullscreen&orgId=1&from=1519199874092&to=1521791874092&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-shard=s8&var-role=master

This doesn't mean issues are 100% fixed, or the model is perfect and doesn't mean refinement, but the most immediate issues- stalls due to high concurrency connections to masters is gone. There was a high spike on s4 the 21st at 22h, but probably unrelated.

As a separate task, for long term, something should be done (probably) about the efficiency of the jobs- I suspect that with the new model they spend more time connecting and disconnecting to databases than actually doing work, but that was not in scope of this ticket.

jcrespo moved this task from In progress to Done on the DBA board.Mar 23 2018, 11:55 AM
jcrespo moved this task from Untriaged to Debug-only on the Wikimedia-production-error board.
mobrovac closed this task as Resolved.Mar 23 2018, 8:28 PM
mobrovac assigned this task to Pchelolo.
mobrovac edited projects, added Services (done); removed Patch-For-Review, Services (doing).

I agree that all of the issues have been fixed, but to my understanding the scope of this task is to be able to control the number of connections to each shard, which is now possible. Resolving.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM