Page MenuHomePhabricator

dbstore1002 (analytics store) enwiki lag due to blocking query
Closed, ResolvedPublic

Description

I believe this query is causing lag on dbstore1002:

/usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/betafeatures/counts.php
     Id: 47844624
    User: research
    Host: 10.64.53.30:34428
      db: NULL
 Command: Query
    Time: 67
   State: Sending data
    Info: INSERT IGNORE INTO staging.wmde_analytics_betafeature_users_today ( user_name, feature ) SELECT user_name, up_property FROM user_properties JOIN enwiki.user ON up_user = user_id WHERE up_property = 'uls-compact-links' AND up_value = '1'

I have converted all Aria tables to InnoDB, and that should fix the issue, but this is a heads up to never do INSERT...SELECT on Aria, MyISAM, MEMORY or any non-MVCC tables, as reads will get writes (from replication).

Event Timeline

Actually, I am not sure if it is that script, there is other one happening at 3am, too, that seems to block the queries:

47744154 | research        | 10.64.53.30:46794 | log                | Sleep   |   17419

as reads will get writes (from replication)

I don't quite follow this, what exactly does this mean?

Actually, I am not sure if it is that script, there is other one happening at 3am, too, that seems to block the queries:

This script runs a couple of queries so could it be another query from the same script?
Is it possible to see what query / script is running in the query mentioned in your comment?

Missing word: "reads will get [blocked] by writes (from replication) [on non transactional engines]."

Is it possible to see what query / script is running in the query mentioned in your comment?

Not until tomorrow at 3am UTC. It was an idle connection probably with an open transaction.

So, looking at the script the same query gets run for 9 different beta features on a cron at roughly 3am, so it is highly possible the query in your comment is the same query again.

There is one more query in the same script that does INSERT...SELECT, I can look at getting rid of these.

Actually, I converted the tables to innodb already- so nothing is to be done unless there is still some non-obvious interaction that makes the lag not go away (metadata locking, inter-query locking, or someting else). In theory nothing is to be made until we check lag doesn't happen again. This was just a heads up to not create Aria tables, ever! :-D

Awesome! thanks for the heads up!

I belive this, or something similar in spirit could be happening now for s5. I need to look more into it to identify it.

jcrespo changed the task status from Open to Stalled.Oct 20 2017, 9:27 AM

I think this could be still happening once a week, probably due to some cron job.

I think this could be still happening once a week, probably due to some cron job.

Hmm, only once a week?

This is happening at around 2-3 GMT on Sundays; potential candidates:

INSERT IGNORE INTO staging.wmde_analytics_betafeature_users_today ( user_name, feature ) SELECT user_name, up_property FROM enwiki.user_properties JOIN enwiki.user ON up_user = user_id WHERE up_property = 'uls-compact-links' AND up_value = '1'

CREATE TEMPORARY TABLE staging.active_user_changes AS ( SELECT rc_user_text AS user, COUNT( * ) as changes FROM wikidatawiki.recentchanges WHERE rc_user != 0 AND rc_bot = 0 AND ( rc_log_type != 'newusers' OR rc_log_type IS NULL) AND UNIX_TIMESTAMP(rc_timestamp) >= UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 30 day)) GROUP BY rc_user_text )

So, both of those queries are me / wmde analytics scripts.
Both of those queries run every day, it could be the db has extra load on sunday 2-3gmt? I can alter them to run at different times.

How much lag are we talking?

jcrespo triaged this task as Unbreak Now! priority.Feb 26 2018, 2:22 PM

This query is right now blocking enwiki replication:

      Id: 31132883
    User: research
    Host: 10.64.53.30:46264
      db: NULL
 Command: Query
    Time: 502
   State: Queried about 50000 rows
    Info: INSERT IGNORE INTO staging.wmde_analytics_betafeature_users_today ( user_name, feature ) SELECT user_name, up_property FROM enwiki.user_properties JOIN enwiki.user ON up_user = user_id WHERE up_property = 'popups' AND up_value = '1'
Progress: 0.000

This is the same script as in previous comments (just running at a different time) due to a failed run via cron last night.

Over 1700 seconds and going.

Seconds_Behind_Master: 1746

This is because the origin table is not InnoDB, so it cannot be done using multi-versioning. While it is planned to be migrated: T159423, for the time being, this should be changed into 2 queries: one select to memory and one insert from memory, so replication is unaffected.

one select to memory and one insert from memory, so replication is unaffected.

Can do

You say here T172410#4001008 change to innodb/multi-instance will not affect you, but I am not 100% sure this will not affect you. We can leave staging databases, but they can be only multiple or just one on a separate host- changing the statistics gathering method to the one I suggest will make sure you are not affected.

You say here T172410#4001008 change to innodb/multi-instance will not affect you, but I am not 100% sure this will not affect you. We can leave staging databases, but they can be only multiple or just one on a separate host- changing the statistics gathering method to the one I suggest will make sure you are not affected.

True, this select & insert does actually span multiple DBs. It didn't pop into my head as it is not a JOIN.
Reworking it to select to memory and insert after should resolve this issue also.

jcrespo changed the task status from Stalled to Open.Feb 26 2018, 2:59 PM
jcrespo assigned this task to Addshore.
jcrespo lowered the priority of this task from Unbreak Now! to Medium.

It recovered:
[15:06] <icinga-wm> PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 792.21 seconds
[15:43] <icinga-wm> RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 287.79 seconds

but I am going to guess the underlying problem exists. We were in red numbers for 40 minutes, and I bet this was the query waking me up :-)

I have flagged up this ticket to be tackled by one of the WMDE teams whose sprints will start next week.

@jcrespo I'm removing Research tag from this task. If the issue persists and you suspect it's Research team related, please add us back and ping.

Not happening at the moment.