dbstore1002 (analytics store) enwiki lag due to blocking query
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Sep 13 2017, 3:35 AM

Description

I believe this query is causing lag on dbstore1002:

/usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/betafeatures/counts.php
     Id: 47844624
    User: research
    Host: 10.64.53.30:34428
      db: NULL
 Command: Query
    Time: 67
   State: Sending data
    Info: INSERT IGNORE INTO staging.wmde_analytics_betafeature_users_today ( user_name, feature ) SELECT user_name, up_property FROM user_properties JOIN enwiki.user ON up_user = user_id WHERE up_property = 'uls-compact-links' AND up_value = '1'

I have converted all Aria tables to InnoDB, and that should fix the issue, but this is a heads up to never do INSERT...SELECT on Aria, MyISAM, MEMORY or any non-MVCC tables, as reads will get writes (from replication).

Related Objects

Mentioned In: T189263: Some slaves are impossible to check for replication lag in MediaWiki
T182011: Differentiate between main, talk and non-talk/main pages
Mentioned Here: T172410: Replace the current multisource analytics-store setup
T159423: Meta ticket: Migrate multi-source database hosts to multi-instance

Event Timeline

jcrespo created this task.Sep 13 2017, 3:35 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2017, 3:35 AM

Actually, I am not sure if it is that script, there is other one happening at 3am, too, that seems to block the queries:

47744154 | research        | 10.64.53.30:46794 | log                | Sleep   |   17419

Addshore subscribed.Sep 13 2017, 7:19 AM

as reads will get writes (from replication)

I don't quite follow this, what exactly does this mean?

Actually, I am not sure if it is that script, there is other one happening at 3am, too, that seems to block the queries:

This script runs a couple of queries so could it be another query from the same script?
Is it possible to see what query / script is running in the query mentioned in your comment?

Addshore added a project: User-Addshore.Sep 13 2017, 7:22 AM

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

Missing word: "reads will get [blocked] by writes (from replication) [on non transactional engines]."

Is it possible to see what query / script is running in the query mentioned in your comment?

Not until tomorrow at 3am UTC. It was an idle connection probably with an open transaction.

So, looking at the script the same query gets run for 9 different beta features on a cron at roughly 3am, so it is highly possible the query in your comment is the same query again.

There is one more query in the same script that does INSERT...SELECT, I can look at getting rid of these.

Actually, I converted the tables to innodb already- so nothing is to be done unless there is still some non-obvious interaction that makes the lag not go away (metadata locking, inter-query locking, or someting else). In theory nothing is to be made until we check lag doesn't happen again. This was just a heads up to not create Aria tables, ever! :-D

Awesome! thanks for the heads up!

I belive this, or something similar in spirit could be happening now for s5. I need to look more into it to identify it.

• fdans moved this task from Incoming to Radar on the Analytics board.Sep 21 2017, 4:34 PM

jcrespo changed the task status from Open to Stalled.Oct 20 2017, 9:27 AM

I think this could be still happening once a week, probably due to some cron job.

In T175790#3827242, @jcrespo wrote:

I think this could be still happening once a week, probably due to some cron job.

Hmm, only once a week?

This is happening at around 2-3 GMT on Sundays; potential candidates:

INSERT IGNORE INTO staging.wmde_analytics_betafeature_users_today ( user_name, feature ) SELECT user_name, up_property FROM enwiki.user_properties JOIN enwiki.user ON up_user = user_id WHERE up_property = 'uls-compact-links' AND up_value = '1'

CREATE TEMPORARY TABLE staging.active_user_changes AS ( SELECT rc_user_text AS user, COUNT( * ) as changes FROM wikidatawiki.recentchanges WHERE rc_user != 0 AND rc_bot = 0 AND ( rc_log_type != 'newusers' OR rc_log_type IS NULL) AND UNIX_TIMESTAMP(rc_timestamp) >= UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 30 day)) GROUP BY rc_user_text )

So, both of those queries are me / wmde analytics scripts.
Both of those queries run every day, it could be the db has extra load on sunday 2-3gmt? I can alter them to run at different times.

How much lag are we talking?

This query is right now blocking enwiki replication:

      Id: 31132883
    User: research
    Host: 10.64.53.30:46264
      db: NULL
 Command: Query
    Time: 502
   State: Queried about 50000 rows
    Info: INSERT IGNORE INTO staging.wmde_analytics_betafeature_users_today ( user_name, feature ) SELECT user_name, up_property FROM enwiki.user_properties JOIN enwiki.user ON up_user = user_id WHERE up_property = 'popups' AND up_value = '1'
Progress: 0.000

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptFeb 26 2018, 2:22 PM

This is the same script as in previous comments (just running at a different time) due to a failed run via cron last night.

Over 1700 seconds and going.

Seconds_Behind_Master: 1746

This is because the origin table is not InnoDB, so it cannot be done using multi-versioning. While it is planned to be migrated: T159423, for the time being, this should be changed into 2 queries: one select to memory and one insert from memory, so replication is unaffected.

one select to memory and one insert from memory, so replication is unaffected.

Can do

You say here T172410#4001008 change to innodb/multi-instance will not affect you, but I am not 100% sure this will not affect you. We can leave staging databases, but they can be only multiple or just one on a separate host- changing the statistics gathering method to the one I suggest will make sure you are not affected.

In T175790#4001318, @jcrespo wrote:

You say here T172410#4001008 change to innodb/multi-instance will not affect you, but I am not 100% sure this will not affect you. We can leave staging databases, but they can be only multiple or just one on a separate host- changing the statistics gathering method to the one I suggest will make sure you are not affected.

True, this select & insert does actually span multiple DBs. It didn't pop into my head as it is not a JOIN.
Reworking it to select to memory and insert after should resolve this issue also.

It recovered:
[15:06] <icinga-wm> PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 792.21 seconds
[15:43] <icinga-wm> RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 287.79 seconds

but I am going to guess the underlying problem exists. We were in red numbers for 40 minutes, and I bet this was the query waking me up :-)

I have flagged up this ticket to be tackled by one of the WMDE teams whose sprints will start next week.

Tobi_WMDE_SW subscribed.Feb 28 2018, 11:33 AM

Addshore mentioned this in T182011: Differentiate between main, talk and non-talk/main pages.Feb 28 2018, 11:49 AM

GoranSMilovanovic subscribed.Feb 28 2018, 11:50 AM

@jcrespo I'm removing Research tag from this task. If the issue persists and you suspect it's Research team related, please add us back and ping.

leila removed a project: Research.Feb 28 2018, 8:45 PM

jcrespo mentioned this in T189263: Some slaves are impossible to check for replication lag in MediaWiki.Mar 9 2018, 7:43 AM

Not happening at the moment.

Liuxinyu970226 unsubscribed.Jun 5 2018, 2:18 PM

Addshore moved this task from Watching 👀 to Closing ✔️ on the User-Addshore board.Jun 6 2018, 8:48 AM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

dbstore1002 (analytics store) enwiki lag due to blocking queryClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

dbstore1002 (analytics store) enwiki lag due to blocking query
Closed, ResolvedPublic
Actions