Page MenuHomePhabricator

Long query running on dbstore1005:3318
Closed, ResolvedPublic

Description

This query has been running for 42 days already (as of today) on dbstore1005:3318 (wikidatawiki) coming from stat1007:

|  298648 | research        | 10.64.21.118:38696 | wikidatawiki       | Query   | 3680610 | Sending data                     | SELECT page_title, page_touched, page_latest
FROM page
WHERE page_title IN
(
SELECT DISTINCT CONCAT('P', wbpt_property_id) as page_title
FROM wbt_property_terms
LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id
LEFT JOIN wbt_type ON wbtl_type_id = wby_id
LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id
LEFT JOIN wbt_text ON wbxl_text_id = wbx_id
WHERE wby_name = 'label' AND wbx_text IS NULL ORDER BY wbpt_property_id
)

It sometimes causes lag on dbstore1005:3318 but nothing too worrying - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fops&var-server=dbstore1005&var-port=13318&from=1580173362981&to=1580259762982
I am not sure if the query will ever finish, but just reporting this in case you want to report it to the user.

I am going to disable notifications for this specific check on icinga, to avoid the noise on -operations (the slave lag check only).
Again, not a big deal for us (DBAs), but just reporting this just in case you want to follow up.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-01-29T01:05:37Z] <marostegui> Disable notifications for dbstore1005:3318 slave lag - T243871

This is probably

analyti+ 145527 98.8  4.3 3283004 2843800 ?     R    03:00 204:04 /usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/betafeatures/counts.php

@Ladsgroup @Addshore can you check whenever you have time?

/srv/analytics-wmde/graphite/src/scripts/src/betafeatures/counts.php

should not be doing a query like this one in the description.

It sounds more like one I have run at some point and for some reason didn't end / something went wrong etc.
I'll take a look.

It definitely doesn't come from the script mentioned, and I am definitely not running anything.

It looks like we do have some issues with long running scripts that we should tackle:

addshore@stat1007:~$ ps aux | grep wmde
analyti+ 118659  0.0  0.0   4276   768 ?        Ss   12:00   0:00 /bin/sh -c time /srv/analytics-wmde/graphite/src/scripts/cron/daily.12.sh /srv/analytics-wmde/graphite/src/scripts >> /srv/analytics-wmde/graphite/log/daily.12.log 2>&1
analyti+ 118663  0.0  0.0   4036   592 ?        S    12:00   0:00 time /srv/analytics-wmde/graphite/src/scripts/cron/daily.12.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 118681  0.0  0.0  11172  3032 ?        S    12:00   0:00 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/daily.12.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 118689 99.4  0.0 234224 25356 ?        R    12:00   7:01 /usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/wikidata/apiLogScanner.php
addshore 136955  0.0  0.0  12780   980 pts/0    S+   12:07   0:00 grep wmde
analyti+ 145427  0.0  0.0   4276   708 ?        Ss   03:00   0:00 /bin/sh -c time /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts >> /srv/analytics-wmde/graphite/log/daily.03.log 2>&1
analyti+ 145433  0.0  0.0   4036   648 ?        S    03:00   0:00 time /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 145438  0.0  0.0  19376  6960 ?        S    03:00   0:00 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 145522  0.0  0.0  11180  2000 ?        S    03:00   0:00 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 145527 99.3  4.3 3337996 2885632 ?     R    03:00 543:46 /usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/betafeatures/counts.php
analyti+ 188017  0.0  0.0   4276   712 ?        Ss   Jan28   0:00 /bin/sh -c time /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts >> /srv/analytics-wmde/graphite/log/daily.03.log 2>&1
analyti+ 188022  0.0  0.0   4036   632 ?        S    Jan28   0:00 time /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 188054  0.0  0.0  19376  7148 ?        S    Jan28   0:00 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 188128  0.0  0.0  11180  2124 ?        S    Jan28   0:00 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/daily.03.sh /srv/analytics-wmde/graphite/src/scripts
analyti+ 188134 99.6  4.5 3435404 3014316 ?     R    Jan28 1979:20 /usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/betafeatures/counts.php

Did:

addshore@stat1007:~$ sudo -u analytics-wmde kill 188134
addshore@stat1007:~$ sudo -u analytics-wmde kill 145527
addshore@stat1007:~$ sudo -u analytics-wmde kill 118689

No long running processes left.
But this still doesnt cover the query in the description of this ticket

Created T243894 to handle these long running php scripts

Thanks - do you want me to kill the query? It is still running there.

elukey claimed this task.

After a chat with @Marostegui I decided to kill the query, since it seems that some optimization is needed (it has been running for 40+ days). Also it is a select and can be re-done in case :)

Thanks a lot @Addshore for the follow up!

Mentioned in SAL (#wikimedia-operations) [2020-01-31T16:59:14Z] <marostegui> Re-enable notifications on the dbstore1005:3318 check T243871

After the query was killed no more replication lag has happened

dbstore.png (500×1 px, 94 KB)