Page MenuHomePhabricator

Weird race condition makes query stuck in queued forever
Closed, ResolvedPublic

Description

Tracked down https://quarry.wmflabs.org/query/14096:

MariaDB [quarry]> select * from query where id = 14096;
+-------+---------+-----------------------------------+---------------+---------------------+-----------+-----------+-----------------------------------+
| id    | user_id | title                             | latest_rev_id | last_touched        | parent_id | published | description                       |
+-------+---------+-----------------------------------+---------------+---------------------+-----------+-----------+-----------------------------------+
| 14096 |     286 | Last edit date of pages on enwiki |        198040 | 2017-07-31 16:24:42 |      NULL |         1 | Last edit date of pages on enwiki |
+-------+---------+-----------------------------------+---------------+---------------------+-----------+-----------+-----------------------------------+
1 row in set (0.00 sec)

MariaDB [quarry]> select * from query_run where query_rev_id = 198040;
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| id     | query_rev_id | status | timestamp           | task_id                              | extra_info |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| 193988 |       198040 |      0 | 2017-07-31 16:24:42 | 3ac485c5-3a2a-4a80-b1e5-8d3dfa7d9781 | NULL       |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
1 row in set (0.22 sec)
zhuyifei1999@quarry-runner-02:/srv/quarry$ grep 3ac485c5-3a2a-4a80-b1e5-8d3dfa7d9781 /var/log/syslog -B 1 -A 11
Jul 31 16:24:01 quarry-runner-02 CRON[29029]: (quarry) CMD (/srv/quarry/quarry/web/killer.py)
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: [2017-07-31 16:24:42,115: ERROR/MainProcess] Task worker.run_query[3ac485c5-3a2a-4a80-b1e5-8d3dfa7d9781] raised unexpected: NoResultFound('No row was found for one()',)
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: Traceback (most recent call last):
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: R = retval = fun(*args, **kwargs)
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 437, in __protected_call__
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: return self.run(*args, **kwargs)
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: File "/srv/quarry/quarry/web/worker.py", line 50, in run_query
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: qrun = conn.session.query(QueryRun).filter(QueryRun.id == query_run_id).one()
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2375, in one
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: raise orm_exc.NoResultFound("No row was found for one()")
Jul 31 16:24:42 quarry-runner-02 celery-quarry-worker[590]: NoResultFound: No row was found for one()
Jul 31 16:25:01 quarry-runner-02 CRON[29649]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Event Timeline

The possibility of reproducing this issue can be greatly increased by running sudo mysql quarry on quarry-main-01 (i.e. connecting to the database)

(Edit: Likely co-incidence)

Change 401486 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[analytics/quarry/web@master] worker.py: session close take a higher prio, ignore error 2013 in cur close

https://gerrit.wikimedia.org/r/401486

Change 401486 merged by jenkins-bot:
[analytics/quarry/web@master] worker.py: session close take a higher prio, ignore error 2013 in cur close

https://gerrit.wikimedia.org/r/401486

Mentioned in SAL (#wikimedia-cloud) [2018-01-02T10:52:32Z] <zhuyifei1999_> deploying d9cc1c8 to quarry-runner-01 & 02 T172143

zhuyifei1999 claimed this task.

I haven't seen a case of this since.