Page MenuHomePhabricator

Execution of maintain-view can create pileups due to metadata locking (was: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out)
Closed, ResolvedPublic

Description

Reported with T180563

On *.web:

MariaDB [wikidatawiki_p]> SELECT 1 FROM wb_items_per_site LIMIT 1;

....times out.

On *.analytics:

MariaDB [wikidatawiki_p]> SELECT 1 FROM wb_items_per_site LIMIT 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

Other tables in wikidatawiki_p seem to be acting normally.

Event Timeline

Can you reconnect? Servers now go up and down all the time (T179244#3763271)- if a connection gets stuck is your (client's) responsibility to try to reconnect at least once.

Can you reconnect? Servers now go up and down all the time (T179244#3763271)- if a connection gets stuck is your (client's) responsibility to try to reconnect at least once.

Still timing out on my end. I discovered this issue because someone reported the ArticleInfo tool in XTools was timing out, see T180563. The above queries I tested manually on Toolforge.

And again, it seems it's only the wikidatawiki_p.wb_items_per_site table that's affected. Very odd!

It should work now. This was due to the maintain views script getting blocked by long-running selects, which by itself was creating metadata locks on all selects using that table.

| 19385953 | <maintain views user>   | localhost         | NULL               | Query   |   64170 | Waiting for table metadata lock          
                    DEFINER=viewmaster
                    VIEW `wikidatawiki_p`.` |    0.000 |

This is a bug of the maintain views scripts- it should lower all timeouts when creating views and retry/abort early.

jcrespo triaged this task as High priority.Nov 15 2017, 4:00 PM
jcrespo added subscribers: bd808, chasemp.

This is almost an unbreak now, as it will likely cause an OOM due to large pileups when running the script.

Confirmed to be working! Thanks!

We will leave the ticket open so the root cause can be evaluated by cloud team. For reference, this are the parameters we set when we do schema changes:

https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/osc_host.sh;7cd7038576da6932275e0f9fee4e8a01edc69ec1$132

jcrespo renamed this task from Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out to Execution of maintain-view can crate pileups due to metadata locking (was: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out).Nov 15 2017, 4:05 PM
jcrespo renamed this task from Execution of maintain-view can crate pileups due to metadata locking (was: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out) to Execution of maintain-view can create pileups due to metadata locking (was: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out).

We will leave the ticket open so the root cause can be evaluated by cloud team. For reference, this are the parameters we set when we do schema changes:

https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/osc_host.sh;7cd7038576da6932275e0f9fee4e8a01edc69ec1$132

My plan is to update maintain-views for this same idea.

Change 391586 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] maintain-views: implement connection timeouts for views creation

https://gerrit.wikimedia.org/r/391586

Change 391586 merged by Rush:
[operations/puppet@production] maintain-views: implement connection timeouts for views creation

https://gerrit.wikimedia.org/r/391586

I would close this as resolved unless we notice more ongoing issues- refinements are still needed in case of errors, etc; but those should be handled on other tasks.