Page MenuHomePhabricator

Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them
Closed, ResolvedPublicBUG REPORT

Description

flaggedrevs table is broken in Toolforge database replica

List of steps to reproduce (step by step, including full links if applicable):

  • localhost $ ssh zache-tool@login.toolforge.org
  • tools-sgebastion-07:~$ sql fiwiki_p;
  • MariaDB [fiwiki_p]> select * from flaggedrevs limit 1;

What happens?:

ERROR 1356 (HY000): View 'fiwiki_p.flaggedrevs' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

What should have happened instead?:
return one row from fiwiki_p.flaggedrevs table

When problem started?:
Last known working time has been 13:25, 18 February 2022

Event Timeline

Just a guess, but the referenced missing column(s) might be fr_img_*? T300774: Drop fr_img_* columns

Yes, the views need to be recreated for the wikis that get done on the wikireplicas

Marostegui triaged this task as High priority.
Marostegui added a subscriber: Kormat.

Assigning to @Kormat (as she's doing the schema change) to either recreate them out coordinate with WMCS

-cloud-services, +data-engineering, as apparently the responsibility has moved teams.

nskaggs added a subscriber: nskaggs.

Sorry for any confusion! Thanks for trying to route this to the correct place.

According to the responsibilities matrix, WMCS still has responsibility for this. It is true we are working with Data Persistence to transition our responsibilities, starting with views. See line 20 https://docs.google.com/spreadsheets/d/1xb8aAqTxOJsc2wEL9sXqINpJz1ksqGkhl2MTCExDni0/edit#gid=0. For now, please keep WMCS in the loop on these tickets until that matrix is updated to remove WMCS. I hope this clarifies things.

I went ahead and ran the maintain_views for the single table you specified on clouddb1014 on clouddb1018. It didn't need depooling since it was already broken :)

sudo maintain-views --table flaggedrevs --databases fiwiki

Now you can query the flaggedrevs on fiwiki_p such as:

image.png (1×1 px, 105 KB)

Mentioned in SAL (#wikimedia-analytics) [2022-02-23T23:00:20Z] <razzi> sudo maintain-views --table flaggedrevs --databases fiwiki on clouddb1014.eqiad.wmnet and clouddb1018.eqiad.wmnet for T302233

I don't know what's the status of this anymore as I have been on holidays but this needs to also be run on clouddb1021 (essentially everywhere where the schema change has happened already)

Currently works for me in toolforge db. Thanks to everybody involved for fixing this.

To be clear, this was only fixed on two hosts and on some wikis, but none of the rest, ie:

root@clouddb1021.eqiad.wmnet[frwiki_p]> select * from flaggedrevs limit 1;
ERROR 1356 (HY000): View 'frwiki_p.flaggedrevs' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

List of wikis that have this table: https://noc.wikimedia.org/conf/dblists/flaggedrevs.dblist

They are basically in every section except s4 and s8

What's the status of this? This is causing all queries related to flaggedrevs view to our clouddb* hosts to fail.

I have been working on a new cookbook to update the views here: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/760880

If it is reviewed and all is well, I can run it tomorrow on any of the sections that are ready (If I understand from @Ladsgroup's comment above, sections s1, s2, s3, s5, s6, and s7 should all have their views updated).

I have been working on a new cookbook to update the views here: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/760880

That's great.

If it is reviewed and all is well, I can run it tomorrow on any of the sections that are ready (If I understand from @Ladsgroup's comment above, sections s1, s2, s3, s5, s6, and s7 should all have their views updated).

I think you got it wrong. No views have been updated other than the ones I did on s6 (frwiki, jawiki, ruwiki, labswiki) on only two hosts: T302233#7755918
Essentially, by looking at the original ticket T300774, you need to run the cookbook everywhere.

Yes. It should be run on s1, s2, s3, s5, and s7. You don't need to run it on s8 or s4 because there is no table in databases of those wikis.

Ok cool. Cookbook is merged, but I didn't realize it was a holiday tomorrow so I'll hold off on running things until Monday.

I'm semi-arbitrarily picking s2 to start unless there are objections, where I can see this query currently errors as expected:

razzi@clouddb1018:~$ sudo mysql -S /var/run/mysqld/mysqld.s2.sock -e 'select * from eowiki_p.flaggedrevs limit 1'
ERROR 1356 (HY000) at line 1: View 'eowiki_p.flaggedrevs' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

Cookbook cookbooks.sre.wikireplicas.update-views run by razzi: Started updating wikireplica views

Cookbook cookbooks.sre.wikireplicas.update-views for section s2 started by razzi executed with errors:

  • dbproxy1018.eqiad.wmnet (PASS)
    • Confirmed clouddb1018.eqiad.wmnet is depooled from dbproxy1018.eqiad.wmnet

Change 772443 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] dbproxy: depool clouddb1018

https://gerrit.wikimedia.org/r/772443

I kicked off the cookbook but infortunately ran into 2 errors right off the bat:

  • the depool command, as I had it written, had a permission denied error but still exited with status 0:
root@dbproxy1018:/home/razzi# echo 'set server clouddb1018.eqiad.wmnet state drain' | sudo socat /run/haproxy/haproxy.sock stdio
Permission denied

root@dbproxy1018:/home/razzi# echo ${PIPESTATUS[0]} ${PIPESTATUS[1]}
0 0

(in the puppet run, it looked like this)

----- OUTPUT of 'echo "set server...proxy.sock stdio' -----                                                   
Permission denied
  • the grep succeeded because of a second issue, a logical error about grep -v, so we saw the Confirmed clouddb1018.eqiad.wmnet is depooled from dbproxy1018.eqiad.wmnet posted above.

I manually confirmed the host was indeed pooled, and tried to figure out why the depool command failed, but chatting with other SREs it came to light that we should use confctl for this kind of depooling.

For now I'll do the updates the manual way, starting with depooling clouddb1018.eqiad.wmnet here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/772443/

Change 772443 merged by Razzi:

[operations/puppet@production] dbproxy: depool clouddb1018

https://gerrit.wikimedia.org/r/772443

That table might not be that popular so if I were you I would try first without depooling, it might succeed. For clouddb1021 (reminder: that one has all the sections itself) you should be fine not depooling it as it is not that used.

Change 772447 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] dbproxy: repool clouddb1018

https://gerrit.wikimedia.org/r/772447

Change 772447 merged by Razzi:

[operations/puppet@production] dbproxy: repool clouddb1018

https://gerrit.wikimedia.org/r/772447

@Marostegui It indeed works without depooling so far on clouddb1018 and clouddb1013. Good tip, this is much faster. Current host status:

s2, s7

  • clouddb1014
  • clouddb1018

s1, s3

  • clouddb1013
  • clouddb1017

s5

  • clouddb1016
  • clouddb1020

All sections

  • clouddb1021

Mentioned in SAL (#wikimedia-operations) [2022-03-21T17:49:31Z] <razzi> sudo maintain-views --all-databases --replace-all --table flaggedrevs on clouddb1018 for T302233

Mentioned in SAL (#wikimedia-operations) [2022-03-21T17:49:56Z] <razzi> sudo maintain-views --all-databases --replace-all --table flaggedrevs on clouddb1013 for T302233

Mentioned in SAL (#wikimedia-operations) [2022-03-21T17:51:03Z] <razzi> sudo maintain-views --all-databases --replace-all --table flaggedrevs on clouddb1017 for T302233

Mentioned in SAL (#wikimedia-operations) [2022-03-21T17:57:07Z] <razzi> sudo maintain-views --all-databases --replace-all --table flaggedrevs on clouddb1016 for T302233

Mentioned in SAL (#wikimedia-operations) [2022-03-21T17:59:25Z] <razzi> sudo maintain-views --all-databases --replace-all --table flaggedrevs on clouddb1020 for T302233

Mentioned in SAL (#wikimedia-operations) [2022-03-21T17:59:56Z] <razzi> sudo maintain-views --all-databases --replace-all --table flaggedrevs on clouddb1021 for T302233

Everything looks good, the only strange thing I saw was that clouddb1020 didn't have an error when I ran this query:

sudo mysql -S /var/run/mysqld/mysqld.s5.sock -e 'select * from dewiki_p.flaggedrevs limit 1'

but I ran maintain-views there anyways. It is still working.

Maybe it was done manually before - I just checked the table itself for dewiki and it doesn't have the fr_img* columns as expected.

Aklapper added a subscriber: razzi.

Resetting inactive task assignee

I just close it, if it's happening again, reopen.