Page MenuHomePhabricator

phabricator->phorge migration - database handling
Closed, ResolvedPublic

Description

Hi @Marostegui / DBA in general,

in T333885 we are planning to migrate the current Phabricator prod server to Phorge, a fork of Phabricator.

It's not a big difference so far and as far as we know and can see when glancing at it, the database scheme has not changed a lot.

But there is still an upgrade script that might change it and we don't want to risk this on the current Phabricator DB, or at least not without being sure we can easily revert if we have to.

So we have been thinking along these lines:

  • Maybe we should ask DBA to make a copy of the existing prod DB (actually DBs.. you know how Phabricator uses a ton of them.. for better or worse), call the copies "phorge_*" instead of "phabricator_*", create new GRANTS/user for them
    • Then we could schedule a maintenance window, point the server to the new databases, try things out without it ever touching the prod database, be sure it cant even by accident touch the prod database as long as we make sure the database user is changed
    • If things work fine we just stay on these new databases and some time later you can delete the old ones
    • If things don't work out we end the maintenance window, switch back to current databases and see where to go next
  • OR.. maybe we should rather ask DBA/backup people how to revert a database to a previous state, in general, and what the actual process for that is, if it needs them be around or not at all
  • OR.. maybe DBA has much better ideas how we should solve this as long as it is:
    • reasonably safe we don't mess up the prod db or can restore it
    • not too complex for everyone involved.. also given that there are so many databases and all the grants..

Finally, we are not even fully decided if we should do the upgrade "in place" on the current prod hardware, where IP would stay the same, or whether we should use a different host to test. Depending if the current GRANTs are limiting to IP we might or might not have to touch them.. but at the same time I think we would like to use a different mysql user to be sure.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a subscriber: jcrespo.

Thanks for the task.
I would feel more comfortable if we can test this before hand rather than going all at once. Would that be possible from your side?
To test that I would propose:

  • @Marostegui to find a totally different TESTING host to place our current phabricator databases (without any renames or anything).
  • collaboration-services to test the migration path
  • Test if everything works as expected

If all works fine, we can arrange a day/time to do the migration itself in production.
Before doing anything, once the day/time is agreed upon, I would make sure we have the most recent backup we could have (maybe we can run an ad-hoc one right before cc @jcrespo).

The reason I don't want to go for the renaming+testing on the current production host:

  1. It is production
  2. Phabricator has 67 databases. MariaDB (and MySQL) do not have a "rename database" feature, so we'd need to do some manual stuff to get all the databases dumped, then create the new ones with a different name and then use the renamed ones, which is VERY error prone.
  3. We have 163 grants in phabricator database, so we'd need to adapt those, again, very error prone.

Question: Does phorge use different database names or does it go with the current ones (phabricator_XX)?

During the migration day we can have a few safety nets to revert quickly:

  • @Marostegui to prepare a different host as a clone from the current one and get it to replicate
  • @Marostegui to stop replication on THAT host and on db1217 (our current backup source)
  • @Marostegui (and/or @jcrespo) to make sure we have the most recent backup possible from m3.

If the migration goes well, we can resume replication on those two extra hosts.
If the migration breaks, we can simply promote the temporary host to master (it is a matter of reloading the proxies only)

How does that sound @Dzahn

If all works fine, we can arrange a day/time to do the migration itself in production.
Before doing anything, once the day/time is agreed upon, I would make sure we have the most recent backup we could have (maybe we can run an ad-hoc one right before cc @jcrespo).

Let me know when this is scheduled and we can do a logical backup the day before and a custom snapshot before (it should take around 1h to complete).
In any case, the suggested stop replication/separate host method should be fastest way to both perform a development test and revert changes if something goes very wrong during a production deploy.

@Dzahn do you want me to prepare (next week) the "new" database attempt the testing or how do you want to proceed?
Just asking to organize myself for next week, as Monday and Tuesday are a public holiday here :)

@Marostegui Thank you! I like the overall approach and suggestion a lot.

Yes, no rush, but if you want to prepare the new database then yes, let's go ahead and do that. Should we help in trying to find a machine for that? Or should we even create one as a ganeti VM?

We don't have an actual migration date set yet.

As to your question about the database names, I asked upstream and the answer is:

"it keeps the phabricator_ one by default" .. because .."we couldn't come up with a way to have new installs use phorge_ without braking old installs.".

@Marostegui Thank you! I like the overall approach and suggestion a lot.

Yes, no rush, but if you want to prepare the new database then yes, let's go ahead and do that. Should we help in trying to find a machine for that? Or should we even create one as a ganeti VM?

Don't worry, I have a DB ready for this. It was going to be decommissioned, but we can use it for this for now. No problem. I will get it ready.

We don't have an actual migration date set yet.

Excellent, I will probably have it ready by the 7th May week.

As to your question about the database names, I asked upstream and the answer is:

"it keeps the phabricator_ one by default" .. because .."we couldn't come up with a way to have new installs use phorge_ without braking old installs.".

Good, that simplifies a lot all the steps indeed.

@Marostegui Thank you! I like the overall approach and suggestion a lot.
"it keeps the phabricator_ one by default" .. because .."we couldn't come up with a way to have new installs use phorge_ without braking old installs.".

Yep but to be 100% sure about that, just be sure to have this:

conf/local/local.json
{
  "storage.default-namespace": "phabricator",
  ...
}

As mentioned in

https://we.phorge.it/w/installation_and_setup/update_from_phabricator/

@Dzahn the testing database is ready (and on read-only), once you are closer to a testing stage from your side, just ping me so we can coordinate how/when we do it!

Thank you so much! :) will get back to it soon!

Change 927798 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add app.dev.learn.wiki pointing to AWS

https://gerrit.wikimedia.org/r/927798

oops, patch does not belong here. accidental

Mentioned in SAL (#wikimedia-operations) [2023-06-21T23:09:07Z] <mutante> created temporary test VM phab-test1001.eqiad.wmnet which we need for a one-time test for T335080 - it will soon be destroyed again

Change 932028 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add phab-test1001.eqiad.wmnet, temporarily

https://gerrit.wikimedia.org/r/932028

Change 932028 merged by Dzahn:

[operations/puppet@production] site: add phab-test1001.eqiad.wmnet, temporarily

https://gerrit.wikimedia.org/r/932028

Dzahn changed the task status from Open to In Progress.Jun 21 2023, 11:32 PM
Dzahn raised the priority of this task from Medium to High.

@brennen I am still working on creating the VM.. unfortunately I am running into issues such as:

Wed Jun 21 23:53:34 2023  - INFO: Resolved given name 'phab-test1002.eqiad.wmnet' to 'phab-test1001.eqiad.wmnet'                                                                                                  
Failure: prerequisites not met for this operation:                                                                                                                                                                
error type: wrong_input, error details:                                                                                                                                                                           
Resolved hostname 'phab-test1001.eqiad.wmnet' does not look the same as given hostname 'phab-test1002.eqiad.wmnet'

will try again tomorrow after cache is hopefully expred.

Change 932035 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] install_server: add netboot line for phab-test

https://gerrit.wikimedia.org/r/932035

Change 932035 merged by Dzahn:

[operations/puppet@production] install_server: add netboot line for phab-test

https://gerrit.wikimedia.org/r/932035

We are now able to use phab-test1001 VM to connect to db1118, thanks to Manuel.

Also, we already have dumps of all the tables of the various phabricator databases and the command line to create them again.

I have copied them to /home/dzahn on cumin1001.

19:15 < mutante> .. we now have: confirmed working access to db1118 from phab-test1001.. with all the phab databases.. AND we already have a dump of them all without the data :)
19:15 < mutante> the "phuser" works with the prod credentials
19:15 < mutante> just like the user/pass that is in prod phab config file
19:16 < mutante> except the host is db1118 vs m3-slave and the port is default 3306 instead of 3323
19:16 < mutante> basically you can already run the phorge upgrade .. then we simply repeat the dump command
..
19:18 < mutante> [phab-test1001:~] $ mysql -u phuser -h db1118.eqiad.wmnet -p phabricator_project
19:19 < mutante> [phab1004:~] $ sudo grep phuser -A1 /etc/phabricator/config.yaml
  • copied .sql files from the "before dump" from cumin100 over to phab-test1001 so that Brennen has access
  • verified mysql credentials work to access test db from test machine

the command used was:

for i in `db-mysql db1118 -e "show databases" | grep phab`; do echo $i ; mysqldump --host db1118.eqiad.wmnet --no-data --ssl-verify-server-cert $i > $i.sql; done

I can access the db and do a dry run for the Phorge migrations:

brennen@phab-test1001:/srv/phab/phabricator$ bin/storage upgrade --dryrun
DRYRUN: Would apply patch "phabricator:20210802.legalpad_document_signature.01.phid.sql" to host "db1118.eqiad.wmnet:3306".
DRYRUN: Would apply patch "phabricator:20210802.legalpad_document_signature.02.phid-populate.php" to host "db1118.eqiad.wmnet:3306".
Synchronizing static tables...
Verifying database schemata on "db1118.eqiad.wmnet:3306"...


Database             Table                      Name     Issues
phabricator_legalpad legalpad_documentsignature key_phid Missing Key
DRYRUN: Would apply adjustments.

That seemed like less than I remembered, but I double checked:

brennen@inertia:~/code/wmf/phabricator/deployment/phabricator/resources/sql (work/phorge-migration-2023-07-11 $%=) ✰ git log --oneline --stat wmf/stable.. .
d00e13baae Merge remote-tracking branch 'phorge/stable' into work/phorge-migration-2023-07-11
c3f0c0b0f7 Merge Phacility/master into phorge
a4948ec800 Add conduit endpoints for querying legalpad
 resources/sql/autopatches/20210802.legalpad_document_signature.01.phid.sql          |  2 ++
 resources/sql/autopatches/20210802.legalpad_document_signature.02.phid-populate.php | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

So we're good there, and it's a minimal set of changes.

All that said, for a quick check before running the update, I logged in to db1118.eqiad.wmnet and ran:

select id, title from phabricator_maniphest.maniphest_task order by id desc limit 1\G

...which gives me very recent data. Is this currently getting replicated from the production database to db1118?

mysql:phuser@db1118.eqiad.wmnet [phabricator_project]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: db1159.eqiad.wmnet

Could we get that disabled? Thanks!

@Marostegui or @Ladsgroup Could the replication to this host be disabled? ^thanks!

Done.
And for what is worth, these are the coordinates where I reseted it:

root@db1118.eqiad.wmnet[(none)]> stop slave; show slave status\G
Query OK, 0 rows affected (0.002 sec)

*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: db1159.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: db1159-bin.000984
           Read_Master_Log_Pos: 772800472
                Relay_Log_File: db1118-relay-bin.000166
                 Relay_Log_Pos: 772800772
         Relay_Master_Log_File: db1159-bin.000984
              Slave_IO_Running: No
             Slave_SQL_Running: No
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 772800472
               Relay_Log_Space: 772801130
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 171966512
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-171966512-350789847
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State:
              Slave_DDL_Groups: 1615
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 48211652

@Marostegui Thanks! One more request here:

Applying patch "phabricator:20210802.legalpad_document_signature.01.phid.sql" to host "db1118.eqiad.wmnet:3306"...
[2023-07-18 15:45:38] EXCEPTION: (AphrontQueryException) #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement at [<phorge>/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:396]

Could we get this writable?

Thanks and sorry, one more:

Applying patch "phabricator:20210802.legalpad_document_signature.01.phid.sql" to host "db1118.eqiad.wmnet:3306"...
[2023-07-18 20:18:18] EXCEPTION: (PhutilProxyException) Unable to access a required database or table. This almost always means that the user you are connecting with ("phuser") does not have sufficient permissions granted in MySQL. You can use `bin/storage databases` to get a list of all databases permission is required on. {>} (AphrontAccessDeniedQueryException) #1142: ALTER command denied to user 'phuser'@'10.64.48.35' for table `phabricator_legalpad`.`legalpad_documentsignature`

This error usually indicates that you need to "GRANT" the MySQL user additional permissions. See "GRANT" in the MySQL manual for help. at [<phorge>/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:360]

I think phuser needs to be able to alter everything, though in this case phabricator_legalpad.legalpad_documentsignature is probably sufficient.

That user had:

GRANT SELECT, INSERT, UPDATE, DELETE, EXECUTE, SHOW VIEW ON `phabricator%`.* TO `phuser`@`10.64.48.35`

I just added the ALTER grant.

Aaaaand one more:

[2023-07-19 14:59:48] EXCEPTION: (PhutilProxyException) Failed to "CREATE TEMPORARY TABLE". You may need to "GRANT" the current MySQL user this permission. {>} (AphrontAccessDeniedQueryException) #1044: Access denied for user 'phuser'@'10.64.48.35' to database 'phabricator_legalpad'

Looking at the migration I'm trying to run, aside from the temp table creation there's also a TRUNCATE. I'm rusty on MySQL perms, to say the least - does that require a DROP grant?

I have added DROP (which gives you TRUNCATE), CREATE and CREATE TEMPORARY TABLES.

@brennen How's this going, do you need anything else from us or Data Persistence to proceed?

@brennen How's this going, do you need anything else from us or Data Persistence to proceed?

We're good! Apologies for not updating here. Migrations all ran ok, we're just in a user testing phase at the moment. I'll update here when this db can go away.

@brennen How's this going, do you need anything else from us or Data Persistence to proceed?

We're good! Apologies for not updating here. Migrations all ran ok, we're just in a user testing phase at the moment. I'll update here when this db can go away.

Cool, remember we'd need a few days to get ready for the final migration too so we can setup a host for a quick roll back to the previous database schema in case things do not go as planned.

Cool, remember we'd need a few days to get ready for the final migration too so we can setup a host for a quick roll back to the previous database schema in case things do not go as planned.

Right on. If possible, I'd like to do this this coming Monday, 2023-08-14 @ 16:00 UTC. For what it's worth, the only schema change is to phabricator_legalpad.

I'll try to get a host ready.
Please be advise that the 15th is a public holiday in many EU countries.
16:00 UTC is late for most of those countries so there won't be much support that evening and/or the day after in case things don't go as expected.
I'll leave the host ready, but it will be hard to be there in case you need to roll back

Change 947197 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1119 to m3

https://gerrit.wikimedia.org/r/947197

Change 947197 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1119 to m3

https://gerrit.wikimedia.org/r/947197

I'll try to get a host ready.
Please be advise that the 15th is a public holiday in many EU countries.
16:00 UTC is late for most of those countries so there won't be much support that evening and/or the day after in case things don't go as expected.
I'll leave the host ready, but it will be hard to be there in case you need to roll back

@jcrespo for awareness - we should try to get a last minute backup from m3 the day of the migration

Sadly I won't be around because of vacations nor on Monday nor Tuesday, so no backups will be available then.

Automated backups run @ m3 Tuesdays from 4:30 to 6:00 am UTC.

@brennen apart from the date of the migration, are you done with db1118? Can I just remove it from here and use it somewhere else?

@brennen Considering Manuel's and Jaime's availability could this happen Wednesday or Thursday next week and at an earlier time?

Considering Manuel's and Jaime's availability could this happen Wednesday or Thursday next week and at an earlier time?

We could aim for Wednesday. There's an open slot on the calendar at 15:00 UTC, which coincides with the weekly RelEng / Collaboration Services sync. It would make sense to use that meeting slot for a deploy, since most of the relevant people already attend...

Considering Manuel's and Jaime's availability could this happen Wednesday or Thursday next week and at an earlier time?

We could aim for Wednesday. There's an open slot on the calendar at 15:00 UTC, which coincides with the weekly RelEng / Collaboration Services sync. It would make sense to use that meeting slot for a deploy, since most of the relevant people already attend...

That would work for me.

On another note, I've got the host ready, it is db1119 and it will be there in case we need to roll back.
The idea is to stop replication right after we put phabricator on read-only mode for the migration.
If something explodes, we can simply promote that host to master on our proxies, as it will contain the latest data before the migration and the old schema - does that sound good to you @brennen?

@brennen apart from the date of the migration, are you done with db1118? Can I just remove it from here and use it somewhere else?

Also, @brennen could you let me know about ^?

Thanks

If something explodes, we can simply promote that host to master on our proxies, as it will contain the latest data before the migration and the old schema - does that sound good to you @brennen?

Yeah, makes sense.

@brennen apart from the date of the migration, are you done with db1118? Can I just remove it from here and use it somewhere else?

It would be helpful we could hang on to it until the deploy is done, just in case. I don't expect trouble here but if something does come up immediately, then having the option to roll back and investigate on the test instance could make life a lot easier.

Wednesday would be ideal for me, too, not only because I will be around but because we will also have a fresh backup from Tuesday, and we can create manually an extra one too, by that time.

@brennen apart from the date of the migration, are you done with db1118? Can I just remove it from here and use it somewhere else?

It would be helpful we could hang on to it until the deploy is done, just in case. I don't expect trouble here but if something does come up immediately, then having the option to roll back and investigate on the test instance could make life a lot easier.

Absolutely, no problem

Wednesday would be ideal for me, too, not only because I will be around but because we will also have a fresh backup from Tuesday, and we can create manually an extra one too, by that time.

Cool. Scheduled for Wednesday, 2023-08-16 @ 15:00 UTC (08:00 PDT).

We'll be on #wikimedia-operations but also feel free to join https://meet.google.com/for-tnif-yku - several of us will be on that call.

Copying from T333885#9115703:

One update re: rollbacks: As best I could tell from a quick revert of code on phab.wmflabs.org to a pre-Phorge state, legalpad functionality seems unaffected by the changed schema.

After a quiet first day after the deploy, I think we're probably good to resume replication here and repurpose the test db. Could wait until Friday if we wanted to be extra cautious.

I am going to start it everywhere but the "special" host. Will do that on Friday

Should the VM phab-test1001 be deleted at this point or does it still have value? (T351115)