Page MenuHomePhabricator

Update to Phorge upstream 2024.35 release
Closed, ResolvedPublic

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
update submodules for upstream 2024.35 mergerepos/phabricator/deployment!69brennenwork/merge-phorge-2024.35wmf/stable
Merge phorge/2024.35 into wmf/stablerepos/phabricator/arcanist!4brennenwork/merge-phorge-2024.35wmf/stable
Merge upstream 2024.35 into wmf/stablerepos/phabricator/phabricator!91brennenwork/merge-phorge-2024.35wmf/stable
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
ResolvedFeatureAklapper
ResolvedFeatureAklapper
ResolvedFeatureAklapper
OpenNone
Resolvedvalerio.bozzolan
ResolvedBUG REPORTvalerio.bozzolan
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedFeatureAklapper
ResolvedAklapper
ResolvedFeatureAklapper
ResolvedBUG REPORTAklapper
Resolvedbrennen
ResolvedMarostegui
ResolvedABran-WMF
ResolvedPppery

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Aklapper changed the task status from Stalled to Open.Sep 2 2024, 11:26 AM
Aklapper raised the priority of this task from Low to Medium.
Aklapper updated the task description. (Show Details)

DB upgrade may take a while:

MariaDB [phabricator_file]> SELECT COUNT(*) FROM file;
+----------+
| COUNT(*) |
+----------+
|   547777 |
+----------+

MariaDB [phabricator_file]> SELECT COUNT(*) FROM file_attachment;
+----------+
| COUNT(*) |
+----------+
| 29897115 |
+----------+

DB upgrade may take a while:

MariaDB [phabricator_file]> SELECT COUNT(*) FROM file;
+----------+
| COUNT(*) |
+----------+
|   547777 |
+----------+

MariaDB [phabricator_file]> SELECT COUNT(*) FROM file_attachment;
+----------+
| COUNT(*) |
+----------+
| 29897115 |
+----------+

Oh good.

Also: why do ours seem backward from upstream?

Just as an indication: the storage upgrade, in a Phorge with file count 1.3M rows and file_attachment consisting in 9K rows, it may delete 170K rows in less than 1 second on average hardware.

– (change that introduced db upgrade)

And

NOTE: If you have 1M+ phabricator_file.file and 10K file_attachment - it may delete 200K rows in 2s

– (2024.35 changelog)

Meanwhile: we have ½M phabricator_file.file and 29M phabricator_file.file_attachment.

Why is our usage backwards from upstream?


For timing "may delete 200K rows in 2s"

If that's the pace of cleanup, then the timing doesn't seem too bad, really. If it deleted every one of our rows that would take ((29,500,000/200,000)*2) = 295 seconds (5 minutes).

Why is our usage backwards from upstream?

Mine is a speculation but it makes sense for Phabricator platforms with a long history to have more references, maybe even more than files. Especially for public installations attracting spiders and generating lot of extra temporary files and extra orphan references.

Just for extra handy reference, this is the (only) involved patch:

USE phabricator_file;

DELETE FROM file_attachment
 WHERE NOT EXISTS
  (SELECT *
   FROM file
   WHERE phid=file_attachment.filePHID)

Eh, I'm kinda reluctant to run USE phabricator_file; SELECT COUNT(*) FROM file_attachment WHERE NOT EXISTS (SELECT * FROM file WHERE phid=file_attachment.filePHID); in our production instance because I'm worried it's gonna take a looong time. :-/

This comment was removed by brennen.

Would there be interest in getting a DB snapshot (replica on a real-size host) to test this on? We took this approach when upgrading VRTS: T355541: Install a temporary DB host in m2 to support VRTS migration.

Would there be interest in getting a DB snapshot (replica on a real-size host) to test this on? We took this approach when upgrading VRTS: T355541: Install a temporary DB host in m2 to support VRTS migration.

Would this be the case in the end? So I can prepare for this.

Are there any significant schema changes on this upgrade?

Are there any significant schema changes on this upgrade?

phabricator_file.file_attachment and phabricator_file.file per https://we.phorge.it/w/changelog/2024.35/ which are the unknowns to us

Are there any significant schema changes on this upgrade?

phabricator_file.file_attachment and phabricator_file.file per https://we.phorge.it/w/changelog/2024.35/ which are the unknowns to us

They have quite a decent size here:

root@db2160:/srv/sqldata.m3/phabricator_file# ls -lh file_attachment.ibd
-rw-rw---- 1 mysql mysql 8.8G Jan  8 14:26 file_attachment.ibd
root@db2160:/srv/sqldata.m3/phabricator_file# ls -lh file.ibd
-rw-rw---- 1 mysql mysql 1.3G Jan  8 14:26 file.ibd

P.S. to save some storage we may want to additionally try an OPTIMIZE TABLE file_attachment after running the migration script, since many rows are expected to be removed there, but MariaDB usually needs a kick I think to recover that space

Picking this thread back up after some discussion in last week's Collaboration Services office hours.

Would there be interest in getting a DB snapshot (replica on a real-size host) to test this on? We took this approach when upgrading VRTS

Would this be the case in the end? So I can prepare for this.

I think this seems like the way forward? It might be overkill, but we could know exactly what kind of downtime to expect.

Picking this thread back up after some discussion in last week's Collaboration Services office hours.

Would there be interest in getting a DB snapshot (replica on a real-size host) to test this on? We took this approach when upgrading VRTS

Would this be the case in the end? So I can prepare for this.

I think this seems like the way forward? It might be overkill, but we could know exactly what kind of downtime to expect.

This needs some preparation, what timeline are we looking at here?

Flexible; whatever fits for you. Not urgent, only blocks us from pulling/deploying future upstream updates of Phab. (FYI I'm likely way less available Apr18–May01.)

Flexible; whatever fits for you. Not urgent, only blocks us from pulling/deploying future upstream updates of Phab. (FYI I'm likely way less available Apr18–May01.)

Thanks!

Unfortunately, I don't think we can leave a replica just dedicated for this. So it is a one off thing (if there's a general need for this in the long term, budget can be requested for next FY). If you all are okay with a one time off test (as we've done in the past) I can try to get a host and prepare it.

If you all are okay with a one time off test (as we've done in the past) I can try to get a host and prepare it.

Yes, that would be great. Thank you!

if there's a general need for this in the long term, budget can be requested for next FY

Might be worth talking about, but yeah, for the moment a one off will work great.

Thanks!

Eventually I just ended up on https://phabricator.wikimedia.org/config/dbissue/.
Phab complains about phabricator_file and phabricator_search.
I have no idea what that __dsns table in phabricator_file might be.
The phabricator_search issue might come from https://gerrit.wikimedia.org/r/c/operations/puppet/+/314286 ?

Just pointing out in case it might (or not) complicate things. :-/

Assigning to @brennen now that we have out test db.

Plan here:

  • test out migration using phab1005, pointed to test db
  • time it (so we know how long downtime will be fore real upgrade)
  • make sure we didn't break anything
  • declare victory

@jcrespo who offered support from the backup perspective.

(Do we still redirect to https://www.mediawiki.org/wiki/Phabricator/Maintenance for that time? I am clueless how "standardized" the steps are.)

Please let me know in advance before start, so I can stop replication on the standby dbs.

(Do we still redirect to https://www.mediawiki.org/wiki/Phabricator/Maintenance for that time? I am clueless how "standardized" the steps are.)

(If you do, please redirect to https://www.mediawiki.org/wiki/Special:MyLanguage/Phabricator/Maintenance if you can, so that people can see the notice in their own language where a translation exists :D)

Please let me know in advance before start, so I can stop replication on the standby dbs.

Planning on 15:00 UTC.

Change #1169654 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[operations/puppet@production] phabricator deployment: skip storage upgrade during deploy

https://gerrit.wikimedia.org/r/1169654

Icinga downtime and Alertmanager silence (ID=f6a06582-ed9b-4e72-aba4-96871ffe134c) set by jynus@cumin1003 for 4:00:00 on 4 host(s) and their services with reason: Phorge upgrade

db[2160,2234].codfw.wmnet,db[1217,1250].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-07-15T15:02:05Z] <jynus> stop replica @ db1217:m3, db2160:m3 T370266

Change #1169654 merged by Dzahn:

[operations/puppet@production] phabricator deployment: skip storage upgrade during deploy

https://gerrit.wikimedia.org/r/1169654

Mentioned in SAL (#wikimedia-operations) [2025-07-15T15:09:01Z] <brennen@deploy1003> Started deploy [phabricator/deployment@ed8270c]: test deploy phab2002 for T370266

Mentioned in SAL (#wikimedia-operations) [2025-07-15T15:09:40Z] <brennen@deploy1003> Finished deploy [phabricator/deployment@ed8270c]: test deploy phab2002 for T370266 (duration: 00m 38s)

Mentioned in SAL (#wikimedia-operations) [2025-07-15T15:12:14Z] <brennen@deploy1003> Started deploy [phabricator/deployment@ed8270c]: deploy phab1004 for T370266

Change #1169692 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[operations/puppet@production] Revert "phabricator deployment: skip storage upgrade during deploy"

https://gerrit.wikimedia.org/r/1169692

Change #1169692 merged by Dzahn:

[operations/puppet@production] Revert "phabricator deployment: skip storage upgrade during deploy"

https://gerrit.wikimedia.org/r/1169692

Mentioned in SAL (#wikimedia-operations) [2025-07-15T15:46:58Z] <jynus> start replica @ db1217:m3, db2160:m3 T370266

I removed the db downtimes after replica catchup. Things seem done from my side- we still have the backup for long term. Regards.

Congratulations! (Just my curiosity: how long it took to upgrade the database? minutes? hours? Thanks)

About 15-20 minutes on production, about an hour on the test instance with a very similar dataset. Machine configurations...

Contents of etherpad deploy checklist for posterity:


Deploy plan for https://phabricator.wikimedia.org/T370266 - upgrade to phorge 2024.35 release

Overview: Disable some things that the phab_* deployment scripts normally do, deploy, run storage upgrade, turn things back on.