Page MenuHomePhabricator

Closed tickets in Bugzilla migrated without a closing date
Closed, ResolvedPublic16 Estimated Story PointsBUG REPORT

Description

We found in the Grimoire stats for Maniphest that the number of closed tickets per month up to 2014 was 0, which obviously is wrong, since Bugzilla tickets were migrated to Maniphest.

The problem seems to be due to the migration process. Apparently, tickets closed in Bugzilla were migrated to Maniphest as closed tasks, but without a related transaction for the status change. So the table phabricator_maniphest.maniphest_task has the column closedEpoch as NULL, and the trouble is that for calculating closed tickets per month, and the backlogs in the past, in Grimoire we use "closing events", which must have a date.

See for example these two, closed but without closing event:

Compared to these two (look for "closed this task as Resolved"):

I wonder if the missing closing events could be migrated to Maniphest. If that is not possible, we will look for some workaround in Grimoire, but the stats for before 2014 will probably render void.

Bugzilla had about 7500 duplicates, with this distribution over time:

bugzilla-all-duplicates-chart.png (800×600 px, 4 KB)

✅ 1. Data scraping

In the next days (2023-07) we will scrape Bugzilla history pages to generate something like this in CSV:

CSV 2.7M https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/raw/5dd57479ca6bd565f1110f9ed974041a5657fb21/data/bugzilla.csv?inline=false

Here the full raw data-source in JSON:

JSON 126MB https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/raw/5dd57479ca6bd565f1110f9ed974041a5657fb21/data/bugzilla.json?inline=false

The goal of the above dataset is to capture all the "meaningful status changes" and have a CSV with the "most meaningful status change".

✅ 2. Data matching

Please add here some instructions about how to do the match BugZillaID → PhabricatorID ✨

✅ 3. Create Update queries

The goal here is to execute a lot of UPDATE queries against the phabricator_maniphest.maniphest_task database table having lot of expected Phabricator numerical IDs and close dates. Example:

UPDATE phabricator_maniphest.maniphest_task SET closedEpoch = :epoch: WHERE closedEpoch IS NULL AND id = :expected_phab_id:

4. Epic Database fiddling (help needed)

Somebody with database access to Phabricator production could execute these, to fix the problem:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/raw/bf835b1469b5e889ac6256cacaef917ab381da2b/data/T107254-migration-commands.sql?inline=false

WARNING: The above download file is 15MB

Preview of first rows:

-- This script is intended to be executed using this database: phabricator_maniphest
-- skipping BugZilla ID 1
UPDATE maniphest_task USE INDEX (PRIMARY) SET closedEpoch = 1103779357 WHERE closedEpoch IS NULL AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ) AND id = 2002; -- in date 2004-12-23 05:22:37 UTC BugZilla ID 2 closed by user colin.pitts with status FIXED
UPDATE maniphest_task USE INDEX (PRIMARY) SET closedEpoch = 1092440778 WHERE closedEpoch IS NULL AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ) AND id = 2003; -- in date 2004-08-13 23:46:18 UTC BugZilla ID 3 closed by user brion with status FIXED
-- skipping BugZilla ID 5
-- skipping BugZilla ID 6
UPDATE maniphest_task USE INDEX (PRIMARY) SET closedEpoch = 1368229779 WHERE closedEpoch IS NULL AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ) AND id = 2007; -- in date 2013-05-10 23:49:39 UTC BugZilla ID 7 closed by user gwicke with status WONTFIX
UPDATE maniphest_task USE INDEX (PRIMARY) SET closedEpoch = 1259607436 WHERE closedEpoch IS NULL AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ) AND id = 2008; -- in date 2009-11-30 18:57:16 UTC BugZilla ID 8 closed by user ayg with status FIXED
UPDATE maniphest_task USE INDEX (PRIMARY) SET closedEpoch = 1132381430 WHERE closedEpoch IS NULL AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ) AND id = 2009; -- in date 2005-11-19 06:23:50 UTC BugZilla ID 9 closed by user brion with status FIXED
UPDATE maniphest_task USE INDEX (PRIMARY) SET closedEpoch = 1182970947 WHERE closedEpoch IS NULL AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ) AND id = 2010; -- in date 2007-06-27 19:02:27 UTC BugZilla ID 10 closed by user wmbugs with status WONTFIX
...

The above is just a preview. The file has a total of 56 000+ updates, all of them share the "same structure", they "just" touch closedEpoch if NULL and if not open 🌈

Details

Other Assignee
valerio.bozzolan

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It looks like you have no special handling for bugs that were closed and then reopened, so e.g. bug 1 (T2001) is listed as having been resolved in 2006. I don't know if these events should be migrated or not?

Fortunately the Glorious Bugzilla Archivers, also archived these pages:

https://static-bugzilla.wikimedia.org/show_activity.cgi?id=73

Woah, I kind of remember doing this and thinking something like "better save the activity as well". It's cool to read years later unexpectedly in another context that it did matter.

It looks like you have no special handling for bugs that were closed and then reopened, so e.g. bug 1 (T2001) is listed as having been resolved in 2006

Indeed, the JSON file says

"73": {
    "lastStatus": "RESOLVED",
    "lastDate": "2005-03-23 21:58:11 UTC",
    "lastActor": "erik"
},

and that's the first event, https://static-bugzilla.wikimedia.org/show_activity.cgi?id=73 says that it got ultimately resolved on 2005-05-22 00:46:39 UTC.

Uh, thanks for the bug report

Another funny thing:

Sometime the column "What" can contain "Assignee", "Blocks", "CC". And that is nice.

But... sometime it contains this :D

..
                <a href="attachment.cgi?id=58">
                Attachment #58</a>
                Attachment is obsolete

Why is that in the logical Heading than in the value? Uh? UH? Only Bugzilla knows.

Interesting indicative stats so far

[FIXED] => 2948
[DUPLICATE] => 1100
[INVALID] => 813
[WONTFIX] => 638
[RESOLVED] => 484
[] => 218
[CLOSED] => 177
[REOPENED] => 131

Edited: where the [] means "no status" and so "open"

Not sure how to read this...
"Resolved", "Reopened", "Verified", "Closed" were statuses; "Invalid", "Worksforme", "Duplicate", "Wontfixed", "Fixed" were resolutions of the status "Resolved":
https://upload.wikimedia.org/wikipedia/commons/archive/5/57/20141207044352%21Bug_Life_Cycle_Diagram.png .
Also see the "Status (and related Resolution) options" table on https://www.mediawiki.org/wiki/Phabricator/versus_Bugzilla#Bugzilla_data_migrated

OK so from a Phabricator perspective these are the most phab-nonsense "status changes":

  • ---, this is just NULL, nonsense
  • ASSIGNED: changed the assignee, not the status
  • LATER, changed when, not the status
  • NEW, this just means I'm new, and the status never changed
  • PATCH_TO_REVIEW, this is a corner-case for "in progress" but better to ignore
  • REMIND, this does not indicate a status change
  • UNCONFIRMED, this indicates reproducibility, not status change
  • VERIFIED, this indicates reproducibility, not status change
  • WORKSFORME, this indicates reproducibility, not status change

The script skipped all of these.

It takes just the very last (→ the most meaningful) status. Here available for download:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/raw/5dd57479ca6bd565f1110f9ed974041a5657fb21/data/bugzilla.csv?inline=false (permalink)

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/raw/master/data/bugzilla.csv (last)

WARNING: the CSV is 2 MB
valerio.bozzolan changed the task status from In Progress to Stalled.Jul 19 2023, 7:21 AM

Chapter I: completed ✅

Chapter II: understanding what should be updated, from what 🔴 We are blocked again \o/

A data source like this one would be useful:

Phabricator IDPhabricator Closed EpochPhabricator Closed PHIDBugzilla ID
numericnumeric / nullstring / nullnumeric

To obtain that, it seems we can look at how Bugzilla redirects to Phabricator. It seems the "redirector" executes this query to find the Phabricator ID:

select t.id from maniphest_customfieldstorage f, maniphest_task t where t.PHID=f.objectPHID and f.fieldIndex=? and f.fieldValue=?

Source:

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/phabricator/templates/redirect_config.json.erb

Somebody who knows puppet is needed to understand what fieldIndex is supposed to be.

The script skipped all of these. It takes just the very last (→ the most meaningful) status.

Statuses were only {UNCONFIRMED, NEW, ASSIGNED, PATCH_TO_REVIEW, REOPENED, RESOLVED, VERIFIED, CLOSED} (not sure we had or used the last one).
{FIXED, LATER, REMIND, WORKSFORME, WONTFIX, INVALID, DUPLICATE} were resolutions of one of the non-open statuses {RESOLVED, VERIFIED, CLOSED}, except for the {---} resolution in case of an open status {UNCONFIRMED, NEW, ASSIGNED, PATCH_TO_REVIEW, REOPENED}.

IMO the underlying problem is that in 2014 we did not import into the Phab DB the calendar date when a ticket's status was changed to RESOLVED for the last time.

Chapter II: understanding what should be updated, from what

Simplified, imported closed tickets with closedEpoch === null and no status changes since the import. [1] That's currently 57035 tickets.
closedEpoch should be updated to say when the ticket's status was changed to RESOLVED (Bugzilla status) for the last time.

Phabricator IDPhabricator Closed EpochPhabricator Closed PHIDBugzilla ID
numericnumeric / nullstring / nullnumeric

To obtain that, it seems we can look at how Bugzilla redirects to Phabricator.

Phabricator task ID = Bugzilla ticket ID + 2000. I do not know what "Phabricator Closed PHID" means here, sorry. :(


[1] SELECT maniphest_task.id, maniphest_task.phid, maniphest_task.status, closedEpoch FROM maniphest_task WHERE (maniphest_task.status != "open" AND maniphest_task.status != "stalled" AND maniphest_task.status != "duplicate" AND maniphest_task.status != "progress") AND maniphest_task.id > 2000 AND maniphest_task.id < 75682 AND maniphest_task.phid NOT IN (SELECT maniphest_task.phid FROM maniphest_task JOIN maniphest_transaction WHERE maniphest_task.phid = maniphest_transaction.objectPHID AND maniphest_transaction.dateModified > 1416697200 AND maniphest_transaction.transactionType = "status" AND (maniphest_transaction.oldValue = "\"open\"" OR maniphest_transaction.oldValue = "\"stalled\"" OR maniphest_transaction.oldValue = "\"progress\"") AND (maniphest_transaction.newValue != "\"open\"" AND maniphest_transaction.newValue != "\"stalled\"" AND maniphest_transaction.newValue != "\"progress\"")) AND closedEpoch IS NULL;

Phabricator task ID = Bugzilla ticket ID + 2000

Whoa! thanks.

SELECT ...

Whoa! Can I somehow surf that lovely data? 🤩 maybe in a pastebin? Also please with the closerPHID column

(With "Phabricator Closed PHID" I just mean the "string identified of the person who closed that Task" - it's the column closerPHID in maniphest_task table)

Whoa! Can I somehow surf that lovely data? 🤩 maybe in a pastebin?

P49618

Also please with the closerPHID column

There are no tasks WHERE closedEpoch IS NULL AND closerPHID IS NOT NULL. Both columns are NULL in the Paste above.

Yeah having closerPHID as NULL has sense since you already filtered by maniphest_task.status != "open" AND maniphest_task.status != "stalled" AND maniphest_task.status != "duplicate" AND maniphest_task.status != "progress" and so there is no closer.

Thanks for the paste 👍

Let's join this lovely pastebin with our scraped stuff

I sincerely do not remember what I was doing here :D will retry to do something during the next WMHack. I will continue from this point:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser

LSobanski claimed this task.
LSobanski subscribed.

Closing based on the last comment.

Aklapper removed LSobanski as the assignee of this task.

This is not resolved; it would require altering our DB

Got it. In this case I'll just untag Collab.

To summarize: Would need to perform 57008 times the following steps:

I obviously do not care who set the closed status (we cannot match non-existing accounts!), and I obviously do not care about creating fake transactions in the DB, but would only set that one closedEpoch column value, if at all.

Note: we have dump of Bugzilla data at https://dumps.wikimedia.org/other/bugzilla/ , as static html file and a database dump (without emails)

@Aklapper

git clone https://gitlab.wikimedia.org/repos/sre/miscweb/bugzilla.git

cd html-compressed

tar xzf static-bugzilla.gz

cd static-bugzilla

grep bz_status_RESOLVED activity2.html

bz_status_RESOLVED  bz_closed"

Thanks, that might turn out helpful. Note to myself: Requires installing git-lfs beforehand.

grep bz_status_RESOLVED activity2.html

bz_status_RESOLVED  bz_closed"

That's about other tickets linked via "Depends on" etc. It is unrelated to the ticket status itself.

Had another quick look and managed to get down to have only <table>...</table> left in the local bug activity page. But then interesting problem is dealing with the rowspan: No date cell in the Status change row when several actions were performed at once and 'Status change' is not listed as first row)

Small note

This already takes in consideration the rowspan thing:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser

So it could parse the website hosted locally correctly I think

@valerio.bozzolan I admit I didn't try to fully understand those 400 lines which seem to cover way way more stuff than needed here. :D Like why there is an array with statuses when we only care about last Status=Resolved etc. Or why care about ACTION_WITH_LINKS etc. (Also, stuff like WORKSFORME and REMIND and LATER and --- are no statuses but resolutions.) And if it would work on a static HTML dump like ours, instead of a real BZ instance.

P.S.

Have anybody already checked the file mentioned in the description, that is, maybe an useful starting point?

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/master/data/bugzilla.csv

stuff like WORKSFORME and REMIND and LATER and --- are no statuses but resolutions

Yep. That is exactly why WORKSFORME etc. are stored in a constant called NONSENSE_STATUSES - so we can ignore them in a strict and stable way 👍

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/5dd57479ca6bd565f1110f9ed974041a5657fb21/bugzilla-scraper.php#L43

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/5dd57479ca6bd565f1110f9ed974041a5657fb21/bugzilla-scraper.php#L89

BTW the real software starts at line 78 and ends at line 157 - that are 79 cute lines of code (comments included) without any import and without any alien library. Hoping to have shared something useful.

Anyway a required step is to take P49618 and make a nice phabricator.csv.

Where to upload that? pull request welcome in the data/ directory here:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/

Have anybody already checked the file mentioned in the description, that is, maybe an useful starting point?
https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/master/data/bugzilla.csv

@valerio.bozzolan: I am sorry, I should have first checked that more thoroughly, indeed. :-( That could be quite useful!
Most data in the CSV file seems correct (yay). Code seems not to ignore VERIFIED and CLOSED statuses though, so e.g. bug 814 and bug 1924 have wrong data.

And if it would work on a static HTML dump like ours, instead of a real BZ instance.

Yeah. Since PHP is a lovely+scaring piece of glue that considers HTTP URLs as files that can be opened... (yes - file_get_contents() supports URLs lol) now our local dump is supported ✨

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/commit/d98b2aaff20e85db66d6172b82c9e117fcdec8e6


Note: we have dump of Bugzilla data at https://dumps.wikimedia.org/other/bugzilla/ , as static html file and a database dump (without emails)

OH YEAH - thanks again @hashar for reporting the existence

And... the script now takes seconds instead of weeks. Now we have enough extra Watts to shutdown a small nuclear plant somewhere 🌈


Most data in the CSV file seems correct (yay) [...] Code seems not to ignore VERIFIED and CLOSED statuses though, so e.g. bug 814 and bug 1924 have wrong data.

If you say that CLOSED should be ignored, let's ignore that 👌


Run again. Completed in 10 seconds instead of 10 weeks 🎇

Is this example change correct to you, @Aklapper ? If yes, I will commit the dataset again.

diff --git a/data/bugzilla.json b/data/bugzilla.json
index 195b7ba..629fee1 100644
--- a/data/bugzilla.json
+++ b/data/bugzilla.json
@@ -2500,9 +2500,9 @@
             ]
         },
         "3": {
-            "lastStatus": "CLOSED",
-            "lastDate": "2005-05-22 16:11:13 UTC",
-            "lastActor": "river",
+            "lastStatus": "FIXED",
+            "lastDate": "2004-08-13 23:46:18 UTC",
+            "lastActor": "brion",
             "history": [
                 [
                     "timwi",
@@ -4394,9 +4394,9 @@
             ]
         },
         "12": {
-            "lastStatus": "CLOSED",
-            "lastDate": "2004-08-31 00:54:34 UTC",
-            "lastActor": "hashar",
+            "lastStatus": "FIXED",
+            "lastDate": "2004-08-31 00:34:05 UTC",
+            "lastActor": "brion",
             "history": [
                 [
                     "timwi",
@@ -5059,9 +5059,9 @@
             ]
         },
         "18": {
-            "lastStatus": "CLOSED",
-            "lastDate": "2005-05-22 16:56:27 UTC",
-            "lastActor": "river",
+            "lastStatus": "FIXED",
+            "lastDate": "2004-08-14 08:34:41 UTC",
+            "lastActor": "brion",



....



         "814": {
-            "lastStatus": "CLOSED",
-            "lastDate": "2007-03-04 06:57:13 UTC",
+            "lastStatus": "FIXED",
+            "lastDate": "2007-03-04 06:55:33 UTC",
             "lastActor": "ryan.lane",


...


         "1924": {
-            "lastStatus": "CLOSED",
-            "lastDate": "2007-01-20 08:14:09 UTC",
+            "lastStatus": "WONTFIX",
+            "lastDate": "2007-01-20 08:13:33 UTC",
             "lastActor": "rotemliss",
             "history": [

....

Note: we have dump of Bugzilla data at https://dumps.wikimedia.org/other/bugzilla/ , as static html file and a database dump (without emails)

And... the script now takes seconds instead of weeks. Now we have enough extra Watts to shutdown a small nuclear plant somewhere 🌈

I had a good laugh on that one!! I am quite happy to see my finding served you well and dramatically sped up the script.

As for CLOSED, I guess we did that to prevent further comments and/or to indicate the task resolution got verified. Then.. I cant quite remember how we used Bugzilla 15/20 years ago :D I have looked at the activities of each of the bug listed in the diff (example: https://static-bugzilla.wikimedia.org/show_activity.cgi?id=1924 ) and the new lastActor looks fine to me.

challenge-accepted

Here our generated SQL patch candidate ready for review:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/5ef4c7d2f640f856f581a53fa35fde5c354f7f0b/data/T107254-migration-commands.sql

The above SQL consists in exactly 58903 UPDATE statements, with comments.

That file is suitable to be imported as-is using mysql lol < T107254-migration-commands.sql or similar.

That SQL was generated from the last dataset in CSV available here:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/fa042f682a196a7037b3a9e0d378048108e97b2a/data/phabricator-tasks.csv

The SQL script can be re-generated using this script:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/fa042f682a196a7037b3a9e0d378048108e97b2a/T107254-print-migration-commands.php


Maybe nice to give some additional eyes, and also do a nice backup before this thing lol. Good luck. It has been a pleasure to converse on this Wikimedia Phabricator before its accidental destruction.

@valerio.bozzolan The SQL statements and comments appear in contradiction. For every comment that says "Skipped bug X - state was REOPENED" there is in fact a real update statement that does exactly what the comment says it doesn't. Is this intentional?

@valerio.bozzolan The SQL statements and comments appear in contradiction. For every comment that says "Skipped bug X - state was REOPENED" there is in fact a real update statement that does exactly what the comment says it doesn't. Is this intentional?

Whooopsie, thanks! fixed :)

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/eb6b9048c874c73d2eea1b550de7c92e72ebef2b/data/T107254-migration-commands.sql

I cant quite remember how we used Bugzilla 15/20 years ago

https://upload.wikimedia.org/wikipedia/commons/archive/5/57/20141207044352%21Bug_Life_Cycle_Diagram.png (with another "CLOSED" status reachable form "VERIFIED" which likely got switch off many many years ago so it wasn't switched on anymore when I made that diagram).

As for CLOSED, I guess we did that to prevent further comments and/or to indicate the task resolution got verified. Then.. I cant quite remember how we used Bugzilla 15/20 years ago :D

Of course we had a task about removing the "CLOSED" status from Bugzilla and that was T30121: Remove "CLOSED" status. Although the removal happened outside of that task, there is an indication the status got removed between March 2011 (when the task got filed) and June 2011 (Peachey88 comment indicates that got removed).

There are a few entries that say:

UPDATE phabricator_maniphest.maniphest_task SET closedEpoch = 1247766128 WHERE closedEpoch IS NULL AND id = 3189; -- in date 2009-07-16 17:42:08 UTC BugZilla ID 1189 closed by user mike.lifeguard+bugs with status REOPENED

This doesn't make sense - a ticket can't be closed with status REOPENED.

This doesn't make sense - a ticket can't be closed with status REOPENED.

Nice catch. Double-checking that soon during #wmhack

UPDATE phabricator_maniphest.maniphest_task SET closedEpoch = 1247766128 WHERE closedEpoch IS NULL AND id = 3189; -- in date 2009-07-16 17:42:08 UTC BugZilla ID 1189 closed by user mike.lifeguard+bugs with status REOPENED

This doesn't make sense - a ticket can't be closed with status REOPENED.

The thing is, Bugzilla really seems to have a bug 1189 that was REOPENED. Exactly in 2009-07-16 17:42:08 UTC and that was then marked as not fixed as resolution.

Is this reading correct also from your perspective, reading this page?

https://static-bugzilla.wikimedia.org/show_activity.cgi?id=1189

If yes, let's understand what we should do. Maybe the scraper should additionally read a "last closed date" field, and just consider that. That, in this example, would be 2009-07-16 17:32:33 UTC . Does it make any sense?

Or, the scaper can surely ignore edits done by mike.lifeguard+bugs :D ihih

Bug 1189/T3189 was open at the time Bugzilla was migrated to Phabricator. It was closed years later. There's nothing to do here.

Actually it looks like this bug was already reported and fixed above and I was following aklapper's quote of an old version of the SQL. But even the most recent version of the SQL (https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/master/data/T107254-migration-commands.sql) appears to be truncated.

And there's no reason to ignore mike.lifeguard - he was a valid bugzilla contributor, albeit now inactive.

appears to be truncated.

It's truncated because it seems you've found a bug in the GitLab frontend. Maybe it takes too much resources to render our SQL file and so it truncates it (interestingly, without any visible warning).

I can confirm the file is in good health, if we clone it in the lovely raw way. Like:

git clone https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser.git

To see 58903 lines shiny and happy without any evil censorship.

Can somebody please run this SQL command in production?

SELECT closedEpoch FROM maniphest_task WHERE id = 3189;
+-------------+
| closedEpoch |
+-------------+
|  1714527215 |
+-------------+

Ignore my last comment. We can obtain that information using Arcanist and Conduit APIs 🌈 🦄

echo '{ "constraints": { "ids": [ 3189 ] } }' \
  | arc call-conduit \
    --conduit-uri https://phabricator.wikimedia.org/ \
    -- \
    maniphest.search

So, everything is under control. That Task has "dateClosed": 1714527215

Oh, thanks Dzahn :3

So everything seems under control since these situations are covered by the WHERE closedEpoch IS NULL condition. So the SQL file seems correct to me.

Ah, no need to use git clone, actually this raw version works (thanks @Nemo_bis)

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/raw/master/data/T107254-migration-commands.sql

To any RelEng hero:

To make the above thing more production-friendly,

Would you appreciate the above SQL file already batched in small transactions consisting in, uhm, 100 statements each? Just to optimize a bit the huge number of MariaDB implicit autocommits.

Make sure you handle the situation where a ticket is closed on Bugzilla, and then later reopened on Phabricator (and thus is open now). The result is that its closedDate would be null and should not be set.

Make sure you handle the situation where a ticket is closed on Bugzilla, and then later reopened on Phabricator (and thus is open now). The result is that its closedDate would be null and should not be set.

Good point; in my understanding this is handled already by both excluding open statuses and checking the transaction log:

SELECT maniphest_task.id, maniphest_task.phid, maniphest_task.status, closedEpoch FROM maniphest_task WHERE (maniphest_task.status != "open" AND maniphest_task.status != "stalled" AND maniphest_task.status != "duplicate" AND maniphest_task.status != "progress") AND maniphest_task.id > 2000 AND maniphest_task.id < 75682 AND maniphest_task.phid NOT IN (SELECT maniphest_task.phid FROM maniphest_task JOIN maniphest_transaction WHERE maniphest_task.phid = maniphest_transaction.objectPHID AND maniphest_transaction.dateModified > 1416697200 AND maniphest_transaction.transactionType = "status" AND (maniphest_transaction.oldValue = "\"open\"" OR maniphest_transaction.oldValue = "\"stalled\"" OR maniphest_transaction.oldValue = "\"progress\"") AND (maniphest_transaction.newValue != "\"open\"" AND maniphest_transaction.newValue != "\"stalled\"" AND maniphest_transaction.newValue != "\"progress\"")) AND closedEpoch IS NULL;

@Pppery at the moment the export followed T107254#9028797 so I'm quite sure we are already covered.

That export is available also here:

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/blob/master/data/phabricator-tasks.csv?ref_type=heads

Anyway... you are right: one of these could be modified recently (like, 1 minute ago).

As extra safety, I can modify the queries to include AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ).

This would become a bit longer:

UPDATE phabricator_maniphest.maniphest_task SET closedEpoch = 1093912445 WHERE closedEpoch IS NULL AND status NOT IN( 'open', 'stalled', 'duplicate', 'progress' ) AND id = 2012; -- in date 2004-08-31 00:34:05 UTC BugZilla ID 12 closed by user brion with status FIXED

The performance impact would be the same since the query plan uses the primary key. It would be just more verbose.

I'm inclined to do the proposed change in the SQL script, if you agree. Just for extra safety.

The performance impact would be the same since the query plan uses the primary key. It would be just more verbose.

About this comment ↑ it was surprisingly wrong:

$ EXPLAIN UPDATE maniphest_task ... AND id = 2012;
+------+-------------+----------------+-------+---------------------------+------------+---------+------+------+---------------------------+
| id   | select_type | table          | type  | possible_keys             | key        | key_len | ref  | rows | Extra                     |
+------+-------------+----------------+-------+---------------------------+------------+---------+------+------+---------------------------+
|    1 | SIMPLE      | maniphest_task | range | PRIMARY,status,key_closed | key_closed | 9       | NULL | 1    | Using where; Using buffer |
+------+-------------+----------------+-------+---------------------------+------------+---------+------+------+---------------------------+
1 row in set (0,001 sec)

$ EXPLAIN UPDATE maniphest_task USE INDEX (PRIMARY) ... AND id = 2012;
+------+-------------+----------------+-------+---------------+---------+---------+------+------+-------------+
| id   | select_type | table          | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+------+-------------+----------------+-------+---------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | maniphest_task | range | PRIMARY       | PRIMARY | 4       | NULL | 1    | Using where |
+------+-------------+----------------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0,001 sec)

It's very probably just my setup that has cardinality stupidities.

Anyway. I've introduced a little USE INDEX (PRIMARY).


Here the result:

WARNING: The file weight is 15MB

https://gitlab.wikimedia.org/valeriobozzolan/yet-another-bugzilla-parser/-/raw/bf835b1469b5e889ac6256cacaef917ab381da2b/data/T107254-migration-commands.sql?inline=false

P.S. if you would appreciate some batching, say so :)

valerio.bozzolan updated the task description. (Show Details)
valerio.bozzolan updated Other Assignee, added: valerio.bozzolan.

Database fiddler hero welcome here with production access. Unassigning since I'm not.

Aklapper raised the priority of this task from Lowest to Low.

FYI, due to importing RT tickets into Phabricator on 2014-12-17 and due to some tickets originally created either in Bugzilla or in RT being access restricted, there is still a number of non-open tickets filed before 2014-12-17 without a closedEpoch value in the DB.
However the vast majority got fixed today (which might not necessarily fix any statistics if they do not rely on querying the closedEpoch value but e.g. on transaction log values or such).

Well well, followup:

I pulled all tickets imported from RT via SELECT t.id, t.closedEpoch, t.status FROM maniphest_task t INNER JOIN maniphest_customfieldstringindex cfsi ON cfsi.objectPHID = t.phid WHERE t.closedEpoch IS NULL AND t.status != "open" AND t.status != "stalled" AND t.status != "progress" AND t.status != "duplicate" AND cfsi.indexValue LIKE "rt%" ORDER BY t.id;.
We cannot easily get their "real" closing date.
So I just arbitrarily set the closedEpoch for all imported RT tickets to the "latest" possible date 1418860800 (20141218 00:00UTC).
I checked beforehand that SELECT t.id, t.closedEpoch FROM maniphest_task t WHERE t.closedEpoch = 1418860800; had zero results as expected, if anyone ever wanted to identify this set of tickets again to set "real" closing dates.

Afterwards I was left with a small number of access-restricted tasks imported from Bugzilla without a closedEpoch. They are obviously 404 errors on https://static-bugzilla.wikimedia.org/ because they were access-restricted in Bugzilla and are still access-restricted in Phab.
Query: SELECT id FROM maniphest_task WHERE closedEpoch IS NULL AND status != "open" AND status != "stalled" AND status != "progress" AND status != "duplicate"AND id > 2000 AND id < 75683;
Again, we cannot easily get their "real" closing date.
So I just arbitrarily set the closedEpoch for those imported access-restricted Bugzilla tickets to the "latest" possible date 1416614400 (20141122 00:00UTC).
I checked beforehand that SELECT t.id, t.closedEpoch FROM maniphest_task t WHERE t.closedEpoch = 1416614400; had zero results as expected, if anyone ever wanted to identify this set of tickets again to set "real" closing dates.

That means that there are now zero results for the Phab DB query SELECT id FROM maniphest_task WHERE closedEpoch IS NULL AND status != "open" AND status != "stalled" AND status != "progress" AND status != "duplicate";.

Resolving for real.