Page MenuHomePhabricator

Determine the fate of comment metadata from legacy systems in phabricator.wikimedia.org
Closed, ResolvedPublic

Description

There is a lot of under the covers metadata that has to be dealt with in migration. Ticket metadata is fairly well defined, as far as we get authorship, assigne, cc's, and modified / creation times for searching and sorting.

Comment metadata is another matter. When we first started down this road the decision was made to not worry about comment metadata and so we end up with something like this:

https://phabricator.wikimedia.org/T378#4196

A static identifier for the username from the system of origin, a timestamp, and the migrated comment.

There is a long road of compromises that has lead to this place but the current state is:

  • Users create a new account from existing external sources (ldap and sul)
  • Metadata for history across tickets of all systems is then associated with this user when they verify the email that they had in the historical system
  • Comments are not updated(1) (i.e. remain static for imported content)

Erik raised the point to investigate this further, and so I have looked into it and Mukunda and I have talked it over. Part of the ongoing confusion and untenable situation we find ourselves in is some people find one group of compromises acceptable, and another group may not. The end result being we keep coming back around and renegotiating the issue -- which is not the fault of either group actually. If we knew comment metadata was a full stop for the project I think we may have gone another route for accounts, although I'm honestly not sure as all methods have their big downfalls. It turns out merging history from accounts in completely unrelated and distinct systems isn't super easy.


So here is the proposal from the phabricator team :)

  • We can set ourselves up to fix comment metadata in the nicest of nice ways for users
  • We don't know if we can do this first run during the actual initial migration window, the time estimates didn't include this thinking. But certainly shortly thereafter, and maybe say once a week, once a month, or every few months for awhile we can batch process things to an amendable state.
  • Part of the 'batch run' thinking is two fold. We are not sure how this will impact trying to do it during normal hours (in theory fine) and part of that is it involves having to invalidate some remarkup caching so that updated things are shown correctly. Anytime you start invalidating cache on a broad scale while users are doing their thing chaos can ensue so at this point the thinking is scheduled off-hours runs.
  • After said batch runs user foo will now in theory be seamlessly integrated into Phab with history from source only historically identifiable by their external reference id (fl, bz, rt, etc).

I'm thinking 7 or 8 days of upfront work / testing to get this to a solid state. Part of that is lots of testing of partial cache invalidation, Phab only has tooling to support global but that would be pretty heavy handed.

updated the proposal with current thinking

Event Timeline

chasemp raised the priority of this task from to Unbreak Now!.
chasemp updated the task description. (Show Details)
chasemp changed Security from none to None.
chasemp added subscribers: Aklapper, Qgil, mmodell.
chasemp added a subscriber: Eloquence.
chasemp added a subscriber: faidon.
chasemp subscribed.

I think in theory we could (maybe during normal user update I guess) take a historical issue, find it's new counterpart, track down transactions (phabricator ones), parse out comment transactions, match them up by order or creation date, and assign the comment to the appropriate user's new identity in the new system. But that still leave's behind artifacts such as "foo commented on xx:xx:xx" since all comments need context in case the user never registers. My take on this is it makes the post processing of user history a bigger problem, and results in messes upon messes (upon messes?).

All ticket and issue sorting is done by the issue metadata, so we aren't losing really search / sort features per say.

Is the loss primarily formatting or consistency or sanity?

For people that don't have Phab accounts on migration day, can we instead create faux user ids for these people? E.g. "BZUser1234" for the 1234th user migrated, or even better, if the user was "robla@robla.net", "BZUser-robla-at-robla-net" (appending numbers to handle the one-in-a-million name clash). We then allow people to claim their BZ accounts. Having a unique token to search/replace would hopefully make it easier to clean things up after the fact.

One other suggestion: would it be possible to hack up a "WMF Bugzilla identity provider" that basically allows people to log in with their old bugzilla credentials? It wouldn't need to actually hit Bugzilla, but rather, could be just an implementation of the BZ password hash hitting a mostly static database table. If Phabricator doesn't support account merging now, we could potentially implement it down the road, and ask people to limp along with two accounts at first.

Since the migration is (hopefully) happening very soon, I think it's a tough pill to swallow to say that the very small percentage of people who sign up for a Phabricator account now will have their metadata show up correctly, and everyone else will have something much kludgier. At a minimum, I think we'll probably need to commit to a one-time correction within a reasonably specific timeframe (e.g. 6 months)

Let's say we could sync comment metadata every 90 days (or 30 or whatever) if we sort things so that is possible, would it satisfy?

Had a slightly different approach occur to me I want to run by Mukunda before elaborating, but I think that may be doable. So I suppose some of my skepicism has softened.

On the suggestions, the idea of multiple accounts, and/or dummy accounts to be processed or merged at an unknown later date has been beaten to death by us before. It's a giant mess that mostly only succeeds in deferring more problems I think in most senses. The idea of Bugzilla as an auth provider I don't think has ever been raised, or I don't remember it, Mukunda would better know how tenable that is. I would say it's partially the problem that never ends due to maintaining it as a local patch forever, and the arc liberate and generated libraries headaches that can come from that. We would also be looking at doing the same for RT, and I am not sure where that leaves any other external source data (or really their unique users as it were). If it were desirable we would still be looking at stubbing out accounts somehow. I will bring these up in our planning meeting in the morning.

FWIW, I don't think the current approach has been a problem at all with the current imported dataset, which though small in comparison is very actively used. Maybe @Qgil or @Aklapper could speak to the weaknesses or strength of what we are doing now.

What about this:

  • Send now an email to all Bugzilla users in the lines of "ACTION REQUIRED: JOIN PHABRICATOR", explaining that this is how you will have your Bugzilla history migrated to Phabricator on the launch date.
  • Send a second and last email to all Bugzilla users as we have a date confirmed, telling them the details about dates when the service will be down and a reminder about JOIN PHABRICATOR NOW, IT TAKES ONE MINUTE.
  • We migrate from Bugzilla to Phabricator, including the comment metadata of the existing Phabricator users, and not bothering about the ones that haven't joined yet.
  • We sync after 3 months, if needed.
  • We sync after 3 months again, if needed.

And that's it?

Sending an email to all the users of a service is sually not a good idea -- except when you are discontinuing the service. I think everybody will understand.

IMHO, I don't think asking thousands of users to migrate their account before the migration is a sane process, when it's really us that should be making this step for them.

I obviously haven't followed this from the beginning so there's lots of background that is lost to me. Can someone (very briefly) explain here why it's impossible to:

  • migrate accounts off Bugzilla/RT as-is (email as the username for Bugzilla)
  • prompt users to just recover/rename their account if they wish so
  • migrate comments as they are, owned by their correct owner and with the correct timestamp
In T572#9349, @chasemp wrote:

On the suggestions, the idea of multiple accounts, and/or dummy accounts to be processed or merged at an unknown later date has been beaten to death by us before. It's a giant mess that mostly only succeeds in deferring more problems I think in most senses. The idea of Bugzilla as an auth provider I don't think has ever been raised, or I don't remember it, Mukunda would better know how tenable that is. I would say it's partially the problem that never ends due to maintaining it as a local patch forever, and the arc liberate and generated libraries headaches that can come from that. We would also be looking at doing the same for RT, and I am not sure where that leaves any other external source data (or really their unique users as it were). If it were desirable we would still be looking at stubbing out accounts somehow. I will bring these up in our planning meeting in the morning.

As for the RT part of this: the RT user database is relatively small, especially taking into account only users with access to the web interface (as opposed to auto-created on e-mail). I'm fairly sure we could easily match like 90% of the RT active user base to Bugzilla accounts in a semi-manual process -before- the migration if that helps. The remainder (which would include emails from external people who never even were aware they were mailing into a ticket system) could be done in the current static way.

Meh, posted this mistakenly in T559 already while Quim actually asked me to paste it here:

select count(userid) from profiles where date_sub(curdate(),INTERVAL 1 MONTH) <= last_seen_date;
871
select count(userid) from profiles where date_sub(curdate(),INTERVAL 1 YEAR) <= last_seen_date;
2606
<mutante> but this confuses me, even for "10 YEAR" i will get 2606
<andre__> according to https://bugzilla.mozilla.org/show_bug.cgi?id=240437 last_seen_date was only introduced in Bugzilla 4.4, and we upgraded to 4.4 in 02/2014

Hence number of users in last 12 months should be higher than 2606.

Not meant exactly as a direct response to @faidon, he just hit the three big-ish options on the nose, though there are variations of them.

  • migrate accounts off Bugzilla/RT as-is (email as the username for Bugzilla)
  • prompt users to just recover/rename their account if they wish so
  • migrate comments as they are, owned by their correct owner and with the correct timestamp

The history surrounding these options is spread out over hundreds of tickets so I'm going to try to hit some highlights to kind of bring this comment narrative together I hope. We all know bugzilla uses emails as usernames, but bugzilla also does some responsible stuff. If you go to a bugzilla ticket without authentication you see:

Screen_Shot_2014-10-09_at_8.31.32_PM.png (135×686 px, 23 KB)

They are smart enough not to leak emails to anonymous crawling, etc. But phabricator isn't setup to handle the emails as username use case. We know that we want maniphest to be search engine friendly, and indeed I think a few of us have found our own issues already while searching for phabricator related things. Andre explained to us that one of the biggest points of contention with the community is the problem of bugzilla being cavalier with user emails. Users will/have/do actively refuse to sign up and report bugs because they don't want to their emails broadcast (which I'm not saying is reasonable necessarily -- just true), and the ones that have signed up are not happy with it either. The idea of using emails as usernames is just an unsellable, undefendable, unworkable solution from this perspective. Or at least after lots of discussion related to this, we all thought it would tank the project from the beginning for community support.

So what about some derivative or some generated username or something like their anonymous user id from bugzilla? There are a lot of social problems with that, but chief among them is that a static comment that shows the old username (as it was shown in the anonymous bugzilla sense) and text is more desirable than a "real" username that no one knows and a comment that has lost the context of the original user. So the idea of noise usernames from old systems seems good at first, and then when you do it there is a lot missing. Looking through a few hundred of the first tickets in that mode leaves you thinking, who is this? is this a wmf person? We could do "1234-WMF" for old WMF accounts and "123" for old non-WMF accounts, but it gets super messy and not at all better than a static comment that is at least straight forward. There is no one yet who we have had to explain what a static comment is, or how it works. It's an obvious ploy for metadata migration. There is no guarantee that many of the people from bugzilla, or rt, or any other system we import will ever show up again and so it may never be fixed from this perspective. We could do something in the middle, use the "generic" name bugzilla uses which is just the "foo" part of "foo@bar.com", but bugzilla doesn't worry much about overlap or identity. The john@cake.com and john@waffles.com issue.

Then there is the problem of usernames and alias management in phabricator. Renaming users, or trying to merge accounts, especially ones that may be claimed a year or two years from now is problematic. If you attempt to change your username in Phabricator the first part of the prompt warning is this:

The old username will no longer be tied to the user, so anything which uses it (like old commit >messages) will no longer associate correctly. (And, if you give a user a username which some other user used to have, username lookups will begin returning the wrong user.)

The long story short there is, we import goo@goo.com as goo, and let's say there is no name conflict with the name from rt or trello or mingle or whatever other system we are going to migrate from next. It's doubtful that goo wants goo to be their username in phab, it seems few actually do want their username to be the first part of whatever email they were using. But goo doesn't show back up for another year, however their comments and their work has been referenced a bunch of times, when they come back and claim their account (however that is done) they immediately want to change their identity as they didn't actually choose the one we assigned. We lose necessary history in phabricator for their account from the point of migration to the point of claim. When we looked at doing this across the many, many, many accounts from bugzilla it was not a username shuffle game that seemed sane to get into. It gets weirder as this huge swath of account names we have taken up are chosen by actual users later and the history of who is really who becomes problematic. The other outcome that is preferable in phabricator now is that you know who is a real person. Of those bugzilla accounts there will not be a 1:1 translation to phabricator accounts, but if we mock up a bunch of dummy accounts and link their emails it is hard to tell who has actually registered in this system. That is not even taking into account the questionable nature of moving user's emails unsolicited to a system they never signed up for. That stuff could be handled in a tactful way most likely, but the mess of 'who is really here?' is actually more substantial than it seems at first.

Then there is the problem of how users can claim accounts. We could make a fake auth provider, we could setup some special page that handles it on its own and creates an associated account in phab after we validate an email, we could write a real auth provider for bugzilla and/or RT. If this was desirable (because we had created our army of dummy accounts) the logic for it would be a merge headache for as long as we want to allow this claiming to happen. Which ideally is forever, even if someday we don't worry about their comment history coming along or whatever. Any way we figured it there was ongoing cost we prefer not to incur. One of the small advantages @mmodell and I have is we have maintained phabricator before, and we know that they move pretty fast. They encourage weekly merges, and the longest spread I know of is 6 months (facebook), and it seems like kind of a support disaster for upstream. That kind of rolling release requires an ever escalating amount of time as local patches stack up, or get complicated, and coming out of the gate that way wasn't taken lightly.

A big part of my thinking on this has been, what is going to put is in the best place in 6 months or a year? The first year is going to be a pain no matter what, but after that bumpy ride which approach leaves us in the best position for easy tracking of upstream and low ops intervention for future users who are late, or very late to the party? So our conclusion on that front was to find a way to bend the native internals of phabricator to provide the needed logic, and then to shim a cron and a bit of metadata swapping. The idea of static comments, and users claiming their history, may seem unintuitive, but out of the box on the first day of migration anyone can pick up any ticket and it makes sense. Also though, if one or thirty thousand people sign up we may trail a bit on associating their old history, but we can do it seamlessly from several historical sources, and in the end, after the first/second/third waves everything will be "right" -- or right enough.

One of the pieces of metadata that has been critical is dependency relationships between tickets, but these don't follow a straight line through history. Issue 1 can depend on issue 200, and issue 300 can depend on issue 3. When importing a series of many thousands of tickets it isn't possible to do this linking correctly on the first pass. If you create an issue and things it is dependent on are further down the line a second wave of patch-up massaging is likely necessary. This reality was really the impetus for the current approach. It was necessary and turned out to be not that difficult -- at least comparatively.

So whether any of that seems justifiable is probably subjective, but suffice to say we came up with a thousand ways not to migrate accounts from multiple historical sources. One of the things discussed early on for this approach of 'pulling the table cloth off the table with the dishes on it' metadata massaging was "can we live with static comments?". @Aklapper and I have talked about it a lot, or more to the point I have bugged him about it a lot, and the answer for the team was 'yes'. Static comments is not a usability issue, and with the number and size of the other problems we have it is a reasonable compromise. So the issue has never been, why can't we find a way to make comment nicer?, it has always been one of survivable, if not reasonable, compromises. I don't think anyone on the phabricator team felt like it was a deal breaker, and honestly, we have been doing a lot more than 4 people worth of work. If you try to search for issues there is no sort or search functionality that is lost with static comments. There is no way to search for things you have commented on as the act of commenting subscribes you to an issue, and you can search for issues you are subscribed to. We are migrating the "CC" or "Subscribers" list from bugzilla so we didn't take issue with the lack.

Why is comment metadata complex? Well it's more like it's not simple. If you load a ticket you are fetching an object, and along with it comes a bunch of transactions (in the phabricator sense). Those transactions can be adding CC's, dependencies, or whatever, and comments too. Transactions can also spawn other transactions. If you create a comment and you @mention a user that association is a transaction in and of itself. Transactions have unique global id's, authors, time metadata, and if their type is comment they associate with a comment history that has the same. So to do really savvy metadata fixup you have to change the right fields on all of those objects, and the objects they spawn, and the interactions while not opaque can still be a pain.

So that's some off-the-cuff back-story, and I'm not saying they are good thoughts or bad thoughts, but they were the thoughts. Not really meant to form the basis of an argument for or against any one thing, only to convey the myriad number of issues. It's late so if I've said something preposterous I apologize.

All that being said, I spent a portion of this week figuring out how we could do it if comments are a priority. When @mmodell and I talked earlier we thought the above was the best compromise, and I still honestly don't find comment metadata to be all that compelling. However, after more exploring I think it can be done in a semi-reasonable fashion.


So here is the proposal from the phabricator team :)

  • We can set ourselves up to fix comment metadata in the nicest of nice ways for users
  • We don't know if we can do this first run during the actual initial migration window, the time estimates didn't include this thinking. But certainly shortly thereafter, and maybe say once a week, once a month, or every few months for awhile we can batch process things to an amendable state.
  • Part of the 'batch run' thinking is two fold. We are not sure how this will impact trying to do it during normal hours (in theory fine) and part of that is it involves having to invalidate some remarkup caching so that updated things are shown correctly. Anytime you start invalidating cache on a broad scale while users are doing their thing chaos can ensue so at this point the thinking is scheduled off-hours runs.
  • After said batch runs user foo will now in theory be seamlessly integrated into Phab with history from source only historically identifiable by their external reference id (fl, bz, rt, etc).
  • I'm thinking 7 or 8 days of upfront work / testing to get this to a solid state. Part of that is lots of testing of partial cache invalidation, Phab only has tooling to support global but that would be pretty heavy handed.

Thank you @chasemp for the technical explanation. Let me add the social explanation.

Comment metadata is important, and user history is even more important. However, not all comment metadata and not all user history are equally important. A few users have hundreds of comments, while hundreds of users have a few comments. Some Bugzilla users have already an account in Phabricator, while some Bugzilla users can't even even access to their old accounts because the email addresses they used are long gone.

Thanks to mutante and aklapper, we have now these interesting numbers:

  • 525 active Bugzilla users in the past 7 days
  • 871 active Bugzilla users in the past 30 days
  • 2606 active Bugzilla users since February (when Bugzila was updated and the last_seen_date feature was available)
  • 19847 Bugzilla accounts

I propose to target those 2606 users (13% of the total) with the two emails proposed above. If only a third of those active users do respond and create a Phabricator account before the migration, we will have covered the amount of regular monthly users.

This is a simple technical solution that would solve the most complex part of the social problem. In any case, I believe 99% of those active users will be happy to be officially notified about the migration, and to have a first experience with Phabricator before they find Bugzilla in read-only mode.

This together with the periodical updates that Chase mentioned is a fair compromise that basically anybody can understand and agree with. The bottom line is: if you were around, you had a chance to fix this problem before Day 1; if you have been away several months, or it took several weeks for you to respond, then you can also wait a few weeks to have your comment metadata connected to your new Phabricator user account.

We have discussed this topic extensively in our last Phabricator team meeting, and the four of us agree with it. It is a fair solution, and there are still many open ends that we need to fix. Let's agree on this approach.

In T572#9273, @chasemp wrote:

But that still leave's behind artifacts such as "foo commented on xx:xx:xx" since all comments need context in case the user never registers.

Sure it is slightly confusing to have that comment, but I think that noise is still 'acceptable'.

In T572#9556, @faidon wrote:

I obviously haven't followed this from the beginning so there's lots of background that is lost to me. Can someone (very briefly) explain here why it's impossible to:

  • migrate accounts off Bugzilla/RT as-is (email as the username for Bugzilla)

One of the most common complaints by WIkimedia users about Bugzilla (besides missing SUL) has been that it reveals your email address: https://bugzilla.wikimedia.org/show_bug.cgi?id=148
However you could argue that this data is already there anyway.

Converting the date/time metadata of comments into proper Phab metadata would be lovely to be able to e.g. search for rotten tickets that have not seen any updates for ages ("Created Before", "Updated Before"). Also having http://korma.wmflabs.org/browser/bugzilla_response_time.html in mind here.

Converting the commenter name metadata into proper Phab metadata feels less important to me simply because Phabricator (currently?) does not allow to search for "Tickets that I have commented on" on https://phabricator.wikimedia.org/maniphest/query/advanced/

just a note: the sort by modified / created doesn't actually use the comments for anything it's another field set on the issue itself directly

In T572#10081, @chasemp wrote:

So here is the proposal from the phabricator team :)

  • We can set ourselves up to fix comment metadata in the nicest of nice ways for users
  • We don't know if we can do this first run during the actual initial migration window, the time estimates didn't include this thinking. But certainly shortly thereafter, and maybe say once a week, once a month, or every few months for awhile we can batch process things to an amendable state.
  • Part of the 'batch run' thinking is two fold. We are not sure how this will impact trying to do it during normal hours (in theory fine) and part of that is it involves having to invalidate some remarkup caching so that updated things are shown correctly. Anytime you start invalidating cache on a broad scale while users are doing their thing chaos can ensue so at this point the thinking is scheduled off-hours runs.
  • After said batch runs user foo will now in theory be seamlessly integrated into Phab with history from source only historically identifiable by their external reference id (fl, bz, rt, etc).
  • I'm thinking 7 or 8 days of upfront work / testing to get this to a solid state. Part of that is lots of testing of partial cache invalidation, Phab only has tooling to support global but that would be pretty heavy handed.

This seems like an ok compromise to me. It feels a little hacky, but as @chasemp pointed out on IRC, there's no way to do this that isn't a little hacky. It'd be nice to have a wider conversation about this, because I think it's going to be a surprising outcome to many people. A concise and appropriately prominent explanation of this tradeoff as part of a wider rollout will be helpful here.

It'd be nice to have a wider conversation about this, because I think it's going to be a surprising outcome to many people. A concise and appropriately prominent explanation of this tradeoff as part of a wider rollout will be helpful here.

T552: https://bugzillapreview.wmflabs.org/ migration preview instance is the best framework for this conversation, because there users will see exactly what are we talking about.

Qgil lowered the priority of this task from Unbreak Now! to Medium.Oct 19 2014, 1:08 AM

is this ticket a blocker for the RT migration as well? we are going to keep all the comments from RT tickets if possible at all, right?

I think this task is resolved at least for the Bugzilla part. Register at https://bugzillapreview.wmflabs.org/ with your Bugzilla email address, wait until bzimport has assigned your Bugzilla activity to your new Phabricator account, and you will see comments assigned to users with the time that they were originally posted.

Example: https://bugzillapreview.wmflabs.org/T1

In T572#14118, @Dzahn wrote:

is this ticket a blocker for the RT migration as well? we are going to keep all the comments from RT tickets if possible at all, right?

Of course. In Bugzilla we are not migrating the comments that were marked as private, and I guess the same will apply to RT. In any case, this is an implementation detail to be discussed in the context of RT-Migration, if needed.

I think we can close this task. Thanks to this discussion and the extra time we have invested improving the migration script accordingly, claimed migrated comments look and behave just like the native ones. If you think that something in this area still needs to be improved, please create a new task with the specific details.