- Diffusion is not the canonical source of anything
- → this is a feature. Diffusion is used as mirror to improve integration with Phabricator features, like:
- mentioning commits only with their hash name
- having commits shown under your Phabricator user page
- being able to close a task by "Closes T123" in the commit message, without the need of any bot
- → this is a feature. Diffusion is used as mirror to improve integration with Phabricator features, like:
- Having it causes confusion
- → what kind of confusion?
- Having it caused disruption
- → what kind of disruption?
- Having it caused wastes time
- → Probably true for Phabricator administrators. But there are still benefits.
- → Unclear what kind of waste of time for other people.
- There are a number of bugs
- Which bug in particular?
- These bugs are getting fixed by anyone.
- → Untrue. Upstream is actively working on bug reports and the closed tasks increases over time - https://we.phorge.it/project/reports/104/ - https://we.phorge.it/tag/diffusion/
- There's now at least one bug which interferes with actually getting things done T358940: GerritBot comments for 7-digit Gerrit changes conflict with Diffusion commit hashes
- → The problem was fixed by the kind matmarex during Mar 21 2024, 19:18
Description
Event Timeline
Uh? 🤔
Just to clarify, here the proposal is to close this kind of repositories (?)
https://phabricator.wikimedia.org/source/mediawiki/
Please at least evaluate all the features we are going to shutdown, with the above idea:
- We cannot mention git commits like 0f2d0d7bbb9a anymore generating an handy link
- We cannot anymore visit the https://phabricator.wikimedia.org/source/mediawiki/browse/master/ with the "pattern search" (not available to Gerrit, not available to GitLab) that use the native git grep stuff and it's super-fast.
- We cannot anymore quickly blame things and surf the permalink back in time clicking on the "<<" arrows like you can do in https://phabricator.wikimedia.org/source/mediawiki/browse/master/opensearch_desc.php
- I cannot anymore click on Open File to open that file in my editor (since I know how to configure that in my Phabricator settings) from the above URL
- I cannot anymore see people contributions in Phabricator itself, from profile pages
- ...
- ...
I have disagreed each time this general idea has been promoted in the past and I will continue to disagree until there is a material difference in the arguments made by either side.
I prefer the ability to explicitly link commits using the {hash} and {diffusion repo:hash} methods over autoliniking which for diffusion commits and other things can be annoying. Auto linked references are not however the only use-case for Diffusion today.
Diffusion is providing a helper service for mirroring repositories to other codeforges. The open-core GitLab "Community Edition" product does not support pull mirroring, but Diffusion does. This means Diffusion can be used to pull from Gerrit or another origin and then push into GitLab. GitLab CE's push mirroring is also deficient in that it requires each origin repo to individually configure credentials for the downstream. These credentials cannot be hidden from the GitLab repo owner, nor can they be shared across repos. This makes GitLab CE's push mirroring support insufficient to implement bulk mirroring from our GitLab to other codeforges such as GutHub.
In my personal opinion, Diffusion provides a much nicer code and commit history browsing experience over Gitiles (the code browser currently used with the Wikimedia Gerrit deployment). You can compare and decide for yourself if you concur, but if we rip out Diffusion then there will not be a choice.
Thanks for reporting. I have carefully examined every single point highlighted, and I think the situation is under control.
Hoping to do a good thing, I put my details directly in the original description.
Please feel free to increase the details and reopen it 🌈 And thank you for any upstream bug reports.
To me this sounds like a valid request not to mirror repositories between numerous systems and not to spend time trying to keep things updated and in sync. If at all, this task may get declined at some point but it does not sound invalid.
While I agree with @Aklapper that this task is not “invalid,” I also think that dropping Diffusion mirrors would be a big loss, and in the end users would spend more working it around than what administrators would win:
- The handy links in tasks wouldn’t work anymore (including already-inserted ones!), let that be a commit link (bb1d0aca814d), or a link to a certain file (rMW includes/recentchanges/ChangesFeed.php) or a certain line of a file (rMW includes/recentchanges/ChangesFeed.php:37 (at bb1d0aca814d)). This means having to resort to less useful links in future comments/descriptions, for which readers need more time to figure out what they mean; and even more time trying to figure out what the broken links to Diffusion (which probably won’t even be rendered as links) mean.
- The powerful repository browser. While basic code viewing functionality is present in both Gitiles and on GitHub, Gitiles cannot search for files or code at all, and GitHub doesn’t allow searching for code without logging in, which means having to have a GitHub account and potentially also means going through 2FA (searching for files is allowed on GitHub without logging in). Code navigation by clicking on an identifier is a useful feature of GitHub, but again only for logged-in users – for logged-out users, it works only within a file, which is hardly better than Ctrl+F, especially in languages as dynamic as PHP and JavaScript. (And remember: I’m not arguing for removing the GitHub mirror – the discoverability of GitHub for newbies is unbeatable –, only against removing the Diffusion one.)
- The diff views that are integrated with the rest of Phabricator:
- Neither Gitiles nor GitHub render Phabricator task numbers as links (the Gerrit code review interface does, but Gitiles doesn’t; the workaround is clicking on the Change-Id – which is rendered as a link in Gitiles, though not on GitHub –, and clicking once again to get to Phabricator; or copying and pasting the task ID), while Diffusion of course renders them.
- Similarly, the authorship information is more integrated into Wikimedia: if the author/comitter email address is connected to a Phabricator account, it links to that account, from which one can not only find Phabricator comments of that person, but also their Wikimedia (CentralAuth/mediawiki.org) account, so it’s possible to write to the talk page of the author instead of opening a task on Phabricator if that’s more appropriate; in contrast, Gitiles shows whatever one entered in user.name (which may be a fully different name, e.g. a made-up user name used on Wikimedia but full name used in Git) and generates no links, while GitHub links to the GitHub profile if there’s any, and nowhere if there isn’t (Diffusion also links nowhere if there’s no Phabricator account, but that’s less likely).
A reason against diffusion is the poor performance. Everything above a few dozen RPS overloads the Phabricator host. Unfortunately all source code endpoints got quite popular for AI scraping and gathering training data. So we are seeing more traffic than in the previous years. So long term we either need performance improvements for diffusion, dedicated hardware for diffusion or have to disable diffusion and all of the source code hosting on Phabricator.
If Diffusion is not worth exposing to the Internet because damn AI bots are causing operations burnout, a quick and dirty mitigation is to change the visibility policy of the Diffusion application from visible to Public to visible to All users.
I think a task about the CPU spikes on the Phabricator host related to AI-bots would be better instead of talking here. In any case, a possible actionable thing is working on this promising patch, to greatly reduce the amount of links for scrapers:
I like the idea of making diffusion only visible to logged-in users. That would also surface which tools specifically are still using it.
Given that this is already a task about completely disabling it it does seem to be a reasonable step to take in between.
But of course that would require some more discussion with stakeholders.
I’ve just used Diffusion yesterday because I didn’t want to log in, and that’s the only view that allows searching code without login (Gitiles doesn’t allow search at all, and GitHub requires login). So it would be a loss if logged-out users couldn’t access it. ☹ But if we have to choose between requiring login and turning Diffusion off completely, requiring login is still less bad.
If we end up requiring login, is there a way to direct logged-out users to Gitiles and/or GitHub (instead of just showing a generic error message), so that they still have some way to see the code?
Also see https://codesearch.wmcloud.org/search/ and there is a also potentially a replacement for that coming in the future.
I haven’t realized Codesearch can search in specific repos (Diffusion is so convenient for searching in a specific repo that I haven’t needed this); I’ve always used it in “search everything” mode. Is it maybe possible to link it from Gitiles, with the repo selected, both for convenience (one doesn’t have to select a repo from an endless list of repos) and for visibility?
For those that know about diffusion searching, would probably be logged into phab most of the time so rhe impact would be smaller if guessing. (except those that don't keep sessions open)
The error message from memory is worded around "you need to be logged into to use this phabricator appliaction"
And except those who use multiple devices. For example, I often search code on my phone, but I’m logged in only on my computer (and neither do I want to log in on my phone, for security reasons). However, Codesearch is a viable alternative, provided that one is aware of it.
The error message from memory is worded around "you need to be logged into to use this phabricator appliaction"
And could that be improved?
Also, does disabling the application mean that revision hashes and the like will no longer be linked when viewed logged out (similarly to what happens when a security task is mentioned)? If yes, can this unlinking be disabled? Because that’s also a discoverability problem.
If crawling Diffusion is a problem then I am very fine making Diffusion available only to logged-in users. Admins can do so by going to https://phabricator.wikimedia.org/applications/edit/PhabricatorDiffusionApplication/ and setting Can Use Application to All Users.
This may create some repo mirroring / replication disruptions though (cf T349921: integration-agent-docker machines excessively pull some Wikibase related Git repos in Diffusion, T347577: Understand which repositories we mirror, observe, host in Diffusion (and fix some findings)).
The error message from memory is worded around "you need to be logged into to use this phabricator appliaction"
Talking about UX when accessing Diffusion in a web browser:
- If an admin changes Can Use Application from Public to All Users on /applications/view/PhabricatorDiffusionApplication/, Phab only renders a Login dialog to anonymous / not logged-in users who access /diffusion/FOO/ or /source/baz/.
- If an admin only changes the Default View Policy from Public to All Users, Phab still renders repository content to anonymous / not logged-in users who access /diffusion/FOO/ or /source/baz/.
(Note to myself: Hacking public function shouldAllowPublic() { return false; } into DiffusionController` does not work.)
And could that be improved?
Given the huge variety of URI routes that Diffusion has I'd very much prefer not to try adding some custom code to add some custom message to the general auth dialog based on the URI which people tried to access before...
Yes, in my understanding.
If yes, can this unlinking be disabled?
No, as any and all objects in Phab which you cannot see/access should also not expose data to you. It's a general concept which makes sense.
I thought about a generic “try Gitiles or GitHub” thing, not linking to the specific thing the user is looking for. If we aim for that, it’s not that huge variety, there are only three possibilities (if I understand the code correctly):
- ^/source/
- ^/diffusion/
- ^/(r[A-Z]+|R[1-9]\d*:)[a-f0-9]+$
Well, it makes sense generally, when access is disabled for security reasons (like security tasks). It doesn’t make much sense when access is disabled for performance reasons.
It seems to me this ticket might have become a bit unspecific over time.
It started out as "Drop our mirroring of code to Diffusion and empty the repos" but morphed towards "let's only show diffusion to logged in users" (and some related and general discussion about different ways to search code).
Are there any vetoes / hard concerns about making it visible only to logged in users?
I wouldn't necessarily call it a veto per se (if only because I probably don't have the authority to veto something like this!), but I do share @Tacsipacsi's concern (raised in T359549#10899369) about Git commit-hashes no longer being autolinked for logged-out users if that does go ahead.
If I was presented with a binary choice between 'disable Diffusion entirely' or 'disable Diffusion for logged-out users', though, I also concur with what @Tacsipacsi said in T359549#10896619 that disabling the app for logged-out users only would be the least bad option.
I'd appreciate comments as comments. While summarizing a situation by updating the task description makes sense, discussing different points of views in the task description while not knowing who wrote what makes less sense and is just confusing to me.
I'd prefer to first try https://we.phorge.it/rP966e80e89efd137839f53a47aa3002fb393941ef and https://we.phorge.it/D26141 once we are on a very recent upstream in our instance (that chain is T370266 -> T386558 -> T393840). If that does not help (enough), we can still make Diffusion logged-in only.
But again, that's all about unwanted Traffic. This task requests not to offer Diffusion at all which might turn into a wontfix per previous comments.