Page MenuHomePhabricator

Disable IO for diffusion repositories
Open, Needs TriagePublic

Description

We believe that T374926: [EPIC][Infra] Move Wikibase and WikibaseLexeme Git submodules to suitable Git host is the last use-case of git on Phab/Diffusion.

We should be able to disable IO for all Diffusion repositories except Wikibase and WikibaseLexeme Git submodules.

This will:

  1. Reduce git load on Phabricator
  2. Test our theory that there is nothing else truly depending on Phab's git.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Draft: Set IO_NONE by default on repos; always expose observed URIrepos/phabricator/phabricator!103aklapperaklapper-wmf/T405596part2wmf/stable
Customize query in GitLab

Event Timeline

Strong support, in light of recent events.

Does this refer to all and any IO? For example, we mirror 26 repositories from Diffusion into Github (see also queries in T347577 and T349921).

What does “IO” mean in this context at all? Is it some kind of Phabricator jargon? In a general sense, I/O would include everything that happens with Diffusion: SSH reads, Git HTTP reads, browser HTTP reads, the disk I/O Phabricator does to store the files, etc. I guess it’s not what you mean, otherwise you would’ve simply said “Disable Diffusion repositories”.

The I/O term is in the Diffusion (Phabricator) settings. If you go to Diffusion and hit "Manage Repository" one of the repos there are a couple settings.

It's not just simply on or off.

For example a repo can be active/inactive, publishing/not publishing and there are configurable URIs for each one.

Under the URI section there can be different ways to git clone (http, https or ssh) and each one of them has a column "I/O" which can have values like "Default, Observe, None, ReadOnly, Mirror or ReadWrite ... "

--> https://we.phorge.it/book/phorge/article/diffusion_uris/#reference-i-o-types

Thanks for the pointers! At https://phabricator.wikimedia.org/source/mediawiki/manage/uris/, I see

I guess the 3d2png entry is out of scope because it’s already disabled.

I hope the Gerrit entry won’t be disabled, as it’d render the mirror practically useless (it’d result in it being frozen in whatever state it currently is in).

Do you want to disable all other entries, i.e. all Read Only ones and the GitHub one? (What will update our GitHub mirror going forward?)

(What will update our GitHub mirror going forward?)

I might be wrong, but I believe that most of our GitHub mirrors (I believe including wikimedia/mediawiki) are updated directly from Gerrit, rather than via Diffusion.

For some of the repos that do mirror directly from Diffusion to GitHub (e.g. https://phabricator.wikimedia.org/source/tool-admin-web/manage/uris/ IIUC), it does feel like disabling some I/O settings there might be a problem.
(xref the list at T347577#9689792, though I guess that list may be slightly outdated if any Diffusion->GitHub mirroring has been added/removed since April 2024.)


[Side note: interestingly, looking at the GitHub activity for a random repo that's located in Gerrit but which still appears to have separate Diffusion->GitHub mirroring configured, it seems like Gerrit and Diffusion are sometimes git push-warring with each other to get commits mirrored first :P]

Sorry the ambiguous "IO" reference. As @Dzahn points out, that's a Phabricator/Phorge term. Every repo in diffusion has URIs.

URIs are ssh or https URLs that point to:

  • builtin places where users can download from diffusion; e.g., phabricator.com/source/mediawiki.git
  • admin-added places we want to mirror a diffusion repository; e.g., github.com
  • admin-added places phabricator should observe for changes and fetch into diffusion; e.g., github.com

Each of these URIs can be enabled or disabled.


A better wording for my proposal would be:
Remove the ability to clone repositories from diffusion for all repositories except the Wikibase repos mentioned in this task

This would:

  • Save resources – since all diffusion repositories should be mirrors, there's no reason folks need to clone from Phabricator.
  • Enforce assumptions – we know Wikibase/WikibaseLexeme submodules are cloned from diffusion, and there's a tracking task to stop that. Hopefully that's the last legit use-case. Disabling clones of other repos will help us be sure.
  • Lessen user confusion – since Phabricator can be code browser/code forge, and repository browsing works, it might make sense to a user to download a repository from Phabricator. Later that user would discover you can't push anything to Phabricator.

But it seems we cannot disable builtin repository URIs. We can only set them to "hidden", which hides them in the UI from end-users, but still allows cloning from strange URLs like diffusion/19/mediawiki.git.

So, in practice, my proposal is less straight-forward than I'd assumed. That is, it would require changes to the Diffusion application in Phorge (I think).


(What will update our GitHub mirror going forward?)

I might be wrong, but I believe that most of our GitHub mirrors (I believe including wikimedia/mediawiki) are updated directly from Gerrit, rather than via Diffusion.

That is correct, repos in Gerrit are mirrored by Gerrit to GitHub.

[Side note: interestingly, looking at the GitHub activity for a random repo that's located in Gerrit but which still appears to have separate Diffusion->GitHub mirroring configured, it seems like Gerrit and Diffusion are sometimes git push-warring with each other to get commits mirrored first :P]

Neat. Proposal addendum:

If a diffusion repo has a gerrit.wikimedia.org or gerrit-replica.wikimedia.org URI that is set to Observe, then it should NOT have a github.com URI set to Mirror.

I manually disabled mirroring from diffusion for the repo you flagged @A_smart_kitten , thanks for flagging that.

As a test, I changed the I/O Type for all Read Only URIs listed on https://phabricator.wikimedia.org/source/tool-jouncebot/manage/uris/ (I picked a random project) from "Read Only" to "No I/O". This seemed to have worked:

[acko@machina ~]$ git clone https://phabricator.wikimedia.org/source/tool-jouncebot.git
Cloning into 'tool-jouncebot'...
fatal: unable to access 'https://phabricator.wikimedia.org/source/tool-jouncebot.git/': The requested URL returned error: 403
[acko@machina ~]$ git clone http://phabricator.wikimedia.org/diffusion/3307/tool-jouncebot.git
Cloning into 'tool-jouncebot'...
fatal: unable to access 'http://phabricator.wikimedia.org/diffusion/3307/tool-jouncebot.git/': The requested URL returned error: 403

If a diffusion repo has a gerrit.wikimedia.org or gerrit-replica.wikimedia.org URI that is set to Observe, then it should NOT have a github.com URI set to Mirror.

While I never fully understood the "tracking-enabled" meaning and neither the confusing ioType called "default", I believe that we have (at least) 25 repositories observing gitlab or gerrit-replica && mirroring to github according to
SELECT CONCAT("https://phabricator.wikimedia.org/diffusion/", r.id, "/manage/uris/") AS repoURI, r.name, uO.uri AS observedFromUri, uM.uri AS MirroredToUri FROM phabricator_repository.repository r INNER JOIN phabricator_repository.repository_uri uM ON r.phid = uM.repositoryPHID INNER JOIN phabricator_repository.repository_uri uO ON r.phid = uO.repositoryPHID WHERE r.details LIKE "%\"tracking-enabled\":\"active\"%" AND uM.ioType = "mirror" AND uM.isDisabled = 0 AND uM.uri LIKE "%github%" AND uO.ioType = "observe" AND uO.isDisabled = 0;

A better wording for my proposal would be:
Remove the ability to clone repositories from diffusion for all repositories except the Wikibase repos mentioned in this task

Thanks, that’s much clearer.

As a test, I changed the I/O Type for all Read Only URIs listed on https://phabricator.wikimedia.org/source/tool-jouncebot/manage/uris/ (I picked a random project) from "Read Only" to "No I/O". This seemed to have worked:

[acko@machina ~]$ git clone https://phabricator.wikimedia.org/source/tool-jouncebot.git
Cloning into 'tool-jouncebot'...
fatal: unable to access 'https://phabricator.wikimedia.org/source/tool-jouncebot.git/': The requested URL returned error: 403
[acko@machina ~]$ git clone http://phabricator.wikimedia.org/diffusion/3307/tool-jouncebot.git
Cloning into 'tool-jouncebot'...
fatal: unable to access 'http://phabricator.wikimedia.org/diffusion/3307/tool-jouncebot.git/': The requested URL returned error: 403

Great! (Next time, feel free to experiment with https://phabricator.wikimedia.org/source/tool-atiro/ instead of random other people’s projects – I don’t use the Phabricator mirror, and probably other people neither. As it turns out, at least its display has been broken ever since I renamed the default branch to main…)

This ticket is getting harder to follow and by now covers numerous things. We may want to move some items into subtasks.

In my understanding, we want to:

[1] SELECT CONCAT("https://phabricator.wikimedia.org/diffusion/", r.id, "/manage/uris/") AS repoURI, r.name, uO.uri AS observedFromUri, uM.uri AS MirroredToUri FROM phabricator_repository.repository r INNER JOIN phabricator_repository.repository_uri uM ON r.phid = uM.repositoryPHID INNER JOIN phabricator_repository.repository_uri uO ON r.phid = uO.repositoryPHID WHERE r.details LIKE "%\"tracking-enabled\":\"active\"%" AND uM.ioType = "mirror" AND uM.isDisabled = 0 AND uM.uri LIKE "%github%" AND uO.ioType = "observe" AND uO.isDisabled = 0;
[2] SELECT u.id, CONCAT("https://phabricator.wikimedia.org/diffusion/", r.id, "/manage/uris/") AS repoURI, r.name, u.uri, builtinProtocol, ioType, displayType FROM phabricator_repository.repository r INNER JOIN phabricator_repository.repository_uri u ON r.phid = u.repositoryPHID WHERE u.builtinProtocol = "ssh" AND u.ioType != "observe" AND u.ioType != "mirror" AND u.displayType != "never";
[3] SELECT u.id FROM phabricator_repository.repository r INNER JOIN phabricator_repository.repository_uri u ON r.phid = u.repositoryPHID WHERE r.details LIKE "%\"tracking-enabled\":\"active\"%" AND u.builtinProtocol IS NULL AND u.ioType = "observe" AND u.displayType != "always";
[4] SELECT u.id, CONCAT("https://phabricator.wikimedia.org/diffusion/", r.id, "/manage/uris/") AS repoURI, r.name, u.uri, builtinProtocol, ioType, displayType FROM phabricator_repository.repository r INNER JOIN phabricator_repository.repository_uri u ON r.phid = u.repositoryPHID WHERE (u.builtinProtocol = "http" OR u.builtinProtocol = "https") AND u.ioType != "observe" AND u.ioType != "mirror";

As a next step I'm tempted to mass-set all http (no s at the end) URIs which neither observe nor mirror to io=none && display=never, so afterwards only non-mirror&&non-observe URIs for https (note the s at the end) would be left.

I guess I'm looking at @thcipriani for a {yes | no wait} input whether to exclude certain repos from getting "blocked" when disabling http access, or get a green light and just do it (and afterwards find out the hard way who was not already using https).

  • For Diffusion repos which observe a gerrit or gitlab URI and also mirror to github [1], change that setup not to have Diffusion "in the middle", per T405596#11242761.

I can't immediately cite a source for this but I feel like GitLab repos don't automatically mirror to GitHub (in the way that Gerrit repos do). Wouldn't disabling GitHub mirroring for these GitLab Diffusion-repo-mirrors stop the GitHub-repo-mirrors from being updated (or am I misunderstanding something)?

As a next step I'm tempted to mass-set all http (no s at the end) URIs which neither observe nor mirror to io=none && display=never

+1. Already did this for a bunch of other repos in the past.

  • For Diffusion repos which observe a gerrit or gitlab URI and also mirror to github [1], change that setup not to have Diffusion "in the middle", per T405596#11242761.

I can't immediately cite a source for this but I feel like GitLab repos don't automatically mirror to GitHub (in the way that Gerrit repos do). Wouldn't disabling GitHub mirroring for these GitLab Diffusion-repo-mirrors stop the GitHub-repo-mirrors from being updated (or am I misunderstanding something)?

The open source version of gitlab does support push mirroring, but that support requires configuring the push token (i.e. github access credentials) separately for each and every origin repo. This makes it a relatively untenable solution for us at scale, especially for mirroring any repo where we would rather not expose the downstream access credentials to the upstream repo owners. We can build a different system to sit between gitlab and github to enable push mirroring if diffusion is no longer suitable, but we would need to build something.

<tl;dr>: I disabled I/O for any non-observed, non-mirrored http clone URIs that I had permissions to edit. According to Phab's pull logs nobody should be affected.

Longer version: Nothing beats outages over the weekenddata-driven decisions, so I checked pull traffic via
SELECT r.name, rp.* FROM phabricator_repository.repository_pullevent rp INNER JOIN phabricator_repository.repository r ON r.phid = rp.repositoryPHID WHERE remoteProtocol != "https";
and it was empty. And also, DiffusionServeController::handleRequest() does differentiate between PROTOCOL_HTTP and PROTOCOL_HTTPS.

I handled those repositories which I could not edit before, by following https://wikitech.wikimedia.org/wiki/Phabricator#Unlocking_edit_permissions_on_random_objects. That means
SELECT u.phid, u.id, CONCAT("https://phabricator.wikimedia.org/diffusion/", r.id, "/manage/uris/") AS repoURI, r.phid, r.name, u.uri, builtinProtocol, ioType, displayType FROM phabricator_repository.repository r INNER JOIN phabricator_repository.repository_uri u ON r.phid = u.repositoryPHID WHERE (u.builtinProtocol = "http" OR u.builtinProtocol = "ssh") AND u.ioType != "observe" AND u.ioType != "mirror" AND u.displayType != "never";
and
SELECT u.phid, u.id, CONCAT("https://phabricator.wikimedia.org/diffusion/", r.id, "/manage/uris/") AS repoURI, r.phid, r.name, u.uri, builtinProtocol,u.id FROM phabricator_repository.repository r INNER JOIN phabricator_repository.repository_uri u ON r.phid = u.repositoryPHID WHERE r.details LIKE "%\"tracking-enabled\":\"active\"%" AND u.builtinProtocol IS NULL AND u.ioType = "observe" AND u.displayType != "always";
show zero results for now as it ideally should be, until Striker creates the next new repos (see the subtasks here).