Page MenuHomePhabricator

Automatically index extensions in Codesearch
Closed, DeclinedPublic

Description

Problem:
The set of repositories covered by codesearch is currently expansive and generally automated, but it is unclear how to add repositories to the list for non-gerrit-hosted repositories. With the Stable Interface Policy mentioning codesearch as a preferred way to assess usage of deprecated code in extensions, it is becoming increasingly important to have a light-weight and transparent way for extension and skin authors to see how they can ensure their extension is included in the index.

Proposal:

WRITEME
NOTE: this is not presently intended to be an RFC, but it may become one of that appears necessary or useful.

Event Timeline

Codesearch searches for extensions and rebuild it on daily basis: https://gerrit.wikimedia.org/g/labs/codesearch/+/refs/changes/47/641847/1/write_config.py#35 Do you mean something larger?

My understanding from the wiki page and from very briefly skimming the script is that it automatically lists all repos in gerrit and in phab, and then has a hard-coded list of additional stuff to index. I'm proposing to change this script so it scans the relevant categories on mediawiki.org instead. This would include more 3rd party extensions, and remove the need for 3rd parties to either request inclusion or host their stuff on gerrit in order to be index.

[…] and then has a hard-coded list of additional stuff to index. I'm proposing to change this script so it scans the relevant categories on mediawiki.org instead. […]

Among the additional stuff is a secondary index of https://raw.githubusercontent.com/MWStake/nonwmf-extensions/master/.gitmodules (ref) which is what you propose, but behind one extra anti-abuse step where MWStake has decided to list the repo. Whether to use this or some other mechanism, I think it's worth having something like that in place before feeding straight into Codesearch cloning, tracking and indexing the repo for the next 24 hours.

What's the problem this is trying to solve?

(Also we shouldn't embed a specific tool like CodeSearch in a long-term policy if it might be scrapped within a year or so – T268196: Figure out the future of codesearch in a GitLab world.)

In addition to what James and Krinkle said, I would also add that the goal of codesearch is not to index any MediaWiki code ever written that was dumped into a git repo. If the results are filled with unmaintained stuff, then it's not useful for developers, which is already beginning to be a problem: T241320: Allow certain or all GitHub repositories to be excluded from search results.

If someone is interested in this, I think a more actionable proposal would be to scrape the list of extensions, identify which are hosted in non-Gerrit repos, figure out those that are actively maintained, and propose those for inclusion to the MWStake repository.

scrape the list of extensions, identify which are hosted in non-Gerrit repos, figure out those that are actively maintained, and propose those for inclusion to the MWStake repository.

Which might benefit from T237470: Create and maintain a list of organization repos that are maintained on Gerrit, GitHub, and Diffusion first...

What's the problem this is trying to solve?

We want to be considerate of 3rd party extensions when deprecating and removing code. This is only possible if we can find the relevant 3rd party code. In practice, we rely on CodeSeacrh for this. To make this approach reliable and fair, there need to be clear criteria for inclusion in the search index, and a more-or-less self-service process for getting new extensions on there.

(Also we shouldn't embed a specific tool like CodeSearch in a long-term policy if it might be scrapped within a year or so – T268196: Figure out the future of codesearch in a GitLab world.)

It's not clear to me from that ticket if and why we would want to scrap codesearch. But even if we do, replacing the mention of one tool with another in the policy is not an issue, as long as the criteria for inclusion in the search index remains the same. In that case, it would be a mere editorial change, not requiring consensus building.

If someone is interested in this, I think a more actionable proposal would be to scrape the list of extensions, identify which are hosted in non-Gerrit repos, figure out those that are actively maintained, and propose those for inclusion to the MWStake repository.

My idea was that anything that is listed as "stable" on the wiki should be actively maintained. I agree that this is probably not the case, but the way to fix this (or the result of the activity you propose) would be to clean up the wiki. We could of course have an explicit "MWStake" or "MW ecosystem" status on the wiki as well.

The main point however is that it should be clear how a new extension can get itself included in the index. My idea was to make this lightweight and automatic: if your extension has a page on mediawiki.org and is actively maintained, it gets indexed. If it appears that it's no longer actively maintained, it gets dropped from the category on the wiki, and thus from the index.

Among the additional stuff is a secondary index of https://raw.githubusercontent.com/MWStake/nonwmf-extensions/master/.gitmodules (ref) which is what you propose, but behind one extra anti-abuse step where MWStake has decided to list the repo. Whether to use this or some other mechanism, I think it's worth having something like that in place before feeding straight into Codesearch cloning, tracking and indexing the repo for the next 24 hours.

Ah, I didn't know that list - it's a good start, but the gap between it and what is listed on mediawiki.org is rather big. The MWStake list contains 152 repos, Category:Stable_extension has 697 pages. Either the category urgently needs cleanup, or the process for getting included on the MWStake list needs to be streamlined. Or perhaps the criteria for inclusion are too different? Then that list indeed isn't what I was proposing.

But perhaps we do need some anti-abuse measure. I'm not quire sure about the attack model, and how it would be different from the current situation. I guess you could make a list of private phone numbes or something, put them in a repo link from the wiki. Private info gets indexed until someone notices and removes the repo from the index. This can already happen now. The main different is whether someone has to manually approve a repo for inclusion, but how does that protect us if the "bad" data is pushed after the repo has been approved?

Anyway. We could also have a request process for inclusion in the "ecosystem" (and thus codesearch). Perhaps run my MWStake. Then, the "ecosystem" could be defined as whatever MWStake lists. If there is a clear process for that, which goes reasonably quickly, we could also build the policy on that. Is this process documented somewhere? Who run it?

The main point however is that it should be clear how a new extension can get itself included in the index. My idea was to make this lightweight and automatic: if your extension has a page on mediawiki.org and is actively maintained, it gets indexed. If it appears that it's no longer actively maintained, it gets dropped from the category on the wiki, and thus from the index.

Making a page on mediawiki.org feels like more effort than sending a pull request to be included on MWState's extension list. I can imagine arguments for requiring such a page, but being lightweight imho isn't one. We don't even have a nice form to create such a page. What we have is preload template which does not provide a good user experience.

Among the additional stuff is a secondary index of https://raw.githubusercontent.com/MWStake/nonwmf-extensions/master/.gitmodules (ref) which is what you propose, but behind one extra anti-abuse step where MWStake has decided to list the repo. Whether to use this or some other mechanism, I think it's worth having something like that in place before feeding straight into Codesearch cloning, tracking and indexing the repo for the next 24 hours.

Ah, I didn't know that list - it's a good start, but the gap between it and what is listed on mediawiki.org is rather big. The MWStake list contains 152 repos, Category:Stable_extension has 697 pages. Either the category urgently needs cleanup, or the process for getting included on the MWStake list needs to be streamlined. Or perhaps the criteria for inclusion are too different? Then that list indeed isn't what I was proposing.

It's apples and oranges. The MWStake list includes non-wmf extensions (or to be more accurate, extensions hosted outside Wikimedia Gerrit). Stable extensions category contains all extensions regardless where they are hosted. Extensions hosted in Gerrit are automatically included in code search, if I am not mistaken.

if your extension has a page on mediawiki.org and is actively maintained, it gets indexed.

I'd say quite some of those thousands of pages are outdated, and there is no good criteria or even update process for extension release status.

if your extension has a page on mediawiki.org and is actively maintained, it gets indexed.

I'd say quite some of those thousands of pages are outdated, and there is no good criteria or even update process for extension release status.

Yes. The idea is to clean it up, rather than keep it in the useless state it is in now.

Lists or categories that are not used for anything tend to go out of date. The only way to keep them useful is to rely on them for something.

I'm not insisting that we should use the category. However, if we use something else, we should kill the category.

Making a page on mediawiki.org feels like more effort than sending a pull request to be included on MWState's extension list. I can imagine arguments for requiring such a page, but being lightweight imho isn't one. We don't even have a nice form to create such a page. What we have is preload template which does not provide a good user experience.

Ok, so "extensions need to be either on gerrit or listed in the MWStake list" to be part of the "ecosystem" may be an option. That would provide a clear enough way to get one's extension indexed.

It's apples and oranges. The MWStake list includes non-wmf extensions (or to be more accurate, extensions hosted outside Wikimedia Gerrit). Stable extensions category contains all extensions regardless where they are hosted. Extensions hosted in Gerrit are automatically included in code search, if I am not mistaken.

I was hoping we could create a single list of extensions "in the ecosystem", based on a small set of criteria. MWStake seems pretty restrictive, the category as it is now includes a lot of outdated stuff, and there are quite a few unmaintained things on gerrit...

I think telling people to go make a pull request for the list is fine, but I'd like to hear from @Hexmode frst.

Making a page on mediawiki.org feels like more effort than sending a pull request to be included on MWState's extension list. I can imagine arguments for requiring such a page, but being lightweight imho isn't one. We don't even have a nice form to create such a page. What we have is preload template which does not provide a good user experience.

Ok, so "extensions need to be either on gerrit or listed in the MWStake list" to be part of the "ecosystem" may be an option. That would provide a clear enough way to get one's extension indexed.

It's not "an option". It's the current reality.

Ok, so "extensions need to be either on gerrit or listed in the MWStake list" to be part of the "ecosystem" may be an option. That would provide a clear enough way to get one's extension indexed.

It's not "an option". It's the current reality.

An option for the definition of "ecosystem" in the policy. Which currently basically is "whatever is indexed in code search".

Ok, so "extensions need to be either on gerrit or listed in the MWStake list" to be part of the "ecosystem" may be an option. That would provide a clear enough way to get one's extension indexed.

It's not "an option". It's the current reality.

An option for the definition of "ecosystem" in the policy. Which currently basically is "whatever is indexed in code search".

Sure. I've re-written the problem statement to not be out-right false. I think there's consensus here that scraping MediaWiki pages is a bad model. Is this turning into "document the process more clearly"?

Krinkle renamed this task from Automatically index extensions in CodeSearch to Automatically index extensions in Codesearch.Nov 25 2020, 8:25 PM

Among the additional stuff is a secondary index of https://raw.githubusercontent.com/MWStake/nonwmf-extensions/master/.gitmodules (ref) which is what you propose, but behind one extra anti-abuse step where MWStake has decided to list the repo. Whether to use this or some other mechanism, I think it's worth having something like that in place before feeding straight into Codesearch cloning, tracking and indexing the repo for the next 24 hours.

Ah, I didn't know that list - it's a good start, but the gap between it and what is listed on mediawiki.org is rather big. The MWStake list contains 152 repos, Category:Stable_extension has 697 pages. Either the category urgently needs cleanup, or the process for getting included on the MWStake list needs to be streamlined. Or perhaps the criteria for inclusion are too different? Then that list indeed isn't what I was proposing.

I've been maintaining the list intermittently and this discussion reminds me of @Legoktm's request to separate skins from it.

Your comments also lead me to think about being more rigorous and systematic about what is done there.