Page MenuHomePhabricator

Figure out the future of codesearch in a GitLab world
Closed, ResolvedPublic

Description

Once all repositories have moved to GitLab, do we still need codesearch? What features are missing from GitLab CE that codesearch provides, which users want?

This will help codesearch maintainers figure out how much resourcing/effort to put into keeping codesearch running, including potentially moving it into production.

Event Timeline

Once all repositories have moved to GitLab

It would be good to have an estimate of when this is expected to happen.

Once all repositories have moved to GitLab

It would be good to have an estimate of when this is expected to happen.

We don't yet have a timeline for "All". The roadmap has a date associated with step 3 ("Utility project migration") as the end of June '21 (aka, end of this fiscal year). The roadmap will be updated as we go with new details and dates as we have them.

fwiw, I believe that gitlab "CE" has a very limited search feature. We might well still need another solution for decent code search.

Gitlab CE comes with a search system backed up by the database ( https://docs.gitlab.com/ee/user/search/#basic-search ). Searches for commit, code or comments are limited to the current project.

Global Search and Advanced Syntax leverage ElasticSearch. I don't know whether that is available in the community edition. There is more reading available at https://docs.gitlab.com/ee/integration/elasticsearch.html which might suggest that part is open source.

https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer/-/issues/54 seems to indicate the ElasticSearch system can be setup freely, but there is no way to enable the integration.

And older task https://gitlab.com/gitlab-org/gitlab/-/issues/19319 states that ElasticSearch support got pushed to CE ''by mistake'' and rolled back.

So I guess the ElasticSearch integration that would leverage search in files across multiple repositories (what codesearch does) is unavailable in CE.

Note: Links should prefer https://docs.gitlab.com/ce/* to https://docs.gotlab.com/ee/* (CE, not EE, in case content / functionality differs at some point).

Thanks for the details @hashar. It looks like the elasticsearch stuff is close, but not exact. The syntax doesn't appear to support full regex searching which we tend to rely on for, well, advanced searches.

And then we'd also need to figure out how to keep indexing code that isn't maintained on GitLab, though I'd think mirroring it in might be sufficient.

Note: Links should prefer https://docs.gitlab.com/ce/* to https://docs.gotlab.com/ee/* (CE, not EE, in case content / functionality differs at some point).

Thank you for pointing that out, it has always confused me. I have looked at their docs source and on https://gitlab.com/gitlab-org/gitlab-docs/#projects-we-pull-from they state:

Note: Although GitLab Community Edition is generated, it is hidden from the website as it's the same as the Enterprise Edition. We generate it for consistency, until better redirects are implemented.

So the /ce/ and /ee/ URLs end up pointing to the same documentation with features / paragraph flagged with the tiers (starter, premium, ultimate) / cloud offer (bronze, silver, gold) they are available in.

@hashar: Hah, I had no idea. Thanks!

I had no idea either about the difference of ce vs ee docs, your comment made me look that up :]

Thanks for the details @hashar. It looks like the elasticsearch stuff is close, but not exact. The syntax doesn't appear to support full regex searching which we tend to rely on for, well, advanced searches.

And then we'd also need to figure out how to keep indexing code that isn't maintained on GitLab, though I'd think mirroring it in might be sufficient.

The ElasticSearch integration is not available in the free GitLab anyway, at least based on the documentation and an issue report flagging that.

Most probably, we would want to keep CodeSearch and have it to index repositories hosted on Gitlab. It might be an opportunity to promote the system to production grade, that opens the path of replicating the repositories from Gerrit/Differential/Gitlab to the codesearch host which might be a bit more effective than the half an hour polling.

brennen edited projects, added GitLab (Integrations); removed GitLab.

Another use case, during the decade or more we'll be transitioning, is discovery. Codesearch is instrumental in discovery through a central portal, so at the very least for as long as repos exist outside GitLab, all GitLab repos should probably also be indexed in Codesearch.

As an example, the MediaWiki-Docker file in core assigns "MW_LANG" which appears yields no results from an "Everywhere" query in Codesearch and so one might think it is an unused remnant of something, but on GitLab one can find it is a custom variable specific to the mw-docker install script.

Change 877233 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[labs/codesearch@master] Add support for gitlab.wikimedia.org

https://gerrit.wikimedia.org/r/877233

Change 877233 merged by jenkins-bot:

[labs/codesearch@master] Add support for gitlab.wikimedia.org

https://gerrit.wikimedia.org/r/877233

Change 881657 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[labs/codesearch@master] Retrieve mediawiki release settings from gitlab

https://gerrit.wikimedia.org/r/881657

Change 881657 merged by jenkins-bot:

[labs/codesearch@master] Retrieve mediawiki release settings from gitlab

https://gerrit.wikimedia.org/r/881657

Gitlab code search across repositories is not available to us (it is behind a proprietary license), for reference the documentation page is https://docs.gitlab.com/ee/user/search/advanced_search.html and indicates it as a premium feature.

https://codesearch.wmcloud.org/ is able to index any git repository regardless of its canonical host (Gerrit, Github, Gitlab). Most of the Gerrit ones are indexed, some of the Gitlab ones as well (iirc).

Almost all of Gerrit repositories are mirrored to GitHub which offers its own search system: https://github.com/search?q=org%3Awikimedia%20&type=code . There is a task to investigate whether Gitlab repositories should be mirrored to GitHub: T321597 which, if done, would add them to their search. If done, we could rely on their search system, albeit it is not ideal since results do not point to the canonical repositories.

If we don't want to rely on GitHub search, then I guess codesearch should index Gitlab repositories.

If we don't want to rely on GitHub search, then I guess codesearch should index Gitlab repositories.

What would it take (high-level) to do that?

If we don't want to rely on GitHub search, then I guess codesearch should index Gitlab repositories.

What would it take (high-level) to do that?

If there are some "way" to know what repos to index and what not. The change is only to make some adjustments to write_config.py (https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/refs/heads/master/write_config.py) Also, it indexes "master" branch and possibly needs adjustment for indexing "main" instead.

If we don't want to rely on GitHub search, then I guess codesearch should index Gitlab repositories.

What would it take (high-level) to do that?

If there are some "way" to know what repos to index and what not. The change is only to make some adjustments to write_config.py (https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/refs/heads/master/write_config.py) Also, it indexes "master" branch and possibly needs adjustment for indexing "main" instead.

I think we would want all of them? In which case, we could use https://docs.gitlab.com/ee/api/repositories.html to get the list.

Something like curl https://gitlab.wikimedia.org/api/v4/groups/186/projects?include_subgroups=true returns the first page of projects under /repos. The link header in the response has the URL for the next page.

I think we would want all of them? In which case, we could use https://docs.gitlab.com/ee/api/repositories.html to get the list.

Are we sure we want to scan every gitlab repo? Stuff like toolforge repos, debian repos, upstream repos we have vendored are a lot. That was sorta why (I think) we didn't index all of wikimedia org in github.com It'll drown out the actual codes

Are we sure we want to scan every gitlab repo? Stuff like toolforge repos, debian repos, upstream repos we have vendored are a lot.

I don't have a strong opinion here, but everything under /repos might be a reasonable place to start if we do want to keep noise down. (Edit: I see Ahmon suggests basically that above.)

I'm happy with it (my minor concern is the debs repos in SRE which IIRC, they should be under /repos/sre, I lost track, we could find a way to exclude them though.) I also think maybe @Legoktm might have an opinion about this here as the main author and maintainer of the tool. Otherwise, feel free to make a patch to write_config.py and I'd be happy to review it.

hashar claimed this task.

I am boldly marking this resolved:

  • the task is 3 years old
  • per my previous comment ( T268196#6637264 ) GitLab CE does not offer a global search
  • the last comments are about implementation details

In case someone is looking at having our Code Search to index repositories hosted in GitLab, see T371992 / https://gerrit.wikimedia.org/r/c/labs/codesearch/+/1060493