Page MenuHomePhabricator

Allow search engines to crawl code hosted by phabricator
Closed, DeclinedPublic

Description

The ability to search commit history (and ideally also code review comments) is greatly appreciated by devs. In [[Wikimedia technical search]] and others, we're currently forced to use external mirrors (like gmane.org) because git.wikimedia.org has never been robust enough.

I see that right now some diffusion URLs are indexed by search engines, e.g. https://phabricator.wikimedia.org/rMW23fab68274456f796563a5eac2ab70cb307afe1a , perhaps because it was linked by some task. However, robots.txt disallows crawling of /diffusion/, which makes the indexing spotty and inconsistent. Can we try removing the rule, or figure out something so that the main diffusion URLs are consistently indexed?

Event Timeline

Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis added a project: Phabricator.
Nemo_bis changed Security from none to None.
Nemo_bis subscribed.
Aklapper triaged this task as Lowest priority.Dec 8 2014, 11:20 AM

I have no idea what "Needs Volunteer" means for such a task.

That a patch is welcome. Then again, if @Chad or @Jdforrester-WMF want to increase the priority, this is their area.

A patch is not a way to do load testing. Chad, do you confirm no load testing is needed and a patch is all that's needed?

Nemo_bis renamed this task from Allow search engines to crawl diffusion/ to Allow search engines to crawl diffusion/ and other diffusion pages.Jan 18 2015, 12:30 PM

Our phabricator server is fairly well equipped, it should handle the load, especially given that search engines usually try not to overload the sites they are indexing.

It seems there is fairly good reason for the diffusion exclusion in robots.txt: see https://secure.phabricator.com/T4610

It seems there is fairly good reason for the diffusion exclusion in robots.txt: see https://secure.phabricator.com/T4610

Sure, there are good reasons to have *some* robots.txt, just not the current one. From the upstream report, problems may be:

  • file download (i.e. ?view=raw, ?diff=1, phab.wmfusercontent.org/file/data/* ?),
  • history for individual files (as in history page or other? I see no git stuff there).

According to the task description, we're mainly interested in indexing pages such as https://phabricator.wikimedia.org/diffusion/MW/history/master/ so that all the commits can be reached. There is probably other stuff though.

Maybe we could try something like:

User-Agent: *
Allow: /diffusion/
Disallow: /diffusion/*?view=raw
Disallow: /diffusion/*?diff=1
Disallow: /diffusion/*/diff/
# Try and block history with multiple slashes?
# Disallow: /diffusion/*/history/*/*
Crawl-delay: 1

Single wildcard works at least for Google, triple wildcard I need to check.

@Nemo_bis: I'm willing to try it, though epriestley had some suggestions in the upstream ticket (re-read the comment thread, there is new discussion as of today) which we might also want to consider. Specifically: indexing our php code with diviner which would generate nice formatted search-engine-friendly code documentation.

From upstream discussion:

Very good suggestions, and I will look into running diviner on our codebase, to see how that looks. It might be as good or better than what I was imagining in my previous comment. And phabricator code search is definitely a nice and desirable feature. That is probably a lot more valuable than search-engine-friendly diffusion.

greg raised the priority of this task from Lowest to Medium.Dec 2 2015, 7:04 PM
greg moved this task from To Triage to Backlog on the Gitblit-Deprecate board.
greg lowered the priority of this task from Medium to Lowest.Dec 2 2015, 7:07 PM

(bad drag/drop on the workboard)

Does gitblit allow indexing now?

Does gitblit allow indexing now?

Nope. And we've had problems with misbehaving bots and had to IP or UA ban them before.

That's moreso a reflection of Gitblit's terrible performance at our scale and less a desire to not let them be indexed.

mmodell renamed this task from Allow search engines to crawl diffusion/ and other diffusion pages to Allow search engines to crawl code hosted by phabricator.Jun 9 2016, 7:53 PM

I propose to decline this as the plan is not to host any code canonically on Phabricator per T191182.
Crawlers could try to reach GitLab and Gerrit instead and users looking at crawler results would get closer to the venue where they could contribute code changes.