Page MenuHomePhabricator

Allow search engines to crawl code hosted by phabricator
Open, LowestPublic

Description

The ability to search commit history (and ideally also code review comments) is greatly appreciated by devs. In [[Wikimedia technical search]] and others, we're currently forced to use external mirrors (like gmane.org) because git.wikimedia.org has never been robust enough.

I see that right now some diffusion URLs are indexed by search engines, e.g. https://phabricator.wikimedia.org/rMW23fab68274456f796563a5eac2ab70cb307afe1a , perhaps because it was linked by some task. However, robots.txt disallows crawling of /diffusion/, which makes the indexing spotty and inconsistent. Can we try removing the rule, or figure out something so that the main diffusion URLs are consistently indexed?

Related Objects

Event Timeline

Nemo_bis created this task.Dec 8 2014, 8:04 AM
Nemo_bis updated the task description. (Show Details)
Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis added a project: Phabricator.
Nemo_bis changed Security from none to None.
Nemo_bis added a subscriber: Nemo_bis.
Qgil edited projects, added Gitblit-Deprecate; removed Phabricator.Dec 8 2014, 10:44 AM
Aklapper triaged this task as Lowest priority.Dec 8 2014, 11:20 AM
demon awarded a token.Dec 8 2014, 6:11 PM

I have no idea what "Needs Volunteer" means for such a task.

That a patch is welcome. Then again, if @Chad or @Jdforrester-WMF want to increase the priority, this is their area.

A patch is not a way to do load testing. Chad, do you confirm no load testing is needed and a patch is all that's needed?

Nemo_bis renamed this task from Allow search engines to crawl diffusion/ to Allow search engines to crawl diffusion/ and other diffusion pages.Jan 18 2015, 12:30 PM

Our phabricator server is fairly well equipped, it should handle the load, especially given that search engines usually try not to overload the sites they are indexing.

It seems there is fairly good reason for the diffusion exclusion in robots.txt: see https://secure.phabricator.com/T4610

It seems there is fairly good reason for the diffusion exclusion in robots.txt: see https://secure.phabricator.com/T4610

Sure, there are good reasons to have *some* robots.txt, just not the current one. From the upstream report, problems may be:

  • file download (i.e. ?view=raw, ?diff=1, phab.wmfusercontent.org/file/data/* ?),
  • history for individual files (as in history page or other? I see no git stuff there).

According to the task description, we're mainly interested in indexing pages such as https://phabricator.wikimedia.org/diffusion/MW/history/master/ so that all the commits can be reached. There is probably other stuff though.

Maybe we could try something like:

User-Agent: *
Allow: /diffusion/
Disallow: /diffusion/*?view=raw
Disallow: /diffusion/*?diff=1
Disallow: /diffusion/*/diff/
# Try and block history with multiple slashes?
# Disallow: /diffusion/*/history/*/*
Crawl-delay: 1

Single wildcard works at least for Google, triple wildcard I need to check.

@Nemo_bis: I'm willing to try it, though epriestley had some suggestions in the upstream ticket (re-read the comment thread, there is new discussion as of today) which we might also want to consider. Specifically: indexing our php code with diviner which would generate nice formatted search-engine-friendly code documentation.

From upstream discussion:

Very good suggestions, and I will look into running diviner on our codebase, to see how that looks. It might be as good or better than what I was imagining in my previous comment. And phabricator code search is definitely a nice and desirable feature. That is probably a lot more valuable than search-engine-friendly diffusion.

greg moved this task from To Triage to Backlog on the Gitblit-Deprecate board.Dec 2 2015, 7:04 PM
greg raised the priority of this task from Lowest to Normal.
greg lowered the priority of this task from Normal to Lowest.Dec 2 2015, 7:07 PM

(bad drag/drop on the workboard)

Qgil removed a subscriber: Qgil.Feb 15 2016, 10:59 AM
greg added a comment.Jun 9 2016, 6:50 PM

Does gitblit allow indexing now?

demon added a comment.Jun 9 2016, 6:57 PM

Does gitblit allow indexing now?

Nope. And we've had problems with misbehaving bots and had to IP or UA ban them before.

That's moreso a reflection of Gitblit's terrible performance at our scale and less a desire to not let them be indexed.

per above, adjusting projects.

Paladox added a subscriber: Paladox.Jun 9 2016, 7:04 PM
mmodell renamed this task from Allow search engines to crawl diffusion/ and other diffusion pages to Allow search engines to crawl code hosted by phabricator.Jun 9 2016, 7:53 PM