Page MenuHomePhabricator

Not all Phabricator tasks are crawled by search engines
Closed, InvalidPublic

Description

Currently, it looks like a good portion of maniphest tasks are already indexed by search engines, which is great (I'll soon be able to update the [[Wikimedia technical search]]). However, I don't see the esteemed number increasing beyond 30k or 40k.

This might be explained by the javascript-y nature of Phabricator and the fact that from extension pages we've only linked tags, which are partial lists: I'm not sure search engines are able to follow all links they should follow. Do we need a sitemap?

Upstream: https://secure.phabricator.com/T7610

Workaround: https://www.mediawiki.org/wiki/Phabricator/Task_sitemap (as of creation, Wayback has 1300 tasks archived).

Event Timeline

Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis added a project: Phabricator.
Nemo_bis changed Security from none to None.
Nemo_bis added a subscriber: Nemo_bis.
Aklapper triaged this task as Lowest priority.Dec 8 2014, 11:11 AM

Well, how can we check this? (I honestly wonder.)

Tangentially, why was old-bugzilla disallowing everything? https://old-bugzilla.wikimedia.org/robots.txt (which led to wikibugs-l mirrors being the only google search results... >.< )
I assume something to do with security - and does this need to be tested/tasked for phab?

Phabricator is incredibly search-engine-friendly. The number of tasks that they index is more a factor of how important the search engines think our content is. Google will likely index everything, though they might detect a lot of it as duplicates due to the mailing list mirrors who have the original text from the old setup where bug reports went bugzilla->mailing list->list archive->search index.

Search engines want to index everything of value on the internet. duplicates of the same content are not considered to be very valuable.

Do we need a sitemap?

Search engines should eventually re-crawl all the web archives of wikibugs-l. Since even http://bugzilla.wikimedia.org/robots.txt redirects to phabricator.w.o, they should following the redirects from bugzilla to the good Phabricator "TNNN" URLs.

Tangentially, why was old-bugzilla disallowing everything?

That's T15881, which was fixed then regressed (and now with Phabricator it's correctly Declined). I couldn't find a security rationale for it; if a search engine can crawl to a security risk then telling the world "don't do that" isn't addressing the problem :)

Search engines should eventually re-crawl all the web archives of wikibugs-l

wikibugs-l is not archived and additionally lists.wikimedia.org prevents all crawling (which btw is catastrophically bad, but off topic here). Which "mirror" archive do you have in mind, and what makes you think search engines follow its links?

That's T15881, which was fixed then regressed (and now with Phabricator it's correctly Declined). I couldn't find a security rationale for it;

Merely a performance issue, IIRC. At some point kaulen was overloaded and changing robots.txt felt easier than actually looking into the cause.

Using Google as a reference:

  • Searching for "Ensure that all phabricator tasks are crawled by search engines": "No results found for "Ensure that all phabricator tasks are crawled by search engines". Results for Ensure that all phabricator tasks are crawled by search engines (without quotes):"
  • Searching for "Re-juggle shard and replica counts in Elasticsearch" (an older task, not migrated from Bugzilla): correct result found.
  • Searching for "optionWidget with check icon should not appear for invalid entries in the category input field" (a Bugzilla report migrated): Bugzilla URL appears first, Phabricator task right after. Good.

Haven't tried others, but your help running these tests is welcome.

All in all it looks like they are catching up.

Qgil claimed this task.
In T76991#841031, @Qgil wrote:

Using Google as a reference:

  • Searching for "Ensure that all phabricator tasks are crawled by search engines": "No results found for "Ensure that all phabricator tasks are crawled by search engines". Results for Ensure that all phabricator tasks are crawled by search engines (without quotes):"

Now the top result points here.

I'm closing this task as Resolved. If someone finds anything in our instance impeding crawling by search engines, please reopen.

I find no evidence that the crawl is complete: all numbers I see are in the tens of thousands at best and I keep not finding reports which I know exist.

I reopened https://secure.phabricator.com/T7610

just search google for the task id: T102920 for example, is indexed.

Well, and I find no evidence that the crawl is NOT complete.
If you have specific test cases, please provide them...

Aklapper changed the task status from Open to Stalled.Aug 1 2015, 5:47 PM

Assign to Nemo as per last comment; feel free to reset after answering

Nemo_bis changed the task status from Stalled to Open.EditedAug 12 2015, 10:29 AM

just search google for the task id: T102920 for example, is indexed

I'm not going to perform 100k Google searches, it's going to take ages with throttles. However, for instance, any random number I search like "T79XXX site:phabricator.wikimedia.org" or higher including T842XX does not find anything.

Thanks a lot for clarifying! I can reproduce this, indeed.

Aklapper renamed this task from Ensure that all phabricator tasks are crawled by search engines to Not all Phabricator tasks are crawled by search engines.Aug 12 2015, 11:39 AM

But searching for phabricator T842 works for me

It's not about T842 but T842** like T84213

@Aklapper: T84213 is protected, of course it isn't indexed.

Every example I've seen so far is either intentionally hidden by security policy or not reproducible.

The funny thing with Google is that searching for "phabricator T105000" works for any valid task number, whereas searching for "phabricator wikimedia T105000" brings a lot more problems and many times the same tasks are not found. "Wikimedia" is in the title of the pages and you can see it transcribed in the results, so it is really strange.

Google doesn't treat wikimedia as wikimedia. They conflate wikimedia and wikipedia, essentially this makes wikimedia almost a generic term for the purposes of google's algorithm.

T84213 is protected, of course it isn't indexed.

Ah true that. I realize that range of task IDs are the tickets imported from RT (by default with restricted access).

Which might make this task invalid.

Aklapper claimed this task.

I realize that range of task IDs are the tickets imported from RT (by default with restricted access).
Which might make this task invalid.

So I'm closing this. If there are examples of public tasks not being crawled, please bring them up and reopen. Thanks!

Will presumably get worse with T110710. What avenues of access are left for crawlers to find all tasks?

Crawlers should find all public tasks (as in https://phabricator.wikimedia.org/T987654321 ).
If they do not, please provide a testcase.

@Nemo_bis: only workboards are blocked, the crawlers can still find all of the projects and from there they can find all tasks via the projects' task lists. Additionally, There is the master maniphest task list at https://phabricator.wikimedia.org/maniphest/