Page MenuHomePhabricator

Pages with apostrophe character in their title aren't indexed by external search engines
Closed, DuplicatePublic

Description

New pages with apostrophe characters aren't indexed by external web search engines. Sampling a random sentence from a page with a title containing ' character in its title will yield no result in web search. The examples below aren't too fresh - so it is old enough for search engines to index.

Old pages (0.5 year old) containing apostrophe character in their title are indexed - so this is probably a regression from the last month or so. I'm not sure whether it is a MediaWiki issue or search engine issue (cross posts - see below).

Exmples
English Wikipedia examples:

Hebrew Wikipedia examples:

Debug info
I"m not sure it is a MW issue but I tried to check the following options

  • probably not an issue with dumps script - e.g bzcat DUMPFILE | grep _NEW_PAGE_WITH_APOSTRPHE
  • probably not an issue with API (there are no complaints on other issues that this would cause)
  • probably not an issue with new pages ATOM
  • no relevant exclusion in robots.txt (if I didn't miss anything)

See also

Event Timeline

eranroz raised the priority of this task from to High.
eranroz updated the task description. (Show Details)
eranroz added subscribers: eranroz, Amire80.

@eranroz: did you check to see if double quotes ( " ) character in article name cause same issue?
at least on hewiki, double quotes is somewhat common in article names.
(would check myself, but could not devise query for "recent article with quotes in name").

peace.

Try "אה"מ פרינס אוף ויילס (1860)", @Kipod. It works well, I think.

It seems that the issue started (at least in hewiki) on Jul 22, perhaps around noon (local time):

Following a query of recent hewiki articles with apostrophe in their name (thanks, IKhitron), I tried to see where the line is drawn. It seems that it is up to אנג'לה פלזנס and ג'אנקרלו דה סיסטי appear in searches whereas אניה רבינוביץ', ג'לאל אבו טועמה and newer don't seem to appear in search results.

On which other wikis do we have this problem? We seem to have it on enwiki as well. From which time, exactly? Is it the same time as hewiki? How can I find out? @IKhitron, can you provide a similar query for enwiki?

Thanks. The cut-off here seems to be at around Jul-10 or slightly later. I'm not completely sure.

It seems that Judo at the 2015 Pan American Games – Men's 60 kg (page34) is not found whereas John R. O'Dea is found. I'm not entirely sure about the articles between them.

If that is indeed the case, we have different cut-off time for enwiki and hewiki.

Tzafrir - based on this data you can search for related changes in git (or if you prefer web interface: http://git.wikimedia.org/ )
I think the relevant repositories are: mediawiki/core and operations. Keep in mind that the relevant commit can be few weeks before it gets to production.

Could this be caused by the same change as T106793?

@matmarex, I don't have Opera12. Can you test the articles on hewiki and enwiki (those that seem to have the problem and those that seem to not have the problem) and see if they manifest they have a redirect loop with Opera12?

Some do, some don't, I couldn't find a way t trigger it reliably. I'm just thinking this could be caused by 155d555b83eca6403e07d2094b074a8ed2f301ae, like that bug. The time it was committed mostly matches your investigation.