Page MenuHomePhabricator

Investigate and Resolve Bing Crawling Invalid Wikipedia URLs
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
Go to Bing Webmaster Tools and view the canonical link filter

Observe that Bing is crawling Wikipedia URLs with the following pattern:

https://en.wikipedia.org/?title=ann_cooper_hewitt&redirect=no

What happens?:

These URLs are generating "not indexed" warnings.

What should have happened instead?:
Bing should not be crawling bogus URLs.

Other information (browser name/version, screenshots, etc.):

We tried to understand why Bing is crawling URLs with the ?title= parameter, particularly those including &redirect=no, but since this is proprietary, it wasn't clear.

Bing is properly indexing the canonical article but also might be crawling lowercase versions of the canonical title even though it has a noindex tag.

Drop the canonical link from the HTML to see if this resolves the issue, as Bing might follow canonical links even on 404s.

Next Steps:

  • Remove the canonical URL from noindex pages in core (possibly here).
  • Monitor the changes to see if this resolves the issue with Bing crawling invalid URLs.

Event Timeline

@NBaca-WMF This looks like a continuation of the parent issue. Can the web team continue to do the investigation and resolution? We are available to help with context and any possible reviews for changes.

I don't quite understand what's going on here. A page exists at https://en.wikipedia.org/wiki/Ann_Cooper_Hewitt. The URL being hit here is not a lower case version of that URL. It's a rather awkward non-standard way to ask for article content, I didn't even kow that works.

I don't see the connection to noindex - Ann_Cooper_Hewitt doesn't have a noinde marker.

The error page returned by https://en.wikipedia.org/?title=ann_cooper_hewitt&redirect=no is marked as noindex, and it also points to a canonical URL, https://en.wikipedia.org/wiki/Ann_cooper_hewitt. Which of course also doesn't exist and again has a noindex marker.

But that doesn't explain why Bing it hitting https://en.wikipedia.org/?title=ann_cooper_hewitt&redirect=no in the first place.

@daniel

We suspect the issue lies with Bing. The URL https://en.wikipedia.org/?title=wwe_hall_of_fame_(2024)&redirect=no results in a 404 error, yet the HTML includes <link rel="canonical" href="https://en.wikipedia.org/wiki/Wwe_hall_of_fame_(2024)">. We recommend removing the canonical link from the HTML of bogus pages to see if this resolves the problem.

It's possible that Bing's logic follows canonical links even when they lead to 404 errors. The exact method by which these URLs are being crawled is unclear (and there's a chance it's proprietary), but there might be a page generating these links in the HTML under the assumption that they exist.

@Jdlrobson Since you suggested that we file this request with MW Platform team, do you mind chiming in on that?

Nat says this is not a priority right now but we will get back to you when we have bandwidth.