Page MenuHomePhabricator

[SPIKE] Investigate Bing Console warning about canonical URLs
Closed, ResolvedPublic2 Estimated Story PointsSpike

Description

Summary
We recently received the below warning in an email from Bing Webmaster Tools, warning of "Large number of pages pointing to the same canonical URL for domain":

image (45).png (1×1 px, 627 KB)

We should investigate whether this is an issue and what we would need to do to resolve it, and create a separate ticket for remediation if necessary.

Parameters
As part of this ticket, we should log into and utilize Bing Webmaster Tools (specifically this url) to investigate:

  1. Why the domain is "http://wikipedia.org/" and not "https"
  2. Why this warning references the top-level domain
  3. Why we have, as the warning states, a large number of pages pointing to this domain, and whether this is a problem?

Nat will grant Bing webmaster read/write access to all devs on this team.

Acceptance Criteria
This ticket is ultimately just to understand whether this is a problem and how big of a problem it is. It is not to fix or to dig into in detail.

  • We have conducted the above exploration in the console and identified the root cause
  • [x ] If found: we have created a separate ticket for remediation of said issue
  • [x ] If not: we have reached out to our Bing partnership contact to set up a conversation to better understand this warning and its implications (n/a)

Event Timeline

Jdrewniak set the point value for this task to 5.May 9 2024, 5:53 PM

From Nat:

Will give access to whoever assigns this to themselves.

Jdrewniak triaged this task as Medium priority.May 9 2024, 5:57 PM

The "not indexed" warnings for URLs with the pattern https://en.wikipedia.org/?title=EXAMPLE_TITLE_HERE&redirect=no could be related to the concept of excessive parameterization and duplicate content issues with Bing.

  1. Each URL with the &redirect=no parameter might have once pointed to the same content that is also accessible without the parameter (e.g., https://en.wikipedia.org/wiki/EXAMPLE_TITLE_HERE). Search engines could see these as duplicate pages.
  1. Bing has a limited crawl budget for each site. If the crawler is spending time indexing multiple versions of the same content, it might not be able to efficiently index new or more important pages. URLs with the &redirect=no parameter might be considered less important compared to their canonical versions without the parameter.
  1. Bing might have lost trust in the URLs with the &redirect=no parameter if they detect that these pages are duplicates or if they consider the parameterized URLs to be less authoritative.

Possible Next Steps:

Use Bing webmaster tools to indicate that the &redirect=no should be ignored during indexing.

Robots.txt and Meta Tags: Use the robots.txt file to disallow crawling of URLs with the &redirect=no parameter, or use the noindex meta tag on these pages to prevent them from being indexed. For instance, add the following to the robots.txt file:

makefile
Copy code
User-agent: *
Disallow: /*?title=*&redirect=no

Ensure that internal and external links point to the canonical URLs without the &redirect=no parameter. This helps consolidate link equity and signals to search engines the preferred version of the URL. (How to do this is not clear)

@KSarabia-WMF to write the follow up ticket for this task.

We are curious why Bing is indexing these broken URLs in engineering-all.

Jdlrobson claimed this task.
Jdlrobson updated the task description. (Show Details)

Thanks KIm! next steps look good to me !