Page MenuHomePhabricator

2025 Commons SEO review
Closed, ResolvedPublic

Description

In response to two wishes in the Community Wishlist (1, 2), I reviewed the current situation with Google indexing of Commons and took some actions.

Some previous discussion was at T54647.

Situation as of project start, 1 June 2025

  • Indexed page count: 50,487,720. Commons indexable page count is ~140M so only ~36% of Commons is indexed.
  • Weekly clicks: 1.02M
  • Weekly impressions: 140M
  • Weekly crawl stats: 2.1M discovery, 25M refresh (92% refresh)
  • Googlebot Smartphone crawling requests all fail due to a redirect loop (~46% of Google crawl traffic). The desktop site redirects to the mobile site, but the mobile site is not indexed due to rel=canonical pointing to the desktop site. Details at T54647#10902206.
  • No enhanced indexing for video (T396168)
  • Discovery comes from a variety of links, no canonical source of discovery data.
  • Categories are not effective for discovery since the "next page" link in a multi-page category is denied by robots.txt.

Actions taken

  • June 24: Disabled desktop to mobile redirect for Googlebot on Commons (T397267)
  • June 27: Call with Google representative. Discussed mobile redirect and sitemaps.
  • July 14: Raised CDN rate limit for Googlebot (T398668)
  • July 24: @Krinkle raised video indexing with Google rep. Google Search Console shows a consequent increase in video indexing (T396168)
  • August 6: Submitted a sitemap to Google for Commons (T400023)

Situation as of 19 July 2025

  • Indexed page count: 56,531,045, up 12%
  • Weekly clicks: 1.18M, up 16%
  • Weekly impressions: 162M, up 16%
  • Weekly crawl stats: 3.2M discovery, 187M refresh (98% refresh). Huge increase in refresh traffic following mobile redirect intervention, smaller increase in discovery traffic.
  • Decline in Core Web Vitals report due to CLS on the desktop site for mobile clients

The performance chart below for the last 12 months shows impressions declined steeply in Sep-Dec 2024 with some impact on clicks. In the last two weeks we are seeing an encouraging recovery in clicks and impressions, with the click count hitting a new maximum for this 12 month period.

performance 12mo 2025-07-19.png (475×1 px, 102 KB)

Situation as of 30 August 2025

  • Indexed page count: 121,563,005, up 141% since 1 June
  • Indexed video count: 75,519, up 102053% since 2 June
  • Weekly clicks: 1.80M, up 75% since 1 June
  • Weekly impressions: 232M, up 68% since 1 June
  • Crawl stats: 9.9M discovery, 109M refresh (92% refresh). We saw a large spike in discovery requests when the sitemap was submitted. This has settled down to a level still much higher than before the sitemap was submitted.

Chart of weekly clicks

performance 2025-08-01.png (639×1 px, 44 KB)

Chart of weekly crawl requests

crawl stats 2025-08-01.png (679×1 px, 60 KB)

Event Timeline

tstarling triaged this task as Medium priority.
MusikAnimal changed the task status from Open to In Progress.Jul 21 2025, 6:08 PM
MusikAnimal changed the status of subtask T400023: Deploy sitemaps API for Commons from Open to In Progress.

I was looking through the new index T400023, and I noticed while we did some work with regard to T250317 in the past for JSON+LD metadata, that the license data we add does not seem to be recognized by Google. We should probably see if that can be fixed, because that just makes it a waste of bytes.

[edit] I was mistaken, this was due to some licenses that Google doesn't recognize.

I updated the task description to add the most recent statistics from Google.

Turnilo: pageviews_daily (type: user, referer_name: Google, project: commons) from April to October. - https://w.wiki/FoFG

Screenshot 2025-10-24 at 16.15.49.png (1×2 px, 281 KB)

Notable dates:

  • April 1 - July 4: Stable pattern of 300-400K Google-referred page views per day.
  • June 23: We disable mobile redirect for Googlebot on Commons.
  • July 5: Rapid increase in Commons pageviews referred from Google, continues to increase for 11 weeks in a row until it finds a new high on Sep 20 at ~800K/day.
  • August 6: We submit sitemap for Commons.

Same, but from 2023 to 2025 (this dataset starts in April 2023):

Screenshot 2025-10-24 at 16.27.47.png (1×2 px, 279 KB)

On our end, we see an +100% increase from 2.5M to 5M weekly pageviews. This matches the Google Console data in shape, but not in absolute values, as it reports a similar doubling but from 1.0M to 1.8M weekly "clicks".

Zooming out even further via the projectview_hourly table in Hadoop. (This is more crude because this dataset has referer_class "search engine", and not referer_name "Google". But that should be close enough for this purpose given it makes up 90% of that slice.)

Screenshot 2025-10-24 at 17.10.52.png (730×2 px, 138 KB)

1_date _count
22016-02 12696326
32016-03 14350244
42016-04 13600581
52016-05 14579255
62016-06 13798953
72016-07 11244233
82016-08 11749857
92016-09 12180681
102016-10 11910618
112016-11 13384737
122016-12 10992998
132017-01 12957973
142017-02 10123910
152017-03 10359125
162017-04 9350087
172017-05 9498897
182017-06 8420826
192017-07 8832303
202017-08 8941976
212017-09 13085738
222017-10 17300798
232017-11 17881866
242017-12 16401194
252018-01 17605049
262018-02 18873044
272018-03 19328553
282018-04 15934382
292018-05 16070871
302018-06 13110366
312018-07 11581620
322018-08 10595033
332018-09 10855460
342018-10 12952883
352018-11 12855865
362018-12 12012066
372019-01 13951637
382019-02 12854853
392019-03 13147948
402019-04 11570086
412019-05 11701454
422019-06 9693999
432019-07 10051166
442019-08 11641703
452019-09 14982920
462019-10 12849608
472019-11 11995788
482019-12 11853525
492020-01 14544004
502020-02 15527909
512020-03 16622006
522020-04 19252801
532020-05 17022308
542020-06 14412907
552020-07 13498663
562020-08 13499957
572020-09 14643719
582020-10 16115092
592020-11 15964750
602020-12 13505856
612021-01 14319364
622021-02 13334903
632021-03 14745507
642021-04 13618217
652021-05 13714035
662021-06 12431294
672021-07 12928642
682021-08 14294150
692021-09 16333290
702021-10 17421904
712021-11 17804498
722021-12 16787253
732022-01 18347894
742022-02 18276033
752022-03 19659552
762022-04 17843593
772022-05 19223444
782022-06 18966391
792022-07 17427705
802022-08 18190499
812022-09 17977490
822022-10 18618616
832022-11 18537647
842022-12 16712784
852023-01 18446940
862023-02 17559625
872023-03 20823649
882023-04 18886740
892023-05 21433637
902023-06 19541856
912023-07 19638625
922023-08 20228362
932023-09 15024905
942023-10 14833005
952023-11 14380063
962023-12 14063027
972024-01 15894983
982024-02 13929193
992024-03 15145511
1002024-04 18185844
1012024-05 19517877
1022024-06 15737637
1032024-07 17149260
1042024-08 18188501
1052024-09 18946095
1062024-10 15444540
1072024-11 12338801
1082024-12 11811710
1092025-01 13652411
1102025-02 13710899
1112025-03 39195678
1122025-04 14248299
1132025-05 14188961
1142025-06 12824161
1152025-07 15925198
1162025-08 21610620
1172025-09 26328842

SELECT CONCAT(year,'-',LPAD(month,2,'0')) _date, SUM(view_count) _count FROM wmf.projectview_hourly WHERE year > 0 AND project='commons.wikimedia' AND agent_type='user' AND referer_class='external (search engine)' GROUP BY year, month ORDER BY _date ASC;
This comment was removed by Krinkle.

Turnilo: pageviews_daily (type: spider, browser_family: Googlebot, project: commons) from 2020 to 2025. - https://w.wiki/FoGo

Beware of caveats with this.

  • This is an aggregate based on Googlebot UA, not Google IP/ASN (we don't keep beyond 90 days), so this is subject to impersonations. Afaik we block such impersonations before they are counted, but that logic may very well have changed since 2020.
  • It isn't a given that the crawl raw should be constant, because activity on Commons isn't constant.
  • It isn't a given that Google Search (only) crawls pages from pageview URLs with a Googlebot UA. It also uses the REST API.

I expected a drop somewhere in 2023 or 2024, and then a recovery after our intervention. But, that's not what we see. I guess that makes sense, because the issue Googlebot certainly kept trying, but it didn't like what it found (mobile redirect with canonical pointing in the opposite direction).

Screenshot 2025-10-24 at 18.04.58.png (928×2 px, 206 KB)

pageviews-Googlebot-2025.png (921×2 px, 226 KB)

Notable dates:

  • June 23: Disabled mobile redirect for Googlebot on Commons.
  • July 14: Raised rate limit for Googlebot.
  • August 6: Submitted sitemap for Commons.

So in short.... visits to Commons originating from Google have almost doubled due to these interventions ? That is amazing news !

For posterity, capturing the equivalent "after" shots for Commons here.

In T54647#10902206, @tstarling wrote on 10 Jun 2025:

46% of Googlebot requests use the smartphone User-Agent, and these mostly get a 302 redirect to commons.m.wikimedia.org. But the mobile site gives a link rel=canonical pointing back to commons.wikimedia.org, i.e. the redirect it just crawled.

Now that I have search console access to the mobile site, I can confirm that Google refuses to index the mobile site for this reason.

mobile search console not indexed reasons.png (720×925 px, 79 KB)

Screenshot 2025-11-06 at 11.21.17.png (1×1 px, 210 KB)

In T400022 task description, wrote on 20 July 2025:

performance 12mo 2025-07-19.png (475×1 px, 102 KB)

Screenshot 2025-11-06 at 11.15.20.png (850×2 px, 208 KB)

Screenshot 2025-11-06 at 11.25.09.png (902×1 px, 99 KB)