Page MenuHomePhabricator

Disable search engine indexing (with noindex) in specific namespaces of Turkish Wikipedia
Closed, ResolvedPublic

Description

There was a consensus at the village pump and we want all pages in the "user" ("kullanıcı" in Turkish) and the "user talk" ("kullanıcı mesaj" in Turkish) namespcaes to opt out of being indexed by search engines.

Google test case link: https://www.google.com/search?q=inurl%3Atr.wikipedia.org%2Fwiki%2FKullanıcı_mesaj%3A

Event Timeline

Change 606374 had a related patch set uploaded (by Majavah; owner: Majavah):
[operations/mediawiki-config@master] Disable NS_USER(_TALK) search engine indexing on trwiki

https://gerrit.wikimedia.org/r/606374

Change 606374 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable NS_USER(_TALK) search engine indexing on trwiki

https://gerrit.wikimedia.org/r/606374

Mentioned in SAL (#wikimedia-operations) [2020-06-22T11:07:26Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: defa81e: Disable NS_USER(_TALK) search engine indexing on trwiki (T255538) (duration: 00m 58s)

Deployed. Please re-open (via Add Action...Change Status) if this isn't working.

Hello @Majavah. Can you give information about when it will take effect?

In T255538#6244097, @Yagizhan49 wrote:

Hello @Majavah. Can you give information about when it will take effect?

The configuration change was just applied, it might take some time for search engines to realize that.

@Majavah, both new pages and old pages are still taking place in Google search results. I guess this needs to be re-opened.

taavi moved this task from To deploy to Backlog on the Wikimedia-Site-requests board.
taavi subscribed.

Interesting. I'll let someone else take a look in case I've done something wrong and don't realize it.

Urbanecm assigned this task to taavi.
Urbanecm subscribed.

The new value did take effect, in sourcecode of https://tr.wikipedia.org/wiki/Kullan%C4%B1c%C4%B1:Martin_Urbanec/sand I can see <meta name="robots" content="noindex,follow"/>, which is exactly what we want there. As @Majavah said earlier, this can take search engines a while to notice, but that's beyond our control.

As @Majavah said earlier, this can take search engines a while to notice, but that's beyond our control.

You could ask someone with search console access to blacklist it.

Vito-Genovese subscribed.

It seems to be working only partially. It has been brought to my attention that many sandbox pages are being indexed by Google, raising concerns of self-promotion by the community. This even includes a page that was created today, which has now been deleted (google: Kullanıcı:Muhammetemreaydinofficial/deneme tahtası). Try googling the words "kullanıcı deneme tahtası vikipedi" without the quotation markers and you will get many hits for sandbox pages in the user namespace. Some main user pages would also be in there. They all have the meta tag that @Urbanecm refers to above, so I don't know what's causing this.

As long as the meta tags are present in the page, the change works fine at Wikimedia's side (it theoretically can be ignored by search engines, but I doubt Google ignores this policy). I have spot-checked few user pages, and the tags are there (saying "noindex"), and also I'm unable to reach them by Google.

The example you listed, "Kullanıcı:Muhammetemreaydinofficial/deneme tahtası", had __DİZİN__ in the (now-deleted) page text. According to Google Translate, that means __INDEX__, which is a special keyword recognized by MediaWiki that overrides the default indexing policy. As such, the page was indeed indexed by Google - becuase MediaWiki was told to allow indexing on this page, regardless of the default indexing policy. If you want, you can create an AbuseFilter that would prevent users to add the tag to user (sub)pages. When writing the filter, noting MediaWiki recognizes several versions of the keyword - it can be written also in English (ie. __INDEX__), and I think (not tried) that lowercase is also recognized correctly. Let me know if you need any help with the filter.

@Urbanecm; Is there not the exemptFromUserRobotsControl flag?

Thank you @Urbanecm . I was able to confirm that only two pages from last month appear on Google and both had the localized behavior switch in question. I have now created https://tr.wikipedia.org/wiki/%C3%96zel:%C4%B0stismarS%C3%BCzgeci/89, so hopefully it will not happen again.

@Urbanecm; Is there not the exemptFromUserRobotsControl flag?

That's (https://www.mediawiki.org/wiki/Manual:$wgExemptFromUserRobotsControl) also a way, however, we'd need another consensus, this time to disallow user control of indexing in certain namespaces. It would prohibit trusted users forcing indexing, when there is a reason to. An abusefilter is simpler solution, IMO.

We created an abuse filter for that. Hopefully that will solve our problem.