Page MenuHomePhabricator

Create robots.txt policy for datasets
Closed, ResolvedPublic1 Estimated Story Points

Description

Our datasets get crawled all the time and some of them are a few MB. We could disallow all crawling on datasets to help reduce bandwidth usage. But is it good for any reason to get them crawled? I mean we can link to specific folders from the wikis if we want them to be searchable on the web, right?

Event Timeline

Is there any reason we are actually concerned about bandwidth usage?

@Peachey88 not particularly, this is low priority, but it just seems like a bad idea to waste it for no reason, especially on larger files like datasets. I mean it downloads the whole thing just to go: oh, not HTML, moving on.

Nuria set the point value for this task to 1.

Change 345634 had a related patch set uploaded (by Milimetric):
[analytics/analytics.wikimedia.org@master] Prevent datasets from being crawled

https://gerrit.wikimedia.org/r/345634

Change 345634 merged by Milimetric:
[analytics/analytics.wikimedia.org@master] Prevent datasets from being crawled

https://gerrit.wikimedia.org/r/345634