Page MenuHomePhabricator

Many small wikis missing from mediawiki_history dataset
Closed, ResolvedPublic

Description

The Wikimedia clusters contains about 940 wikis (all.dblist contains 934, and the enwiki sites table contains 940).

However, mediawiki_history contains data on only 752 wikis. The missing ones (compared to the sites table) are listed in P8372.

Various types of wikis are excluded, from private coordination wikis (e.g. arbcom_enwiki), to closed content wikis (e.g. akwikibooks), to small, open content wikis (e.g. roa_rupwiki, hiwikivoyage).

All of these wikis should be included; the biggest problem are the missing open content wikis, but even for closed or private wikis we should at least have the data available.

Event Timeline

fdans moved this task from Incoming to Analytics Query Service on the Analytics board.

A lot of the wikis there are either private or closed. The rest can be added via a patch to the sqoop load list

https://github.com/wikimedia/analytics-refinery/blob/master/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv

Generally whenever we get pageviews for a new site we add the site to both the pageview whitelist and the sqoop list, but this hasn't always been consistent.

@Neil_P._Quinn_WMF the private wikis are not included on the labs replicas and that is intentional, if you notice we also do not report pageview data for those wikis either. This is at the request of users of those wikis and seems like the right policy as there are wikis of delicate nature there, these wikis will continue not to be included.

Any small open content wiki should be included and we should work to make sure all are. I am assigning this ticket to @fdans so the small wikis missing can be added. The best measure here would be to source our labs_grouped_wikis.csv is at least as current as the pageview list which is always updated upon the creation of a new wiki.

Nuria raised the priority of this task from Low to Medium.

@Neil_P._Quinn_WMF the private wikis are not included on the labs replicas and that is intentional, if you notice we also do not report pageview data for those wikis either. This is at the request of users of those wikis and seems like the right policy as there are wikis of delicate nature there, these wikis will continue not to be included.

Yeah, on reflection I agree with that. So I'm fine with just making sure all the public wikis are included (closed or not). There are more than a hundred of them, although the overall effect isn't huge since they're pretty much all small.

The best measure here would be to source our labs_grouped_wikis.csv is at least as current as the pageview list which is always updated upon the creation of a new wiki.

Sorry, what do you mean by source?

Change 507355 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Add 137 wikis that haven't been sqooped so far

https://gerrit.wikimedia.org/r/507355

Change 507355 merged by Nuria:
[analytics/refinery@master] Add 122 wikis that haven't been sqooped so far

https://gerrit.wikimedia.org/r/507355

ping @fdans let's make sure the prod list and lab list match as every snapshot scoops from both

Change 509938 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Add 122 wikis to prod sqoop list

https://gerrit.wikimedia.org/r/509938

Change 509938 merged by Fdans:
[analytics/refinery@master] Add 122 wikis to prod sqoop list

https://gerrit.wikimedia.org/r/509938

closing as after scooping looks like the only wiki failed was hiwikisource so rest existed.