Page MenuHomePhabricator

Membership in "researchers" group for Srishti Sethi
Closed, ResolvedPublic

Description

I (@srishakatux is is my Phabricator/Gerrit/Wikimedia Developer account username) would like these permissions on the Analytics cluster to access the database for developing a data visualization for Toolforge/Cloud VPS usage on the wikis. More information about the script is here: T226663#5287195

Event Timeline

@Bmueller your help needed in approving my request!

Dzahn subscribed.

Adding Nuria as requested on Analytics access requests.

@srishakatux hello, can you be a bit more specific about what data you are after? Are you thinking eventlogging?

@Nuria I want to read MySQL credentials from /etc/mysql/conf.d/research-client.cnf via a script that @bd808 wrote to analyze Toolforge/Cloud VPS usage on wikis. So, for that, I need membership in the "researchers" group.

The next step from here would be to use Reportupdater for the datasets that the script generates.

I can confirm that (reading those mysql credentials) is what the researchers group was originally created for.

Hardly any data can be found in mysql, we are deprecating that storage this quarter as we moved all data collection to hadoop a while back. If you point me to the script i can recommend the best path.

Hardly any data can be found in mysql, we are deprecating that storage this quarter as we moved all data collection to hadoop a while back. If you point me to the script i can recommend the best path.

The data needed is the checkuser tables across the wiki fleet. This is data that is never replicated into the Wiki Replica hosts due to data sensitivity, so other than "live" production servers the mysql replicas in the Analytics realm are the only place to get the data we are looking for.

I see, this data also exists in hadoop not only on the analytics replicas. .The replicas are a cluster where different wikis are located in different hosts, whereas hadoop has that same data already nicely available for you to query and it is already denormalizerd by wiki. Please see:

hive (wmf)> desc geoeditors_edits_daily;
col_name data_type comment
wiki_db string The wiki database of origin
country_code string The 2-letter ISO country
code this group of edits geolocated to, including Unknown (--)
user_fingerprint_or_id string If an anonymous user,
this is the IP + UA, otherwise it is their user id in this wiki db
user_is_anonymous boolean Whether or not this user
edited this group of edits anonymously
date string The YYYY-MM-DD date for
this group of edits
edit_count bigint The total count of edits for t his grouping
namespace_zero_edit_count bigint The total count of edits to namespace zero for this grouping
month string The month in YYYY-MM format, all edits from that month from the cu_changes table are aggregated in this partition

  1. Partition Information
  2. col_name data_type comment

    month string The month in YYYY-MM format, all edits from that month from the cu_changes table are aggregated in this partition

Time taken: 0.041 seconds, Fetched: 13 row(s)

Not only that, the data can be easily imported into druid where you could explore it, see, for example, the monthly aggregated version (already by country code) of that edit data: https://turnilo.wikimedia.org/#mediawiki_geoeditors_monthly

My recommendation would be to move the script to be part of a jupyter notebook that accesses hadoop, I can help @srishakatux create one (do ping me on irc) and basically she can port your python logic there, where translates to probably just 20 lines to filter edits by by IP. I think that would be a lot faster than trying to do this work in the replicas. Let me know what you think.

Also worth checking the many edit dashboards that there are on superset https://superset.wikimedia.org/chart/list/ we can also help to build one of those, although, as I said for the first exploratory work given data this is really a very short jupyter notebook.

SRE-Access-Requests please give @srishakatux ssh permits for the cluster if she does not have those already

Also, question to @mforns and @JAllemandou it seems that since the cloud-services-team just wants to see edits from those ips we could add a column to this dataset about _is_cloud_edit and they could compute their aggregates using this table, right? Does that seems like a good idea?

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_daily_data.hql

Change 537516 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] admin: add srishakatux to researchers

https://gerrit.wikimedia.org/r/537516

@srishakatux do sync up with @bd808 about steps to go forward to have a public dashboard, i think that on the light of data in hadoop we probably no longer need the script.

Change 537516 abandoned by Herron:
admin: add srishakatux to researchers

https://gerrit.wikimedia.org/r/537516

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks