Membership in "researchers" group for Srishti Sethi
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	srishakatux
	Sep 11 2019, 7:47 PM

Description

I (@srishakatux is is my Phabricator/Gerrit/Wikimedia Developer account username) would like these permissions on the Analytics cluster to access the database for developing a data visualization for Toolforge/Cloud VPS usage on the wikis. More information about the script is here: T226663#5287195

Details

	Subject	Repo	Branch	Lines +/-
	admin: add srishakatux to researchers	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		srishakatux	T226663 Create a WMCS edits dashboard via Dashiki
		Resolved		None	T232664 Membership in "researchers" group for Srishti Sethi

Event Timeline

srishakatux created this task.Sep 11 2019, 7:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2019, 7:47 PM

srishakatux triaged this task as Medium priority.Sep 11 2019, 7:48 PM

srishakatux added projects: SRE, SRE-Access-Requests.

@Bmueller your help needed in approving my request!

@srishakatux: What's your LDAP user name? See https://wikitech.wikimedia.org/wiki/Production_shell_access#Additional_permissions_for_existing_users for required data

Adding Nuria as requested on Analytics access requests.

srishakatux updated the task description. (Show Details)Sep 12 2019, 12:04 AM

@srishakatux hello, can you be a bit more specific about what data you are after? Are you thinking eventlogging?

@Nuria I want to read MySQL credentials from /etc/mysql/conf.d/research-client.cnf via a script that @bd808 wrote to analyze Toolforge/Cloud VPS usage on wikis. So, for that, I need membership in the "researchers" group.

The next step from here would be to use Reportupdater for the datasets that the script generates.

I can confirm that (reading those mysql credentials) is what the researchers group was originally created for.

Hardly any data can be found in mysql, we are deprecating that storage this quarter as we moved all data collection to hadoop a while back. If you point me to the script i can recommend the best path.

Krenair subscribed.Sep 14 2019, 9:55 PM

Script is here T226663#5287195

In T232664#5493638, @Nuria wrote:

Hardly any data can be found in mysql, we are deprecating that storage this quarter as we moved all data collection to hadoop a while back. If you point me to the script i can recommend the best path.

The data needed is the checkuser tables across the wiki fleet. This is data that is never replicated into the Wiki Replica hosts due to data sensitivity, so other than "live" production servers the mysql replicas in the Analytics realm are the only place to get the data we are looking for.

I see, this data also exists in hadoop not only on the analytics replicas. .The replicas are a cluster where different wikis are located in different hosts, whereas hadoop has that same data already nicely available for you to query and it is already denormalizerd by wiki. Please see:

hive (wmf)> desc geoeditors_edits_daily;
col_name data_type comment
wiki_db string The wiki database of origin
country_code string The 2-letter ISO country
code this group of edits geolocated to, including Unknown (--)
user_fingerprint_or_id string If an anonymous user,
this is the IP + UA, otherwise it is their user id in this wiki db
user_is_anonymous boolean Whether or not this user
edited this group of edits anonymously
date string The YYYY-MM-DD date for
this group of edits
edit_count bigint The total count of edits for t his grouping
namespace_zero_edit_count bigint The total count of edits to namespace zero for this grouping
month string The month in YYYY-MM format, all edits from that month from the cu_changes table are aggregated in this partition

Partition Information
col_name data_type comment

month string The month in YYYY-MM format, all edits from that month from the cu_changes table are aggregated in this partition

Time taken: 0.041 seconds, Fetched: 13 row(s)

Not only that, the data can be easily imported into druid where you could explore it, see, for example, the monthly aggregated version (already by country code) of that edit data: https://turnilo.wikimedia.org/#mediawiki_geoeditors_monthly

My recommendation would be to move the script to be part of a jupyter notebook that accesses hadoop, I can help @srishakatux create one (do ping me on irc) and basically she can port your python logic there, where translates to probably just 20 lines to filter edits by by IP. I think that would be a lot faster than trying to do this work in the replicas. Let me know what you think.

Also worth checking the many edit dashboards that there are on superset https://superset.wikimedia.org/chart/list/ we can also help to build one of those, although, as I said for the first exploratory work given data this is really a very short jupyter notebook.

SRE-Access-Requests please give @srishakatux ssh permits for the cluster if she does not have those already

Also, question to @mforns and @JAllemandou it seems that since the cloud-services-team just wants to see edits from those ips we could add a column to this dataset about _is_cloud_edit and they could compute their aggregates using this table, right? Does that seems like a good idea?

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_daily_data.hql

Change 537516 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] admin: add srishakatux to researchers

https://gerrit.wikimedia.org/r/537516

gerritbot added a project: Patch-For-Review.Sep 17 2019, 7:04 PM

@srishakatux do sync up with @bd808 about steps to go forward to have a public dashboard, i think that on the light of data in hadoop we probably no longer need the script.

• Nuria closed this task as Resolved.Sep 17 2019, 8:35 PM

Change 537516 abandoned by Herron:
admin: add srishakatux to researchers

https://gerrit.wikimedia.org/r/537516

bd808 mentioned this in T226663: Create a WMCS edits dashboard via Dashiki.Oct 3 2019, 12:17 AM

[batch] remove patch for review tag from resolved tasks

Membership in "researchers" group for Srishti SethiClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Membership in "researchers" group for Srishti Sethi
Closed, ResolvedPublic
Actions

Related Objects
Search...