Page MenuHomePhabricator

Requesting access to wmf MediaWiki history for Tarun Chadha
Closed, DeclinedPublicRequest

Description

Requestor provided information and prerequisites

This section is to be completed by the individual requesting access.

  • Wikitech username: Tarun Chadha
  • Email address: tarun.chadha@id.ethz.ch
  • SSH public key (must be a separate key from Wikimedia cloud SSH access):

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOUED517R0rJiESj68KzkcagkI6doTm7y9ObRKtoD1cW tarunchadha@baby-yoda

  • Requested group membership:
  • Reason for access: I am a data scientist from ETH Zürich and am working with Prof. Jerome Hergueux from university of Strasbourg (https://beta-economics.fr/annuaire/328/hergueux_jerome/). For our current project we are trying to identify the edits of a subset of english wiki editors which were reverted. For this we request access to the wmf data.
  • Name of approving party (manager for WMF/WMDE staff): WMF manager
  • Ensure you have signed the L3 Wikimedia Server Access Responsibilities document: Yes
  • Please coordinate obtaining a comment of approval on this task from the approving party.

SRE Clinic Duty Confirmation Checklist for Access Requests

This checklist should be used on all access requests to ensure that all steps are covered, including expansion to existing access. Please double check the step has been completed before checking it off.

This section is to be confirmed and completed by a member of the SRE team.

  • - User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document.
  • - User has a valid NDA on file with WMF legal. (All WMF Staff/Contractor hiring are covered by NDA. Other users can be validated via the NDA tracking sheet)
  • - User has provided the following: wikitech username, email address, and full reasoning for access (including what commands and/or tasks they expect to perform)
  • - User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not shared with any other service (this includes not sharing with WMCS access, no shared keys.)
  • - The provided SSH key has been confirmed out of band and is verified not being used in WMCS.
  • - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf staff)
  • - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml

For additional details regarding access request requirements, please see https://wikitech.wikimedia.org/wiki/Requesting_shell_access

Event Timeline

Hi @chadhat, thanks for taking the time to report this and welcome to Wikimedia Phabricator!

For our current project we are trying to identify the edits of a subset of english wiki editors which were reverted.

That sounds more like a research project. Note that a lot of data is publicly available: https://meta.wikimedia.org/wiki/Data_dumps
Could you please elaborate a bit what made you file this access request, and how you came to the conclusion that public data does not cover your needs? Thanks!

Hi Aklapper. Thanks for coming back to us.

We are actually working to complete a research project that leverages 10 years worth of en:wiki data, documented here: https://meta.wikimedia.org/wiki/Research:Dynamics_of_Online_Interactions_and_Behavior

We need to process many edits and identify those that were reverted via shasum, undo or semi-automated tools. We were directed to this resource by Isaac Johnson (WMF), as it appears that this dataset already has all the metadata that we need. So this seems much more efficient that having to implement separate data collection strategies to identify each revert type. Thanks!

@Isaac I'm afraid I'm still a bit confused as to what access is needed here (and/or which data set is being referred to); can you help, since I gather you directed the requestor here, please?

I think my name is being brought up based on this email thread (though @SalimJah let me know if I've given other advice but am not remembering). I don't think you'll need access to any special, private data so this ticket isn't necessary. If you just need revert-related data, the denormalized history dumps that I mentioned in my response are publicly available and should have everything you need. If you need content-related data, then you'll want to look at the also public various history dump options (*pages-meta-history*) though these are far larger and more cumbersome.

In general, the only reason you would need private access to WMF's servers are if you're working on non-public data such as some of the more sensitive aspects of readership (but to protect reader privacy, those sorts of studies are relatively rare) or part of a formal collaboration with us where using our cluster resources would greatly simplify analysis. I think what you want though is more easily accessible via the public data resources so feel free to download the data that you need for your analysis and ask questions on the wiki-research-l listserv if they come up, but you don't need any special permission/support from us.

@Isaac: thanks for your reply. You are correct.

We are reacting to this suggestion in the thread you mention, which we thought looked very efficient for our purposes (we want to collect shasum, "undo" and semi-automated reverts):

"I'll also highlight the excellent public dataset put together by the Wikimedia Foundation Data Engineering team that has the full edit history for each language edition and includes metadata such as whether the edit was a revert based on shasums as well as the edit tags. If you were processing many many edits, I'd suggest starting with this as it would have all the information you need in one place."

I don't see that we need access to WMF servers for this; we just need access to the dataset. Maybe this is what caused the confusion here...

Excellent -- sounds like this task can be resolved then. I'll allow SRE to handle that in case they have a specific process but good luck @SalimJah and don't hesitate to reach out with follow-up questions.

Hi @SlimJah,

not sure if this is helpful, but in addition to what @Isaac mentioned, there is also https://dumps.wikimedia.org/other/mediawiki_history/, which includes precalculated information about revertness of edits. Public documentation at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps has more information about how the schema look like.

MatthewVernon claimed this task.

[thanks folks, I'll close this access request now]

Aklapper changed the task status from Resolved to Declined.Jun 7 2023, 2:06 PM
Aklapper removed MatthewVernon as the assignee of this task.