Page MenuHomePhabricator

SecurePoll should provide a way to download data
Open, Needs TriagePublicFeature

Description

Elections run using SecurePoll are audited (commonly called scrutineering) by checkusers for fraudulent votes, sockpuppetry, etc. There is a scrutineering.py script written by User:Zzuuzz which takes most of the drudgework out of this, but you need to get the raw data onto your local machine. This is typically done by copy-pasting from the browser into a local file, which is a pain (and error prone). Things get even more annoying if there's more than 500 votes and you have to copy from multiple pages.

This would all be a lot easier if there was a button on the [[Special:SecurePoll/list/xxx]] screen visible to CUs which would download the entire dataset in some standardized format. Either json or csv would be fine. Then people could write analysis scripts they wanted to slice and dice the data in whatever way was convenient to them.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What data do you need exactly? There's a link called "Dump" that lets you download anonymized votes. Do you instead need the voter metadata including checkuser ish stuff like IP? What fields does the python scriot need and what does the python script do with the data?

Can you link to the python script's source code if its public?

The script is on cuwiki, but I don't think that's important. It reads the data as you would see it on the 'vote list' page, from a local text file, in tab-delimited (TSV) format. Fields include timestamp, name, duplicate, IP, and the rest, as it's displayed in the list table. It reads from a text file because it's too much coding-faff to log in and read directly from the site. The script is not public.

HTML tables are typically copied as TSV, which is the correct format. It might be tempting to copy-paste 500 votes at a time and append into the text file. However users can adjust the URL to display 5,000 records at a time and copy that in one action. I recommend that. Roy is requesting a direct download of this table, in TSV/CSV or similar.

BTW last time I checked one of either Firefox or Chrome wouldn't copy-paste the text in a clean TSV format. If you come across that problem, it should work easily with the other one.

OK, so you just want an unpaginated version of the /list page, with checkuser-ish columns included, and formatted in TSV.

Sounds like we can do this as an API query or as a file download. As long as there's a way to get to the data.

An option to display 5,000 records per page would also get around the immediate issue. It doesn't need to be 5,000 (enwiki admin elections number around 600-ish), but that seems to be the norm.

The URL can be modified to do this: https://en.wikipedia.org/w/index.php?limit=5000&title=Special%3ASecurePoll%2Flist%2F893

Although that is hacky, so I suppose it'd be better to add 5000 to the dropdown list of "per page" options.

Yep, I'm talking about the checkuser stuff. Basically, all the data that's on the page:

Strike Time Name CSRF Duplicate IP XFF User agent

but now that I look at it closer, that's a subset of what's in Special:SecurePoll/details/xxxx, so just dump everything that's in /details and then CUs (or their scripts) can use or not use whatever they need. Another issue is that out of the box, the script grovels forever as it keeps hitting the anonymous API throttle. I can get around that because I know how to install my own OAuth keys, but not every CU will be capable of that.

As for the details of what the script does, I don't think there's anything particularly secret about the script, but at the moment it's not publicly accessible and the details really aren't important for this ticket, so I'll skip that.

Another issue is that out of the box, the script grovels forever as it keeps hitting the anonymous API throttle.

Why would it be hitting an API throttle if you're pasting all the data into it? Does it do a foreach of every user after it receives the data?

It starts with the copy-pasted data, but then it makes a bunch of API calls to retrieve additional information about individual users and IPs.

In retrospect, the data in /details is structured enough that JSON would probably make more sense than csv or tsv.

A download button would probably be a lot simpler than an API call from the user's perspective. It would be easy enough to have the script make the required API call to grab the data, but then you're right back to the script needing to supply the user's credentials which will probably immediately devolve into the user having to figure out how to generate their own OAuth keys.

Good brainstorming. If you want to edit your original post with instructions on exactly what you want programmed, that'll make this more likely to get worked on. Due to discussion, I find the following things unclear and these need someone to decide on them before this can be easily worked on:

  • CSV, TSV, or JSON?
  • just the fields on the /list page, or all the fields on the /details page too? (the latter is more complicated and will slow down development, but it sounds like the latter would get rid of your foreach username API query throttle problem)

Another concern I just thought of is, is WMF OK with copying the PII of several hundred/thousand users to a local computer? I guess you're doing it already anyway via copy paste, but it is a little concerning to have that much PII exiting our secured systems. If we can think of a way to do this but without needing to export PII to local systems, that would be more ideal. (Although that goes into the direction of building the scrutineering.py script into SecurePoll, which is complicated.)

One of the first things my script says it to securely handle and promptly delete any data. I'm pretty sure many checkusers handle local text files on a temporary basis - not for record keeping but to juggle the amount of data we sometimes have to look at. It's not something I particularly recommend though, especially if you're not disciplined in dealing with secure data. I'm not really a fan of just having a download-file button for several reasons.

A secure script should be able to read the list page on the site and analyse it there and then. Something like the &action=raw format that you can use on non-special pages shouldn't add any security concerns. Plus you could copy-paste it.

Sock checking is a dark art, but I'll explain a couple of things that my script does. The basics are to check for shared IPs, and it uses CIDRs to help with that. It also parses the XFF field, which usually contains several IPs. Other things are to check for obvious alt account names, and flag CSRF or duplicate failures which can be easy to miss. It also flags up blocked users, for example.

My script contains options to bypass some or all API checks, and to insert a delay between API calls. Personally I would expect to be cooking dinner while the script runs. I've always said it's not a script for noobs. Basically I would say don't invest any effort into helping with this script. But some scripting is probably helpful when it comes to scrutineering.

I get your point about not storing data locally, but right now, you need to do that anyway. Wouldn't it be better to at least make that convenient rather than copy-paste from the screen?

FWIW, letting the script run while attending to dinner is exactly what I just did (sans credentials). I left it running when I went out. After 1 hour 57 minutes, it crashed with:

pywikibot.exceptions.TimeoutError: Maximum retries attempted without success.
CRITICAL: Exiting due to uncaught exception TimeoutError: Maximum retries attempted without success.

Are you running it on a non privileged account? If you run it on a sysop account or a bot account, that'll give you noratelimit and apihighlimits. Some combination of those user rights should probably turn off throttling.

Also make sure you have a custom user agent set.

Please obey API etiquette. GET requests should be spaced sequentially, not in parallel. I think the API etiquette page used to say something about spacing requests 10 seconds apart, but I can't find that part with Ctrl-F anymore.

Batching, if available, could be a good strategy to reduce API queries.

Without commenting on the underlying request, it seems to me that if this proceeds, it should also accommodate WMF Board of Trustees elections, which range up to 7000 or so votes.

Wouldn't it be better to at least make that convenient rather than copy-paste from the screen?

I would just like to stress the semantic difference (as well as HTTP and filesystem differences) between downloading a data file (a 'download button'), and viewing the raw data. The latter would provide convenience for either direct access from a script with suitable credentials, or manual handling without the extra auth code. I believe you're less likely to end up with stray data using this 'inline' method.

Please obey API etiquette.

I believe the API calls are well optimised and etiquetted (and as I say they can also be selectively disabled and spaced out). I have some pending updates since the last time I was scrutineering (including a UA header), but as I'm not currently scrutineering, I'm not really in a position to test them out. It seems Roy has made some relevant adjustments.

OK, I've given this some more thought and there is value in what Zzuuzz says about the security aspects of having this data in a file on a local disk. Still, the current method of copy-pasting the data from a browser has all those security disadvantages combined with being inconvenient. So, yeah, I'm might be coming around to being OK with Novem's suggestion of adding an API.

On the other hand, if we do that, then we're back to keeping OAuth credentials in a local disk file. If anything, that's a bigger security exposure than just keeping a file of the scrutinizing data in a local disk file because the credentials file gives away everything I have access to. But that's a different discussion for another day.

Huji subscribed.

Exporting the data from the table can be done using a user script in JS. Even the pagination piece can be handled by a creative piece of JS code. I'm not sure if this kind of export is worthwhile implementing in SecurePoll; after all, the use case is an uncommon, external use case, not a core feature of the SecurePoll process.