Page MenuHomePhabricator

[REQUEST] List of email addresses of departed editors
Closed, ResolvedPublic

Description

What's requested:
A list of email addresses of departed editors.

Why it's requested:
Community Advocacy will be launching a large-scale survey on the topic of harassment on November 2. Given the nature of the topic and the knock-on effects it can have on editors' participation, we'd very much like to reach out via email to departed editors and invite them to participate to the survey, as they may have interesting feedback to offer.

When it's requested:
Given that we are launching the survey on the November 2nd, we'd love to have this by October 30th at the latest.

Other helpful information:
Have done a small Q&A for you:

How do we define departed editors for this survey?

To solidify the term for the purposes of running a query, we'd consider an editor as departed if their user account has registered zero activity in any of the Wikimedia projects for over a period of 6 months.

Is there a threshold of edits the user had to have made beforehand before they go inactive to filter out accounts that only make one edit and don't come back?

if not causing too much work, we'd like to look at two thresholds on this. One would be set at 100 edits and above, and one at 250 edits and above. The reason for this is so that we can see what kind of numbers we're looking at here, and review the scalability of sending those emails out.

Are we concerned about departed editors who have chosen to remove their email address from our system?

No. If one has removed their address, they probably *really* don't want to hear from us in any way or form.

Should blocked users be skipped? (There are cases where people who were harassed ended up getting blocked for various reasons, so I'm not sure about this...but we don't want to be emailing random vandals.)

Yes, please. The number of upset but legitimately blocked users is likely to be much higher that the number of harassed users being blocked as a knock-on effect of their reaction to the harassment. So, inviting this group of editors to participate would be more likely to skew the results than help.

Should bot accounts be skipped?

Yes, please. If possible.

How far back do we want to go for departed accounts? Do we want to be emailing people who were super active in 2004 and then left?

It makes little sense to reach out to somebody who's left the projects a decade ago to ask them if there were harassed....Let's set the threshold at editors who were considered as departed from January 1 2014, onwards. Which assume that last recorded activity would be 30 June 2013.

Ideally, it'd be good to know how many there are as soon as it is possible, so that we know how much time to allocate for sending the emails out (send in batched, etc.)

Question from me:
How long would you typically need for this, based on the above? I am conscious of your own time and other commitments, so I want to make sure we're not overwhelming you, considering this is a major favour I'm asking. If we need to make any adjustments on our part on the above parameters, that is something we can certainly accommodate. So, let me know.

Let me know if you have any more questions!! I'll be happy to help :)

Event Timeline

Kalliope raised the priority of this task from to Needs Triage.
Kalliope updated the task description. (Show Details)
Kalliope subscribed.

@Kalliope, thanks. This is exactly what I was looking for!

With the parameters you've given above, it would unfortunately be a lot of work—significantly more than I could commit to doing before 30 October.

However, as you suggested, there are a couple changes that should make it doable within that timeframe.

  1. Look at individual wikis separately. It would be very difficult to look at a user's contributions across all Wikimedia wikis, but much easier to run one query for departed enwiki users, another for departed frwiki users, and so on—we could even do that for dozens of wikis. This would miss users who stopped editing one wiki but remained active on another, but I don't there would be that many editors in that category.
  2. Don't skip blocked users. Unfortunately, it's very hard to figure out whether a user is currently blocked (the software manages, but it's not very efficient if you're crunching hundreds of thousands of accounts :). It's a bit easier, but still quite difficult, to filter out users who have ever been blocked. The easiest thing to do would be to ignore blocks entirely.

Would that work? If so, what format would you need the data in? Do you just want a list of email addresses, or do you want additional data like user name, home wiki, or last edit date too? Since this has email addresses, it's fairly sensitive data so I'd need to check with the Analytics team about the most secure way to transfer it to you.

In T116078#1740913, @Neil_P._Quinn_WMF wrote:
  1. Don't skip blocked users. Unfortunately, it's very hard to figure out whether a user is currently blocked (the software manages, but it's not very efficient if you're crunching hundreds of thousands of accounts :). It's a bit easier, but still quite difficult, to filter out users who have ever been blocked. The easiest thing to do would be to ignore blocks entirely.

Isn't it just a LEFT JOIN on the ipblocks table?

@Legoktm, ah, you're right! It looks like only current blocks are stored in this table; I guess they get deleted if they expire?

I didn't realize a table like that existed; I thought (as did the person I asked) blocks were just stored in the logging table, and I'd have to do some complicated logic to figure out whether the user was currently blocked.

@Kalliope, in that case, it shouldn't be a problem to exclude blocked users. So the only real caveat is that I'll have to look at each wiki you're interested in separately.

Ok, let's give it a go for each Wiki separately then. The interest is technically on all Wikis. So... all 16 of them... :/
I am hoping that we can then somehow cross reference this data to identify which of those users have been inactive across all Wikis...

Ok, let's give it a go for each Wiki separately then. The interest is technically on all Wikis. So... all 16 of them... :/

All 16 of them? I imagine you mean all "160 of them" or something like that :) There are a couple issues here:

  1. I could probably manage 20 wikis or so without much trouble, but running the query for a hundred or more might be quite difficult.
  2. Which wikis, specifically? If you look at meta:Wikimedia wikis, there are a lot of tiny ones like the Wikimania wikis, team and committee specific wikis, closed and completely inactive projects, and so on. Do you have a list of the ones you're interested in? Or at least a set of specific criteria?

There's also the issue of language with small wikis, but I imagine you've thought more about that than I have.

I am hoping that we can then somehow cross reference this data to identify which of those users have been inactive across all Wikis...

I don't think that will be possible. I'd be able to give you a list saying "users A, B, and C are departed from enwiki and users D, E, and F are departed from frwiki", but that won't tell you that user D is still happily editing at enwiki. I wish I could give you a global list but sadly the production databases are set up to run the site efficiently, not to make it easy to gather global statistics.

Also, how many email addresses do you want? Is it just as many as we can find (I don't have any how many that will be), or do you have a specific quota? If so and the results are over the quota, should I randomly select from the available choices or just raise the edit count threshold until the number's low enough?

All 16? All 160? What? We have over 800 wikis...

Ok, can we start with en.wp, to get an idea for numbers, as it's the largest project anyway?
We have no set number cap in mind, as we have no idea what those may be in the first place. If the number is massive *like 10,000 users then we probably want to set some limit. But if it's more like a few hundreds, we may use all. Does that make sense?

@Kalliope, sounds good. I'll do a test run on enwiki and get you the number returned if set the threshold at 100 edits, and if we set it at 250. I should have that by tomorrow.

Just started the test runs. Using the following code:

SET @START = "20130601000000";
SET @TODAY = "20151021000000";

SELECT COUNT( * )

FROM user

INNER JOIN (
	SELECT 
		rev_user,
		MAX( rev_id ),
		rev_timestamp
	FROM revision
	GROUP BY rev_user
	HAVING 
		rev_timestamp >= @START AND
		rev_timestamp < DATE_FORMAT( DATE_SUB( @TODAY, INTERVAL 180 DAY), "%Y%m%d%H%i%S")
	) last_edits
ON user_id = rev_user

LEFT JOIN ipblocks
ON user_id = ipb_id

LEFT JOIN user_groups
ON user_id = ug_user AND ug_group = "bot"

WHERE
	-- remove currently blocked users
	ipb_id IS NULL AND
	-- remove bots
	ug_group IS NULL AND
	-- remove users without email addresses
	user_email IS NOT NULL AND
	user_editcount > {{threshold}}
;

@Kalliope, here are the results.

Using 100 edits as the cutoff, we have about 17,000 departed editors:

+------------+
| COUNT( * ) |
+------------+
|      16950 |
+------------+
1 row in set (9 min 42.00 sec)

Using 250 edits, we have about 6,000:

+------------+
| COUNT( * ) |
+------------+
|       6302 |
+------------+
1 row in set (9 min 16.37 sec)

What threshold would you like to use? It can be a third option, of course. Also, both queries took about 10 minutes to run, so there's no reason I couldn't run the query (separately) for 100 wikis or more. I'm not sure querying lots of small wikis will give you a great deal of useful data, so I don't necessarily recommend that. But it's definitely an option.

Also, have you had a chance to think about the format you need the data in? If you don't have a preference, my default option is a TSV file, which you could easily convert to CSV or a spreadsheet file. Also, do you just want email addresses, or do you also want user names, last edit dates, wikis, or other data?

Hey Neil,

This is fantastic.
Ok, so, here's the slightly ammended requirements:

Please run the query for over-500 edits (as the over-100 and over-250 are giving pretty large numbers of results)

For the time being, you can run it for the following 8 wikis with a cap for each, set as follows:

  • en.wp : 400 email addresses
  • es.wp : 100 email addresses
  • de.wp : 100 email addresses
  • zh.wp : 100 email addresses
  • ru wikitionary : 100 email addresses
  • fr wiktionary : 100 email addresses
  • ar.wp : 50 email addresses
  • pt.wp : 50 email addresses

I'm flexible on the format - if we can get those addresses in any form of a list (CSV, excel, etc.) it'd be great.

@Kalliope, would you like those to be randomly selected for the over-500 departed editors, or selected starting with the highest edit counts and going downward? The second strategy would be a good bit easier.

In T116078#1744595, @Neil_P._Quinn_WMF wrote:
	-- remove users without email addresses
	user_email IS NOT NULL AND

You should also make sure the email address is confirmed by making sure user_email_authenticated (doc) isn't null.

Also, would it be possible for me to upload the addresses into Qualtrics directly? I think that would remove the need for a lot of precautions necessary before transferring the data.

@Neil
,
You can start with the highest edit counts and go downward.
I don't see why it should not be possible for you to upload them into Qualtrics yourself. Assuming you have an account there, I can share our survey with you, so that you can perform the action.

@Kalliope, can I just create a free Qualtrics account, or do I have to do something special so it's associated with the WMF?

Okay, I'm starting the runs now. Using this code (customized for each wiki):

mysql -h analytics-store.eqiad.wmnet -u research -e "
SELECT
	-- Match fields to Qualtrics format 
	user_name AS FirstName,
	NULL AS LastName, 
	user_email AS Email, 
	{{language}} AS Language

FROM {{db}}.user

INNER JOIN (
	SELECT 
		rev_user,
		MAX( rev_id ),
		rev_timestamp
	FROM {{db}}.revision
	GROUP BY rev_user
	HAVING 
      -- Limit to users whose last edit was after 1 June 2013 but before 26 April 2015
      rev_timestamp >= "20130601000000" AND
		rev_timestamp <  "20150426000000"
	) last_edits
ON user_id = rev_user

LEFT JOIN {{db}}.ipblocks
ON user_id = ipb_id

LEFT JOIN {{db}}.user_groups
ON user_id = ug_user AND ug_group = 'bot'

WHERE
	-- remove currently blocked users
	ipb_id IS NULL AND
	-- remove bots
	ug_group IS NULL AND
	-- remove users without confirmed emails
	user_email IS NOT NULL AND
	user_email_authenticated IS NOT NULL AND	
	user_editcount >= 250

ORDER BY user_editcount DESC
LIMIT {{quota}}
;" > ~/{{db}}_{{quota}}_departed_users.tsv

Well, the good news is that the runs finished. The bad news is that the lists they gave are wrong because I made some errors writing the query. I'll keep working on it, possibly this weekend, otherwise next week.

Have a chat with Edward Galvez, for a Qualtrics account. He can set it up for you.

Mdennis-WMF set Security to None.
Mdennis-WMF subscribed.

@Neil_P._Quinn_WMF - after some discussion, we'd like to set the final target number at 2000 addresses. Does that work on your end?

@PEarleyWMF, that should work.

However, how do you want that broken down between wikis? Just double the numbers Kalliope gave earlier? That would add up to 2000, but a lot of the small wikis have fewer departed editors than that would require (for example, it would require 200 editors from frwiktionary, but the project only has 87 when we limit it to those with at least 100 edits on the project). Should I stick as close to those numbers as possible, filling it up to 2000 using the large wikis like enwiki when necessary?

Neil, that sounds like a plan - double where you can, and bulk it up with the larger wikis if needed. Thanks!

Updated query with two important bugs fixed:

mysql -h analytics-store.eqiad.wmnet -u research -e "
SELECT
	-- Match fields to Qualtrics format 
	user_name AS FirstName,
	'' AS LastName, 
	user_email AS Email, 
	{{language}} AS Language

FROM {{db}}.user

INNER JOIN (
	SELECT 
		rev_user,
		MAX( rev_id ),
		MAX( rev_timestamp )
	FROM {{db}}.revision
	GROUP BY rev_user
	HAVING 
      -- Limit to users whose last edit was after 1 June 2013 but before 26 April 2015
      MAX( rev_timestamp ) >= "20130601000000" AND
		MAX( rev_timestamp ) <  "20150426000000"
	) last_edits
ON user_id = rev_user

LEFT JOIN {{db}}.ipblocks
ON user_id = ipb_user

LEFT JOIN {{db}}.user_groups
ON user_id = ug_user AND ug_group = 'bot'

WHERE
	-- remove currently blocked users
	ipb_id IS NULL AND
	-- remove bots
	ug_group IS NULL AND
	-- remove users without confirmed emails
	user_email IS NOT NULL AND
   user_email_authenticated IS NOT NULL AND
   -- some users have emails which are blank but not null
   user_email != '' AND
	user_editcount >= 100

ORDER BY user_editcount DESC
LIMIT {{quota}}
;" > ~/departed/{{db}}.tsv

Results so far

ProjectResultsCompared to request
frwiktionary85-115
ruwiktionary45-155
ptwiki124+24
arwiki115+15
zhwiki249+49
eswiki243+43
dewiki250+50
enwiki889+89
total20000

@Kalliope, @PEarleyWMF, I've completed the runs and almost everything looks good (I don't have a Qualtrics account yet, but I believe @egalvezwmf is working on it—once that's done, I can upload).

The only remaining issue is that there are 9 vanished users in the enwiki results; they might have something to say about harassment, but vanishing probably (I'm not very familiar with the process) indicates someone who wants to leave the project behind completely. Should I remove them?

@Neil
This is fantastic, thank you so much!!
Since our email is not an invitation to become active in the projects again, rather tell us if they left because of harassment, I'm inclined to keep those vanished users in our list and contact them anyway.

nshahquinn-wmf claimed this task.

@Kalliope, @PEarleyWMF, I have finished uploading the panel to Qualtrics; they're in the panel named "2015-10 departed editors". I uploaded the user name as the first name, left the last name field blank, and set each user's language based on the language of the project they were active on.

I forgot to trim the enwiki list like I planned so there are 2100 users in the panel instead of 2000; I imagine that should be fine though. There were nine duplicates, which I consolidated, and I left the vanished users in like you asked.

From my point of view this seems done but feel free to reopen the task if you see any loose ends!

Hey Neil,

Thank you so much for this. I see that the email addresses uploaded in one large group. Can those be uploaded in lists per language wiki they were drawn from?

Each language is going to be receiving the invitation in the relevant language, so they need to be separated by language in groups/mailing lists if you like, so that the right invitation is sent to the right email addresses.. And it seems that they need to be 'shared' with us. You need to go to 'Panel' and share those lists with myself and Haitham. Is that doable?

@Kalliope, when I uploaded the data, I set the correct language code for each person in the panel. You could use this to make and send a translated survey, which sounds like it would make analysis of the data easier by automatically grouping responses to the same question together across languages (and also allow the user to choose a different language than the one we set if they want). However, if you just want to send separate surveys, let me know and I'll do it in separate panels.

Also, I can't see a option to share the panel anywhere. Maybe it's not enabled for me? In that case you would have to talk to Edward.

@Neil_P._Quinn_WMF, I added you and Kalli to the group Safety and Support. @Kalliope, You want to add the group as a collaborator to the survey, (using the collaborate icon), and then you should be able to see the panel in your account.

Thanks Edward!

@Peachey88, I'm afraid I didn't filter based on that category (nor am I sure how to do that efficiently). It's always sad to learn about Wikipedians who have died, and I don't want to inadvertently harm someone when we send this survey invitation out. But do you think that possibility exists here? It seems to me that emails to any deceased Wikipedians would largely go unread.

Also, if there are any deceased editors in the sample (which is a possibility although in my opinion not a major one), they almost certainly wouldn't be caught by the category anyway.

@egalvezwmf I have added Safety and Support Group as a collaborator to the survey and have activated the later. But I still seem to be missing something, as the system asks me to add a panel (I expected that I'd see one there already).

nshahquinn-wmf raised the priority of this task from High to Needs Triage.Mar 29 2018, 9:07 AM
nshahquinn-wmf moved this task from Neil's in progress to Done on the Contributors-Analysis board.