Page MenuHomePhabricator

Generate report of Special:EmailUser usage
Closed, ResolvedPublic5 Estimated Story Points

Description

We want to know how frequently Special:EmailUser is used. This information is recorded in the checkuser database (scrubbed after 90 days), as well as in the echo notification database in x1 (This gets scrubbed every 5000 notifications or something).

  • How many emails are sent per day?
  • How many unique users are sending emails?
  • Less important: How many unique users are the recipient of emails?

It'd be nice to know this for a minimum of one week, but we will have to be flexible with the data we can access.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2017, 8:53 PM

(I couldn't find anything on EventLogging or the database.)

This information is recorded in the checkuser database (scrubbed after 90 days), as well as in the echo notification database in x1 (This gets scrubbed every 5000 notifications or something). The information may also be in the postfix logs, not sure.

@Bawolff Oh fantastic! Thank you for the direction. :)

In addition, once T67478 is solved, you'll probably have a public Graphite metric tracking the number of attempted emailuser actions, and a private Logstash channel with details of aborted user actions (exceeding the rate limit).

TBolliger renamed this task from Measure Special:EmailUser usage to Generate report of Special:EmailUser usage.Aug 16 2017, 9:33 PM
TBolliger updated the task description. (Show Details)
TBolliger set the point value for this task to 5.Aug 17 2017, 7:28 PM

@TBolliger Are you interested in English Wikipedia or some other wiki?

dbarratt claimed this task.Aug 17 2017, 8:24 PM

@Bawolff It looks like I have checkuser permission, but how would I download a snapshot of the data? (or query it directly?)

Yes please — ENWP. Weekly would be preferable. Here's an example report:

MonthDayEmails sentTotal users sendingTotal users receiving
January1999999999
January2999999999
January3999999999
January4999999999

...

I don't think you can take the checkuser permission like that to query this data (besides, I don't think it's possible to do this without huge amounts of queries) because it wouldn't align with our privacy policy.

I don't think you can take the checkuser permission like that to query this data

Do you know what permission I would need?

(besides, I don't think it's possible to do this without huge amounts of queries)

From @TBolliger's examples, they look like simple COUNT queries with a date range.

because it wouldn't align with our privacy policy.

Can you be more specific?

The policy says:

We may give volunteer developers and researchers access to systems that contain your information to allow them to protect, develop, and contribute to the Wikimedia Sites.
We may also share non-personal or aggregated information with third parties interested in studying the Wikimedia Sites.
When we share information with third parties for these purposes, we put reasonable technical and contractual protections in place to protect your information consistent with this Policy.

I don't think you can take the checkuser permission like that to query this data

Do you know what permission I would need?

Probably you have to do a manual lookup in the database for that, but I'm not even sure if the recieving users are logged there at all; at least they aren't exposed to the interface at all. Checkuser can only tell that you send an email and when, not to whom and not with which content. (Actually, I meant this more in a "You're allowed" than a "You're able" way…)

(besides, I don't think it's possible to do this without huge amounts of queries)

From @TBolliger's examples, they look like simple COUNT queries with a date range.

Yeah, but you can only do this in the database, not in the onwiki interface.

because it wouldn't align with our privacy policy.

Can you be more specific?

The policy says:

We may give volunteer developers and researchers access to systems that contain your information to allow them to protect, develop, and contribute to the Wikimedia Sites.
We may also share non-personal or aggregated information with third parties interested in studying the Wikimedia Sites.
When we share information with third parties for these purposes, we put reasonable technical and contractual protections in place to protect your information consistent with this Policy.

I wasn't aware of this, but what I meant is this: You shouldn't query IPs of all users in enwiki just du get their email data. This is the thing that checkuser would do and it's not needed at all.

I just think checkuser is a completely wrong approach to what you're trying to do.

Huji added a subscriber: Huji.Aug 18 2017, 1:34 PM

Probably you have to do a manual lookup in the database for that, but I'm not even sure if the recieving users are logged there at all; at least they aren't exposed to the interface at all. Checkuser can only tell that you send an email and when, not to whom and not with which content.

Technically, you are correct except for the fact that "to whom" information is stored in hashed form (so you see that User:X sent an email to 7710ee82cefdbca90b7b1e7859e4417a but you won't know who 7710ee82cefdbca90b7b1e7859e4417a is. The hash is always the same for that same user, so if you see 10 records showing emails sent to 7710ee82cefdbca90b7b1e7859e4417a, it means that particular user was sent 10 emails.

Probably you have to do a manual lookup in the database for that, but I'm not even sure if the recieving users are logged there at all; at least they aren't exposed to the interface at all. Checkuser can only tell that you send an email and when, not to whom and not with which content. (Actually, I meant this more in a "You're allowed" than a "You're able" way…)

I figured as much. :) I've never seen the CheckUser database, so I have no idea what's in there other than what was mentioned in T170717#3440794

Yeah, but you can only do this in the database, not in the onwiki interface.

I just think checkuser is a completely wrong approach to what you're trying to do.

Yeah I wasn't talking about Special:CheckUser, I more meant the "checkuser database" (as mentioned in T170717#3440794)

Technically, you are correct except for the fact that "to whom" information is stored in hashed form (so you see that User:X sent an email to 7710ee82cefdbca90b7b1e7859e4417a but you won't know who 7710ee82cefdbca90b7b1e7859e4417a is. The hash is always the same for that same user, so if you see 10 records showing emails sent to 7710ee82cefdbca90b7b1e7859e4417a, it means that particular user was sent 10 emails.

Which should be perfect for what @TBolliger is looking for. :)

Yes, we just want total numbers of usage — not anything personally identifiable.

@kaldari Do you know how I would get a snapshot of the "checkuser" database? Or is there a place where it can be queried directly?

kaldari added a comment.EditedAug 21 2017, 7:58 PM

@dbarratt: It's unlikely you could get a snapshot as the checkuser data is sensitive. You can query it directly from terbium (https://wikitech.wikimedia.org/wiki/Terbium).

@kaldari Fantastic! Thanks!

FWIW, I have no idea what "checkuser database" refers to. I think they mean the cu_changes table that exists in each wiki database. Any time an email action is performed, it should record an entry in that table with the cuc_actiontext field set to "sent an email to user XXXX" (on English Wikipedia at least). You can see some example data, by running:

select * from cu_changes where cuc_actiontext LIKE 'sent an email to user%' LIMIT 5;

from the enwiki database via terbium.

dbarratt changed the task status from Open to Stalled.Aug 22 2017, 2:44 PM

@herron it looks like I have access to the server and database, but there is no cu_changes table visable to me (and if it helps, there are no passwords in the user table either)

In T173779 shell access and sudo rights to run as the apache and www-data users were provisioned. It sounds like additional db permissions are needed (or the query needs to be run using different creds) but I'm not sure off hand what those are. Perhaps someone with working access to these tables could help clarify what is needed?

dbarratt added a subscriber: bd808.Aug 29 2017, 4:44 PM

@bd808 helped me, I just needed to run sql enwiki rather than specifying a host.

@TBolliger This is from enwiki. Note that the first day in the list as well as the last day in the list will be incomplete.

  • How many emails are sent per day?
> SELECT SUBSTR(cuc_timestamp, 1, 8) AS `date`, COUNT(*) AS num_sent FROM cu_changes WHERE cuc_actiontext LIKE 'sent an email to user%' AND cuc_timestamp GROUP BY `date`;
+----------+----------+
| date     | num_sent |
+----------+----------+
| 20170531 |       10 |
| 20170601 |       56 |
| 20170602 |       53 |
| 20170603 |       29 |
| 20170604 |       36 |
| 20170605 |       39 |
| 20170606 |       49 |
| 20170607 |       34 |
| 20170608 |       36 |
| 20170609 |       34 |
| 20170610 |       27 |
| 20170611 |       44 |
| 20170612 |       47 |
| 20170613 |       28 |
| 20170614 |       37 |
| 20170615 |       27 |
| 20170616 |       35 |
| 20170617 |       24 |
| 20170618 |       24 |
| 20170619 |       36 |
| 20170620 |       38 |
| 20170621 |       23 |
| 20170622 |       57 |
| 20170623 |       52 |
| 20170624 |       58 |
| 20170625 |       29 |
| 20170626 |       33 |
| 20170627 |       39 |
| 20170628 |       57 |
| 20170629 |       40 |
| 20170630 |       40 |
| 20170701 |       26 |
| 20170702 |       50 |
| 20170703 |       23 |
| 20170704 |       37 |
| 20170705 |       41 |
| 20170706 |       61 |
| 20170707 |       67 |
| 20170708 |       48 |
| 20170709 |       35 |
| 20170710 |       57 |
| 20170711 |       43 |
| 20170712 |       47 |
| 20170713 |       64 |
| 20170714 |       55 |
| 20170715 |       33 |
| 20170716 |       33 |
| 20170717 |       32 |
| 20170718 |       58 |
| 20170719 |       35 |
| 20170720 |       26 |
| 20170721 |       61 |
| 20170722 |       25 |
| 20170723 |       27 |
| 20170724 |       47 |
| 20170725 |       59 |
| 20170726 |       45 |
| 20170727 |       33 |
| 20170728 |       44 |
| 20170729 |       34 |
| 20170730 |       41 |
| 20170731 |       58 |
| 20170801 |       55 |
| 20170802 |       33 |
| 20170803 |       43 |
| 20170804 |       35 |
| 20170805 |       38 |
| 20170806 |       30 |
| 20170807 |       31 |
| 20170808 |       37 |
| 20170809 |       42 |
| 20170810 |       48 |
| 20170811 |       45 |
| 20170812 |       31 |
| 20170813 |       38 |
| 20170814 |       50 |
| 20170815 |       42 |
| 20170816 |       43 |
| 20170817 |       38 |
| 20170818 |       31 |
| 20170819 |       33 |
| 20170820 |       33 |
| 20170821 |       37 |
| 20170822 |       36 |
| 20170823 |       29 |
| 20170824 |       52 |
| 20170825 |       56 |
| 20170826 |       33 |
| 20170827 |       31 |
| 20170828 |       48 |
| 20170829 |       22 |
+----------+----------+
  • How many unique users are sending emails?
> SELECT SUBSTR(cuc_timestamp, 1, 8) AS `date`, COUNT(DISTINCT cuc_user) AS unique_senders FROM cu_changes WHERE cuc_actiontext LIKE 'sent an email to user%' AND cuc_timestamp GROUP BY `date`;
+----------+----------------+
| date     | unique_senders |
+----------+----------------+
| 20170531 |              7 |
| 20170601 |             31 |
| 20170602 |             25 |
| 20170603 |             24 |
| 20170604 |             28 |
| 20170605 |             33 |
| 20170606 |             33 |
| 20170607 |             25 |
| 20170608 |             24 |
| 20170609 |             23 |
| 20170610 |             22 |
| 20170611 |             25 |
| 20170612 |             34 |
| 20170613 |             27 |
| 20170614 |             36 |
| 20170615 |             21 |
| 20170616 |             28 |
| 20170617 |             15 |
| 20170618 |             14 |
| 20170619 |             24 |
| 20170620 |             27 |
| 20170621 |             20 |
| 20170622 |             36 |
| 20170623 |             33 |
| 20170624 |             35 |
| 20170625 |             20 |
| 20170626 |             22 |
| 20170627 |             29 |
| 20170628 |             34 |
| 20170629 |             28 |
| 20170630 |             27 |
| 20170701 |             25 |
| 20170702 |             29 |
| 20170703 |             17 |
| 20170704 |             24 |
| 20170705 |             30 |
| 20170706 |             41 |
| 20170707 |             41 |
| 20170708 |             30 |
| 20170709 |             28 |
| 20170710 |             41 |
| 20170711 |             34 |
| 20170712 |             36 |
| 20170713 |             45 |
| 20170714 |             34 |
| 20170715 |             27 |
| 20170716 |             24 |
| 20170717 |             22 |
| 20170718 |             39 |
| 20170719 |             28 |
| 20170720 |             24 |
| 20170721 |             30 |
| 20170722 |             18 |
| 20170723 |             18 |
| 20170724 |             32 |
| 20170725 |             37 |
| 20170726 |             26 |
| 20170727 |             29 |
| 20170728 |             36 |
| 20170729 |             24 |
| 20170730 |             28 |
| 20170731 |             38 |
| 20170801 |             42 |
| 20170802 |             32 |
| 20170803 |             33 |
| 20170804 |             27 |
| 20170805 |             21 |
| 20170806 |             23 |
| 20170807 |             23 |
| 20170808 |             27 |
| 20170809 |             32 |
| 20170810 |             35 |
| 20170811 |             32 |
| 20170812 |             27 |
| 20170813 |             30 |
| 20170814 |             41 |
| 20170815 |             33 |
| 20170816 |             32 |
| 20170817 |             30 |
| 20170818 |             26 |
| 20170819 |             26 |
| 20170820 |             21 |
| 20170821 |             28 |
| 20170822 |             27 |
| 20170823 |             26 |
| 20170824 |             34 |
| 20170825 |             39 |
| 20170826 |             28 |
| 20170827 |             25 |
| 20170828 |             34 |
| 20170829 |             20 |
+----------+----------------+
  • How many unique users are the recipient of emails?
SELECT SUBSTR(cuc_timestamp, 1, 8) AS `date`, COUNT(DISTINCT cuc_actiontext) AS unique_recievers FROM cu_changes WHERE cuc_actiontext LIKE 'sent an email to user%' AND cuc_timestamp GROUP BY `date`; 
+----------+------------------+
| date     | unique_recievers |
+----------+------------------+
| 20170531 |                8 |
| 20170601 |               41 |
| 20170602 |               46 |
| 20170603 |               25 |
| 20170604 |               32 |
| 20170605 |               35 |
| 20170606 |               39 |
| 20170607 |               31 |
| 20170608 |               34 |
| 20170609 |               29 |
| 20170610 |               24 |
| 20170611 |               33 |
| 20170612 |               37 |
| 20170613 |               26 |
| 20170614 |               28 |
| 20170615 |               24 |
| 20170616 |               29 |
| 20170617 |               22 |
| 20170618 |               22 |
| 20170619 |               29 |
| 20170620 |               36 |
| 20170621 |               19 |
| 20170622 |               49 |
| 20170623 |               46 |
| 20170624 |               50 |
| 20170625 |               28 |
| 20170626 |               30 |
| 20170627 |               33 |
| 20170628 |               51 |
| 20170629 |               34 |
| 20170630 |               32 |
| 20170701 |               25 |
| 20170702 |               44 |
| 20170703 |               22 |
| 20170704 |               36 |
| 20170705 |               32 |
| 20170706 |               54 |
| 20170707 |               57 |
| 20170708 |               42 |
| 20170709 |               33 |
| 20170710 |               53 |
| 20170711 |               39 |
| 20170712 |               43 |
| 20170713 |               58 |
| 20170714 |               49 |
| 20170715 |               29 |
| 20170716 |               28 |
| 20170717 |               29 |
| 20170718 |               51 |
| 20170719 |               26 |
| 20170720 |               25 |
| 20170721 |               47 |
| 20170722 |               24 |
| 20170723 |               26 |
| 20170724 |               41 |
| 20170725 |               47 |
| 20170726 |               40 |
| 20170727 |               30 |
| 20170728 |               36 |
| 20170729 |               25 |
| 20170730 |               30 |
| 20170731 |               44 |
| 20170801 |               46 |
| 20170802 |               25 |
| 20170803 |               35 |
| 20170804 |               31 |
| 20170805 |               35 |
| 20170806 |               29 |
| 20170807 |               29 |
| 20170808 |               34 |
| 20170809 |               38 |
| 20170810 |               47 |
| 20170811 |               43 |
| 20170812 |               28 |
| 20170813 |               32 |
| 20170814 |               42 |
| 20170815 |               35 |
| 20170816 |               34 |
| 20170817 |               32 |
| 20170818 |               31 |
| 20170819 |               30 |
| 20170820 |               32 |
| 20170821 |               30 |
| 20170822 |               31 |
| 20170823 |               22 |
| 20170824 |               44 |
| 20170825 |               49 |
| 20170826 |               30 |
| 20170827 |               29 |
| 20170828 |               35 |
| 20170829 |               22 |
+----------+------------------+
bd808 removed a subscriber: bd808.Aug 29 2017, 5:35 PM
TBolliger closed this task as Resolved.Aug 29 2017, 6:39 PM
TBolliger moved this task from Code Review to Done on the Anti-Harassment (AHT Sprint 3) board.

Fascinating — a lot less actively used than I would have expected.

Thank you David!