Page MenuHomePhabricator

Help panel: Extend EditorJourney data capturing to 14 days for help desk pages
Closed, ResolvedPublic

Description

One of the questions regarding the Help Panel is whether users who asked a question on the help desk return to view it. To be able to answer that, we are proposing to alter the rules for the EditorJourney schema so that it captures views of the help desk page (and just that) for up to 14 days after account registration. This task tracks that change.

Outstanding issues:

  • Figure out why Elena's user's visit to Wikipedia:Help_Desk on beta wasn't tracked
  • Implement a solution for tracking the primary kowiki help desk page in addition to its monthly archive pages.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2018, 11:03 PM
MMiller_WMF added a subscriber: MMiller_WMF.

Jazmin will go through the Instrumentation DACI for this small change as she does the process for T209982.

kostajh added a comment.EditedDec 14 2018, 5:00 PM

@nettrom_WMF @MMiller_WMF some questions for you.

The key that we use to look up the hashing salt expires after 24 hours. So, let's say a user makes a few visits to the help desk within the first 24 hours of registering. We will currently have the page ID logged e.g. ab6f8c9f12b752c1781d. If the user comes back on day 5, a new hash salt is generated for them, so the page ID for the help desk would be logged as e.g. 7484b7d2b3a1ae51a3b0, visits within 24 hours would then record the same page ID but for example a visit on day 8 would have a a different value for the same page ID.

I see three options here:

  1. Keep the same obfuscation logic with the 24 TTL for the key used to find the hash salt. This means that we'll know if a user is visiting a page or subpage of the help desk, but we won't be able to know if they were looking at the help desk or a subpage. (Side note: in any scenario we won't know if they viewed their individual question since the hash parameters in the URL, e.g. wiki/Help_Desk#Help_panel_question_on_Main_Page_7 are interpreted in the browser and are not sent to the server.)
  2. Do not obfuscate details of visits to Help Desk and its subpages. We could do this either for any time period, or only after the initial 24 hour period has elapsed. So, option (A) is obfuscate help desk and subpage visits for first 24 hours but then make them unobfuscated after that. Option (B) is never make those visits obfuscated. Option B makes more sense to me. Note that unless we increase the TTL of the key used to find the hashing salt (see below), the obfuscated values for the query parameters return, returnto and search will vary across 24 hour periods.
  3. Increase the TTL of the key used to lookup the hashing salt to 14 days. That way, a visit to the help desk on hour 5 of day 1 would be recorded as ab6f8c9f12b752c1781d and on day 5 it would still be recorded as ab6f8c9f12b752c1781d. We'd need to get another OK because of the security/privacy implications, but it seems reasonable. The downside is that we will not be able to know if the user was on the help desk main page or a subpage.

We could also consider a combination of options 2 and 3: increase the TTL of the hash salt to 14 days so we have consistent hashing of query parameters across the entire 14 day period, and treat help desk and subpages as non-sensitive content.

Change 479722 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/WikimediaEvents@master] Editor Journey: Track visits to help desk (sub)page for 14 days

https://gerrit.wikimedia.org/r/479722

@kostajh : as far as I can see from looking up the Helpdesk pages in the data from the EditorJourney schema, we don't obfuscate those as they're in the Wikipedia namespace on both Czech and Korean? Or did I miss something?

@kostajh : as far as I can see from looking up the Helpdesk pages in the data from the EditorJourney schema, we don't obfuscate those as they're in the Wikipedia namespace on both Czech and Korean? Or did I miss something?

Oops. You're correct, please disregard my comment. Thanks for looking through it.

Change 479722 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Editor Journey: Track visits to help desk (sub)page for 14 days

https://gerrit.wikimedia.org/r/479722

@Morten Warncke-Wang I saw a small cluster (17) of records with event_user_id=0:

MariaDB [log]>  SELECT event_action, event_page_title, event_title, event_is_mobile, timestamp  FROM EditorJourney_18504997 WHERE event_user_id=0;
+--------------+------------------+-------------+-----------------+----------------+
| event_action | event_page_title | event_title | event_is_mobile | timestamp      |
+--------------+------------------+-------------+-----------------+----------------+
| view         | Log out          | UserLogout  |               0 | 20181114211324 |
| view         | Log out          | UserLogout  |               0 | 20181114211824 |
| view         | Log out          | UserLogout  |               0 | 20181114212029 |
| view         | Log out          | UserLogout  |               0 | 20181114212144 |
| view         | Log out          | UserLogout  |               0 | 20181114213443 |
| view         | Log out          | UserLogout  |               0 | 20181114213625 |
| view         | Log out          | UserLogout  |               0 | 20181114213759 |
| view         | Log out          | UserLogout  |               0 | 20181114213914 |
| view         | Log out          | UserLogout  |               0 | 20181114214200 |
| view         | Log out          | UserLogout  |               0 | 20181115025839 |
| view         | Log out          | UserLogout  |               0 | 20181115050202 |
| view         | Log out          | UserLogout  |               1 | 20181115123130 |
| view         | Log out          | UserLogout  |               0 | 20181115123605 |
| view         | Log out          | UserLogout  |               0 | 20181115123913 |
| view         | Log out          | UserLogout  |               0 | 20181115124519 |
| view         | Log out          | UserLogout  |               0 | 20181115124943 |
| view         | Log out          | UserLogout  |               1 | 20181115190824 |
+--------------+------------------+-------------+-----------------+----------------+
17 rows in set (0.02 sec)

It'd be interesting to check if something like that is in production.

kostajh added a comment.EditedJan 8 2019, 5:41 PM

It looks like those events are from before this fix rEWMV11cb0168408e41910d942075b0094bf7cff730f9

It looks like those events are from before this fix rEWMV11cb0168408e41910d942075b0094bf7cff730f9

Right. They are all on 11/15.

I ran a couple of queries of the EditorJourney data in the Data Lake, and there are no events there with user ID 0 that aren't log out events from before the fix was put in place. Neat to be able to verify that there wasn't a quality issue there, thanks for checking that @Etonkovidova ! And thanks @kostajh for identifying the fix!

@kostajh please review the following:

I created a test user ET109` with registration date:

MariaDB [enwiki]>  select user_id, user_name, user_registration  from user where user_id= 15846;
+---------+-----------+-------------------+
| user_id | user_name | user_registration |
+---------+-----------+-------------------+
|   15846 | ET109     | 20190103231757    |
+---------+-----------+-------------------+
1 row in set (0.00 sec)

ET109 has the following records in EditorJourney:

MariaDB [log]> select timestamp, event_page_title, event_action from EditorJourney_18504997 where event_user_id =15846;
+----------------+-----------------------------------------------------------------------------------------------------------------+
| timestamp      | event_page_title                                                                                                                                       |
+----------------+-----------------------------------------------------------------------------------------------------------------+
| 20190103231758 | Welcome, ET109!                                                                                                                        
| 20190103231759 | Welcome, ET109!                                                                                                                         
| 20190103231925 | Thanks! Your responses have been saved.                                                                                                  
| 20190104174814 | Recent changes                                                                                                                          
| 20190104174902 | Wikipedia:Help desk                                                                                                                      
| 20190104174942 | Revision history of Wikipedia:Help desk                                                                                                 
| 20190104175152 | Recent changes                                                                                                                          
| 20190104202126 | Wikipedia:Help desk                                                                                                                    
| 20190104205810 |                                                                                                                                         
| 20190104205811 | e5b7388db26ad2ae1b79917a09b1993a76e55c98af55be792470b4dfbf1dc8cfa ...     
| 20190104205816 | Preferences                                                                                                                              
| 20190104205825 | Preferences                                                                                                                           
| 20190104205826 | Preferences                                                                                                                            
| 20190104205834 |                                                                                                                                         
| 20190104205834 | e5b7388db26ad2ae1b79917a09b1993a76e55c98af55be792470b4dfbf1dc8cfa   ...    
| 20190104205840 | Editing e5b7388db26ad2ae1b79917a09b1993a76e55c98af55be792470b4dfbf1dc8 ...
| 20190104210001 | Wikipedia:Help desk                                                                                                           
+----------------+------------------------------------------------------------------------------------------------------------+

However, ET109 was visiting Wikipedia:Help desk after 20190104 - the 'View history' has some records of the user editing the page, but no records in EditorJourney_18504997:

kostajh removed kostajh as the assignee of this task.Jan 11 2019, 3:01 PM
kostajh moved this task from QA to In Progress on the Growth-Team (Current Sprint) board.

In addition to this bug, we need to also ensure that visits to the kowiki help desk (and not just monthly archives) are tracked. The current code will only look at the current month and its subpages.

kostajh updated the task description. (Show Details)Jan 11 2019, 3:04 PM
Restricted Application added a subscriber: revi. · View Herald TranscriptJan 11 2019, 3:04 PM

In addition to this bug, we need to also ensure that visits to the kowiki help desk (and not just monthly archives) are tracked. The current code will only look at the current month and its subpages.

Looking at PageViews::isHelpDeskVisit(), we could check subpage relationship in all directions between the visited page and the configured help desk.

Change 484299 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/WikimediaEvents@master] GrowthExperiments: Log visits to parent page of help desk

https://gerrit.wikimedia.org/r/484299

Change 484299 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] GrowthExperiments: Log visits to parent page of help desk

https://gerrit.wikimedia.org/r/484299

Etonkovidova closed this task as Resolved.Jan 18 2019, 12:57 AM
Etonkovidova claimed this task.
Etonkovidova updated the task description. (Show Details)
Etonkovidova moved this task from QA to Epics In Progress on the Growth-Team (Current Sprint) board.

After another restart of betalabs mysql (thx, @nettrom_WMF ) the EditorJourney table got updated.
(1)
The user ET109 was created on Jan03/2109

MariaDB [enwiki]> select user_id, user_name,user_registration from user where  user_name='ET109';
+---------+-----------+-------------------+
| user_id | user_name | user_registration |
+---------+-----------+-------------------+
|   15846 | ET109     | 20190103231757    |
+---------+-----------+-------------------+
1 row in set (0.00 sec)

The last visit to the Help Panel was recorded on Jan16/2019:

MariaDB [log]> select event_page_title,max(timestamp), event_user_id  from EditorJourney_18504997 where event_user_id=15846  and  event_page_title='Wikipedia:Help desk' ;
+---------------------+----------------+---------------+
| event_page_title    | max(timestamp) | event_user_id |
+---------------------+----------------+---------------+
| Wikipedia:Help desk | 20190116164644 |         15846 |
+---------------------+----------------+---------------+
1 row in set (0.00 sec)

(2) Regarding the following

Implement a solution for tracking the primary kowiki help desk page in addition to its monthly archive pages.

The data will be checked in production kowiki as a part of ongoing data analytical work.