Page MenuHomePhabricator

EditorJourney records HTML tags in page_title field
Closed, ResolvedPublic

Description

Looks like the EditorJourney schema records HTML tags in the page_title field. For articles (pages in namespace 0), the result appears to only consist of articles with a title that should be in italics, and the page_title field becomes <i>[obfuscated page title string]</i>.

For pages in other namespaces, this appears to mainly affect user and user talk pages (namespaces 2 and 3). There, we find examples using both the font and span elements as well.

Currently this is unlikely to affect our data analysis, where the combination of namespace and title is used instead. In other words, this is definitely a "nice to have" on the prioritization scale.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2019, 9:35 PM
JTannerWMF moved this task from Inbox to Upcoming Work on the Growth-Team board.Jan 17 2019, 9:12 PM
kostajh removed kostajh as the assignee of this task.Feb 4 2019, 9:32 PM
kostajh added a subscriber: kostajh.

<i>[obfuscated page title string]</i>

I assume you prefer to just have [obfuscated page title string] and not [obfuscated surrounding elements and page title string]. In other words, don't include the surrounding elements in the string that's going to be obfuscated, but purge them entirely. Is that right?

Currently this is unlikely to affect our data analysis,

Given that's the case, would it be reasonable to do nothing with this task?

I assume you prefer to just have [obfuscated page title string] and not [obfuscated surrounding elements and page title string]. In other words, don't include the surrounding elements in the string that's going to be obfuscated, but purge them entirely. Is that right?

Yes. The way I see it, for a non-obfuscated namespace, page_title should contain the full page title with HTML removed. If we then make it [obfuscated page title string] for obfuscated namespaces, it would be consistent.

Currently this is unlikely to affect our data analysis,

Given that's the case, would it be reasonable to do nothing with this task?

That's another "yes" from me. I opened this ticket primarily to document that this exists. Feel free to close it and prioritize other tasks. We can reopen it if this becomes an actual issue.

kostajh added a subscriber: MMiller_WMF.

That's another "yes" from me. I opened this ticket primarily to document that this exists. Feel free to close it and prioritize other tasks. We can reopen it if this becomes an actual issue.

Cool. @MMiller_WMF unless you disagree, I think we can close this and reconsider if EditorJourney has a new round of deployments in the future.

MMiller_WMF closed this task as Declined.Feb 13 2019, 1:27 AM

Declining this task because we've decided it's not urgent.

kostajh reopened this task as Open.Feb 26 2019, 9:16 PM
kostajh claimed this task.
kostajh moved this task from Needs PM Review to In Progress on the Growth-Team (Current Sprint) board.

Re-opening, as @nettrom_WMF noticed that this is impacting data analysis.

Change 493109 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/WikimediaEvents@master] EditorJourney: Remove HTML when obfuscating page title

https://gerrit.wikimedia.org/r/493109

Change 493109 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] EditorJourney: Remove HTML when obfuscating page title

https://gerrit.wikimedia.org/r/493109

The title retrieved from OutputPage::getPageTitle relates to the <h1> on the page (a.k.a firstHeading) and is, during page views, can be influenced by editors using the {{DISPLAYTITLE:}} parser magic word. This can, among other things introduce styling, casing differences and other things.

I don't know what the actual usage of this field is, but it seems like page_title is mostly a slightly less reliable version of title and action. Less reliable, in that you're more likely to miss certain privacy obfuscations due to liberal ability for HTML formatting, localisation messages, Skin hooks, and DISPLAYTITLE to change how the title is used. It could be wiki-escaped, url-escaped, html-escaped (in more than one way), db-escaped (underscores not spaces), and any combination thereof. And then there's DISPLAYTITLE which can also changing casing, spacing, underscores, and extra HTML tags.

Presumably, for the privacy aspect, we'd have to detect and normalise all of these. It may be a bit late, but I suspect this field might be redundant with title and action together, assuming that would already cover things like "history", "edit" in a way that isn't subject to variance by wiki language, user language, skin, mobile/desktop, and per-wiki localisation overrides. Anyhow, if the phrasing of these is relevant to the study, then I suppose you'll want to normalise all the above in order to uphold the privacy intent.

Change 493119 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/WikimediaEvents@wmf/1.33.0-wmf.18] EditorJourney: Remove HTML when obfuscating page title

https://gerrit.wikimedia.org/r/493119

Change 493242 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/WikimediaEvents@master] (WIP) EditorJourney: Convert to lower case and decode chars

https://gerrit.wikimedia.org/r/493242

Change 493119 abandoned by Kosta Harlan:
EditorJourney: Remove HTML when obfuscating page title

Reason:
going with different approach

https://gerrit.wikimedia.org/r/493119

Change 493242 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] EditorJourney: Convert to lower case and decode chars

https://gerrit.wikimedia.org/r/493242

A few takeaways from the discussion between myself, @Catrope, @MMiller_WMF and @nettrom_WMF today:

  1. page_title is not as important as we thought when we first added it. The only thing we can't infer from other event properties is editing conflict which is relatively rare. If possible, we should set editingconflict in the permission_errors property
  2. Reviewing query, we should add title and create to the list of parameters to hash
  3. Order of operations: we'll need to modify WikimediaEvents to send an empty string for page title, then update the schema to drop the property, then stop sending page_title altogether, then ask analytics to purge page_title.
kostajh added a comment.EditedMar 1 2019, 3:33 PM

@nettrom_WMF @MMiller_WMF I remember why we added page_title. It's for cases like this:

"event": {
  "user_id": 14,
  "page_title": "permission error",
  "title": "Block",
  "permission_errors": "",
  "namespace": -1,
  "request_method": "GET",
  "is_mobile": false,
  "path": "/index.php",
  "action": "view",
  "http_response_code": 200,
  "query": "title=Special:Block",
  "page_id": "0"
}

If you get access denied to view a page (e.g. regular user tries to get to Special:Block), the HTTP response code is 200, there are no permission errors (because we are only setting permission errors property when action = edit), and if page_title was removed you'd have no way to tell that the user couldn't actually do anything on this page. While Special:Block may not be an especially relevant example to EditorJourney cohort users, maybe there are other similar pages where knowing that the user received a permission error is important.

As a hacky workaround for this use case I could check if page title is an exact match for the i18n string badaccess ("Permission error") and if so, append that to the permission errors property.

Change 493713 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/WikimediaEvents@master] EditorJourney: Drop title, create, and page_title

https://gerrit.wikimedia.org/r/493713

Change 493713 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] EditorJourney: Redact title / create params, drop page_title

https://gerrit.wikimedia.org/r/493713

Change 494300 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/WikimediaEvents@wmf/1.33.0-wmf.19] EditorJourney: Redact title / create params, drop page_title

https://gerrit.wikimedia.org/r/494300

Change 494300 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.33.0-wmf.19] EditorJourney: Redact title / create params, drop page_title

https://gerrit.wikimedia.org/r/494300

Mentioned in SAL (#wikimedia-operations) [2019-03-05T00:44:51Z] <catrope@deploy1001> Synchronized php-1.33.0-wmf.19/extensions/WikimediaEvents/: Redact title/create params and drop page_title in EditorJourney schema (T213974) (duration: 00m 49s)

Cannot check the eventlogging for the cases when a user does not have permission to edit page due to T218370.

Etonkovidova added a subscriber: Morten-Haan.EditedMar 20 2019, 2:57 PM

EditorJourney schema cannot be checked client-side. I checked other schemas EditAttemptStep and `HelpPanel' whether they record formatting in page title. It seems not to be the case.
EditorJourney schema cannot be checked client-side. I checked other schemas EditAttemptStep and `HelpPanel' whether they record formatting in page title. It seems not to be the case.

Previously there were some cases recorded e.g. event_page_title <i>Birdman</i> (film)
Ping @nettrom_WMF to confirm that such page titles are recorded correctly.

@Etonkovidova : Searched through the data from EditAttemptStep and HelpPanel in the Data Lake starting from 2019-01-01, and I didn't find any indications of HTML in the page_title field in either of those.

I also searched through the EditorJourney data since deployment of the fix, and don't find an issue there either.

Etonkovidova closed this task as Resolved.Mar 20 2019, 11:12 PM

Thanks, @nettrom_WMF. Closing the ticket as Resolved.