Page MenuHomePhabricator

Determine solution for logging clicks to external links
Closed, DeclinedPublic

Description

Our team has been discussing the need to track some external links. For example, on Czech wiki there is a prominent link in the top navigation to https://nastenka.wikimedia.cz/?source=cs.wikipedia-menu; we'd like to understand how first 24-hour users are utilizing this and other external resources.

Per the parent task, we'll have server-side logging of page views and redirects.

To track external links, we could either add client-side logging to links external links, or modify external links to use a redirect mechanism on wiki. For example, instead of linking to https://nastenka.wikimedia.cz/?source=cs.wikipedia-menu, we would link to https://cs.wikipedia.org/wiki/Special:Redirect/externalUrl/https://nastenka.wikimedia.cz/?source=cs.wikipedia-menu, and then the visit to Special:Redirect would get logged and the user would be sent to the external URL. This approach has the nice side effect of reducing the information sent to external domains about the page the user was just on (external sites would see Special:Redirect as the referer rather than whatever URL they were on before).

I have a proof of concept patch for the server-side Special:Redirect approach, but I am open to doing this client-side if we think that will be a better solution.

Details

Related Gerrit Patches:

Event Timeline

kostajh created this task.Oct 16 2018, 2:03 AM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptOct 16 2018, 2:03 AM
Restricted Application added subscribers: Urbanecm, Aklapper. · View Herald Transcript

Change 467553 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] WIP: Proof-of-concept external URL redirection for event logging

https://gerrit.wikimedia.org/r/467553

Just a note: I have access to access log for nastenka.wikimedia.cz, if you want those data, feel free to ask me - I'll be happy to send it to you (preferably without IP addresses).

Also, if you think using Special:Redirect is better, I think we can change it on-wiki.

If you're interested in why the link is there: It is a redirect to outreachdashboard.wmflabs.org, I'm just interested in wheather this link is used - so I used this redirect (used as shortcut, it is easier to remember for Czechs) with a query parameter, to be able to retrieve such data from access log.

Thanks @Urbanecm.

Just a note: I have access to access log for nastenka.wikimedia.cz, if you want those data, feel free to ask me - I'll be happy to send it to you (preferably without IP addresses).

Thank you for offering. Ideally we will have all of the log data in one place through our EventLogging implementation, so I don't think we'll need to ask you for it.

Ok then. I was thinking mainly about data from the past, which is something you cannot get by other means.

Change 468047 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/core@master] Introduce Special:RedirectExternal

https://gerrit.wikimedia.org/r/468047

Change 467553 abandoned by Kosta Harlan:
WIP: Proof-of-concept external URL redirection for event logging

Reason:
Abandon in favor of 468047

https://gerrit.wikimedia.org/r/467553

@Catrope @SBisson here's what this looks like with PageViews schema:

{
  "wiki": "wiki",
  "uuid": "82309071ba5f52df81ad5508450c485c",
  "webHost": "dev.wiki.local.wmftest.net:8080",
  "timestamp": 1539798796,
  "recvFrom": "localhost",
  "seqId": 33,
  "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0",
  "revision": 7,
  "event": {
    "pageId": "0",
    "title": "RedirectExternal/https://google.com?query=1",
    "userId": 10,
    "httpResponseCode": 302,
    "namespace": -1,
    "pageTitle": "",
    "requestMethod": "GET",
    "permissionErrors": [],
    "query": "",
    "action": "",
    "path": "/wiki/Special:RedirectExternal/https://google.com%3Fquery%3D1",
    "isMobile": false
  },
  "schema": "PageViews"
}

@revi @Urbanecm -- basically this is a very similar idea to T206882, but this time for tracking links to external sites. The first thing would be to make a short list of the external links we might want to track. @kostajh already identified the "Kurzy" link in Czech Wikipedia as one that newcomers might click on. So the question for ambassadors is: what external links exist in your wiki that newcomers might click? Are there some handful that it would be good to track?

Change 468047 merged by jenkins-bot:
[mediawiki/core@master] Introduce Special:RedirectExternal

https://gerrit.wikimedia.org/r/468047

I have some concerns about the implementation. Copying from the gerrit comment:

Can this use a whitelist of domains? Its a common tactic among phishers to use open redirects as part of a phising attack to make a url look "official" when it really isn't.
Can this verify the protocol is either http or https. AFAICT this would allow you to redirect to things like telnet://telnet.wmflabs.org or data:text/html,html%20here . Potentially some uri schemes may be unsafe (although browsers will usually ban anything to evil, we shouldn't rely on that)

Question: Is it only about "real" external links? Or about internal links written as external ones? Technically, a link which is written like this: [https://cs.wikipedia.org/wiki/Help:Something Something] is an external link. This is frequently used when there is need to pass query parameters with a page, like when linking to a guided tour (https://cs.wikipedia.org/wiki/Wikipedie:Pr%C5%AFvodce/P%C3%ADskovi%C5%A1t%C4%9B?tour=wikiedbebold is an example) or in T206882.

I'm not fully sure which links qualifies as external in the context of this task.

Have the privacy aspects of this been considered? Tracking full URLs feels pretty icky, and I think we generally try to avoid collecting data that link users to what they're reading.

How will this impact referer data?

@Legoktm: I think tracking full URLs is acceptable in non-article space, or for whitelisted URLs. By the definition of this task, I think the team won't have access to data about links that aren't done through Special:RedirectExternal, so there is an easy way how to manage tracked and untracked links.

@Legoktm:

Have the privacy aspects of this been considered? Tracking full URLs feels pretty icky, and I think we generally try to avoid collecting data that link users to what they're reading.

Yes. We want to understand common pathways for accounts younger than 24 hours to try to improve retention and growth. Understanding when users go off-wiki (for example, to view social media links that a wiki has provided on their help desk) is helpful to improving our understanding.

How will this impact referer data?

Could you elaborate on this please?

@Bawolff:

Can this use a whitelist of domains? Its a common tactic among phishers to use open redirects as part of a phising attack to make a url look "official" when it really isn't.

Yes, we could although that has its own issues. There's plenty of malicious Twitter/Facebook pages for example. We could whitelist exact URLs but then this is a little more cumbersome to work with.

Can this verify the protocol is either http or https. AFAICT this would allow you to redirect to things like telnet://telnet.wmflabs.org or data:text/html,html%20here . Potentially some uri schemes may be unsafe (although browsers will usually ban anything to evil, we shouldn't rely on that)

Yes we could do that. For now I'm going to revert this patch while we reconsider whether we could get this data via a client-side eventlogging action that wouldn't require anything like Special:RedirectExternal.

There are additional security issues with this patch - it entirely bypasses SpamBlacklist as well as nofollow.

Question: Is it only about "real" external links? Or about internal links written as external ones? Technically, a link which is written like this: [https://cs.wikipedia.org/wiki/Help:Something Something] is an external link. This is frequently used when there is need to pass query parameters with a page, like when linking to a guided tour (https://cs.wikipedia.org/wiki/Wikipedie:Pr%C5%AFvodce/P%C3%ADskovi%C5%A1t%C4%9B?tour=wikiedbebold is an example) or in T206882.
I'm not fully sure which links qualifies as external in the context of this task.

This was answered in a meeting. The answer is: Links outside the domain.

Change 468047 merged by jenkins-bot:
[mediawiki/core@master] Introduce Special:RedirectExternal
https://gerrit.wikimedia.org/r/468047

This was since reverted: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/468353

@Legoktm wrote:

Have the privacy aspects of this been considered? Tracking full URLs feels pretty icky, and I think we generally try to avoid collecting data that link users to what they're reading.

Yes. We want to understand common pathways for accounts younger than 24 hours to try to improve retention and growth. Understanding when users go off-wiki (for example, to view social media links that a wiki has provided on their help desk) is helpful to improving our understanding. [..]

Ignore me if the following was already known, but – I believe @Legoktm was asking about the url of external links. I may've missed it, but I couldn't find a need for it on the relevant tasks. Instead of logging that a user visited "en.wikipedia.org/wiki/Foo" and navigated to "https://books.google.com/1235/foo-bar-baz"; Could we only track that they navigated from there to a url with hash b58a5f9 (e.g. hash of url and a private salt for this purpose).

The hash would be deterministic within the campaign. That still allows correlation of navigations by different users to the same url. It also still allows knowing which urls they went to for any urls you're specifically interested in. Either because it's a url of general interest (like the wikis' social page), or for other reasons. With a little extra work, it could even allow you to trace a specific unknown url (e.g. if it's popular, or poses a question for some reason), by hashing the links from a given page and finding the match. It also has the benefit of not needing a whitelist ahead of time, and (more importantly, for privacy) it means the urls are not visible in the EventLogging database to us, and anyone else with access there. This last point is important, because logging it to the EventLogging database would effectively grant a new form of restricted access.

There might be existing datasets that could be used to reverse-engineer what a given user read, but - we have nothing today (AFAIK) that directly logs a user name with a wiki page they read. That internal tracking alone is ground breaking (ReadingDepth doesn't do it). To also log it with external link would go quite far in my opinion.

Perhaps no alternative suffices for this purpose, but then I think we'll need to publicly explain why. We may also want to double-check whether the current EventLogging destination and default access levels are acceptable.

@Krinkle and @Legoktm thank you for bringing up your concerns and critiques. I realize now I could have been more clear in my task description. Let me try to recap what our intention is with this task and also with T205759: Understanding first day: prototype instrumentation approach

  1. For this project (T205754: [EPIC] Growth: Understanding first day), we want to develop a funnel analysis of users younger than 24 hours in a way that protects user privacy. The goal of the funnel analysis is to understand which patterns are likely to lead to edits, where users might get stuck and abandon becoming editors, with the overall aim to improve growth and retention on mid-size wikis.
  2. In T205759: Understanding first day: prototype instrumentation approach we are developing a "PageViews" schema to use server-side EventLogging for users in the cohort we are interested in. You can see what we are proposing to log and also what we are doing to redact sensitive data, like exact URL, output title, page title, page ID, search query terms. Your feedback and critique of that patch would be very welcome there.
  3. While T205759 gets us anonymized information for on-wiki user interactions, our team was considering how we could also include new user usage of external, off-wiki links that are part of the onboarding / help guides. As a specific example, on Czech wikipedia there is a prominent link in the top navigation to Kurzy which leads users to outreachdashboard.wmflabs.org. If there is a correlation between first 24 hour users accessing this URL and making an edit, and/or continuing to make edits beyond the first 24 hours, that's important for our team to know.
    1. To be clear about what we are proposing to do in the scope of this task: we have no desire to track _every_ external link a new user visits, only a handful that are prominent components of the help / onboarding process for new users. In this task we proposed to rewrite these URLs using the (now reverted) Special:RedirectExternal mechanism.
    2. Logging a hash for external URLs (like we are doing overall in PageViews schema) would work fine as well, I think.

All of this said, we are re-evaluating whether we want to attempt including these handful of external links (even in hashed form) in the PageViews EventLogging. We'll update this task when there's a decision on that.

@MMiller_WMF please comment if I've missed anything.

Thanks for adding the detailed explanation, @kostajh. With respect to the larger "Understanding first day" project, we are being careful to narrowly constrain our measurement to just those namespaces that are relevant, and to not track article reading behavior (and to purge/aggregate/anonymize after 90 days). An important detail that I'll add to @kostajh's description is that our plan will record User ID along with page ID, but not for a set of more sensitive namespaces, such as Article, Article Talk, Draft, Draft Talk, Portal, Portal Talk, and some others. Therefore the browsing behavior on those content pages will be obscured. The plan will, however, record page IDs visited in namespaces like Help, Wikipedia, and Special, so that we can see which of those pages new editors learn from before editing.

We'll post more details on the PageViews schema and its business logic. We are currently in touch with WMF privacy folks to make sure everything looks good. Once that information is up in T205763, we hope you'll take a look and weigh in. Let us know if you would like to discus at any point. We're definitely open to anyone's thoughts on the right way to pursue our goal of understanding what newcomers do on their first day.

For this external links initiative, in addition to Kosta's description, I'll also say that we're still figuring out if this is even valuable to pursue. We're figuring out now what external links actually exist Czech and Korean Wikipedia that are geared toward new editor learning (in T207306 and T207473). It may be that there just aren't very many. At T206882#4680525, we discussed this some more to get some basic counts around how often the "Kurzy" button is visited in Czech (which is the button that inspired this idea).

@MMiller_WMF based on the lack of external link usage in newbie/help guides on Czech and Korean wikis, I think we can drop this task.

If we decide to revisit it, based on the feedback we've received and my experience with coding this, I would suggest something like:

  1. Maintaining a whitelist of external help/onboarding URLs (exact match, not domain) that are used as part of help/onboarding wiki process. Probably there would be fewer than 10 per wiki.
  2. Implementing the proposed Special:RedirectExternal mechanism from a patchset on this issue, but ensuring that the redirect URL is in the whitelist.
  3. not hashing the whitelisted URLs in EventLogging, as we would consider these to be non-sensitive namespace content.
MMiller_WMF closed this task as Declined.Oct 22 2018, 6:01 PM

I agree that we can decline this task, given that it won't be all that important for Czech and Korean Wikipedias (given the info in T207306 and T207473). Thanks for recording your parting thoughts on this in case we need to revisit for other wikis. And thanks to @Bawolff and @Legoktm for weighing in.

sbassett triaged this task as Normal priority.Oct 16 2019, 4:38 PM
sbassett moved this task from Backlog to Done on the Privacy board.
sbassett removed a project: Patch-For-Review.