Page MenuHomePhabricator

Is User-Agent data PII when associated with Action API requests?
Closed, DeclinedPublic

Description

In T102079: Metrics about the use of the Wikimedia web APIs we have defined various rollup tables based on raw Action API (/w/api.php) activity. One of the dimensions that we are measuring is User-Agent to help answer questions like

  • Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)

@Nuria has pointed out that User-Agent data is scrubbed from webrequest tables after 90 days because of the protections assured by https://wikimediafoundation.org/wiki/Privacy_policy.

The open question is whether or not the requesting User-Agent is PII when accessing /w/api.php in the same way that it is when reading or editing an article as human interactive process considering https://meta.wikimedia.org/wiki/User-Agent_policy

Event Timeline

Since on-wiki Common.js, default gadgets, core features such as edit stashing, and extensions such as MultimediaViewer hit the API, probably.

If the user-agent data is non-public and may be used to identify individual users, then it should be kept private under the Privacy Policy. It may be possible to aggregate some user-agent data to the point that it is no longer identifiable, but it would need to be reviewed before its published.

Per Wikipedia, PII is information that can be used on its own or with other information to identify, contact, or locate a single person, or to identify an individual in context. That's pretty clearly not true of browser UAs - they don't identify anyone in isolation. In the very worst case, if someone were to use a super-rare browser/OS/plugin/whatever combination so the useragent is specific only to them, all the public dataset would reveal is the fact that they used the API.
OTOH we don't really care about browser UAs, so if they can be reliably identified, then replacing them with the browser family would be fine IMO.

As for bot UAs, they do contain email addresses sometimes, but those are put there with the intention to identify the owner. Is there really an expectation of privacy in such a case?

I suggest just going with the parsed UA version that gets you Browser (family, major, minor) and OS (family, major, minor). The issue with user agent strings is the same as the issue with search terms: they can include any kind of free text. For example, "My name is Dan Andreescu and my social security number is >>> >> >>>>" could be a UA string. So parsing out what you care about and aggregating is always a good idea. For an example, this is how we handle UA strings when we report statistics about them: https://analytics.wikimedia.org/dashboards/browsers/

Although it seems aggregating the agent still doesn't make it not PII, according to @APalmer_WMF in T151655#2855630.

I don't have permission to see that task, but yes, simply aggregating won't work, that's why I gave that dashboard as an example. We normalize, aggregate, and cut off the long tail. We looked at the results, thought about potential de-anonymization attacks quite a bit, and found it safe.

Although it seems aggregating the agent still doesn't make it not PII, according to @APalmer_WMF in T151655#2855630.

That comment is about storing useragents in a way that's linked to users, which wouldn't happen here.

I suggest just going with the parsed UA version that gets you Browser (family, major, minor) and OS (family, major, minor).

That's fine for browsers; we don't really care about them. But there is no way to parse user agents set by API client libraries, much less UAs manually configured by users of such libraries.

Cutting the long tail doesn't help either, since the most active bots are actually more likely to have manually set UAs (there are multiple UAs with email addresses in the top 10, for example).

For example, "My name is Dan Andreescu and my social security number is >>> >> >>>>" could be a UA string.

That's a bit like putting your SSN on your user page. Can there be any expectation of privacy for information you have intentionally communicated to the public?

I suggest just going with the parsed UA version that gets you Browser (family, major, minor) and OS (family, major, minor).

That's fine for browsers; we don't really care about them. But there is no way to parse user agents set by API client libraries, much less UAs manually configured by users of such libraries.

I see, yeah, that's tricky; it would help me to know more clearly what questions we need this data to answer. We could augment UA Parser with regexes that pull out client libraries which follow our convention. If that's not enough we can revisit enforcing the convention more strictly.

Cutting the long tail doesn't help either, since the most active bots are actually more likely to have manually set UAs (there are multiple UAs with email addresses in the top 10, for example).

I think if a bot is following our UA string convention, it's fair to think they're ok with their UA string being public. We should make that explicit in the convention and communicate it, though.

For example, "My name is Dan Andreescu and my social security number is >>> >> >>>>" could be a UA string.

That's a bit like putting your SSN on your user page. Can there be any expectation of privacy for information you have intentionally communicated to the public?

Not the same. Writing a public blog post is not the same as searching for something on google. The blog post is known to be public, the search terms are assumed to be private. I think a User Agent is actually slightly more private than search terms.

We could augment UA Parser with regexes that pull out client libraries which follow our convention. If that's not enough we can revisit enforcing the convention more strictly.

We have a convention? https://meta.wikimedia.org/wiki/User-Agent_policy doesn't seem to establish much of anything that could be very useful for parsing.

The best time to plant a tree is 20 years ago. The second best time is now.
:)

I think we are getting side tracked here. The user agents on api are of a different nature than browser UAs and there is no point on parsing/agreggating those with ua-parser which is made for UAs that follow the usual "browser protocol". The point of this ticket is not to decide whether UAs "in general" are identifying (they are, and that is well stablished) but rather whether APIs UAs have any expectation of privacy.

It is a fundamentally different question as in the UA policy (https://meta.wikimedia.org/wiki/User-Agent_policy) we explicitly ask users to include an e-mail or similar with their request to api so we can identify them. Then, it really seems that data has no expectation of privacy just like @Tgr pointed out early on.

To clarify, API logs contain some browser UAs (since the desktop and mobile web interface load JS which calls the API; and probably some third-party websites make CORS requests as well), we just aren't really interested in them. Whatever reliable method there is to standardize or anonymize or remove them will work fine, as long as it does not squash most non-browser UAs.

it would help me to know more clearly what questions we need this data to answer.

See T102079: Metrics about the use of the Wikimedia web APIs. One use case is to identify top users of our APIs, another (which might or might not be plausible) is to use it to appriximate the number of "unique visitors" to the API.

I think we are getting side tracked here. The user agents on api are of a different nature than browser UAs and there is no point on parsing/agreggating those with ua-parser which is made for UAs that follow the usual "browser protocol". The point of this ticket is not to decide whether UAs "in general" are identifying (they are, and that is well stablished) but rather whether APIs UAs have any expectation of privacy.

It is a fundamentally different question as in the UA policy (https://meta.wikimedia.org/wiki/User-Agent_policy) we explicitly ask users to include an e-mail or similar with their request to api so we can identify them. Then, it really seems that data has no expectation of privacy just like @Tgr pointed out early on.

User agents may be private if they contain identifiable information, even if that information is required in the User-Agent Policy, unless users have given permission for their data to be public.

Let's discuss over email or a call if you are interested in publishing this data.

Thnaks for the correction @Slaporte : question about data is for Api owners, on our end we have not use for api user agents. if @Tgr and @Anomie agree we can simply delete the data harvested after 90 days.

Per discussion with @Slaporte and @APalmer_WMF:

  • making UA data public is hard. We need consent because it could include email addresses or other personal information, and there is no clear way to get that consent (even if the UA policy did ask for it, which it does not, we cannot assume that API users have seen it). We are not particularly interested in making the data public anyway.
  • increasing the retention period is easy if there is a clear need. Per T102079 the main "customer" for UAs is DevRel so @Qgil or @srishakatux can confirm whether this is still useful for them, and the we can get an exception from the 90-day retention in place (and add browser UA standardization to the requirements of T137321.)

@bd808: just to make sure there is no misunderstanding: ApiAction is purged after 90 days and this task is only about the aggregated UA data (action_ua_hourly), right?

@bd808: just to make sure there is no misunderstanding: ApiAction is purged after 90 days and this task is only about the aggregated UA data (action_ua_hourly), right?

Correct, at least by design. The raw tables should be purged using refinery-drop-hourly-partitions by some scheduled job in the analytics pipeline. I'm not exactly sure where that job schedule lives to verify. Only the action_* aggregate tables are intended to be kept for longer periods to allow year over year comparisons.

@bd808 and @Tgr: in order for purging to occur recurrently a cron needs to be added here: https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/analytics_cluster/refinery/data/drop.pp

Let's plan on deleting all PII data and if/when someone asks for anything that might be PII we can work with legal on granting an extension, that doesn't seems needed at this time.

Let's plan on deleting all PII data and if/when someone asks for anything that might be PII we can work with legal on granting an extension, that doesn't seems needed at this time.

The parent task (T102079: Metrics about the use of the Wikimedia web APIs) explictly asks for what is current considered PII (user-agent):

  • Number of user agents coming from Labs or third party services, on a monthly basis + all time (DevRel, to check whether our APIs are increasing adoption)

This one seems to be about count(distinct user-agent) which we can capture at a monthly granularity as a metric without retaining the UA long term.

  • Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)

This is explicitly about retaining and exposing the UA as part of the metric.

  • Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

This seems more operational than historic and thus can probably fit in a model where the UA is discarded before the end of the 90-day window.

@Qgil should chime in here and help everyone understand if the ranking metric was a transitory desire that we can now ignore or if this is actually critical in some way to the success of developer relations outreach.

If that is the only place these are managed, then it looks like the CirrusSearch data set that the ActionApi data set was modeled on is not being purged either.

If that is the only place these are managed, then it looks like the CirrusSearch data set that the ActionApi data set was modeled on is not being purged either.

That looks correct. We should update this to drop both. I've pushed a patch to refinery that has a new drop script, and a [WIP] patch to puppet to be deployed after a refinery including the appropriate script has shipped.

  • increasing the retention period is easy if there is a clear need. Per T102079 the main "customer" for UAs is DevRel so @Qgil or @srishakatux can confirm whether this is still useful for them, and then we can get an exception from the 90-day retention in place (and add browser UA standardization to the requirements of T137321.)

Does the question to #DevRel refer to the "+ all time" bullet points currently listed under "Metrics requested" in the task description of T102079? Or other aspect(s)?

Clueless question: To get "all time" statistics, is storing each UA date needed? Or can they be aggregated to create monthly data (and each date be deleted as early as possible for privacy)?

  • increasing the retention period is easy if there is a clear need. Per T102079 the main "customer" for UAs is DevRel so @Qgil or @srishakatux can confirm whether this is still useful for them, and then we can get an exception from the 90-day retention in place (and add browser UA standardization to the requirements of T137321.)

Does the question to #DevRel refer to the "+ all time" bullet points currently listed under "Metrics requested" in the task description of T102079? Or other aspect(s)?

All time is in question, yes. But more urgently "Ranking of user agents" is in question. If UA is PII then such rankings could only be behind the 'firewall' and not ever published.

Clueless question: To get "all time" statistics, is storing each UA date needed? Or can they be aggregated to create monthly data (and each date be deleted as early as possible for privacy)?

Whether the UA is needed or not for year over year depends on the particular statistic. We could store something that said 'we saw N unique UAs in 2016-02' without keeping the UA data, but if what is wanted instead is 'we saw UA "foo bar baz" N times in 2017-02 which was an X% increase/decrease over the same UA in 2016-02' then we need to keep the UA data long term.

The original thinking was that User-Agent data send by an API consumer was not PII if it was stored disjoint from a record of the actual activity that was done by that UA. The aggregation tables I designed (https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction) store:

  • UA + wiki + internal/external/labs + count
  • Action + wiki + internal/external/labs + count
  • Param + wiki + internal/external/labs + count

There is no way to tell from these aggregations what actions and/or params were used by a given UA (which I would most certainly consider to be private). If the privacy issue is the value but not necessarily the existence of the UA then we could hash the UA that is stored in the aggregate table. Hashing would obscure the value of the US (e.g. "my cool browser/42.0") by replacing it with an opaque token (e.g. "bb4c90bbae9e4a123ef553f7a78e09896ff7c3d9"). The search space for UA strings is functionally infinite, so it could not be broken with a rainbow table in reasonable time especially if we added a pepper value to the UA string before hashing. It would still be possible to hash a known UA and find its related data across time. It would also still be possible to make year over year examinations for UAs that were interesting in a given period (e.g. "bb4c90bbae9e4a123ef553f7a78e09896ff7c3d9 is the top UA this month, where did they rank 18 months ago?").

Does the question to #DevRel refer to the "+ all time" bullet points currently listed under "Metrics requested" in the task description of T102079?

For now the proposal is to extend data lifetime to one year (so it couldn't be used for an "all time" metric but could be used for a year-on-year metric).

Clueless question: To get "all time" statistics, is storing each UA date needed? Or can they be aggregated to create monthly data (and each date be deleted as early as possible for privacy)?

They are already aggregated on a hourly basis. We could use lower granularity, but it wouldn't really change the legal situation as far as I can see. The problem is not that user agents can be correlated with something, it's that they can contain personal data.

If UA is PII then such rankings could only be behind the 'firewall' and not ever published.

Not automatically, anyway. CoolBot/1.2.3 is not personal information; CoolBot/1.2.3 my@email.address is. So someone would have to sanitize them by hand. That seems unavoidable at this point (unless we come up with a new user agent policy, and figure out how to make it legally binding, which is something no one seems to be interested in).

If the privacy issue is the value but not necessarily the existence of the UA then we could hash the UA that is stored in the aggregate table.

Knowing that the top three users of the API in the last year were <hash>, <hash> and <hash> does not seem like useful information.

Sorry for being late to this party. From a Developer Relations point of view, being able to extra quarter-over-quarter data would be the hard requirement, and year-over-year would be very nice to have.

Sorry for being late to this party. From a Developer Relations point of view, being able to extra quarter-over-quarter data would be the hard requirement, and year-over-year would be very nice to have.

Can you add what specific questions we hope the data to answer? That help us see what aggregation scheme needs to be used ..etc

Specific questions, iterating on what I wrote in the T102079 description more than a year ago plus our current plans to focus on onboarding and retaining developers:

Services calling our APIs from Wikimedia Labs

  • How many active services are there? Is Labs use increasing?
  • Which are the most used for reading/writing data? Which services are the most popular/impactful?
  • Are there new services becoming active, and and can we trace them down to a project or a developer?
  • Are there significant changes in popularity i.e. new services going up, old services winding down?
  • What are these services doing, which types of action? Are these uses matching our developer offering in terms of documentation, support, feature requests? Do we have interesting features that seem to be unknown / underused by developer and could use some promotion?

Services calling our APIs from non-Wikimedia servers

  • Same questions adapted for third parties, if possible.

Bonus point if we could have this data separated by Labs / 3rd parties and then all aggregated.

Neither these questions should require retention of UA data at all as far as I can see.

Are there new services becoming active, and and can we trace them down to a project or a developer?

This is the only one for which you might need some contact info but certainly not over the 90 day period in which PII can be retained thus I'd say we have no need to retain user agent data and, just like the rest of webrequest data in hadoop it should be dropped after 90 days.

Some more points:

Services calling our APIs from Wikimedia Labs

The work that @bd808 has done tracks only the main wikipedia php api, other services like say, (Pageview API) are not included. Thus a number of your questions can not be answered with this data as it doesn't pertain new services, now I imagine the ex-labs team (now cloud services) might have stats of services in labs and usage of those? Not sure, but in any case that should be another thread.

Anything that involves tracking usage levels is infeasible without user agents, for the same reason tracking usage of a web feature is infeasible without cookies: we care primarily about the number of users, not the number of requests, and the latter is impossible to even estimate without having some sort of information to tell clients apart. So none of these questions can be answered without user agents.

The work that @bd808 has done tracks only the main wikipedia php api, other services like say, (Pageview API) are not included. Thus a number of your questions can not be answered with this data as it doesn't pertain new services, now I imagine the ex-labs team (now cloud services) might have stats of services in labs and usage of those? Not sure, but in any case that should be another thread.

The task for that is T122245: REST API entry point web request statistics at the Varnish level.

so none of these questions can be answered without user agents.

Wait, wait.., you need aggreggates about "distinct" uas no the UAs themselves, so we all the work to estimate distinct users can be done using say an md5(user-agent, salt) that is constant in a, say, 60 day period, right? That should be sufficient to extract information like "there are 5 parties that are responsible for 90% of our requests". Correct?

That would work in an ideal world where everyone uses a unique UA. In the real world, we have three groups: proper UAs, browser UAs from JS-based API usage (those are fine to hash or otherwise obscure but we need to be able to tell they are browser UAs) and framework UAs (like "Pywikibot" or "PHP"). Not being able to tell those groups apart would be a problem, I imagine.

In any case, I feel the concerns around this are overblown. API client user agents are set for the explicit purpose of telling us about the user. (Browser UAs are an exception, and should be replaced as discussed in T154912#2961297.) We don't have a clear mandate to publish those UAs publicly so we won't do that, but storing them doesn't seem be the same kind of concern as storing user PII - they are being sent with the request because users want us to know (and it might benefit them in the long term, if we can plan API support based on the level of demand). And the only information we would associate (beyond the 90-day retention period) with the UA is the number of requests per hour.

Jcross subscribed.

We are declining this ticket as it has been almost three years since last comment. Should new work be required please create a new ticket. Thank you!