Page MenuHomePhabricator

[Data Platform] Update referer job to use global country deny list instead of a hard-coded one
Closed, ResolvedPublic

Description

The referer job excludes some countries from its computation results:
https://github.com/wikimedia/analytics-refinery/blob/master/hql/referrer/compute_referer_daily.hql#L28

The list of excluded countries is hard-coded, and seems a smaller version of the list of excluded countries maintained globally on the cluster:

SELECT * FROM canonical_data.countries WHERE is_protected;

I think we should update the referer job to use the global exclusion list.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Use canonical_data.countries when populating the referer tablesrepos/data-engineering/airflow-dags!516aquT348504_use_canonical_country_list_in_referer_dailymain
Customize query in GitLab

Event Timeline

Ping @Isaac as you were the one creating the job - do you think the idea described in this ticket is valid?

Okay, did some digging because the hard-coded list is far smaller than the standard blocklist so making the switch would greatly reduce the size of the referrer dataset. I think (per below) that we should keep with the hard-coded set but if you'd like to check in with Privacy to see if this is still acceptable, let me know and happy to help with that. For what it's worth, this would currently fall in the medium risk category under: https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines.

Relevant docs:

Hi Isaac,
the reason for which I ask if we can switch is not only for privacy now, but for maintenance reasons: when we'll all have forgotten of this hard-coded list (as I already had before coming back to it by chance), any update that should be applied to it will not be done, while the global exclusion list, being THE place where we keep this information up to date, will be.
I don't think we can or should keep one list per dataset. If the Privacy team is ok to devise multi-level risk assessment that are applied to different datasets and stored in a single table, I'd be ok with that.
Otherwise it means there needs to be a a way to track those hard-coded lists so that people review them when a change needs to happen in the protected country-list.

Also, to quantify impact, I ran queries for yesterday and found that using the global exclusion list reduces the dataset by ~15% in number of lines (from 9672 lines to 8288) and by 12% in number of referrals (from 269671115 referrals to 234867572).

I'm sorry to be a pain, but I'd rather not forget to update that thing when it'll be needed :)

@JAllemandou that makes sense as far as reducing maintenance. I'm not sure I'm a great person to answer other than to say I'm happy to help support whatever choice hits the right balance. @Htriedman can I loop you in for some quick thoughts? In particular, wondering about a few things:

  • This dataset is a good candidate for differential privacy, which would push us into the low-risk category and remove the need for this country blocklist clause entirely. Is this a short-term-ish possibility?
  • Can you think of other simple ways of expressing the logic that @JFishback_WMF was applying to the dataset in the review that would allow us to pull directly from canonical_data.countries while still not throwing out data from Russia etc. that have very high numbers? Feels like this could help balance a single source of truth for country blocklists with having nuance in how it's applied. (Also James, would be happy to have your perspective too)
  • If not, thoughts on Joseph's proposal to have separate tables (or perhaps more than just a binary is_protected attribute?) for the country blocklist that are a little more use-specific?

Also, to quantify impact, I ran queries for yesterday and found that using the global exclusion list reduces the dataset by ~15% in number of lines (from 9672 lines to 8288) and by 12% in number of referrals (from 269671115 referrals to 234867572).

And thanks for this!

Hi! Thanks for flagging this, @Isaac! Definitely agree that this dataset is a great candidate for differential privacy (DP), which would also likely reduce the minimum publication threshold to <500. I'm happy to start working on that with you — it's a somewhat independent process from the discussion of the country protection list (CPL) and I think this dataset could benefit from it.

More broadly: this is yet another example of the conservatism and lack of customizability of the CPL impeding engineering efforts more than it clarifies. I have several ideas as to how we could move forward on the list (which I'll list below), but I think it's a much larger conversation that will take a while:

  • replace the current binary value with a tiered system: e.g. countries not on the list are tier 0, countries on the list that meet certain lower risk criteria (high traffic, relatively higher scores from Freedom on the Net / Reporters Without Borders, etc.) are tier 1, riskier countries are tier 2, ...
  • reduce risk scores / drop the CPL entirely when using DP: DP enables us to precisely quantify the worst-case risk to a specified privacy unit (usually, in WMF's case, an individual user), which is a function based a single number, epsilon. The closer the epsilon value to 0, the lower the risk posed to data subjects. We could state that the CPL doesn't apply when DP is used in a data pipeline, but I feel like that would incentivize poorly-thought-out releases that could be harmful. We could use epsilon values to modify scores (i.e. multiplying a risk score by epsilon / k, where k is the expected max allowable privacy budget for a single release [probably something like 2, 3, or 5]; or stating that CPL countries should get {0.5, 0.2, 0.1} * epsilon), but OTOH it's entirely possible that an individualistic conception of privacy protection is inadequate for governments that may target people collectively based off of group behavior.

I think where I land is something like this: Instead of setting a binary value for countries based off of their FOTN / RWB scores, we should construct a composite measure that can take DP noise scale into account as a mitigation. FOTN and RWB both use a 0-100 index, where higher is better; in our case, just to make things easier, we can invert that, so lower is better. If a country's score goes above a certain threshold, we don't publish it. If DP mitigates that score to below the threshold, we do publish it. The general formula would be something like this:

(((100 - FOTN score) * 0.5) + ((100 - RWB score) * 0.5)) * (epsilon / k)

Let's consider the US, Egypt, and China (using 2023 scores) with no DP; that would lead to the following:

US:
((100 - 76) * 0.5) + ((100 - 71.22) * 0.5) = (24 * 0.5) + (28.78 * 0.5) = 12 + 14.38
= 26.38

Egypt:
((100 - 28) * 0.5) + ((100 - 33.37) * 0.5) = (72 * 0.5) + (66.63 * 0.5) = 36 + 33.32
= 69.32

China:
((100 - 9) * 0.5) + ((100 - 22.97) * 0.5) = (91 * 0.5) + (77.03 * 0.5) = 45.5 + 38.52
= 84.02

RWB and FOTN both set their threshold for a country being not free at 40/100, so we can invert that and say our threshold is 100-40 = 60. Now, let's imagine we wanted to do a DP release of data about these countries, where epsilon = 1.5 and k = 2.

US:
26.38 * (1.5 / 2) = 26.38 * 0.75 = 19.79    # below 60, ok to publish

Egypt:
69.32 * 0.75 = 51.99    # below 60, ok to publish

China:
84.02 * 0.75 = 63.02    # above 60, not ok to publish

Obviously, a lot depends on the expected value of k here, and the group vs. individual privacy concerns still stand. Anyhow, let me know if you have any questions.

Thanks for your view @Htriedman, the solution you suggest is definitely better than what we have.
In the meantime for when we don't have differential Privacy, do you agree we should go for the global exclusion list instead of keeping multiple versions of the list hard-coded in jobs?

@JAllemandou Thanks for the kind words! For the moment, yes — let's try to standardize use of the country protection list and try to avoid keeping multiple versions of the list hard-coded in jobs. I will work on the following:

  1. getting my proposed schema reviewed by legal and human rights
  2. implementing the new schema in hive
  3. updating documentation on wikitech
  4. getting this data release onto a DP framework (cc: @Isaac)

Thank you @Htriedman for the plan on your side.
We're gonna update the current job removing the hard-coded list and using the global exclusion list for now on our side, and we'll change the job to use Differential Privacy when it's ready.

Change 965771 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Use canonical_data.countries when generating referer table

https://gerrit.wikimedia.org/r/965771

Since I got pinged, I'll quickly weigh in with my thoughts, too. I like the approach of a "weighted" CPL. Also, I think @Htriedman is correct that DP is a valid mitigation that might obviate the need to use the CPL at all in some cases. When we first came up with the CPL years ago it was a stop-gap blunt instrument mitigation that worked at the time, but (thanks to Hal) we have better options now. And, to me, the increasingly widespread use of it lends support to the idea that we should have ONE version of it that we improve over time, rather than using a one-off every time (imho, and to the degree possible). FWIW I've also reached out to Legal a few times over the years to see if they have any feedback about improving the list, but it's not really been a high priority. Perhaps it should be since we seem to keep coming back to the well?

@JFishback_WMF I'll invite you to a meeting about this next week!

Change 965771 merged by TChin:

[analytics/refinery@master] Use canonical_data.countries when populating the referer tables

https://gerrit.wikimedia.org/r/965771

Ahoelzl renamed this task from Update referer job to use global country deny list instead of a hard-coded one to [Maintenance] Update referer job to use global country deny list instead of a hard-coded one.Oct 20 2023, 4:51 PM
Ahoelzl renamed this task from [Maintenance] Update referer job to use global country deny list instead of a hard-coded one to [Platform] Update referer job to use global country deny list instead of a hard-coded one.
Ahoelzl renamed this task from [Platform] Update referer job to use global country deny list instead of a hard-coded one to [Data Platform] Update referer job to use global country deny list instead of a hard-coded one.Oct 20 2023, 5:16 PM