Page MenuHomePhabricator

Analyze how many distinct devices edit per day from a given IP address
Closed, ResolvedPublic

Description

In the parent task, we are discussing what rate limit would be appropriate for number of temporary account creations to allow per IP per hour.

Temporary accounts are created when a logged-out user makes their first edit on the wiki. The account will persist for that user's device as long as the user does not clear their browser cookies.

Currently, we allow:

  • 6 regular account creations per day per IP via $wgAccountCreationThrottle
  • 8 edits per minute per IP (= 480 per hour, or 11,520 per day) via $wgRateLimits

There is of course a big difference between allowing 6 temporary account creations and 11,520 temporary account creations per day. And as noted in the parent task:

Some IPs are shared by a large number of people, e.g. covering a large geographical area. Rate limiting could significantly harm the ability of people using these IPs to edit

It would therefore be very useful if we could analyze current IP editing data, and try to work out how many unique user agents appear for the same IP address. That could help give us a clearer picture of what a reasonable rate limit would be for temp account creations. The user agent is an imperfect proxy for this information, especially since T242825: Deal with Google Chrome User-Agent deprecation, so we likely also need to make use of client hints data.

To summarize, we want to know: distinct user agents that appear for a given IP address per day.

After we do that, perhaps we could add the following variations:

  • only include unreverted edits
  • only include reverted edits
  • exclude from analysis any IPs known to iPoid-Service
  • only include IPs known to iPoid-Service
  • break down information by country–countries with fewer IP addresses will have more people editing from a smaller pool of IPs
  • exclude/include obvious bots by looking at user agent data
  • look at edit attempts (clicking the edit button), not just edits

It would especially be interesting to look at outliers for countries with fewer IP addresses, to make sure that we don't inadvertently shut out anonymous editing for users in those countries by having too restrictive of a rate limit in place.

We should be able to use wmf_raw.mediawiki_private_cu_changes for this analysis, along with client hints data.

Related Objects

StatusSubtypeAssignedTask
Resolvedkostajh
DeclinedNone
In ProgressNiharika
Openkostajh
Resolvedkostajh
Resolvedkostajh
Resolvedjwang
Declinedkostajh
Resolvedkostajh
Resolvedkostajh
DeclinedNone
OpenNone
Openmszabo
Resolvedachou
Openkostajh
Openkostajh
Resolvedsbassett
OpenNone
OpenNone
ResolvedMunizaA
OpenNone

Event Timeline

kostajh renamed this task from Determine how many distinct devices edit per day from a given IP address to Analyze how many distinct devices edit per day from a given IP address.Feb 16 2024, 1:09 PM
kostajh updated the task description. (Show Details)
jwang triaged this task as High priority.
jwang moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.
kostajh updated the task description. (Show Details)

@kostajh, please see the findings below.

Methodology

We reviewed the distribution of the number of distinct user agents that appear for a given IP address per day on each pilot wiki candidate and the largest wiki enwiki.
We also reviewed the worst-case scenario: the maximum number of the distinct user agents that appear for a given IP address per day across all wikis.
The analysis is limited to anonymous edits committed between 2024-01-01 and 2024-01-31.

Summary
  • Below table summarized the maximum of the number of distinct user agents that appear for a given IP address per day on each wiki and across wikis during January 2024. Please find the daily data and 99th percentile and 75 percentile at the sheet.
    • The maximum varies between 1 and 6 on the wikis we investigated.
    • Across all wikis, the maximum is 8.
wikiMaximum of the number of distinct user agents that appear for a given IP address per day in Jan 2024
fawiki3
htwiki1
zh_yuewiki2
enwikivoyage2
enwiki6
across all wikis (700+)8
  • We did not observe a big variance in the daily trend on each wiki. The 100th, 99th, and 75th percentiles remain consistent for the entire month of January 2024.
  • We did not observe a big variance among large, medium and small wikis. Given that, suggest applying the same limit to all wikis.
Next steps

To add the following variations:

  • only include unreverted edits
  • only include reverted edits
  • break down information by country–countries with fewer IP addresses will have more people editing from a smaller pool of IPs
  • exclude/include obvious bots by looking at user agent data
  • look at edit attempts (clicking the edit button), not just edits

@kostajh, please see the findings below.

Methodology

We reviewed the distribution of the number of distinct user agents that appear for a given IP address per day on each pilot wiki candidate and the largest wiki enwiki.
We also reviewed the worst-case scenario: the maximum number of the distinct user agents that appear for a given IP address per day across all wikis.
The analysis is limited to anonymous edits committed between 2024-01-01 and 2024-01-31.

Summary
  • Below table summarized the maximum of the number of distinct user agents that appear for a given IP address per day on each wiki and across wikis during January 2024. Please find the daily data and 99th percentile and 75 percentile at the sheet.
    • The maximum varies between 1 and 6 on the wikis we investigated.
    • Across all wikis, the maximum is 8.
wikiMaximum of the number of distinct user agents that appear for a given IP address per day in Jan 2024
fawiki3
htwiki1
zh_yuewiki2
enwikivoyage2
enwiki6
across all wikis (700+)8
  • We did not observe a big variance in the daily trend on each wiki. The 100th, 99th, and 75th percentiles remain consistent for the entire month of January 2024.
  • We did not observe a big variance among large, medium and small wikis. Given that, suggest applying the same limit to all wikis.
Next steps

To add the following variations:

  • only include unreverted edits
  • only include reverted edits
  • break down information by country–countries with fewer IP addresses will have more people editing from a smaller pool of IPs
  • exclude/include obvious bots by looking at user agent data
  • look at edit attempts (clicking the edit button), not just edits

Thanks so much for this, @jwang!

One thing Jennifer and I are discussing is incorporating client hints into the analysis. It may be that the generic user agent in Google Chrome is masking the number of unique devices in the analysis above. If we look at client hints data (https://www.mediawiki.org/wiki/Extension:CheckUser/cu_useragent_clienthints_table, https://www.mediawiki.org/wiki/Extension:CheckUser/cu_useragent_clienthints_map_table), we should be able to determine if that's the case.

Include client hint

Goal
To add client hint into Analysis of English Wikipedia to evaluate how the additional client hint info will skew the distribution of the number of unique agent per IP.
Data
Below table summarized the maximum of the number of distinct user agents that appear for a given IP address per day on English Wikipedia during January 2024. Please find the daily maximum, 99th percentile and 75th percentile at the tab of eniwki in the sheet. The client info we added in the analysis includes: platformVersion, model and fullVersionlist.

Maximum of the month
Max8
99th Percentile3
75th Percentile2

Takeaways

  • Following the inclusion of client hints in the analysis, there was an average increase of 2 in the maximum number of unique user agents on a daily basis.
  • Throughout January 2024, the daily maximum rose from 6 to 8 unique user agents per IP on English Wikipedia.
  • For some days, the increase in maximum after including client info could be as large as 5.
Non reverted edits vs reverted edits

Goal
To analyze the distribution of the number of user agents for non reverted edits and reverted edits.

Data
Below data shows the distribution of the number of distinct user agents that appear for a given IP address per day on English Wikipedia during January 2024. ( client hint is included in analysis)

Max99th percentile75th percentile
Number of user agents per IP for nonreverted edits832
Number of user agents per IP for reverted edits32.782

Takeaways

  • The maximum number of user agents per IP is mainly determined by non-reverted edits.
  • The maximum number of user agents per IP for reverted edits is much lower than that for non-reverted edits.
  • The 75th percentile number of user agents per IP for both reverted and non-reverted edits is the same.
By countries

Goal
To analyze the distribution of the number of user agents by country

Data
We analyzed English Wikipedia, using the data collected in Jan 2024. Client hints are included in analysis. Result is at: Data sheet of User_agent_per_IP_GEO_country

Takeaways
The maximum in 10 countries is 6 or exceeds 6, the current cap. Recommend increasing the limit to minimize the impact on those countries.

Code location

stat1010: /home/jiawang/share/T357771_python_v20240320.ipynb

Remaining tasks:

To add the following variations:

  • exclude/include obvious bots by looking at user agent data
  • look at edit attempts (clicking the edit button), not just edits