Page MenuHomePhabricator

WE4.3.1 - IP traffic
Closed, ResolvedPublic

Assigned To
Authored By
XiaoXiao-WMF
Jun 25 2024, 11:38 AM
Referenced Files
F57345381: image.png
Aug 30 2024, 9:29 AM
F57345379: image.png
Aug 30 2024, 9:29 AM
F57345377: image.png
Aug 30 2024, 9:29 AM
F57345374: image.png
Aug 30 2024, 9:29 AM
F57345371: image.png
Aug 30 2024, 9:29 AM
F57345345: image.png
Aug 30 2024, 9:29 AM
F57345287: image.png
Aug 30 2024, 9:29 AM
F57273752: image.png
Aug 14 2024, 3:50 PM

Description

Hypothesis statement: If we apply some machine learning and data analysis tools to webrequest logs during known attacks, we'll be able to identify abusive IP addresses with at least >80% precision sending largely malicious traffic that we can then ratelimit at the edge, improving reliability for our users

This is the continuation of collaboration between T&S and Research https://phabricator.wikimedia.org/T353547

Event Timeline

Weekly update

  • Kick-off meeting scheduled for next week

Great kick-off meeting today!

Pablo, as promised, a recent silent-drop dataset (covering approx the previous month) is available at stat1009.eqiad.wmnet:/home/cdanis/silentdrop.2024-07-24.tsv.gz

Thanks @CDanis! I'll take a look at it and share the results with you all.

Weekly update (some notes from the kick-off meeting)

  • The expected deliverable will be an algorithm for an automated system that identifies malicious IPs (improving existing intuition-based approaches), e.g,. a list with a confidence value probably.
  • The systems will not be real-time, the only part might have real-time component in the future is detecting when attack starts and the signature of the attack.
  • Attack window detection is not a necessary feature, but it is appreciated.
  • API requests will be ignored given previous analysis revealing the great presence of good-faith bots (in the future, we could track concurrency).

Pablo, as promised, a recent silent-drop dataset (covering approx the previous month) is available at stat1009.eqiad.wmnet:/home/cdanis/silentdrop.2024-07-24.tsv.gz

@CDanis, I checked the webrequests of the most frequent silenced IPs and got results like:

(hidden IP) on 2024-06-26

user_agenturi_hosturi_pathcount
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/check/tex358543
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/505a4ceef454c69dffd23792c84b90f48854374339334
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/968b3d0e5694f5e14e2183af513213bf7686592125655
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/4c5c34250859b6f6d2a77b4e8a2ceaa90638076d3100
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/5bcafc14a200f12f38c2dba7b3e735dbcb8c079e2480
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/e3894b9f2f8b04049d4e57d99acd665211de98131786
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/32465cf48f67dedbb740c47ecf2c855aecfc50af1339
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/2aae8864a3c1fec9585261791a809ddec14899501127
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/86a67b81c2de995bd608d5b2df50cd8cd7d924551121
MediaWiki/1.40.1wikimedia.org/api/rest_v1/media/math/render/mml/ffd2487510aa438433a2579450ab2b3d557e5edc1076

(hidden IP) on 2024-07-08

user_agenturi_hosturi_pathcount
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/check/tex44585
MediaWiki/1.38.2 ForeignAPIRepo/2.1commons.wikimedia.org/w/api.php9091
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/961d67d6b454b4df2301ac571808a3538b3a6d3f1317
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/4232c9de2ee3eec0a9c0a19b15ab92daa6223f9b1148
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/add78d8608ad86e54951b8c8bd6c8d8416533d20768
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/4b0bfb3769bf24d80e15374dc37b0441e2616e33729
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/68baa052181f707c662844a465bfeeb135e82bab580
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/b4dc73bf40314945ff376bd363916a738548d40a559
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/7daff47fa58cdfd29dc333def748ff5fa4c923e3530
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/f82cade9898ced02fdd08712e5f0c0151758a0dd500

(hidden IP) on 2024-07-03

user_agenturi_hosturi_pathcount
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/check/tex44585
MediaWiki/1.38.2 ForeignAPIRepo/2.1commons.wikimedia.org/w/api.php9091
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/961d67d6b454b4df2301ac571808a3538b3a6d3f1317
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/4232c9de2ee3eec0a9c0a19b15ab92daa6223f9b1148
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/add78d8608ad86e54951b8c8bd6c8d8416533d20768
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/4b0bfb3769bf24d80e15374dc37b0441e2616e33729
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/68baa052181f707c662844a465bfeeb135e82bab580
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/b4dc73bf40314945ff376bd363916a738548d40a559
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/7daff47fa58cdfd29dc333def748ff5fa4c923e3530
MediaWiki/1.38.2en.wikipedia.org/api/rest_v1/media/math/render/mml/f82cade9898ced02fdd08712e5f0c0151758a0dd500

I understand that most of the queries are against the Math extension and I wonder if this is an expected behavior and/or any explanation for it.

No, this is not expected behavior. I'm guessing this is external Mediawiki instances using our "Mathoid as a service" . I also see some requests there for "Instant Commons".

Thanks for finding this.

I'll change the silent-drop protection to ignore these calls.

Oh, and, silent-drop occurring during attack time windows should still make for a good signal, I think.

Change #1059126 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Exclude some requests from concurrency tracking

https://gerrit.wikimedia.org/r/1059126

Thanks @CDanis! One consistent pattern I have found in the latest analysis is that attack time windows with very few abusive IPs involved (I consider abusive IP any IP with more than 1 request per second) barely contain IPs reported in AbuseIPDB (the best ground truth I have):

image.png (220×456 px, 9 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 8 KB)
image.png (220×456 px, 9 KB)
image.png (220×456 px, 9 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 9 KB)
image.png (220×456 px, 9 KB)
image.png (220×456 px, 9 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 9 KB)
image.png (220×456 px, 10 KB)

Below an exploration of the first 3 "attacks".

Attack time window 1

The most common UA is about https://developer.amazon.com/support/amazonbot but I also found other well-known bots (https://openai.com/gptbot, http://yandex.com/bots, http://mj12bot.com, http://www.bing.com/bingbot.htm, http://www.google.com/bot.html).

Attack time window 2

The most common UAs in the 4 abusive IPs are about Timpibot (http://www.timpi.io) and ClaudeBot (claudebot@anthropic.com), but also the bots listed above (https://developer.amazon.com/support/amazonbot, https://openai.com/gptbot, http://yandex.com/bots, http://mj12bot.com, http://www.bing.com/bingbot.htm).

Attack time window 3

This is very similar to the previous case. Many requests from ClaudeBot (claudebot@anthropic.com) and some other bots like https://developer.amazon.com/support/amazonbot.

Discussion

  • ClaudeBot made over 40K requests with cache_status in ('miss','pass','int-front') between 2024-07-26:20 and 2024-07-26:21.
    • I also think that silent-drop occurring during attack time windows might be a good signal... @CDanis, could you share with me an expanded version of the silentdrop dataset covering the entire month of July to see if these bots triggered any silentdrop action?
  • I also think that silent-drop occurring during attack time windows might be a good signal... @CDanis, could you share with me an expanded version of the silentdrop dataset covering the entire month of July to see if these bots triggered any silentdrop action?

Absolutely, available at stat1009.eqiad.wmnet:/home/cdanis/silentdrop.2024-08-02.tsv.gz
It covers much of June and all of July, across all cache hosts.

Change #1059126 merged by CDanis:

[operations/puppet@production] haproxy: exclude some requests from concurrency tracking

https://gerrit.wikimedia.org/r/1059126

Weekly update:

  • Exploration of a subset of the most abusive IPs in July 2024 ('2001:67c:2f4c:2::xxx', '149.154.157.xxx', '2001:67c:2f4c:2::xxx', '192.116.36.xxx', '79.16.139.xxx, '75.89.152.xxx, '174.101.160.xxx'). Only 75.89.152.xxx was found at stat1009.eqiad.wmnet:/home/cdanis/silentdrop.2024-08-02.tsv.gz: { "time": "2024-07-09T11:59:28", "hostname": "cp1106", "prefix": "silent-drop_for_300s", "data": "75.89.152.xxx"}.
  • For all these IPs their webrequest time series (hourly granularity) were retrieved:
2001:67c:2f4c:2::xxx149.154.157.xxx2001:67c:2f4c:2::xxx
image.png (428×578 px, 30 KB)
image.png (427×578 px, 66 KB)
image.png (427×578 px, 45 KB)
192.116.36.xxx79.16.139.xxx174.101.160.xxx
image.png (427×578 px, 45 KB)
image.png (427×569 px, 20 KB)
image.png (427×578 px, 37 KB)
75.89.152.xxx
image.png (427×578 px, 31 KB)
  • As I understood from https://phabricator.wikimedia.org/T353547#9654460, when an IP address surpasses a threshold of concurrently-executing requests against our CDN, we temporarily block that IP address and stop replying to anything at all from it for 5 minutes. Now running a notebook to analyze the webrequests of those IP that do appear in the silent drop datasets to identify differences in time series.

Note: I have added "xxx" suffixes to the IPs for privacy reasons on this public ticket.

Weekly update:

  • Below are the time series of the top 3 IPs in silentdrop.2024-08-02.tsv.gz (the most blocked ones). Although the silent drop threshold is not about the volume of requests in a period of time but the number of concurrently-executing requests, I am a bit surprised that their request volume is noticeably lower than the abusive IPs shown in my previous message (which were not affected by silent drop protection).
70.32.23.xxx2804:1454:1004:100::xxx2001:1600:4:13:1a66:daff:fea5:xxx
image.png (432×569 px, 122 KB)
image.png (427×569 px, 69 KB)
image.png (427×569 px, 93 KB)
  • Below is the ratio of IPs reported in AbuseIPDB by the number of silent drop blocks the IPs have received (greater or equal). The higher the number of blocks, the higher the ratio. This is true up to a certain value (8) beyond which the trend reverses. This could be explained by the fact that the top silent drop blocked IPs often correspond to (a) good-faith external Mediawiki instances using our "Mathoid as a service" (no longer affected by this protection), and (b) IPs with requests to APIs such as Commons API, typically used by good-faith volunteer bots. Out of curiosity I have checked a sample of IPs with at least 8 silent drop blocks (the value where AbuseIPDB ratio is maximum:~80%). Many cases were about requests to "www.wikipedia.org/" on July 31 and requests to "en.m.wikipedia.org/wiki/united_states" on July 15. These were not found with the in-progress abusive IP detection approach as I only consider cache busting requests where cache_status IN ("miss","pass","int-front")(i.e., cache busting attacks) and these requests got hit-front. Given that IPs with multiple silent drop blocks are just a few thousands (those with at least 5 blocks are even less than 1K), it is worth considering directly checking these IPs on AbuseIPDB in order to expand our identification of abusive IPs that do not perform cache busting.

image.png (298×513 px, 19 KB)

  • With the in-progress abusive IP detection approach, I identified the following attacks and IPs (with over 100 requests):
June 2024
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
July 2024
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
image.png (220×462 px, 10 KB)
image.png (220×456 px, 9 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 10 KB)
image.png (220×456 px, 9 KB)
image.png (220×459 px, 10 KB)
image.png (220×456 px, 10 KB)
  • I created datasets for each month with features for every (attack, IP) pair including request_count, response_size_mean, uri_query_count, not_null_referer_ratio, user_agent_count, not_null_accept_language_ratio, is_pageview_ratio, desktop_ratio, mobile_web_ratio, mobile_app_ratio, not_null_referer_class_ratio, spider_ratio, abuseipdb_isReported. Then, first experiments with a standard XGBoost model were run:
train:June 0.8 splittrain:July 0.8 splittrain:June+July 0.8 splittrain:June
test:June 0.2 splittest:July 0.2 splittest:June+July 0.2 splittest:July
image.png (410×442 px, 16 KB)
image.png (410×442 px, 16 KB)
image.png (410×442 px, 18 KB)
image.png (412×451 px, 19 KB)
  • High precision and recall when training and testing with splits of the same dataset (June, July, June+July) but lower when training with June dataset and testing with July one. Therefore, next steps will focus on further experimentation, i.e., dataset preparation (including Spur metadata and August IP data), feature engineering, model settings, etc.

Weekly update:

  • A new XGBoost model has been trained with data from June and July and tested with data from August. Results already meet the requirements of the hypothesis statement:
precisionrecallf1-score
False0.7156860.9474930.815435
True0.924480.630680.749828
accuracy0.7875820.7875820.787582
macro avg0.8200830.7890860.782632
weighted avg0.8210750.7875820.78232
  • As one could expect request_count and response_size_mean are the most important features to predict abusive IPs.

image.png (393×586 px, 32 KB)

  • I have also assessed how the model even improves when only IPs with a min. number of request are considered; it decreases the ratio of false negatives.
request_count ≥ 0; 100% IPsrequest_count ≥ 10; 85% IPsrequest_count ≥ 100; 73% IPsrequest_count ≥ 1K; 55% IPsrequest_count ≥ 10K; 21% IPsrequest_count ≥ 100K; 2% IPs
image.png (410×451 px, 18 KB)
image.png (410×451 px, 17 KB)
image.png (410×451 px, 18 KB)
image.png (410×451 px, 19 KB)
image.png (411×442 px, 17 KB)
image.png (413×434 px, 15 KB)
  • All the notebooks are available at stat1009.eqiad.wmnet:/home/paragon/ip-traffic such as stat1009.eqiad.wmnet:/home/paragon/ip-traffic/3_analysis_2024_08.ipynb for data collection and stat1009.eqiad.wmnet:/home/paragon/ip-traffic/4_xgb_2024_train_0607_test_08.ipynb for model training/testing. @CDanis, it would be great if you could let me know how you would need me to transfer the code/model.

Reminder: our ground-truth implies that an IP is abusive if it is reported in AbuseIPDB. However, I showed in previous updates some (very) abusive IPs that were not reported in that database. As a consequence, the real number of false positives should be lower than the one I am able to compute :)

@CDanis we will close this task. Please reach out to me if you need further assistance.

Questions:

  1. What form does this model take?
  2. Where is the code?
  3. How can MW code use this model?
  4. Where's the documentation that explains how to retrain the model / deploy it?

@SCherukuwada

  1. We have provided two approaches: 1) hard-coded logic based, 2) model. The two approaches arrive very similar performance metrics, which suggested that the latter (model) can be seen as merely a confirmation of the former (logic based), i.e. if the ML model does not out-smart the logics, it may imply that the problem in hand may not need a ML based approach. Should we observe other behaviors in the future, we could come back to revisit.
  1. From Aug 30 update: "All the notebooks are available at stat1009.eqiad.wmnet:/home/paragon/ip-traffic such as stat1009.eqiad.wmnet:/home/paragon/ip-traffic/3_analysis_2024_08.ipynb for data collection and stat1009.eqiad.wmnet:/home/paragon/ip-traffic/4_xgb_2024_train_0607_test_08.ipynb"
  1. We suggest that, for this application, T&S could start with implementing hard-coded logic. The reasons being
    • Performance metrics: as stated from the first point, the performances of logic vs model do not differ.
    • Maintenance: this ML model would require periodic retraining, while logic is unchanged going forward. Given the similar performance metrics, we recommend using logic based approach for now.
    • Deployment: we anticipate overhead for deployment as the current state of the code is self-contained in research env and may not meet the required deployment standard.
  1. If we decide to use model (instead of logic), please see notebook stat1009.eqiad.wmnet:/home/paragon/ip-traffic/4_xgb_2024_train_0607_test_08.ipynb for model training/testing.

I will let @CDanis to update the engineering adoption side of the project.

Thank you.

How do research folks follow version control? Is there a repo where this should be checked in? If there isn't could you please choose one?

Questions:

  1. How can MW code use this model?

Given our current goal is to integrate this data in concurrency limiting at the TLS terminator layer of our CDN, we haven't considered exposing the data to MediaWiki. But, if I had to think of serving those data to MediaWiki, I'd probably make them available via ipoid with its own ingestion pipeline.

@SCherukuwada thank you for your questions.

@XiaoXiao-WMF thanks for your response, let me provide two additional comments:

  • ML was applied to formally verify the hypothesis statement. That said, as shown in T368389#10105538, request_count and response_size_mean are the most important features for predicting abusive IPs (i.e., IPs already reported in AbuseIPDB). Therefore, for integration, I would implement a threshold-based approach with those two features in place.
  • As for the repository, I will discuss today with @fkaelin the best way to make existing code available in Gitlab.

The repo with the notebooks created for this project is https://gitlab.wikimedia.org/repos/research/abusive-ips (for privacy reasons, the data is not publicly released and all IPs in notebooks have been masked).