Add "Damn Small XSS Scanner" (DSXS) to list of known bots
Closed, ResolvedPublic

Description

In T153699, it was found that this automated tool (https://github.com/stamparm/DSXS ) generated a lot of artificial pageviews. Looking at more recent times and all projects, it was up to 0.91% of our total "human" (agent_type = 'user') pageviews in December, between 0.37-0.63% on January 6, and still between 0.2-0.4% on February 6.

SELECT day, ROUND(100* SUM(IF(user_agent = 'Damn Small XSS Scanner (DSXS) < 100 LoC (Lines of Code)', 1, 0))/SUM(1), 2) AS DSXS_percentage
FROM wmf.webrequest
WHERE year = 2016 AND month = 12
AND agent_type = 'user'
AND is_pageview = TRUE
GROUP BY day
ORDER BY day ASC
LIMIT 10000;

day     dsxs_percentage
8       0.78
9       0.84
10      0.91
11      0.59
12      0.52
13      0.63
14      0.61
15      0.36
16      0.44
17      0.42
18      0.3
19      0.45
20      0.3
21      0.33
22      0.39
23      0.38
24      0.38
25      0.3
26      0.27
27      0.26
28      0.26
29      0.27
30      0.29
31      0.36
24 rows selected (11881.762 seconds)

SELECT hour, ROUND(100* SUM(IF(user_agent = 'Damn Small XSS Scanner (DSXS) < 100 LoC (Lines of Code)', 1, 0))/SUM(1), 1) AS DSXS_percentage 
FROM wmf.webrequest
WHERE year = 2017 AND month = 2 AND day = 6
AND agent_type = 'user'
AND is_pageview = TRUE
GROUP BY hour
ORDER BY hour ASC
LIMIT 10000;

hour	dsxs_percentage
0	0.4
1	0.3
2	0.3
3	0.3
4	0.3
5	0.4
6	0.4
7	0.4
8	0.3
9	0.3
10	0.3
11	0.2
12	0.3
13	0.3
14	0.2
15	0.2
16	0.3
17	0.2
18	0.2
19	0.2
20	0.2
21	0.3
22	0.3
23	0.3
24 rows selected (808.264 seconds)

SELECT hour, ROUND(100* SUM(IF(user_agent = 'Damn Small XSS Scanner (DSXS) < 100 LoC (Lines of Code)', 1, 0))/SUM(1), 2) AS DSXS_percentage 
FROM wmf.webrequest
WHERE year = 2017 AND month = 1 AND day = 6
AND agent_type = 'user'
AND is_pageview = TRUE
GROUP BY hour
ORDER BY hour ASC
LIMIT 10000;

hour	dsxs_percentage
0	0.59
1	0.63
2	0.61
3	0.59
4	0.64
5	0.66
6	0.64
7	0.64
8	0.36
9	0.5
10	0.51
11	0.4
12	0.43
13	0.41
14	0.44
15	0.46
16	0.37
17	0.4
18	0.42
19	0.42
20	0.44
21	0.42
22	0.47
23	0.51
Tbayer created this task.Feb 8 2017, 5:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2017, 5:24 AM

Change 336575 had a related patch set uploaded (by HaeB):
Add DSXS (self-identified bot) to bot regex

https://gerrit.wikimedia.org/r/336575

JAllemandou moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 336575 merged by jenkins-bot:
Add DSXS (self-identified bot) to bot regex

https://gerrit.wikimedia.org/r/336575

Nuria closed this task as Resolved.Feb 14 2017, 5:03 PM

Thanks for the quick merge! To document the effect on total pageviews for later reference, here is a plot of the daily percentage (for the timespan where data was still available; same query as above):

Awesome, thanks @Tbayer :)

Tbayer added a comment.May 9 2017, 7:05 PM

PS, to record this here as a small footnote: I also double-checked and confirmed that these undetected bot requests happened on desktop only.

SELECT day, ROUND(100* SUM(IF(user_agent = 'Damn Small XSS Scanner (DSXS) < 100 LoC (Lines of Code)', 1, 0))/SUM(1), 2) AS DSXS_percentage
FROM wmf.webrequest
WHERE year = 2017 AND month = 2 AND day <=10
AND agent_type = 'user'
AND is_pageview = TRUE
AND access_method != 'desktop'
GROUP BY day
ORDER BY day ASC
LIMIT 10000;
day	dsxs_percentage
1	0.0
2	0.0
3	0.0
4	0.0
5	0.0
6	0.0
7	0.0
8	0.0
9	0.0
10	0.0
10 rows selected (1691.009 seconds)