Page MenuHomePhabricator

20K events by a single user in the span of 20 mins
Closed, DeclinedPublic1 Estimated Story Points

Description

Running the following query in Hive results in 20,814 results, which is odd to say the least:

USE event;
SELECT * FROM citationusage
WHERE event.session_token="1d8be34f432aaad1" AND year=2018 AND month=7;

The events are all happening on the same page and the action is also the same 'extClick' (click on an external link). The code that logs these events looks OK.

The user agent isn't identified as a bot, yet so many requests are coming from this single user in a short span of time. The only explanation I currently have is that these events are generated by a bot masquerading as a user and clicking on all external links.

What do you think? Have you seen something similar with other EventLogging schemas?

This is not a single case. We see other similar cases with other session_tokens:

session_tokenevents
some session token7817
some session token6797
some session token6115
some session token5757
some session token4246
some session token4241
some session token4171
some session token4011
some session token2479
some session token2470
some session token2117
some session token1762
some session token1755
some session token1704
some session token1663
some session token1625
some session token1520
some session token1494
some session token1270

Event Timeline

bmansurov edited subscribers, added: Milimetric; removed: Unknown Object (User).

What do you think? Have you seen something similar with other EventLogging schemas?

mmm... that session token is too short no? it should be 64 bits like: 80dr048q6b0buqrf4b44u2i11qvthbi4

We definitely have seen js-able bots in EL data and bots mostly crawl wikipedia which is what your schema is registering , now in this case your sessionToken looks wrong (if you have not abridged it).

@Nuria we're using mediaWiki.user.sessionId() to get the session_token. The documentation says that this is a 64-bit integer in hex format. So I converted the above session_token (from the task description) to a number in decimal and got 2129045178431482577, which looks correct to me.

@bmansurov you are totally right, 16 chars in length is correct , my apologies

This comment was removed by Nuria.

Also, all crawled content by session: 1d8be34f432aaad1 is about cookies and cookie policy.

Milimetric triaged this task as High priority.
Milimetric lowered the priority of this task from High to Medium.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric moved this task from Operational Excellence to Data Quality on the Analytics board.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Next Up to Done on the Analytics-Kanban board.
Nuria set the point value for this task to 1.Aug 24 2018, 3:03 PM

Will be closing ticket as this is lawful traffic if not human.

@Nuria anything we can do to mark such user agents as 'bot'?

@bmansurov nothing easy I can think of at this time server side. This schema suffers much more than others regarding issues with bot traffic given that per definition is translating into events much of the "crawlers" clicking around. I think any filtering is going to have happen at analysis time and UAs with more than, say, 6 events per minute (1 every 10 secs) should probably be discarded.

@Nuria do you expect this task to be addressed by the better bot detection approach you're implementing? If yes, I'd like to close it.

It will be initially deployed just for pageviews so not quite yet.

The heuristics would work however so we just need to think how would we plug it in this pipeline

Aklapper changed the task status from Stalled to Open.Nov 1 2020, 9:15 PM
Aklapper subscribed.

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons...

(Smallprint, as general orientation for task management:
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

Resetting assignee (inactive account)

phuedx subscribed.

Being bold. The CitationUsage instrument was removed in September 2020 and disabled roughly a year before per T262349#6445226.