Page MenuHomePhabricator

Bot field in edits_hourly dataset ignores username
Closed, ResolvedPublic

Description

The edits_hourly dataset (available both in Hive and Druid) has a user_is_bot field which should take into account both whether the user was in the bot group and whether the user has "bot" in their name.

However, the creating code only takes into account the first item. The line ARRAY_CONTAINS(event_user_groups_historical, 'bot') AS user_is_bot should instead be SIZE(event_user_is_bot_by_historical) > 0 AS user_is_bot.

This has major impacts on users of edits_hourly, since it can distort non-bot edit counts (a key metric) by millions.

Event Timeline

fdans triaged this task as High priority.
fdans edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
fdans moved this task from Smart Tools for Better Data to Data Quality on the Analytics board.

it looks like ARRAY_CONTAINS is used in multiple places in the HQL. Is it used correctly in these places?

it looks like ARRAY_CONTAINS is used in multiple places in the HQL. Is it used correctly in these places?

Yes, I looked through the remaining 8 uses and they're all correct.

In this case, the problem wasn't the use of ARRAY_CONTAINS in itself; the code was using it correctly to test whether the editor was in the bot group at the time of the edit. Rather, it was forgetting to also add the test for the editor having "bot" in the username. It just happens that both of these tests can be done at once using the SIZE function.

Change 575545 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Fix bot identification

https://gerrit.wikimedia.org/r/575545

Makes sense, thanks for the fix Neil. For future reference, totally feel free to submit this in gerrit. We can review and test and deploy from there and it's likely to get on my radar faster if I'm behind on Phab pings (I was out when this was assigned to me so I missed it until now). And we definitely welcome contributors to the refinery repo.

And @Tnegrin - I love the thorough checking, we should talk if this dataset has been valuable and you have questions beyond it (ala my recent email)

The patch will be deployed next week since today's Friday.

Change 575545 merged by Milimetric:
[analytics/refinery@master] Fix bot identification

https://gerrit.wikimedia.org/r/575545

I was able to fast track it due to another deployment, this will take effect with the next snapshot which should happen next week.

Thanks for taking care of this, @Milimetric! Uploading patches to Gerrit takes me a while since I don't do it much and I have to reread the tutorial every time, but I'll be getting more familiar with Gerrit soon because of T245223 😊

Yes, this looks fixed. The code is correct now, and I also checked Turnilo's classification of user talk edits on the Vietnamese Wikipedia, where we originally noticed this issue. Almost all those edits are correctly classified as bot edits now.