Page MenuHomePhabricator

Update reportupdater to be able to query the new db cluster that will substitute 1002
Closed, ResolvedPublic

Event Timeline

fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Something fun I realized today: THIS IS REALLY HARD :) Because we explode the reports by wiki, but we don't differentiate that explosion from other explosions. So the code to do this is not only messy, it has to rely on a loose convention. We'll need to do this and test it thoroughly on existing reports.

I went through all enabled reportupdater jobs I know of (I think we're missing some jobs that people run outside of puppet). I categorized them by how they need to adjust to the new cluster:

Need to dynamically point to different shards to connect to multiple wikis:

https://github.com/wikimedia/analytics-reportupdater-queries/tree/master/page-creation
https://github.com/wikimedia/analytics-limn-ee-data/tree/master/ee
https://github.com/wikimedia/analytics-limn-ee-data/tree/master/ee-beta-features
https://github.com/wikimedia/analytics-limn-language-data/blob/master/language
https://github.com/wikimedia/analytics-limn-edit-data/blob/master/edit-beta-features
https://github.com/wikimedia/analytics-limn-flow-data/blob/master/flow-beta-features

Needs basic config update to point to a single shard:

https://github.com/wikimedia/analytics-limn-language-data/tree/master/published_cx2_translations
https://github.com/wikimedia/analytics-limn-language-data/tree/master/mt_engines
https://github.com/wikimedia/analytics-limn-language-data/blob/master/cx
https://github.com/wikimedia/analytics-limn-flow-data/blob/master/flow

Only need EventLogging:

https://github.com/wikimedia/analytics-reportupdater-queries/tree/master/pingback

The good news is that everyone is using "wiki_db" as a parameter, so we could maybe use that as a convention and jobs that don't use it would just fail.

Change 494734 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater@master] Connect to the right shard when querying wiki databases

https://gerrit.wikimedia.org/r/494734

Change 494752 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater@master] Adapt codebase for new flake8 upgrade

https://gerrit.wikimedia.org/r/494752

Change 494734 merged by Milimetric:
[analytics/reportupdater@master] Connect to the right shard when querying wiki databases

https://gerrit.wikimedia.org/r/494734

Change 494752 merged by Mforns:
[analytics/reportupdater@master] Adapt codebase for new flake8 upgrade

https://gerrit.wikimedia.org/r/494752

Change 494967 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-flow-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494967

Change 494972 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater@master] Fix bug in auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494972

Change 494974 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-edit-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494974

Change 494972 merged by Milimetric:
[analytics/reportupdater@master] Fix bug in auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494972

Change 494967 merged by Milimetric:
[analytics/limn-flow-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494967

Change 494974 merged by Milimetric:
[analytics/limn-edit-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494974

Change 495199 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-language-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495199

Change 495199 merged by Milimetric:
[analytics/limn-language-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495199

Change 495239 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-ee-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495239

Change 495239 merged by Mforns:
[analytics/limn-ee-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495239

Change 495253 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Reenable reportupdater jobs after analytics replica shard fix

https://gerrit.wikimedia.org/r/495253

Hi @chelsyx and @Amire80

When fixing these reportupdater jobs to be able to connect to the new analytics replica dbs scheme, I found this particular job that unfortunately can not work the way it was meant with the new database structure: https://github.com/wikimedia/analytics-limn-language-data/tree/master/cx
See, it connects to one database (log) to gather stats and then outputs them into another database (staging) into a specific table. That staging table is then queried from Superset.
The problem is that now, the log database and the staging database are in different hosts, so that job won't work any more, sorry.

@chelsyx, Nuria told me you're working on a project named Toledo, that she thinks might provide similar stats that this data set was providing, is that right?

@Amire80, if the answer to the above question is negative, we could move this job to Hive maybe? and insert the results of the query into your database instead of staging? Then we can load that data to Druid or try to query it from Superset as well...

Thanks!

Change 495253 merged by Ottomata:
[operations/puppet@production] Reenable reportupdater jobs after analytics replica shard fix

https://gerrit.wikimedia.org/r/495253

@chelsyx, Nuria told me you're working on a project named Toledo, that she thinks might provide similar stats that this data set was providing, is that right?

@mforns No, the Toledo project is not using the output dataset (daily_abuse_filter_count) nor the source table (ContentTranslationAbuseFilter). The eventlogging schema for Toledo is ExternalGuidance -- there seems to be some similar field names, but it's a difference mediawiki extension and serves a different purposes. Please see T212414 and the report I'm working on for more details.

@chelsyx thanks for the clarification!

@Amire80 it seems then, that we'll have to find other solutions...
My only idea would be to adapt the RU job to be run in Hive. The query would remain basically the same, but we'd need to wrap it with a simple bash script (that's the way RU queries Hive for now). And then, once the data is in a Hive table, we could try to query it from Superset, or load it into Druid, to be browsed with Turnilo. @Amire80 let me know if this makes sense, and I will start doing some tests to see if this solution is feasible. I'm sorry that the initial idea of using RU to select insert into another table stopped working because of the dbstore changes. It's something that I should have seen beforehand.