Update reportupdater to be able to query the new db cluster that will substitute 1002
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Nuria
	Feb 5 2019, 4:25 PM

Details

Subject	Repo	Branch	Lines +/-
Reenable reportupdater jobs after analytics replica shard fix	operations/puppet	production	+1 -1
Adapt config to use new auto_find_db_shard feature	analytics/limn-ee-data	master	+6 -6
Adapt config to use new auto_find_db_shard feature	analytics/limn-language-data	master	+6 -7
Adapt config to use new auto_find_db_shard feature	analytics/limn-edit-data	master	+2 -3
Adapt config to use new auto_find_db_shard feature	analytics/limn-flow-data	master	+4 -10
Fix bug in auto_find_db_shard feature	analytics/reportupdater	master	+1 -1
Adapt codebase for new flake8 upgrade	analytics/reportupdater	master	+11 -184
Connect to the right shard when querying wiki databases	analytics/reportupdater	master	+75 -107

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	elukey	T172410 Replace the current multisource analytics-store setup
Resolved	Jclark-ctr	T216491 Decommission dbstore1002
Resolved	Marostegui	T210478 Migrate dbstore1002 to a multi instance setup on dbstore100[3-5]
Resolved	elukey	T215589 Migrate users to dbstore100[3-5]
Resolved	mforns	T215289 Update reportupdater to be able to query the new db cluster that will substitute 1002

Event Timeline

• Nuria created this task.Feb 5 2019, 4:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2019, 4:25 PM

See snipet to determine the correct shard: https://phabricator.wikimedia.org/T212386

• fdans triaged this task as High priority.Feb 7 2019, 6:14 PM

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.

elukey added a parent task: T215589: Migrate users to dbstore100[3-5].Feb 8 2019, 8:41 AM

Something fun I realized today: THIS IS REALLY HARD :) Because we explode the reports by wiki, but we don't differentiate that explosion from other explosions. So the code to do this is not only messy, it has to rely on a loose convention. We'll need to do this and test it thoroughly on existing reports.

I went through all enabled reportupdater jobs I know of (I think we're missing some jobs that people run outside of puppet). I categorized them by how they need to adjust to the new cluster:

Need to dynamically point to different shards to connect to multiple wikis:

https://github.com/wikimedia/analytics-reportupdater-queries/tree/master/page-creation
https://github.com/wikimedia/analytics-limn-ee-data/tree/master/ee
https://github.com/wikimedia/analytics-limn-ee-data/tree/master/ee-beta-features
https://github.com/wikimedia/analytics-limn-language-data/blob/master/language
https://github.com/wikimedia/analytics-limn-edit-data/blob/master/edit-beta-features
https://github.com/wikimedia/analytics-limn-flow-data/blob/master/flow-beta-features

Needs basic config update to point to a single shard:

https://github.com/wikimedia/analytics-limn-language-data/tree/master/published_cx2_translations
https://github.com/wikimedia/analytics-limn-language-data/tree/master/mt_engines
https://github.com/wikimedia/analytics-limn-language-data/blob/master/cx
https://github.com/wikimedia/analytics-limn-flow-data/blob/master/flow

Only need EventLogging:

https://github.com/wikimedia/analytics-reportupdater-queries/tree/master/pingback

The good news is that everyone is using "wiki_db" as a parameter, so we could maybe use that as a convention and jobs that don't use it would just fail.

Milimetric reassigned this task from Milimetric to mforns.Feb 27 2019, 6:39 PM

Milimetric subscribed.

Change 494734 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater@master] Connect to the right shard when querying wiki databases

https://gerrit.wikimedia.org/r/494734

gerritbot added a project: Patch-For-Review.Mar 6 2019, 2:51 PM

Change 494752 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater@master] Adapt codebase for new flake8 upgrade

https://gerrit.wikimedia.org/r/494752

mforns moved this task from In Progress to In Code Review on the Analytics-Kanban board.Mar 6 2019, 3:57 PM

Change 494734 merged by Milimetric:
[analytics/reportupdater@master] Connect to the right shard when querying wiki databases

https://gerrit.wikimedia.org/r/494734

Change 494752 merged by Mforns:
[analytics/reportupdater@master] Adapt codebase for new flake8 upgrade

https://gerrit.wikimedia.org/r/494752

mforns moved this task from In Code Review to In Progress on the Analytics-Kanban board.Mar 7 2019, 3:46 PM

Change 494967 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-flow-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494967

Change 494972 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater@master] Fix bug in auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494972

Change 494974 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-edit-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494974

Change 494972 merged by Milimetric:
[analytics/reportupdater@master] Fix bug in auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494972

Change 494967 merged by Milimetric:
[analytics/limn-flow-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494967

Change 494974 merged by Milimetric:
[analytics/limn-edit-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/494974

Change 495199 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-language-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495199

Change 495199 merged by Milimetric:
[analytics/limn-language-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495199

Change 495239 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-ee-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495239

Change 495239 merged by Mforns:
[analytics/limn-ee-data@master] Adapt config to use new auto_find_db_shard feature

https://gerrit.wikimedia.org/r/495239

Change 495253 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Reenable reportupdater jobs after analytics replica shard fix

https://gerrit.wikimedia.org/r/495253

Hi @chelsyx and @Amire80

When fixing these reportupdater jobs to be able to connect to the new analytics replica dbs scheme, I found this particular job that unfortunately can not work the way it was meant with the new database structure: https://github.com/wikimedia/analytics-limn-language-data/tree/master/cx
See, it connects to one database (log) to gather stats and then outputs them into another database (staging) into a specific table. That staging table is then queried from Superset.
The problem is that now, the log database and the staging database are in different hosts, so that job won't work any more, sorry.

@chelsyx, Nuria told me you're working on a project named Toledo, that she thinks might provide similar stats that this data set was providing, is that right?

@Amire80, if the answer to the above question is negative, we could move this job to Hive maybe? and insert the results of the query into your database instead of staging? Then we can load that data to Druid or try to query it from Superset as well...

Thanks!

Change 495253 merged by Ottomata:
[operations/puppet@production] Reenable reportupdater jobs after analytics replica shard fix

https://gerrit.wikimedia.org/r/495253

mforns moved this task from In Progress to Done on the Analytics-Kanban board.Mar 8 2019, 6:47 PM

@chelsyx, Nuria told me you're working on a project named Toledo, that she thinks might provide similar stats that this data set was providing, is that right?

@mforns No, the Toledo project is not using the output dataset (daily_abuse_filter_count) nor the source table (ContentTranslationAbuseFilter). The eventlogging schema for Toledo is ExternalGuidance -- there seems to be some similar field names, but it's a difference mediawiki extension and serves a different purposes. Please see T212414 and the report I'm working on for more details.

@chelsyx thanks for the clarification!

@Amire80 it seems then, that we'll have to find other solutions...
My only idea would be to adapt the RU job to be run in Hive. The query would remain basically the same, but we'd need to wrap it with a simple bash script (that's the way RU queries Hive for now). And then, once the data is in a Hive table, we could try to query it from Superset, or load it into Druid, to be browsed with Turnilo. @Amire80 let me know if this makes sense, and I will start doing some tests to see if this solution is feasible. I'm sorry that the initial idea of using RU to select insert into another table stopped working because of the dbstore changes. It's something that I should have seen beforehand.

• Nuria closed this task as Resolved.Mar 18 2019, 4:10 PM

Update reportupdater to be able to query the new db cluster that will substitute 1002Closed, ResolvedPublicActions

Details

Related ObjectsSearch...

Event Timeline

Update reportupdater to be able to query the new db cluster that will substitute 1002
Closed, ResolvedPublic
Actions

Related Objects
Search...