Page MenuHomePhabricator

Decide how to split wmf database into functional areas
Closed, ResolvedPublic1 Estimated Story Points

Description

On the iceberg working session we decided to keep iceberg tables on a separate database rather than mix it up under wmf. The rationale is that although in Spark we can easily share one catalog, that is not the case for Presto.

On a similar thread, we were also considering having database names by function. An example would be a superset database for tables that have been optimized to be used by that system.

Should we keep this simple and name the database wmf_next or something else in that line, or should we bite the bullet and do a full on functional decompostion?

(We had started this conversation in Slack)

Editable proposed functional decomposition at https://docs.google.com/spreadsheets/d/15OoeDi4eWKhsGIR_ktsLeOPSIlVXaZXVvKAfg9s0lnk/edit#gid=0
Public view at https://docs.google.com/spreadsheets/d/e/2PACX-1vS_-9ymeJzftxRbKE2cdHljEVYm1gCAyUFMLEkMhr_lrDfvF0kria0KIzJCXyGc483OIK0YjJWZFDdE/pubhtml

Event Timeline

Copy pasting the conversation so far from Slack. Contributors: @Ottomata @Mayakp.wiki @mpopov @JAllemandou

(Side note: It would be great if Slack had a button to convert a thread into a phab! )

Andrew Otto

superset for tables that have been optimized to be used by that system

Andrew Otto

I think that’s a bad functional division

Andrew Otto

we’ll one day swap to a different dashboarding solution

Andrew Otto

We should think of name(s) that  will last

Andrew Otto

if we are talking about doing this for ‘trusted datasets’ / ‘data warehouse’ reasons

Andrew Otto

‘wmf_warehouse’?

Andrew Otto

or perhaps functional area is better?

Andrew Otto

wmf_traffic (webrequest, etc.)

Andrew Otto

wmf_readership (pageviews, etc.)
wmf_contributors (editors, etc.)
wmf_mediawiki (mediawiki imporated data)

Andrew Otto

?

Xabriel Collazo Mojica

There are indeed 59 tables under wmf. So even though I wanted to sidestep the functional separation right now, I think I agree we should try and figure it out.

Xabriel Collazo Mojica

(actually 56, 3 seem to be temps)

Xabriel Collazo Mojica

Ok, I attempted to categorize current wmf tables into @otto’s suggested functional areas above. I came out with this:
wmf_traffic (webrequest, etc.)
	anomaly_detection
	aqs_hourly
	browser_general
	domain_abbrev_map
	mediacounts
	mediarequest
	interlanguage_navigation
	webrequest
	webrequest_actor_label_hourly
	webrequest_actor_metrics_hourly
	webrequest_actor_metrics_rollup_hourly
	webrequest_subset_tags
	referrer_daily


wmf_readership (pageviews, etc.)
	disallowed_cassandra_articles
	pageview_actor
	pageview_allowlist
	pageview_dataloss_202112_202201
	pageview_historical
	pageview_hourly
	pageview_unexpected_values
	projectview_hourly
	unique_devices_per_domain_daily
	unique_devices_per_domain_monthly
	unique_devices_per_project_family_daily
	unique_devices_per_project_family_monthly
	officewiki_webrequest_daily
	session_length_daily
	virtualpageview_hourly


wmf_contributors (editors, etc.)
	edit_hourly
	editors_daily
	geoeditors_blacklist_country (rename to denylist)
	geoeditors_edits_monthly
	geoeditors_monthly
	geoeditors_public_monthly
	unique_editors_by_country_monthly


wmf_mediawiki (mediawiki imported data)
	mediawiki_history
	mediawiki_history_archive
	mediawiki_history_reduced
	mediawiki_metrics
	mediawiki_page_history
	mediawiki_page_history_archive
	mediawiki_user_history
	mediawiki_user_history_archive
	mediawiki_wikitext_current
	mediawiki_wikitext_history


wmf_wikidata (wikidata imported data)
	wikidata_entity
	wikidata_item_page_link

wmf_experiments?
	iceberg_wikitext_content


not sure how to categorize:
	data_quality_stats
	hdfs_usage
	traffic_anomaly_checked_countries


tables that would not be migrated (i.e. appear to be deprecated / no recent data added):
	projectcounts_all_sites
	projectcounts_raw
	pagecounts_all_sites
	mobilewebuiclicktracking_10742159_15423246
	wdqs_extract

Xabriel Collazo Mojica

Does this make sense ? (edited)

Xabriel Collazo Mojica

Consumers of wmf tables, please chime in.

Maya Kampurath

Categorization seems good. wondering if officewiki_webrequest_daily should be in wmf_traffic?

Mikhail Popov

I just put this into a spreadsheet for easier collaboration & iteration: https://docs.google.com/spreadsheets/d/15OoeDi4eWKhsGIR_ktsLeOPSIlVXaZXVvKAfg9s0lnk/edit#gid=0
Also the proposed categorization seems very sensible and +1 to Maya's suggestion – was about to say the same :smile:

Xabriel Collazo Mojica

Agreed on officewiki_webrequest_daily .
Wanted to clarify: the intention is to move tables to their new home while we also tackle some internal migration (moving to a technology called Apache Iceberg).
We intend to keep both versions (old and new) working in production for a reasonable time.

Maya Kampurath

ah ok! thanks for clarifying. my next question was gonna be "how soon is this happening!??"

Joseph Allemandou

While I think functional categorization is good, we also need a categorization allowing us to differentiate between data that is prepared so that it can be queried with Presto (smaller scale) versus the one that needs to be queried via spark (bigger scale).
I had in mind that for now the DB split was to help differentiating the presto/spark datasets, but mabe we need a different approach for this if we think the functional DB split is better.

Xabriel Collazo Mojica

There is a lot of good comments here that we should document. Let me move this into a phab ticket, and we can continue the discussion over there.
xcollazo changed the task status from Open to In Progress.May 26 2023, 2:34 PM
xcollazo triaged this task as High priority.
xcollazo updated the task description. (Show Details)
xcollazo set the point value for this task to 1.

@Mayakp.wiki :

ah ok! thanks for clarifying. my next question was gonna be "how soon is this happening!??"

We just started the migration, with work being tracked under T333013. There will also be schema changes, so that we can leverage 'hidden partitioning' in Iceberg. This means the way you SELECT will change slightly as well. We intend table level changes, so it should be clear what the query modifications from a user would be.

But again, the idea is to support both versions of the table for a 'reasonable' time so that folks can migrate.

@JAllemandou :

While I think functional categorization is good, we also need a categorization allowing us to differentiate between data that is prepared so that it can be queried with Presto (smaller scale) versus the one that needs to be queried via spark (bigger scale).
I had in mind that for now the DB split was to help differentiating the presto/spark datasets, but maybe we need a different approach for this if we think the functional DB split is better.

I wanted to sidestep the functional split as well, as it requires more syncing. But I think tackling it now is better if we want to break folks only once.

We can still try to solve the smaller/bigger scale issue though. Do you think using DataHub tags would suffice? Or do we need a more explicit mechanism?

I had in mind that being able to access only "presto-ok" datasets in Presto was a good idea. We can do this through having multiple hive-mestastores, but this makes schema-sync complicated. Maybe we can devise a catalog that uses the hive-mestastore but only shows a subset of the tables?

I had in mind that being able to access only "presto-ok" datasets in Presto was a good idea. We can do this through having multiple hive-mestastores, but this makes schema-sync complicated. Maybe we can devise a catalog that uses the hive-mestastore but only shows a subset of the tables?

How about a bunch of VIEWs? The VIEWs can be simple statements like SELECT * FROM another_db.t, and they would be compartmentalized to a single well known database (wmf_dashboarding?). We'd have to test if Presto would do the right thing: give us equally good query plans with a direct SELECT or a simple VIEW like above.

We just started the migration, with work being tracked under T333013. There will also be schema changes, so that we can leverage 'hidden partitioning' in Iceberg. This means the way you SELECT will change slightly as well. We intend table level changes, so it should be clear what the query modifications from a user would be.

But again, the idea is to support both versions of the table for a 'reasonable' time so that folks can migrate.

Thanks @xcollazo for the heads up! we are anticipating changes to some of our widely used and critical tables like mediawiki_history due to IP Masking project (see impacted list here) in Q2 2023. so it would be great if the Iceberg related changes could be rolled out after that. cc @Milimetric

Mentioned in SAL (#wikimedia-analytics) [2023-05-30T15:52:15Z] <xcollazo> created HDFS folder /wmf/data/wmf_traffic (T335305 and T337562)

I added two comments in the Splitting wmf Hive database sheet on the categorization. Thank you for this work. This direction will be helpful for table discoverability.

Could this be a good opportunity to improve upon our naming conventions?

We currently have a few folders with wmf in the name:

  • wmf
  • wmf_raw
  • wmf_traffic
  • wmf_product

We're looking to divide up wmf and add to each division spin-off another prefix. Current proposal:

  • wmf_traffic (webrequest, etc.)
  • wmf_readership (pageviews, etc.)
  • wmf_contributors (editors, etc.)
  • wmf_mediawiki (mediawiki imported data)
  • wmf_wikidata (wikidata imported data)
  • wmf_experiments?

As these folders will not be unique holders of the data specified in the name (i.e., contributor data can also be found in wmf_product and in other prest_analytics_hive folders, might we benefit from another naming set-up?

I don't have a better proposal and do wonder about naming.
For example: What's true about these folders that is not true about the others? are these DE maintained tables whereas the others are not? Might it be helpful to name these with a data_engineering prefix?

I don't have a super constructive suggestion, but a lil comment:

are these DE maintained tables whereas the others are not? Might it be helpful to name these with a data_engineering prefix?

We try (but don't always succeed) to not put team names in code and data, as team names change relatively frequently (every few years).

Could this be a good opportunity to improve upon our naming conventions?

Yes!

...are these DE maintained tables whereas the others are not?

I think the prefix wmf was meant to convey officiality, as in, well supported by Data Engineering (and after the reorg, by 'Data Products').

Might it be helpful to name these with a data_engineering prefix?

Because of the current (and future) reorgs I would rather not use a prefix like that. But, you got me thinking.... do we need a prefix at all? We could just have the officiality/ownership in Wikitech?

We could just have the officiality/ownership in Wikitech?

or datahub? :)

I don't think the 'wmf' prefix is strictly needed. E.g. the 'event' database isn't 'wmf_event'. Having a prefix of some kind for official / trusted dataset databases might be nice though.

Instead of functional areas, could we organize by 'trusted'/data-productness? Could go over to the thesaurus, but some synonym for trusted, official, gold(?) 'published'?

Instead of functional areas, could we organize by 'trusted'/data-productness? Could go over to the thesaurus, but some synonym for trusted, official, gold(?) 'published'?

We could do that with Datahub tags / Wikitech as well? The issue with a database named trusted would be that we'd lump 56 tables on it again.

IIRC, wasn't the reason for doing this to find a nice new way of grouping things that will be Iceberg, that doesn't say "iceberg_xxxx"?

IIRC, wasn't the reason for doing this to find a nice new way of grouping things that will be Iceberg, that doesn't say "iceberg_xxxx"?

That is true, but then, we pivoted to a functional split as it seems like a better categorization strategy ( T337562#8883194 ).

I like that folks that would use these tables on a daily basis thought that that split was best. I also like that if we split functionally, then we could do the migration functionally as well; seems like a good work division.

We will still get the iceberg grouping, as in, anything under wmf is Hive tables, and anything under the new databases will be Iceberg tables.

I had in mind that being able to access only "presto-ok" datasets in Presto was a good idea. We can do this through having multiple hive-mestastores, but this makes schema-sync complicated. Maybe we can devise a catalog that uses the hive-mestastore but only shows a subset of the tables?

How about a bunch of VIEWs? The VIEWs can be simple statements like SELECT * FROM another_db.t, and they would be compartmentalized to a single well known database (wmf_dashboarding?). We'd have to test if Presto would do the right thing: give us equally good query plans with a direct SELECT or a simple VIEW like above.

Wanted to get back to this issue. I investigated whether VIEWs will work, and the TLDR is no for now.

The longer version:

This is the current landscape around VIEWs:

  1. Hive tables support Hive-generated views, and Spark can read these Hive generated Views. However, when we throw Iceberg in the mix, this all breaks. We can create Spark-generated views, and then Spark is happy with Iceberg.
  2. Presto does support Hive-genereted views when we enable the hive.views-execution.enabled flag. However, the Presto Iceberg catalog does not support them. Presto-generated Views do not support Iceberg as well.

Its a mess. There is ongoing work on Iceberg to implement a View specification that is sql-engine agnostic, but that is ongoing and so the ETA is unknown.

Wanted to get back to this issue. I investigated whether VIEWs will work, and the TLDR is no for now.

Thank you for the investigation!

I'll leave this open till EOD Wednesday June 7 2023 for further consideration, but it does looks like we are converging on the new database names since there has been less activity on the spreadsheet.

I had in mind that being able to access only "presto-ok" datasets in Presto was a good idea. We can do this through having multiple hive-mestastores, but this makes schema-sync complicated. Maybe we can devise a catalog that uses the hive-mestastore but only shows a subset of the tables?

How about a bunch of VIEWs? The VIEWs can be simple statements like SELECT * FROM another_db.t, and they would be compartmentalized to a single well known database (wmf_dashboarding?). We'd have to test if Presto would do the right thing: give us equally good query plans with a direct SELECT or a simple VIEW like above.

Wanted to get back to this issue. I investigated whether VIEWs will work, and the TLDR is no for now.

The longer version:

This is the current landscape around VIEWs:

  1. Hive tables support Hive-generated views, and Spark can read these Hive generated Views. However, when we throw Iceberg in the mix, this all breaks. We can create Spark-generated views, and then Spark is happy with Iceberg.
  2. Presto does support Hive-genereted views when we enable the hive.views-execution.enabled flag. However, the Presto Iceberg catalog does not support them. Presto-generated Views do not support Iceberg as well.

Its a mess. There is ongoing work on Iceberg to implement a View specification that is sql-engine agnostic, but that is ongoing and so the ETA is unknown.

We can always expedite the progress of the VIEWs implementation by contributing to the Iceberg project. CC @lbowmaker for consideration.

xcollazo renamed this task from Split wmf database into functional areas to Decide how to split wmf database into functional areas.Jun 7 2023, 4:11 PM

The proposed database names and table mappings are now available at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Iceberg#Changes_to_database_names.

Marking spreadsheet as deprecated, and closing this ticket.

Thank you all for your comments!