Page MenuHomePhabricator

Determine schema differences between labsdb1001 and labsdb1009
Closed, ResolvedPublic

Description

Audit the Wiki Replica database schemas in use on labsdb1001 to document the differences from the "standard" schema that is available on labsdb1009. This is will help us understand what custom alterations have been made historically for Toolforge analytics queries. We can then determine which changes should be scripted for application to future Wiki Replica hosts.

For now, changes have been recorded on: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/labs/db/views/extra-wikireplicas-only-indexes.sql

Event Timeline

bd808 claimed this task.Oct 2 2017, 3:33 PM
bd808 created this task.
Restricted Application added a project: User-bd808. · View Herald TranscriptOct 2 2017, 3:33 PM
jcrespo added a comment.EditedOct 2 2017, 3:37 PM

There is already 5 related things that, even nothing to do with this, we could integrate this on:

  • replication_sanitize, where triggers are created
  • replication_private_data check, where private data is checked
  • heartbeat_p, the extra database that handles the replication lag addition
  • watchlist_count, the regularly updated extra table that summarizes an otherwise private table
  • maintain-views scripts

I do not have the links handy, but you can search for those on puppet.

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Oct 6 2017, 3:56 AM

There is already 5 related things that, even nothing to do with this, we could integrate this on:
(... snip ...)
I do not have the links handy, but you can search for those on puppet.

From role::labs::db::check_private_data we get tools that appear to be used on the sanitarium source hosts:

  • /usr/local/sbin/check_private_data.py: script to look for non-public databases, tables, and columns
  • /usr/local/sbin/check_private_data_report: script to email DBAs if non-public data is found
  • /usr/local/sbin/redact_sanitarium.sh: script to create redaction triggers

From role::labs::db::views we get tools that are used on the target Wiki Replica hosts (not the sanitarium source hosts):

  • /usr/local/sbin/maintain-views: create or replace ${WIKIDB}_p databases and their views
  • /usr/local/sbin/maintain-meta_p: creates and populates a meta_p database
  • /usr/local/src/heartbeat-views.sql: creates the heartbeat_p database and its heartbeat view of the replicated heartbeat table

The watchlist_count table is mentioned in modules/role/templates/labs/db/views/maintain-views.yaml, but I can't find reference to it anywhere else. I see it referenced in rOSOFc886c32dcd94: Replace impossible watchlist_counts custom view with full view of already… where a custom view with a similar name was removed.

From rOPUP2b2050646943: mariadb: Add file to control additional wikireplicas-only indexes we also now have the modules/role/files/labs/db/views/extra-wikireplicas-only-indexes.sql file which is not provisioned anywhere as far as I can tell, but does document the first ALTER TABLE to be applied to the new Wiki Replica cluster hosts. It looks like this is something that ideally we would apply to each new wikidb that is replicated.

Is this something that we should figure out how to add to maintain-views? Or am I misunderstanding and this change is applied on the sanitarium host rather than on the downstream replicas?

Deeper inspection by me is currently blocked by T178128: Access to raw database tables on labsdb* for wmcs-admin users, but my intent is to run a tool like mysqldbcompare or SchemaSync to compare labsdb1001 and labsdb1009 to see if we can preemptively find other extra indexes that have been hacked into the legacy wiki replicas and determine if they should be added to the new replicas and documented/automated for reapplication when servers are rebuilt and new wikis are added.

bd808 changed the task status from Open to Stalled.Oct 18 2017, 3:00 AM
bd808 changed the task status from Stalled to Open.Dec 1 2017, 5:43 AM

I used this query to collect a description of the indexes on enwiki tables from both labsdb1001 and labsdb1011:

SELECT TABLE_NAME,
       INDEX_NAME,
       GROUP_CONCAT(DISTINCT(COLUMN_NAME) ORDER BY SEQ_IN_INDEX SEPARATOR ', ') cols
FROM INFORMATION_SCHEMA.STATISTICS
WHERE TABLE_SCHEMA = 'enwiki'
GROUP BY TABLE_NAME, INDEX_NAME
ORDER BY TABLE_NAME, INDEX_NAME;

Then I diffed the resulting output and filtered the diff to only show missing or added indexes:

--- labsdb1011-enwiki.txt       2017-12-01 17:56:33.546968656 +0000
+++ labsdb1001-enwiki.txt       2017-12-01 17:56:44.938953674 +0000
+abuse_filter_log       afl_user        afl_user
-abuse_filter_log       user_timestamp  afl_user, afl_user_text, afl_timestamp
+ep_articles    ep_articles_course_id   article_course_id
-flaggedrevs    fr_user fr_user
-ipblocks       ipb_parent_block_id     ipb_parent_block_id
-logging        log_title_type_time     log_title, log_type, log_timestamp
-logging        log_user_type_time      log_user, log_type, log_timestamp
-ores_classification    oresc_rev_model_class   oresc_rev, oresc_model, oresc_class
+ores_classification    oresc_winner    oresc_rev, oresc_is_predicted
-pagelinks      pl_backlinks_namespace  pl_from_namespace, pl_namespace, pl_title, pl_from
+pagelinks      pl_backlinks_namespace  pl_namespace, pl_title, pl_from_namespace, pl_from
-recentchanges  tmp_2   rc_bot, rc_timestamp
-recentchanges  tmp_3   rc_namespace, rc_timestamp
-revision       page_user_timestamp     rev_page, rev_user, rev_timestamp
-revision       PRIMARY rev_id
-revision       rev_page_id     rev_page, rev_id
+revision       PRIMARY rev_page, rev_id
+revision       rev_id  rev_id
-revision       usertext_timestamp      rev_user_text, rev_timestamp
+revision       usertext_timestamp      rev_user_text, rev_timestamp, rev_user, rev_deleted, rev_minor_edit, rev_text_id, rev_comment
+watchlist_count        PRIMARY wl_namespace, wl_title
+watchlist_count        watchers        watchers
+__wmf_checksums        PRIMARY db, tbl, chunk
+__wmf_checksums        ts_db_tbl       ts, db, tbl

Some of the differences are only in the index name or field order which I imagine represent MediaWiki or production schema changes that were never applied on labsdb1001. A few others are for tables which exist on labsdb1001 but do not exist on labsdb1011 (watchlist_count, __wmf_checksums). The remaining differences are:

+abuse_filter_log       afl_user                afl_user
+ep_articles            ep_articles_course_id   article_course_id
+ores_classification    oresc_winner            oresc_rev, oresc_is_predicted

watchlist_count is T59617#3070203. __wmf_checksums can be ignored, it is an ops-only table (to check data consistency). Many other schema changes were not added to labsdb1001-3 because it was difficult to add them in a hot way, and those are documented, but difficult to search right now for them (phabricator search is not very useful right now.

For wikidatawiki, the substantive diff is:

+wb_items_per_site      wb_ips_site_page        ips_site_page
+wb_terms               wb_terms_language       term_language

Also lately there have been at least 3-4 schema changes that never arrived to labsdb1001 as it is in read only and replication isn't working

jcrespo updated the task description. (Show Details)Dec 1 2017, 6:53 PM

Looking over the diffs again, I think these are the potential indices to add:

-- exists in EducationProgram.sql
ALTER TABLE ep_articles ADD KEY `ep_articles_course_id` (`article_course_id`);

-- exists in Wikibase.sql
ALTER TABLE wb_items_per_site ADD KEY `wb_ips_site_page` (`ips_site_page`);

Both of these look like they are intended to exist in their associated Extension git repos. Does this mean that they are actually missing in production by some accident of replication or omission?

The others are either redundant with existing indices, or were dropped at one point or another in the history of their associated extensions according to git history.

Marostegui added a comment.EditedJan 10 2018, 6:25 AM

The index ep_articles_course_id was deleted per: T180166
The index wb_ips_site_page was deleted per: T179793

Those were not deleted from labsdb1001 because it was already on read_only

bd808 closed this task as Resolved.Jan 10 2018, 6:29 AM

Per T177223#3888671 and T177223#3888956 it seems like we are in parity.