Page MenuHomePhabricator

Drop index rev_page_id (rev_page, rev_id)
Open, LowPublic

Description

This index doesn't make sense, and almost everything that uses it is a bug.

The correct ordering of revisions within a page is by timestamp, except where there are duplicate timestamps, in which case rev_id is used to break the tie.

rev_id does not increase monotonically with rev_timestamp, so it's not appropriate to use it for revision ordering within a page. It can be assumed to reflect time only when rev_timestamp matches.

Let's consider the callers.

ApiQueryRevisions has startid/endid parameters. These can be mapped to a (rev_timestamp,rev_id) tuple in the proposed total order before performing the query. If an unknown rev_id is requested as an endpoint, it can fail. Currently startid/endid is broken, since it partitions by rev_id but sorts by rev_timestamp, and so causes queries to take tens of seconds (T163495).

Title::getNextRevisionID() gets the next revision, ordering by ID, getPreviousRevisionID() is the same in the opposite direction. The original use case for these functions were the prev/next links on the old revision page views. The user certainly expects a time-based ordering in that case. The other callers:

  • WatchedItemStore::getNotificationTimestamp() uses it to check if a given rev_id is the latest, which is totally broken. It should use page_latest. This is duplicated in ApiSetNotificationTimestamp and User::clearNotification().
  • Article::view() has a direction parameter, next or prev, which was introduced to avoid the need for an expensive getNextRevisionID() call on old revision page views. As mentioned above, this should use time-based ordering.
  • Article::setOldSubtitle() calls getPreviousRevisionID() to check if the "previous" link needs to be displayed on an old revision view, which rather defeats the efficiency purpose of the direction parameter. rev_parent_id could be used, avoiding the need for a DB query, assuming the column is fully populated.
  • RawAction has a direction parameter analogous to the action=view one, which should function in the same way as action=view.
  • The "undo" feature in EditPage and duplicated in ApiEditPage has a range of revisions specified by the rev_id of the start and end of the range. It uses Revision::getNext() to determine whether the start and end are consecutive revisions. This should use the time-based total order or rev_parent_id.
  • DifferenceEngine has a diff=prev, diff=next parameters and uses getNextRevisionID()/getPreviousRevisionID() to interpret these parameters. Like direction parameters, this would benefit from using a time-based order. HistoryAction::feedItem() uses getPreviousRevisionID() similarly.
  • CategoryMembershipChange uses getPreviousRevisionID() to get the timestamp for sending $lastTimestamp to RecentChange. It should ideally use the pre-update page_latest value like we do on page save.

{T159319} would be fixed by making this proposed change.

Searching for SQL with rev_id inequalities:

  • dumpBackup.php has --revrange, possibly a legitimate use case, but could be answered by scanning the table.
  • compressOld.php uses rev_id<page_latest, it should use rev_id<>page_latest instead.
  • Already using the proposed total order: populateParentId.php, MysqlUpdater::doSchemaRestructuring(), CategoryMembershipChangeJob.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2017, 6:24 AM

List of unused indexes on db2070 (which means the index is in use):

root@db2070.codfw.wmnet[sys]> SELECT * FROM schema_unused_indexes;
+---------------+--------------------------------+------------------------------------+
| object_schema | object_name                    | index_name                         |
+---------------+--------------------------------+------------------------------------+
| enwiki        | abuse_filter                   | af_user                            |
| enwiki        | abuse_filter_history           | afh_user_text                      |
| enwiki        | abuse_filter_history           | afh_user                           |
| enwiki        | abuse_filter_log               | afl_ip                             |
| enwiki        | abuse_filter_log               | afl_rev_id                         |
| enwiki        | abuse_filter_log               | afl_log_id                         |
| enwiki        | arbcom1_vote                   | log_id                             |
| enwiki        | arbcom1_vote                   | log_user_key                       |
| enwiki        | babel                          | babel_lang_level                   |
| enwiki        | blob_tracking                  | bt_moved                           |
| enwiki        | blob_tracking                  | bt_cluster                         |
| enwiki        | category                       | cat_pages                          |
| enwiki        | categorylinks                  | cl_collation_ext                   |
| enwiki        | change_tag                     | change_tag_tag_id                  |
| enwiki        | change_tag                     | change_tag_log_tag                 |
| enwiki        | change_tag                     | ct_log_id                          |
| enwiki        | change_tag                     | change_tag_rc_tag                  |
| enwiki        | click_tracking                 | click_tracking_action_time         |
| enwiki        | click_tracking                 | click_tracking_event_id            |
| enwiki        | click_tracking                 | click_tracking_session_id          |
| enwiki        | click_tracking_events          | event_name                         |
| enwiki        | click_tracking_user_properties | ct_user_prop_id_name_value_version |
| enwiki        | click_tracking_user_properties | ct_version_name_value              |
| enwiki        | cur                            | jamesspecialpages                  |
| enwiki        | cur                            | name_title_dup_prevention          |
| enwiki        | cur                            | id_title_ns_red                    |
| enwiki        | cur                            | usertext_timestamp                 |
| enwiki        | cur                            | cur_title                          |
| enwiki        | cur                            | user_timestamp                     |
| enwiki        | cur                            | name_title_timestamp               |
| enwiki        | cur                            | cur_random                         |
| enwiki        | cur                            | cur_timestamp                      |
| enwiki        | cu_changes                     | cuc_xff_hex_time                   |
| enwiki        | cu_changes                     | cuc_user_time                      |
| enwiki        | edit_page_tracking             | user_timestamp                     |
| enwiki        | ep_articles                    | ep_articles_user_id                |
| enwiki        | ep_articles                    | ep_articles_page_title             |
| enwiki        | ep_articles                    | ep_articles_page_id                |
| enwiki        | ep_cas                         | ep_cas_visible                     |
| enwiki        | ep_courses                     | ep_course_lang                     |
| enwiki        | ep_courses                     | ep_course_token                    |
| enwiki        | ep_courses                     | ep_course_student_count            |
| enwiki        | ep_courses                     | ep_course_oa_count                 |
| enwiki        | ep_courses                     | ep_course_name                     |
| enwiki        | ep_courses                     | ep_course_level                    |
| enwiki        | ep_courses                     | ep_course_ca_count                 |
| enwiki        | ep_courses                     | ep_course_field                    |
| enwiki        | ep_events                      | ep_events_course_id                |
| enwiki        | ep_events                      | ep_events_user_id                  |
| enwiki        | ep_events                      | ep_events_type                     |
| enwiki        | ep_oas                         | ep_oas_visible                     |
| enwiki        | ep_orgs                        | ep_org_ca_count                    |
| enwiki        | ep_orgs                        | ep_org_student_count               |
| enwiki        | ep_orgs                        | ep_org_instructor_count            |
| enwiki        | ep_orgs                        | ep_org_oa_count                    |
| enwiki        | ep_orgs                        | ep_org_active                      |
| enwiki        | ep_orgs                        | ep_org_course_count                |
| enwiki        | ep_orgs                        | ep_org_country                     |
| enwiki        | ep_orgs                        | ep_org_city                        |
| enwiki        | ep_revisions                   | ep_revision_object_identifier      |
| enwiki        | ep_revisions                   | ep_revision_deleted                |
| enwiki        | ep_revisions                   | ep_revision_minor_edit             |
| enwiki        | ep_revisions                   | ep_revision_user_text              |
| enwiki        | ep_revisions                   | ep_revision_user_id                |
| enwiki        | ep_revisions                   | ep_revision_type                   |
| enwiki        | ep_revisions                   | ep_revision_object_id              |
| enwiki        | ep_students                    | ep_students_last_course            |
| enwiki        | ep_students                    | ep_students_active_enroll          |
| enwiki        | ep_students                    | ep_students_last_enroll            |
| enwiki        | ep_students                    | ep_students_first_course           |
| enwiki        | ep_students                    | ep_students_first_enroll           |
| enwiki        | ep_users_per_course            | ep_upc_time                        |
| enwiki        | exarchive                      | name_title_timestamp               |
| enwiki        | exrevision                     | rev_timestamp                      |
| enwiki        | exrevision                     | user_timestamp                     |
| enwiki        | exrevision                     | rev_id                             |
| enwiki        | exrevision                     | usertext_timestamp                 |
| enwiki        | exrevision                     | page_timestamp                     |
| enwiki        | externallinks                  | el_index                           |
| enwiki        | externallinks                  | el_backlinks_to                    |
| enwiki        | filearchive                    | fa_storage_group                   |
| enwiki        | filearchive                    | fa_deleted_timestamp               |
| enwiki        | filejournal                    | fj_batch_id                        |
| enwiki        | filejournal                    | fj_timestamp                       |
| enwiki        | flaggedpages                   | fp_quality_page                    |
| enwiki        | flaggedpages                   | fp_reviewed_page                   |
| enwiki        | flaggedpage_pending            | fpp_quality_pending                |
| enwiki        | flaggedrevs                    | page_time                          |
| enwiki        | flaggedrevs                    | page_qal_time                      |
| enwiki        | flaggedrevs                    | page_rev                           |
| enwiki        | flaggedrevs                    | page_qal_rev                       |
| enwiki        | gather_list                    | gl_user_label                      |
| enwiki        | gather_list                    | gl_visibility_updated              |
| enwiki        | gather_list_flag               | glf_list_reviewed                  |
| enwiki        | gather_list_flag               | glf_list_user_ip                   |
| enwiki        | gather_list_item               | gli_id_order                       |
| enwiki        | gather_list_item               | gli_id_ns_title                    |
| enwiki        | gather_list_item               | gli_ns_title                       |
| enwiki        | geo_tags                       | gt_page_id_id                      |
| enwiki        | global_block_whitelist         | gbw_by                             |
| enwiki        | hidden                         | hidden_by_user                     |
| enwiki        | hidden                         | hidden_user_text                   |
| enwiki        | hidden                         | page_title_timestamp               |
| enwiki        | hidden                         | name_title_timestamp               |
| enwiki        | hidden                         | hidden_on_timestamp                |
| enwiki        | image                          | img_size                           |
| enwiki        | interwiki                      | iw_prefix                          |
| enwiki        | iwlinks                        | iwl_prefix_from_title              |
| enwiki        | iwlinks                        | iwl_prefix_title_from              |
| enwiki        | job                            | job_timestamp                      |
| enwiki        | job                            | job_cmd_token_id                   |
| enwiki        | job                            | job_cmd_token                      |
| enwiki        | job                            | job_sha1                           |
| enwiki        | job                            | job_cmd                            |
| enwiki        | l10n_cache                     | lc_lang_key                        |
| enwiki        | langlinks                      | ll_lang                            |
| enwiki        | linkscc                        | lcc_pageid                         |
| enwiki        | localisation_file_hash         | lfh_timestamp                      |
| enwiki        | localisation_file_hash         | lfh_file_hash                      |
| enwiki        | localisation_file_hash         | lfh_file                           |
| enwiki        | logging                        | type_time                          |
| enwiki        | logging                        | times                              |
| enwiki        | logging                        | log_page_id_time                   |
| enwiki        | logging                        | log_user_type_time                 |
| enwiki        | log_search                     | ls_log_id                          |
| enwiki        | moodbar_feedback               | mbf_type_timestamp_id              |
| enwiki        | moodbar_feedback               | mbf_latest_response                |
| enwiki        | moodbar_feedback               | mbf_title_type_id                  |
| enwiki        | moodbar_feedback               | mbf_timestamp_id                   |
| enwiki        | moodbar_feedback               | mbf_type_userid_ip_timestamp_id    |
| enwiki        | moodbar_feedback               | mbf_userid_ip_timestamp_id         |
| enwiki        | moodbar_feedback_response      | mbfr_user_id                       |
| enwiki        | moodbar_feedback_response      | mbfr_mbf_id                        |
| enwiki        | moodbar_feedback_response      | mbfr_mbf_mbfr_id                   |
| enwiki        | msg_resource                   | mr_resource_lang                   |
| enwiki        | msg_resource_links             | mrl_message_resource               |
| enwiki        | objectcache                    | keyname                            |
| enwiki        | objectcache                    | exptime                            |
| enwiki        | oldimage                       | oi_sha1                            |
| enwiki        | ores_model                     | oresm_version                      |
| enwiki        | page                           | page_len                           |
| enwiki        | pagetriage_log                 | ptrl_timestamp                     |
| enwiki        | pagetriage_page                | ptrp_reviewed_updated              |
| enwiki        | pagetriage_page                | ptrp_updated_page_reviewed         |
| enwiki        | page_restrictions              | pr_typelevel                       |
| enwiki        | page_restrictions              | pr_level                           |
| enwiki        | prefstats                      | ps_pref_duration_start             |
| enwiki        | prefstats                      | ps_user_pref_start                 |
| enwiki        | prefswitch_survey              | pss_user_timestamp_name_question   |
| enwiki        | profiling                      | pf_name_server                     |
| enwiki        | recentchanges                  | rc_ip                              |
| enwiki        | recentchanges                  | tmp_2                              |
| enwiki        | recentchanges                  | rc_name_type_patrolled_timestamp   |
| enwiki        | recentchanges                  | rc_ns_usertext                     |
| enwiki        | securepoll_cookie_match        | spcookie_match_voter_1             |
| enwiki        | securepoll_cookie_match        | spcookie_match_voter_2             |
| enwiki        | securepoll_elections           | spel_title                         |
| enwiki        | securepoll_lists               | splists_member                     |
| enwiki        | securepoll_lists               | splists_name                       |
| enwiki        | securepoll_msgs                | spmsg_entity                       |
| enwiki        | securepoll_options             | spop_question                      |
| enwiki        | securepoll_options             | spop_election                      |
| enwiki        | securepoll_properties          | sppr_entity                        |
| enwiki        | securepoll_questions           | spqu_election_index                |
| enwiki        | securepoll_strike              | spstrike_vote                      |
| enwiki        | securepoll_voters              | spvoter_elec_name_domain           |
| enwiki        | securepoll_votes               | spvote_ip                          |
| enwiki        | securepoll_votes               | spvote_voter_domain                |
| enwiki        | securepoll_votes               | spvote_voter_name                  |
| enwiki        | sites                          | sites_domain                       |
| enwiki        | sites                          | sites_source                       |
| enwiki        | sites                          | sites_global_key                   |
| enwiki        | sites                          | sites_language                     |
| enwiki        | sites                          | sites_protocol                     |
| enwiki        | sites                          | sites_group                        |
| enwiki        | sites                          | sites_type                         |
| enwiki        | sites                          | sites_forward                      |
| enwiki        | site_identifiers               | site_ids_key                       |
| enwiki        | tag_summary                    | ts_log_id                          |
| enwiki        | tag_summary                    | tag_summary_rc_id                  |
| enwiki        | tag_summary                    | tag_summary_log_id                 |
| enwiki        | tag_summary                    | tag_summary_rev_id                 |
| enwiki        | titlekey                       | name_key                           |
| enwiki        | transcache                     | tc_url_idx                         |
| enwiki        | transcode                      | transcode_key_idx                  |
| enwiki        | transcode                      | transcode_time_inx                 |
| enwiki        | updates                        | up_sequence                        |
| enwiki        | updates                        | up_timestamp                       |
| enwiki        | uploadstash                    | us_timestamp                       |
| enwiki        | uploadstash                    | us_user                            |
| enwiki        | user                           | user_email_token                   |
| enwiki        | user_groups                    | ug_expiry                          |
| enwiki        | user_properties                | user_properties_property           |
| enwiki        | vote_log                       | log_id                             |
| enwiki        | vote_log                       | log_user_key                       |
| enwiki        | wikigrok_claims                | wgc_rejection_checker              |
| enwiki        | wikigrok_claims                | wgc_unique                         |
| enwiki        | wikigrok_questions             | wgq_campaign_page                  |
| enwiki        | wikigrok_questions             | wgq_version_page                   |
| enwiki        | wikigrok_questions             | wgq_version_random                 |
| enwiki        | wikigrok_questions             | wgq_version_campaign_random        |
| enwiki        | wikigrok_responses             | wgr_unique                         |
| enwiki        | wikigrok_responses             | wgr_wiki_page                      |
| enwiki        | wikigrok_responses             | wgr_user_name                      |
| enwiki        | wikilove_image_log             | wlil_timestamp                     |
| enwiki        | wikilove_image_log             | wlil_user_time                     |
| enwiki        | wikilove_log                   | wll_sender_time                    |
| enwiki        | wikilove_log                   | wll_receiver_time                  |
| enwiki        | wikilove_log                   | wll_timestamp                      |
| enwiki        | wikilove_log                   | wll_type_time                      |
| enwiki        | __wmf_checksums                | ts_db_tbl                          |
| test          | eltx                           | x                                  |
+---------------+--------------------------------+------------------------------------+
212 rows in set (0.07 sec)

Frequency of usage of indexes on revision (db2070):

root@db2070.codfw.wmnet[sys]> SELECT * FROM schema_index_statistics WHERE table_name='revision'\G
*************************** 1. row ***************************
  table_schema: enwiki
    table_name: revision
    index_name: page_user_timestamp
 rows_selected: 123465061
select_latency: 23.70 h
 rows_inserted: 0
insert_latency: 0 ps
  rows_updated: 0
update_latency: 0 ps
  rows_deleted: 46761
delete_latency: 0 ps
*************************** 2. row ***************************
  table_schema: enwiki
    table_name: revision
    index_name: PRIMARY
 rows_selected: 81679529
select_latency: 19.00 h
 rows_inserted: 0
insert_latency: 0 ps
  rows_updated: 403305
update_latency: 17.42 s
  rows_deleted: 0
delete_latency: 0 ps
*************************** 3. row ***************************
  table_schema: enwiki
    table_name: revision
    index_name: page_timestamp
 rows_selected: 76742746
select_latency: 16.62 h
 rows_inserted: 0
insert_latency: 0 ps
  rows_updated: 27
update_latency: 675.93 us
  rows_deleted: 644824
delete_latency: 0 ps
*************************** 4. row ***************************
  table_schema: enwiki
    table_name: revision
    index_name: rev_timestamp
 rows_selected: 302795552
select_latency: 31.50 m
 rows_inserted: 0
insert_latency: 0 ps
  rows_updated: 0
update_latency: 0 ps
  rows_deleted: 0
delete_latency: 0 ps
*************************** 5. row ***************************
  table_schema: enwiki
    table_name: revision
    index_name: rev_page_id
 rows_selected: 289906541
select_latency: 13.62 m
 rows_inserted: 0
insert_latency: 0 ps
  rows_updated: 0
update_latency: 0 ps
  rows_deleted: 16166
delete_latency: 0 ps
*************************** 6. row ***************************
  table_schema: enwiki
    table_name: revision
    index_name: user_timestamp
 rows_selected: 651192
select_latency: 10.01 m
 rows_inserted: 0
insert_latency: 0 ps
  rows_updated: 29758
update_latency: 7.05 s
  rows_deleted: 0
delete_latency: 0 ps
*************************** 7. row ***************************
  table_schema: enwiki
    table_name: revision
    index_name: usertext_timestamp
 rows_selected: 79558
select_latency: 8.19 m
 rows_inserted: 0
insert_latency: 0 ps
  rows_updated: 0
update_latency: 0 ps
  rows_deleted: 0
delete_latency: 0 ps
7 rows in set (0.08 sec)
TTO added a subscriber: TTO.EditedApr 21 2017, 7:37 AM

DifferenceEngine has a diff=prev, diff=next parameters and uses getNextRevisionID()/getPreviousRevisionID() to interpret these parameters. Like direction parameters, this would benefit from using a time-based order.

Quite a lot of articles on enwiki (at least) have had old history imported, creating non-monotonic revision ID sequences, and any diff links that use diff=prev and diff=next across imported and non-imported revisions, such as https://en.wikipedia.org/w/index.php?diff=next&oldid=140182, would change to refer to other revisions.

I'm not sure how much it is worth caring about this, but bear in mind people are presently able to assume that diff URLs are stable (modulo revision hiding), which is an assumption that would no longer hold true with this change.

As for rev_parent_id, it would be interesting to see how that copes with imports. I seem to remember that in the past (and possibly still today) rev_parent_ids were completely bogus on some or all imported revisions, and revisions proximate to imported revisions.

Marostegui added a subscriber: Marostegui.EditedApr 21 2017, 8:24 AM

Shall we hold: T132416 and T162611 until we decide what to do with rev_page_id index? Before starting to alter the pending hosts in eqiad?

Change 349448 had a related patch set uploaded (by Anomie):
[mediawiki/core@master] API: Convert rvstartid/rvendid to timestamps for query

https://gerrit.wikimedia.org/r/349448

Change 349501 had a related patch set uploaded (by Anomie):
[mediawiki/core@master] Have Title::get(Next|Previous)RevisionID sort by timestamp

https://gerrit.wikimedia.org/r/349501

The original use case for these functions were the prev/next links on the old revision page views. The user certainly expects a time-based ordering in that case.

This behavior not matching expectations is a bug filed as T4930: Edit history is based on timestamp while diff prev/next links are based on revision id (yup, that's old).

Shall we hold: T132416 and T162611 until we decide what to do with rev_page_id index? Before starting to alter the pending hosts in eqiad?

I think you should go ahead with those consistency changes, since it will take a while to get rid of all uses of rev_page_id and to verify that we have done so.

Shall we hold: T132416 and T162611 until we decide what to do with rev_page_id index? Before starting to alter the pending hosts in eqiad?

I think you should go ahead with those consistency changes, since it will take a while to get rid of all uses of rev_page_id and to verify that we have done so.

Ok! Thanks! I will continue then :-)

Change 349501 merged by jenkins-bot:
[mediawiki/core@master] Have Title::get(Next|Previous)RevisionID sort by timestamp

https://gerrit.wikimedia.org/r/349501

Change 349448 merged by jenkins-bot:
[mediawiki/core@master] API: Convert rvstartid/rvendid to timestamps for query

https://gerrit.wikimedia.org/r/349448

Sadly the last change seems to change the behavior of the API in some cases where the timestamps of two revisions are identical and messes with the RevisionSlider :-/. e.g.:

We have the id of a revision, in our example it is the newest revision. We use the API to see if there are any newer revisions so we use action=query&prop=revisions&rvstartid=123&rvdir=newer .

Since there are no newer revisions the API is expected to return only the current revision and this should also work when the revisions share the same timestamp. That worked like a charm before that patch. With the patch we now get two ( ore even more ) revisions as a result when the newest revision ( 123 ) has the same timestamp as the second newest revision ( e.g. 122 ).

I guess this is not intended and it would be really great if it would be fixed before this gets deployed.

Change 352632 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] ApiQueryRevisions: Restore use of rvstartid/rvendid as a tiebreaker

https://gerrit.wikimedia.org/r/352632

Change 352632 merged by jenkins-bot:
[mediawiki/core@master] ApiQueryRevisions: Restore use of rvstartid/rvendid as a tiebreaker

https://gerrit.wikimedia.org/r/352632

Anomie added a comment.May 9 2017, 2:24 AM

Sadly the last change seems to change the behavior of the API in some cases where the timestamps of two revisions are identical and messes with the RevisionSlider :-/. e.g.:

That should be fixed for you now, it will use the specified rev_id to break the tie when there are multiple revisions with the same rev_timestamp.

Note that if revisions somehow get out of order so revision 122 has a later timestamp than 123 it will still return both 123 and 122, but if you see that somehow then something is really screwed up.

That should be fixed for you now, it will use the specified rev_id to break the tie when there are multiple revisions with the same rev_timestamp.

Thanks for addressing the issue so quickly it works as expected again!

Note that if revisions somehow get out of order so revision 122 has a later timestamp than 123 it will still return both 123 and 122, but if you see that somehow then something is really screwed up.

Jepp, and thats how we would want it so everything is fine with that ;-).