Page MenuHomePhabricator

Spike: Figure feasability to emit (wiki_db, revision_id) pairs
Closed, ResolvedPublic

Description

Conclusion:

Yes it is feasible to emit (wiki_db, revision_id) pairs. We got the prototype code to be able to emit these rows for a year worths of revisions for enwiki in about 5 mins of runtime.

Considering enwiki is one of the biggest wiki, this is very good news. We were also able to use Spark's JDBC Data source with parallelization set to 8 concurrent connections to the Analytics Replica. The replicas were able to handle the load quite fine.

Details on the shared notebook below.

Details

TitleReferenceAuthorSource BranchDest Branch
Add notebook from Spike: Figure feasibility to emit (wiki_db, revision_id) pairsrepos/data-engineering/dumps/mediawiki-content-dump!29xcollazoadd-notebook-emit-feasibilitymain
Customize query in GitLab

Event Timeline

xcollazo changed the task status from Open to In Progress.Fri, Jun 14, 4:03 PM
xcollazo triaged this task as High priority.
xcollazo moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

What I want to do is somehow explore each wiki using windows bound by rev_timestamp.

Let's start with an exploration of the revision table and its indexes:

MariaDB [eswiki]> describe revision;
+----------------+---------------------+------+-----+---------+----------------+
| Field          | Type                | Null | Key | Default | Extra          |
+----------------+---------------------+------+-----+---------+----------------+
| rev_id         | int(8) unsigned     | NO   | PRI | NULL    | auto_increment |
| rev_page       | int(8) unsigned     | NO   | MUL | 0       |                |
| rev_comment_id | bigint(20) unsigned | NO   |     | 0       |                |
| rev_actor      | bigint(20) unsigned | NO   | MUL | 0       |                |
| rev_timestamp  | binary(14)          | NO   | MUL | NULL    |                |
| rev_minor_edit | tinyint(1) unsigned | NO   |     | 0       |                |
| rev_deleted    | tinyint(1) unsigned | NO   |     | 0       |                |
| rev_len        | int(8) unsigned     | YES  |     | NULL    |                |
| rev_parent_id  | int(8) unsigned     | YES  |     | NULL    |                |
| rev_sha1       | varbinary(32)       | NO   |     |         |                |
+----------------+---------------------+------+-----+---------+----------------+
10 rows in set (0.002 sec)


MariaDB [eswiki]> show index from revision;
+----------+------------+--------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table    | Non_unique | Key_name                 | Seq_in_index | Column_name   | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+----------+------------+--------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| revision |          0 | PRIMARY                  |            1 | rev_id        | A         |   150607030 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_timestamp            |            1 | rev_timestamp | A         |   150607030 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_actor_timestamp      |            1 | rev_actor     | A         |    25101171 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_actor_timestamp      |            2 | rev_timestamp | A         |   150607030 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_actor_timestamp      |            3 | rev_id        | A         |   150607030 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_page_actor_timestamp |            1 | rev_page      | A         |    18825878 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_page_actor_timestamp |            2 | rev_actor     | A         |   150607030 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_page_actor_timestamp |            3 | rev_timestamp | A         |   150607030 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_page_timestamp       |            1 | rev_page      | A         |    30121406 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| revision |          1 | rev_page_timestamp       |            2 | rev_timestamp | A         |   150607030 |     NULL | NULL   |      | BTREE      |         |               | NO      |
+----------+------------+--------------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
10 rows in set (0.001 sec)

How would a query look like? Let's EXPLAIN it:

EXPLAIN EXTENDED
SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= NOW() - INTERVAL 1 WEEK
  AND rev_timestamp  < NOW() - INTERVAL 1 MINUTE
+------+-------------+----------+------+---------------+------+---------+------+-----------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows      | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+-----------+----------+-------------+
|    1 | SIMPLE      | revision | ALL  | rev_timestamp | NULL | NULL    | NULL | 150607798 |   100.00 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+-----------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

So this is not good because ALL = full table scan. Perhaps we need to CAST to BINARY since rev_timestamp is weird like that?:

EXPLAIN EXTENDED
SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= CAST((NOW() - INTERVAL 1 WEEK) AS BINARY)
  AND rev_timestamp  < CAST((NOW() - INTERVAL 1 MINUTE) AS BINARY)
+------+-------------+----------+-------+---------------+---------------+---------+------+------+----------+-----------------------+
| id   | select_type | table    | type  | possible_keys | key           | key_len | ref  | rows | filtered | Extra                 |
+------+-------------+----------+-------+---------------+---------------+---------+------+------+----------+-----------------------+
|    1 | SIMPLE      | revision | range | rev_timestamp | rev_timestamp | 14      | NULL | 1    |   100.00 | Using index condition |
+------+-------------+----------+-------+---------------+---------------+---------+------+------+----------+-----------------------+
1 row in set, 1 warning (0.002 sec)

Great, now we are picking up the rev_timestamp index. However:

SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= CAST((NOW() - INTERVAL 1 WEEK) AS BINARY)
  AND rev_timestamp  < CAST((NOW() - INTERVAL 1 MINUTE) AS BINARY)
Empty set (0.001 sec)

We get no rows? Let's try instead applying the cast to the left side:

EXPLAIN EXTENDED
SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE CAST(rev_timestamp AS DATETIME) >= NOW() - INTERVAL 1 WEEK
  AND CAST(rev_timestamp AS DATETIME)  < NOW() - INTERVAL 1 MINUTE
+------+-------------+----------+------+---------------+------+---------+------+-----------+----------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows      | filtered | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+-----------+----------+-------------+
|    1 | SIMPLE      | revision | ALL  | NULL          | NULL | NULL    | NULL | 150609133 |   100.00 | Using where |
+------+-------------+----------+------+---------------+------+---------+------+-----------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

So that doesn't work either, likely because there is no way to traverse the index while CASTing.

Let's go back to the CASTing of the right side, because that should, in theory work. This is the behavior:

SELECT rev_timestamp, CAST(rev_timestamp AS DATETIME) as rev_as_dt, CAST(CAST(rev_timestamp AS DATETIME) AS BINARY) from revision limit 10;
+----------------+---------------------+-------------------------------------------------+
| rev_timestamp  | rev_as_dt           | CAST(CAST(rev_timestamp AS DATETIME) AS BINARY) |
+----------------+---------------------+-------------------------------------------------+
| 20010522001935 | 2001-05-22 00:19:35 | 2001-05-22 00:19:35                             |
| 20010522222913 | 2001-05-22 22:29:13 | 2001-05-22 22:29:13                             |
| 20010523195014 | 2001-05-23 19:50:14 | 2001-05-23 19:50:14                             |
| 20010524231034 | 2001-05-24 23:10:34 | 2001-05-24 23:10:34                             |
| 20010525072255 | 2001-05-25 07:22:55 | 2001-05-25 07:22:55                             |
| 20010525103106 | 2001-05-25 10:31:06 | 2001-05-25 10:31:06                             |
| 20010525115355 | 2001-05-25 11:53:55 | 2001-05-25 11:53:55                             |
| 20010525130541 | 2001-05-25 13:05:41 | 2001-05-25 13:05:41                             |
| 20010525192907 | 2001-05-25 19:29:07 | 2001-05-25 19:29:07                             |
| 20010525225352 | 2001-05-25 22:53:52 | 2001-05-25 22:53:52                             |
+----------------+---------------------+-------------------------------------------------+
10 rows in set (0.001 sec)

Weird, as the original rev_timestamp is, somehow not formatted as a DATETIME.

Looking at table definition, and at docs ( https://mariadb.com/kb/en/binary/ ), I can now see that BINARY(14) is very similar to a CHAR(14). A bit wasteful that we are encoding what is, effectively, a LONG, as a CHAR(14). Anyhow, let's try again:

SELECT rev_timestamp, CAST(rev_timestamp AS DATETIME) as rev_ts_as_dt, TO_CHAR(CAST(rev_timestamp AS DATETIME), 'YYYYMMDDHH24MISS')  from revision limit 10;
+----------------+---------------------+--------------------------------------------------------------+
| rev_timestamp  | rev_ts_as_dt        | TO_CHAR(CAST(rev_timestamp AS DATETIME), 'YYYYMMDDHH24MISS') |
+----------------+---------------------+--------------------------------------------------------------+
| 20010522001935 | 2001-05-22 00:19:35 | 20010522001935                                               |
| 20010522222913 | 2001-05-22 22:29:13 | 20010522222913                                               |
| 20010523195014 | 2001-05-23 19:50:14 | 20010523195014                                               |
| 20010524231034 | 2001-05-24 23:10:34 | 20010524231034                                               |
| 20010525072255 | 2001-05-25 07:22:55 | 20010525072255                                               |
| 20010525103106 | 2001-05-25 10:31:06 | 20010525103106                                               |
| 20010525115355 | 2001-05-25 11:53:55 | 20010525115355                                               |
| 20010525130541 | 2001-05-25 13:05:41 | 20010525130541                                               |
| 20010525192907 | 2001-05-25 19:29:07 | 20010525192907                                               |
| 20010525225352 | 2001-05-25 22:53:52 | 20010525225352                                               |
+----------------+---------------------+--------------------------------------------------------------+
10 rows in set (0.001 sec)

Bingo. Let's try again the window but with the new TO_CHAR approach:

EXPLAIN EXTENDED
SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= TO_CHAR(NOW() - INTERVAL 1 WEEK  , 'YYYYMMDDHH24MISS')
  AND rev_timestamp  < TO_CHAR(NOW() - INTERVAL 1 MINUTE, 'YYYYMMDDHH24MISS')
+------+-------------+----------+-------+---------------+---------------+---------+------+--------+----------+-----------------------+
| id   | select_type | table    | type  | possible_keys | key           | key_len | ref  | rows   | filtered | Extra                 |
+------+-------------+----------+-------+---------------+---------------+---------+------+--------+----------+-----------------------+
|    1 | SIMPLE      | revision | range | rev_timestamp | rev_timestamp | 14      | NULL | 298822 |   100.00 | Using index condition |
+------+-------------+----------+-------+---------------+---------------+---------+------+--------+----------+-----------------------+
1 row in set, 1 warning (0.005 sec)


SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= TO_CHAR(NOW() - INTERVAL 1 WEEK  , 'YYYYMMDDHH24MISS')
  AND rev_timestamp  < TO_CHAR(NOW() - INTERVAL 1 MINUTE, 'YYYYMMDDHH24MISS')
LIMIT 10
+-----------+----------------+-------------+---------+---------------------------------+
| rev_id    | rev_timestamp  | rev_deleted | rev_len | rev_sha1                        |
+-----------+----------------+-------------+---------+---------------------------------+
| 160609735 | 20240607183326 |           0 |    6810 | eos9cgcegqtoxb6womjjrjmhlzbecca |
| 160609736 | 20240607183328 |           0 |    6872 | k5sy29f3ugnsxi4ik7pwjfeh9esk5hs |
| 160609738 | 20240607183329 |           0 |    3126 | erlgbvt2lhvcwhz1udxrty1xx2h4zpq |
| 160609737 | 20240607183330 |           0 |  122555 | r4gb3eojvauy2l6717pstsy0mczmbn3 |
| 160609739 | 20240607183334 |           0 |    8611 | t5zlppuh8rrr29i0xjsbzu3sbyq1iuz |
| 160609740 | 20240607183341 |           0 |    8627 | nk9g9gilkzkszk8uv481yypykyl9ygj |
| 160609742 | 20240607183341 |           0 |   61289 | bn9sppdoaqeer3uy9bov78zdtuvtl79 |
| 160609741 | 20240607183342 |           0 |    7190 | eh5bp9ewpyfgqe0uxgf6urze697x975 |
| 160609743 | 20240607183352 |           0 |   40878 | gpowhaobpwy9w9x8a8x1jxkbech1hyt |
| 160609744 | 20240607183358 |           0 |    2761 | 8lb5hdwevs3xzxxvaji4d5abg2zm03r |
+-----------+----------------+-------------+---------+---------------------------------+
10 rows in set (0.001 sec)

Awesome. Let's see how it scales by forcing row counts on a subquery. One week:

SELECT COUNT(1) FROM (
SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= TO_CHAR(NOW() - INTERVAL 1 WEEK  , 'YYYYMMDDHH24MISS')
  AND rev_timestamp  < TO_CHAR(NOW() - INTERVAL 1 MINUTE, 'YYYYMMDDHH24MISS')
) r
+----------+
| COUNT(1) |
+----------+
|   137043 |
+----------+
1 row in set (0.054 sec)

One month:

SELECT COUNT(1) FROM (
SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= TO_CHAR(NOW() - INTERVAL 1 MONTH  , 'YYYYMMDDHH24MISS')
  AND rev_timestamp  < TO_CHAR(NOW() - INTERVAL 1 MINUTE, 'YYYYMMDDHH24MISS')
) r
+----------+
| COUNT(1) |
+----------+
|   623485 |
+----------+
1 row in set (0.205 sec)

One year:

SELECT COUNT(1) FROM (
SELECT rev_id,
       rev_timestamp,
       rev_deleted,
       rev_len,
       rev_sha1
FROM revision
WHERE rev_timestamp >= TO_CHAR(NOW() - INTERVAL 1 YEAR  , 'YYYYMMDDHH24MISS')
  AND rev_timestamp  < TO_CHAR(NOW() - INTERVAL 1 MINUTE, 'YYYYMMDDHH24MISS')
) r
+----------+
| COUNT(1) |
+----------+
|  8663426 |
+----------+
1 row in set (2.579 sec)

~2.6 seconds to go over one year worth of eswiki revisions. Certainly shows great promise!

After some more debugging, I an now able to emit (wiki_db, revision_id) pairs over periods of 6 months for enwiki, one of our biggest wikis, in about ~15 mins of runtime. Most of that time is spent in a single executor, doing a single read against the replica. The replica is responsive while the query is running, and doesn't show signs of overloading as per grafana.

I think this mechanism will certainly work fine for a daily run over the last 24 hours of each wiki.

It may need more scaling for the monthly run over all of the wikis, but I think the fact that we can run it smoothly over 6 months of enwiki is a good sign. If the monthly run takes several days, while mostly using one executor of the cluster, that would be ok I think.

More debugging, and I am now able to emit (wiki_db, revision_id) pairs over periods of 12 months for enwiki, one of our biggest wikis, in about ~5 mins of runtime.

The key change here is up to 8 parallel connections per query to our analytics replicas. I was under the impression that our replicas could not take the load, but they can, admirably. I will now cleanup the PySpark notebook I have been experimenting on, and share here.

xcollazo claimed this task.

Notebook now available at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/notebooks/Can_we_emit_wiki_db__revision_id_pairs_for_reconciliation.ipynb

This concludes this spike. Once we finalize the design of the mechanism proposed in T358373 we'll work on making this code prod quality.