Page MenuHomePhabricator

[4 hours] Investigate whether it's efficient to order by tag value (DBA input requested)
Closed, ResolvedPublicSpike

Description

For the request to order by pageviews, and in light of making sure we are scoping things in a way where they can scale without ballooning -- let's investigate whether PageTriage schema can efficiently order pages by tag value.

Example query (sorting by category count):

SELECT page_namespace, page_title, ptrpt_value
FROM pagetriage_page_tags
JOIN page ON ptrpt_page_id = page_id
WHERE ptrpt_tag_id = 2
ORDER BY CAST(ptrpt_value AS SIGNED) DESC
LIMIT 10

~0.20 secs

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Niharika renamed this task from Investigate whether it's possible to order by tag value to [4 hours] Investigate whether it's possible to order by tag value.Jun 6 2019, 5:41 PM
Niharika triaged this task as Medium priority.
Niharika added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 6 2019, 5:41 PM

Something weird is going on. This should work:

SELECT page_namespace, page_title, ptrpt_value
FROM pagetriage_page_tags
JOIN page ON ptrpt_page_id = page_id
WHERE ptrpt_tag_id = 2 /* Category count */
ORDER BY ptrpt_value DESC
LIMIT 10

But I don't get any results that exceed 9:

+----------------+----------------------------+-------------+
| page_namespace | page_title                 | ptrpt_value |
+----------------+----------------------------+-------------+
|              0 | Erandol,_Maharashtra       | 9           |
|              0 | Nathan_McGuire             | 9           |
|              0 | Tymon_Tytus_Chmielecki     | 9           |
|            118 | Holophragma_mitrata        | 9           |
|              0 | Shochi_Niiyama             | 9           |
|              0 | Harry_C._Kelly_(physicist) | 9           |
|              0 | Christian_Pfleiderer       | 9           |
|              0 | Shahrom_Samiyev            | 9           |
|              0 | Ryo_Hozumi                 | 9           |
|              0 | Ameiva_praesignis          | 9           |
+----------------+----------------------------+-------------+
10 rows in set (0.04 sec)

But if we check for > 10:

SELECT page_namespace, page_title, ptrpt_value
FROM pagetriage_page_tags
JOIN page ON ptrpt_page_id = page_id
WHERE ptrpt_tag_id = 2 /* Category count */
AND ptrpt_value > 10
ORDER BY ptrpt_value DESC
LIMIT 10

we get what is probably more accurate results:

+----------------+---------------------------------------+-------------+
| page_namespace | page_title                            | ptrpt_value |
+----------------+---------------------------------------+-------------+
|              0 | Franco_Calero                         | 63          |
|              0 | Insurgency_in_Tripura                 | 56          |
|              2 | EMann3240                             | 50          |
|              2 | Oshawott_12/Userboxes                 | 42          |
|              0 | Arationality                          | 36          |
|              0 | Hell_Girl_(film)                      | 36          |
|              0 | Tornado_outbreak_sequence_of_May_2019 | 33          |
|              0 | Nightmare_(Halsey_song)               | 33          |
|              0 | Easther_Mayi_Kith                     | 30          |
|              0 | Susan_Beschta                         | 29          |
+----------------+---------------------------------------+-------------+
10 rows in set, 1 warning (1.86 sec)

Also check out the warning:

MariaDB [enwiki_p]> SHOW WARNINGS;
+---------+------+-----------------------------------------+
| Level   | Code | Message                                 |
+---------+------+-----------------------------------------+
| Warning | 1292 | Truncated incorrect DOUBLE value: '50+' |
+---------+------+-----------------------------------------+
1 row in set (0.00 sec)

Apparently there are values with 50+, such as incoming links:

SELECT page_namespace, page_title, ptrpt_value
FROM pagetriage_page_tags
JOIN page ON ptrpt_page_id = page_id
WHERE ptrpt_tag_id = 1 /* Link count */
AND ptrpt_value > 10
ORDER BY ptrpt_value DESC
LIMIT 10

results:

+----------------+----------------------------------------------+-------------+
| page_namespace | page_title                                   | ptrpt_value |
+----------------+----------------------------------------------+-------------+
|              0 | 1962_Drake_Bulldogs_football_team            | 50+         |
|              0 | 2019_Fiesta_Bowl_(December)                  | 50+         |
|              0 | Lupadium                                     | 50+         |
|              0 | Ceretapa                                     | 50+         |
|              0 | Virginia_Bedell                              | 50+         |
|              0 | Diocaesarea_(Isauria)                        | 50+         |
|              0 | Diocaesarea_(Cappadocia)                     | 50+         |
|              0 | 2019–20_NCAA_football_bowl_games             | 50+         |
|              0 | 2019_Russian_Super_Cup                       | 50+         |
|              0 | 2019_Russian_Championship_(women's_football) | 50+         |
+----------------+----------------------------------------------+-------------+

???

JTannerWMF subscribed.

It looks like Community-Tech is working on this, so I am moving this to External for the Growth Team

So T225169#5243948 is apparently because the ptrpt_value column is a VARBINARY. If we cast it into a signed integer we get what we want:

SELECT page_namespace, page_title, ptrpt_value
FROM pagetriage_page_tags
JOIN page ON ptrpt_page_id = page_id
WHERE ptrpt_tag_id = 2 /* Category count */
ORDER BY CAST(ptrpt_value AS SIGNED) DESC
LIMIT 10

Results (2.37 sec):

+----------------+---------------------------------------+-------------+
| page_namespace | page_title                            | ptrpt_value |
+----------------+---------------------------------------+-------------+
|              0 | Committee_Against_Torture_(UN)        | 198         |
|              0 | Franco_Calero                         | 63          |
|              0 | Insurgency_in_Tripura                 | 56          |
|              2 | EMann3240                             | 50          |
|              2 | Oshawott_12/Userboxes                 | 42          |
|              0 | Arationality                          | 36          |
|              0 | Tornado_outbreak_sequence_of_May_2019 | 33          |
|              0 | Never_Really_Over                     | 31          |
|              0 | Easther_Mayi_Kith                     | 30          |
|              0 | Cross_Me                              | 30          |
+----------------+---------------------------------------+-------------+

The above query is scanning every row in pagetriage_page_tags, so I'd wager 2.37 seconds is pretty good, and it will be much faster in production. There is an index on ptrpt_tag_id and ptrpt_value.

As for the 50+ values, that's comes from https://github.com/wikimedia/mediawiki-extensions-PageTriage/blob/master/includes/ArticleCompile/ArticleCompileInterface.php#L66. When storing pageviews, we could simply set the max value to something higher.

@MusikAnimal If I am reading this right, you are suggesting it is possible to store pageviews for all pages in the pages in the feed and to allow sorting based on that?
Another thing we need to think about is updating this data - however often we do that. We can discuss that on T207238: Special:NewPageFeed - add option to filter by pageviews though.

MusikAnimal If I am reading this right, you are suggesting it is possible to store pageviews for all pages in the pages in the feed and to allow sorting based on that?

Yes I believe so, though we may want to do more performance testing.

2.37 sec is not good at all. Look at the query plan:

MariaDB [enwiki]> explain SELECT page_namespace, page_title, ptrpt_value
    -> FROM pagetriage_page_tags
    -> JOIN page ON ptrpt_page_id = page_id
    -> WHERE ptrpt_tag_id = 2 /* Category count */
    -> ORDER BY CAST(ptrpt_value AS SIGNED) DESC
    -> LIMIT 10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagetriage_page_tags
         type: ref
possible_keys: ptrpt_page_tag_id,ptrpt_tag_id_value
          key: ptrpt_tag_id_value
      key_len: 4
          ref: const
         rows: 260948
        Extra: Using where; Using index; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.pagetriage_page_tags.ptrpt_page_id
         rows: 1
        Extra:

In general, filesort over 260k rows can be problematic in production environments. Running the query itself, we see 10 rows in set, 1 warning (0.20 sec). The warning is:

MariaDB [enwiki]> show warnings;
+---------+------+------------------------------------------+
| Level   | Code | Message                                  |
+---------+------+------------------------------------------+
| Warning | 1292 | Truncated incorrect INTEGER value: '50+' |
+---------+------+------------------------------------------+
1 row in set (0.00 sec)

Hmm?

MusikAnimal added subscribers: Marostegui, jcrespo.

@Marostegui @jcrespo Apologies for the ping, but we were hoping to get the OK from DBAs on our proposed query to be used on the English Wikipedia. To summarize:

We want to order values in pagetriage_page_tags by ptrpt_value. An example query:

USE enwiki;
SELECT page_namespace, page_title, ptrpt_value
FROM pagetriage_page_tags
JOIN page ON ptrpt_page_id = page_id
WHERE ptrpt_tag_id = 2
ORDER BY CAST(ptrpt_value AS SIGNED) DESC
LIMIT 10

in production this runs in about 0.20 secs. Note we are using CAST because ptrpt_value is a varbinary. In our case the values can safely be cast into integers (as opposed to other values which are intentionally strings).

The query plan:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEpagetriage_page_tagsrefptrpt_page_tag_id,ptrpt_tag_id_valueptrpt_tag_id_value4const273588Using where; Using index; Using filesort
1SIMPLEpageeq_refPRIMARYPRIMARY4enwiki.pagetriage_page_tags.ptrpt_page_id1

There are some 2.5 million rows in this table, and it's only scanning about 270,000 of them. However it apparently may be using filesort, and that's where we're unsure.

Is the runtime (~0.2 secs) enough to conclude this sort of query is OK?

The query will be used only when the user applies certain sorting options in the Special:NewPagesFeed interface. So, we do not expect this query to be ran super often.

MusikAnimal renamed this task from [4 hours] Investigate whether it's possible to order by tag value to [4 hours] Investigate whether it's efficient to order by tag value (DBA input requested).Jun 20 2019, 7:40 PM
MusikAnimal updated the task description. (Show Details)

I have been checking this query on enwiki and it doesn't seem to be too bad:

root@db1089.eqiad.wmnet[enwiki]> FLUSH STATUS; pager cat > /dev/null; SELECT page_namespace, page_title, ptrpt_value FROM pagetriage_page_tags JOIN page ON ptrpt_page_id = page_id WHERE ptrpt_tag_id = 2 ORDER BY CAST(ptrpt_value AS SIGNED) DESC LIMIT 10; ; nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.00 sec)

PAGER set to 'cat > /dev/null'
10 rows in set, 1 warning (0.28 sec)

ERROR: No query specified

PAGER set to stdout
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 0      |
| Handler_icp_attempts       | 0      |
| Handler_icp_match          | 0      |
| Handler_mrr_init           | 0      |
| Handler_mrr_key_refills    | 0      |
| Handler_mrr_rowid_refills  | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 17     |
| Handler_read_last          | 0      |
| Handler_read_next          | 136863 |
| Handler_read_prev          | 0      |
| Handler_read_retry         | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_deleted   | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_tmp_update         | 0      |
| Handler_tmp_write          | 0      |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
26 rows in set (0.00 sec)

I have also tested this query on a old host non warmed up and with spinning disks and the query takes 0.26 seconds, which is fast enough for a query that won't happen too often, as you mentioned

why the ORDER BY CAST(ptrpt_value AS SIGNED) ? Please fix your values, store them as integers and update '50+' value as NULL, 50 or maxint, but don't put strings for integers on the database ('50+' is presentation logic, not data).

"But we need a string because tags are all blobs"

You may say?

Nope, avoid EAV model unless for the simplest of operations or you will pay with performance penalties. This is not a simple operation, so either it shouldn't be done, or an appropriate data model should be used. Optimization may be indeed optional for a home server, but compulsory for a high-thoughput website.

Reading back comments, I realized I am just repeating what was already suggested at T225169#5254216.

Thank you both for the detailed feedback!

why the ORDER BY CAST(ptrpt_value AS SIGNED) ? Please fix your values, store them as integers and update '50+' value as NULL, 50 or maxint, but don't put strings for integers on the database ('50+' is presentation logic, not data).

"But we need a string because tags are all blobs"

You may say?

Nope, avoid EAV model unless for the simplest of operations or you will pay with performance penalties. This is not a simple operation, so either it shouldn't be done, or an appropriate data model should be used. Optimization may be indeed optional for a home server, but compulsory for a high-thoughput website.

I wholeheartedly agree. As you may know, PageTriage is pretty old. We did not make these design decisions. Indeed, there are text-based values in pagetriage_page_tags, e.g. where ptrpt_tag_id = 9, which is a snippet of the article content. Other types could conceivably be stored as integers. But, I wonder if it's worth doing this level of refactoring right now? If we are able to scrape by using the above query, this may be preferred because we only have so much time to work on this project. Pinging @Mooeypoo for input. If it is deemed performance impact may be too great, we will need to rethink our strategy. I should add that with T207238, we'd be inserting an additional ~60,000 rows to the pagetriage_page_tags table.

I am not asking you to change page_triage, nor was I blaming you for its current state. Page triage can be good enough for simple usage, what I was warning about is against using it for more complex (non-indexed) access. If more rows are going to be inserted, the 27K row scan will convert into a 100K row scan. The timing has to be analyzed in context- wall clock may not be much, but if several queries are run in parallel, under load, that could be much longer. This is ok for obscure queries such as admin-only tasks, but I don't consider Special:NewPages an obscure one.

My suggestion would be to create a virtual column and a functional index with the values suggested at T225169#5277802 and index it so that the order + limit is very fast.

I think we should discuss whether the feature this work would enable is worth the effort.

We've discussed this (and the wider implication of the entire feature) in the Engineering meeting, and agreed it's time to revisit whether the product is worth the significant effort here. We tried to estimate, generally, how long things may take.

Here is our general estimate of what it would take, in general, to be able to store pageviews in the database and sort the result by them, erring on the side of caution:

  • Add indexed integer column to the PageTriage table. Since the table isn't extremely large, it will probably not take too long but it does requires DBA review and assistance. Estimated time: A month (with potential more)
  • Refactor the code to use the new column. Estimated time: Days, leaning towards a couple of weeks with QA.
  • Creating cron job to fill in the values. This is the most challenging, honestly; It would require figuring out whether we have limits to how many rows we update every cycle, get the cron job working and approved. There aren't many other cron jobs that do actual data fetching (cron jobs are maintenance) so it will likely require some back-and-forth with ops on how to do this best, and what limitations to set.

Our overall estimate on this work, erring on the side of caution but not taking into account major setbacks, is a couple of months, which includes some back-and-forth and waiting for input on tickets, and some large portion of actual coding.

On top of that, this will add a pretty significant item to continuously maintain, with cron jobs not being 100% reliable for fetching information (again, they're *meant* for maintenance)

@Niharika it is starting to look like this might be too large for us.

The alternative we talked about was displaying page-views in the table, as information, without being able to sort by them. This will be possible from the Javascript (fetch on load of page, from the page view API) and will give users some extra information. Considering the alternative is pretty massive, should we reconsider?

We've discussed this (and the wider implication of the entire feature) in the Engineering meeting, and agreed it's time to revisit whether the product is worth the significant effort here. We tried to estimate, generally, how long things may take.

Here is our general estimate of what it would take, in general, to be able to store pageviews in the database and sort the result by them, erring on the side of caution:

  • Add indexed integer column to the PageTriage table. Since the table isn't extremely large, it will probably not take too long but it does requires DBA review and assistance. Estimated time: A couple of weeks (with some risk of this being months)

From what I have seen, this table only exists on enwiki, test2wiki and testwiki, right?
2 weeks is probably too optimistic, if the above is true a month is probably more realistic (it is hard to predict, even if the table is small, it is big enough on enwiki that it would require us to do it host by host, and that can take some days, not to mention the fact that we might have to switch priorities due to unexpected things arriving to the DBA land, ie: incidents).
I would suggest you estimate a month for it from the moment the ticket is picked up (it might be less if it goes fine, but it is not guaranteed)

Thanks @Marostegui for the detail.

I think this is further evidence that we shouldn't take on this task right now.

Thanks, @Marostegui -- I edited my comment to reflect the new estimate.