Page MenuHomePhabricator

ORES extension tables should be able to be populated using maintenance scripts
Closed, ResolvedPublic

Description

We have this WIP patch by @awight but it needs some work

Event Timeline

Ladsgroup raised the priority of this task from to Needs Triage.
Ladsgroup updated the task description. (Show Details)
Ladsgroup added subscribers: Ladsgroup, awight.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 15 2016, 9:49 PM

Maybe Katie and Marius can help?

awight closed this task as Declined.Jan 27 2016, 3:57 AM
awight claimed this task.

We could determine how much time the cold cache adds to a revision feed pageview, if anyone thinks that's a good idea, but I'm sure it's only a fraction of a second.

hoo added a comment.Jan 27 2016, 9:06 AM

We could determine how much time the cold cache adds to a revision feed pageview, if anyone thinks that's a good idea, but I'm sure it's only a fraction of a second.

All of our loading times are measured in fractions of seconds (except for maybe page saves and page parsing w/o cache), so just adding a few hundred ms is quite terrible (as the average web request is below 200ms, I think). How exactly do you plan for this to work? If you want to lazy populate it, please keep race conditions in mind.
This is probably the wrong ticket to talk about this, can you point me to the new one?

awight reopened this task as Open.Jan 27 2016, 9:15 AM

@hoo
I refreshed my memory of how the scores are populated... My last comment was misleading, we never make API requests during page load, so a cold cache adds zero time. Instead, we fetch revscoring data with a FetchScoreJob triggered after page save. When the cache is empty, revscoring data simply isn't available, so shows up empty.

This was designed with RecentChanges in mind, where only the most recent changes matter, but now I realize that the Special:Contributions integration in particular will require us to go ahead with writing a maintenance script to populate historical scores.

hoo added a comment.Jan 27 2016, 9:24 AM

Oh, I see… so you want to back populate it to "the beginning of time"? I'm not sure how much sense it makes to compute scores for changes from 2014 and beforehand, but for consistency.

Please keep in mind, that that makes performance a much bigger matter… and also storage, which is a little tight on s5 right now.

hoo added a subscriber: jcrespo.Jan 27 2016, 9:25 AM

@jcrespo suggested to move this to the extension database shard. That would mean data can't directly be joined against the main MediaWiki database, but also solve the storage capacity problems.

I don't think moving it to the extension database shard. would be a good idea but I think we can get pretty high performance with losing none or negligible functionality. My suggestions follows:

  • Only populate members of recent changes table or unpatrolled edits (since the flag won't be shown when the edit is patrolled)
  • Don't insert rows when oresc_class = 'false', it serves no meaningful purpose and value of that row (oresc_probability) is equal to 1 - value of another row (when oresc_class = 'true'). Using this we can cut number of row in all wikis to half (except enwiki which it'll cut the number to ~0.7)
  • Change oresc_class from varchar to tinyint, in that case 'true' will be 1 (in all of binary models). 'a' = 0, 'b' = 1, etc. in wp10 model, it'll save lots of storage.
  • Change "DECIMAL(10,10)" in oresc_probability to DECIMAL(2,5)" Anything after fifth decimal order is completely negligible and mostly are error in computation (like 0.79999) which rounding gives us better results

What do you think?

You do not need to make a decision yet, but please say if it would be possible/too difficult. We cannot now add more tables to the main databases, but we may be able to do it in 3 month's time. Also please give an estimation of total size.

Halfak added a subscriber: Halfak.Jan 27 2016, 5:46 PM

We shouldn't rely on oresc_class = 'false' to decide whether to add rows or not. For some models, thresholds below 50% make sense. Also, for the goodfaith model, "false" is the interesting/rare class.

I'm not sure that moving from ores class to tinyint makes sense. varchar would store 'a' in 2 bytes whereas tinyint would require 1 byte.

Re. decimal, I might not be reading the docs right, but I think that DECIMAL(3,3) makes the most sense. This would store anything between 0.000 and 1.00 with up to three digits after the decimal point.

The wp10 model works very differently than the edit quality models (reverted, damaging, goodfaith). I think it would make much more sense to request those scores as needed via the client than to build a cache on the recentchanges table.

Oh sorry about misunderstanding, right now we have two rows in ores_classification for each edit in damaging,reverted, and goodfaith models. E.g.

(18:20) root@localhost:[wiki]> select * from ores_classification where oresc_rev = 32;
+----------+-----------+-------------+-------------+-------------------+--------------------+
| oresc_id | oresc_rev | oresc_model | oresc_class | oresc_probability | oresc_is_predicted |
+----------+-----------+-------------+-------------+-------------------+--------------------+
|       57 |        32 |           1 | false       |      0.7700000000 |                  1 |
|       58 |        32 |           1 | true        |      0.2300000000 |                  0 |
+----------+-----------+-------------+-------------+-------------------+--------------------+
2 rows in set (0.00 sec)

but the first row is duplicate of the second one because 1 - 0.23 = 0.77 (we simply can edit any query we want and get the results). We don't lose any efficiency or data, etc.

obviously we can't do the same for wp10 model.

The problem with varcahr is not "a" it stores actually 64 bytes not 2 bytes because in definition we defined this column a VARCHAR(32) which means it can contain anything from "A" to "zzzzzzzzzz..." (32 Zs) and each cell would contain 64 bytes not matter what. (Please correct me if I'm wrong)

I agree on wp10 usage, we can leave wp10 model and don't store anything for now. until we add functionalities related to the model (e.g. T124617) later.

You do not need to make a decision yet, but please say if it would be possible/too difficult. We cannot now add more tables to the main databases, but we may be able to do it in 3 month's time. Also please give an estimation of total size.

Okay, I've made some analysis:
Size of rev_id cell in each wiki:

  • fawiki: 8 bytes
  • enwiki: 9 bytes
  • wikidata: 9 bytes

Current state one model:

  • [(size of rev + 1) + size of rev + 1 (model) + 5 (class) + 12 (prob) + 1 (predicted)] * 2
    • fawiki: 72 Bytes per rev model, 3 models, 160191 edits per month, it'll grow at speed of 1.1 MB/day
    • enwiki: 76, 3 models, 4389018 edits per month, it'll grow at speed of 31.8 MB/day
    • wikidatawiki: 76, 1 model, 9687563 edits per month, it'll grow at speed of 23.4 MB/day, it'll be tripled soon

1st improvement, Cutting number of rows to half:

  • [(size of rev) + size of rev + 1 (model) + 5 (class) + 12 (prob) + 1 (predicted)]
    • fawiki: 35 Bytes, Same sitatuion: 547 KB/day
    • enwiki: 37, 15.5 MB/day
    • wikidatawiki: 37, 11.4 MB/day (later: 34.2)

2nd improvement, changing decimal(10,10) to decimal(3,3):

  • [(size of rev) + size of rev + 1 (model) + 5 (class) + 5 (prob) + 1 (predicted)]
    • fawiki: 28 Bytes, Same situation: 438 KB/day
    • enwiki: 30, 12.57 MB/day
    • wikidatawiki: 30, 9.24 MB/day

(proposed) 3rd improvement, changing 'true' to 0 or 1 in oresc_class:

  • [(size of rev) + size of rev + 1 (model) + 1 (class) + 5 (prob) + 1 (predicted)]
    • fawiki: 24 Bytes, Same situation: 375 KB/day
    • enwiki: 26, 10.89 MB/day
    • wikidatawiki: 26, 8.01 MB/day

(proposed) 4th improvement, drop the oresc_id primary key:

  • [size of rev + 1 (model) + 1 (class) + 5 (prob) + 1 (predicted)]
    • fawiki: 16 Bytes, Same situation: 250 KB/day
    • enwiki: 17, 7 MB/day
    • wikidatawiki: 17, 5.23 MB/day

Given that we have two indices for ores_classification, worst case scenario you need to triple these number.

ores_model is completely negligible database with about at the most 100 rows.

We definitely cannot host those tables on the main cluster right now (I am assuming this could expand to support every wiki)- either we host them on the external cluster or it will have to wait until we provision more disk space on it.

@jcrespo, what would be a more reasonable range for these tables. Is there some threshold from which you are operating?

We definitely cannot host those tables on the main cluster right now (I am assuming this could expand to support every wiki)- either we host them on the external cluster or it will have to wait until we provision more disk space on it.

No, There are only about 14 Wikis supported (See this list) and most of them do not have damaging model which is vital to the extension thus only 4 wikis can use it (+wikidata very soon), and we can start with them for now until the next three month or we can have a road map for that.

s1, s4, s5 (enwiki, commons, wikidata) is ok. The concern here would be performance, with wikidata in particular, for which I have open already several tickets. Please coordinate with me on wikidata-related changes.
s2 and s3 are almost literally full- I am blocking anything there that adds data significantly until we provision new hardware.
s6 and s7 have plenty of space and they are not a concern at all.

The hardware requests for s2 and s3 (which are the main blockers) are almost ready, only waiting from my ok to chose a vendor/model, and the ok from those on top of me. I expect those to be here by 15 March.

Please continue optimizing the space, not so much for disk, which is only a temporary issue, but for performance less disk == faster access.

s1, s4, s5 (enwiki, commons, wikidata) is ok. The concern here would be performance, with wikidata in particular, for which I have open already several tickets. Please coordinate with me on wikidata-related changes.
s2 and s3 are almost literally full- I am blocking anything there that adds data significantly until we provision new hardware.
s6 and s7 have plenty of space and they are not a concern at all.

The hardware requests for s2 and s3 (which are the main blockers) are almost ready, only waiting from my ok to chose a vendor/model, and the ok from those on top of me. I expect those to be here by 15 March.

Please continue optimizing the space, not so much for disk, which is only a temporary issue, but for performance less disk == faster access.

Thank you for helping. So you're implying deploying this extension is okay in enwiki (s1), fawiki (s7). For wikidata (s5) we need to be careful but it's probably okay (correct me if I'm wrong). And It's not possible for trwiki and ptwiki (s2) until March. We only want to deploy this extension in these wikis for now because the related models is not there yet. Maybe we'll add more wikis later but we definitely ask you beforehand.

Yes, that is a good translation.

Change 267425 had a related patch set uploaded (by Ladsgroup):
Database performance improvements

https://gerrit.wikimedia.org/r/267425

Change 267425 merged by jenkins-bot:
Database performance improvements

https://gerrit.wikimedia.org/r/267425

Change 268874 had a related patch set uploaded (by Ladsgroup):
Add PopluateDatabase.php

https://gerrit.wikimedia.org/r/268874

awight reassigned this task from awight to Ladsgroup.Feb 7 2016, 1:23 AM

Change 268874 merged by jenkins-bot:
Add PopulateDatabase.php

https://gerrit.wikimedia.org/r/268874

Ladsgroup closed this task as Resolved.Feb 16 2016, 6:33 PM
Ladsgroup set Security to None.

Hi,

most space issues have been corrected since the last time we talked due to the datacenter failover + new hardware purchases. (unless you intend to take something like half the size of the current data we hold, like half a terabyte per shard). @Ladsgroup your size estimation is not very clear to me. Please provide a "worst case scenario for 3 years" so I can assess what wikis this could be deployed on.

I am still suggesting to start deploying on a separate shard (x1, for example) unless you tell me it is not efficient due to code reasons (but of course I am open for discussion), given that the particular nature of this functionality seems to be metadata, and not data itself- so probably the access patterns will be different too.

A last reminder is that creating new tables, as of now, is blocked by me (as in, has to have my OK) to assess potential data leaks due to labs filtering issues: https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change

@jcrespo Hey, Worst case scenario, after three years: fawiki: 401 MB, 12 GB for enwiki, 8.56 GB for wikidatawiki. Any other wiki will be in between.

Two things I haven't considered yet:

  1. I haven't considered growth in edit rate, It makes things a little bit complicated to estimate.
  2. We decided not to store scores of edits made by bots but I haven't include this decision in the estimation. Considering that it would cut wikidata number to half.

We wanted to use x1 before but we are joining with other tables a lot and we would run into efficiency issues.

Ok, those number are not that big, but let's deploy it incrementally (not to all wikis at the same time) and let's try to keep the per-row/revision overhead to a minimum.

Sure, that's the plan. Thanks :)

Halfak added a comment.Jun 7 2016, 6:25 PM

OK. Let's try out an example iterative plan. Let's say we do the iterations below with at least a week in between to address issues that come up. Would that be reasonable?

1st iteration:

  • fawiki

2nd iteration:

  • wikidatawiki

3rd iteration (wikis with best support):

  • enwiki
  • ruwiki
  • ptwiki
  • trwiki
  • nlwiki

4th iteration (wikis with basic support):

  • arwiki
  • dewiki
  • eswiki
  • etwiki
  • frwiki
  • hewiki
  • huwiki
  • idwiki
  • itwiki
  • plwiki
  • ukwiki
  • viwiki

This was a blocker to this morning's production release, and in the end @thcipriani had to manually create the database in a rush during SWAT. :-( This should have been scheduled properly and not just rammed into a SWAT window.

Dear @Jdforrester-WMF this task is not making the tables, it's about populating it which is being done using a maintenance script. I didn't know maintenance/update.php doesn't work in prod and tables needs to be manually created.