Page MenuHomePhabricator

flaggedrevs_statistics is holding millions of rows while FR needs only a handful
Open, Needs TriagePublic

Description

Looking at the code, it stores the stats values every couple of hours but it won't select any of them, only the most recent dates and let the rest rot. It's already holding 2.4M rows in Polish wikisource and I can't make schema changes live on the table due to this. This needs clean up.

Event Timeline

I double checked, it doesn't need any non-most-recent row. Just to be safe, I'll delete anything older than 2022. Now deletion like that will cause read-only given the sheer size of crap in this lovely extension. I have to run it in batches.

There is some older tickets related to this: T163107, T44360 , T177951.

Couple of notes. The flaggedrevs_statistics table is currently the only place to see how Flaggedrews works over a longer period of time so even deleting or tidying the duplicate values should be done so that there is lest one row per day (even if the value is clearly broken = identical than last value).

An even better solution would move displaying stats to wikistats so that there would be formal crosswiki stats system and deprecate this but until then the least some broken logging is better than none.

@Ladsgroup I have used the data for generating the crosswiki reports on how flaggedrevs is working so even it is kind of broken it works for that. Also some data works on somewikis and not on others. No idea why.

We have dumps of this table: https://dumps.wikimedia.org/fiwiki/20221120/fiwiki-20221120-flaggedrevs_statistics.sql.gz you can simply download this one and look at old data. Anything older than a year can be accessed via this.

This would cover your usecase. Again, we need to simply and prune this table, this means sometimes barely used features have to go as well. Specially if they are not part of the standard extension.

And it would make the statistics gathering multiple times harder at same time. Also, even if there is millions rows the amount of the data in the disk is trivial.

The problem is not the disk size, I have explained in details why we need to be careful on what we keep in core databases. It's not the kind of problem you can throw hardware at: T294881#7743146

It's also adds a lot of maintenance burden, for example I can't run schema changes on it live, which turns a maint work that could take an hour to something that would take months requiring read-only times for each wiki.

Third, in any other extension with lower tech debt and steward, that'd be to some degree acceptable but FlaggedRevs has too many issues to count and simplifying its data storage and data model would help keeping it alive a lot.

Yeah, and i have explained here why the data for getting trends is important T44360#3235893.

In any case, you could tidy the database size down just to average the data to daily without losing the history

I believe something like this would work (though needs to be tested first and I don't have clue how it should be done in production).

CREATE TEMPORARY TABLE flaggedrevs_statistics_temp  (
   SELECT 
      FLOOR(AVG(frs_timestamp)) as frs_timestamp, 
      frs_stat_key, 
      FLOOR(AVG(frs_stat_val)) as frs_stat_val 
   FROM 
      flaggedrevs_statistics 
  GROUP BY  frs_stat_key, FLOOR(frs_timestamp/1000000)
);

TRUNCATE flaggedrevs_statistics;
INSERT INTO flaggedrevs_statistics SELECT * FROM  flaggedrevs_statistics_temp;

Also, I believe that updating (adding new rows) the stats is still running through cronjob or something and it could be running less than 12 per day if one would want to reduce the creation of new data without fixing the actual update script.

Yup, it has two sides:

  • Clean up old data
  • Store correctly onwards.

I can make it store only once per day and clean it up somehow. I can't run that query in production, it'll choke replication (large write queries do), but finding ways to do it in chunks shouldn't be hard.

Change 859589 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mediawiki: Reduce the frequency of flaggedrevs updates

https://gerrit.wikimedia.org/r/859589

Having up-to-date data is important, especially for motivation. If I worked hard to get a counter in the namespaces section to zero, and it’s still non-zero after hours, it’s disappointing. A better solution would be still collecting the data frequently, but dropping some older rows (e.g. the previous one unless it’s the one kept as daily statistics).

It’s even more important to have the “list of 5 most active reviewers in the last hour” table up-to-date (it makes no sense to have 23-hours-old data about the most active reviewers in the “last” hour), but that bit seems to actually be live, despite the note about stale data on the top.

I'm starting to clean up the old data (and keeping one per day) but I also realized it has been storing a lot of rows per timestamp as well. For example for plwikisource, it keeps 38 rows per timestamp:

wikiadmin@10.64.32.196(plwikisource)> select * from flaggedrevs_statistics order by frs_timestamp limit 40;
+----------------+--------------------------------+--------------+
| frs_timestamp  | frs_stat_key                   | frs_stat_val |
+----------------+--------------------------------+--------------+
| 20111013052535 | pendingLag-average             |            0 |
| 20111013052535 | reviewLag-average              |         3854 |
| 20111013052535 | reviewLag-median               |         1150 |
| 20111013052535 | reviewLag-percentile:35        |          395 |
| 20111013052535 | reviewLag-percentile:45        |          680 |
| 20111013052535 | reviewLag-percentile:55        |         1338 |
| 20111013052535 | reviewLag-percentile:65        |         1570 |
| 20111013052535 | reviewLag-percentile:75        |         2999 |
| 20111013052535 | reviewLag-percentile:85        |         5140 |
| 20111013052535 | reviewLag-percentile:90        |        10883 |
| 20111013052535 | reviewLag-percentile:95        |        22113 |
| 20111013052535 | reviewLag-sampleEndTimestamp   |   1318483533 |
| 20111013052535 | reviewLag-sampleSize           |           38 |
| 20111013052535 | reviewLag-sampleStartTimestamp |   1310707533 |
| 20111013052535 | reviewedPages-NS:0             |        16739 |
| 20111013052535 | reviewedPages-NS:10            |         1467 |
| 20111013052535 | reviewedPages-NS:100           |        43019 |
| 20111013052535 | reviewedPages-NS:102           |          385 |
| 20111013052535 | reviewedPages-NS:104           |          324 |
| 20111013052535 | reviewedPages-NS:12            |           20 |
| 20111013052535 | reviewedPages-NS:14            |         1798 |
| 20111013052535 | reviewedPages-NS:6             |           50 |
| 20111013052535 | syncedPages-NS:0               |        16739 |
| 20111013052535 | syncedPages-NS:10              |         1467 |
| 20111013052535 | syncedPages-NS:100             |        43019 |
| 20111013052535 | syncedPages-NS:102             |          385 |
| 20111013052535 | syncedPages-NS:104             |          324 |
| 20111013052535 | syncedPages-NS:12              |           20 |
| 20111013052535 | syncedPages-NS:14              |         1798 |
| 20111013052535 | syncedPages-NS:6               |           50 |
| 20111013052535 | totalPages-NS:0                |        21759 |
| 20111013052535 | totalPages-NS:10               |         1467 |
| 20111013052535 | totalPages-NS:100              |        43020 |
| 20111013052535 | totalPages-NS:102              |          385 |
| 20111013052535 | totalPages-NS:104              |          456 |
| 20111013052535 | totalPages-NS:12               |           20 |
| 20111013052535 | totalPages-NS:14               |         1798 |
| 20111013052535 | totalPages-NS:6                |           50 |
| 20111019200751 | pendingLag-average             |            0 |
| 20111019200751 | reviewLag-average              |         3677 |
+----------------+--------------------------------+--------------+
40 rows in set (0.050 sec)

Do we need all of these? Which one is safe to get rid of? Each type removed translates to millions of rows being cleaned up. Maybe syncedPages-NS*?

! In T323495#8419025, @Ladsgroup wrote:
Do we need all of these? Which one is safe to get rid of? Each type removed translates to millions of rows being cleaned up. Maybe syncedPages-NS*?

Important ones are

  • pendingLag-average
  • reviewedPages-*
  • syncedPages-*
  • totalPages-*

reviewLag-* values are known to be broken in some wikis and script is duplicating same data over and over. reviewLag can be detected to be broken by checking how many unique values there is. If the number is well below 500 then it is broken, but if i remember correctly in some wikis it has worked for some time and then broke.

I can check in which wikis it is working, but my guess is that in most of the wikis it is broken and reviewLags could be deleted.

MariaDB [fiwiki_p]> select count(distinct(frs_stat_val)) as number_of_distinct_values, sum(1) as number_of_rows, frs_stat_key from flaggedrevs_statistics group by frs_stat_key;
+---------------------------+----------------+-------------------------------------+
| number_of_distinct_values | number_of_rows | frs_stat_key                        |
+---------------------------+----------------+-------------------------------------+
|                     42253 |          45213 | pendingLag-average                  |
|                         9 |          43228 | reviewLag-anon-average              |
|                         9 |          43228 | reviewLag-anon-median               |
|                         9 |          43228 | reviewLag-anon-percentile:35        |
|                         9 |          43228 | reviewLag-anon-percentile:45        |
|                         8 |          43228 | reviewLag-anon-percentile:55        |
|                         8 |          43228 | reviewLag-anon-percentile:65        |
|                         8 |          43228 | reviewLag-anon-percentile:75        |
|                         8 |          43228 | reviewLag-anon-percentile:85        |
|                         9 |          43228 | reviewLag-anon-percentile:90        |
|                         9 |          43228 | reviewLag-anon-percentile:95        |
|                         6 |          43228 | reviewLag-anon-sampleEndTimestamp   |
|                         9 |          43228 | reviewLag-anon-sampleSize           |
|                         6 |          43228 | reviewLag-anon-sampleStartTimestamp |
|                       157 |           1985 | reviewLag-average                   |
|                        85 |           1985 | reviewLag-median                    |
|                        77 |           1970 | reviewLag-percentile:35             |
|                        90 |           1970 | reviewLag-percentile:45             |
|                        88 |           1970 | reviewLag-percentile:55             |
|                        70 |           1970 | reviewLag-percentile:65             |
|                        67 |           1970 | reviewLag-percentile:75             |
|                        59 |           1970 | reviewLag-percentile:85             |
|                        58 |           1970 | reviewLag-percentile:90             |
|                        46 |           1970 | reviewLag-percentile:95             |
|                       192 |           1985 | reviewLag-sampleEndTimestamp        |
|                        90 |           1985 | reviewLag-sampleSize                |
|                       180 |           1985 | reviewLag-sampleStartTimestamp      |
|                        10 |          43228 | reviewLag-user-average              |
|                         1 |          43228 | reviewLag-user-median               |
|                         1 |          43228 | reviewLag-user-percentile:35        |
|                         1 |          43228 | reviewLag-user-percentile:45        |
|                         3 |          43228 | reviewLag-user-percentile:55        |
|                         4 |          43228 | reviewLag-user-percentile:65        |
|                         4 |          43228 | reviewLag-user-percentile:75        |
|                         6 |          43228 | reviewLag-user-percentile:85        |
|                         8 |          43228 | reviewLag-user-percentile:90        |
|                        10 |          43228 | reviewLag-user-percentile:95        |
|                         6 |          43228 | reviewLag-user-sampleEndTimestamp   |
|                        10 |          43228 | reviewLag-user-sampleSize           |
|                         6 |          43228 | reviewLag-user-sampleStartTimestamp |
|                     40146 |          45213 | reviewedPages-NS:0                  |
|                      5979 |          45213 | reviewedPages-NS:10                 |
|                     16456 |          45213 | reviewedPages-NS:6                  |
|                       245 |          41370 | reviewedPages-NS:828                |
|                     38899 |          45213 | syncedPages-NS:0                    |
|                      6813 |          45213 | syncedPages-NS:10                   |
|                     16506 |          45213 | syncedPages-NS:6                    |
|                       254 |          41370 | syncedPages-NS:828                  |
|                     39060 |          45213 | totalPages-NS:0                     |
|                      6076 |          45213 | totalPages-NS:10                    |
|                     17425 |          45213 | totalPages-NS:6                     |
|                       248 |          41370 | totalPages-NS:828                   |
+---------------------------+----------------+-------------------------------------+

Here is some analysis.

  • Columns reviewLag-anon-average, reviewLag-user-average, have meaningful data in some wikis (such as trwiki and ukwiki)
  • data in other fields is pretty much repeating values

Personally i would drop the reviewLag-* values as in most of the wikis they are broken and it is likely some problems in updating the values in cases where it works least somehow.

Examples

Number of distinct values in selected reviewLag columns in different wikis.

wiki	reviewLag-anon-average	reviewLag-anon-median	reviewLag-anon-percentile:65	reviewLag-anon-sampleSize	reviewLag-anon-sampleStartTimestamp	reviewLag-average	reviewLag-median	reviewLag-percentile:65	reviewLag-percentile:65	reviewLag-user-average	reviewLag-user-median	reviewLag-user-percentile:65	reviewLag-user-sampleEndTimestamp
alswiki_p	3	2	2	3	2	1	1	1	1	2	1	1	1
arwiki_p	17726	36	35	45	23	1	1	1	1	17578	1	6	23
bewiki_p	2627	935	924	456	2544	315	226	224	224	7952	7	10	2543
bnwiki_p	1	1	1	1	1	1	1	1	1
bswiki_p	5426	891	1067	678	7337	5	5	5	5	9799	1	16	7337
cawikinews_p	1	1	1	1	1	1	7	1	1	3
cewiki_p	8	5	6	8	4	82	1	1	183
ckbwiki_p	1	1	1	1	1	1	1
dewiki_p	10	6	6	9	1	2	2	2	2	6	1	2	1
dewiktionary_p	2	2	2	2	2	1	1	1	1	1	1	1	1
elwikinews_p	3	3	2	2	3	8	4	3	3	7	4	4	5
enwiki_p	1	1	1	1	1	1	1	1	1
enwikibooks_p	5	4	4	5	4	2	2	2	2	7	7	7	4
enwikinews_p	1508	418	414	167	24637	219	49	45	45	7751	209	584	26327
eowiki_p	2	2	2	2	2	1	1	1	1	43199	1	1	1
fawiki_p	1	1	1	1	1	1	1
fawikinews_p	4	4	3	4	7	2	2	1	1	75	1	1	10
fiwiki_p	9	9	8	9	6	157	85	70	70	10	1	4	6
frwikinews_p	4	3	2	3	2	1	1	1	1	3	1	1	1
hewikisource_p	1	1	1	1	1	1	3	2	2	3
hiwiki_p	1	1	1	1	1	1	1	1	1
huwiki_p	4	2	2	4	1	1	1	1	1	6	1	1	1
iawiki_p	4	3	4	4	6	1	1	1	1	6	1	1	5
idwiki_p	8	8	7	8	7	1	1	1	1	12	2	3	7
kawiki_p	645	326	312	330	931	14	12	12	12	5558	1	1	930
plwiki_p	9	4	4	10	1	2	1	2	2	387	1	1	1
plwikisource_p	982	59	61	44	356	82	26	25	25	784	1	1	355
plwiktionary_p	10280	1346	1404	949	2079	198	125	121	121	6910	1	1	2078
ptwikibooks_p	8	7	8	7	5	2	2	1	1	6	1	4	5
ptwikinews_p	8	6	6	7	4	1	1	1	1	10	1	3	3
ptwikisource_p	60	32	33	21	284	15	9	8	8	181	4	6	283
ruwiki_p	2	2	2	2	1	1	1	1	1	43166	6	6	1
ruwikinews_p	268	66	47	54	4050	100	26	26	26	11277	3	17	4049
ruwikisource_p	6	5	4	5	5	2	1	1	1	11	1	3	5
ruwiktionary_p	583	22	24	22	24	4	3	3	3	287	1	1	22
sqwiki_p	3	2	2	3	1	2	2	1	1	3	2	3	2
test2wiki_p	2	2	1	2	2	3	3	3	3
trwiki_p	37692	3894	3971	1975	4455	442	374	378	378	32386	1	2	4456
trwikiquote_p	2550	313	307	303	3855	22	17	18	18	9954	6	98	3855
ukwiki_p	3479	460	455	398	456	52	45	45	45	9307	1	89	456
vecwiki_p	349	93	90	76	3022	158	41	46	46	1145	1	4	3021

Code for generating table

Having up-to-date data is important, especially for motivation. If I worked hard to get a counter in the namespaces section to zero, and it’s still non-zero after hours, it’s disappointing. A better solution would be still collecting the data frequently, but dropping some older rows (e.g. the previous one unless it’s the one kept as daily statistics).

Thanks, this is valid point.

It’s even more important to have the “list of 5 most active reviewers in the last hour” table up-to-date (it makes no sense to have 23-hours-old data about the most active reviewers in the “last” hour), but that bit seems to actually be live, despite the note about stale data on the top.

Afaik this doesn't change that list as if i understand correctly the code then the list of most active reviewers are generated on fly (?)

https://github.com/wikimedia/mediawiki-extensions-FlaggedRevs/blob/master/frontend/specialpages/reports/ValidationStatistics.php#L320

What's the difference between syncedPages-* and reviewedPages-*? They seem to have almost identical values in wikis I checked, even the percentage combines these two and I wonder if one can be deleted in favor of the other without much signal being lost.

What's the difference between syncedPages-* and reviewedPages-*?

totalPages-* = total pages in the namespace ( "pages" column Special:ValidationStatistics)
reviewedPages-* = least once reviewed pages in the namespace ( "Reviewed" column Special:ValidationStatistics)
syncedPages-* = number of pages which latest version is reviewed in the namespace ("Reviewed" column Special:ValidationStatistics)

"Synced/Reviewed" column is syncedPages / reviewedPages.
Outdated = reviewed - synced ( ie. pending changes )
Unreviewed = totalPages - reviewed

They seem to have almost identical values in wikis I checked, even the percentage combines these two and I wonder if one can be deleted in favor of the other without much signal being lost.

They cannot be deleted as they are used for calculating values. Also it depends on wiki (and how they are using flaggedrevs), and time when value is taken if there is significant difference in numbers. In example these are from ruwiki.

year	total_pages_ns0_avg	reviewed_pages_ns0_avg	synced_pages_ns0_avg	pending_changes_avg
2011	795339	716591	624527	92064
2012	846635	769079	658162	110917
2013	1020816	943144	778797	164347
2014	1131552	1058719	877582	181137
2015	1229720	1159341	968508	190833
2016	1322675	1267686	1035040	232646
2017	1404197	1342062	1089368	252694
2018	1481811	1419904	1140898	279006
2019	1551381	1490383	1188168	302215
2020	1645599	1579578	1255634	323944
2021	1735227	1667736	1351624	316112
2022	1827644	1756645	1418276	338369

Example: https://quarry.wmcloud.org/query/69228

Okay, I was a bit confused by the terminology in FR. If a synced page means it's fully reviewed and reviewed page means it has at least one reviewed revision, then I personally think "reviewedpage" doesn't bring much value. When dealing with FR in my home wiki, a page is fully reviewed or needs to be reviewed (whether it has one reviewed revision or not), the third piece in between doesn't bring much info to me. Is that correct?

Afaik this doesn't change that list as if i understand correctly the code then the list of most active reviewers are generated on fly (?)

Yes, I know, I just wanted to mention it for the sake of completeness.

Okay, I was a bit confused by the terminology in FR. If a synced page means it's fully reviewed and reviewed page means it has at least one reviewed revision, then I personally think "reviewedpage" doesn't bring much value. When dealing with FR in my home wiki, a page is fully reviewed or needs to be reviewed (whether it has one reviewed revision or not), the third piece in between doesn't bring much info to me. Is that correct?

The difference between unsynced but reviewed and unreviewed pages is that unreviewed pages are not “protected” by FlaggedRevs: their unsynced version is shown to readers instead of a synced one (because there’s no such version), making vandalism there more visible. Therefore it should be a priority to review them (although the practice is probably just the opposite, as reviewing a small diff takes less effort than reviewing a whole new article).

Change 859589 merged by Ladsgroup:

[operations/puppet@production] mediawiki: Reduce the frequency of flaggedrevs updates

https://gerrit.wikimedia.org/r/859589