Page MenuHomePhabricator

Stale data / missing pages in HTML ("enterprise")
Open, Needs TriagePublic8 Estimated Story PointsBUG REPORT

Description

  • Download enwiktionary HTML dump (April 1st, 2022)
  • Untar file
Stale data:
$ jq -r 'select(.name == "apreciable")' enwiktionary_*ndjson | head
{
  "name": "apreciable",
  "identifier": 2713698,
  "date_modified": "2021-03-19T05:53:16Z",
  "version": {
    "identifier": 62182446,
    "comment": "convert {{es-adj-old}} to new {{es-adj}} format",

What happens?:
The data returned is from March 2021. ("date_modified": "2021-03-19T05:53:16Z")

What should have happened instead?:
The data returned is from March 2022. (last edit 2022-03-09, diff)

Missing page:
$ jq -r 'select(.name == "paniaguarse")' enwiktionary_*ndjson
$

What happens?:
No output.

What should have happened instead?:
Data is returned for the page paniaguarse (created 2018-07-11)

There seem to be missing or outdated pages in all the recent (enwikt) HTML dumps I've tried. If it's useful, I can try to compile a list by diffing with the XML dump.

To do

  • additional tickets to be created

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Just a thought: perhaps the HTML dumps should be generated from the XML dumps, so that the revisions in both match (and they can both be used interchangeably without consistency problems).

Protsack.stephan triaged this task as Medium priority.
Protsack.stephan set the point value for this task to 3.
Protsack.stephan moved this task from Incoming to In Progress on the Wikimedia Enterprise board.
Lena.Milenko changed the task status from Open to In Progress.May 5 2022, 2:29 AM

I thin this might be related to T274359.

Lena.Milenko changed the task status from In Progress to Open.Jun 30 2022, 2:23 PM

Any updates on this? The task has been moved around a bit recently, but it's not clear what is happening. Is it difficult to fix?

Protsack.stephan raised the priority of this task from Medium to Needs Triage.Oct 12 2022, 9:44 AM

Another cache fail related ticket, probably not related though: T226931

Hey, sorry for not providing updates, we are updating our infrastructure and code at the moment that is not directly related to this problem, but potentially will fix this. After we are done with the current updates, we'll double check if this is still the case, and if it is will take a closer look into it.

Thanks, are you referring to the deprecation of restbase/MCS? On the English Wiktionary, we're relying more and more on these dumps for statistics and maintenance tasks, and many editors have noticed problems with data derived from these dumps.

I crosschecked the HTML dump data with the XML dump (20230320). Around 2200 entries are missing completely from the HTML dump, here's a sample:

entrydate
2516азыхә бызшәа2022-07-07 13:19:40+00:00
17964sonlig2022-03-31 05:56:34+00:00
19714ܣܡܩ2022-03-26 05:41:16+00:00
21483サㇰカィ2022-03-30 14:39:10+00:00
21656オㇰカィポ2022-03-30 14:23:01+00:00
25842ḫamṭu2022-03-24 17:57:00+00:00
42570læcewyrt2022-03-24 21:26:26+00:00
42572gærsgrene2022-03-25 11:18:06+00:00
42573wyrtmete2022-03-25 11:32:47+00:00
42835cynegyrd2022-08-13 01:45:00+00:00
45618أفريقانية2022-06-29 10:14:25+00:00
46247قانوننامة2021-10-02 20:59:24+00:00
47425أهداب2022-03-19 13:57:52+00:00
47501جاودر2022-03-31 08:34:36+00:00

Additionally, around 3200 entries are older than reported in the XML dump, in extreme cases with a 12-year lag!

entrydate_modified_xmldate_modified_htmldiff
7380220meith2022-03-28 17:15:17+00:002010-07-06T22:04:06Z4282 days 19:11:11
1384598admixtured2022-03-28 17:42:54+00:002010-08-09T08:20:10Z4249 days 09:22:44
5776613strēli2022-03-28 13:50:13+00:002012-11-18T12:51:05Z3417 days 00:59:08
5748728strēlām2022-03-28 13:57:26+00:002013-08-31T16:53:49Z3130 days 21:03:37
4689291アオリスト2022-03-28 02:10:10+00:002013-11-09T06:11:26Z3060 days 19:58:44
7370597ылгын чыкыйа2022-03-31 23:09:17+00:002014-07-16T19:36:27Z2815 days 03:32:50
4701424ガロッシュ2022-03-28 03:19:02+00:002014-10-26T21:37:29Z2709 days 05:41:33
1023219Roman brick2022-03-29 23:52:49+00:002015-10-16T23:15:32Z2356 days 00:37:17
4680036あんしんかん2022-03-31 09:46:43+00:002015-10-22T19:34:26Z2351 days 14:12:17
6736831República de Uganda2022-03-31 08:52:50+00:002015-11-04T18:21:01Z2338 days 14:31:49
1155239death cultist2022-03-30 16:53:02+00:002015-12-08T15:04:23Z2304 days 01:48:39
1156437huncher2022-04-01 01:34:29+00:002016-01-17T22:02:15Z2265 days 03:32:14

The last problem concerns entries that have a current timestamp, but where the content itself is outdated. This seems to happen when transcluded templates are updated after the last edit of the transcluding page. I don't have numbers for this one, but it likely affects a lot of pages.

Thanks for the stats, I'm not referring to restbase/MCS deprecation, we are doing some work WME infrastructure.

The stats themselves are really useful, would be nice to see what changes after we change our infra.

Ok, let me know once you have dumps available with the new infra and I'll re-generate them.

It looks like the situation has improved with the latest dump (20230601, enwikt):

SELECT xml.title, timestamp AS xml_timestamp, date_modified AS html_timestamp, date_diff('month', date_modified, timestamp) AS diff FROM xml
  INNER JOIN html ON xml.title = html.title 
  WHERE redirect is null AND diff > 0 AND xml_timestamp < '2023-05-26'   
 ORDER BY diff DESC
titlexml_timestamphtml_timestampdiff
деактивированный2023-05-18 23:07:432018-06-11 04:13:5359
𪓅2023-03-17 20:47:102020-03-23 04:59:2636
Mahoré2023-05-23 16:19:092020-07-31 07:00:0834
collisionnel2023-05-23 11:46:002022-06-05 06:39:2511
pédocriminel2023-05-23 11:45:572022-11-21 10:40:356
décisionnel2023-05-23 11:45:592022-11-13 10:47:196
kunniansa2023-04-08 18:32:252022-12-17 13:12:104
Traoré2023-05-23 16:19:062023-03-18 00:10:372
epistolar2023-05-23 11:45:562023-04-11 16:07:471
criminel2023-05-23 11:45:562023-04-01 22:15:581
doré2023-05-23 16:19:072023-04-01 20:11:441
solennel2023-05-23 11:45:582023-04-01 22:25:031

Only 12 entries have different timestamps. However, there are now duplicate entries in the HTML dump:

SELECT id, title, content_hash, count(*) AS COUNT FROM html GROUP BY id, title, content_hash HAVING count > 1 ORDER BY count DESC LIMIT 20;
idtitlecontent_hashcount
191204alternating current55fda023af99ede271daf28c518b5624e450fc6a37
206741sodium chloridebaf90a0961990ec882ea804f55a407a28b31069031
210561Brassicaceae1e32d8fd4e708114c64a845cd499bc7a45ef8b1831
233235binary star1f5af4ac3a7366130ceebf4bdb5ff2658c09c23930
1139414powiedzieć79c6840aa53a48f315734f3044aaface3de54ec929
5079578心房e5f870e5deec31313878bc9d03dc7ee0e8dce4c325
1577o723cbbb05154936dcd59bccf2eff6b288e1cdc2025
6089371رسید0995b796cf926fc65a80a0b5964dd69fefd93e5a24
681927stringed instrument71c6c2d9728e980a9333ed1ee0f3e039112f304323
3071222◌̅1c82bd0691db86a65b74bd3142ebfeeb000086db22
9377006kʰau³⁵9497fc20ebfc745cdec7665cae4ade483287ab0520
1544820veadoda6b895496496da9c7e3dc6bfb9ba0df1148236f19
6724710ocky77c2fbc17abd195d3521ce7c37931d5e547c828818
35872toilet paperfaec4095ab15debc590f430a21da1dad06b2c37018
9377338bajraktar5b930e4cfa7166f1c6cff5d3c8b29d00baba660d17
9379511◌᪳c3a712be46de83fdaf77dd0ca0825f3d6bde25c317
254575اسد5f7c05236cca3daf06385c00ec072c00f433844917
306224Junoe59da2eb719181c63068c823ad264b4d588703d017
39671autopoiesise10a3aa79b1084094e238d8cf74ced1528c07aaa17
623589take to the cleaners201295bae1ab4ee666220df0fced034ade59f33d16

There are around 114K entries which are listed more than once in the dump, and 96K are exact duplicates (matching content_hash).

Here are the changes for the top entry:

SELECT id, title, date_modified, content_hash FROM html WHERE title = 'alternating current' ORDER BY date_modified DESC LIMIT 10;
idtitledate_modifiedcontent_hash
191204alternating current2023-05-31 04:24:4155fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:23:5655fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:22:4655fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:22:4655fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:22:4655fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:22:4655fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:22:4655fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:19:5155fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:19:1155fda023af99ede271daf28c518b5624e450fc6a
191204alternating current2023-05-31 04:18:0055fda023af99ede271daf28c518b5624e450fc6a

There are ~150 entries missing from the HTML dump (compared to 2200 earlier):

SELECT xml.title, timestamp FROM xml LEFT OUTER JOIN html ON xml.title = html.title WHERE xml.redirect IS NULL AND html.title IS NULL AND timestamp < '2023-05-25' LIMIT 10;
titletimestamp
youngshits2023-04-10 21:19:27
Maegan2023-04-14 12:30:52
Connemaras2023-05-15 01:12:06
پھا2023-01-02 04:02:33
seekinis2023-01-30 23:33:17
deprotona2023-03-23 19:38:30
cordón de seguridad2023-03-24 15:35:33
s**ts2023-04-08 20:25:53
oldshits2023-04-10 21:20:20
Liszt fever2023-04-14 23:21:59

Conversely, around 200 entries are in the HTML dump, but not in the XML dump: in most cases these are entries that have been deleted.

titledate_modified
octaperfect2022-08-04 02:36:54
tärkiä2023-01-15 02:02:20
przeciwbakterynie2023-05-21 23:22:57
companable2023-05-21 21:16:48
folwes2018-11-27 07:20:42
trước đo2023-04-26 10:07:50
habinam2023-03-02 17:21:35
keeps it on the barber pole2022-08-09 00:10:30
sarouks2019-10-15 12:21:08
vô hữu bất như kỷ giả2023-04-27 07:13:13
Protsack.stephan renamed this task from Stale data / missing pages in HTML ("enterprise") dumps to [Investigation] Stale data / missing pages in HTML ("enterprise") dumps {5 days}.Jun 15 2023, 1:58 PM
Protsack.stephan changed the point value for this task from 3 to 8.

Why was this already marked as resolved? New dumps haven't even been published yet, so it's impossible to verify.

jberkel reopened this task as Open.EditedJul 21 2023, 2:39 PM

I just checked the latest dumps (2023-07-20), and it's now worse: there are around 2.5 million pages missing from the HTML dump (using the XML dump as a baseline).

Also note the difference in file size of the dumps:

enwiktionary-NS0-20230701-ENTERPRISE-HTML.json...> 01-Jul-2023 14:25         13021022629  (13.1 GB)
enwiktionary-NS0-20230720-ENTERPRISE-HTML.json...> 20-Jul-2023 13:44          7159413854  ( 7.2 GB)

@LDlulisa-WMF perhaps related to recent changes?

@jberkel Thank you for the updated info. The recent changes have not made it into our production infra yet, they are still in staging. I will keep you posted on when they do.

Ok, I hope this can be rolled out quickly, it can't get much worse than the current state

Hasn't been fixed yet, data is still missing.

JArguello-WMF renamed this task from [Investigation] Stale data / missing pages in HTML ("enterprise") dumps {5 days} to Stale data / missing pages in HTML ("enterprise").Jul 26 2023, 5:53 PM

investigation timeboxed for 5 days and made by @LDlulisa-WMF was completed. With the results of the investigation, We´ll triage accordingly during the next sprint planning.

JArguello-WMF raised the priority of this task from Medium to Needs Triage.Jul 26 2023, 5:56 PM
JArguello-WMF renamed this task from Stale data / missing pages in HTML ("enterprise") to Stale data / missing pages in HTML ("enterprise") .Aug 2 2023, 2:24 PM
JArguello-WMF updated the task description. (Show Details)

BTW, there is a related issue T300124 that some categories and templates are not always extracted.

file sizes from the most recent enwikt HTML dumps (NS0):

20230701: 13 GB
20230720: 7.1 GB
20230801: 1.1 GB
20230820: 4.6 GB
20230901: 7.2 GB
20230920: 3 GB
20231001: 5 GB

something's going really wrong there.

Every MediaWiki site publishes a count of its own "content" pages at the API endpoint https://de.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics . The definition of "content" can be confirmed through the https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces API, and for most wikis this will be strictly NS0.

Please include some kind of basic health check after each dump run, to verify that the article count matches these numbers.

Sadly, the stats are not easily available for other namespaces (while dumps are made for them as well). I think currently the best way to get those counts is to use ElasticSearch endpoint, as described in T312200.

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

When you say "base off them", are you suggesting that the HTML dumps be produced by iterating over the XML dumps and then fetching the HTML content for each row? This would be a cumbersome approach since it introduces an unnecessary extra dependency. The HTML content of the new dumps is not directly derived from the XML dumps, so I don't see much advantage to this approach. I agree that it would be nice to snapshot the wiki content before dumping but this isn't feasible given the way that HTML rendering requires random access to all articles and templates.

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

When you say "base off them", are you suggesting that the HTML dumps be produced by iterating over the XML dumps and then fetching the HTML content for each row? This would be a cumbersome approach since it introduces an unnecessary extra dependency. The HTML content of the new dumps is not directly derived from the XML dumps, so I don't see much advantage to this approach. I agree that it would be nice to snapshot the wiki content before dumping but this isn't feasible given the way that HTML rendering requires random access to all articles and templates.

I don't know how the HTML dumps are currently produced (the project is not very transparent), but you probably don't have to transform the XML itself, you could take the page ids and revision numbers from the XML dump as a starting point, and then fetch the HTML of those revisions. Of course this is more complicated with templates dependencies, but it would already be an improvement, and ensure the data you get is complete.

so why not simply base the HTML dumps off them?

This does not work. You cannot convert wikitext to HTML in offline mode. You need access to running Mediawiki instance to render macros, templates, etc. So rendering HTML is by definition a dynamic process and something which has not been available otherwise.

so why not simply base the HTML dumps off them?

This does not work. You cannot convert wikitext to HTML in offline mode. You need access to running Mediawiki instance to render macros, templates, etc. So rendering HTML is by definition a dynamic process and something which has not been available otherwise.

I'm not suggesting this, see my reply above. The XML dumps would serve as a baseline for which page ids / revisions to include.

I ran into another surprising quirk in these dumps: a page may appear multiple times, sometimes with a duplicate revision, or with several different revisions. Please deduplicate by page ID.

For example, the dewiki-20230820 NS0 dump contains these revisions of one article:

["Haus Lohe",192272812]
["Haus Lohe",236444682]
["Haus Lohe",236444682]
["Haus Lohe",236444682]
["Haus Lohe",236444682]

adding some more stale data examples I came across while working with a snapshot of the HTML dumps, specifically /public/dumps/public/other/enterprise_html/runs/20231201/enwiki-NS0-20231201-ENTERPRISE-HTML.json.tar.gz on PAWS: