Stale data / missing pages in HTML ("enterprise")
Open, Needs TriagePublic8 Estimated Story PointsBUG REPORT
Actions

Assigned To

None

Authored By

	jberkel
	Apr 4 2022, 8:06 PM

Description

Download enwiktionary HTML dump (April 1st, 2022)
Untar file

Stale data:

Extract data for the page "apreciable":

$ jq -r 'select(.name == "apreciable")' enwiktionary_*ndjson | head
{
  "name": "apreciable",
  "identifier": 2713698,
  "date_modified": "2021-03-19T05:53:16Z",
  "version": {
    "identifier": 62182446,
    "comment": "convert {{es-adj-old}} to new {{es-adj}} format",

What happens?:
The data returned is from March 2021. ("date_modified": "2021-03-19T05:53:16Z")

What should have happened instead?:
The data returned is from March 2022. (last edit 2022-03-09, diff)

Missing page:

Extract data for the page "paniaguarse":

$ jq -r 'select(.name == "paniaguarse")' enwiktionary_*ndjson
$

What happens?:
No output.

What should have happened instead?:
Data is returned for the page paniaguarse (created 2018-07-11)

There seem to be missing or outdated pages in all the recent (enwikt) HTML dumps I've tried. If it's useful, I can try to compile a list by diffing with the XML dump.

To do

additional tickets to be created

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	BUG REPORT	None	T305407 Stale data / missing pages in HTML ("enterprise")
Resolved	BUG REPORT	REsquito-WMF	T345176 {Investigation} Different file sizes for dumps
Open		None	T348100 Request: changelog for Enterprise API HTML dumps

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

jberkel updated the task description. (Show Details)Apr 4 2022, 9:06 PM

ArielGlenn moved this task from Backlog to Other teams on the Dumps-Generation board.Apr 5 2022, 8:34 AM

Mitar subscribed.Apr 5 2022, 9:53 PM

Thanks for adding this @jberkel - we'll look into it!

Just a thought: perhaps the HTML dumps should be generated from the XML dumps, so that the revisions in both match (and they can both be used interchangeably without consistency problems).

• RBrounley_WMF added a subscriber: Protsack.stephan.Apr 25 2022, 8:23 PM

Erutuon subscribed.Apr 28 2022, 9:02 PM

Protsack.stephan claimed this task.May 4 2022, 3:20 PM

Protsack.stephan triaged this task as Medium priority.

Protsack.stephan set the point value for this task to 3.

Protsack.stephan moved this task from Incoming to In Progress on the Wikimedia Enterprise board.

Protsack.stephan moved this task from In Progress to Machine Readability PB on the Wikimedia Enterprise board.

• Lena.Milenko changed the task status from Open to In Progress.May 5 2022, 2:29 AM

I thin this might be related to T274359.

Mitar mentioned this in T274359: Mobile REST API delivers year old+ content for very select pages.May 21 2022, 11:09 PM

Protsack.stephan moved this task from Machine Readability PB to Estimated /Discussed on the Wikimedia Enterprise board.Jun 28 2022, 12:15 PM

Protsack.stephan removed Protsack.stephan as the assignee of this task.Jun 30 2022, 7:08 AM

Protsack.stephan moved this task from Estimated /Discussed to Incoming on the Wikimedia Enterprise board.Jun 30 2022, 7:14 AM

• Lena.Milenko changed the task status from In Progress to Open.Jun 30 2022, 2:23 PM

Any updates on this? The task has been moved around a bit recently, but it's not clear what is happening. Is it difficult to fix?

nfliu mentioned this in T315276: Latest English Wikipedia Wikimedia Enterprise HTML dumps do not seem to be updated.Aug 16 2022, 5:22 AM

Protsack.stephan raised the priority of this task from Medium to Needs Triage.Oct 12 2022, 9:44 AM

Another cache fail related ticket, probably not related though: T226931

Hey, sorry for not providing updates, we are updating our infrastructure and code at the moment that is not directly related to this problem, but potentially will fix this. After we are done with the current updates, we'll double check if this is still the case, and if it is will take a closer look into it.

Thanks, are you referring to the deprecation of restbase/MCS? On the English Wiktionary, we're relying more and more on these dumps for statistics and maintenance tasks, and many editors have noticed problems with data derived from these dumps.

I crosschecked the HTML dump data with the XML dump (20230320). Around 2200 entries are missing completely from the HTML dump, here's a sample:

	entry	date
2516	азыхә бызшәа	2022-07-07 13:19:40+00:00
17964	sonlig	2022-03-31 05:56:34+00:00
19714	ܣܡܩ	2022-03-26 05:41:16+00:00
21483	サㇰカィ	2022-03-30 14:39:10+00:00
21656	オㇰカィポ	2022-03-30 14:23:01+00:00
25842	ḫamṭu	2022-03-24 17:57:00+00:00
42570	læcewyrt	2022-03-24 21:26:26+00:00
42572	gærsgrene	2022-03-25 11:18:06+00:00
42573	wyrtmete	2022-03-25 11:32:47+00:00
42835	cynegyrd	2022-08-13 01:45:00+00:00
45618	أفريقانية	2022-06-29 10:14:25+00:00
46247	قانوننامة	2021-10-02 20:59:24+00:00
47425	أهداب	2022-03-19 13:57:52+00:00
47501	جاودر	2022-03-31 08:34:36+00:00

Additionally, around 3200 entries are older than reported in the XML dump, in extreme cases with a 12-year lag!

	entry	date_modified_xml	date_modified_html	diff
7380220	meith	2022-03-28 17:15:17+00:00	2010-07-06T22:04:06Z	4282 days 19:11:11
1384598	admixtured	2022-03-28 17:42:54+00:00	2010-08-09T08:20:10Z	4249 days 09:22:44
5776613	strēli	2022-03-28 13:50:13+00:00	2012-11-18T12:51:05Z	3417 days 00:59:08
5748728	strēlām	2022-03-28 13:57:26+00:00	2013-08-31T16:53:49Z	3130 days 21:03:37
4689291	アオリスト	2022-03-28 02:10:10+00:00	2013-11-09T06:11:26Z	3060 days 19:58:44
7370597	ылгын чыкыйа	2022-03-31 23:09:17+00:00	2014-07-16T19:36:27Z	2815 days 03:32:50
4701424	ガロッシュ	2022-03-28 03:19:02+00:00	2014-10-26T21:37:29Z	2709 days 05:41:33
1023219	Roman brick	2022-03-29 23:52:49+00:00	2015-10-16T23:15:32Z	2356 days 00:37:17
4680036	あんしんかん	2022-03-31 09:46:43+00:00	2015-10-22T19:34:26Z	2351 days 14:12:17
6736831	República de Uganda	2022-03-31 08:52:50+00:00	2015-11-04T18:21:01Z	2338 days 14:31:49
1155239	death cultist	2022-03-30 16:53:02+00:00	2015-12-08T15:04:23Z	2304 days 01:48:39
1156437	huncher	2022-04-01 01:34:29+00:00	2016-01-17T22:02:15Z	2265 days 03:32:14

The last problem concerns entries that have a current timestamp, but where the content itself is outdated. This seems to happen when transcluded templates are updated after the last edit of the transcluding page. I don't have numbers for this one, but it likely affects a lot of pages.

Thanks for the stats, I'm not referring to restbase/MCS deprecation, we are doing some work WME infrastructure.

The stats themselves are really useful, would be nice to see what changes after we change our infra.

Ok, let me know once you have dumps available with the new infra and I'll re-generate them.

Protsack.stephan moved this task from Incoming to Community requests on the Wikimedia Enterprise board.Apr 11 2023, 4:55 PM

NewSchmidt subscribed.May 11 2023, 3:14 PM

jberkel mentioned this in T331765: Outdated page / corrupt data in enwiki-NS0-20230220-ENTERPRISE-HTML.json.tar.gz.May 11 2023, 3:25 PM

It looks like the situation has improved with the latest dump (20230601, enwikt):

SELECT xml.title, timestamp AS xml_timestamp, date_modified AS html_timestamp, date_diff('month', date_modified, timestamp) AS diff FROM xml
  INNER JOIN html ON xml.title = html.title 
  WHERE redirect is null AND diff > 0 AND xml_timestamp < '2023-05-26'   
 ORDER BY diff DESC

title	xml_timestamp	html_timestamp	diff
деактивированный	2023-05-18 23:07:43	2018-06-11 04:13:53	59
𪓅	2023-03-17 20:47:10	2020-03-23 04:59:26	36
Mahoré	2023-05-23 16:19:09	2020-07-31 07:00:08	34
collisionnel	2023-05-23 11:46:00	2022-06-05 06:39:25	11
pédocriminel	2023-05-23 11:45:57	2022-11-21 10:40:35	6
décisionnel	2023-05-23 11:45:59	2022-11-13 10:47:19	6
kunniansa	2023-04-08 18:32:25	2022-12-17 13:12:10	4
Traoré	2023-05-23 16:19:06	2023-03-18 00:10:37	2
epistolar	2023-05-23 11:45:56	2023-04-11 16:07:47	1
criminel	2023-05-23 11:45:56	2023-04-01 22:15:58	1
doré	2023-05-23 16:19:07	2023-04-01 20:11:44	1
solennel	2023-05-23 11:45:58	2023-04-01 22:25:03	1

Only 12 entries have different timestamps. However, there are now duplicate entries in the HTML dump:

SELECT id, title, content_hash, count(*) AS COUNT FROM html GROUP BY id, title, content_hash HAVING count > 1 ORDER BY count DESC LIMIT 20;

id	title	content_hash	count
191204	alternating current	55fda023af99ede271daf28c518b5624e450fc6a	37
206741	sodium chloride	baf90a0961990ec882ea804f55a407a28b310690	31
210561	Brassicaceae	1e32d8fd4e708114c64a845cd499bc7a45ef8b18	31
233235	binary star	1f5af4ac3a7366130ceebf4bdb5ff2658c09c239	30
1139414	powiedzieć	79c6840aa53a48f315734f3044aaface3de54ec9	29
5079578	心房	e5f870e5deec31313878bc9d03dc7ee0e8dce4c3	25
1577	o	723cbbb05154936dcd59bccf2eff6b288e1cdc20	25
6089371	رسید	0995b796cf926fc65a80a0b5964dd69fefd93e5a	24
681927	stringed instrument	71c6c2d9728e980a9333ed1ee0f3e039112f3043	23
3071222	◌̅	1c82bd0691db86a65b74bd3142ebfeeb000086db	22
9377006	kʰau³⁵	9497fc20ebfc745cdec7665cae4ade483287ab05	20
1544820	veado	da6b895496496da9c7e3dc6bfb9ba0df1148236f	19
6724710	ocky	77c2fbc17abd195d3521ce7c37931d5e547c8288	18
35872	toilet paper	faec4095ab15debc590f430a21da1dad06b2c370	18
9377338	bajraktar	5b930e4cfa7166f1c6cff5d3c8b29d00baba660d	17
9379511	◌᪳	c3a712be46de83fdaf77dd0ca0825f3d6bde25c3	17
254575	اسد	5f7c05236cca3daf06385c00ec072c00f4338449	17
306224	Juno	e59da2eb719181c63068c823ad264b4d588703d0	17
39671	autopoiesis	e10a3aa79b1084094e238d8cf74ced1528c07aaa	17
623589	take to the cleaners	201295bae1ab4ee666220df0fced034ade59f33d	16

There are around 114K entries which are listed more than once in the dump, and 96K are exact duplicates (matching content_hash).

Here are the changes for the top entry:

SELECT id, title, date_modified, content_hash FROM html WHERE title = 'alternating current' ORDER BY date_modified DESC LIMIT 10;

id	title	date_modified	content_hash
191204	alternating current	2023-05-31 04:24:41	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:23:56	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:22:46	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:22:46	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:22:46	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:22:46	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:22:46	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:19:51	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:19:11	55fda023af99ede271daf28c518b5624e450fc6a
191204	alternating current	2023-05-31 04:18:00	55fda023af99ede271daf28c518b5624e450fc6a

There are ~150 entries missing from the HTML dump (compared to 2200 earlier):

SELECT xml.title, timestamp FROM xml LEFT OUTER JOIN html ON xml.title = html.title WHERE xml.redirect IS NULL AND html.title IS NULL AND timestamp < '2023-05-25' LIMIT 10;

title	timestamp
youngshits	2023-04-10 21:19:27
Maegan	2023-04-14 12:30:52
Connemaras	2023-05-15 01:12:06
پھا	2023-01-02 04:02:33
seekinis	2023-01-30 23:33:17
deprotona	2023-03-23 19:38:30
cordón de seguridad	2023-03-24 15:35:33
s**ts	2023-04-08 20:25:53
oldshits	2023-04-10 21:20:20
Liszt fever	2023-04-14 23:21:59

Conversely, around 200 entries are in the HTML dump, but not in the XML dump: in most cases these are entries that have been deleted.

title	date_modified
octaperfect	2022-08-04 02:36:54
tärkiä	2023-01-15 02:02:20
przeciwbakterynie	2023-05-21 23:22:57
companable	2023-05-21 21:16:48
folwes	2018-11-27 07:20:42
trước đo	2023-04-26 10:07:50
habinam	2023-03-02 17:21:35
keeps it on the barber pole	2022-08-09 00:10:30
sarouks	2019-10-15 12:21:08
vô hữu bất như kỷ giả	2023-04-27 07:13:13

Protsack.stephan moved this task from Community requests to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Jun 13 2023, 1:30 PM

LDlulisa-WMF claimed this task.Jun 14 2023, 3:11 PM

LDlulisa-WMF moved this task from To Be Estimated/To Be Discussed to In Progress on the Wikimedia Enterprise board.

Protsack.stephan triaged this task as Medium priority.Jun 15 2023, 9:54 AM

Protsack.stephan moved this task from In Progress to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Jun 15 2023, 1:19 PM

Protsack.stephan renamed this task from Stale data / missing pages in HTML ("enterprise") dumps to [Investigation] Stale data / missing pages in HTML ("enterprise") dumps {5 days}.Jun 15 2023, 1:58 PM

Protsack.stephan changed the point value for this task from 3 to 8.

Protsack.stephan moved this task from To Be Estimated/To Be Discussed to Estimated /Discussed on the Wikimedia Enterprise board.

LDlulisa-WMF moved this task from Estimated /Discussed to In Progress on the Wikimedia Enterprise board.Jun 15 2023, 8:13 PM

LDlulisa-WMF moved this task from In Progress to Done Sprint 42 (July 14 - July 27) on the Wikimedia Enterprise board.Jul 10 2023, 1:25 PM

JArguello-WMF closed this task as Resolved.Jul 11 2023, 12:59 PM

Why was this already marked as resolved? New dumps haven't even been published yet, so it's impossible to verify.

I just checked the latest dumps (2023-07-20), and it's now worse: there are around 2.5 million pages missing from the HTML dump (using the XML dump as a baseline).

Also note the difference in file size of the dumps:

enwiktionary-NS0-20230701-ENTERPRISE-HTML.json...> 01-Jul-2023 14:25         13021022629  (13.1 GB)
enwiktionary-NS0-20230720-ENTERPRISE-HTML.json...> 20-Jul-2023 13:44          7159413854  ( 7.2 GB)

@LDlulisa-WMF perhaps related to recent changes?

@jberkel Thank you for the updated info. The recent changes have not made it into our production infra yet, they are still in staging. I will keep you posted on when they do.

Ok, I hope this can be rolled out quickly, it can't get much worse than the current state

JArguello-WMF closed this task as Resolved.Jul 24 2023, 5:38 PM

Hasn't been fixed yet, data is still missing.

REsquito-WMF moved this task from Done Sprint 42 (July 14 - July 27) to Blocked on the Wikimedia Enterprise board.Jul 25 2023, 1:05 PM

JArguello-WMF renamed this task from [Investigation] Stale data / missing pages in HTML ("enterprise") dumps {5 days} to Stale data / missing pages in HTML ("enterprise").Jul 26 2023, 5:53 PM

investigation timeboxed for 5 days and made by @LDlulisa-WMF was completed. With the results of the investigation, We´ll triage accordingly during the next sprint planning.

JArguello-WMF raised the priority of this task from Medium to Needs Triage.Jul 26 2023, 5:56 PM

JArguello-WMF moved this task from Blocked to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.

LDlulisa-WMF removed LDlulisa-WMF as the assignee of this task.Aug 2 2023, 2:20 PM

LDlulisa-WMF subscribed.

JArguello-WMF renamed this task from Stale data / missing pages in HTML ("enterprise") to Stale data / missing pages in HTML ("enterprise") .Aug 2 2023, 2:24 PM

JArguello-WMF updated the task description. (Show Details)

BTW, there is a related issue T300124 that some categories and templates are not always extracted.

file sizes from the most recent enwikt HTML dumps (NS0):

20230701: 13 GB
20230720: 7.1 GB
20230801: 1.1 GB
20230820: 4.6 GB
20230901: 7.2 GB
20230920: 3 GB
20231001: 5 GB

something's going really wrong there.

JArguello-WMF moved this task from To Be Estimated/To Be Discussed to DevOps/Infrastructure on the Wikimedia Enterprise board.Aug 29 2023, 2:58 PM

awight subscribed.Oct 4 2023, 6:46 AM

Every MediaWiki site publishes a count of its own "content" pages at the API endpoint https://de.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics . The definition of "content" can be confirmed through the https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces API, and for most wikis this will be strictly NS0.

Please include some kind of basic health check after each dump run, to verify that the article count matches these numbers.

Sadly, the stats are not easily available for other namespaces (while dumps are made for them as well). I think currently the best way to get those counts is to use ElasticSearch endpoint, as described in T312200.

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

In T305407#9223782, @jberkel wrote:

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

When you say "base off them", are you suggesting that the HTML dumps be produced by iterating over the XML dumps and then fetching the HTML content for each row? This would be a cumbersome approach since it introduces an unnecessary extra dependency. The HTML content of the new dumps is not directly derived from the XML dumps, so I don't see much advantage to this approach. I agree that it would be nice to snapshot the wiki content before dumping but this isn't feasible given the way that HTML rendering requires random access to all articles and templates.

In T305407#9223796, @awight wrote:

In T305407#9223782, @jberkel wrote:

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

When you say "base off them", are you suggesting that the HTML dumps be produced by iterating over the XML dumps and then fetching the HTML content for each row? This would be a cumbersome approach since it introduces an unnecessary extra dependency. The HTML content of the new dumps is not directly derived from the XML dumps, so I don't see much advantage to this approach. I agree that it would be nice to snapshot the wiki content before dumping but this isn't feasible given the way that HTML rendering requires random access to all articles and templates.

I don't know how the HTML dumps are currently produced (the project is not very transparent), but you probably don't have to transform the XML itself, you could take the page ids and revision numbers from the XML dump as a starting point, and then fetch the HTML of those revisions. Of course this is more complicated with templates dependencies, but it would already be an improvement, and ensure the data you get is complete.

so why not simply base the HTML dumps off them?

This does not work. You cannot convert wikitext to HTML in offline mode. You need access to running Mediawiki instance to render macros, templates, etc. So rendering HTML is by definition a dynamic process and something which has not been available otherwise.

In T305407#9223814, @Mitar wrote:

so why not simply base the HTML dumps off them?

This does not work. You cannot convert wikitext to HTML in offline mode. You need access to running Mediawiki instance to render macros, templates, etc. So rendering HTML is by definition a dynamic process and something which has not been available otherwise.

I'm not suggesting this, see my reply above. The XML dumps would serve as a baseline for which page ids / revisions to include.

I ran into another surprising quirk in these dumps: a page may appear multiple times, sometimes with a duplicate revision, or with several different revisions. Please deduplicate by page ID.

For example, the dewiki-20230820 NS0 dump contains these revisions of one article:

["Haus Lohe",192272812]
["Haus Lohe",236444682]
["Haus Lohe",236444682]
["Haus Lohe",236444682]
["Haus Lohe",236444682]

JArguello-WMF changed the status of subtask T345176: {Investigation} Different file sizes for dumps from Open to In Progress.Nov 7 2023, 2:25 PM

thiemowmde subscribed.Nov 21 2023, 9:24 AM

JArguello-WMF closed subtask T345176: {Investigation} Different file sizes for dumps as Unknown Status.Jan 18 2024, 1:11 PM

creynolds subscribed.Jan 19 2024, 5:51 AM

adding some more stale data examples I came across while working with a snapshot of the HTML dumps, specifically /public/dumps/public/other/enterprise_html/runs/20231201/enwiki-NS0-20231201-ENTERPRISE-HTML.json.tar.gz on PAWS:

A redirect that was briefly vandalized in July 2023 to not be a redirect: https://en.wikipedia.org/w/index.php?title=Adhirath&oldid=1167382940
This also very old briefly-an-article-now-a-redirect: https://en.wikipedia.org/w/index.php?oldid=1025016513

Isaac mentioned this in T354559: Put together diff blogpost on AI + Wikimedia + Datasets.Feb 9 2024, 9:17 PM

MGerlach subscribed.Feb 14 2024, 3:12 PM

Aklapper changed the status of subtask T345176: {Investigation} Different file sizes for dumps from Unknown Status to Resolved.Mar 18 2024, 10:40 AM

null subscribed.Aug 4 2024, 10:19 AM

REsquito-WMF moved this task from DevOps/Infrastructure to Tech Debt on the Wikimedia Enterprise board.Oct 30 2024, 4:08 PM

Stale data / missing pages in HTML ("enterprise") Open, Needs TriagePublic8 Estimated Story PointsBUG REPORTActions