Page MenuHomePhabricator

jberkel
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Mar 31 2015, 8:12 PM (471 w, 3 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
Jberkel [ Global Accounts ]

Recent Activity

Mon, Mar 25

jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Mon, Mar 25, 11:02 AM · Wikimedia Enterprise (sprint 53), Dumps-Generation
jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

Can anyone clarify though? It seems that the new sub-tasks are now stuck again.

Mon, Mar 25, 11:00 AM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Mon, Mar 18

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

It probably means the investigation has been "resolved". The main task is now T351712 + subtasks.

Mon, Mar 18, 2:13 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Mar 1 2024

jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Mar 1 2024, 1:47 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Feb 21 2024

jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Feb 21 2024, 8:52 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Feb 19 2024

jberkel added a comment to T349899: 'digero' tool uses an unreasonable amount of disk space.

I'll add a command to automatically clear the tmp storage, that should help

Feb 19 2024, 12:17 PM · Tools
jberkel added a comment to T349899: 'digero' tool uses an unreasonable amount of disk space.

I've deleted tmp and other unused stuff it's now down to 16GB, is that acceptable?

Feb 19 2024, 12:14 PM · Tools

Feb 5 2024

jberkel added a comment to T351712: Q3- Q4: Snapshots service is failing to decode some Kafka messages .

Could you explain a bit more what this means, please?

Feb 5 2024, 1:09 PM · Wikimedia Enterprise, Epic

Feb 2 2024

jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Feb 2 2024, 11:50 AM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Jan 26 2024

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

Latest enwikt dump is now at 9.6 GB, still some way to go to the 13GB of the 20230701 dump (also incomplete, but still useful as a baseline).

Jan 26 2024, 12:10 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation
jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Jan 26 2024, 12:08 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Jan 9 2024

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

OK. I think it might be worth putting a disclaimer somewhere, perhaps on https://dumps.wikimedia.org/other/enterprise_html/, to warn users that the dumps are incomplete.

Jan 9 2024, 7:14 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation
jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

@REsquito-WMF thanks! So this means the next dumps will have more data, but will still be incomplete until this other bug is fixed?

Jan 9 2024, 6:48 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Dec 31 2023

jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Dec 31 2023, 10:41 AM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Dec 11 2023

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

@REsquito-WMF not sure if the changes were already in place, but the current enwiktionary NS0 dump is still at 3.5 GB (compared to 13 GB on 20230701).

Dec 11 2023, 8:15 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation
jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Dec 11 2023, 8:13 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Nov 6 2023

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

Is there anything going to be done about this? The enterprise dumps have been in full failure mode for a few months now and are absolutely unusable. I really don't know how an obvious total failure of service can stay in triage hell for such a long time. I understand WMF resources are limited, but then at least let volunteers help out with this. My question about the code generating the dumps above is still unanswered. The transparency/communication on this whole issue has been miserable.

Nov 6 2023, 9:28 AM · Wikimedia Enterprise (sprint 53), Dumps-Generation
jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Nov 6 2023, 9:09 AM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Oct 27 2023

jberkel added a comment to T349899: 'digero' tool uses an unreasonable amount of disk space.

We don't really need to keep all the old dumps around, I've started the deletion of all dump files before 2023. Different files are needed different purposes: for the stats, and for the "wanted entries" on Wiktionary. After generating the dumps, all the data "lives" on Wiktionary, except for the raw data, which is hosted on ~tools.digero/www and shouldn't be deleted. Right now it uses about 1.3G.

Oct 27 2023, 2:49 PM · Tools

Oct 24 2023

jberkel added a comment to T165935: "Lua error: not enough memory" on certain en.wiktionary pages.

@tstarling Thanks for unblocking this! 🙌

Oct 24 2023, 4:58 PM · Performance Issue, Scribunto, All-and-every-Wiktionary

Oct 20 2023

jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Oct 20 2023, 3:00 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Oct 5 2023

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

Some random guessing: perhaps the error handling code is borked, and it just finishes the dump and closes the file (without erroring the process)? But why then would so many repositories hit errors at the same time? All the 7-20 dumps seem to be affected, maybe some site-wide network/server problems which weren't handled properly?

Oct 5 2023, 1:06 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Oct 4 2023

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

so why not simply base the HTML dumps off them?

This does not work. You cannot convert wikitext to HTML in offline mode. You need access to running Mediawiki instance to render macros, templates, etc. So rendering HTML is by definition a dynamic process and something which has not been available otherwise.

Oct 4 2023, 8:29 AM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

When you say "base off them", are you suggesting that the HTML dumps be produced by iterating over the XML dumps and then fetching the HTML content for each row? This would be a cumbersome approach since it introduces an unnecessary extra dependency. The HTML content of the new dumps is not directly derived from the XML dumps, so I don't see much advantage to this approach. I agree that it would be nice to snapshot the wiki content before dumping but this isn't feasible given the way that HTML rendering requires random access to all articles and templates.

Oct 4 2023, 8:25 AM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.

Oct 4 2023, 8:11 AM · Wikimedia Enterprise, Dumps-Generation
jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Oct 4 2023, 8:04 AM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Sep 20 2023

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

Any idea why this would affect primarily non-wikipedia instances? Is the code which generates these dumps available somewhere?

Sep 20 2023, 4:04 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation
jberkel updated the task description for T345176: {Investigation} Different file sizes for dumps.
Sep 20 2023, 3:07 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Sep 15 2023

jberkel added a comment to T345176: {Investigation} Different file sizes for dumps.

Weirdly, there seems to be less variation in filesizes for Wikipedia dumps:

Sep 15 2023, 8:48 PM · Wikimedia Enterprise (sprint 53), Dumps-Generation

Aug 24 2023

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

file sizes from the most recent enwikt HTML dumps (NS0):

Aug 24 2023, 11:13 AM · Wikimedia Enterprise, Dumps-Generation

Jul 24 2023

jberkel reopened T305407: Stale data / missing pages in HTML ("enterprise") as "Open".

Hasn't been fixed yet, data is still missing.

Jul 24 2023, 5:54 PM · Wikimedia Enterprise, Dumps-Generation

Jul 21 2023

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

Ok, I hope this can be rolled out quickly, it can't get much worse than the current state

Jul 21 2023, 3:03 PM · Wikimedia Enterprise, Dumps-Generation
jberkel reopened T305407: Stale data / missing pages in HTML ("enterprise") as "Open".

I just checked the latest dumps (2023-07-20), and it's now worse: there are around 2.5 million pages missing from the HTML dump (using the XML dump as a baseline).

Jul 21 2023, 2:40 PM · Wikimedia Enterprise, Dumps-Generation

Jul 12 2023

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

Why was this already marked as resolved? New dumps haven't even been published yet, so it's impossible to verify.

Jul 12 2023, 8:09 PM · Wikimedia Enterprise, Dumps-Generation

Jun 12 2023

jberkel closed T338770: Concurrent gradle jobs on toolforge as Invalid.

Closing this, maybe it'll be useful for future reference. I haven't added documentation to wikitech, not sure where it should go.

Jun 12 2023, 10:46 AM
jberkel added a comment to T338770: Concurrent gradle jobs on toolforge.

I'll see if I can prebuilt the binaries and then just launch the commands without gradle to avoid this issue (so the locks are only held during building, not execution)

Jun 12 2023, 9:37 AM
jberkel added a comment to T338770: Concurrent gradle jobs on toolforge.

Have we tried declaring a different gradle home for each job?

https://github.com/gradle/gradle/issues/8750#issuecomment-605016788

Jun 12 2023, 9:24 AM
jberkel created T338770: Concurrent gradle jobs on toolforge.
Jun 12 2023, 7:30 AM

Jun 10 2023

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

There are ~150 entries missing from the HTML dump (compared to 2200 earlier):

Jun 10 2023, 11:53 PM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

It looks like the situation has improved with the latest dump (20230601, enwikt):

Jun 10 2023, 11:24 PM · Wikimedia Enterprise, Dumps-Generation

Jun 9 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

looks like the files have finally been synced to toolforge!

Jun 9 2023, 9:45 AM · Wikimedia Enterprise, Dumps-Generation

Jun 7 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

still in progress?

Yes please :-) The rsync is still in progress!

Jun 7 2023, 8:36 AM · Wikimedia Enterprise, Dumps-Generation

Jun 5 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

The rsync, which copies the files over to the nfs share accessible to toolforge, is still in progress.

Jun 5 2023, 12:42 PM · Wikimedia Enterprise, Dumps-Generation

Jun 2 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

Looks like the data was copied successfully this time! I've downloaded the enwiktionary-NS0 dump and the checksum matches.

Jun 2 2023, 7:10 AM · Wikimedia Enterprise, Dumps-Generation

May 29 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

Looking into potential fixes and trying to figure out the best way to handle this.

May 29 2023, 3:42 PM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

It might be the case that we are just serving the checksum of the previous dump.
Meaning: we are grabbing the checksum before the upload has finished.

May 29 2023, 12:43 PM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

@ArielGlenn if the API side isn't fixed until the June run would it be possible to ignore the checksums and copy the files regardless? We've been dump-less for 2 months now…

May 29 2023, 9:43 AM · Wikimedia Enterprise, Dumps-Generation

May 25 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

@Protsack.stephan Where are the checksums calculated? Can you re-index the metadata of the dump files on the API side so that they match the actual file content? It looks like they might get calculated before the file is fully processed, or they are calculated from a different version of the file (as you indicated in your comment)?

May 25 2023, 9:36 AM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

@ArielGlenn Is the downloaded data usable, that is, can you decompress the files without error? If the files are OK, maybe it's a problem with the checksum generation: if the checksums are off only for some files, it could be related to the file size. Perhaps some sort of overflow where the hashes are calculated?

May 25 2023, 6:18 AM · Wikimedia Enterprise, Dumps-Generation

May 22 2023

jberkel updated the task description for T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.
May 22 2023, 7:07 AM · Wikimedia Enterprise, Dumps-Generation
jberkel renamed T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs from Missing Enterprise Dumps in 2023-04-20 and 2023-05-01 runs to Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.
May 22 2023, 7:06 AM · Wikimedia Enterprise, Dumps-Generation

May 17 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

Another question, where are the enterprise dumps stored on toolforge now? They seem to have stopped updating October last year.

$ ls /public/dumps/public/other/enterprise_html/runs/
20220720  20220801  20220820  20220901	20220920  20221001

The rsync job meant to update the dumps after files have been downloaded on the primary host has not been running since last year. It was recently fixed and we expect the data on clouddumps1002 to be updated on the next run.

May 17 2023, 2:46 PM · Wikimedia Enterprise, Dumps-Generation

May 16 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

Another question, where are the enterprise dumps stored on toolforge now? They seem to have stopped updating October last year.

May 16 2023, 6:45 PM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T320343: Include "make" in all images.

Thanks for moving this one forward!

May 16 2023, 10:09 AM · Toolforge (Software install/update)

May 11 2023

jberkel added a comment to T331765: Outdated page / corrupt data in enwiki-NS0-20230220-ENTERPRISE-HTML.json.tar.gz.

Perhaps the same underlying issue as T305407.

May 11 2023, 3:26 PM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise

May 8 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

The files haven't materialized, guess something is still amiss…

May 8 2023, 7:56 AM · Wikimedia Enterprise, Dumps-Generation

May 4 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

Yes that's what I meant, thanks 🤞

May 4 2023, 12:53 PM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

Ok, so the files have been generated, but not copied? Can they be recovered?

May 4 2023, 6:15 AM · Wikimedia Enterprise, Dumps-Generation

May 2 2023

jberkel added a comment to T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.

Thanks! Is there any way to check the HTML dump progress/state "from the outside"? The XML dumps have a status page + the machine readable dumpstatus.json

May 2 2023, 1:15 PM · Wikimedia Enterprise, Dumps-Generation
jberkel created T335761: Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs.
May 2 2023, 11:07 AM · Wikimedia Enterprise, Dumps-Generation

Apr 11 2023

jberkel added a comment to T303652: Include more namespaces in Wiktionary HTML dumps.

Related to T318371

Apr 11 2023, 6:35 PM · Dumps-Generation, Wikimedia Enterprise

Mar 24 2023

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

Ok, let me know once you have dumps available with the new infra and I'll re-generate them.

Mar 24 2023, 12:35 PM · Wikimedia Enterprise, Dumps-Generation
jberkel added a comment to T303652: Include more namespaces in Wiktionary HTML dumps.

On the English Wiktionary we now use HTML dumps to generate our stats. Some of our content is not in the mainspace and therefore not reflected in the statistics. There are also problems generating information related to proto-languages, these live in the Reconstruction: namespace.

Mar 24 2023, 12:45 AM · Dumps-Generation, Wikimedia Enterprise
jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

Thanks, are you referring to the deprecation of restbase/MCS? On the English Wiktionary, we're relying more and more on these dumps for statistics and maintenance tasks, and many editors have noticed problems with data derived from these dumps.

Mar 24 2023, 12:25 AM · Wikimedia Enterprise, Dumps-Generation

Mar 22 2023

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

Another cache fail related ticket, probably not related though: T226931

Mar 22 2023, 9:58 PM · Wikimedia Enterprise, Dumps-Generation

Mar 16 2023

jberkel added a comment to T331906: Add Lua function to read out previous section heading.

Looks like T122934 is relevant and would help with this. Unfortunately, there's been no movement on that task recently.

Mar 16 2023, 12:10 PM · Scribunto, All-and-every-Wiktionary

Dec 5 2022

jberkel added a comment to T320343: Include "make" in all images.

It works when adding -t latest.

Dec 5 2022, 11:29 AM · Toolforge (Software install/update)

Dec 4 2022

jberkel added a comment to T320343: Include "make" in all images.

I've been looking at submitting a patch for this myself, but while building the docker images from https://gerrit.wikimedia.org/g/operations/docker-images/toollabs-images
I get the following error:

Dec 4 2022, 1:41 PM · Toolforge (Software install/update)

Nov 9 2022

jberkel updated the task description for T322725: Allow selection of the page title in 2017 Wikitext Editor on Vector 2022.
Nov 9 2022, 4:55 PM · Verified, MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), Editing-team (Kanban Board), VisualEditor
jberkel added a comment to T322725: Allow selection of the page title in 2017 Wikitext Editor on Vector 2022.

I have disabled all gadgets and beta features (except "Visual Editing" and "New wikitext mode"), still the same result.
I've also tried it with Safari (see screenshot).

Nov 9 2022, 4:47 PM · Verified, MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), Editing-team (Kanban Board), VisualEditor
jberkel updated the task description for T322725: Allow selection of the page title in 2017 Wikitext Editor on Vector 2022.
Nov 9 2022, 9:20 AM · Verified, MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), Editing-team (Kanban Board), VisualEditor
jberkel created T322725: Allow selection of the page title in 2017 Wikitext Editor on Vector 2022.
Nov 9 2022, 9:20 AM · Verified, MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), Editing-team (Kanban Board), VisualEditor

Oct 10 2022

jberkel added a comment to T315276: Latest English Wikipedia Wikimedia Enterprise HTML dumps do not seem to be updated.

The stats now have a correct timestamp, but there's still missing data. Can you please fix this? With this unpredictable mix of old and new data they're useless for most purposes right now, might as well not generate them at all.

Oct 10 2022, 11:33 AM · Dumps-Generation, Wikimedia Enterprise Engineering, Wikimedia Enterprise

Oct 9 2022

jberkel updated the task description for T320343: Include "make" in all images.
Oct 9 2022, 10:16 AM · Toolforge (Software install/update)
jberkel created T320343: Include "make" in all images.
Oct 9 2022, 10:16 AM · Toolforge (Software install/update)
jberkel closed T319269: Missing Enterprise Dumps from 2022-10-01 run as Resolved.
Oct 9 2022, 10:04 AM · Dumps-Generation

Oct 7 2022

jberkel added a comment to T319269: Missing Enterprise Dumps from 2022-10-01 run.

Hmm, dumps are still not available…

Oct 7 2022, 10:14 AM · Dumps-Generation

Oct 4 2022

jberkel added a comment to T319269: Missing Enterprise Dumps from 2022-10-01 run.

Thanks for the update!

Oct 4 2022, 6:35 PM · Dumps-Generation
jberkel created T319269: Missing Enterprise Dumps from 2022-10-01 run.
Oct 4 2022, 8:13 AM · Dumps-Generation

Oct 2 2022

jberkel added a comment to T315276: Latest English Wikipedia Wikimedia Enterprise HTML dumps do not seem to be updated.

@Protsack.stephan great! however, looks like the october dumps haven't been generated yet?

Oct 2 2022, 6:23 PM · Dumps-Generation, Wikimedia Enterprise Engineering, Wikimedia Enterprise

Aug 22 2022

jberkel added a comment to T315276: Latest English Wikipedia Wikimedia Enterprise HTML dumps do not seem to be updated.

@nfliu Unfortunately, the HTML dumps don't seem to be very reliable at the moment.

Aug 22 2022, 4:01 PM · Dumps-Generation, Wikimedia Enterprise Engineering, Wikimedia Enterprise

Jul 1 2022

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

Any updates on this? The task has been moved around a bit recently, but it's not clear what is happening. Is it difficult to fix?

Jul 1 2022, 8:02 AM · Wikimedia Enterprise, Dumps-Generation

Apr 21 2022

jberkel added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

Just a thought: perhaps the HTML dumps should be generated from the XML dumps, so that the revisions in both match (and they can both be used interchangeably without consistency problems).

Apr 21 2022, 8:15 AM · Wikimedia Enterprise, Dumps-Generation

Apr 4 2022

jberkel updated the task description for T305407: Stale data / missing pages in HTML ("enterprise") .
Apr 4 2022, 9:06 PM · Wikimedia Enterprise, Dumps-Generation
jberkel created T305407: Stale data / missing pages in HTML ("enterprise") .
Apr 4 2022, 8:06 PM · Wikimedia Enterprise, Dumps-Generation

Mar 18 2022

jberkel added a project to T303652: Include more namespaces in Wiktionary HTML dumps: Dumps-Generation.
Mar 18 2022, 8:37 AM · Dumps-Generation, Wikimedia Enterprise

Mar 11 2022

jberkel renamed T303652: Include more namespaces in Wiktionary HTML dumps from Include more namespaces in Wiktionary HTML dump to Include more namespaces in Wiktionary HTML dumps.
Mar 11 2022, 10:26 PM · Dumps-Generation, Wikimedia Enterprise
jberkel created T303652: Include more namespaces in Wiktionary HTML dumps.
Mar 11 2022, 10:10 PM · Dumps-Generation, Wikimedia Enterprise

Dec 7 2021

jberkel added a comment to T122934: Section-scope declarations for Wiktionary template invocations.

Seeing that T114072 is marked as resolved, is it now possible to implement this?

Dec 7 2021, 8:39 PM · All-and-every-Wiktionary, Parsing-Team--ARCHIVED

Nov 21 2021

jberkel added a comment to T88797: #iferror should suppress scribunto error tracking category too.

Even if you don't want to change this behavior, it should probably be mentioned in the documentation of #iferror. Because of this limitation, the function is practically useless when used with Scribunto.

Nov 21 2021, 11:40 AM · MediaWiki-TrackingCategories, ParserFunctions, Scribunto

Nov 5 2021

jberkel added a comment to T295173: Interface 'Wikimedia\NormalizedException' not found.

Thanks for fixing this so quickly, I'll wait for rc2 and re-test.

Nov 5 2021, 5:27 PM · MediaWiki-Vendor, MW-1.37-release
jberkel created T295173: Interface 'Wikimedia\NormalizedException' not found.
Nov 5 2021, 4:50 PM · MediaWiki-Vendor, MW-1.37-release

Aug 8 2021

jberkel added a comment to T165935: "Lua error: not enough memory" on certain en.wiktionary pages.

If Lua on MediaWiki can't be upgraded to 5.2 or later (T178146 is stalled, with "re-evaluation in 2024"), maybe just the GC changes could be backported to 5.1, to have at least some predictable GC behaviour?

Aug 8 2021, 8:12 PM · Performance Issue, Scribunto, All-and-every-Wiktionary

Apr 11 2021

jberkel added a comment to T219351: Java jobs run the Stretch grid seem to require a very large memory reservation.

Thanks for investigating. I'm running a Spark job to parse some CBOR files (~ 100MB) to generate a list of missing words on Wiktionary. It really should not take so much memory but I noticed in the logs that Spark seems to aggressively request a lot of memory up front as some sort of buffer space/working memory. I'll see if I can tweak this. Update: solved.

Apr 11 2021, 11:12 AM · Toolforge

Apr 10 2021

jberkel added a comment to T219351: Java jobs run the Stretch grid seem to require a very large memory reservation.

I'm running into more problems with Java on the grid: the JVM gets killed (exit code 137 = SIGKILL), but if I do a qacct -j jobid I don't see any enforced limits:

Apr 10 2021, 11:16 PM · Toolforge

Mar 29 2021

jberkel added a comment to T165935: "Lua error: not enough memory" on certain en.wiktionary pages.

The problem seems to be that the version of Lua used by Scribunto does not run the garbage collector when a memory allocation fails, so memory is not reclaimed when it is needed most. I would consider that a bug of the the implementation. With that behaviour it's hard to tell if we're really out of memory, or if we're just out of luck, because the GC didn't run in time.

Can you tell us how did you reach to that conclusion?

Mar 29 2021, 12:13 PM · Performance Issue, Scribunto, All-and-every-Wiktionary
jberkel added a comment to T165935: "Lua error: not enough memory" on certain en.wiktionary pages.

The problem seems to be that the version of Lua used by Scribunto does not run the garbage collector when a memory allocation fails, so memory is not reclaimed when it is needed most. I would consider that a bug of the the implementation. With that behaviour it's hard to tell if we're really out of memory, or if we're just out of luck, because the GC didn't run in time.

Mar 29 2021, 10:07 AM · Performance Issue, Scribunto, All-and-every-Wiktionary

Nov 23 2020

jberkel closed T267550: Zotero returns different results for same query as Invalid.
Nov 23 2020, 1:16 AM · Citoid
jberkel added a comment to T267550: Zotero returns different results for same query.

In this case it's obviously not Zotero's fault, even when setting a modern user agent in the header the server/CDN sometimes returns different content.

Nov 23 2020, 1:14 AM · Citoid
jberkel added a comment to T259685: Zeroconf VisualEditor/Parsoid doesn't work on SQLite .

SQLite might not be your typical production setup, but it's very handy for getting a MediaWiki instance up and running in a docker container, I'm using it to process Wiktionary data. After switching to Parsoid the performance suffered quite noticeably, related to the aggressive write locking mentioned above.

Nov 23 2020, 1:00 AM · Editing-team (Third-party), User-Ryasmeen, MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), Parsoid (Third-party), MW-1.35-notes, Platform Team Workboards (External Code Reviews), Patch-For-Review, VisualEditor, MW-1.35-release, SQLite