WMDE Technical Wishes developer
User Details
- User Since
- Oct 12 2014, 9:02 PM (595 w, 3 d)
- Availability
- Available
- IRC Nick
- awight
- LDAP User
- Awight
- MediaWiki User
- Adamw [ Global Accounts ]
Yesterday
Just for the record, I think it's a good policy to not pull binaries from servers outside of Wikimedia infrastructure—so I'm not asking for a change here, only providing information about what went wrong :-)
The artifact was defined like so, in wmde/config/artifacts.yaml:
page-summary-scraper-0.6.1.tgz: id: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/package_files/278501910/download source: url
For anyone who wants to monitor memory usage: Thanos
The job seems to be hitting an out-of-memory error now. I increased the memory from 2GB to 4GB but now it crashes in the third chunk of dewiki.
Wrote this too quickly—after looking into it more I see that the BashSensor actually uses the same krb5 credentials path cache so my shortcut *does* work.
Tue, Mar 10
I found that the BashSensor is not getting the correct KRB5CCNAME, and I'll have to implement an append_env which adds my additional variables to the executor's environment.
Mon, Mar 9
We already have a way to pin to a specific revision, for example using oldid in the URL:
https://en.wikipedia.org/w/index.php?title=User:Adamw/DraftTopic.js&oldid=855943446&action=raw&ctype=text/javascript
Fri, Mar 6
So far, it seems that mainBody (Parsoid-added refListItemId of the main ref, as an attribute on a main+details footnote marker) already carries the information we need. I'm still experimenting with transclusion edge cases.
We found that any new solution to finding a subref's main ref needs to *not* know about main refs (or main part of a main+details) that were produced by a transclusion and do not appear at the top-level document, to avoid perpetuating the issue from T412007: VE unexpectedly copies reference content from one sub-ref to another if the main ref is defined within a template. The main ref's InternalList item if produced by a transclusion should render as missing, "This reference is defined in a template"—until T355858: References from template transclusions should be included (read-only) in the internalList is solved.
Hi, Tech Wishes dev here! After a brief chat, we would recommend disabling the entire suite for now. We've noticed the flakiness as well and it counteracts any value that might be gotten from having the tests. I'm happy to do that if you agree that it's the right direction to go in.
Thu, Mar 5
Wed, Mar 4
Tue, Mar 3
Fri, Feb 27
I'm going to pick this up with a focus on the risk that even our newly-refactored solution could still be incompatible with the {{reflist}} template. The current assumption is that if we can solve this task using our new wiring, then the approach will work out overall since the lack of main content is exactly the hole provisionally filled by the synthetic ref.
(CC'ing data platform engineers who have generously helped us, and might be interested in watching the exciting conclusion.)
Thu, Feb 26
Wed, Feb 25
I think it works!
@brouberol That's amazing, thank you. I'll wait for the chart deployment and will post the outcome here.
Tue, Feb 24
Dumping the summary here before I sign out for the day,
dbname dewiki snapshot_date 2026-02-02 identical_refs_count 194973 identical_refs_on_pages_with_25_or_less_refs_average 194972.95 identical_refs_on_pages_with_over_25_refs_average 0.7808942 identical_refs_on_pages_with_over_25_refs_count 93687 list_defined_ref_per_page_having_ref 0.36901948 list_defined_ref_sum 731064 max_ref_reuse_average 2.8919237 nested_ref_sum 578 page_count 3093332 pages_with_automatically_named_refs_count 116612 pages_with_identical_refs_and_over_25_refs_count 25736 pages_with_identical_refs_count 91934 pages_with_multiple_reflists_count 29991 pages_with_named_refs_count 896769 pages_with_nested_refs_count 243 pages_with_over_25_refs_count 119974 pages_with_ref_reuse_count 7024761981077 pages_with_refs_count 248833 pages_with_similar_refs_count 5986 pages_with_subrefs_count 0.5766826 proportion_of_named_refs_uniquely_named_average 0.04640607 proportion_of_pages_with_identical_refs 1.2266055E-4 proportion_of_pages_with_nested_refs 0.12560491 proportion_of_pages_with_similar_refs 0.6404346 proportion_of_pages_with_refs 0.079681166 proportion_of_refs_from_transclusion 0.3740749 proportion_of_refs_having_transclusion 0.26614386 proportion_of_refs_named_average 0.118505105 proportion_of_refs_reused_average 0.7054597 ref_by_transclusion_average 1397570 ref_by_transclusion_count 17539527 ref_count 5.6701083 ref_count_per_page 8.853531 ref_count_per_page_having_ref 2015523 reflist_count 1.0173875 reflists_per_page_having_ref 5475486 refs_with_solely_transclusion_count 6561097 refs_with_transclusions_countsimilar_refs_count 1038935 subrefs_sum 62401 transclusion_average 10.077638 transclusion_sum 31173480 wikitext_length_average 6913.2437
Data has landed in wmde.wiki_page_cite_references_raw (per-page) and wmde.wiki_page_cite_references_monthly (totals are in one row for dewiki).
January data is being imported manually and provides a verification of the queries here.
This is the command line I used to process the saved chunks, in my home directory on stat1010:
mix scrape --dir ~/dewiki-chunks-2026-02-02/ --output=dewiki-2026-02-02-page-summary.ndjson
Note that the file is reopened for each chunk, in an unexpected combination of reading saved chunks from separate files, and outputting to a file. I don't think this will hurt anything; the aggregation step rejects duplicate pages (ignoring whether the revisions change).
Mon, Feb 23
One more +1 for Spark 3.5.
I've downloaded and stored the 2026-02-02 snapshot chunks on stat1010 and created a new input mode that will allow the scraper to read from those files. This makes it possible to finish processing even after the snapshot is replaced with newer revisions.
The thoroughput was 20 rows/sec even after batching heavily (100 rows/statement). This is far slower than we can accept, so I'm abandoning the approach.
Fri, Feb 20
After discussion with @xcollazo , I'll take a simpler path and write files to a temporary filesystem. Will be described in a new task...
Wed, Feb 18
Paving over the above errors by sending complex types as strings for the moment.
Running into two issues which strangely didn't appear on the test server.
Seems ready to work on now?
Implementation has been smoke-tested on the Analytics testing cluster. It's verified as able to perform inserts and queries, and can authenticate and encrypt through Kerberos.