Page MenuHomePhabricator

Add Welsh Wikipedia to Structured Contents snapshots for NLW Hackathon [2 days]
Closed, ResolvedPublic

Description

The National Library of Wales is running its annual Welsh-history hackathon on 20 Feb 2026, with an AI focus this year, and would like to use Structured Contents data from Welsh Wikipedia.

Welsh Wikipedia size: ~283,960 articles, ~1.9 GB

TO DO

  • Roll out Structured Contents snapshot generation for Welsh Wikipedia (cywiki) and include it as a supported language in our SC snapshots.
  • (Jackeline/Jolan/Stephanie) set up checklist, and follow release steps - checlist done, public release is in progress
  • (Stephanie) Download one off dataset to Gdrive (?) to use for event - Done by @JWuyts-WMF

Context/links

Details

Due Date
Feb 10 2026, 5:00 AM

Event Timeline

for future additions, add the steps to enable Engs to do this themselves

JArguello-WMF renamed this task from Welsh Wikipedia Structured Contents dataset for NLW Hackathon to Welsh Wikipedia Structured Contents dataset for NLW Hackathon [2 days].Jan 28 2026, 2:09 PM
JArguello-WMF updated the task description. (Show Details)
SDelbecque-WMF renamed this task from Welsh Wikipedia Structured Contents dataset for NLW Hackathon [2 days] to Add Welsh Wikipedia to Structured Contents snapshots for NLW Hackathon [2 days].Jan 28 2026, 2:20 PM
SDelbecque-WMF updated the task description. (Show Details)
SDelbecque-WMF updated the task description. (Show Details)
JArguello-WMF set Due Date to Feb 10 2026, 5:00 AM.

Hi everyone. I just wanted to check the progress on this, as our Hackathon event is tommorrow and it would be great to have access to the data. Many thanks @Wittylama @JArguello-WMF @KMontalva-WMF

Hiya @Jason.nlw! Apologies that we weren't able to provide this dataset to you earlier. I've had word from our engineers that this dataset is now ready for processing, which means the Structured Contents Snapshot will be generated tonight. I will send you the full Welsh Structured Contents bundle over WeTransfer to your email address first thing tomorrow. Here are the API calls you and other participants can make once the dataset has been processed, i.e. starting tomorrow morning:

  1. Go through the Getting Started section of the docs to sign up for an account and use the Login method to get an access_token. Make sure to enter your username all lowercase when using the Login method!
  2. Get Structured Contents Snapshot Bundle Info for cywiki. Enter your access token where the code says 'ACCESS_TOKEN':
curl -L "https://api.enterprise.wikimedia.com/v2/snapshots/structured-contents/cywiki_namespace_0" -H "Authorization: Bearer ACCESS_TOKEN"
  1. Download the Structured Contents Snapshot Bundle for cywiki:
curl -L "https://api.enterprise.wikimedia.com/v2/snapshots/structured-contents/cywiki_namespace_0/download" -H "Authorization: Bearer ACCESS_TOKEN"

The commands above are cURL commands formatted to be run in a cmd.exe shell, if you're using a different type of shell replace the double quotes with single quotes.

I've forwarded Welsh Structured Contents to @Jason.nlw via email, I'm adding the WeTransfer link here for others who would like to download it. This link expires in 3 days: https://we.tl/t-2zHySC6tmK

@JWuyts-WMF Brilliant. Thanks Jolan! I'll let you know if anyone decides to work with it.