Maniphest T332047

Script to fetch the latest HTML dump for a given wiki
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Mar 14 2023, 4:26 PM

Description

Write a scraper interface which takes as its input a wiki database name like "hawiki" and scrapes the dumps pages https://dumps.wikimedia.org/other/enterprise_html/ to find the most recent NS0 dump for that wiki. Output should be an URL to the dump tarball.

Write a script which simply downloads that tarball to the local drive. Include a flag for streaming only a sample (eg. first 10k lines) without downloading the entire file.

Code to review: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/5

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	None	T332162 Run scraper on samples from several wikis
Resolved	None	T332047 Script to fetch the latest HTML dump for a given wiki

Event Timeline

awight created this task.Mar 14 2023, 4:26 PM

awight renamed this task from Convenience to pull the HTML dump for a given wiki to Script to fetch the latest HTML dump for a given wiki.Mar 15 2023, 9:22 AM

awight added a parent task: T332162: Run scraper on samples from several wikis.Mar 15 2023, 12:40 PM

lilients_WMDE moved this task from Incoming to In progress on the WMDE-References-FocusArea board.Mar 15 2023, 2:15 PM

awight claimed this task.Mar 20 2023, 9:10 AM

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2023-03-14 board.

I've mentioned the use case for a "latest" link, in T332544: Provide an updated "latest" link for HTML dumps.

awight added a subtask: T332562: Don't expose partial dumpfiles.Mar 20 2023, 12:03 PM

awight updated the task description. (Show Details)Mar 20 2023, 1:54 PM

It can fetch files, sample, and parse directly from a web stream.

Haven't implemented sampling while parsing from web, I'm not sure we need this use case? Any serious run on sampled lines should also preserve a record of the samples.

We should profile the streaming memory usage a bit, in a follow-up task. Here's the first minute of parsing enwiki directly from the web:

(This graph is produced by calling :observer.start() from parse_wiki.exs)

awight removed a subtask: T332562: Don't expose partial dumpfiles.Mar 24 2023, 1:35 PM

WMDE-Fisch moved this task from Tech Review to Done on the WMDE-TechWish-Sprint-2023-03-14 board.Mar 27 2023, 2:00 PM

There's a superficial glitch when sampling, the pipeline shuts down in an obnoxiously faily-looking way. Outputs seem to be unaffected however, so this can be a low-priority follow-up.

Filed the issue as upstream https://github.com/akash-akya/exile/issues/15 .

Error log:

elixir
Reading from https://dumps.wikimedia.org/other/enterprise_html/runs/20230320/dewiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz
Writing to dewiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson                                        
write(): Bad file descriptor                                                                             
** (EXIT from #PID<0.96.0>) an exception was raised:                                                     
    ** (MatchError) no match of right hand side value: {:error, 9}                                                                                                                                                 
        (exile 0.1.0) lib/exile/stream.ex:18: anonymous fn/4 in Collectable.Exile.Stream.Sink.into/1     
        (elixir 1.14.3) lib/enum.ex:1519: anonymous fn/3 in Enum.reduce_into_protocol/3             
        (elixir 1.14.3) lib/stream.ex:1799: anonymous fn/3 in Enumerable.Stream.reduce/3                 
        (elixir 1.14.3) lib/stream.ex:272: anonymous fn/4 in Stream.chunk_while_fun/2                    
        (elixir 1.14.3) lib/stream.ex:1651: Stream.do_element_resource/6                                 
        (elixir 1.14.3) lib/stream.ex:1811: Enumerable.Stream.do_each/4                                  
        (elixir 1.14.3) lib/enum.ex:1518: Enum.reduce_into_protocol/3                                    
        (elixir 1.14.3) lib/enum.ex:1502: Enum.into_protocol/2                                           
                                                                                                         
/usr/bin/tar: dewiki_0.ndjson: Wrote only 4096 of 10240 bytes                                            

gzip: stdin: unexpected end of file
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now

awight closed this task as Resolved.Mar 28 2023, 10:16 AM

awight moved this task from In progress to Done on the WMDE-References-FocusArea board.Oct 23 2024, 7:06 AM

	F36921513: image.png
	Mar 21 2023, 10:10 AM

Script to fetch the latest HTML dump for a given wikiClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Script to fetch the latest HTML dump for a given wiki
Closed, ResolvedPublic
Actions

Related Objects
Search...