Page MenuHomePhabricator

Script to fetch the latest HTML dump for a given wiki
Closed, ResolvedPublic

Description

Write a scraper interface which takes as its input a wiki database name like "hawiki" and scrapes the dumps pages https://dumps.wikimedia.org/other/enterprise_html/ to find the most recent NS0 dump for that wiki. Output should be an URL to the dump tarball.

Write a script which simply downloads that tarball to the local drive. Include a flag for streaming only a sample (eg. first 10k lines) without downloading the entire file.

Code to review: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/5

Event Timeline

awight renamed this task from Convenience to pull the HTML dump for a given wiki to Script to fetch the latest HTML dump for a given wiki.Mar 15 2023, 9:22 AM
awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2023-03-14 board.

It can fetch files, sample, and parse directly from a web stream.

Haven't implemented sampling while parsing from web, I'm not sure we need this use case? Any serious run on sampled lines should also preserve a record of the samples.

We should profile the streaming memory usage a bit, in a follow-up task. Here's the first minute of parsing enwiki directly from the web:

image.png (618×870 px, 187 KB)

(This graph is produced by calling :observer.start() from parse_wiki.exs)

There's a superficial glitch when sampling, the pipeline shuts down in an obnoxiously faily-looking way. Outputs seem to be unaffected however, so this can be a low-priority follow-up.

Filed the issue as upstream https://github.com/akash-akya/exile/issues/15 .

Error log:

elixir
Reading from https://dumps.wikimedia.org/other/enterprise_html/runs/20230320/dewiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz
Writing to dewiki-NS0-20230320-ENTERPRISE-HTML-sample10000.ndjson                                        
write(): Bad file descriptor                                                                             
** (EXIT from #PID<0.96.0>) an exception was raised:                                                     
    ** (MatchError) no match of right hand side value: {:error, 9}                                                                                                                                                 
        (exile 0.1.0) lib/exile/stream.ex:18: anonymous fn/4 in Collectable.Exile.Stream.Sink.into/1     
        (elixir 1.14.3) lib/enum.ex:1519: anonymous fn/3 in Enum.reduce_into_protocol/3             
        (elixir 1.14.3) lib/stream.ex:1799: anonymous fn/3 in Enumerable.Stream.reduce/3                 
        (elixir 1.14.3) lib/stream.ex:272: anonymous fn/4 in Stream.chunk_while_fun/2                    
        (elixir 1.14.3) lib/stream.ex:1651: Stream.do_element_resource/6                                 
        (elixir 1.14.3) lib/stream.ex:1811: Enumerable.Stream.do_each/4                                  
        (elixir 1.14.3) lib/enum.ex:1518: Enum.reduce_into_protocol/3                                    
        (elixir 1.14.3) lib/enum.ex:1502: Enum.into_protocol/2                                           
                                                                                                         
/usr/bin/tar: dewiki_0.ndjson: Wrote only 4096 of 10240 bytes                                            

gzip: stdin: unexpected end of file
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now