Page MenuHomePhabricator

Investigate possible corruption, or bad string handling
Closed, ResolvedPublic

Description

Multibyte scripts offer unique challenges, and a quick pilot run shows some string errors for several wikis. For example, on kbpwiki:

13:42:59.177 [error] GenServer #PID<0.870.0> terminating                              
** (ArgumentError) errors were found at the given arguments:                                                                                                                 
                                                                                                                                                                             
  * 1st argument: not a bitstring                                                                                                                                            
                                                                                      
    :erlang.byte_size(nil)                                                                                                                                                   
    (scrape_wiki_dump 0.1.0) lib/html_page_parser.ex:21: Wiki.HtmlPageParser.parse_line/1                                                                                    
    (flow 1.2.4) lib/flow/materialize.ex:761: anonymous fn/4 in Flow.Materialize.mapper/2                                                                                    
    (flow 1.2.4) lib/flow/materialize.ex:729: Flow.Materialize."-mapper_ops/1-lists^foldl/2-1-"/3                                                                            
    (flow 1.2.4) lib/flow/materialize.ex:729: anonymous fn/5 in Flow.Materialize.mapper_ops/1                                                                                
    (flow 1.2.4) lib/flow/map_reducer.ex:67: Flow.MapReducer.handle_events/3                                                                                                 
    (gen_stage 1.2.1) lib/gen_stage.ex:2578: GenStage.consumer_dispatch/6                                                                                                    
    (gen_stage 1.2.1) lib/gen_stage.ex:2767: GenStage.take_pc_events/3                                                                                                       
Last message: {:"$gen_consumer", {#PID<0.648.0>, #Reference<0.3385915574.4211343365.114942>}, ["{\"name\":\"1913\",\"identifier\":2284,\"abstract\":\"Pɩnaɣ 1913 kɛ pɩnaɣ 191
3 ñɩŋga kɛ palʋlʋʋ Yeesu Krɩstʋ yɔ. Pɩnzɩ mɩnɩŋ nɛɛlɛ taa lɛ, pɩnaɣ 1913 kɛŋna hiu nɛ naadozo ñɩŋga. Kɛwɛ pɩnzɩ 1912 nɛ 1914 hɛkʋ taa.\",\"date_modified\":\"2018-01-20T10:06
:21Z\",\"version\":{\"identifier\":10665},\"url\":\"https://kbp.wikipedia.org/wiki/1913\",\"namespace\":{\"identifier\":0},\"in_language\":{\"identifier\":\"kbp\"},\"main_en
tity\":{\"identifier\":\"Q2080\",\"url\":\"https://www.wikidata.org/entity/Q2080\"},\"additional_entities\":[{\"identifier\":\"Q2080\",\"url\":\"https://www.wikidata.org/ent
ity/Q2080\",\"aspects\":[\"S\"]}],\"categories\":[{\"name\":\"Catégorie:Pɩnzɩ\",\"url\":\"https://kbp.wikipedia.org/wiki/Catégorie:Pɩnzɩ\"},{\"name\":\"Catégorie:Pɩnzɩ mɩnɩŋ
 nɛɛlɛ\",\"url\":\"https://kbp.wikipedia.org/wiki/Catégorie:Pɩnzɩ_mɩnɩŋ_nɛɛlɛ\"}],\"is_part_of\":{\"identifier\":\"kbpwiki\",\"url\":\"https://kbp.wikipedia.org\"},\"article
_body\":{\"html\":\"\\u003c!DOCTYPE html\\u003e\\n\\u003chtml prefix=\\\"dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/\\\" about=\\\"https://kbp.wikipedia.org/
wiki/Special:Redirect/revision/10665\\\"\\u003e\\u003chead prefix=\\\"mwr: https://kbp.wikipedia.org/wiki/Special:Redirect/\\\"\\u003e\\u003cmeta property=\\\"mw:TimeUuid\\\

Also, tee errors logging to a file so that errors can be reviewed more easily.

Event Timeline

This was too trivial to merit a task: blanked redirect pages sometimes show up without an .article_body.wikitext value, handled in patch https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/64

awight claimed this task.