Page MenuHomePhabricator

Revision_text field of mediawiki_wikitext_current is Not properly mapped
Closed, ResolvedPublicBUG REPORT

Description

The hive table mediawiki_wikitext_current seems to have an issue in mapping the Revision_text record field. The field breaks up after the first newline char, it seems to generate multiple NULL records.

Steps to Reproduce:
Run the following query in Hive:

select * from mediawiki_wikitext_current m
where m.snapshot = '2020-04'
and m.wiki_db = "enwiki"
and m.page_id = 5413726
and m.page_namespace = 0

Actual Results:

page_id	page_namespace	page_title	page_redirect_title	page_restrictions	user_id	user_text	revision_id	revision_parent_id	revision_timestamp	revision_minor_edit	revision_comment	revision_text_bytes	revision_text_sha1	revision_text	revision_content_model	revision_content_format	snapshot	wiki_db
page_id	page_namespace	page_title	page_redirect_title	page_restrictions	user_id	user_text	revision_id	revision_parent_id	revision_timestamp	revision_minor_edit	revision_comment	revision_text_bytes	revision_text_sha1	revision_text	revision_content_model	revision_content_format	snapshot	wiki_db
5413726	0	Lenny Randle		[]	-1	2600:4040:111A:5600:A4FA:598D:51F9:D607	950443035	947663655	2020-04-12T03:27:43Z	false	/* 1977 punching incident */better word choice	11669	pgldgd9gd9f2b4ro5h7917gd0brok9u	{{Infobox baseball biography	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL

Expected Results:
Only 1 full record is expected

Event Timeline

Hi @DED - The problem is with showing the data only, not decoding rows.
For instance:

select
  page_id,
  page_namespace,
  page_title,
  page_redirect_title,
  page_restrictions,
  user_id,
  user_text,
  revision_id,
  revision_parent_id,
  revision_timestamp,
  revision_minor_edit,
  revision_comment,
  revision_text_bytes,
  revision_text_sha1,
  revision_content_model,
  revision_content_format,
  snapshot,
  wiki_db
from wmf.mediawiki_wikitext_current m
where m.snapshot = '2020-04'
  and m.wiki_db = 'enwiki'
  and m.page_id = 5413726;

shows values for all fields, and

select
  revision_text
from wmf.mediawiki_wikitext_current m
where m.snapshot = '2020-04'
  and m.wiki_db = 'enwiki'
  and m.page_id = 5413726;

also shows the expected text.

Finally, I suggest you use Spark for querying the data :)

Marking the task invalid.

DED changed the task status from Invalid to Resolved.May 25 2020, 8:40 AM

To run the query on Hive if some fields contain a newline char:

set hive.query.result.fileformat = SequenceFile;

reference: https://www.phdata.io/hive-corruption-due-to-newlines-and-carriage-returns/