The hive table mediawiki_wikitext_current seems to have an issue in mapping the Revision_text record field. The field breaks up after the first newline char, it seems to generate multiple NULL records.
Steps to Reproduce:
Run the following query in Hive:
select * from mediawiki_wikitext_current m where m.snapshot = '2020-04' and m.wiki_db = "enwiki" and m.page_id = 5413726 and m.page_namespace = 0
Actual Results:
page_id page_namespace page_title page_redirect_title page_restrictions user_id user_text revision_id revision_parent_id revision_timestamp revision_minor_edit revision_comment revision_text_bytes revision_text_sha1 revision_text revision_content_model revision_content_format snapshot wiki_db page_id page_namespace page_title page_redirect_title page_restrictions user_id user_text revision_id revision_parent_id revision_timestamp revision_minor_edit revision_comment revision_text_bytes revision_text_sha1 revision_text revision_content_model revision_content_format snapshot wiki_db 5413726 0 Lenny Randle [] -1 2600:4040:111A:5600:A4FA:598D:51F9:D607 950443035 947663655 2020-04-12T03:27:43Z false /* 1977 punching incident */better word choice 11669 pgldgd9gd9f2b4ro5h7917gd0brok9u {{Infobox baseball biography NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
Expected Results:
Only 1 full record is expected