Page MenuHomePhabricator

SLF4J errors when querying mediawiki_wikitext_history
Closed, ResolvedPublic

Description

A query against mediawiki_wikitext_history is throwing errors. Would it be possible to get this fixed?

hive (wmf)> select * from mediawiki_wikitext_history where snapshot = '2019-03' and wiki_db = 'enwiki' limit 1;
OK
page_id	page_namespace	page_title	page_redirect_title	page_restrictions	user_id	user_text	revision_id	revision_parent_id	revision_timestamp	revision_minor_edit	revision_comment	revision_text_bytes	revision_text_sha1	revision_text	revision_content_model	revision_content_format	snapshot	wiki_db
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-pig-bundle-1.5.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-hadoop-bundle-1.5.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-format-2.1.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-exec-1.1.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.16.1-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)
	at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)
	at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
	at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:212)
	at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:118)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
	at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
	at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:674)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:324)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:446)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:415)
	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:140)
	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2071)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:246)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:141)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 27 2019, 7:01 PM

Explanantions: By selecting all limit 1, you're not actually running a map-reduce job but rather read the file locally to print all columns. Files storing the text are compressing data a lot, and therefore un-compressing lead to memory issues in the client.
Bumping the hive client memory before doing your query has solved the issue for me:

export HADOOP_HEAPSIZE=2048 && hive

Something else to keep in mind: querying text is very expensive in term of resources. For instance enwiki represent 8.3T, and querying over it means reading all of it.
You can ping me to discuss your use-case on the analytics irc chan if you want :)

JAllemandou closed this task as Resolved.Aug 29 2019, 2:05 PM
Nuria added a subscriber: Nuria.Aug 29 2019, 2:12 PM

@dr0ptp4kt reducing query size (columns, where clause more explicit) will help as well. Closing.

Thanks! Okay, expanding the heap size helped here. Thank you for the offer of assistance as well!