Page MenuHomePhabricator

[Airflow] extract_wikibase_item items fails for ores_predictions_weekly since 11.01 run
Closed, ResolvedPublic5 Estimated Story Points

Description

Extract_wikibase_item failed during the last run of ores_predictions_weekly. Following was reported on the worker:

# Pastebin wy4LQBxW
Container: container_e25_1601916545561_148082_01_000001 on analytics1064.eqiad.wmnet_8041
===========================================================================================
LogType:stderr
Log Upload Time:Sun Nov 08 00:00:38 +0000 2020
LogLength:4524
Log Contents:
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/var/lib/hadoop/data/j/yarn/local/filecache/44379/spark-2.4.4-assembly.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/08 00:00:35 INFO SignalUtils: Registered signal handler for TERM
20/11/08 00:00:35 INFO SignalUtils: Registered signal handler for HUP
20/11/08 00:00:35 INFO SignalUtils: Registered signal handler for INT
20/11/08 00:00:35 INFO SecurityManager: Changing view acls to: analytics-search
20/11/08 00:00:35 INFO SecurityManager: Changing modify acls to: analytics-search
20/11/08 00:00:35 INFO SecurityManager: Changing view acls groups to: 
20/11/08 00:00:35 INFO SecurityManager: Changing modify acls groups to: 
20/11/08 00:00:35 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(analytics-search); groups with view permissions: Set(); users  with modify permissions: Set(analytics-search); groups with modify permissions: Set()
20/11/08 00:00:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/08 00:00:35 INFO ApplicationMaster: Preparing Local resources
20/11/08 00:00:36 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
20/11/08 00:00:37 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1601916545561_148082_000001
20/11/08 00:00:37 INFO ApplicationMaster: Starting the user application in a separate Thread
20/11/08 00:00:37 INFO ApplicationMaster: Waiting for spark context initialization...
20/11/08 00:00:37 ERROR ApplicationMaster: User application exited with status 1
20/11/08 00:00:37 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 1)
20/11/08 00:00:37 ERROR ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
	at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
	at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
	at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:106)
	at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
20/11/08 00:00:37 INFO ApplicationMaster: Deleting staging directory hdfs://analytics-hadoop/user/analytics-search/.sparkStaging/application_1601916545561_148082
20/11/08 00:00:37 INFO ShutdownHookManager: Shutdown hook called

LogType:stdout
Log Upload Time:Sun Nov 08 00:00:38 +0000 2020
LogLength:351
Log Contents:
venv/bin/python3.7: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by venv/bin/python3.7)
venv/bin/python3.7: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.25' not found (required by venv/bin/python3.7)
venv/bin/python3.7: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.26' not found (required by venv/bin/python3.7)

LogType:container-localizer-syslog
Log Upload Time:Sun Nov 08 00:00:38 +0000 2020
LogLength:1028
Log Contents:
2020-11-08 00:00:33,442 WARN [ContainerLocalizer Downloader] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:analytics-search (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2020-11-08 00:00:33,444 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2020-11-08 00:00:33,445 WARN [ContainerLocalizer Downloader] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:analytics-search (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

Event Timeline

Gehel set the point value for this task to 5.Nov 9 2020, 4:51 PM

The critical part of the log is:

venv/bin/python3.7: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.25' not found (required by venv/bin/python3.7)

venv/bin/python3.7 gives a hint that we aren't running the system python here, we are using a deployed virtualenv. Looking up the relevant task definition we find that mw_sql_to_hive indeed has a custom virtualenv.

The specific error here I've seen before when we first deployed airflow to debian 10. What's happening is that shipping virtualenv's ships the actual python executable in addition to any compiled binaries from individual dependencies. This environment was built on stat1007, running debian 10, and is not compatible with the hadoop worker nodes running debian 9.

Additional Context:

  • Virtualenv's are built by scap/scripts/build_deployment_virtualenvs.sh. The only thing scap specific about this script is it requires SCAP_REV_PATH to be set to the base of the repository checkout.
  • Hadoop worker nodes are all on debian 9, but hadoop client nodes are debian 10
  • These environments rarely change. Typically new environments only come about as a result of new scripts with specific external needs.

Solutions for unblocking today:

  • The virtualenvs are sourced from an hdfs deployment of the repository, and the path to this is held in an airflow variable. Today this variable is pointed at 'current' which is whatever the latest deployment is. We could instead point this variable to a specific deployed directory and not let airflow update just because the repo was deployed to hdfs.
    • Change wikimedia_discovery_analytics_hdfs_path in airflow/config/wmf_conf.json from hdfs://analytics-hadoop/wmf/discovery/current to hdfs://analytics-hadoop/wmf/discovery/2020-10-02T19.25.00+00.00-scap_sync_2020-10-02_0005-dirty and deploy
    • The variable changes will apply to the next attempted run of the task.

Available solutions:

  • We could write a skein spec (yaml) that ships the repository to a hadoop worker node, runs the script, then copies the results into hdfs
  • The docker-registry.wikimedia.org/releng/tox-pyspark:0.6.0 docker image, used in CI for running our test suite, is on debian 9 and can be used to build the virtualenvs.
    • The virtualenvs could be built on the developer machine and uploaded to archiva in the same way we upload python wheels. Scap deploy would stop building venvs and simply get them via git-fat. But this doesn't feel like a rigorous way to go about things.
    • Alternatively CI could have a triggerable task that builds venvs and pushes to archiva, would require CI to create and merge a commit updating the git-fat hashes.

Reasons to go with the simple solution:

  • The last deployments of this repo to hdfs were: 2019-12-19, 2020-3-10, 2020-10-02, 2020-11-06. And some of those deployments were only for oozie and didn't change the python environments. Maybe in 6 months we will have debian 10 worker nodes, and based on past deployments we might ship 1 or 2 environment updates in that time.

Change 640349 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikimedia/discovery/analytics@master] Point to working venv

https://gerrit.wikimedia.org/r/640349

Change 640349 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Point to working venv

https://gerrit.wikimedia.org/r/640349