Page MenuHomePhabricator

mediawiki_history_snapshot_config_dag fails since the last change about the AQS config table
Closed, ResolvedPublicBUG REPORT

Description

mediawiki_history_snapshot_config_dag is failing while trying to insert the tuple:

24/04/02 10:19:58 ERROR AppendDataExec: Data source write support CassandraBulkWrite(org.apache.spark.sql.SparkSession@3fae357f,com.datastax.spark.connector.cql.CassandraConnector@30d3f583,TableDef(aqs,config,ArrayBuffer(ColumnDef(param,PartitionKeyColumn,VarCharType)),ArrayBuffer(),Stream(ColumnDef(value,RegularColumn,VarCharType)),Stream(),false,false,Map()),WriteConf(RowsInBatch(1024),1000,Partition,LOCAL_QUORUM,false,false,5,None,TTLOption(DefaultValue),TimestampOption(DefaultValue),true,None),StructType(StructField(param,StringType,false), StructField(value,StringType,false)),org.apache.spark.SparkConf@48064a5a) aborted.
24/04/02 10:19:58 ERROR SparkSQLDriver: Failed in [
INSERT INTO ${aqs_config_table}
SELECT
    '${property_name}' AS param,
    '${property_value}' AS value
]

To see the full log:
sudo -u analytics yarn logs -appOwner analytics -applicationId application_1707226456123_385652

It seems it's the first time this dag is executed after the change about the aqs config table to store the recently create mediawiki_history_snapshot

Details

TitleReferenceAuthorSource BranchDest Branch
Make driver and executor memory configurablerepos/data-engineering/airflow-dags!672milimetricdebug-mediawiki-history-snapshot-configmain
Customize query in GitLab

Event Timeline

@VirginiaPoundstone We think this bug is not urgent because it's related to a change we haven't deployed yet (the one about automating mediawiki_history_snaphot for edit and editor analytics services)

Milimetric subscribed.

I am not sure this is 100% squashed because the behavior is so weird. Here's what I found, in short:

The spark-submit command (pasted below for reference) was failing when issued by airflow via skein. However, when copy-pasted and run exactly the same from an-launcher1002, it would succeed. I tried different configurations to see if I could get it to fail but it always worked.

On a whim, I made driver and executor memory configurable (MR mentioned in the comment above). I set these to 2G while hard-coding cores to 1. When I deployed this, the new spark-submit command ran fine even through airflow/skein. I'm going to leave those configurable for now so we can debug more if the job fails again. It would be interesting to binary search exactly the point where it fails someday, but we have other priorities.

spark3-submit --driver-cores 1 --conf spark.executorEnv.SPARK_HOME=/usr/lib/spark3
    --conf spark.executorEnv.SPARK_CONF_DIR=/etc/spark3/conf --master yarn --conf
    spark.sql.catalog.aqs=com.datastax.spark.connector.datasource.CassandraCatalog
    --conf spark.sql.catalog.aqs.spark.cassandra.connection.host=aqs1010-a.eqiad.wmnet:9042,aqs1011-a.eqiad.wmnet:9042,aqs1012-a.eqiad.wmnet:9042
    --conf spark.sql.catalog.aqs.spark.cassandra.auth.username=aqsloader --conf spark.sql.catalog.aqs.spark.cassandra.auth.password=cassandra
    --conf spark.sql.catalog.aqs.spark.cassandra.output.batch.size.rows=1024 --conf
    spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=16
    --conf spark.shuffle.service.enabled=true --conf spark.yarn.maxAppAttempts=1 --conf
    spark.yarn.appMasterEnv.SPARK_CONF_DIR=/etc/spark3/conf --conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark3
    --jars hdfs:///wmf/cache/artifacts/airflow/analytics/spark-cassandra-connector-assembly-3.2.0-WMF-1.jar,hdfs:///wmf/cache/artifacts/airflow/analytics/refinery-job-0.2.17-shaded.jar
    --executor-cores 2 --executor-memory 4G --driver-memory 1G --keytab analytics.keytab
    --principal analytics/an-launcher1002.eqiad.wmnet@WIKIMEDIA --name mediawiki_history_shapshot_config__load_cassandra__20240201
    --class org.apache.spark.sql.hive.thriftserver.WMFSparkSQLCLIDriver --queue production
    --deploy-mode client hdfs:///wmf/cache/artifacts/airflow/analytics/wmf-sparksqlclidriver-1.0.0.jar
    -f hdfs://analytics-hadoop/wmf/refinery/current/hql/cassandra/load_cassandra_aqs_config.hql
    -d property_name=mediawiki_history_reduced_druid_datasource -d property_value=mediawiki_history_reduced_2024_02
    -d aqs_config_table=aqs.aqs.config