I was trying to run the following query on notebook1003 with "PySpark - YARN (large)" kernel. Here's all the code in my notebook:
wikidataParquetPath = '/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204' spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata') articles_viewed = spark.sql(""" WITH toledo_articles AS ( select distinct r.page_id, w.database_code from wmf.webrequest r join canonical_data.wikis w on CONCAT(r.pageview_info.project, '.org') = w.domain_name where year = 2019 and month =5 and webrequest_source = 'text' and is_pageview and namespace_id=0 and x_analytics_map['translationengine'] = 'GT' and parse_url(referer, 'QUERY') like '%client=srp%' and (regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'id' or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'id') ) select t.page_id as toledo_page_id, t.database_code, d1.item_id, d2.page_id as idwiki_page_id from toledo_articles t left join wikidata d1 on (t.page_id=d1.page_id and t.database_code=d1.wiki_db) left join wikidata d2 on (d1.item_id=d2.item_id and d2.wiki_db='idwiki' and d2.page_namespace=0) """) from pyspark.sql.functions import countDistinct counts = articles_viewed.agg(countDistinct('toledo_page_id', 'item_id', 'idwiki_page_id')) counts.show()
However, after the pyspark shell started around 50 minutes, it shutdown automatically: https://yarn.wikimedia.org/cluster/app/application_1555511316215_187017
The link in diagnostics and the links to the logs doesn't work, so I don't know how to investigate the issue. @JAllemandou Can you help? Thanks!