Page MenuHomePhabricator

Pyspark shell shut down automatically
Closed, ResolvedPublic

Description

I was trying to run the following query on notebook1003 with "PySpark - YARN (large)" kernel. Here's all the code in my notebook:

wikidataParquetPath = '/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204'
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata')
articles_viewed = spark.sql("""

WITH toledo_articles AS (
    select distinct r.page_id, w.database_code
    from wmf.webrequest r join canonical_data.wikis w on CONCAT(r.pageview_info.project, '.org') = w.domain_name
    where
        year = 2019
        and month =5
        and webrequest_source = 'text'
        and is_pageview
        and namespace_id=0
        and x_analytics_map['translationengine'] = 'GT'
        and parse_url(referer, 'QUERY') like '%client=srp%'
        and (regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'id'
        or regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'id')
)
select t.page_id as toledo_page_id, t.database_code, d1.item_id, d2.page_id as idwiki_page_id
from toledo_articles t left join wikidata d1 on (t.page_id=d1.page_id and t.database_code=d1.wiki_db)
left join wikidata d2 on (d1.item_id=d2.item_id and d2.wiki_db='idwiki' and d2.page_namespace=0)

""")

from pyspark.sql.functions import countDistinct
counts = articles_viewed.agg(countDistinct('toledo_page_id', 'item_id', 'idwiki_page_id'))
counts.show()

However, after the pyspark shell started around 50 minutes, it shutdown automatically: https://yarn.wikimedia.org/cluster/app/application_1555511316215_187017
The link in diagnostics and the links to the logs doesn't work, so I don't know how to investigate the issue. @JAllemandou Can you help? Thanks!

Event Timeline

@chelsyx, once a YARN application is done, you can view all of the logs via the CLI.

yarn logs -applicationId application_1555511316215_187017

We also have a wrapper script that tries to make those logs a little less verbose, especially when viewing spark logs:

/srv/deployment/analytics/refinery/bin/yarn-logs application_1555511316215_187017
fdans moved this task from Incoming to Ops Week on the Analytics board.

I have reproduced the error. The problem comes from driver-memory I think. I have been able to make the computation succeed for 1 day in python-notebook, and for 1 month in CLI with higher driver memory.

@Ottomata : Can we please bump the driver-memory of the PySpark -YARN (large) kernel to 4G please?

I have also rewrote the query to be sure I was understanding it correctly:

wikidataParquetPath = '/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204'
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata')

articles_viewed = spark.sql("""

WITH wikidata_idwiki_pages AS (
  SELECT DISTINCT
    item_id AS idwiki_item_id,
    page_id AS idwiki_page_id
  FROM wikidata
  WHERE wiki_db = 'idwiki'
    AND page_namespace = 0
),

wikidata_items_reduced AS (
  SELECT DISTINCT
    d1.wiki_db as wiki_db,
    d1.item_id as item_id,
    d1.page_id as page_id,
    d2.idwiki_page_id as idwiki_page_id
  FROM wikidata d1
    LEFT JOIN wikidata_idwiki_pages d2
      ON d1.item_id = d2.idwiki_item_id

),

toledo_base AS (
    SELECT DISTINCT
      r.page_id AS page_id,
      CONCAT(r.pageview_info.project, '.org') AS project
    FROM wmf.webrequest r
    WHERE webrequest_source = 'text'
        AND year = 2019
        AND month = 5
        AND is_pageview
        AND namespace_id = 0
        AND x_analytics_map['translationengine'] = 'GT'
        AND parse_url(referer, 'QUERY') like '%client=srp%'
        AND (regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) = 'id'
          OR regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) = 'id')
),

toledo_proj AS (
  SELECT
    tb.page_id,
    w.database_code as wiki_db
  FROM toledo_base tb
    JOIN canonical_data.wikis w
      ON tb.project = w.domain_name
)

SELECT
  t.page_id as toledo_page_id,
  t.wiki_db,
  d.item_id,
  d.idwiki_page_id as idwiki_page_id
FROM toledo_proj t
  LEFT JOIN wikidata_items_reduced d
    ON t.page_id = d.page_id
      AND t.wiki_db=d.wiki_db
""")
articles_viewed.count()

Thanks @JAllemandou !
I've tried:

spark.conf.set('spark.driver.memory', '4g')

in the first cell in the notebook, but it doesn't seem to work?

Hi @chelsyx :)

Spark driver is not launched from the notebook but from the kernel, and it's configuration is not updatable on the fly, so I'm not surprised it doesn't work.
The solution is to bump driver-memory at the kernel level (see my ping to Andrew and Luca in the previous comment).

In the mean time you can generate the dataset in spark-shell CLI (you can define driver-memory when in CLI), write it in parquet in a dedicated folder, then load and query it from notebooks.
Does that sound an acceptable solution for now?

I will update driver memory for the included large kernel.

@chelsyx you can configure custom notebook kernels for yourself too. See: https://wikitech.wikimedia.org/wiki/SWAP#Custom_Spark_Kernels

Change 516262 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/jupyterhub/deploy@master] Bump Spark driver memory to 4g for large Spark kernels

https://gerrit.wikimedia.org/r/516262

Change 516262 merged by Ottomata:
[analytics/jupyterhub/deploy@master] Bump Spark driver memory to 4g for large Spark kernels

https://gerrit.wikimedia.org/r/516262