Hello,
I will appreciate any tips and feedback that you can provide.
It looks like both of these spark queries could use some optimization. As of Friday, both queries are running slower than they were running when I tried them last. The translation query I tried one week ago and it ran smoothly. The pageview query I last ran in November and it worked fine.
In the first query, I would like to know which articles from a list of articles were translated.
In the second query, I would like to know, for a list of articles, the sum of pageviews per article between Jan 11th and Feb 11th.
from pyspark.sql.types import ArrayType, StringType from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) att_r = ''' SELECT wiki_db AS wikicode, page_id, revision_tags FROM wmf.mediawiki_history WHERE snapshot = "{MWH_SNAPSHOT}" AND page_namespace = 0 AND array_contains(revision_tags, "contenttranslation") AND wiki_db = 'pawiki' AND page_id IN {clean_pageids} GROUP BY wiki_db, page_id, revision_tags ''' pv_since_contest_start_r = ''' SELECT page_id, SUM(view_count) AS views_1M FROM wmf.pageview_hourly WHERE year = 2020 AND CONCAT(month,day) >= 0111 --AND (month >= 01 AND day >=11) AND CONCAT(month,day) < 0211 --AND (month <= 02 AND day <=11) AND agent_type = 'user' AND country_code = 'IN' AND project = 'pa.wikipedia' AND page_id IN {clean_pageids} GROUP BY page_id, view_count ''' pv_since_contest_start = spark.sql(pv_since_contest_start_r.format(**quality_vars)).toPandas() att = spark.sql(att_r.format(**quality_vars)).toPandas()