Page MenuHomePhabricator

Clean up issues with jobs after Hadoop Upgrade
Closed, ResolvedPublic

Description

Grouping issues here:

Event Timeline

Change 663065 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Clean up jobs after Hadoop upgrade

https://gerrit.wikimedia.org/r/663065

Change 663065 merged by Milimetric:
[analytics/refinery@master] Clean up jobs after Hadoop upgrade

https://gerrit.wikimedia.org/r/663065

Change 663096 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Clean up jobs after Hadoop Upgrade

https://gerrit.wikimedia.org/r/663096

mediacounts and mediarequest have the same problem now that the syntax was worked out. Some hours, but not all hours, fail because of the way UDFs return structs. I think the best idea was Joseph's, to run them as spark sql (or pyspark if that's easier). But in the long term, we need to standardize these kinds of jobs and run them all the same way with the same boilerplate and only changing the query.

For now, this remains a TODO when fixed, resume both coordinators from 2021-02-09T06:

Change 663096 merged by Milimetric:
[analytics/refinery@master] Clean up jobs after Hadoop Upgrade

https://gerrit.wikimedia.org/r/663096

@Ottomata I noticed that eventlogging_legacy refine had some REFINE_FAILED flags for some events, together with a REFINED flag. Since there was a lot of hours to re-refine, I tried the following:

sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --table_whitelist_regex='CentralNoticeImpression|ContentTranslationError|ContentTranslationSuggestion|EchoMail|EditAttemptStep|EditorActivation|EventError|GuidedTourExited|GuidedTourGuiderHidden|InukaPageView|KaiOSAppFeedback|MobileWebSearch|MobileWebSectionUsage|MobileWebUIActionsTracking|MobileWikiAppABTest|MobileWikiAppIntents|MobileWikiAppLinkPreview|MobileWikiAppOnThisDay|MobileWikiAppRandomizer|MobileWikiAppSessions|MobileWikiAppSuggestedEdits|MobileWikiAppSuggestedEditsFeed|MobileWikiAppTabs|MobileWikiAppTalk|MobileWikiAppToCInteraction|MobileWikiAppiOSReadingLists|MobileWikiAppiOSSessions|ReferencePreviewsBaseline|TranslationRecommendationUIRequests|UploadWizardStep|WikidataCompletionSearchClicks' --since='2021-02-09T00:00:00' --until='2021-02-10T00:00:00' --ignore_failure_flag=true --ignore_done_flag=true

I used the ignore_done_flag due to the presence of REFINED and REFINED_FAILED (I thought it was a weirdness happened before you changed the spark assembly) but I made it worse since now the double REFINED/REFINE_FAILED is more widespread :(

Change 663191 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update hadoop and hive dependdencies versions

https://gerrit.wikimedia.org/r/663191

Ottomata renamed this task from Clean up issues with oozie jobs after Hadoop Upgrade to Clean up issues with jobs after Hadoop Upgrade.Feb 10 2021, 2:48 PM

eventlogging_legacy refine had some REFINE_FAILED flags for some events, together with a REFINED flag

There were hours with both? Hm, I guess that makes sense, I don't think _REFINE_FAILED is removed if _REFINED happens later. But, if a dir has _REFINED, it shouldn't need re-run, unless you are fixing some bug where the hour needs re-refined.

Oh, it it looks like you ran the refine_eventlogging_legacy job for tables that are not yet migrated to event platform. Non migrated tables use the refine_eventlogging_analytics job. See: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp#L6

I'm rerunning the same command you just ran but with refine_eventlogging_analytics instead:

sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_analytics --table_whitelist_regex='CentralNoticeImpression|ContentTranslationError|ContentTranslationSuggestion|EchoMail|EditAttemptStep|EditorActivation|EventError|GuidedTourExited|GuidedTourGuiderHidden|InukaPageView|KaiOSAppFeedback|MobileWebSearch|MobileWebSectionUsage|MobileWebUIActionsTracking|MobileWikiAppABTest|MobileWikiAppIntents|MobileWikiAppLinkPreview|MobileWikiAppOnThisDay|MobileWikiAppRandomizer|MobileWikiAppSessions|MobileWikiAppSuggestedEdits|MobileWikiAppSuggestedEditsFeed|MobileWikiAppTabs|MobileWikiAppTalk|MobileWikiAppToCInteraction|MobileWikiAppiOSReadingLists|MobileWikiAppiOSSessions|ReferencePreviewsBaseline|TranslationRecommendationUIRequests|UploadWizardStep|WikidataCompletionSearchClicks' --since='2021-02-09T00:00:00' --until='2021-02-10T00:00:00' --ignore_failure_flag=true --ignore_done_flag=true

Change 663245 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Replace UNION ALL with UNION to unbreak data_quality_stats job

https://gerrit.wikimedia.org/r/663245

Change 663305 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Use temp directory hack on mediacounts

https://gerrit.wikimedia.org/r/663305

Change 663191 merged by jenkins-bot:
[analytics/refinery/source@master] Update hadoop and hive dependencies versions

https://gerrit.wikimedia.org/r/663191

Change 663312 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Fix interlanguage job syntax

https://gerrit.wikimedia.org/r/663312

Change 663305 merged by Milimetric:
[analytics/refinery@master] Use temp directory hack on mediacounts

https://gerrit.wikimedia.org/r/663305

Change 663312 merged by Milimetric:
[analytics/refinery@master] Fix interlanguage job syntax

https://gerrit.wikimedia.org/r/663312

Change 663372 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] [WIP] The mediarequest per file job had a syntax error that I fix here, but it also has the UNION ALL syntax that I understand doesn't work. In this case, it would be prohibitively expensive to use UNION, it's a LOT of data. So I'm leaving this fix for tomorrow.

https://gerrit.wikimedia.org/r/663372

Update: all cassandra jobs restarted and seem ok, except mediarequests per_file daily. Patch for that WIP above. See note in description, when figured out it needs to start at 2021-02-09. Other problems fixed and jobs restarted.

NOTE: there may be plenty of problems yet to come, with the monthly jobs that just haven't run yet. But hopefully just syntax and whatever is happening with the mediacounts job.

Change 663575 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] [WIP] Move mediarequest oozie job to sparksql

https://gerrit.wikimedia.org/r/663575

Change 663582 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Fix oozie sharelib creation script

https://gerrit.wikimedia.org/r/663582

Change 663582 merged by Elukey:
[operations/puppet@production] Fix oozie sharelib creation script

https://gerrit.wikimedia.org/r/663582

Change 663245 merged by Mforns:
[analytics/refinery@master] Replace UNION ALL with UNION to unbreak data_quality_stats job

https://gerrit.wikimedia.org/r/663245

Change 663372 merged by Milimetric:
[analytics/refinery@master] Update syntax for new hive version

https://gerrit.wikimedia.org/r/663372

While checking the hive-server2.log file on an-coord1001 I found a recurrence of the following, right after a virtualpageview hourly job starts:

2021-02-13T00:38:00,776 ERROR [8a3b024d-3aca-4c23-ae05-966e6e88080d HiveServer2-Handler-Pool: Thread-164680] parse.CalcitePlanner: CBO failed, skipping CBO. 
org.apache.hadoop.hive.ql.parse.SemanticException: Line 20:4 Invalid function ''project''
        at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:939) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1359) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:76) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:229) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.optimizer.calcite.translator.JoinCondTypeCheckProcFactory.genExprNode(JoinCondTypeCheckProcFactory.java:58) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinRelNode(CalcitePlanner.java:1953) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinLogicalPlan(CalcitePlanner.java:2167) ~[hive-exec-2.3.6.jar:2.3.6]
...

This pops up in hive-server2.log very often:

2021-02-13T03:15:15,802 ERROR [2d5044e1-67a1-4c02-88e6-ebcb6352842b HiveServer2-Handler-Pool: Thread-171789] parse.CalcitePlanner: CBO failed, skipping CBO. 
org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Table Sample specified for webrequest. Currently we don't support Table Sample clauses in CBO, turn off cbo for queries on tableSamples.
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genTableLogicalPlan(CalcitePlanner.java:2185) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:3957) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1315) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1261) ~[hive-exec-2.3.6.jar:2.3.6]
        at org.apache.calcite.tools.Frameworks$1.apply(Frameworks.java:113) ~[calcite-core-1.10.0.jar:1.10.0]
        at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:997) ~[calcite-core-1.10.0.jar:1.10.0]

Every time pageview_actor hourly is ran, I see the following in hive-server2.log:

2021-02-13T00:29:20,114 ERROR [c69ed050-16b1-47f3-9306-8740b72a3aaf HiveServer2-Handler-Pool: Thread-163900] parse.Cal
citePlanner: CBO failed, skipping CBO. 
java.lang.NullPointerException

Change 664172 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Update oozie sharelib creation

https://gerrit.wikimedia.org/r/664172

Change 664172 merged by Elukey:
[operations/puppet@production] Update oozie sharelib creation

https://gerrit.wikimedia.org/r/664172

Change 664674 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/reportupdater-queries@master] Fix use of reserved keywords

https://gerrit.wikimedia.org/r/664674

Change 664674 abandoned by Milimetric:
[analytics/reportupdater-queries@master] Fix use of reserved keywords

Reason:
oh no! Adam already did it :(

https://gerrit.wikimedia.org/r/664674

I'm running into an issue when testing my AQS pageviews/per-country Oozie job that didn't appear when I tested it a few weeks ago

This log is from one of my runs, and can be replicated when running the Oozie job's hive query through Hue as well: https://hue.wikimedia.org/hue/jobbrowser/#!id=task_1612875249838_45243_m_000000

This is the main part of the error:

Error: Error running query: java.lang.AssertionError: Internal error: While invoking method 'public org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator$Frame org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator.decorrelateRel(org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveProject) throws 
org.apache.hadoop.hive.ql.parse.SemanticException' (state=,code=0)

Furthermore, when I run the query through the command line (hive), I receive a different error:

NoViableAltException(350@[()* loopback of 430:20: ( ( LSQUARE ^ expression RSQUARE !) | ( DOT ^ identifier ) )*])

Again, this error did not appear when I first ran the query several weeks ago

The query file is located at /home/lexnasser/oozie/cassandra/daily/pageview_top_percountry.hqlon stat1007, and can be run for example as:

hive -f pageview_top_percountry.hql -d refinery_hive_jar_path=hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.1.1.jar -d destination_directory=/user/lexnasser/test    -d source_table=wmf.pageview_actor -d 
country_blacklist_table=wmf.geoeditors_blacklist_country -d separator=\\t -d year=2021 -d month=2 -d day=16

Because this error is only occurring after the Hadoop upgrade, I'm thinking the issue is related to that, but not sure. Any help or suggestions would be greatly appreciated :)

@lexnasser: I run into the same error testing the query manually (NoViableAltException(350@[()* loopback of 430:20: ( ( LSQUARE ^ expression RSQUARE !) | ( DOT ^ identifier ) )*])). I don't know what's going on there :(

I however have found a solution for the `Error: Error running query: java.lang.AssertionError: Internal error: While invoking method 'public org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator$Frame org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator.decorrelateRel(org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveProject) throws
org.apache.hadoop.hive.ql.parse.SemanticException' (state=,code=0) one: the problem comes from Calcite Cost-Based-Optimizer, and we can disable it: SET hive.cbo.enable=false;`

We won't take advantage of the optimizations, but at least the query should run (if we get rid of the parsing issue -MEH).

@JAllemandou Thanks for finding the hive.cbo.enable option! That fixed the HiveRelDecorrelator issue, but now I'm getting another error:

Error: Error while compiling statement: FAILED: SemanticException [Error 10250]: Line 23:8 Invalid SubQuery expression '1': SubQuery can contain only 1 item in Select List. (state=42000,code=10250)
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10250]: Line 23:8 Invalid SubQuery expression '1': SubQuery can contain only 1 item in Select List.

Here's the full log: https://hue.wikimedia.org/jobbrowser/apps#!id=job_1612875249838_48256

I'm thinking of just iteratively rewriting the query from a blank slate and seeing what components cause the issue and how I can circumvent them. Does this sound like a reasonable approach to you?

I'm thinking of just iteratively rewriting the query from a blank slate and seeing what components cause the issue and how I can circumvent them.

This is my way of testing, using deconstruction instead of construction :)
Let me know how it goes!

Just finished fixing up the Hive query for the Oozie job to load the data into Cassandra for the top per-country AQS pageviews endpoint.

In my description below, I will be referencing line numbers from the following file: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/654924/5/oozie/cassandra/daily/pageview_top_percountry.hql .

There were 3 issues I encountered, all of which I believe were introduced by the Hadoop upgrade:

  1. Redundant partition is required. As expected, when querying a table, such as wmf.pageview_actor, a partition predicate is required by Hive (i.e., WHERE year = ${year} AND month = ${month} AND day = ${day}. However, Hive now errors if this partition predicate is not repeated in the subquery. See lines 40-42 and 57-59.
  2. org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelDecorrelator.decorrelateRel assertion error for grouping sets. @elukey probably understand why better than I, but Calcite doesn't like the grouping sets I had, so I converted it to a simple UNION. This is pretty ugly and would be way worse with even more grouping sets needed. See line 87.
  3. Errors when double quotes are not delimited in single-quoted strings. Not that this is only an issue I've encountered when running via Hive, but not when Oozie runs it, so I'm not necessarily sure this affects the functionality of existing Oozie jobs. But this issue is pretty straightforward, albeit difficult to diagnose. If any string literal contains double quotes, they must delimited, even when the string literal is single-quoted. See lines 162-168.

Let me know if you have any questions, suggestions, or thoughts as to why these issues occur, and if there are cleaner solutions.

Thanks to Luca for helping me with issue #2 and to Joseph for helping me with the overall fixing process.

Change 668111 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add backticks to reserved word date in geoeditors monthly job

https://gerrit.wikimedia.org/r/668111

Change 668111 merged by Elukey:
[analytics/refinery@master] Add backticks to reserved word date in geoeditors monthly job

https://gerrit.wikimedia.org/r/668111

@lexnasser heya - I have found the time to make a deeper analysis of the errors you encountered on your HQL query.
I have a query-version that is closer to the original query you wrote, that also improves computation performance quite a bit over the last one you suggested (and that I merged to move faster - difference can be seen in query plan for instance :)
Here are the two version, and a check that the generate the same result:

The new version, rewritten just a bit so that it works for me:

ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar;

DROP TABLE IF EXISTS joal.test_lex_pageview_top_percountry;
CREATE TABLE joal.test_lex_pageview_top_percountry AS
WITH raw_ungrouped AS (
    SELECT
        geocoded_data['country_code'] AS country_code,
        REGEXP_REPLACE(access_method, ' ', '-') AS access,
        pageview_info['project'] AS project,
        REFLECT('org.json.simple.JSONObject', 'escape', REGEXP_REPLACE(pageview_info['page_title'], '\t', '')) AS page_title,
        LPAD(year, 4, '0') as year,
        LPAD(month, 2, '0') as month,
        LPAD(day, 2, '0') as day,
        actor_signature
    FROM wmf.pageview_actor source
    WHERE
        year = 2021
        AND month = 3
        AND day = 3
        AND agent_type = 'user'
        AND pageview_info IS NOT NULL
        AND geocoded_data IS NOT NULL
        AND is_pageview
        -- Remove special unknown pageview (see T117346)
        AND pageview_info['page_title'] != '-'
        AND geocoded_data['country_code'] != '--'
        -- This NOT EXISTS executes as a map join, and was chosen after benchmarking against LEFT JOIN and NOT IN
        AND NOT EXISTS (
            SELECT country_blacklist.country_code
            FROM wmf.geoeditors_blacklist_country country_blacklist
            WHERE
                country_blacklist.country_code = source.geocoded_data['country_code']
        )
),
raw AS (
    SELECT
        access,
        year,
        month,
        day,
        country_code,
        project,
        page_title,
        COUNT(1) AS total_view_count,
        COUNT(DISTINCT actor_signature) AS unique_actor_count
    FROM raw_ungrouped
    WHERE
        year =2021
        AND month = 3
        AND day = 3
    GROUP BY
        access,
        year,
        month,
        day,
        country_code,
        project,
        page_title
    -- Union must be used rather than grouping sets due to issue with HiveRelDecorrelator.decorrelateRel
    UNION
    SELECT
        'all-access' AS access,
        year,
        month,
        day,
        country_code,
        project,
        page_title,
        COUNT(1) AS total_view_count,
        COUNT(DISTINCT actor_signature) AS unique_actor_count
    FROM raw_ungrouped
    WHERE
        year = 2021
        AND month = 3
        AND day = 3
    GROUP BY
        year,
        month,
        day,
        country_code,
        project,
        page_title
),
ranked AS (
    SELECT
        country_code,
        access,
        project,
        page_title,
        year,
        month,
        day,
        CEIL(total_view_count / 100) * 100 AS views_ceil,
        rank() OVER (PARTITION BY access, country_code, year, month, day ORDER BY total_view_count DESC, page_title) as rank,
        row_number() OVER (PARTITION BY access, country_code, year, month, day ORDER BY total_view_count DESC, page_title) as rn
    FROM raw
    WHERE unique_actor_count > 1000
),
max_rank AS (
    SELECT
        country_code,
        access,
        year,
        month,
        day,
        rank as max_rank
    FROM ranked
    WHERE rn = 1001
    GROUP BY
        country_code,
        access,
        year,
        month,
        day,
        rank
)

SELECT
        ranked.country_code as country_code,
        ranked.access as access,
        ranked.year as year,
        ranked.month as month,
        ranked.day as day,
        CONCAT(
            '[',
            CONCAT_WS(
                ',',
                COLLECT_SET(
                    CONCAT(
                        '{\"article\":\"',
                        ranked.page_title,
                        '\",\"project\":\"',
                        ranked.project,
                        '\",\"views_ceil\":',
                        CAST(ranked.views_ceil AS STRING),
                        ',\"rank\":',
                        CAST(ranked.rank AS STRING),
                        '}'
                    )
                )
            ),
            ']'
        ) as top_values
FROM ranked
LEFT JOIN max_rank ON (
    ranked.country_code = max_rank.country_code
    AND ranked.access = max_rank.access
    AND ranked.year = max_rank.year
    AND ranked.month = max_rank.month
    AND ranked.day = max_rank.day
)
WHERE
    ranked.rank < COALESCE(max_rank.max_rank, 1001)
GROUP BY
    ranked.country_code,
    ranked.access,
    ranked.year,
    ranked.month,
    ranked.day
;

My suggested version:

ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar;

SET hive.cbo.enable=false;
SET hive.mapred.mode=nonstrict;

DROP TABLE IF EXISTS joal.test_joal_pageview_top_percountry;
CREATE TABLE joal.test_joal_pageview_top_percountry AS
WITH base_data AS (
    SELECT
        geocoded_data['country_code'] AS country_code,
        regexp_replace(access_method, ' ', '-') AS access,
        pageview_info['project'] AS project,
        reflect('org.json.simple.JSONObject', 'escape', regexp_replace(pageview_info['page_title'], '\t', '')) AS page_title,
        actor_signature,
        LPAD(year, 4, '0') as year,
        LPAD(month, 2, '0') as month,
        LPAD(day, 2, '0') as day
    FROM wmf.pageview_actor source
    WHERE
        year = 2021
        AND month = 3
        AND day = 3
        AND agent_type = 'user'
        AND pageview_info IS NOT NULL
        AND geocoded_data IS NOT NULL
        AND is_pageview
        -- Remove special unknown pageview (see T117346)
        AND pageview_info['page_title'] != '-'
        AND geocoded_data['country_code'] != '--'
        -- This NOT EXISTS executes as a map join, and was chosen after benchmarking against LEFT JOIN and NOT IN
        AND NOT EXISTS (
            SELECT TRUE
            FROM wmf.geoeditors_blacklist_country country_blacklist
            WHERE country_blacklist.country_code = source.geocoded_data['country_code']
        )
),
raw AS (
    SELECT
        country_code,
        COALESCE(access, 'all-access') AS access,
        project,
        page_title,
        year,
        month,
        day,
        COUNT(1) AS total_view_count,
        COUNT(DISTINCT actor_signature) AS unique_actor_count
    FROM base_data
    GROUP BY
        country_code,
        access,
        project,
        page_title,
        year,
        month,
        day
    GROUPING SETS (
        (
            country_code,
            access,
            project,
            page_title,
            year,
            month,
            day
        ),
        (
            country_code,
            project,
            page_title,
            year,
            month,
            day
        )
    )
),
ranked AS (
    SELECT
        country_code,
        access,
        project,
        page_title,
        year,
        month,
        day,
        CEIL(total_view_count / 100) * 100 AS views_ceil,
        rank() OVER (PARTITION BY access, country_code, year, month, day ORDER BY total_view_count DESC, page_title) as rank,
        row_number() OVER (PARTITION BY access, country_code, year, month, day ORDER BY total_view_count DESC, page_title) as rn
    FROM raw
    WHERE unique_actor_count > 1000
),
max_rank AS (
    SELECT
        country_code,
        access,
        year,
        month,
        day,
        rank as max_rank
    FROM ranked
    WHERE rn = 1001
    GROUP BY
        country_code,
        access,
        year,
        month,
        day,
        rank
)

SELECT
        ranked.country_code as country_code,
        ranked.access as access,
        ranked.year as year,
        ranked.month as month,
        ranked.day as day,
        CONCAT('[',
            CONCAT_WS(',', collect_set(
                CONCAT('{\"article\":\"', ranked.page_title,
                    '\",\"project\":\"', ranked.project,
                    '\",\"views_ceil\":', CAST(ranked.views_ceil AS STRING),
                    ',\"rank\":', CAST(ranked.rank AS STRING), '}'))
            ),']') AS top_values
FROM ranked
LEFT JOIN max_rank ON (
    ranked.country_code = max_rank.country_code
    AND ranked.access = max_rank.access
    AND ranked.year = max_rank.year
    AND ranked.month = max_rank.month
    AND ranked.day = max_rank.day
)
WHERE
    ranked.rank < COALESCE(max_rank.max_rank, 1001)
GROUP BY
    ranked.country_code,
    ranked.access,
    ranked.year,
    ranked.month,
    ranked.day
;

And the check of output correctness:

 select
    (l.top_values = j.top_values),
    count(1) as c
from test_joal_pageview_top_percountry j
    inner join test_lex_pageview_top_percountry l
    on (
        j.country_code = l.country_code
       and j.access = l.access
       and j.year = l.year
       and j.month = l.month
       and j.day = l.day)
group by (l.top_values = j.top_values)
;
Milimetric triaged this task as Medium priority.May 10 2021, 3:12 PM
Milimetric updated the task description. (Show Details)
Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.