Page MenuHomePhabricator

Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link
Closed, ResolvedPublic

Description

The unit for mediawiki-history-drop-snapshot.service is in an error state; the journalctl logs have been cleared but I found the error from syslog:

razzi@an-launcher1002:/var/log$ zgrep 'HDFS directories to check' *
...
syslog.7.gz:Mar 30 06:26:25 an-launcher1002 kerberos-run-command[20455]: 2022-03-30T06:26:25 ERROR  Selected partitions extracted from table specs ({'snapshot=2022-01-24', 'snapshot=2022-01-31'}) does not match selected partitions extracted from data paths (set()). HDFS directories to check: []

Running the command with --verbose and --dry-run gave the context of what table was erroring:

2022-04-06T19:52:52 DEBUG  Processing table wikidata_entity keeping 6 snapshots
2022-04-06T19:52:52 DEBUG  Getting partitions to drop...
2022-04-06T19:52:52 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; SHOW PARTITIONS wikidata_entity;
2022-04-06T19:53:05 DEBUG  Getting directories to remove...
2022-04-06T19:53:05 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; DESCRIBE FORMATTED wikidata_entity;
2022-04-06T19:53:17 DEBUG  Running: hdfs dfs -ls -d hdfs://analytics-hadoop/wmf/data/wmf/wikidata/entity/*/_SUCCESS
2022-04-06T19:53:19 ERROR  Selected partitions extracted from table specs ({'snapshot=2022-01-31', 'snapshot=2022-02-07', 'snapshot=2022-01-24'}) does not match selected partitions extracted from data paths (set()). HDFS directories to check: []

@BTullis and I had thought we could fix this by manually adding some _SUCCESS files, but the files are there:

razzi@an-launcher1002:~$ hdfs dfs -ls /wmf/data/wmf/wikidata/entity/snapshot=*/_SUCCESS
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:55 /wmf/data/wmf/wikidata/entity/snapshot=2022-01-24/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-01-31/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-07/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-28/_SUCCESS

I wonder if the partitions and directories got out of sync, and now the script will not fix the situation because the check for partitions versus directories happens before anything is removed:

if not non_strict:
    check_partitions_vs_directories(partitions, directories)
drop_partitions(hive, table, partitions, dry_run)
remove_directories(hive, table, directories, dry_run)

So I ran the script with --dry-run and --non-strict:

PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python /usr/local/bin/kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-drop-mediawiki-snapshots --verbose --dry-run --non-strict
...
2022-04-06T20:38:34 DEBUG  Processing table wikidata_entity keeping 6 snapshots
2022-04-06T20:38:34 DEBUG  Getting partitions to drop...
2022-04-06T20:38:34 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; SHOW PARTITIONS wikidata_entity;
2022-04-06T20:38:47 DEBUG  Getting directories to remove...
2022-04-06T20:38:47 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; DESCRIBE FORMATTED wikidata_entity;
2022-04-06T20:38:59 DEBUG  Running: hdfs dfs -ls -d hdfs://analytics-hadoop/wmf/data/wmf/wikidata/entity/*/_SUCCESS
2022-04-06T20:39:02 INFO   Dropping 3 partitions from wmf.wikidata_entity
2022-04-06T20:39:02 DEBUG       snapshot='2022-02-07'
2022-04-06T20:39:02 DEBUG       snapshot='2022-01-31'
2022-04-06T20:39:02 DEBUG       snapshot='2022-01-24'

Here's the full output, since running it takes a while: https://phabricator.wikimedia.org/P24175

There are 2 hive tables that are affected: wmf.wikidata_item_page_link and wmf.wikidata_entity.

So it looks like it would work to run it with --non-strict. Before I do this I'm hoping anybody on the team can weigh in on my understanding: @JAllemandou and/or @mforns perhaps?

Event Timeline

Milimetric triaged this task as High priority.
Milimetric added subscribers: Ottomata, Milimetric.

removing Razzi, resetting to high and pinging @Ottomata who's swapping ops weeks with me. Sorry I just found this now, but it might be related to the airflow SLA weirdness around the dependent jobs (I replied to the alert emails)

I think the problem here is that the migrated Airflow job does not generate _SUCCESS files after the output data has been generated.
We didn't identify the need of having them, since we only looked at Oozie's data dependencies.
Also, there's this task T303988 in which we want to modify the deletion script to not look at _SUCCESS files, and instead look at Hive partitions, and HDFS folder existence.
If we do it the way suggested by T303988, then we won't need to add success file writers to all Airflow jobs that need deletion.
Let's discuss tomorrow in standup.
Thanks @JAllemandou for remembering about that task!