item_page_link
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• razzi
	Apr 6 2022, 8:48 PM

Description

The unit for mediawiki-history-drop-snapshot.service is in an error state; the journalctl logs have been cleared but I found the error from syslog:

razzi@an-launcher1002:/var/log$ zgrep 'HDFS directories to check' *
...
syslog.7.gz:Mar 30 06:26:25 an-launcher1002 kerberos-run-command[20455]: 2022-03-30T06:26:25 ERROR  Selected partitions extracted from table specs ({'snapshot=2022-01-24', 'snapshot=2022-01-31'}) does not match selected partitions extracted from data paths (set()). HDFS directories to check: []

Running the command with --verbose and --dry-run gave the context of what table was erroring:

2022-04-06T19:52:52 DEBUG  Processing table wikidata_entity keeping 6 snapshots
2022-04-06T19:52:52 DEBUG  Getting partitions to drop...
2022-04-06T19:52:52 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; SHOW PARTITIONS wikidata_entity;
2022-04-06T19:53:05 DEBUG  Getting directories to remove...
2022-04-06T19:53:05 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; DESCRIBE FORMATTED wikidata_entity;
2022-04-06T19:53:17 DEBUG  Running: hdfs dfs -ls -d hdfs://analytics-hadoop/wmf/data/wmf/wikidata/entity/*/_SUCCESS
2022-04-06T19:53:19 ERROR  Selected partitions extracted from table specs ({'snapshot=2022-01-31', 'snapshot=2022-02-07', 'snapshot=2022-01-24'}) does not match selected partitions extracted from data paths (set()). HDFS directories to check: []

@BTullis and I had thought we could fix this by manually adding some _SUCCESS files, but the files are there:

razzi@an-launcher1002:~$ hdfs dfs -ls /wmf/data/wmf/wikidata/entity/snapshot=*/_SUCCESS
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:55 /wmf/data/wmf/wikidata/entity/snapshot=2022-01-24/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-01-31/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-07/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/_SUCCESS
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-28/_SUCCESS

I wonder if the partitions and directories got out of sync, and now the script will not fix the situation because the check for partitions versus directories happens before anything is removed:

if not non_strict:
    check_partitions_vs_directories(partitions, directories)
drop_partitions(hive, table, partitions, dry_run)
remove_directories(hive, table, directories, dry_run)

So I ran the script with --dry-run and --non-strict:

PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python /usr/local/bin/kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-drop-mediawiki-snapshots --verbose --dry-run --non-strict
...
2022-04-06T20:38:34 DEBUG  Processing table wikidata_entity keeping 6 snapshots
2022-04-06T20:38:34 DEBUG  Getting partitions to drop...
2022-04-06T20:38:34 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; SHOW PARTITIONS wikidata_entity;
2022-04-06T20:38:47 DEBUG  Getting directories to remove...
2022-04-06T20:38:47 DEBUG  Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; DESCRIBE FORMATTED wikidata_entity;
2022-04-06T20:38:59 DEBUG  Running: hdfs dfs -ls -d hdfs://analytics-hadoop/wmf/data/wmf/wikidata/entity/*/_SUCCESS
2022-04-06T20:39:02 INFO   Dropping 3 partitions from wmf.wikidata_entity
2022-04-06T20:39:02 DEBUG       snapshot='2022-02-07'
2022-04-06T20:39:02 DEBUG       snapshot='2022-01-31'
2022-04-06T20:39:02 DEBUG       snapshot='2022-01-24'

Here's the full output, since running it takes a while: https://phabricator.wikimedia.org/P24175

There are 2 hive tables that are affected: wmf.wikidata_item_page_link and wmf.wikidata_entity.

So it looks like it would work to run it with --non-strict. Before I do this I'm hoping anybody on the team can weigh in on my understanding: @JAllemandou and/or @mforns perhaps?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T305591 Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link
		Resolved		JAllemandou	T303988 Refactor refinery-drop-mediawiki-snapshots so that it no longer uses a _SUCCESS file

Event Timeline

• razzi created this task.Apr 6 2022, 8:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 6 2022, 8:48 PM

• razzi claimed this task.Apr 6 2022, 8:48 PM

• razzi moved this task from Incoming (new tickets) to Data Products & Metrics on the Data-Engineering board.

removing Razzi, resetting to high and pinging @Ottomata who's swapping ops weeks with me. Sorry I just found this now, but it might be related to the airflow SLA weirdness around the dependent jobs (I replied to the alert emails)

JAllemandou added a subtask: T303993: Add the commons-entity dataset to the refinery-drop-mediawiki-snapshots script.Apr 19 2022, 5:59 PM

JAllemandou edited subtasks, added: T303988: Refactor refinery-drop-mediawiki-snapshots so that it no longer uses a _SUCCESS file; removed: T303993: Add the commons-entity dataset to the refinery-drop-mediawiki-snapshots script.Apr 19 2022, 6:09 PM

I think the problem here is that the migrated Airflow job does not generate _SUCCESS files after the output data has been generated.
We didn't identify the need of having them, since we only looked at Oozie's data dependencies.
Also, there's this task T303988 in which we want to modify the deletion script to not look at _SUCCESS files, and instead look at Hive partitions, and HDFS folder existence.
If we do it the way suggested by T303988, then we won't need to add success file writers to all Airflow jobs that need deletion.
Let's discuss tomorrow in standup.
Thanks @JAllemandou for remembering about that task!

JAllemandou closed this task as Resolved.May 10 2022, 7:32 AM

JAllemandou closed subtask T303988: Refactor refinery-drop-mediawiki-snapshots so that it no longer uses a _SUCCESS file as Resolved.Jun 9 2022, 6:52 AM

Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_linkClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link
Closed, ResolvedPublic
Actions

Related Objects
Search...