Page MenuHomePhabricator

Skip sync tasks for private wikis, or allow them to fail gracefully
Closed, ResolvedPublic

Description

Our current method of running the dumps v1 DAGs always creates sync tasks for every wiki.

However, in the case of private wikis, there is no /mnt/dumpsdata/xmldatadumps/${wiki} directory, so the sync tasks fail.

image.png (404×717 px, 44 KB)

We should determine how to handle this. Options include:

  • Skipping the sync_batch_* tasks for private wikis
  • Allowing the sync_batch_* tasks to fail gracefully if the directory does not exist

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Skip sync tasks for private wikisrepos/data-engineering/airflow-dags!1299btullisskip_private_wiki_dump_syncsmain
Customize query in GitLab

Event Timeline

My first thought is that we have a managed list of private wikis available.
Perhaps we could add a new task to cat this list and save it as an xcom, then check against this list before creating a sync task.

www-data@mediawiki-dumps-legacy-toolbox-c465d7598-96d97:~$ cat /srv/mediawiki/dblists/private.dblist 
# NOTE: This file is automatically generated. Do not edit it directly, run 'composer manage-dblist' instead.
advisorswiki
arbcom_cswiki
arbcom_dewiki
arbcom_enwiki
arbcom_fiwiki
arbcom_itwiki
arbcom_nlwiki
arbcom_ruwiki
arbcom_zhwiki
auditcomwiki
boardgovcomwiki
boardwiki
chairwiki
chapcomwiki
checkuserwiki
collabwiki
ecwikimedia
electcomwiki
execwiki
fdcwiki
grantswiki
id_internalwikimedia
iegcomwiki
ilwikimedia
internalwiki
legalteamwiki
movementroleswiki
noboard_chapterswikimedia
officewiki
ombudsmenwiki
otrs_wikiwiki
projectcomwiki
searchcomwiki
spcomwiki
stewardwiki
sysop_itwiki
sysop_plwiki
techconductwiki
transitionteamwiki
u4cwiki
wg_enwiki
wikimaniateamwiki
www-data@mediawiki-dumps-legacy-toolbox-c465d7598-96d97:~$

Why do we consider private wikis at all? AFAIK, there is no need to dump them?

Why do we consider private wikis at all? AFAIK, there is no need to dump them?

That's a good question. I suppose it's because it's would represent a change in behaviour.
These private wikis are currently dumped, but the files just live on the dumpsdata servers and do not get synced to the clouddumps servers.

Perhaps changing the behaviour so that they are not dumped at all would be a good idea anyway.

I've asled the question in Slack, too.

I'll start by just modifying the DAG to skip the sync jobs, since that represents a no-op to the current dumps behaviour.
If we want to skip the dumping part as well, then I can come back to that.

I don't know of any product reasons to create dumps of the private wikis. We could always be surprised, but for now I agree that they should be skipped.