After switching refreshLinks to the kafka queue, there's been one instance of an exception that was not observed before (for all ot of them see https://logstash.wikimedia.org/goto/bb6bf23b741ea699c0d874950005f799)
Exception executing job: refreshLinks Template:API causeAction=edit-page causeAgent=Tgr (WMF) pages={"266691":[0,"Wikibase/API/ja"]} requestId=Wo8QIQpAMD0AAGlEvgUAAABA rootJobSignature=e7ab18aa1dac3512ab187653f487ac00050727c2 rootJobTimestamp=20180222184658 triggeredRecursive=1 : RuntimeException: Could not acquire lock 'LinksUpdate:job:pageid:266691'.with the following stack trace:
#0 /srv/mediawiki/php-1.31.0-wmf.22/includes/jobqueue/jobs/RefreshLinksJob.php(148): LinksUpdate::acquirePageLock(Wikimedia\Rdbms\DatabaseMysqli, integer, string)
#1 /srv/mediawiki/php-1.31.0-wmf.22/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#2 /srv/mediawiki/php-1.31.0-wmf.22/extensions/EventBus/includes/JobExecutor.php(51): RefreshLinksJob->run()
#3 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array)
#4 {main}Simultaneously with the spike in errors like this, there's been a huge spike in refreshLinks job processing latency - mean (and p99) went up to 2 minutes. Also immediately after these, there's been a bunch of Retry Count Exceeded logs from change-prop.
The rate of refreshLinks jobs on test wikis is very low, so it's not clear whether this is some one-off issue due to some other outage or a symptom of a large problem, so this needs to be investigated before proceeding with rolling out the refresh links job to other wikis.