Page MenuHomePhabricator

Investigate duplicate IPInfo log entries [8H]
Closed, ResolvedPublicBUG REPORT

Description

What is the problem?

I have seen two instances on testwiki where the IPInfo log records two entries for the same IP, performer, etc. viewing an infobox within a 24 hour period:

(You will need Adminstrator or Check User rights to view these.)

I have not found reliable reproduction steps yet.

Curiously, the first example occurred when I had only looked up information for an IP once.

Environment

Wiki(s): https://test.wikipedia.org MediaWiki 1.39.0-wmf.4 (94515a4) 01:47, 23 March 2022. IP Info 0.0.0 (d28439a) 21:50, 22 March 2022.

Acceptance criteria

Either:

  • We understand the problem, and a task is filed (or a fix made, if simple enough)

Or:

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMar 25 2022, 2:14 PM
ARamirez_WMF renamed this task from Investigate duplicate IPInfo log entries to Investigate duplicate IPInfo log entries [8H].Apr 5 2022, 4:18 PM

Quick notes so I don't forget:

  • I think but need to confirm these are queued by the job
  • check if job is retrying
  • maybe jobs are running out of order? eg a later one gets added first which I think (but need to check again) would still allow an earlier log to pass the delay check

This should be scoped to the Logger class (and where it is called from). In particular, Logger::logViewInfobox and Logger::logViewPopup, since these are the log calls that should be debounced (they call Logger::debouncedLog).

Change 779018 had a related patch set uploaded (by STran; author: STran):

[mediawiki/extensions/IPInfo@master] Add adhoc logging around logger job

https://gerrit.wikimedia.org/r/779018

I couldn't reproduce this locally despite breaking my setup in various ways, testing around the job and spamming views. I:

  • spammed a LOT of views
  • forced errors/return false
  • tossed sleeps in a lot of areas around jobs
  • made an attempt to force retries but don't think I did this one right
  • did some +/- on the timestamp

I thought at first it might have been an asynchronous job but given that the timestamp for both log lines are identical, I am wondering if the job is instead being retried? Or run twice somehow. I've added logging in a patch to try and trace the job through its steps.

I couldn't reproduce this locally despite breaking my setup in various ways, testing around the job and spamming views. I:

  • spammed a LOT of views
  • forced errors/return false
  • tossed sleeps in a lot of areas around jobs
  • made an attempt to force retries but don't think I did this one right
  • did some +/- on the timestamp

I thought at first it might have been an asynchronous job but given that the timestamp for both log lines are identical, I am wondering if the job is instead being retried? Or run twice somehow. I've added logging in a patch to try and trace the job through its steps.

@STran Thanks for testing all this. Just wondered if you've tried testing these with the versions of core/IPInfo from the task description checked out? Just to see if it was something in an old version that has since been fixed.

@STran @Tchanders One thing I have noticed is that I can reproduce it reliably with IPv6 addresses.

If you are looking up IPInfo for a log entry where the performer and target are and IPv6 address (e.g. an anonymous user creates a User page User:$ipv6_address), you can get duplicate log entries.

For example, user with IP address 2001:8b0:7acf:5d4e:2e09:4dff:fe00:bd2 created the page User:2001:8b0:7acf:5d4e:2e09:4dff:fe00:bd2. In the logs, I see:

[DBQuery] Wikimedia\Rdbms\Database::selectRow [0.004s] unknown: SELECT  *  FROM logging    WHERE log_type = 'ipinfo' AND log_action = 'view_popup' AND log_actor = 1 AND log_namespace = 2 AND log_title = '2001:8b0:7acf:5d4e:2e09:4dff:fe00:bd2' AND (log_timestamp > '20220412091032') AND (log_params LIKE '%ipinfo-view-full%' ESCAPE '`' )  LIMIT 1  
[DBQuery] ManualLogEntry::insert [0s] unknown: INSERT INTO logging (log_type,log_action,log_timestamp,log_actor,log_namespace,log_title,log_page,log_params,log_comment_id) VALUES ('ipinfo','view_popup','20220413091032',1,2,'2001:8b0:7acf:5d4e:2e09:4dff:fe00:bd2',0,'a:1:{s:8:"4::level";s:16:"ipinfo-view-full";}',1)
...
[DBQuery] Wikimedia\Rdbms\Database::selectRow [0s] unknown: SELECT  *  FROM logging    WHERE log_type = 'ipinfo' AND log_action = 'view_popup' AND log_actor = 1 AND log_namespace = 2 AND log_title = '2001:8B0:7ACF:5D4E:2E09:4DFF:FE00:BD2' AND (log_timestamp > '20220412091032') AND (log_params LIKE '%ipinfo-view-full%' ESCAPE '`' )  LIMIT 1  
[DBQuery] ManualLogEntry::insert [0s] unknown: INSERT INTO logging (log_type,log_action,log_timestamp,log_actor,log_namespace,log_title,log_page,log_params,log_comment_id) VALUES ('ipinfo','view_popup','20220413091032',1,2,'2001:8B0:7ACF:5D4E:2E09:4DFF:FE00:BD2',0,'a:1:{s:8:"4::level";s:16:"ipinfo-view-full";}',1)

Notice that in the first is looking up '2001:8b0:7acf:5d4e:2e09:4dff:fe00:bd2' and the second is looking up '2001:8B0:7ACF:5D4E:2E09:4DFF:FE00:BD2'.

But I guess this is a separate bug, as the duplicate entries on testwiki in the description are both IPv4 addresses. I have raised T306013.

@Tchanders brought up that IPv4 can have leading zeroes. Leaving a comment here so I can remember to investigate that as well.

Just wondered if you've tried testing these with the versions of core/IPInfo from the task description checked out?

I was on the most up to date version so I'm wondering if the timestamp fix also fixed this issue.

I'm in favour of adding the logging, until we deploy to more production sites. I've made a deployment-blocking task for removing the logging: T306212

@Niharika Are we OK to log the IP, as discussed on gerrit? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/IPInfo/+/779018

I'm in favour of adding the logging, until we deploy to more production sites. I've made a deployment-blocking task for removing the logging: T306212

@Niharika Are we OK to log the IP, as discussed on gerrit? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/IPInfo/+/779018

This sounds good to me.

Change 779018 merged by jenkins-bot:

[mediawiki/extensions/IPInfo@master] Add adhoc logging around logger job

https://gerrit.wikimedia.org/r/779018

The debug logging was switched off (see T306212). It was live for a couple of weeks, during which time, on Special:Log/ipinfo at testwiki (https://test.wikipedia.org/wiki/Special:Log/ipinfo):

  • There were 29 log entries
  • There were no duplicates

The data in logstash (https://logstash.wikimedia.org/goto/633b1d65c62f6731ba0fa1feed73bf09) show:

  • Logs starting from 3 May
  • 32 debug log lines for "Job queued"
  • 49 debug log lines for "Job run"
  • 49 debug log lines for "Logging" <-- note that this comes before the check for an existing log entry in the logging table, so doesn't always result in a new entry

Discrepancies:

  • 29 log items added, but 32 jobs enqueued. That's fine, since we check whether a similar log item already exists after queuing a job but before adding a new log item
  • 32 jobs queued, but 49 jobs run. This seems odd and is worth investigating

I went through every set of logs by IP and sorted out the ones with a mismatch in queued/run/logging:

ipv4:
https://logstash.wikimedia.org/goto/a73b3e512dc0b3cf8b7ae8434486d11e
https://logstash.wikimedia.org/goto/4aead9120c3fb3cea7ac95cce62d332c

ipv5:
https://logstash.wikimedia.org/goto/fa8537c680416e188908439c10d2ecc4
https://logstash.wikimedia.org/goto/4aead9120c3fb3cea7ac95cce62d332c
https://logstash.wikimedia.org/goto/6611b4985e702b9005b390f47d0fa16d

In particular, https://logstash.wikimedia.org/goto/fa8537c680416e188908439c10d2ecc4 was queued up 3 times and spawned a total of 14 jobs. That seems to account for the bulk of the mismatches. Additionally, if you sort by time, we appear to be missing a "Job queued" logline. Really not sure what happened here with the missing logline (the other examples all show the expected order of queued -> run -> log eg. https://logstash.wikimedia.org/goto/846a2215e9161a9fafa48fdc1fb57994). It happens again here: https://logstash.wikimedia.org/goto/c226a1e4eb430f3b96f29387ab18d4d0 although this one only re-runs the job once.

If you ignore the 14 job ip, each job in every other anomaly only gets re-run once. I suspect that it's because jobs are retry-able from Job.php:

	/**
	 * @inheritDoc
	 * @stable to override
	 */
	public function allowRetries() {
		return true;
	}

It's a little unfortunate that the "Logging" line didn't run after the log was actually inserted. I wonder if we would have seen duplicates after the log line was written or if the log line wasn't written in the first job and that's why it kicked off another job (based on what I can tell from skimming JobRunner.php). Each "duplicate" runs ~1.5 minutes after what I believe to be its parent. I was unable to find errors I would expect (either Invalid performer or Invalid dataContext based on where we return false but I may be bad at searching the dashboards.

Ultimately, from what I can tell:

  • the discrepancy mostly arose from a one-off instance (maybe a server issue at the time? Dom reported at the time that the loglines weren't showing up for whatever reason https://phabricator.wikimedia.org/T306013#7902158 and suggested it was a problem with beta at the time.)
  • some logs were dropped but the jobs still appeared to run (I have no idea - maybe the same issue? It happened around the same time)
  • The rest seem to be requeues as allowed

Given that 1. there are no duplicate log lines, 2. we want to allow retries (because we definitely want the logs), and 3. the bulk of the issue arose from a documented (alleged temporary but I didn't see any follow up tasks that suggested it was our fault), I'm going to say the oddity was a fluke and there's nothing actionable.

Sounds like this is working as expected: jobs can be re-run, either if unsuccessful or for some other reason (e.g. the user looks at the same item twice), but the actual logging is being debounced.

Thanks for the summary, @STran. (And for helpfully finding the match with T306013#7902158.) We can look again at the analysis here if this bug recurs enough to be problematic in the future. In the meantime I think we can consider this task done.