Page MenuHomePhabricator

[SPIKE] Investigate getting global contribution information for IP Info [8H]
Closed, ResolvedPublicSpike

Description

From T270319: Contribs info in the popup [M], getting the global number of edits is more complicated than getting the local numbers and that work should be split up so as to not block the other contribution information going in. There are some outstanding questions about the task:

  • At a high level, "global edits" refer to the sum of all of a user's edits across wikis
  • Is there an existing best practice way to gather that sort of information from every wiki? There's some precedence but for reasons I'm not aware of specifically, it can't be directly translated here.
  • What's the performance hit of gathering this data?
  • How up to date does this information have to be? Can we cache it?
Open Questions

How accurate does this data have to be?

  • Does it have to be up to date?
  • What if the user doesn't have the right to see "deleted revisions?"
  • What if [there's a database error and] a wiki doesn't return data?

All of these add to +/1 to the accuracy of a "global" count.

Event Timeline

phuedx renamed this task from Investigate getting global contribution information for IP Info to [SPIKE] Investigate getting global contribution information for IP Info.Oct 6 2021, 10:42 AM
phuedx added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptOct 6 2021, 10:42 AM

A couple tools that do cross-wiki IP contribution searches, might be helpful:

There does not appear to be a good API for getting an IP's cross-wiki activity; with users you can use CentralAuth but for IPs I believe the only way is to fetch the wiki list and do a for loop over them.

ARamirez_WMF renamed this task from [SPIKE] Investigate getting global contribution information for IP Info to [SPIKE] Investigate getting global contribution information for IP Info [8H].Oct 6 2021, 4:06 PM

tl;dr: I think doing what Danny did in his FindIPActivity.js script but in PHP is the way to go. Thanks to @GeneralNotability for providing these references!

I looked into CentralAuth but agree that it can't be used for what we want. From what I can glean from the docs and skimming some uses of CentralAuth (SecurePoll), it's user based and therefore won't have the ip info we need. Given that, I focused my investigation on ways we can do a loop and count. Danny's script is well-commented and I highly recommend reading through that as well.

I made some notes on possible implementation. As it turns out, it was shockingly annoying to get a local farm going so this is done to the best of my knowledge. 🙈

All of this presumably lives in ContributionInfoRetriever.php

Get list of wikis

  1. WikiMap

https://doc.wikimedia.org/mediawiki-core/master/php/classWikiMap.html

  1. SiteMatrix

Danny's script uses sitematrix, which from what I can tell is wmf specific:
https://meta.wikimedia.org/w/api.php?action=sitematrix
https://www.mediawiki.org/wiki/Extension:SiteMatrix

I don't know which is a more canonical/preferred list. SiteMatrix seems to require some special handling of edge wikis whereas WikiMap appears to Just Work (fwiw SecurePoll uses WikiMap).

Get edit count

  1. ApiQueryAllRevisions and ApiQueryAllDeletedRevisions

https://doc.wikimedia.org/mediawiki-core/master/php/classApiQueryAllRevisions.html
https://doc.wikimedia.org/mediawiki-core/master/php/classApiQueryAllDeletedRevisions.html

  1. Manually*

From WikiMap, we have access to the DBDomain, which can be passed to getConnnection(): https://doc.wikimedia.org/mediawiki-core/master/php/interfaceWikimedia_1_1Rdbms_1_1ILoadBalancer.html#a14b388a0093e7446f741ebe5254c94c0

It's worth reading through ApiQueryAllRevisions and ApiQueryAllDeletedRevisions to see what cases they handle but since we don't need details on the edits, it might be possible for us to write a custom COUNT query. For a local example, ContributionInfo currently uses the following to get the count of local edits:

		$numRecentEdits = $this->database->selectRowCount(
			'ip_changes',
			'*',
			[
				'ipc_hex' => $hexIP,
				'ipc_rev_timestamp > ' . $this->database->addQuotes( $this->database->timestamp( $oneDayTS ) ),
			],
			__METHOD__
		);

Could this same query be run on all other wikis? That's a lot of db connections though...

Additionally, seeing deleted revisions isn't a guaranteed right so in either case, whatever we choose will have to manage the possibility of (probably silently?) failing those requests. Does the concept of a deleted revision exist in the ip_changes table? Do we care about those? Or can we get by with just aggregating every wiki's ip_changes counts?

*I don't know how cross-wiki permissions work. If I have the right to see deleted revisions on my main wiki, does that apply to other wikis?

Summing

+=

  • It might be more interesting to talk about error handling here. Since it's just a summed number, what happens if a request to a wiki fails? Not sure how often that happens but is it worth still showing the number if most wikis return values?

Caching

Depending on how IP Info is used (and the patterns of vandalism), it might be good to consider if we can possibly cache some of this information. It's of course possible to check on a rolling 24 hour time range but that would mean every time someone wants to check an IP, the API has to make a call to... at least 460 or so wikis (very quickly checked a count of how many languages as a ballpark)? SiteMatrix returns 977 wikis. That adds up.


Some notes for me (or someone else to follow up on):

  • I mentioned in the ticket and a lot in here but I think someone else needs to help gauge what the performance cost of doing this is. It seems like a lot and hitting up all the wikis has been problematic in the past performance-wise (looking at you, SecurePoll).
  • Someone should figure out and own what happens in less than ideal conditions (eg. one wiki throws an error for whatever reason but 459 have returned good responses).
  • How is editing information used? Is daily the important metric? weekly? Knowing that helps determine if caching is something we can look into more.
  • Can we safely cache this information in a store of some kind for other users who use the tool to gather? Does it pose some sort of information leak risk? Where would we store it?

cc @Niharika @Prtksxna We discussed this in engineering and think there are a few design questions that need to be answered before we move forward w/any technical implementation:

How accurate does this data have to be?

  • Does it have to be up to date?
  • What if the user doesn't have the right to see "deleted revisions?"
  • What if a wiki doesn't return data?

All of these add to +/1 to the accuracy of a "global" count.

How quickly does this data have to be returned?
It will take time to ask for and sum up all these numbers. It's reasonable to expect this to take some time, especially if it happens to be on demand. It's also very possible we'll hit a technical limitation on doing this on-load just like we did w/SecurePoll. If we have to make it a job, is that acceptable?

Would it be possible to cache the result?
Alluded to earlier, some users could possibly see different numbers based on their permissions. Would it be possible to save an "official" number somewhere and update it as necessary? Perhaps when a user re-requests the information? That would save some time in the load.

How accurate does this data have to be?

  • Does it have to be up to date?

It doesn't need to be live. We can figure out what freshness of data is acceptable and then show that (along with the text explaining that the data might be a little old.

  • What if the user doesn't have the right to see "deleted revisions?"

Does this mean that different users will see different numbers based on their permissions. Would it make sense to not count deleted revisions in this case?

  • What if a wiki doesn't return data?

I am not sure how this is specific to Global, sorry! Could you elaborate what you're asking here?

It will take time to ask for and sum up all these numbers. It's reasonable to expect this to take some time, especially if it happens to be on demand. It's also very possible we'll hit a technical limitation on doing this on-load just like we did w/SecurePoll. If we have to make it a job, is that acceptable?

If the data points in the box and the popup are loaded separately and the Global contribs won't block all the information then I think it's ok if it take more time. Do we have an idea of how much more time this could be — seconds? minutes?

Would it be possible to cache the result?

If the load time can be reduced this sounds like a great idea. Lets talk more about the deleted revisions issue.

If we have to make it a job, is that acceptable?

AIUI the job queue isn't meant to be used for tasks that should be performed at regular intervals. Instead, we should create a maintenance script that's invoked as a cron job. We even have a Puppet module to simplify doing this, for which there are plenty of examples: https://codesearch.wmcloud.org/puppet/?q=mediawiki%3A%3Amaintenance

Would it be possible to cache the result?
How quickly does this data have to be returned?

If we build this as a maintenance script that writes its results into a table, then these questions are answered (yes and this isn't an issue).

Regarding the results table, we'd need to select an appropriate wiki to write to/read from. Other extensions have used wikishared but metawiki might also be acceptable.

Edit

metawiki wouldn't be acceptable because anons could edit it…

What if a wiki doesn't return data?

We do have a Logstash dashboard for tracking errors connecting to a database. Looking at the last 7 days, there were ~200 errors/hour and in the last 30 days, there were ~600 errors/hour. In both cases, there were one or two days with huge error spikes that were pronounced over a relatively much lower error rate. All of this is to say that this is absolutely a case that we need to consider.

I think we have two options:

  1. We could not capture the error and let the maintenance script fail, leaving the data stale for $interval hours; or
  2. We could capture the error, log it, and not update the data in the results table, leaving the data incomplete for $interval hours

In either case, we should probably add messaging to the UI that the data might be inaccurate.

If we build this as a maintenance script that writes its results into a table, then these questions are answered (yes and this isn't an issue).

Great! Lets go with this. As we discussed during stand-up, its okay to run this script once everyday, so lets go with that cadence.

  1. We could not capture the error and let the maintenance script fail, leaving the data stale for $interval hours; or
  2. We could capture the error, log it, and not update the data in the results table, leaving the data incomplete for $interval hours

In either case, we should probably add messaging to the UI that the data might be inaccurate.

I think its better to have stale data (1) rather than no data at all (2) as long as we can correctly capture, and show that the data is N hours old.

  • What if the user doesn't have the right to see "deleted revisions?"

Would it be possible to not count the deleted revisions for all users (whether or not they can see them). Its better if everyone with the feature sees the same data.

@Niharika to add notes from today's AHT Design check-in meeting.

Some notes from meeting:

  • We should keep the data stale if we don’t get data from any wiki
  • We want to show an indicator for how old the data is ("Updated X hours ago")
  • On the first load of the data for an IP, continue to show “Not available” for the number until we get the data instead of a loading indicator.
  • How long do we want to keep the rows in the table?
    • Concerns with the cron job taking a long time
    • Can we automatically capture it every time a new IP makes an edit/logged action? It is likely for them to be checked during patrolling.
    • We can store data for an IP for 30 days since the last time they made an edit or the last time they were checked with the tool.

@Prtksxna @STran I captured some notes from the meeting. I was not religious with my note taking so this is missing a lot of details - please feel free to add more information.

@phuedx We wanted your thoughts on these, particularly the bolded question.

  • How long do we want to keep the rows in the table?

We're discussing a month to 90 days over in T293263: [SPIKE] Investigate getting number of past block for IP (including IP ranges) [8H].

Alternatively, we could query the recentchanges table, which only holds data about changes to the wiki from the last $wgRCMaxAge seconds (30 days in production). If the maintenance script fetches all edits from IPs from all recentchanges tables and updates the table, then the longest we'll hold information about an IP is $wgRCMaxAge + 24 hours.

  • Concerns with the cron job taking a long time

I think this is governed by how fast the queries are to pull out the numbers of edits by IP per wiki. While there are approaching 1000 wikis, the vast majority don't have an edit rate approaching the top wikis and so I'd expect the queries of wikis outside of the large wikis by number of active editors to return fast.

select
  actor.actor_name,
  count(*)
from
  recentchanges
  join actor
    on recentchanges.rc_actor = actor.actor_id
where
  ( recentchanges.rc_type = 0 or recentchanges.rc_type = 1 ) -- RC_EDIT = 0, RC_NEW = 1
  and actor.actor_user is null
group by
  actor.actor_name
;

should be fast enough.

@Ladsgroup: I would greatly appreciate your input on this.

  • Can we automatically capture it every time a new IP makes an edit/logged action? It is likely for them to be checked during patrolling.

We could but that's different from the idea of having a maintenance script that runs periodically. Is this a hard requirement?

Moving this to Done: Q2 2021-22. @ARamirez_WMF: You mentioned that you were going to talk to @Niharika about this.

  • How long do we want to keep the rows in the table?

We're discussing a month to 90 days over in T293263: [SPIKE] Investigate getting number of past block for IP (including IP ranges) [8H].

Alternatively, we could query the recentchanges table, which only holds data about changes to the wiki from the last $wgRCMaxAge seconds (30 days in production). If the maintenance script fetches all edits from IPs from all recentchanges tables and updates the table, then the longest we'll hold information about an IP is $wgRCMaxAge + 24 hours.

  • Concerns with the cron job taking a long time

I think this is governed by how fast the queries are to pull out the numbers of edits by IP per wiki. While there are approaching 1000 wikis, the vast majority don't have an edit rate approaching the top wikis and so I'd expect the queries of wikis outside of the large wikis by number of active editors to return fast.

select
  actor.actor_name,
  count(*)
from
  recentchanges
  join actor
    on recentchanges.rc_actor = actor.actor_id
where
  ( recentchanges.rc_type = 0 or recentchanges.rc_type = 1 ) -- RC_EDIT = 0, RC_NEW = 1
  and actor.actor_user is null
group by
  actor.actor_name
;

should be fast enough.

@Ladsgroup: I would greatly appreciate your input on this.

I'm slightly unsure about this. I'm doing a bit of speculation here but the problem with that query is that it's quite unpredictable depending on the wiki and time. So having a regular (minutely?) cron (we have deprecated crons in production though, using systemd timers now but it's mostly semantics*) might cause systems to choke when it comes to enwiki. The other problem is that while it's going to be fast, you still have to open a connection to that wiki, run the query and close it, it means it keeps a connection open and there is also non-zero chance of network fault, etc. if you want to do this minutely over 1k wikis, I think it's gonna get more complicated than expected. Extra systemd timers need operational maintenance, etc.

  • Can we automatically capture it every time a new IP makes an edit/logged action? It is likely for them to be checked during patrolling.

We could but that's different from the idea of having a maintenance script that runs periodically. Is this a hard requirement?

With some modifications, I think the idea of push instead of pull is much more performant db-wise.

  • Make it a job so it doesn't slow down the edit (simply hook into edit hooks in core, if it's an IP edit, just queue a job).
    • That would also make the table more updated than the cron (the job run time usually is lower than a minute)
  • As long as rows are immutable and you don't modify them afterwards, inserts are quite cheap (That goes back on how mvcc works)

Regarding purge, just put an index on edit time and it'll be quite fast.

Keep me in loop about this. Thank you for the great work, looking forward to it!

That seems like a much neater solution, @Ladsgroup, which, as far as I can tell:

  • Makes the data as fresh as possible
  • Keeps database query load low
  • Is less operationally complex

If you don't mind, I'd like to restate your suggestion to make sure that I've understood it correctly:

  • We define a table, e.g. ipinfo_ip_changes, which has a schema like:
    • ipc_id: int(10) unsigned (required, automatically incrementing, primary key)
    • ipc_ip_hex: varbinary(35) (required)
      • This subject to your thoughts on how to best store IP addresses (per T293263#7516229).
    • ipc_timestamp: binary(14) (required)
  • We create a PageSaveComplete hook handler that:
    • Checks if the editor is logged out; and, if so
    • Queues a job that inserts a row into/purges rows that are older than X seconds from the ipinfo_ip_changes table in the wikishared database
  • In the IP Info RESTful API, we calculate the number of global edits for a specific IP address with the following query:
// PSEUDOCODE!!1

$ipHex = IPUtils::toHex( $ip );
MediaWikiServices::getInstance()
  ->getDBLoadBalancerFactory()
  ->getMainLB( 'wikishared' )
  ->getConnection( ILoadBalancer::DB_REPLICA )
  ->selectRowCount(
    '*',
    'ipinfo_ip_changes',
    [
      'ipc_ip_hex' => $ipHex,
    ]
  );

If this is acceptable, then I'll create tasks for Anti-Harassment to work on as soon as possible and CC'ing you on them for visibility.

  • This subject to your thoughts on how to best store IP addresses (per T293263#7516229).

Yes. Sorry it took me so long to finish the investigation. That was way more complicated than I thought. Here are my findings:

It has been recommended in lots of different places to use INT UNSIGNED for IPv4 and BINARY(16) for IPv6. Possibly meaning we need to have two tables, one for ipv4 and one for ipv6 but that's okay given that there is no condition you would want to query both tables at the same time (correct?). Here are some useful links:

Basically there are some functions that can make the IP much smaller: https://www.php.net/manual/en/function.ip2long.php turns IPv4 into a long integer and https://www.php.net/manual/en/function.inet-pton.php turns it into a binary that can be saved in the database (and supports IPv6). Using these two would make the storage much smaller than the current hex storage (it's quite big in some wikis so if you want to sum them up from all wikis forever, that's gonna be complicated). It will be at least half as small but likely around 10% of the hex size.

But here are the cons I could think of:

  • We need to write the code that converts them back and forth (it's not hard though)
    • i.e. It's nice we have the code written for hex already.
  • Debugging can get hard. I tried var_dumping inet_pton output and it was empty string, I assume it'll look like gibberish in database
    • Looking like gibberish in database makes it harder for people in the cloud/toolforge to build tools on top of
  • I haven't tested that if these binaries would work with our database abstractions, they probably would but needs testing
  • There is problem of portability, MediaWiki supports Sqlite and Postgres too and we need to have a way to store them there as well.
    • if it supposed to be a separate extension, you can simply not care about other RDBMSes
  • It also needs testing about how/if mysql would be able to handle CIDR queries (assuming using BETWEEN)

So TLDR is that if you want to store it for a short period of time, even three months, just go with Hex, it's simpler, ready, etc. But if you want to keep it forever, then we probably should look into other methods of storing them (likely two tables, one with int unsigned for v4 and one with binary(16) for v6). I add @tstarling who originally developed hex representation for IP storage in this commit in 2005. There might be some valid reasons for choosing hex representation of IPs for storage that I'm not aware of or some reasons there were valid in 2005 but not any more.

This actually got me thinking: Do we need to keep all ip_changes entries forever? Why not purge them after three months or a year? I think the main usecase for them is finding vandalisms and looking at CIDR ranges for really old data is useless. This is also more aligned with the data retention policies.

(Unrelated to the IP storage issue)
Do you want to add a column for wiki and revid? so people could find the actual edit?

Sorry it took me so long. This week has been quite messy.

So TLDR is that if you want to store it for a short period of time, even three months, just go with Hex, it's simpler, ready, etc. But if you want to keep it forever, then we probably should look into other methods of storing them (likely two tables, one with int unsigned for v4 and one with binary(16) for v6). I add @tstarling who originally developed hex representation for IP storage in this commit in 2005. There might be some valid reasons for choosing hex representation of IPs for storage that I'm not aware of or some reasons there were valid in 2005 but not any more.

This actually got me thinking: Do we need to keep all ip_changes entries forever? Why not purge them after three months or a year? I think the main usecase for them is finding vandalisms and looking at CIDR ranges for really old data is useless. This is also more aligned with the data retention policies.

Firstly, thank you for all of the detail included in that comment.

@Niharika: With @Ladsgroup's comment in mind, would counting global IP edits over a time window of, say, 90 days (the proposed time window for tracking IP blocks in T293263: [SPIKE] Investigate getting number of past block for IP (including IP ranges) [8H]) be acceptable?

(Unrelated to the IP storage issue)
Do you want to add a column for wiki and revid? so people could find the actual edit?

AIUI this table is only ever be used to show the number of global edits by an IP. @Niharika: Do you foresee a need for tracking the wiki and revid along with the IP address?

Being bold and moving this back into Blocked/Stalled for visibility.

So TLDR is that if you want to store it for a short period of time, even three months, just go with Hex, it's simpler, ready, etc. But if you want to keep it forever, then we probably should look into other methods of storing them (likely two tables, one with int unsigned for v4 and one with binary(16) for v6). I add @tstarling who originally developed hex representation for IP storage in this commit in 2005. There might be some valid reasons for choosing hex representation of IPs for storage that I'm not aware of or some reasons there were valid in 2005 but not any more.

This actually got me thinking: Do we need to keep all ip_changes entries forever? Why not purge them after three months or a year? I think the main usecase for them is finding vandalisms and looking at CIDR ranges for really old data is useless. This is also more aligned with the data retention policies.

Firstly, thank you for all of the detail included in that comment.

@Niharika: With @Ladsgroup's comment in mind, would counting global IP edits over a time window of, say, 90 days (the proposed time window for tracking IP blocks in T293263: [SPIKE] Investigate getting number of past block for IP (including IP ranges) [8H]) be acceptable?

The number of global edits an account has made is representative of how active and experienced that editor is. With the proposed solution, it sounds like if I have been editing for many years but haven't edited recently, it would show nil global edits for my IP. I am not sure if this number will actually be useful for a patroller. @Prtksxna what do you think?

(Unrelated to the IP storage issue)
Do you want to add a column for wiki and revid? so people could find the actual edit?

AIUI this table is only ever be used to show the number of global edits by an IP. @Niharika: Do you foresee a need for tracking the wiki and revid along with the IP address?

I don't think we would want to link to all of the actual edits at any point. @Prtksxna do you agree?

Random note: except a very small portion of IPs, most of the get reassigned after several months. With mobile providers it's even worse and they get reassigned possibly daily. Most of the reassigned IPs don't end up in the same range either.

The number of global edits an account has made is representative of how active and experienced that editor is. With the proposed solution, it sounds like if I have been editing for many years but haven't edited recently, it would show nil global edits for my IP. I am not sure if this number will actually be useful for a patroller. @Prtksxna what do you think?

I think it is useful (maybe even more useful than all time edits). As @Ladsgroup points out there can only be very few long term editors, and even they could at some point lose their IPs. Since IPs move so quickly, I am now thinking of this number more as what has this IP been up to recently.

I don't think we would want to link to all of the actual edits at any point. @Prtksxna do you agree?

I agree, we don't need to track that.

The number of global edits an account has made is representative of how active and experienced that editor is. With the proposed solution, it sounds like if I have been editing for many years but haven't edited recently, it would show nil global edits for my IP. I am not sure if this number will actually be useful for a patroller. @Prtksxna what do you think?

I think it is useful (maybe even more useful than all time edits). As @Ladsgroup points out there can only be very few long term editors, and even they could at some point lose their IPs. Since IPs move so quickly, I am now thinking of this number more as what has this IP been up to recently.

I don't think we would want to link to all of the actual edits at any point. @Prtksxna do you agree?

I agree, we don't need to track that.

Sounds good to me!

Thank you all for your time and comments. I think this task is now done and so I'm being bold and moving this to Done: Q2 2021-22. We now need to create tasks based on my comment above (T292623#7530870), making sure that the time after which we purge rows is configurable so that we can change it as necessary.

Niharika claimed this task.