Page MenuHomePhabricator

Log source port for anonymous users and expose it for sysops/checkusers
Closed, ResolvedPublic

Description

Some ISPs deploy carrier grade NAT designs, making use of the port in addition to the IP to resolve the client connection.
With this design when sysops report to ISPs on trolls and vandals and providing the specific IP and the time of edit, it isn't always enough to determine the specific client - hence it is important to also provide the source port in such cases.

It would be nice to log the source port of the client ($_SERVER['REMOTE_PORT'] ?), so some privileged users (sysops or checkusers) could access it later if needed for reporting for ISPs.

Event Timeline

Krenair subscribed.

<Krenair> I would expect $_SERVER['REMOTE_PORT'] to be useless inside WMF infrastructure
<Krenair> I would not expect nginx and varnish to use the same source port as the client

ema triaged this task as Medium priority.Nov 27 2017, 8:33 AM

<Krenair> I would expect $_SERVER['REMOTE_PORT'] to be useless inside WMF infrastructure

Yeah I'm also not really sure if adding the client source port to the current list of fields available in webrequests would help much. What type of questions could we answer if we had such field that we cannot now? Analytics, any input?

<Krenair> I would not expect nginx and varnish to use the same source port as the client

The source port as seen by nginx is indeed the remote end's source port. We could thus simply map
$remote_port to a new header as follows, and then add the relevant varnishkafka config:

proxy_set_header X-Real-Port $remote_port

<Krenair> I would expect $_SERVER['REMOTE_PORT'] to be useless inside WMF infrastructure

Yeah I'm also not really sure if adding the client source port to the current list of fields available in webrequests would help much. What type of questions could we answer if we had such field that we cannot now? Analytics, any input?

The request here is to make the source port available to MediaWiki's CheckUser extension, so that trusted users can look it up when filing abuse reports with ISPs, that need to know the source port to identify the source of the abuse.

Per @Legoktm this has nothing to do with Analytics as far as I am aware.

<Krenair> I would not expect nginx and varnish to use the same source port as the client

The source port as seen by nginx is indeed the remote end's source port. We could thus simply map
$remote_port to a new header as follows, and then add the relevant varnishkafka config:

proxy_set_header X-Real-Port $remote_port

Yeah, so nginx will see the actual remote source port, then use it to set an X-Real-Port header which I guess Varnish will simply pass on to MW servers.

I did a bit of searching around and found some people talking about X-Forwarded-Port, but it sounds like they are using it for the destination port rather than source port - https://mattrobenolt.com/handle-x-forwarded-port-header-in-django/
Is there a chance the outermost layer of LVS might screw up anything in this plan?

Is this still desirable for checkusers? Infrastructure has changed since then and is still-changing, but we could probably find a way to pass the data along in a header.

It is desirable when there are trolls using ISPs which use CGN (maybe other cases) - I think this is quite rare case - but when it is required it's important to have that.

For wikimedia wikis: I'm not familiar enough with the infra regarding that (and how much effort it requires)- but I guess it may be good to have it logged at least somewhere (MW database accessible to checkuser extension or some logs - available for ops and analytics), so when needed can get that data.

as for the right logs and whether or not to keep it logged in MW database - this may involve privacy/legal/bureaucratic aspect. My guess the easiest would be to log it somewhere, and analytics can give this data to checkusers upon request - and if this became a common request (very unlikely) we can explore how to do the extra mile of exposing it to checkusers directly to the extension.

Is this still desirable for checkusers? Infrastructure has changed since then and is still-changing, but we could probably find a way to pass the data along in a header.

It is indeed desirable. Exactly today I attempted to disclose an IP data (after WMF approval) to an ISP, however, they said that IPs are not enough and source ports are also needed.

This would have typically helped me to allow an ISP to identify a long-term abuser (LTA) today.

This comment was removed by Huji.

While I understand how this can be helpful when reporting abusers to ISPs, this use case is narrow and uncommon. If we decide to add this to CU logs, we should certainly not show it in typical CU results; it would clutter the interface.

However, I'm not sure if this even should be in every CU's hands ever. The use case (of reporting an abuser to their ISP) is something that I think WMF should handle, not volunteer CUs. So even if we decide that CU logs are the most appropriate place to store this data, it should not be shown in any of the CU interfaces. Only those with shell access (through a direct query of the DB) should be able to pull this, and in my opinion, volunteers with shell access should be expected not to take care of such requests either. If we do want to have a separate web-based view in CU that exposed this data, it should be restricted through another permission setting that is false for CUs and only true for certain WMF-employee users.

Lastly, if what I just wrote is agreeable, that begs the question: is the CU logs really the best place to store this at all? If varnish/nginx already keeps a log of all requests, could that be matched by the timestamp to the CU logs and port data be extracted on demand? Why store something in two places, if it is stored in one place and can be queried with a reasonable amount of effort?

If T265692 ends up being easy to do, that supports my last point above.

I think this shouldn't go in mw side of things, it should be part of the analytics data lake (webrequest hadoop table for example).

I'm inclined to close this as declined in favor of T271953: Add client TCP source port to webrequest which basically gives people who have access to hadoop to be able to see the source port. In case it's needed for reporting, let people know and they can get it for you. I assume this happen pretty rarely.

Change 657416 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] varnish: include X-Client-Port in X-Analytics

https://gerrit.wikimedia.org/r/657416

Change 657416 abandoned by Effie Mouzeli:
[operations/puppet@production] varnish: include X-Client-Port in X-Analytics

Reason:
rebase probs

https://gerrit.wikimedia.org/r/657416

Change 658567 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] varnish: include X-Client-Port in X-Analytics

https://gerrit.wikimedia.org/r/658567

Change 658567 merged by Vgutierrez:
[operations/puppet@production] varnish: include X-Client-Port in X-Analytics

https://gerrit.wikimedia.org/r/658567

This is half resolved/half declined. The data is now available in the data lake and can be disclosed by people with access if needed. But we shouldn't add this information to mediawiki which can identify users even easier to CUs and admins unless there's a strong benefit from it (which I can't see)

Urbanecm closed subtask Restricted Task as Declined.Jan 23 2023, 8:00 PM