Page MenuHomePhabricator

webiron batched abuse reports 1/23/2017 for coibot and linkwatcher
Closed, InvalidPublic

Description

We have gotten a few emails from these folks in the past day. They indicate the following IPs are problematic:

  • 208.80.155.143
  • 208.80.155.178

Some views of this reports:

https://www.webiron.com/abuse_feed/abuse@wmflabs.org
https://www.webiron.com/iplookup/208.80.155.178
https://www.webiron.com/abuse_feed/208.80.155.143

Unwanted and or Abusive Web Requests:

Offending/Source IP: 208.80.155.143
- Issue: Source has attempted the following botnet activity: Orphan Malware Scanner
- Block Type: New Ban
- Time: 2017-01-23 13:02:12-07:00
- Port: 80
- Service: http
- Report ID: 3142fba8-e198-41a2-a99f-023d2d3cc5a6
- Bot Fingerprint: 1332903b7e59e47342fcd6c65b8b7858
- Bot Information: https://www.webiron.com/bot_lookup/1332903b7e59e47342fcd6c65b8b7858
- Bot Node Feed: https://www.webiron.com/bot_feed/1332903b7e59e47342fcd6c65b8b7858
- Abused Range: 45.79.136.0/24
- Requested URI: /about/
- User-Agent: COIParser/2.0
- Time: 2017-01-23 14:16:59-07:00
- Port: 80
- Service: http
- Report ID: cfa4e4db-a20a-49e8-93f9-e745dcbc6bef
- Bot Fingerprint: 1332903b7e59e47342fcd6c65b8b7858
- Bot Information: https://www.webiron.com/bot_lookup/1332903b7e59e47342fcd6c65b8b7858
- Bot Node Feed: https://www.webiron.com/bot_feed/1332903b7e59e47342fcd6c65b8b7858
- Abused Range: 50.116.5.0/24
- Requested URI: /
- User-Agent: COIParser/2.0
- Time: 2017-01-23 14:16:59-07:00
- Port: 80
- Service: http
- Report ID: cfa4e4db-a20a-49e8-93f9-e745dcbc6bef
- Bot Fingerprint: 1332903b7e59e47342fcd6c65b8b7858
- Bot Information: https://www.webiron.com/bot_lookup/1332903b7e59e47342fcd6c65b8b7858
- Bot Node Feed: https://www.webiron.com/bot_feed/1332903b7e59e47342fcd6c65b8b7858
- Abused Range: 50.116.5.0/24
- Requested URI: /
- User-Agent: COIParser/2.0

All the activity I see them reporting I can track back to coibot and linkwatcher https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Beetstra that is run by @Beetstra.

The nodes in question (each exec node in Tools has it's own external IP):

79a41ff1-de61-4061-a4b0-7cd1d25d658ftools-exec-1403ACTIVEpublic=10.68.17.239, 208.80.155.143
f284b92f-3e86-4127-bdd1-d7e32bb65809tools-exec-1411ACTIVEpublic=10.68.17.209, 208.80.155.178

I can see the expected bots running:

557434 0.35652 coibot     tools.coibot r     12/01/2016 05:03:23 continuous MASTER

I see where the user is setting an appropriate (and reported user-agent):

/data/project/coibot

data/project/coibot# grep -Ri COIParser *
Parser.pl:$diffFetcher->agent("COIParser/2.0");

/data/project/linkwatcher

linkwatcher.pm: my $agent = shift || 'LinkWatcher'; #user-specified agent or default to 'LinkWatcher'

My thought at the moment is these reports are not identifying whatever Orphan Malware Scanner is.

Event Timeline

chasemp added a project: Cloud-Services.
chasemp added a subscriber: RobH.

Replied:

Please be more specific about the behavior you are seeing that is being flagged as harmful.  I have looked into the bot using the user-agent COIParser and I do not see anything that could be mistaken for Orphan Malware Scanner from either of these IPs:

    208.80.155.143
    208.80.155.178


Chase Pettet
chasemp renamed this task from webiron batched abuse reports to webiron batched abuse reports 1/23/2017.Jan 23 2017, 10:47 PM

I also need to know what they see as harmful. Coibot and Linkwatcher are
checking added links for viability, whether they are redirects, and whether
they are containing typical 'money making schemes'. If they note a lot of
traffic, then those links are added to Wikipedia at a somewhat alarming
rate.

I will have a look into the nature of this use....

Checked - I do not see insertions of webiron.com itself, the 40 major projects do not link to it, so it is unlikely that linkwatcher or COIBot saw additions taking place. It must be a domain they are watching that triggered linkwatcher and COIBot, triggering their reports.

Requested URI: /about/

Requested URI: /

My first suspicion is that they have some honeypot domains and count anything requesting any URL at one of those domains as a "hit". And someone put URLs at one or more of those domains somewhere where Coibot and Linkwatcher would find them.

That is also one of my suspicions. The other one is that a domainowner
noticed that there is a bot requesting data from their site, and they want
to know whether that is/was legit ... or that the site itself got added a
lot in some tracking template in places where Linkwatcher and coibot would
notice (the latter being odd but not impossible). I would need to know
specific requests that triggered this now ...

Just to clarify:

  • linkwatcher is parsing every wikipedia-diff (cross-wiki) and parses out the added external links. These get stored in its database. It then:
    • does some simple stats on its database (along 'how often did <user> add <domain>' vs 'how often did <domain> get added in total'. If those two numbers match and it is reason to report the links - these are going to COIBot).
    • gets the IP of the server the webpage is hosted on.
    • it loads the header of the full url - looks at 404s, 202s, etc. etc. Detects whether the site is a redirect or a 'true webpage' - redirects get reported.
    • it loads a small chunk of the page, parses out typical trackers and webbugs - spammers generally use same tracker on different domains, stats those. From the same chunk it looks for html-based redirect pages (blogspot e.g. allows that).
    • checks the links, IPS, redirects, and identifiers against blacklists (to feed to en.wikipedia's XLinkBot)

*From this data it produces an IRC feed.

  • COIBot looks at <username> vs <pagename> (from Wikimedia IRC feed) for 'conflict of interest', and at <username> vs. <domain> (from linkwatcher feed) for 'conflict of interest
    • It the statistics match, COIBot goes deeper, doing among others the same thing that LiWa3 does, but parses a bit more on page content. That is the 'COIParser' that seems to get hits.
    • It saves reports for suspicious cases, those are either user-reported (typically in spam-blacklist requests) or the ones that arise from the statistics that linkwatcher did. That one uses agent 'COIBot' (should actually be 'COILinkSaver' for clarity - changed it, waiting for it to restart). Also the LinkSaver does things that linkwatcher and the COIBot parser do.

Hence, I do not see what is the 'offending' behaviour at the moment.

@Beetstra thank you for your proactive response. FWIW I don't see and I don't currently think there is any abusive behavior or wrong doing on the part of the bots in question. Looking forward to further details from the reporter but the language used in the emails don't leave me the impression this is a high value reporter.

  • gets the IP of the server the webpage is hosted on.

Hmm. Any of these IPs in 45.79.136.0/24 or 50.116.5.0/24, since they claim those as the "abused range"?

With hundreds of edits per minute to the 800 wikis that are checked ...
likely there are domains in every thinkable range ...

@Beetstra thank you for your proactive response. FWIW I don't see and I don't currently think there is any abusive behavior or wrong doing on the part of the bots in question. Looking forward to further details from the reporter but the language used in the emails don't leave me the impression this is a high value reporter.

@chasemp did you get any response or further requests? There is no intentional wrongdoing from my side, but there may be unintentional alarming behaviour due to the coding of my bots - I do want to know how to avoid (do I have to set flags? Does my 'agent' need to be set in a more clear way that this is an automated process?)

chasemp changed the visibility from "Custom Policy" to "Public (No Login Required)".

@Beetstra thank you for your proactive response. FWIW I don't see and I don't currently think there is any abusive behavior or wrong doing on the part of the bots in question. Looking forward to further details from the reporter but the language used in the emails don't leave me the impression this is a high value reporter.

@chasemp did you get any response or further requests? There is no intentional wrongdoing from my side, but there may be unintentional alarming behaviour due to the coding of my bots - I do want to know how to avoid (do I have to set flags? Does my 'agent' need to be set in a more clear way that this is an automated process?)

@Beetstra I never heard anything back and I believe we stopped getting their report emails (which are almost certainly automated). I haven't witnessed the bots in question doing anything malicious and I feel comfortable saying this was a low value, low insight, low reporting user quality report.

That being said I think these bots behave fairly closely to a webcrawler from the perspective of those who are seeing the traffic.

A few points that are probably useful to that effect:

  • Ideally the page mentioned above would indicate valid source IPs for the bot itself. This allows people to reason about 'reputation' and intended operation more easily. For Cloud-Services this is 208.80.155.128/25 atm.
  • Ideally the page mentioned above would have instructions on contacting the bot operator.
  • A mechanism that reads a blacklist of sites to /never/ hit so we can mitigate an issue like this immediately (read into memory on startup?) and the bot itself isn't an all-or-nothing proposition
  • Understanding and respecting at least the robots.txt basics. This may be tricky as it could be turned against citation validation possibly? But in general respecting delay and full blocks is probably good behavior until proven otherwise. See our version for phab itself.

Thank you for caring about this. I am going to close this issue as invalid until we get further information that says otherwise. I am also going to make it public as there seems to be no sensitive information here.

chasemp renamed this task from webiron batched abuse reports 1/23/2017 to webiron batched abuse reports 1/23/2017 for coibot and linkwatcher.Feb 1 2017, 2:46 PM