Create a centralized logging API for tracking and reporting dead link fixes
Closed, ResolvedPublic8 Story Points

Description

We should create a centralized logging API on Tool Labs that keeps track of which pages have been processed for dead links, when, and by what agent/bot.

The API should accept as input the following information:

  • wiki
  • page name (possibly page id)
  • timestamp (possibly revision id)
  • number of links fixed (could be zero)
  • agent/bot
  • archive service used
  • whether the links were actually fixed or just posted on the talk page.
kaldari created this task.Feb 9 2016, 7:23 PM
kaldari added a project: Community-Tech.
kaldari added a subscriber: kaldari.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 9 2016, 7:23 PM
kaldari edited the task description. (Show Details)Feb 9 2016, 7:24 PM
kaldari set Security to None.
kaldari edited a custom field.Feb 9 2016, 7:26 PM
DannyH triaged this task as "Normal" priority.
DannyH edited the task description. (Show Details)Feb 9 2016, 7:29 PM

I created a new Tool Labs project called deadlinks where we can set this up. @Niharika and @Fhocutt have been added as project maintainers.

Also created a new database for it called deadlinks. Just type "mysql" from the command line after becoming the deadlinks project and it will log you in to the new database.

Niharika claimed this task.Feb 17 2016, 1:45 PM

Thanks, @kaldari. You need to tell us the secret behind your incredible efficiency. :P

This is now working, yay! I had to fight with labs a bit for it.
So something like http://tools.wmflabs.org/deadlinks/?wiki=frwiki&page=22&num=3&id=1&rev=112&service=IA&status=fixed will now record the information in the db.

@Cyberpower678: Does this logging API sound usable for Cyberbot or are there any changes that need to be made?

@Cyberpower678 - here's the db table schema: https://github.com/Niharika29/Deadlink_logger/blob/master/db/createLogTable.sql
Does that look alright? Anything we might have missed?

Bot logs aren't much of a priority for me at current, but I have no objections to using it sometime in the future. With all the pages it edits, you're going to need some serious indexing for that table. It wouldn't take very long to run the table full enough to have queries take 5 minutes.

Also, we need some kind of authentication to prevent DB spamming.

Also, we need some kind of authentication to prevent DB spamming.

Will bots need to be whitelisted in order to use the interface?

I don't think it would be needed. Do you see any reason it would be?

DannyH added a subscriber: DannyH.Feb 23 2016, 5:15 PM

@Niharika: Per our discussion, it would be good to change the bot field to a string and divide num_links into num_links_fixed and num_links_not_fixed.

Also, we need some kind of authentication to prevent DB spamming.

Will bots need to be whitelisted in order to use the interface?

I don't think it would be needed. Do you see any reason it would be?

Taking care of this in T128111: Create a password system for the dead links logging API. Thanks for pointing it out.

To prevent SQL injection attacks, you'll need to do some escaping on the input in addLogRecord() before you write it to the database. I usually do something like:

foreach( $vars as $key => $value ) {
      $vars[$key] = trim( mysqli_real_escape_string( $link, $value ) );
}

Also the inputs for the SELECT statements will need to be escaped in index.php. Sorry I missed that earlier. See https://www.mediawiki.org/wiki/SQL_injection for more info on avoiding SQL vulnerabilities.

Where the data from the database is being output to the table, you'll need to escape it with htmlspecialchars(), in order to avoid XSS attacks (since we're outputting arbitrary user-generated strings). See https://www.mediawiki.org/wiki/Security_for_developers#Cross-site_scripting_.28XSS.29 or https://www.mediawiki.org/wiki/Cross-site_scripting.

Sorry, both of those comments actually belong on T126364 rather than here.

Right now, the API code is mostly duplicated between /public_html/api.php and /public_html/api/index.php. Let's settle on 1 API end-point and remove the other one (so that we don't have to maintain 2).

Also the inputs for the SELECT statements will need to be escaped in index.php. Sorry I missed that earlier. See https://www.mediawiki.org/wiki/SQL_injection for more info on avoiding SQL vulnerabilities.

Done.

Where the data from the database is being output to the table, you'll need to escape it with htmlspecialchars(), in order to avoid XSS attacks (since we're outputting arbitrary user-generated strings). See https://www.mediawiki.org/wiki/Security_for_developers#Cross-site_scripting_.28XSS.29 or https://www.mediawiki.org/wiki/Cross-site_scripting.

Done.

Right now, the API code is mostly duplicated between /public_html/api.php and /public_html/api/index.php. Let's settle on 1 API end-point and remove the other one (so that we don't have to maintain 2).

Done.

kaldari closed this task as "Resolved".Mar 9 2016, 5:18 PM
kaldari moved this task from Needs Review/Feedback to Done on the Community-Tech-Sprint board.
DannyH moved this task from Backlog to Archive on the Community-Tech board.Mar 14 2016, 11:15 PM