Page MenuHomePhabricator

Create a centralized logging API for tracking and reporting dead link fixes
Closed, ResolvedPublic8 Estimated Story Points

Description

We should create a centralized logging API on Tool Labs that keeps track of which pages have been processed for dead links, when, and by what agent/bot.

The API should accept as input the following information:

  • wiki
  • page name (possibly page id)
  • timestamp (possibly revision id)
  • number of links fixed (could be zero)
  • agent/bot
  • archive service used
  • whether the links were actually fixed or just posted on the talk page.

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari added a project: Community-Tech.
kaldari subscribed.
kaldari set Security to None.
DannyH triaged this task as Medium priority.Feb 9 2016, 7:27 PM
DannyH moved this task from Needs Discussion to Up Next (May 6-17) on the Community-Tech board.

I created a new Tool Labs project called deadlinks where we can set this up. @Niharika and @Fhocutt have been added as project maintainers.

Also created a new database for it called deadlinks. Just type "mysql" from the command line after becoming the deadlinks project and it will log you in to the new database.

Thanks, @kaldari. You need to tell us the secret behind your incredible efficiency. :P

This is now working, yay! I had to fight with labs a bit for it.
So something like http://tools.wmflabs.org/deadlinks/?wiki=frwiki&page=22&num=3&id=1&rev=112&service=IA&status=fixed will now record the information in the db.

@Cyberpower678: Does this logging API sound usable for Cyberbot or are there any changes that need to be made?

Bot logs aren't much of a priority for me at current, but I have no objections to using it sometime in the future. With all the pages it edits, you're going to need some serious indexing for that table. It wouldn't take very long to run the table full enough to have queries take 5 minutes.

Also, we need some kind of authentication to prevent DB spamming.

Also, we need some kind of authentication to prevent DB spamming.

Will bots need to be whitelisted in order to use the interface?

I don't think it would be needed. Do you see any reason it would be?

@Niharika: Per our discussion, it would be good to change the bot field to a string and divide num_links into num_links_fixed and num_links_not_fixed.

Also, we need some kind of authentication to prevent DB spamming.

Will bots need to be whitelisted in order to use the interface?

I don't think it would be needed. Do you see any reason it would be?

Taking care of this in T128111: Create a password system for the dead links logging API. Thanks for pointing it out.

To prevent SQL injection attacks, you'll need to do some escaping on the input in addLogRecord() before you write it to the database. I usually do something like:

foreach( $vars as $key => $value ) {
      $vars[$key] = trim( mysqli_real_escape_string( $link, $value ) );
}

Also the inputs for the SELECT statements will need to be escaped in index.php. Sorry I missed that earlier. See https://www.mediawiki.org/wiki/SQL_injection for more info on avoiding SQL vulnerabilities.

Where the data from the database is being output to the table, you'll need to escape it with htmlspecialchars(), in order to avoid XSS attacks (since we're outputting arbitrary user-generated strings). See https://www.mediawiki.org/wiki/Security_for_developers#Cross-site_scripting_.28XSS.29 or https://www.mediawiki.org/wiki/Cross-site_scripting.

Sorry, both of those comments actually belong on T126364 rather than here.

Right now, the API code is mostly duplicated between /public_html/api.php and /public_html/api/index.php. Let's settle on 1 API end-point and remove the other one (so that we don't have to maintain 2).

Also the inputs for the SELECT statements will need to be escaped in index.php. Sorry I missed that earlier. See https://www.mediawiki.org/wiki/SQL_injection for more info on avoiding SQL vulnerabilities.

Done.

Where the data from the database is being output to the table, you'll need to escape it with htmlspecialchars(), in order to avoid XSS attacks (since we're outputting arbitrary user-generated strings). See https://www.mediawiki.org/wiki/Security_for_developers#Cross-site_scripting_.28XSS.29 or https://www.mediawiki.org/wiki/Cross-site_scripting.

Done.

Right now, the API code is mostly duplicated between /public_html/api.php and /public_html/api/index.php. Let's settle on 1 API end-point and remove the other one (so that we don't have to maintain 2).

Done.