Page MenuHomePhabricator

Investigate potential issues with globe-coordinate validation
Closed, ResolvedPublic

Description

In T144248: No RDF builder defined for data type globe-coordinate nor for value type bad in DispatchingValueSnakRdfBuilder::getValueBuilder it appears as though some invalid globe-coordinate data found its way into the database. While this may be due to some older issues with validation, in order to ensure that this does not happen in the future we should investigate whether there are problems with globe-coordinate validation that would result in the following malformed data:

{
  "value": {
    "latitude": -90,
    "longitude": 0,
    "altitude": null,
    "precision": 146706019195900,
    "globe": "http://www.wikidata.org/entity/Q308"
  },
  "type": "globecoordinate",
  "error": "$precision needs to be between -360 and 360"
}

Acceptance Criteria

  • If any issues with validation are found, either create a ticket or fix the issue.

Event Timeline

By examining both the wbparsevalue and wbsetclaim API modules it appears that there isn't a way for invalid globe-coordinate precision values to remain undetected while editing existing values in the web interface. Up next:

  • wbeditentity: modification-failed error on invalid precision
  • wbcreateclaim: invalid-snak error on bad precision
  • wbsetclaimvalue

It appears that there is no way to insert add invalid globe-coordinate without triggering a snak validation error or, at least a data value parsing error.

The above ways are the only ways afaiu an innacurate percision could reach the DB, are there any other ways I'm unaware of?

So I think it's good so far. Thanks for doing it. My worry is about some ways to make edits that'd miss. e.g. when there's a leak, we have to check everything and in case of mediawiki it's not possible (e.g. you can make edits using maint scripts).

My suggested course of action since the main bits have been checked:

  • Wait for results of T283576: Look in the database for malformed globe-coordiante precisions
  • If the problematic cases introduced only circa 2014 and stopped being introduced for years, call this ticket done.
  • If it is being introduced recently, dig deeper. For example, if it happened in the last three months, we can check hadoop and logstash for more information.

HTH

I think this is ready for another look now the db looking ticket is done.
Removed @ItamarWMDE as he is no longer on the camp

So, given that T283576 only found two items (Q3629997 history, Q3642430 history) with an invalid precision and both those precision values were entered in 2014 by the same bot, I guess we can close this task as done?

Itamar's investigation didn't find any issues, and Amir pointed out that we probably don't have reason to expect any undiscovered issues.

We technically got it done but I would like to know how are we going to avoid having exceptions in production. They were introduced in 2014 and wont' happen again but: 1- We need to fix current items (by possibly removing the statement?) 2- Avoid showing exception in case a user requests rdf output, we can't and shouldn't change content of an old revision so it means hitting with revid for output shouldn't cause an exception.

We technically got it done but I would like to know how are we going to avoid having exceptions in production. They were introduced in 2014 and wont' happen again but: 1- We need to fix current items (by possibly removing the statement?) 2- Avoid showing exception in case a user requests rdf output, we can't and shouldn't change content of an old revision so it means hitting with revid for output shouldn't cause an exception.

Good points. For the purely technical aspect of the logspam there already is T283569. But I feel the two issues you mention also touch some product considerations. I'll check in with them and when we know more we can move forward with this.