Page MenuHomePhabricator

[Task] create script to populate data for Special:Nearby
Closed, ResolvedPublic

Description

We need a script to populate the coordinates and images for Special:Nearby.

Event Timeline

Lydia_Pintscher raised the priority of this task from to High.
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher moved this task to Backlog on the Wikidata-Sprint-2015-09-29 board.

Maintanace script refreshLinks could be used to batch update page images together with geo cords for all items.

@hoo: is it enough to use the refreshLinks script or do we need another one?

As far as I can see calling refreshLinks.php with no parameter should be sufficient to update all page_props including the PageImages and GeoData properties (only when T75482 and T112865 are merged and deployed). Problems I can see:

  • There is no namespace filter. We do not need to update all namespaces, only namespaces that support statements.
  • refreshLinks.php does a bit more, e.g. deleting stuff for non-existing pages. We probably don't want that.

@hoo: is it enough to use the refreshLinks script or do we need another one?

Given how it is going to be implemented now (https://gerrit.wikimedia.org/r/243673) that would suffice, yes (I also tested it).

As far as I can see calling refreshLinks.php with no parameter should be sufficient to update all page_props including the PageImages and GeoData properties (only when T75482 and T112865 are merged and deployed). Problems I can see:

  • There is no namespace filter. We do not need to update all namespaces, only namespaces that support statements.
  • refreshLinks.php does a bit more, e.g. deleting stuff for non-existing pages. We probably don't want that.

That would be nice to change, yes... but I don't think it will block us here (Most pages on Wikidata as entities, so just refreshing everything isn't terribly bad. Also deleting old links really shouldn't cause any problems, the queries for that also look sane to scale to Wikidata).

@hoo refreshLinks will work, but it's going to be slow... sharding would be nice, is that possible?

@hoo refreshLinks will work, but it's going to be slow... sharding would be nice, is that possible?

Yes... not in a very convenient way, but we can provide it with start and end page ids, thus we can shard using that.

we are still working on enabling geodata, but anyway I tried running refreshLinks on 100 pages for Wikidata and it's rather slow.

I think it could still work, but propose we do have a script for this

  1. in case refreshLinks doesn't work so good
  2. For some reason, it doesn't work for updating cirrus.

refreshLinks has no option afaik to filter to run it for only pages in specific namespace, which would help some, and think it does a lot that we don't need.

Making refreshLinks (or something like it) be able to filter on namespace and/or be more selective in what work it does is probably useful in general, even if we end up not needing it now.

How slow was it?

If refreshLinks doesn't work for updating cirrus, then likely whatever we come up with here won't work either? So perhaps we should either test if it works or defer to deal with that when it happens. Do you have a suspicion what specifically we should test?

i am already updating cirrus :) it's a separate script.

refreshLinks is needed for updating the geo_tags and page_props table.

as a not so nice work around at the moment, I am purging pages with my bot, with forcelinkupdate option. this might just be okay enough this time, especially since a bot can request purge of a batch of pages at a time. I can see that this is working when looking at the relevant database tables.

I was trying to add tests to my hacky script and could finish with that, but in the end that's not so nice a solution and would be a one-time thing that I prefer not go into wikibase. (vs. fixing and making refreshLinks more flexible)

as well, I did make a patch (already merged) to help fix refreshLinks script. This new code is deployed later today, so then I could try refreshLinks.php again later.

Change 248061 abandoned by Aude:
Script for populating geodata and pageimages

Reason:
found another, better way to get this done

https://gerrit.wikimedia.org/r/248061