We need a script to populate the coordinates and images for Special:Nearby.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Script for populating geodata and pageimages | mediawiki/extensions/Wikibase | master | +571 -0 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | aude | T112531 [Story] see items around me (Special:Nearby) | |||
| Resolved | aude | T115325 [Task] run refreshLinks on wikidatawiki | |||
| Resolved | Lydia_Pintscher | T114868 [Task] create script to populate data for Special:Nearby |
Event Timeline
Maintanace script refreshLinks could be used to batch update page images together with geo cords for all items.
As far as I can see calling refreshLinks.php with no parameter should be sufficient to update all page_props including the PageImages and GeoData properties (only when T75482 and T112865 are merged and deployed). Problems I can see:
- There is no namespace filter. We do not need to update all namespaces, only namespaces that support statements.
- refreshLinks.php does a bit more, e.g. deleting stuff for non-existing pages. We probably don't want that.
Given how it is going to be implemented now (https://gerrit.wikimedia.org/r/243673) that would suffice, yes (I also tested it).
That would be nice to change, yes... but I don't think it will block us here (Most pages on Wikidata as entities, so just refreshing everything isn't terribly bad. Also deleting old links really shouldn't cause any problems, the queries for that also look sane to scale to Wikidata).
@hoo refreshLinks will work, but it's going to be slow... sharding would be nice, is that possible?
Yes... not in a very convenient way, but we can provide it with start and end page ids, thus we can shard using that.
we are still working on enabling geodata, but anyway I tried running refreshLinks on 100 pages for Wikidata and it's rather slow.
I think it could still work, but propose we do have a script for this
- in case refreshLinks doesn't work so good
- For some reason, it doesn't work for updating cirrus.
refreshLinks has no option afaik to filter to run it for only pages in specific namespace, which would help some, and think it does a lot that we don't need.
Making refreshLinks (or something like it) be able to filter on namespace and/or be more selective in what work it does is probably useful in general, even if we end up not needing it now.
How slow was it?
If refreshLinks doesn't work for updating cirrus, then likely whatever we come up with here won't work either? So perhaps we should either test if it works or defer to deal with that when it happens. Do you have a suspicion what specifically we should test?
i am already updating cirrus :) it's a separate script.
refreshLinks is needed for updating the geo_tags and page_props table.
as a not so nice work around at the moment, I am purging pages with my bot, with forcelinkupdate option. this might just be okay enough this time, especially since a bot can request purge of a batch of pages at a time. I can see that this is working when looking at the relevant database tables.
I was trying to add tests to my hacky script and could finish with that, but in the end that's not so nice a solution and would be a one-time thing that I prefer not go into wikibase. (vs. fixing and making refreshLinks more flexible)
as well, I did make a patch (already merged) to help fix refreshLinks script. This new code is deployed later today, so then I could try refreshLinks.php again later.
Change 248061 abandoned by Aude:
Script for populating geodata and pageimages
Reason:
found another, better way to get this done