Page MenuHomePhabricator

Provide tools for processing obfuscated Chinese geodata (GCJ-02, BD-09)
Open, LowPublic

Description

Restrictions on geographic data in China forces all approved Chinese map providers to run a obfuscation algorithm on the WGS-84 coordinates in all parts of Chinese maps. This method, known as GCJ-02, is generally known as a "public secret" with deobfuscation methods available. However, Wikipedians often tend to record such coordinates without prior deobfuscation, leading to the equivalent of the "China GPS shift (or offset) problem" polluting Wikidata and Wikipedia itself. (Baidu has a further twist on GCJ-02 known as BD-09. Also a public secret.)

This problem can affect Wikipedia, GeoHack, and Wikidata in various means. When importing the data Wikipedians may obtain such data from commercial maps and save without checking; when reading the data in geohack, Wikipedians will find Google and Baidu map links "wrong" due to the shift. Therefore, these things should be done for all Chinese (mainland china minus South China Sea Islands where everyone is using satellite imagery) coordinates recorded:

  • Tag all existing Chinese coordinates as "check required".
  • Provide a tool for reviewing tagged coordinates along with deobfuscated values (original, unGCJ, unBD) on satellite imagery and on OSM.
  • Remind users to check new Chinese coordinates for deviation. Perhaps just add a "check required" tag.
  • In GeoHack, provide format strings for these scrambled coordinates. Probably just {lat,lon}degdec_cn{gcj,bd}.

Thanks to @Liuxinyu970226 for bringing this up in T127950#3123960. Previous discussions: on enwp, on zhwp.


The "Chinese" part mentioned above can be done simply with Google's Geocoding API's region output. It excludes MO, HK, TW, disputed territory which Google didn't buy data from Chinese sources for, and most of South China Sea islands. For a local solution, use OSM's Chinese border for land and 12nm border (without S. China Sea) for sea.


For people looking for a way to do it manually: read https://en.wikipedia.org/wiki/User:Artoria2e5/coord-notice

Event Timeline

Arthur2e5 updated the task description. (Show Details)
Arthur2e5 updated the task description. (Show Details)
Arthur2e5 updated the task description. (Show Details)
Arthur2e5 updated the task description. (Show Details)

For people looking for a solution for now, I have set up a simple user gadget for conversion (web version). I have added links to the web version to the CN section for GeoTemplate on zhwiki and enwiki. I am planning on adding the said map previews to the web demo, but that might take a few weeks of procrastination.

Is systematic copying of coordinates from other sources not problematic anyways because it means the data infringes copyright and/or DB rights in many jurisdictions?

Good question. With my AGF hat on, I would say that maps served in mainland China, Google (cn) or Baidu, also provide distorted/shift aerial imagery to mask the deviation applied, so "good guys" who look for places on the wrong aerial map are still screwed. The same is said to apply to some handheld GPS devices (but not smartphones) sold in China. Smartphones users still suffer from purposefully distorted readings of their current location from map apps though.

With that hat off by a bit, I should mention that quite a few people produce stuff tagged with source:GoogleMaps…

According to some pages I have read from google, it seems like in the US only the compilation of data is protected while data itself are not and the creation of databasr also need to have some creativity in order to make the database fulfil copyright law, and in the EU there is an extra protection of investment being put to collect, arrange and present data. So it seems like it should not have problem under the US law in most cases although it might be better to let a legal expert to answer the question ..

A bit of addition to that "AGF hat" part. It turns out that these twisted variations of WGS-84 do see widespread use out of map vendors' maps, and that's not limited to the gov's mandatory GCJ-02. Ningbo Government's database for historial and cultural sites, for example, uses Baidu's BD-09 (twisted GCJ-02) system so its internal APIs could use Baidu's maps smoothly. (Kudos to User:Siyuwj for pointing that out.) As administrative documents of state organs, such data should be public domain under China's copyright law.

These systems are like Newspeak, but are much more pervasive in terms of usage. Thanks to GCJ-02's early implementation, the common prole doesn't even know what WGS-84 looks like in China.

See also Wikidata:Property proposal/Coordinate reference system, a property proposal that maybe reflect this issue in a property assigning way.

I doubt that proposal's usefulness in our case, as WGS84-like (including ITRS) lat-lon is fairly ubiquitous on our planet. Adding a property for solving Earth problems also means complicating things like Template:Coord, which is something I don't want to do for "systems" created out of bad faith and riddled with errors (look at GCJ-02's ellipsoid and its supposed data source).

The earlier suggestion to store GCJ02 data in a separate property seems preferable ..

For now at least GCJ-02 data can be justified since stuff like Geolocator and Coordinates use Google Maps. Baidu also has some click-to-see-coord feature, so I guess BD-09 data, although definitely limited to one proprietrary source that everyone uses, may be justfied as well.

I am still against seriously storing GCJ-02 data. The entire thing is created for obscurity from WGS-84, and given how reversible its deterministic part is there's no point in letting the clients do their own work.

MPhamWMF subscribed.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

This belongs to search team?