Page MenuHomePhabricator

Jupyter notebook for reconciliation
Closed, ResolvedPublic

Description

After learning the basics of Jupyter notebooks, I would like to create one to tackle Wikidata reconciliation.

I am interested in reconciling places with coordinates. We can also work together on different reconciliation scenarios.

This is my imagined workflow. Maybe it's not the best one. All ideas are more than welcome!

Read the data to be reconciled

  • Import a csv file including coordinate data.

Read best matches from Wikidata for each entry

  • The query will produce a table of data for each entry
  • Sort entries by the closest coordinates
  • Read also a set of chosen properties.
  • Filter by a property is needed

Evaluate the data
Try different methods. Ideas below. Any step is useful.

  • Measure distance of the coordinates, set distance threshold, rate match based on distance
  • Name matching: Take aliases and languages into account, rate based on names
  • Authority id present in Wikidata.
  • Type matching: instance of a subclass or a subclass of a suitable Wikidata item.
  • Geographic shape. Find out if the coordinate is inside the shape of a known Wikidata item.

How to use the matched data?

  • Mass match highly rated matches. Update Wikidata directly?
  • Export csv to be used in another tool like OpenRefine
  • Explore nonmatches individually in the best possible ways. For example change the criteria, omit some, choose only some.
  • Mark / create new items.

Event Timeline

We have a GitHub repo for this now at https://github.com/AvoinGLAM/jupyter-reconciliation. The dataset is uploaded there.

I think that this will still work as an example for reading CSV/TSV row by row and updating wikidata based on values from the file.

Hey @Susannaanas,
I would like to contribute to this task, would you describe which tasks are to be done and how I may proceed.
Thanks in advance.

HI @Palak199! @Fuzheado might have prepared a sketch notebook already, so I think we would build upon that. We may also collect pieces of a code/workflow and put them together at the hackathon.

I think we will

  1. Learn the basics for those who don't already know Jupyter. The event is T281420 at 16 UTC Saturday, see the program https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2021/Schedule
  2. Meet in Jitsi after that and coordinate work for a main notebook. We must schedule this.
  3. Work independently or together (depending on what makes sense)

It is possible to start working already before the Intro to PAWS/Jupyter notebooks for Python beginners.

Thought it might be worth posting the suggestions for geo-reconciling here.

It looks like the meeting is at 1AM my time. I'm not sure that I'll be able to make it. Hopefully the major points make it to the Telegram channel or this ticket. I'll collect my thoughts and update as appropriate.

For starters, I'm more used to using OpenRefine than Jupyter notebooks. OpenRefine supports reconciling values to a source using an external reconciliation service. I.e. the service is not built in to the OpenRefine app, but rather is connected to via HTTP. The communication protocol is described here.

There is a list of reconciliation services at the Reconciliation Census. Strictly speaking, what's listed under Services are closer to "server frameworks". I.e. tools for building services. You can find services under Clients -> Testbench.

I wrote one of these server frameworks. Namely csv-reconcile. My main goal was to make it as easy as possible to fire up a service using a simple CSV file. Obviously, there's a lot of variability here (CSV or TSV? Quoted entries or not? etc.) so I tried to make it configurable without the configuration getting in the way for the simplest cases.

As far as integrating with Jupyter notebooks goes, you could start up the service and interact with it through HTTP. Doing that would give you access to any service available to OpenRefine. The Python project reconciler listed under Clients in the Reconciliation Census does just that.

Another possibility for my server tool is to use it directly as a Python library. The only thing that's missing to do that is a way to build a query and that shouldn't be too hard to add. This latter may be an easy way to add reconciliation against an arbitrary CSV file to Jupyter notebooks.

Sorry if much of this was already known to people here. I just thought the background might be useful to newcomers.

Also @Susannaanas, I see the project has a zip file with a CSV file in it. I'm assuming the goal is to reconcile these values with Wikidata. Is that right??

It's worth pointing out that given two lists you're trying to reconcile you can approach it one of two ways. You can view either list as being the target (you can think of this as the list of "true" values) and the other as having values you're looking up against that target.

When the target is all of Wikidata, you can imagine that this is not an easy task. One approach is to make the target some subset of Wikidata. Reconciliation in OpenRefine allows you to define the "type" of values you're reconciling against as well as other fields to filter out target values not matching those fields. IMHO, this is a fairly crude way to restrict the target. I think a more flexible way is to use SPARQL to generate your target list explicitly. This way you have the full flexibility of the SPARQL language to choose which values you want to reconcile against (i.e. use as a target list). I created csv-reconcile to allow you to reconcile against a hand-generated list. You simply export your SPARQL results to TSV and then use that with csv-reconcile.

With that said, if the goal is to reconcile the CSV in the project against Wikidata, is there any more information about what this list is? Is a list of locations in Finland? Are they geographic features? Cities? etc.

Sorry for posting too much, but since I won't make the meeting, I thought it would be best to get this down here.

For clarity, my csv-reconcile service is able to reconcile using geographic distance to do the fuzzy matching. The wikidata reconciliation service currently does not allow that, but there is an open issue to add it.

Actually, csv-reconcile allows you to make your own fuzzy matching routine using plugins. Geo-matching is done with the csv-reconcile-geo plugin. It uses geopy to calculate the distance. You might want to use that library directly.

I did some refactoring at HEAD to be able to use csv-reconcile as module rather than as a service. I think you'd probably want to add a wrapper on top of this, but the following worked.

import sqlite3
from csv_reconcile.initdb import init_db
import csv_reconcile_dice
from csv_reconcile.score import reconcileStrings

dbname = '/tmp/reconcile.db'
db = sqlite3.connect(dbname)

# This line is required
db.row_factory = sqlite3.Row

# Remove 'lake' an 'reservoir' for the purposes of scoring
scoreOptions = dict(stopwords=('lake', 'reservoir', 'pond'))

# Normalize options
csv_reconcile_dice.processScoreOptions(scoreOptions)

init_db(db,
        '/tmp/NY-lakes.tsv',
        'item',
        'itemLabel',
        csvkwargs=dict(delimiter='\t'),
        scoreOptions=scoreOptions)

items = ['Salmon Lake', 'Lake Beaver', 'Long Pond']

ret = reconcileStrings(db, items, threshold=30.0, limit=5, **scoreOptions)

db.close()

from pprint import pprint as pp
pp(ret)

Generating the following output:

$ poetry run python /tmp/testing.py 
/Users/douglasmennella/Library/Caches/pypoetry/virtualenvs/csv-reconcile-6QidCc17-py3.7/lib/python3.7/site-packages/normality/__init__.py:72: ICUWarning: Install 'pyicu' for better text transliteration.
  text = ascii_text(text)
[('Salmon Lake',
  {'result': [{'id': 'Q30607084',
               'match': True,
               'name': 'Salmon Lake',
               'score': 100.0},
              {'id': 'Q48739795',
               'match': False,
               'name': 'Little Salmon Lake',
               'score': 58.8235294117647},
              {'id': 'Q30609970',
               'match': False,
               'name': 'Balsam Lake',
               'score': 40.0}]}),
 ('Lake Beaver',
  {'result': [{'id': 'Q30607832',
               'match': False,
               'name': 'Beaver Lake',
               'score': 100.0},
              {'id': 'Q48739489',
               'match': False,
               'name': 'Belvedere Lake',
               'score': 46.15384615384615},
              {'id': 'Q4875981',
               'match': False,
               'name': 'Beacon Reservoir',
               'score': 40.0},
              {'id': 'Q48742671',
               'match': False,
               'name': 'Weatherby Pond',
               'score': 30.76923076923077},
              {'id': 'Q5181892',
               'match': False,
               'name': 'Cranberry Lake',
               'score': 30.76923076923077}]}),
 ('Long Pond',
  {'result': [{'id': 'Q30607517',
               'match': True,
               'name': 'Long Pond',
               'score': 100.0},
              {'id': 'Q6675638',
               'match': False,
               'name': 'Loon Lake',
               'score': 66.66666666666667},
              {'id': 'Q30608273',
               'match': False,
               'name': 'Iron Lake',
               'score': 33.333333333333336},
              {'id': 'Q6478417',
               'match': False,
               'name': 'Lake Washington',
               'score': 33.333333333333336},
              {'id': 'Q30623252',
               'match': False,
               'name': 'Colton Flow',
               'score': 30.76923076923077}]})]
$

It might be nice to represent the reconciler as an object so that you can keep several around and apply them. You likely also want to hide the details of the backing sqlite db. There are a couple of gotchas here, but nothing too severe.

@Susannaanas It would be great to get answers to my question above about what the sample data is meant to reconcile to. If I knew the intent I would have used that as an example of what's possible.

Also, it's called "csv-reconcile" so you're assumed to have a csv around, but you could initialize the database with just a list of values rather than having to read it off of disk, but I don't know if this offers any practical benefit as I'm guessing most of the time you will have a csv file handy.

Thanks for participating in the Wikimedia Hackathon 2021! We hope you had a great time.

  • If this task was being worked on and resolved at the Hackathon: Please change the task status to "resolved" via the Add Action...Change Status dropdown.
  • If this task is still valid and should stay open: Please add another active project tag to this task, , so others can find thise task (as likely nobody in the future will look back at Wikimedia-Hackathon-2021 tasks when trying to find something they are interested in).
  • In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to "declined".

Thank you,
your Hackathon venue housekeeping service

No reply to preview comment; assuming there are no followup actions and closing this task.