Page MenuHomePhabricator

Alternative, affordable, lower-barrier approach(es) to reconciliation
Open, Needs TriagePublic

Description

Epic / high-level task to discuss innovative/alternative approaches to Reconciliation that are

  1. Affordable to deploy and maintain even by non-technical people, by organizations with few or no financial, technical, human resources
  2. Simple to integrate by software developers who want to incorporate lightweight and easy-to-use reconciliation processes in tools
  3. Providing clarity and configurability to non-technical, non-specialist end users

Event Timeline

(Repost from meta:Talk:Artificial_intelligence/Bellagio_2024)

With Reconciliation I mean for instance, checking which names in a dataset match with people entities on Wikidata, or matching author and publisher names against VIAF. The approach now is to use an emerging protocol / API, which requires each provider of a dataset to have technical and financial capacity to build and maintain a web service, making this process only maintainable by exceptionally well-resourced organizations in the Minority World. As a significant example, the Wikimedia movement hasn't been able to bring together this capacity for Wikidata and Wikibase (see [T244847]). I think that should be seen as one sign that a fresh approach would be extremely welcome.

I see this reconciliation process in practice a lot in my work (both inside and beyond Wikimedia), observing users, and I also hear input from developers who want to incorporate such reconciliation processes in their software. From both sides, more flexibility, clarity and configurability, and more ease of use is strongly asked for. The ideal is an approach that is deployable by even an individual without financial resources and technical capacity, that is intuitive and clear to use by the people doing the matching, and that is re-usable with ease and flexibility by developers who build software that integrates a reconciliation process.

I regularly hear suggestions that artificial-intelligence / LLM-based approaches have potential here. By creating this ticket I want to indicate that I would be very excited and happy if this would be investigated.

A family of solutions here, or even one flexible one, would be very impactful.

But @Spinster perhaps this epic is too abstract to feel easy to triage. What sort of deployable thing do you have in mind? If not a web service, is this a tool? A library? Can you illustrate in detail a demonstration use case?

One of our plans with DB2Rest is to provide a simple instant Recon API for database tables.
It will be a web app, like OpenRefine is, and allow a user to instantly create a Recon API from local files or existing database.

Imagine the following:

User has Existing Database

  • A user (let's say not of OpenRefine, since many other Recon Clients could eventually exist) but some other Client software that can work with a Recon API, and where the user/org has some data existing already in their organizations database tables.
  • They would spin up DB2Rest locally (or on a server) on Linux, macOS, Windows we would provide a natively installable instance of DB2Rest (some java files, db drivers, config files) to make this painless for users.
  • They work with DB2Rest instance within a Web Browser (similar to OpenRefine) https://127.0.0.1:8088 to navigate to their local DB2Rest app and see a nice simple web app to use where they point it to an existing database on their network, remotely in the cloud, or locally running.
  • DB2Rest would analyze the table content, ask the user some questions, and then configure the Recon API, caching, matching algorithms, etc. (This is the harder part for us, and we're working behind the scenes on some of that now)
  • The user would then use their Recon Client of choice and add the endpoint URL of their DB2Rest instance.

User has no database, but only local Files
For those users that do not already have a database to point to, but want to recon with local files, say like CSV's or SQL files, or other forms of data like JSON files, etc. then DB2Rest would have an "Create local DB with your own local files" option where they choose their file(s) and DB2Rest then creates by default a local SQLite database. Why SQLite? Because it's 1 file. You can transfer it to a friend, throw it on a USB stick, and your friend could spin up their own Recon API with your data, and reconcile with their own Recon Client of choice.

There will be other DB types available for a user to choose, depending on their needs and data, records. We'll try to make the web app easy to understand, and do preliminary analysis on their datasets to help them pick and spinup a local database. We're thinking of SQLite, MongoDB Community initially.

The long-term part comes into play with the matching (search) engine. Where we hope to make this relatively painless to users to auto-choose algorithms for matching. Behind the scenes maybe supporting Apache Solr(has DenseVectorField now), Meilisearch(which has attribute ranking & experimental AI vector embeddings), and other open source search engines.

We had to put the extra effort into first building DB2Rest as an instant API program, and add support for lots of database dialects out there (Oracle even older 9i/10, PostgreSQL, MySQL, Mongo, SQLite, etc.) The next step is adding an instant Recon API feature and web application interface for ease of use by non-developers. We hope to have something to demonstrate by end of this year 2024 or a bit later.

Having said the above
This is the reality - DB2Rest might not have to do anything!
I am actively working to convince a few in Meilisearch and some other open source projects and help them to initiate a Recon API extension and frontend config tool. The only problem is that it would be limited to the open source search engine's databases they support. However Meilisearch and Apache Solr are in fact databases, specialized, and can import files into their embedded databases. Whereas DB2Rest is trying to support as many that are useful to users.

For users, it would be a painless, download Instant Recon API software, install, run locally, use its web browser app to point to existing DB, or import your local files, then a Recon API is created and exposed for Recon Clients like OpenRefine, use the Instant Recon API web browser app to tweak ranking/scoring/property relevancies as needed.

Personally, I'm thinking of just leveraging Apache Solr, creating custom integration with Wikidata Schemas (to deduce features/properties for Solr features and relevancy ranking), and then calling it...

OpenRecon

As a community, we could begin the OpenRecon (based on Apache Solr, Tailwind CSS, etc.) in a new GitHub project and begin to work on it. If OpenRefine org wanted to subhost the project
/OpenRefine/openrecon , maybe even better for all.

This comment was removed by Spinster.

Aside: the link to DB2Rest should be db2rest.com (github)

I am wondering if the Wikidata:Embedding Project can play some role here?