Context
A part of the WMDE x Purdue University program where students have been looking for mismatches has been the creation of scripts to more easily upload mismatch files. It has been discussed that these scripts could be added to the wmde/wikidata-mismatch-finder repo on GitHub as a resource for the broader community.
These scripts can be found in the root of the wikidata/Purdue-Data-Mine-2024 repo on GitHub. The files and descriptions of their use are:
- check_mismatch_file.py
- Loads a target CSV into a pandas DataFrame
- Includes the function check_mf_formatting that will check the validity of the file for upload given the Mismatch Finder user guide
- Says that the file is ready for upload, or if the file is not valid, steps to fix it are printed
- At the start of the process, will also warn the user if the file is larger than the upload file size limit of 10 MB (see next file)
- split_mismatch_file.py
- Written in response to the upload limit of 10 MB for the Mismatch Finder API (see T360436)
- A path to a CSV is passed, and if the file is greater than the upload limit, then CSV subsets are created in a directory that are below the upload limit
- A path to where the subset CSVs should be saved can be passed, and the resulting directory is checked to make sure it only has CSVs
- Whether the original CSV should be deleted can also be passed as an argument
- upload_mismatches.py
- A path to a CSV or directory of CSVs is passed
- Python requests is used to execute the cURL request, with the r.raise_for_status() raising an error and printing the errors if the upload is unsuccessful
- Arguments further include the needed access token, a description, the external source, the URL for the external source, and verbosity
- Assertions are made to assure that arguments are correct
Open questions
I've found the process of using these scripts for uploading mismatches to be much easier than using cURL where the errors were not returned, or figuring out where all the needed arguments should go within a interface to make the request like Postman. Whether or not the second script should be included in the third is definitely something that should be considered based on end user feedback.
Please let me know if there are any questions!