Make available raw dumps of:
- images suggestions
- ALIS (image_suggestions/cassandra.py)
- SLIS (image_suggestions/section_image_suggestions.py)
- section topics (section_topics/pipeline.py)
- entity/image pairs (image_suggestions/entity_images.py)
At minimum, we should publish a simple manual dump once. Ideally, dumps are automatically published on a monthly basis.
Note that publishing dumps risks draining the available data (i.e. if all images end up being added, then no more are available for notifications or newcomer tool)
Given the sheer amount of data, this is most likely nothing more than a theoretical issue (and even then, more images being added will likely yield new suggestions in other places), but it still merits a quick worst-case discussion with existing consumers of this data to ensure this case is adequately handled in the unlikely event this happens. [DONE] See T337925#8906957
AC:
- Choose format - compressed CSV
- Publish the latest snapshots at https://analytics.wikimedia.org/published/datasets/one-off/
- Select location to publish dumps, see also T337253: [M] Publish image suggestions evaluation data - https://dumps.wikimedia.org/other/
- Make image_suggestions/entity_images.py write output to HDFS
- Implement a script that converts the outputs to compressed CSVs. Dumps should contain the most possible data (e.g. no confidence cutoff)
- Automate monthly publishing via hdfs_rsync_jobs to stats.pp in the operations-puppet repo
- Document data in the respective readme.html files
- ALIS
- SLIS
- section topics
- entity/image pairs
- add paths to other_index.html
Update
One-off datasets: