Page MenuHomePhabricator
Paste P17254

(An Untitled Masterwork)
ActivePublic

Authored by EBernhardson on Sep 8 2021, 4:49 PM.
Tags
None
Referenced Files
F34636970: raw-paste-data.txt
Sep 8 2021, 4:49 PM
Subscribers
Web UI for cirrus debug/devel features:
- Settings dump
- Mappings dump
- Copy version of settings+mappings suitable to create index with curl
- cirrusDumpQuery
- cirrusDumpResult
- cirrusExplain
- cirrusUserTesting
Top level idea is to make it easy to access all of these things. Could be
a userscript run on-page in the wiki. Could be an SPA run from tool labs
(or even people.wikimedia.org).
============
docker setup to initialize elasticsearch, import latest cirrus dump, and
attach a kibana instance for UI. Probably with a modified mapping more
amicable to kibana inspection.
============
Some script to manage elasticsearch allocation manually via api? Pointless, but
perhaps fun.
===========
phabricator formatted export for jupyter
- problem: images?
-- Seems would need to upload separately and then reference them in final output
-- There is an api for this, but then we can't just emit something to paste into a field
the whole export needs to happen over api then.
- better, but worse: data-uri's would be great. But i dunno if phab is built for megabyte sized posts. They also
don't support data-uri's. Browsers also hate when you copy/paste excessive amounts of data.
==========
Custom implementation to find similar images in commons:
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.5151&rep=rep1&type=pdf
- http://www.deepideas.net/building-content-based-search-engine-quantifying-similarity/
- Convert image into a feature vector
- Use clustering to generate an image signature
- Find k-nearest-neighbors via Earth Mover Distance (EMD), can utilize pyemd library.
- It's very not-obvious how the signature + weight gets plugged into pyemd
- EMD is expensive, no clue how this would scale to millions of images
- This would probably perform poorly, more interesting to get to understand some of the history of similar image retrieval
=========
https://github.com/beniz/deepdetect.git ?
- Use pre-trained ML to detect objects in images and then label those objects.
- Can compare similarity of objects detected for similar images. Can probably
extend with color information
- Do we actually have a use case for images similar to other images? Perhaps on upload?
==========
Elasticsearch cluster balance simulator
- Allow to Simulate valuate how the cluster balancing performs under various simulated conditions
- no way this could be done in a weekend hackathon. It would probably be
completely wrong as well and simulate some idealized cluster that doesn't act
like ours.
==========
Prototype Lire plugin for elasticsearch
- Lire = Lucene Image REtrieval
- I know nothing about it, other than it exists
- Plugin already exists plugging it into solr, so how hard could it be?
- Maybe try it out standalone with some small test set to see what it does
==========
Potential daemon for serving up similarity, bring your own image vector.
- https://github.com/facebookresearch/faiss/wiki
- Could use vectors from above "custom implementation", but probably not too fancy
- Could use opencv: https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_surf_intro/py_surf_intro.html
=========
Updatable doc values in elasticsearch!
* Guaranteed to suck!
* Needs to query doc value on update to put into new document (in source?)
* Should there be some stupid hack that makes requesting field from source return the doc value?
* Otherwise, different results from different places. Fun!
==========
Segment-level sidecar data?
==========
Scroll to section / Scroll to snippet from search results
* Javascript string search?
* Elasticsearch highlighter to be aware of heading positions in text?
* lookup
==========
Index pHash for images into elasticsearch
- Do some crazy expensive query to measure hamming distance
=========
extract transliteration from mediawiki into a composer library
build a small server over the library as a transliteration service
=========
Make progress on extension.json for CirrusSearch
========
Expose cirrussearch sort orders to api/ui
========
Convert cindy's actual runner into a docker container