Mon, Sep 13
Thu, Sep 9
Hey @mforns thanks for starting this.
Aug 6 2021
Aug 4 2021
PR at https://github.com/mirrys/ImageMatching/pull/28
All checks (main branch) are green again.
Aug 3 2021
Many thanks for this! Just wanted to give an ack that login on the host worked.
Jul 26 2021
Jul 23 2021
Jul 14 2021
June run has successfully completed on 2021-07-13 at 1600UTC/1800CEST.
Jul 13 2021
I have a couple of questions re integration:
Jul 2 2021
Disclaimer: total MW noob here :).
Jul 1 2021
Jun 24 2021
Jun 23 2021
The datasets we generated for PoC are available at https://analytics.wikimedia.org/published/datasets/one-off/platform-imagematching/api/.
Jun 21 2021
Jun 10 2021
I'd prefer not to have to maintain a fork, but this change is not on the critical path for now. No rush :).
Jun 9 2021
The May training/ingestion run completed successfully.
I would have opened a PR for this, but I wanted to validate one of our use cases first.
wmfdata helper methods set spark.driver.memory via SparkSession‘s builder. The config will be ignored when spark runs in client mode, which is the default for the configs you ship. In client deployment mode, spark.driver.memory must be set before the JVM starts. For example we should pass the value to spark-submit like spark-submit --driver-memory <size>.
You can find more about this behaviour in Spark’s doc https://spark.apache.org/docs/latest/configuration.html.
Jun 8 2021
- DAG dir and distribution
We'll need to set a directory in which airflow scheduler will look for DAG files. Perhaps we can just add an airflow/dags directory in refinery and configure airflow scheduler to look there?
This will be deteremined per instance. For now we are using refinery/airflow/dags for analytics instance.
Jun 7 2021
Adding a summary of https://github.com/mirrys/ImageMatching/pull/26#issuecomment-849519963 for posterity:
Jun 3 2021
I had a chat with @ArielGlenn today; I can help with a one-off analysis, but I'd need to understand needs and scope. Before moving forward let's make sure we would not be replicating analysis work already made available by @Addshore.
May 31 2021
May 27 2021
We dedicated some time fine tuning the spark job (not the code itself, but the cluster that executes it) and troubleshooting out-of-memory errors caused by extracting labels for all languages. Our findings are here https://github.com/mirrys/ImageMatching/pull/26#issuecomment-849519963
PoC code developed for this spike can be found at https://github.com/gmodena/wmf-streaming-imagematching/pulls. The stated goal of demoing was not satisfiable within the budgeted effort. A functionally equivalente aggregation query has been provided instead.
This PoC shows basic approaches to packaging Flink application, and the moving parts required for deploying clusters atop YARN on WMF's Hadoop cluster.
The Image Matching model is trained with a monthly schedule. During the month, the state of a recommendation can change. For example:
- A recommendation has been rejected and should not be offered again.
- A recommendation has been accepted, a page is now illustrated and should not receive further recommendations.
- A page has been illustrated by a workflow external to ImageMatching and should not receive further recommendations.
With our current setup, we’ll need to wait till the new training completes to see changes reflected in data.
May 21 2021
@Ottomata thanks for the summary & overview of the .deb status.
May 17 2021
@Marostegui today we ran Similarusers ingestion of April data. Some stats:
May 6 2021
May 5 2021
May 4 2021
Right now it is due to memory constraints. We encountered a number of out of memory errors when trying to retrieve large set of labels from enwiki.
There's a few things we can do to fine tune memory footprint of the algo, but first we experimented with restricting the result set.
May 3 2021
The following instances should be added to our filter list:
Apr 29 2021
Local Cassandra docker-compose PoC (under review): https://github.com/gmodena/wmf-cassandra-imagematching
This discussion, with relevant stakeholders, is ongoing at https://phabricator.wikimedia.org/T280042
The full dataset for ImageMatching, generated on 321 wikis, is 2.6GB. It contains 23585365 records.
To be clear, a record as it is referred to here is one globally unique primary key, and the corresponding columns, yes?
Apr 26 2021
Maybe premature optimisation, but this dataset stores text fields (part of a potential primary key) that can be relatively long (page titles, image names). Do we have guidelines for hashing/storing long keys?
The full dataset for ImageMatching, generated on 321 wikis, is 2.6GB. It contains 23585365 records. In prod we might want to store multiple snapshots (prev/current months), and possibly variants (to satisfy ad-hoc clients or A/B testing).
Apr 21 2021
@Marostegui @Eevans thanks for the input!
I should have stats re dataset sizes of the 300+ wikis towards the end of this week. Crunching is still in progress; it takes a while to cycle through all languages.
Apr 19 2021
Apr 15 2021
Thanks for detailed reply and constructive feedback.
Apr 14 2021
Apr 13 2021
The job has successfully completed at 2021-04-13 15:37:22,710.
Some stats for the ingested datasets:
The ingestion part of the data pipeline kicked off at 2021-04-13 09:05:37,296.
It is set with