Currently, the hashing algorithm is done using static sample data downloaded from webreqeust table. We would like to
- Make it a spark job, for any given time interval.
- Test ran time per number of records
Later, the resulting hashes will be put into a openSearch instance for search and exploration purpose.
Sample code: https://gitlab.wikimedia.org/xiaoxiao/web_scraping/-/tree/experimental?ref_type=heads