Page MenuHomePhabricator

PoC on anomaly detection with Flink
Open, Needs TriagePublic13 Estimated Story Points


As the team having a use case on Flink (WDQS Streaming Updater) we want to explore sharing that platform so that other teams can profit from it and we can share the maintenance burden.

A good use case for a proof of concept outside of WDQS Streaming Updater seems to be anomaly detection in the context of SRE workload. @CDanis has some ideas (T257527), we should pair him with @dcausse / @Zbyszko and see if we can come up with a quick demo using Flink table API and sql client.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CBogen set the point value for this task to 13.

Looking at existing solutions based on flink in this area I don't think this is a good fit for the table API and/or SQL unless the usecase is relatively simple (does not require fine control on the state nor specific timers).
Most solutions I've seen describe a similar architecture:

  • event ingestion (exactly what eventgate does)
  • flink pipeline:
    • read from existing event sources and possibly join multiple ones
    • key (partitioning)
    • feature extraction (time operation/aggregation/...)
    • anomaly detection (applying rules/models)
  • front-end (alerts/UI)

The main reasons on why a streaming approach is preferred in these cases (as apposed to a classic polling or batch approach) is:

  • low latency requirements
  • high cardinality states

For a simple POC if we have precise rules in mind the table API/SQL might be usable.

@CDanis if you have something relatively precise & simple in mind this is something I could try to build a skeleton for and start from here.
I see that you started collecting some data for T257527, is this data available in a kafka topic yet? Additionally would you have some rules or metrics you would like to extract from it?

I might also be interested in applying a more complex approach on top of some data that I know (e.g. cirrus backend requests) and see if this is something that could be generalized.

Hm, I think we need to find a use case that can be done using Stream SQL repl. I don't think SRE will deploy a Java app e.g. during a DDoS.

Can Flink's SQL REPL do something like or even better with windows?

Yes it definitely can support such queries e.g (extract all api requests from mediawiki.apiaction grouped by their action param and database where the avg backend time is > 100ms over a 1 minute window).

   TUMBLE_START(dt, INTERVAL '1' MINUTE) as ts_win,
   count(1) as cnt,
   avg(backend_time_ms) as avg_backend_time
FROM api_action
  database, params['action']
HAVING avg(backend_time_ms) > 100

Will output:

          ts_win                  database                    EXPR$2                       cnt          avg_backend_time
2020-10-28T14:35                    srwiki         feedrecentchanges                         4                       228
2020-10-28T14:35                    svwiki         feedrecentchanges                         2                       467
2020-10-28T14:35                    bgwiki                 stashedit                         6                       245
2020-10-28T14:35                    igwiki            languagesearch                         2                       146
2020-10-28T14:35                    ruwiki                    cxsave                         6                       108
2020-10-28T14:35                    fiwiki         feedrecentchanges                         4                       702
2020-10-28T14:35              plwiktionary                   options                         2                       696
2020-10-28T14:35                   astwiki                     parse                        10                       349
2020-10-28T14:35                    bswiki                     parse                         5                       110
2020-10-28T14:35               jawikibooks                     query                         8                       115
2020-10-28T14:35                    viwiki                 stashedit                         8                       479
2020-10-28T14:35              enwikisource         feedrecentchanges                         2                       626
2020-10-28T14:35              bnwikisource                     purge                         6                       346
2020-10-28T14:35              zhwikisource                     query                        11                       260
2020-10-28T14:35                    ukwiki         feedrecentchanges                         3                       672
2020-10-28T14:35                    elwiki         feedrecentchanges                         4                       401

I'll work out some other examples using a python + the table API (possibly doing joins), the SQL CLI is nice for quickly looking at results but rather limited.

Change 643021 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/rdf@master] Flink Swift FS plugin with TempAuth

Change 643257 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/flink-swift-plugin@master] Flink Swift FS plugin with TempAuth

Change 643279 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[integration/config@master] Added Flink Swift Plugin

Change 643021 abandoned by ZPapierski:
[wikidata/query/rdf@master] Flink Swift FS plugin with TempAuth

Moved to own repo - /643257

Change 643279 merged by jenkins-bot:
[integration/config@master] jjb / Zuul: Added CI for Flink Swift Plugin

Mentioned in SAL (#wikimedia-releng) [2020-12-04T18:46:06Z] <James_F> Zuul: Added CI for Flink Swift Plugin T262942

Made but stopped actively working on this for the moment, have hit issues with python env.