Page MenuHomePhabricator

PoC on anomaly detection with Flink
Open, Needs TriagePublic13 Estimated Story Points

Description

As the team having a use case on Flink (WDQS Streaming Updater) we want to explore sharing that platform so that other teams can profit from it and we can share the maintenance burden.

A good use case for a proof of concept outside of WDQS Streaming Updater seems to be anomaly detection in the context of SRE workload. @CDanis has some ideas (T257527), we should pair him with @dcausse / @Zbyszko and see if we can come up with a quick demo using Flink table API and sql client.

Event Timeline

Gehel created this task.Sep 15 2020, 3:13 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptSep 15 2020, 3:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel updated the task description. (Show Details)Sep 15 2020, 3:14 PM
mforns edited projects, added Analytics-Radar; removed Analytics.Sep 17 2020, 3:56 PM
CBogen set the point value for this task to 13.

Looking at existing solutions based on flink in this area I don't think this is a good fit for the table API and/or SQL unless the usecase is relatively simple (does not require fine control on the state nor specific timers).
Most solutions I've seen describe a similar architecture:

  • event ingestion (exactly what eventgate does)
  • flink pipeline:
    • read from existing event sources and possibly join multiple ones
    • key (partitioning)
    • feature extraction (time operation/aggregation/...)
    • anomaly detection (applying rules/models)
  • front-end (alerts/UI)

The main reasons on why a streaming approach is preferred in these cases (as apposed to a classic polling or batch approach) is:

  • low latency requirements
  • high cardinality states

For a simple POC if we have precise rules in mind the table API/SQL might be usable.

@CDanis if you have something relatively precise & simple in mind this is something I could try to build a skeleton for and start from here.
I see that you started collecting some data for T257527, is this data available in a kafka topic yet? Additionally would you have some rules or metrics you would like to extract from it?

I might also be interested in applying a more complex approach on top of some data that I know (e.g. cirrus backend requests) and see if this is something that could be generalized.

Hm, I think we need to find a use case that can be done using Stream SQL repl. I don't think SRE will deploy a Java app e.g. during a DDoS.

Can Flink's SQL REPL do something like https://www.confluent.io/stream-processing-cookbook/ksql-recipes/syslog-pattern-detection-alerting/ or even better https://www.confluent.io/stream-processing-cookbook/ksql-recipes/detecting-abnormal-transactions/ with windows?

Yes it definitely can support such queries e.g (extract all api requests from mediawiki.apiaction grouped by their action param and database where the avg backend time is > 100ms over a 1 minute window).

SELECT
   TUMBLE_START(dt, INTERVAL '1' MINUTE) as ts_win,
   database,
   params['action'],
   count(1) as cnt,
   avg(backend_time_ms) as avg_backend_time
FROM api_action
GROUP BY
  TUMBLE(dt, INTERVAL '1' MINUTE),
  database, params['action']
HAVING avg(backend_time_ms) > 100

Will output:

          ts_win                  database                    EXPR$2                       cnt          avg_backend_time
2020-10-28T14:35                    srwiki         feedrecentchanges                         4                       228
2020-10-28T14:35                    svwiki         feedrecentchanges                         2                       467
2020-10-28T14:35                    bgwiki                 stashedit                         6                       245
2020-10-28T14:35                    igwiki            languagesearch                         2                       146
2020-10-28T14:35                    ruwiki                    cxsave                         6                       108
2020-10-28T14:35                    fiwiki         feedrecentchanges                         4                       702
2020-10-28T14:35              plwiktionary                   options                         2                       696
2020-10-28T14:35                   astwiki                     parse                        10                       349
2020-10-28T14:35                    bswiki                     parse                         5                       110
2020-10-28T14:35               jawikibooks                     query                         8                       115
2020-10-28T14:35                    viwiki                 stashedit                         8                       479
2020-10-28T14:35              enwikisource         feedrecentchanges                         2                       626
2020-10-28T14:35              bnwikisource                     purge                         6                       346
2020-10-28T14:35              zhwikisource                     query                        11                       260
2020-10-28T14:35                    ukwiki         feedrecentchanges                         3                       672
2020-10-28T14:35                    elwiki         feedrecentchanges                         4                       401

I'll work out some other examples using a python + the table API (possibly doing joins), the SQL CLI is nice for quickly looking at results but rather limited.

Change 643021 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/rdf@master] Flink Swift FS plugin with TempAuth

https://gerrit.wikimedia.org/r/643021

Change 643257 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/flink-swift-plugin@master] Flink Swift FS plugin with TempAuth

https://gerrit.wikimedia.org/r/643257

Change 643279 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[integration/config@master] Added Flink Swift Plugin

https://gerrit.wikimedia.org/r/643279

Change 643021 abandoned by ZPapierski:
[wikidata/query/rdf@master] Flink Swift FS plugin with TempAuth

Reason:
Moved to own repo - https://gerrit.wikimedia.org/r/c/wikidata/query/flink-swift-plugin/ /643257

https://gerrit.wikimedia.org/r/643021

Change 643279 merged by jenkins-bot:
[integration/config@master] jjb / Zuul: Added CI for Flink Swift Plugin

https://gerrit.wikimedia.org/r/643279

Mentioned in SAL (#wikimedia-releng) [2020-12-04T18:46:06Z] <James_F> Zuul: Added CI for Flink Swift Plugin T262942