Page MenuHomePhabricator

AQS 2.0: Extract production testing data for Druid-based endpoints
Closed, ResolvedPublic

Description

The AQS 2.0 Druid-based testing endpoints will need data from our production data lake, to execute tests against. See T316849: Audit tests for Druid-based endpoints for a list of tests, and an initial analysis of the necessary data.

Write queries to extract this data in a format suitable for ingestion into the Druid testing environment described in T311190: Establish testing procedure for Druid-based endpoints. Document the resulting queries, including the procedure required to execute them and ingest the data into the testing environment.

This task does not cover publishing and making accessible a Docker image for the Druid testing environment containing the ingested data. That work should be covered in a separate task.

Completion criteria:

  • queries are written and successfully executed against production
  • testing data is ingested into the Druid testing environment (locally, not published anywhere)
  • all steps required to perform the above are documented

Event Timeline

The purpose of this comment is to document what I've learned so far about writing and executing Druid queries against WMF's Druid installation. The target audience is other AQS 2.0 developers who may (like me) be unfamiliar with Druid and how to query it. This is much more a "just enough to get started" explanation that a full technical description.

First, it isn't strictly necessary to query Druid for our testing data. The data is also available in in hadoop/HDFS and can be queried via hive/spark/hue (although hue is being deprecated). However, I chose to query actual Druid by posting Druid native queries from the command line. There's a fair amount of discussion on T311190: Establish testing procedure for Druid-based endpoints about all the various options, as well as numerous related pages on Wikitech. The rest of this comment will focus on the approach that I chose.

Druid can accept queries via either:

  • Druid SQL, which looks a lot like the SQL I was already familiar with from relational database like MySQL/Mariadb
  • Native Queries, in which the query is described as a json object

The existing production AQS code uses Native Queries. I found them to be convenient and understandable (at least for my relatively simple use cases). The ability to save queries as .json files was nice when working from the command line, and they are also convenient to represent in code either as embedded strings or as external asset files. It is also trivial to save the output to file.

Manually executing native queries from the command line requires a developer account and production shell access. Anyone needing shell access should coordinate with their manager, and is welcome to use my request for shell access as an example (hopefully that's still the right procedure...)

Here are the steps I took to execute a Druid native query:

  • ssh stat1005.eqiad.wmnet (I assume other stat servers would work too, I just happened to use 1005)
  • kinit (not absolutely sure you need this for Druid, it may just be for Hadoop. But it doesn't hurt)
  • vim my-query.json, then write/paste your query and save the file (or use your favorite command line editor, or scp your locally-created .json query files to your home directly on the stat server)
  • curl -L -X POST "http://druid1004.eqiad.wmnet:8082/druid/v2/?pretty" -H "content-type: application/json" -d @my-query.json

That will display the output directly in your terminal, which may be fine for quick analysis, confirming your query works, etc. If you'd like to save the results, you can instead tell curl to write to a file, like this:

  • curl -L -X POST "http://druid1004.eqiad.wmnet:8082/druid/v2/?pretty" -H "content-type: application/json" -d @bill-scan-three.json

Disclaimer: as with most command line operations, there are multiple ways to get similar results. I'm not claiming that what I documented above is ideal for every use case, just that it worked for mine. If you have a different way you prefer to do things that accomplishes your goals, go for it.

So, what do queries look like? Two notable points:

I won't pretend to understand the subtleties of all that, but it appears to me that for purposes of this task - extracting raw rows to ingest as testing data - the "scan" query type will be very useful.

Here are some example queries:

top_revisions_per_project_2022_07.json

{
  "queryType": "topN",
  "dataSource": "mediawiki_history_reduced_2022_08",
  "dimension": "project",
  "granularity": "month",
  "metric": "revisions_created",
  "threshold": 10,
  "filter": {
    "type": "and",
    "fields": [{
        "type": "selector",
        "dimension": "event_entity",
        "value": "revision"
      },
      {
        "type": "selector",
        "dimension": "event_type",
        "value": "create"
      }
    ]
  },
  "aggregations": [{
    "type": "longSum",
    "name": "revisions_created",
    "fieldName": "events"
  }],
  "intervals": "2022-07-01/2022-08-01"
}

Output from the above is:

[ {
  "timestamp" : "2022-07-01T00:00:00.000Z",
  "result" : [ {
    "revisions_created" : 23914911,
    "project" : "wikidata"
  }, {
    "revisions_created" : 9116716,
    "project" : "commons.wikimedia"
  }, {
    "revisions_created" : 5649871,
    "project" : "en.wikipedia"
  }, {
    "revisions_created" : 915244,
    "project" : "en.wiktionary"
  }, {
    "revisions_created" : 760184,
    "project" : "fr.wikipedia"
  }, {
    "revisions_created" : 739002,
    "project" : "de.wikipedia"
  }, {
    "revisions_created" : 736428,
    "project" : "ru.wikipedia"
  }, {
    "revisions_created" : 582137,
    "project" : "es.wikipedia"
  }, {
    "revisions_created" : 550489,
    "project" : "zh.wikipedia"
  }, {
    "revisions_created" : 480457,
    "project" : "ja.wikipedia"
  } ]
} ]

create-page events on en.wikipedia during July 1, 2002 for namespace 0:

"bill-scan-four.json" 
{
   "queryType": "scan",
   "dataSource": "mediawiki_history_reduced_2022_08",
   "resultFormat": "list",
   "columns":[],
  "intervals": "2022-07-01/2022-07-02",
   "batchSize":20480,
  "filter": {
    "type": "and",
    "fields": [{
        "type": "selector",
        "dimension": "page_namespace",
        "value": "0"
      },
      {
        "type": "selector",
        "dimension": "project",
        "value": "en.wikipedia"
      },
      {
        "type": "selector",
        "dimension": "event_type",
        "value": "create-page"
      }
    ]
  }
 }

The above found 1059 occurrences. One example:

{
    "__time" : 1656633600000,
    "project" : "en.wikipedia",
    "event_entity" : "page",
    "event_type" : "create-page",
    "user_text" : "Zyxw",
    "user_type" : "user",
    "page_title" : "TeamNL",
    "page_namespace" : "0",
    "page_type" : "content",
    "other_tags" : null,
    "revisions" : "0",
    "text_bytes_diff" : null,
    "text_bytes_diff_abs" : null,
    "text_bytes_diff_sum" : 0,
    "text_bytes_diff_abs_sum" : 0,
    "revisions_sum" : 0,
    "events" : 1
  }

Next up: write the actual queries needed to extract our testing data.

I ended up writing this as one query, as follows:

{
  "queryType": "scan",
  "dataSource": "mediawiki_history_reduced_2022_08",
  "resultFormat": "compactedList",
  "columns":[],
  "intervals": "2022-07-01/2022-07-02",
  "batchSize":20480,
  "filter": {
    "type": "or",
    "fields": [
      {
        "type": "and",
        "fields": [
          {
            "type": "in",
            "dimension": "page_title",
            "values": [
              "Jess_Wells",
              "Věra_Lišková",
              "Chipolin",
              "Karmarong_dialect",
              "Project:WikiProject_Bangladesh_Premier_League/Article_alerts/Archive",
              "Timeline_of_the_Joe_Biden_presidency_(2022_Q2)"
            ]
          },
          {
            "type": "selector",
            "dimension": "project",
            "value": "en.wikipedia"
          }
        ]
      },
      {
        "type": "and",
        "fields": [
          {
            "type": "in",
            "dimension": "page_title",
            "values": [
              "Julie_Lemieux",
              "36_Ursae_Majoris",
              "Manaus_slender-legged_tree_frog",
              "TANS_Perú_Flight_204",
              "Category:1925_comedy_movies"
            ]
          },
          {
            "type": "selector",
            "dimension": "project",
            "value": "simple.wikipedia"
          }
        ]
      },
      {
        "type": "and",
        "fields": [
          {
            "type": "in",
            "dimension": "user_text",
            "values": [
              "-Alabama-",
              "AAlertBot",
              "AnomieBOT"
            ]
          },
          {
            "type": "selector",
            "dimension": "project",
            "value": "en.wikipedia"
          }
        ]
      }
    ]
  }
}

I'll get all this into a proper repository in WMF GitLab once I finish up a few more details. But for now, here's the Python script I wrote to convert the Druid compact list output json into a csv for injection into Druid:

https://gist.github.com/bpirkle/461e952ea39b2ef43cfd082d2b871ad9

Maybe there's a way I could have fed the Druid json response directly back in, but this wasn't too hard, is a little more compact, and (probably most important) it is easier to review the testing data outside Druid using tools like an everyday spreadsheet or text editor.

The Druid environment is currently at https://github.com/bpirkle/aqs-docker-druid-test-env . As noted above, I'll get it into a better repository. But I'll do that work under a different phab task.

Steps to query and ingest Druid data into a local env:

  1. ssh stat1005.eqiad.wmnet (presumably, this could be done from several servers. This is just the one I happened to use.)
  2. get the above query onto the stat server, by scp'ing it, running vim on the stat server and pasting it in, or whatever you prefer
  3. curl -L -X POST "http://druid1004.eqiad.wmnet:8082/druid/v2/?pretty" -H "content-type: application/json" -o query-output-compacted-list.json -d @test-data-query.json
  4. copy query-output-compacted-list.json to your local machine (again, however you prefer)
  5. python3 druid-to-csv.py (or however you prefer to run the script that converts the json to csv)
  6. clone the Druid environment repository
  7. copy in the .csv to that repository's directory under the name aqs_druid_test_data.csv
  8. docker build -t bpirkle/aqs-docker-druid-test-env . (this creates the Docker image with the data ingested)
  9. docker run -p 8081:8081 -p 8082:8082 -p 8888:8888 -it bpirkle/aqs-docker-druid-test-env (this actually runs the image in a container)
  10. hit http://localhost:8888/ in your browser and confirm data is present. Note that it may take several minutes for Druid to fully launch. In my experience, I was able to bring up the Druid interface almost immediately, but had to wait a bit for the data to appear, even though it was already ingested in the build step.

The above is still a little more fidgety and manual than I'd like. I'll try to clean it up when I create the real repository. But I'll do that under another task. The above is sufficient to get a working env with test data, and therefore meets the completion criteria of this task.

BPirkle moved this task from Backlog to QA/Review on the API Platform board.