Page MenuHomePhabricator

Onboarding for Isaac around code / data for sockpuppet detection
Closed, ResolvedPublic

Description

Things to understand:

  • Code: where is it, version control, what scripts are relevant to which part of the pipeline (data gathering + preprocessing, model training, prediction)
  • Access provided by DD along with description of full pipeline and what each script does and produces
  • Model: architecture (and alternatives considered / rejected), features being used (or explicitly not used)
  • XGBoost and good understanding of features being used and why
  • Documentation: where it lives
  • Work in progress but I have enough informal documentation to know what's going on
  • Future work: what improvements are prioritized right now
  • Working with DD to identify key areas for improvement

Event Timeline

Weekly update: setup meeting for next week to start onboarding process.

Weekly update: met with DD and was given an overview of the model choices and future directions. Will be receiving a pointer to code / documentation in the near future. For now, though, I have a decent understanding of the current state of the project, which will hopefully be enough to make interpretation of the code relatively straightforward.

Weekly update: no progress. Will hopefully look at some code next week though to familiarize.

Weekly update: access to code repository and date set to discuss with DD later in August

Weekly update:

  • DD sent code along with overview. Going through and taking notes and looking for areas where it might be made more efficient with little loss in model performance so the pipeline is quicker and easier to iterate on. So far pretty straightforward though I'll have to make a decision at some point whether to leave it in Scala, with which I'm not particularly familiar and so would have trouble debugging / updating, or move it to PySpark where I'm much more comfortable.
  • Working on reproducing model training so I make sure I understand the features / architecture / performance. The data generation component takes much more time to run but also is relatively straightforward.

@Isaac thanks for the update. Re the choice of Scala or not: if you can delay the decision until the research engineer starts, that'd be my recommendation as I'd expect that part of the pipeline optimization to be done by them. (a possible good first project)

Re the choice of Scala or not: if you can delay the decision until the research engineer starts, that'd be my recommendation as I'd expect that part of the pipeline optimization to be done by them. (a possible good first project)

@leila sounds good -- we can discuss next meeting

Weekly update: I'm going to resolve this task -- I'm still learning obviously about some of the intricacies of this work but at this point I have access to the code pipeline and have a good understanding of what's currently going on and am now at the stage of thinking about improvements etc.