Onboarding for Isaac around code / data for sockpuppet detection
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Jul 13 2020, 7:42 PM

Description

Things to understand:

Code: where is it, version control, what scripts are relevant to which part of the pipeline (data gathering + preprocessing, model training, prediction)
Access provided by DD along with description of full pipeline and what each script does and produces
Model: architecture (and alternatives considered / rejected), features being used (or explicitly not used)
XGBoost and good understanding of features being used and why
Documentation: where it lives
Work in progress but I have enough informal documentation to know what's going on
Future work: what improvements are prioritized right now
Working with DD to identify key areas for improvement

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• DarTar	T171251 [Objective 3.1.2] Models for sockpuppet and toxic discussion detection
		Resolved		Isaac	T257870 Onboarding for Isaac around code / data for sockpuppet detection

Event Timeline

Isaac created this task.Jul 13 2020, 7:42 PM

Isaac moved this task from Backlog to FY2020-21-Research-July-September on the Research board.

Isaac edited projects, added Research (FY2020-21-Research-July-September); removed Research.

Weekly update: setup meeting for next week to start onboarding process.

Weekly update: met with DD and was given an overview of the model choices and future directions. Will be receiving a pointer to code / documentation in the near future. For now, though, I have a decent understanding of the current state of the project, which will hopefully be enough to make interpretation of the code relatively straightforward.

Weekly update: no progress. Will hopefully look at some code next week though to familiarize.

Weekly update: access to code repository and date set to discuss with DD later in August

Weekly update:

DD sent code along with overview. Going through and taking notes and looking for areas where it might be made more efficient with little loss in model performance so the pipeline is quicker and easier to iterate on. So far pretty straightforward though I'll have to make a decision at some point whether to leave it in Scala, with which I'm not particularly familiar and so would have trouble debugging / updating, or move it to PySpark where I'm much more comfortable.
Working on reproducing model training so I make sure I understand the features / architecture / performance. The data generation component takes much more time to run but also is relatively straightforward.

@Isaac thanks for the update. Re the choice of Scala or not: if you can delay the decision until the research engineer starts, that'd be my recommendation as I'd expect that part of the pipeline optimization to be done by them. (a possible good first project)

Re the choice of Scala or not: if you can delay the decision until the research engineer starts, that'd be my recommendation as I'd expect that part of the pipeline optimization to be done by them. (a possible good first project)

@leila sounds good -- we can discuss next meeting

Weekly update: I'm going to resolve this task -- I'm still learning obviously about some of the intricacies of this work but at this point I have access to the code pipeline and have a good understanding of what's currently going on and am now at the stage of thinking about improvements etc.

Isaac updated the task description. (Show Details)Aug 28 2020, 3:34 PM

Isaac closed this task as Resolved.Sep 3 2020, 7:33 PM

Onboarding for Isaac around code / data for sockpuppet detectionClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Onboarding for Isaac around code / data for sockpuppet detection
Closed, ResolvedPublic
Actions

Related Objects
Search...