Help requested from an R expert to help tweak phlogiston (burnup chart scripts)
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	• JAufrecht
	Aug 24 2015, 5:58 PM

Description

Priorities for review.

Anything in the R code that would affect the integrity of the data and interpretation of the results

The R scripts generate graphs directly from csv files; data processing is all done in SQL
Making it easier to continue working with the code
- The file ve_report.R is the master R reporting file. Other files *_report.R is copy-pasted; would be very helpful to get guidance on the best way to clean this up, parameterize the files, so that there is no code duplication.
- any egregious standards violations that would make it harder for other people to use this
Save Joel some time figuring out various R things
- on ve-age_of_resolved_count.png and ve-age_of_resolved.png, the scale in the legend should be flipped so that red is on the bottom, just as red is on the bottom in the chart.
- on ve-backlog_burnup_crop.png, the VisualEditor Interrupt data should be plotted descending from the X axis.
- on ve_backlog_status, why are the data shapes out of alignment? They didn't used to be.
- on ve-trancheN_burnup.png, what is the right way to make the "open" data on each chart match the colors from ve-backlog-burnup_crop.png (so that, e.g., the "open" in tranch1 is teal), and "resolved" is a black line instead of an area?

Here is fresh data for the script:

data.zip30 KBDownload

Code is at https://github.com/wikimedia/phab_task_history

Related Objects
Search...

Status	Assigned	Task
Resolved	• JAufrecht	T107482 Create third version of VE burnup report
Resolved	• Awjrichards	T108645 Get code review for Phab burnup reporting scripts (phlogiston)
Resolved	Ironholds	T110080 Help requested from an R expert to help tweak phlogiston (burnup chart scripts)
Declined	None	T115740 Simplify Phlogiston R files by moving duplicated code into functions
Resolved	• JAufrecht	T115743 Replace duplicated R files with parameterized call

Event Timeline

• JAufrecht created this task.Aug 24 2015, 5:58 PM

• JAufrecht claimed this task.

• JAufrecht raised the priority of this task from to Medium.

• JAufrecht updated the task description. (Show Details)

• JAufrecht added a project: VisualEditor.

• JAufrecht added subscribers: gerritbot, • JAufrecht, Aklapper.

• Deskana added a project: Discovery-Analysis (Current work).Aug 24 2015, 6:33 PM

• Deskana set Security to None.

@Jdforrester-WMF requested the assistance of Discovery-ARCHIVED here. I checked with the analysts and since this is a fairly small, well scoped task, we can probably help out pretty fast.

I choose you, @Ironholds

• JAufrecht updated the task description. (Show Details)Aug 25 2015, 4:00 PM

Jdforrester-WMF moved this task from To Triage to TR0: Interrupt on the VisualEditor board.Aug 25 2015, 7:04 PM

• Ironholds_backup edited a custom field.Aug 26 2015, 11:25 AM

@JAufrecht so I discussed this with James yesterday; it initially sounded like a 30-minute task but given that it is 300 lines of plotting code is actually a lot more than that ;).

I can do three things (an AND, not an OR):

Offer you some general guidance on style and common R gotchas;
Write up an example of the format I'd use for such code;
Answer specific questions ('how do I do X, broadly-speaking?' Rather than 'make X do Y') within a timeboxed window.

For detailed debugging of 300 lines of code I have informed James that my hourly rate is 125 bucks ;p

• Ironholds_backup moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.Aug 27 2015, 3:49 AM

Awaiting response.

• JAufrecht added a parent task: T108645: Get code review for Phab burnup reporting scripts (phlogiston).Aug 27 2015, 9:12 PM

Problem 1: The R script generates 19 graphs, generally only the input file and maybe 1-2 other things are different. Are there general bad practices in this approach, or in this basic block of code? How can I set more defaults or otherwise avoid so much duplication?

Problem 2: This is just for VE; I will have similar lists of graphs for other projects. What's the best way to recycle this code so that I can generate the same charts for other projects, just changing the titles and the source data, without having cut-paste code that gets out of sync?

Problem 3: I'd like to make a chart like ve-backlog_burnup_crop.png, but with one set of data shown descending from the X axis, rather than ascending. How should I go about this?

Problem 4: In the same chart, the black burnup line is confusing because it shows the total points resolved, but in fact each of the different color bands could have its own points resolved. can you think of any visual way to communicate this information legibly on one chart, rather than on a separate chart for each color band?

In T110080#1582334, @JAufrecht wrote:

Problem 1: The R script generates 19 graphs, generally only the input file and maybe 1-2 other things are different. Are there general bad practices in this approach, or in this basic block of code? How can I set more defaults or otherwise avoid so much duplication?

Hopefully the example will demonstrate some of this.

Problem 2: This is just for VE; I will have similar lists of graphs for other projects. What's the best way to recycle this code so that I can generate the same charts for other projects, just changing the titles and the source data, without having cut-paste code that gets out of sync?

By using functions rather than raw calls. Check out the ?source function, too (php's 'include' but for R).

Problem 3: I'd like to make a chart like ve-backlog_burnup_crop.png, but with one set of data shown descending from the X axis, rather than ascending. How should I go about this?

I mean, honestly this isn't how I'd choose to visualise this at all. Is there a reason you're not just showing, say, the ratio of cleared points to current points each week?

Problem 4: In the same chart, the black burnup line is confusing because it shows the total points resolved, but in fact each of the different color bands could have its own points resolved. can you think of any visual way to communicate this information legibly on one chart, rather than on a separate chart for each color band?

Not visualise it at all; see above ;p

• Ironholds_backup moved this task from Stalled/Waiting to In progress on the Discovery-Analysis (Current work) board.Aug 28 2015, 5:38 PM

Ironholds moved this task from In progress to Stalled/Waiting on the Discovery-Analysis (Current work) board.Sep 1 2015, 8:08 PM

Moving back to Stalled/Waiting while we await response.

• ksmith renamed this task from Get R expert to help tweak report to Help requested from an R expert to help tweak phlogiston (burnup chart scripts).Sep 3 2015, 4:32 PM

• ksmith added a project: Team-Practices (This-Week).Sep 3 2015, 9:18 PM

• ggellerman moved this task from To Do to In Progress on the Team-Practices (This-Week) board.Sep 3 2015, 9:23 PM

• JAufrecht claimed this task.Sep 10 2015, 9:14 PM

Declining this task, as it's sat stalled waiting for feedback for a very long time. @JAufrecht, feel free to reopen if you still have specific questions you need our input on.

• Deskana moved this task from Stalled/Waiting to Done on the Discovery-Analysis (Current work) board.Sep 17 2015, 8:05 PM

Sorry for the delay. I could definitely still use some help; struggling to figure out the best way to ask. Regarding the last round of questions:

What do you mean by example? What do you need from me to produce an example?
The scripts generate a bunch of charts from fully prepared data.. Each chart is currently about fourteen lines of code, where I specify the output file, one or two input files, legends and titles, the chart type, font sizes and chart sizes, and custom scale. All of this is then copy-pasted to each chart for each project.
1. What's the best way to put the real boilerplate stuff, like chart size and font size, into a single location like a default or stylesheet?
2. For the stuff that varies, like file names, what is the best approach? The current approach with repetitious code, or to parameterize it somehow and have a data file along the lines of {my_input.csv, my_output.png, "My Title"}, or something else?
3. If ?source is the right approach, should I research that or would that be in the example?
"Is there a reason you're not just showing, say, the ratio of cleared points to current points each week?" By showing the backlog and burnup in a chart, we can visually answer a lot of questions that would otherwise require a lot of numbers that I think would be harder to comprehend simultaneously. E.g.:
1. What is the trend in backlog growth?
2. What is the trend in velocity?
3. What is the relative scope of work for upcoming milestones, and what is the trend in scope creep?
4. What is the relationship between velocity and scope creep?
  
  On the other hand, all of these are better answered with numbers, so we probably want both a picture and specific numbers, to allow conceptual navigation through the data.

• JAufrecht reopened this task as Open.Sep 17 2015, 9:27 PM

@JAufrecht Thanks. Moving this into the review column in the sprint accordingly.

• JAufrecht added a project: Phlogiston.Sep 17 2015, 9:34 PM

• JAufrecht moved this task from To Be Triaged to Technical Debt Backlog on the Phlogiston board.Sep 21 2015, 3:47 PM

@JAufrecht Thanks for the questions. I've reprioritised this now. Unfortunately it's fairly deep in our sprint backlog, as we've got a lot of other work to do first. We'll get back to you when we're working on this.

• JAufrecht moved this task from In Progress to Blocked or Waiting on the Team-Practices (This-Week) board.Sep 22 2015, 8:50 PM

• JAufrecht added a project: Team-Practices.Sep 25 2015, 4:43 PM

• JAufrecht removed a project: Team-Practices.Sep 25 2015, 4:44 PM

In T110080#1651403, @JAufrecht wrote:

Sorry for the delay. I could definitely still use some help; struggling to figure out the best way to ask. Regarding the last round of questions:

What do you mean by example? What do you need from me to produce an example?

The scripts generate a bunch of charts from fully prepared data.. Each chart is currently about fourteen lines of code, where I specify the output file, one or two input files, legends and titles, the chart type, font sizes and chart sizes, and custom scale. All of this is then copy-pasted to each chart for each project.

What's the best way to put the real boilerplate stuff, like chart size and font size, into a single location like a default or stylesheet?

Okay, the easiest way to do this is to simply wrap it all in a function. We actually have an example of this; if you look at https://github.com/Ironholds/wmf/blob/master/R/dataviz.R you'll see a function called fivethirtynine(), our modification of the fivethirtyeight standard plotting form, which we use as standard for our visual display of data. To call it:

ggplot(df, aes(x,y)) + geom_line() + theme_fivethirtynine()

So yeah: function calls!

In addition, ggplot objects are post-hoc modifiable:

plot_obj <- ggplot(df, aes(x, y)) + geom_line()
one_themed_plot_obj <- plot_obj + theme_fivethirtynine()
two_themed_plot_obj <- plot_obj + theme_bw()

This doesn't just apply to themes, it also applies to any aesthetics or geoms after the first one - so you can avoid duplicating the initial ggplot() call if you want to display the same data multiple times with different aesthetics.

For the stuff that varies, like file names, what is the best approach? The current approach with repetitious code, or to parameterize it somehow and have a data file along the lines of {my_input.csv, my_output.png, "My Title"}, or something else?

The parameterisation is what I do, yeah. Function wrappers are your friend!

If ?source is the right approach, should I research that or would that be in the example?

source is good for pulling in code from multiple files (it's analogous to include) but I'd really recommend parameterisation here. Doesn't mean you can't store those functions in a file, and source that into a file that actually makes the plots, though (I do that all the time).

"Is there a reason you're not just showing, say, the ratio of cleared points to current points each week?" By showing the backlog and burnup in a chart, we can visually answer a lot of questions that would otherwise require a lot of numbers that I think would be harder to comprehend simultaneously. E.g.:

What is the trend in backlog growth?

What is the trend in velocity?

What is the relative scope of work for upcoming milestones, and what is the trend in scope creep?

What is the relationship between velocity and scope creep?

On the other hand, all of these are better answered with numbers, so we probably want both a picture and specific numbers, to allow conceptual navigation through the data.

That makes sense. I'd suggest using multiple plots to reduce the visual complexity, though. Using multiple plots doesn't mean you need different slides, of course - you can plot them as part of the same image. Lemme know if that sounds like the right approach and I'll build an example.

Ironholds moved this task from Backlog to Needs review on the Discovery-Analysis (Current work) board.Oct 15 2015, 11:53 AM

Ironholds moved this task from Needs review to Stalled/Waiting on the Discovery-Analysis (Current work) board.

Created new tasks for followup work to change Phlogiston to incorporate functions and paramaterization, leaving I think only this one issue for this task:

Using multiple plots doesn't mean you need different slides, of course - you can plot them as part of the same image. Lemme know if that sounds like the right approach and I'll build an example.

I think I've gone down this road pretty far; see Phlogiston VE. I'm now showing the same or similar data in two or three different charts as we experiment to see which is most useful for different purposes. Do you have something in mind other than this?

Naw, that works fine - I just meant literally storing them in the same image to compress, e.g., number of slides. But that only works well when you're using the same scales and legend - yours looks great for this kind of data :)

I think you can close this; the new subtasks cover the specific followup work. Thanks.

Ironholds moved this task from Stalled/Waiting to Done on the Discovery-Analysis (Current work) board.Oct 21 2015, 5:37 PM

• JAufrecht closed this task as Resolved.Oct 22 2015, 9:04 PM

• JAufrecht moved this task from Blocked or Waiting to Done on the Team-Practices (This-Week) board.

• JAufrecht added a project: OKR-Work.Oct 27 2015, 6:19 PM

• JAufrecht closed subtask T115743: Replace duplicated R files with parameterized call as Resolved.Oct 27 2015, 9:51 PM

• Deskana moved this task from Done to Resolved on the Discovery-Analysis (Current work) board.Nov 20 2015, 5:26 AM

Aklapper closed subtask T115740: Simplify Phlogiston R files by moving duplicated code into functions as Declined.Sep 17 2020, 2:29 PM

Restricted Application added projects: User-Ryasmeen, Product-Analytics. · View Herald TranscriptSep 17 2020, 2:29 PM

Aklapper removed projects: Product-Analytics, User-Ryasmeen.Sep 17 2020, 2:34 PM

Help requested from an R expert to help tweak phlogiston (burnup chart scripts)Closed, ResolvedPublic2 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Help requested from an R expert to help tweak phlogiston (burnup chart scripts)
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...