Help requested from an R expert to help tweak phlogiston (burnup chart scripts)
Closed, ResolvedPublic2 Story Points

Description

Priorities for review.

  1. Anything in the R code that would affect the integrity of the data and interpretation of the results

    The R scripts generate graphs directly from csv files; data processing is all done in SQL
  2. Making it easier to continue working with the code
    • The file ve_report.R is the master R reporting file. Other files *_report.R is copy-pasted; would be very helpful to get guidance on the best way to clean this up, parameterize the files, so that there is no code duplication.
    • any egregious standards violations that would make it harder for other people to use this
  3. Save Joel some time figuring out various R things
    • on ve-age_of_resolved_count.png and ve-age_of_resolved.png, the scale in the legend should be flipped so that red is on the bottom, just as red is on the bottom in the chart.
    • on ve-backlog_burnup_crop.png, the VisualEditor Interrupt data should be plotted descending from the X axis.
    • on ve_backlog_status, why are the data shapes out of alignment? They didn't used to be.
    • on ve-trancheN_burnup.png, what is the right way to make the "open" data on each chart match the colors from ve-backlog-burnup_crop.png (so that, e.g., the "open" in tranch1 is teal), and "resolved" is a black line instead of an area?

Here is fresh data for the script:


Code is at https://github.com/wikimedia/phab_task_history

JAufrecht updated the task description. (Show Details)
JAufrecht raised the priority of this task from to Normal.
JAufrecht claimed this task.
JAufrecht added a project: VisualEditor.
JAufrecht added subscribers: gerritbot, JAufrecht, Aklapper.

@Jdforrester-WMF requested the assistance of Discovery here. I checked with the analysts and since this is a fairly small, well scoped task, we can probably help out pretty fast.

JAufrecht updated the task description. (Show Details)Aug 25 2015, 4:00 PM

@JAufrecht so I discussed this with James yesterday; it initially sounded like a 30-minute task but given that it is 300 lines of plotting code is actually a lot more than that ;).

I can do three things (an AND, not an OR):

  1. Offer you some general guidance on style and common R gotchas;
  2. Write up an example of the format I'd use for such code;
  3. Answer specific questions ('how do I do X, broadly-speaking?' Rather than 'make X do Y') within a timeboxed window.

For detailed debugging of 300 lines of code I have informed James that my hourly rate is 125 bucks ;p

Deskana changed the task status from Open to Stalled.

Awaiting response.

Problem 1: The R script generates 19 graphs, generally only the input file and maybe 1-2 other things are different. Are there general bad practices in this approach, or in this basic block of code? How can I set more defaults or otherwise avoid so much duplication?

Problem 2: This is just for VE; I will have similar lists of graphs for other projects. What's the best way to recycle this code so that I can generate the same charts for other projects, just changing the titles and the source data, without having cut-paste code that gets out of sync?

Problem 3: I'd like to make a chart like ve-backlog_burnup_crop.png, but with one set of data shown descending from the X axis, rather than ascending. How should I go about this?

Problem 4: In the same chart, the black burnup line is confusing because it shows the total points resolved, but in fact each of the different color bands could have its own points resolved. can you think of any visual way to communicate this information legibly on one chart, rather than on a separate chart for each color band?

Problem 1: The R script generates 19 graphs, generally only the input file and maybe 1-2 other things are different. Are there general bad practices in this approach, or in this basic block of code? How can I set more defaults or otherwise avoid so much duplication?

Hopefully the example will demonstrate some of this.

Problem 2: This is just for VE; I will have similar lists of graphs for other projects. What's the best way to recycle this code so that I can generate the same charts for other projects, just changing the titles and the source data, without having cut-paste code that gets out of sync?

By using functions rather than raw calls. Check out the ?source function, too (php's 'include' but for R).

Problem 3: I'd like to make a chart like ve-backlog_burnup_crop.png, but with one set of data shown descending from the X axis, rather than ascending. How should I go about this?

I mean, honestly this isn't how I'd choose to visualise this at all. Is there a reason you're not just showing, say, the ratio of cleared points to current points each week?

Problem 4: In the same chart, the black burnup line is confusing because it shows the total points resolved, but in fact each of the different color bands could have its own points resolved. can you think of any visual way to communicate this information legibly on one chart, rather than on a separate chart for each color band?

Not visualise it at all; see above ;p

Moving back to Stalled/Waiting while we await response.

ksmith renamed this task from Get R expert to help tweak report to Help requested from an R expert to help tweak phlogiston (burnup chart scripts).Sep 3 2015, 4:32 PM
Deskana closed this task as Declined.Sep 17 2015, 8:05 PM

Declining this task, as it's sat stalled waiting for feedback for a very long time. @JAufrecht, feel free to reopen if you still have specific questions you need our input on.

Sorry for the delay. I could definitely still use some help; struggling to figure out the best way to ask. Regarding the last round of questions:

  1. What do you mean by example? What do you need from me to produce an example?
  2. The scripts generate a bunch of charts from fully prepared data.. Each chart is currently about fourteen lines of code, where I specify the output file, one or two input files, legends and titles, the chart type, font sizes and chart sizes, and custom scale. All of this is then copy-pasted to each chart for each project.
    1. What's the best way to put the real boilerplate stuff, like chart size and font size, into a single location like a default or stylesheet?
    2. For the stuff that varies, like file names, what is the best approach? The current approach with repetitious code, or to parameterize it somehow and have a data file along the lines of {my_input.csv, my_output.png, "My Title"}, or something else?
    3. If ?source is the right approach, should I research that or would that be in the example?
  3. "Is there a reason you're not just showing, say, the ratio of cleared points to current points each week?" By showing the backlog and burnup in a chart, we can visually answer a lot of questions that would otherwise require a lot of numbers that I think would be harder to comprehend simultaneously. E.g.:
    1. What is the trend in backlog growth?
    2. What is the trend in velocity?
    3. What is the relative scope of work for upcoming milestones, and what is the trend in scope creep?
    4. What is the relationship between velocity and scope creep?

      On the other hand, all of these are better answered with numbers, so we probably want both a picture and specific numbers, to allow conceptual navigation through the data.
JAufrecht reopened this task as Open.Sep 17 2015, 9:27 PM

@JAufrecht Thanks. Moving this into the review column in the sprint accordingly.

@JAufrecht Thanks for the questions. I've reprioritised this now. Unfortunately it's fairly deep in our sprint backlog, as we've got a lot of other work to do first. We'll get back to you when we're working on this.

Sorry for the delay. I could definitely still use some help; struggling to figure out the best way to ask. Regarding the last round of questions:

  1. What do you mean by example? What do you need from me to produce an example?
  2. The scripts generate a bunch of charts from fully prepared data.. Each chart is currently about fourteen lines of code, where I specify the output file, one or two input files, legends and titles, the chart type, font sizes and chart sizes, and custom scale. All of this is then copy-pasted to each chart for each project.
    1. What's the best way to put the real boilerplate stuff, like chart size and font size, into a single location like a default or stylesheet?

Okay, the easiest way to do this is to simply wrap it all in a function. We actually have an example of this; if you look at https://github.com/Ironholds/wmf/blob/master/R/dataviz.R you'll see a function called fivethirtynine(), our modification of the fivethirtyeight standard plotting form, which we use as standard for our visual display of data. To call it:

ggplot(df, aes(x,y)) + geom_line() + theme_fivethirtynine()

So yeah: function calls!

In addition, ggplot objects are post-hoc modifiable:

plot_obj <- ggplot(df, aes(x, y)) + geom_line()
one_themed_plot_obj <- plot_obj + theme_fivethirtynine()
two_themed_plot_obj <- plot_obj + theme_bw()

This doesn't just apply to themes, it also applies to any aesthetics or geoms after the first one - so you can avoid duplicating the initial ggplot() call if you want to display the same data multiple times with different aesthetics.

  1. For the stuff that varies, like file names, what is the best approach? The current approach with repetitious code, or to parameterize it somehow and have a data file along the lines of {my_input.csv, my_output.png, "My Title"}, or something else?

The parameterisation is what I do, yeah. Function wrappers are your friend!

  1. If ?source is the right approach, should I research that or would that be in the example?

source is good for pulling in code from multiple files (it's analogous to include) but I'd really recommend parameterisation here. Doesn't mean you can't store those functions in a file, and source that into a file that actually makes the plots, though (I do that all the time).

  1. "Is there a reason you're not just showing, say, the ratio of cleared points to current points each week?" By showing the backlog and burnup in a chart, we can visually answer a lot of questions that would otherwise require a lot of numbers that I think would be harder to comprehend simultaneously. E.g.:
    1. What is the trend in backlog growth?
    2. What is the trend in velocity?
    3. What is the relative scope of work for upcoming milestones, and what is the trend in scope creep?
    4. What is the relationship between velocity and scope creep?

      On the other hand, all of these are better answered with numbers, so we probably want both a picture and specific numbers, to allow conceptual navigation through the data.

That makes sense. I'd suggest using multiple plots to reduce the visual complexity, though. Using multiple plots doesn't mean you need different slides, of course - you can plot them as part of the same image. Lemme know if that sounds like the right approach and I'll build an example.

JAufrecht reassigned this task from JAufrecht to Ironholds.Oct 16 2015, 6:18 PM

Created new tasks for followup work to change Phlogiston to incorporate functions and paramaterization, leaving I think only this one issue for this task:

Using multiple plots doesn't mean you need different slides, of course - you can plot them as part of the same image. Lemme know if that sounds like the right approach and I'll build an example.

I think I've gone down this road pretty far; see Phlogiston VE. I'm now showing the same or similar data in two or three different charts as we experiment to see which is most useful for different purposes. Do you have something in mind other than this?

Naw, that works fine - I just meant literally storing them in the same image to compress, e.g., number of slides. But that only works well when you're using the same scales and legend - yours looks great for this kind of data :)

I think you can close this; the new subtasks cover the specific followup work. Thanks.

JAufrecht closed this task as Resolved.