Page MenuHomePhabricator

Report on the central tendency for length of pages which are edited for VisualEditor performance benchmarking {lion}
Closed, DeclinedPublic

Description

For VisualEditor's performance benchmarking, it would be hugely helpful to have an understanding of the central tendency of the length of pages which are edited (by all users using all editing tools), rather than all articles that exist (because lots of articles don't get edited frequently/from year to year, and some get edited multiple times a day). Not sure what measures will be most useful – median & s.d.?

Event Timeline

Jdforrester-WMF assigned this task to Milimetric.
Jdforrester-WMF raised the priority of this task from to Needs Triage.
Jdforrester-WMF updated the task description. (Show Details)
Jdforrester-WMF added subscribers: Jdforrester-WMF, ori.

Thanks for the tag, Grace.

James, how about a breakdown of the "success" metric by "size of page" where the size can be:

  • more than 1 SD below median
  • above 1 SD, below median
  • above median, below 1 SD
  • more than 1 SD above median

So you'd have total rate of success, and rate by each of those four classes? This could be visualized as a stacked area chart under the total success rate.

Thanks for the tag, Grace.

James, how about a breakdown of the "success" metric by "size of page" where the size can be:

  • more than 1 SD below median
  • above 1 SD, below median
  • above median, below 1 SD
  • more than 1 SD above median

So you'd have total rate of success, and rate by each of those four classes? This could be visualized as a stacked area chart under the total success rate.

That'd be great. :-) The initial impetus of this request is to try to come up with a coherent, reasonable article/set of articles for which we can measure performance numbers using our synthetic benchmarks, as part of the "is VisualEditor fast enough" requirement. However, this sounds like a really great extension of the idea.

That'd be great. :-) The initial impetus of this request is to try to come up with a coherent, reasonable article/set of articles for which we can measure performance numbers using our synthetic benchmarks, as part of the "is VisualEditor fast enough" requirement. However, this sounds like a really great extension of the idea.

Hm. Then, in the short term, would you rather I give you a set of articles that had N articles in each of these sixteen categories:

  • four quartiles by size of article
  • four quartiles by relative editing traffic

? If so, what N would you like?

The fact that page length is not set in the event [1] means this will be a little trickier, as we have to make prepared statements and join to every wiki db.

[1] https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/modules/ve-mw/init/ve.init.mw.trackSubscriber.js#L85

That'd be great. :-) The initial impetus of this request is to try to come up with a coherent, reasonable article/set of articles for which we can measure performance numbers using our synthetic benchmarks, as part of the "is VisualEditor fast enough" requirement. However, this sounds like a really great extension of the idea.

Hm. Then, in the short term, would you rather I give you a set of articles that had N articles in each of these sixteen categories:

  • four quartiles by size of article
  • four quartiles by relative editing traffic

? If so, what N would you like?

That'd be great. N of 4? I don't know, what would you recommend?

The fact that page length is not set in the event [1] means this will be a little trickier, as we have to make prepared statements and join to every wiki db.

[1] https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/modules/ve-mw/init/ve.init.mw.trackSubscriber.js#L85

Yeah, not available on the client, sadly, so an additional client request we wouldn't otherwise need.

Change 195895 had a related patch set uploaded (by Milimetric):
Analyze page size impact on editing

https://gerrit.wikimedia.org/r/195895

kevinator renamed this task from Report on the central tendency for length of pages which are edited for VisualEditor performance benchmarking to Report on the central tendency for length of pages which are edited for VisualEditor performance benchmarking {lion}.Apr 27 2015, 4:35 PM

Doing this requires ad-hoc cron jobs to synchronize data from event logging and the other mediawiki databases into a combined staging table. A data analyst could do this task, or we could build infrastructure to solve this problem in general. But the analytics team wants to stay away from ad-hoc work going forward and favors infrastructure work (as per our new infrastructure denomination :))

I'm removing the analytics projects from this task to make it clear we'd rather have an analyst on the editing team take on this type of work.

@Jdforrester-WMF,

Are you looking for a set of summary statistics for different categories of articles by editing frequency? For example, "articles which have been edited in the past month have an average length of X kB, with a standard deviation of Y kB." Or would you prefer what Millimetric suggested: a (more-or-less) random selection of pages in each quartile of article size and editing traffic?

Change 195895 abandoned by Milimetric:
Analyze page size impact on editing

https://gerrit.wikimedia.org/r/195895