Page MenuHomePhabricator

Investigate how to predict fundraising banner impressions using a general traffic metric
Closed, ResolvedPublic

Event Timeline

nshahquinn-wmf triaged this task as High priority.
nshahquinn-wmf renamed this task from Investigate how well different readership metrics predict banner fundraising revenue to Investigate how to predict fundraising campaign banner impressions using general traffic metrics.Feb 8 2025, 3:19 AM
nshahquinn-wmf renamed this task from Investigate how to predict fundraising campaign banner impressions using general traffic metrics to Investigate how to predict fundraising banner impressions using a general traffic metric.Feb 8 2025, 3:22 AM

This week's update:

  • Progress updates
    • Worked on method selection and other detailed planning for the traffic metric forecasting stage (RG3). See this section of the project doc for the current plan.
    • Consulted with Mikhail and Morten on the time series forecasting methods
    • Cleaned, wrangled, and explored banner impressions data
  • Any emerging blockers or risks
    • I have not done forecasting before, which means the forecasting stage (RG3) will involve a lot of learning. Additionally, I don't haven’t found anyone available to be a dedicated consultant–reviewer, although there are good folks who are usually available for ad-hoc help, which I’ve already benefited a lot from. To deal with this, I’m continuing to try to limit the scope and I’ve extended the timeline for RG3 a bit. I’ll continue asking around about a consultant–reviewer (suggestions welcome\!).
    • I have decided to do the forecasting in R rather than Python. In my experience, R has a much richer ecosystem of statistical tools than Python, and I’ve had previous projects where I struggled trying to do modelling in Python that would be quite simple in R. The resources I’ve been consulting (e.g. Forecasting: Principles and Practice) also use R rather than Python. However, I’m much less fluent with R, so this will also make the work slower. Like me, Joseph rarely works with R, so it will also make it more complicated to build on this work in the future.
  • Any unresolved dependencies?
    • Nothing significant
  • Have there been any new lessons from the hypothesis project?
    • No
  • Have there been any changes to the hypothesis project scope or timeline?
    • In order to keep the scope manageable, I plan to produce traffic forecasts only for the chosen campaign groups (e.g. en6C, jaJP) and not for subgroups (e.g. enUS, enUS mobile, jaJP desktop). Looking at groups and subgroups as hierarchical time series could improve the forecast performance and give us more insight into the determinants of the group forecast, but it would add complexity and make it difficult to generate prediction intervals.
    • Since I don't plan forecasting mobile and desktop traffic separately, I will not be able to use page previews, which covers desktop only, as the predictor for banner impressions. As mentioned before, ComScore data covers only the US and Google Search Console data is only available for the past 16 months, so both are also unsuitable. That leaves user page views, despite its potential flaws, as the only possible predictor. This makes RG2, the step of selecting the predictor, unnecessary. Despite the potential flaws, I expect that user page views will work reasonably well, because banner impressions is likely flawed in very similar ways (such as excessive scraping traffic).
    • My projected timeline had me finishing the analysis part of the traffic trend investigation (RG1) today. Since I spent this week on other parts of the project (planning for RG3 and wrangling the banner impressions data, which I planned to use in RG2), I’m adjusting the timeline accordingly and expect to have the analysis finished by Thursday of next week.
    • Because of the risks mentioned above, I’m increasing the time budgeted for RG3 (the final stage) to 2.5 weeks, which has me completing it on 26 March. If this isn’t acceptable, I can look for ways to reduce scope further.

This week's update:

  • Progress updates
    • Did a comparative analysis of our user page views and Google Search Console and Comscore data
    • I’m almost done with the analysis part of traffic trend comparison; I just need to look at page preview data and Google Search Console data for the three extra wikis I just got access to. I expect to wrap that up on Monday and then wrap up the report by Wednesday.
  • Have there been any changes to the hypothesis scope or timeline?
    • I'm on track to finish the report a couple of days later than expected. This is partly because other work took up more time than I expected this past week, but that should clear up after I finish a couple of things on Monday. I'm still on track to finish the whole project on or before the 31 March deadline.

This week's update:

  • Progress updates
    • I made some tweaks to the trends report (most notably replacing figure 2 and improving the layout) and discussed it with lots of people. I’m planning to make a few more changes and then share it widely on Monday.
    • I gathered and cleaned data and started exploration for page view forecasting
  • Any emerging blockers or risks
    • As mentioned previously, I will be wrapping up this work by 1 April (and will be on vacation 2-4 April).
    • Given the limited time remaining and the challenges I outlined with forecasting, I probably will not have much time to iterate and explore, so my models may not be very refined. However, they should still serve as a useful first draft and a good starting point for Movement Insights to continue to build on.

Last week's update:

  • Progress updates
    • I made some last improvements to the trends report and shared it out.
    • I refamiliarized myself with R and its ecosystem and figured out a workflow for forecast model training and cross-validation using Fable.
    • I experimented with different pageview forecast models, with somewhat disappointing results.
    • I will share a set of first-draft forecasts tomorrow, before I go on vacation.

Last week's update:

I've finished my model explorations and generated first-draft page view forecasts for all the campaigns we wanted to forecast, plus a global forecast.

I've actually gotten results that are a bit better than the "somewhat disappointing" ones I was seeing yesterday. However, the forecasts are still based on limited, noisy data and haven't been peer-reviewed, so please don't go out and, say, buy a bunch of servers based on them!

I haven't yet made a notebook to display the forecast results; I will do that when I'm back on Monday; it shouldn't take more than a day. For now, here are the graphs of a couple forecasts as teasers.
{F59007626} {F59007631}

If you want to look at my modelling notebook (full repo), please do! It's quite clean and has a bunch of narration so it should be a reasonable reading experience if you don't mind stats and R code.

This week's update: