Page MenuHomePhabricator

Create analytics.wikimedia.org
Closed, ResolvedPublic5 Estimated Story Points

Description

Create analytics.wikimedia.org to host the new tools/data that we are working on as part of the migration of stats.wikimedia.org

Event Timeline

Will it be a wiki? microsite? redirect to a page on meta?

Will it be a wiki? microsite? redirect to a page on meta?

It sounds like this is more of a request for a server to host/run tools than a request for just a DNS A record. I'm curious why Wikimedia Labs is insufficient.

There's some debate about this. We haven't used data.wikimedia.org in the past because of the possible confusion with wikidata.

The wmflabs domain seems less "production" simply because of the word "labs" in the name. I have nothing against it technically and I don't think others do either.

The specific question that started this conversation was where should we put this dashboard, given that it'll be linked-to from wikistats:

https://browser-reports.wmflabs.org/

It sounds like this is more of a request for a server to host/run tools than a request for just a DNS A record. I'm curious why Wikimedia Labs is insufficient.

Being a prod rather than labs domain speaks as to this tools being fully supported, similar to stats.wikimedia.org (which we are working in migrating data out of) and performance.wikimedia.org

Also for SEO and discoverability pretty urls work better.

Nuria renamed this task from Create data.wikimedia.org to Create analytics.wikimedia.org.Apr 13 2016, 3:49 PM
Nuria updated the task description. (Show Details)

If we're doing this in production, the frontend should probably be through cache_misc. I'm not sure what the backend looks like at all role/software-wise...

There's some debate about this. We haven't used data.wikimedia.org in the past because of the possible confusion with wikidata.

Yeah, this is a valid concern. The other thought I had was that data.wikimedia.org would be a domain serving RESTBase's API or MediaWiki's api.php or something.

I think I like analytics.wikimedia.org better.

The wmflabs domain seems less "production" simply because of the word "labs" in the name. I have nothing against it technically and I don't think others do either.

I'm still a bit unclear on the scope of this task. I may have missed previous discussions. Which official tools (mentioned in the task description) are we talking about?

I guess I'm wondering whether this task is about (for example) "promoting" the page view API from Labs to production or if you just want a "better" domain name (one without "labs" in it) that will continue to point at Labs infrastructure?

Potentially tangentially: I'm unclear what the distinction between Labs and production is when the Wikimedia Foundation is running/operating both. Are there other consequences to such a promotion or demotion from one to the other? Would the underlying hosting/hardware change? Would the level of support and response time for an outage be different?

the pageview API is running on wikimedia.org, that's the prod cluster. This task right now is about having a production domain for reports like this: https://browser-reports.wmflabs.org/

In the future, we may transition more of the data and reports that wikistats provides to dynamic reports like these. But in any case we'll keep both up so people can give feedback and compare the two approaches. Personally, I want to get to the point where we have a single place to find all our data products and analytics APIs. It makes sense that place could be analytics.wikimedia.org, but that's a conversation for later.

Potentially tangentially: I'm unclear what the distinction between Labs and production is when the Wikimedia Foundation is running/operating both. Are there other consequences to such a promotion or demotion from one to the other? Would the underlying hosting/hardware change? Would the level of support and response time for an outage be different?

This is a bit of a tangent, yes. The underlying hosting/hardware does change. The support and response time depends more on the team that's supporting the tool rather than the environment, though our operations team helps more with the production-hosted tools than the labs-hosted tools. This has been changing though, as we recognize (to my great satisfaction) that some of our most critical tools run in Labs.

I hope over time that the difference between labs and production are simply that Labs is open to the public to deploy and manage everything, but supported as much as possible. And production is the same but just limits access to people who sign the NDA. I've tried to approach this equalization in different ways at WMF, first with wikimetrics and now with broader infrastructure efforts. But it's still one of my main goals.

Potentially tangentially: I'm unclear what the distinction between Labs and production is when the Wikimedia Foundation is running/operating both. Are there other consequences to such a promotion or demotion from one to the other? Would the underlying hosting/hardware change? Would the level of support and response time for an outage be different?

Yes, Yes, and Yes.

Better support. Stricter standards for how things are deployed/run/managed (in terms of things like config management and redundancy). More roots have access to the service and its configuration and know how to operate on it. Better raw availability. Better downtime responsiveness. etc...

Also notable: we can't deploy secure private data to labs, as everything there is open to the world in configuration management terms (and less-secure in general by its nature against indirect attacks on private information). Being able to securely deploy and manage snippets of truly private data (generally speaking, that means things like authnetication tokens, passwords, SSL private keys, etc) is often a requirement for production services to access other production services....

@BBlack: just in case there's some concern about what the purpose of analytics.wikimedia.org is. We will never use it to proxy to services / dashboards on labs. We're going to change some very simple configuration [1] and begin deploying those reports to a subfolder on analytics.wikimedia.org. We understand the reason to keep dependencies from production limited only to other production services, and we have no requirements that would make us disagree in the future.

[1] https://github.com/wikimedia/analytics-dashiki/blob/master/fabfile.py#L13

@Milimetric - I guess what I'm missing here is the disconnect between our public termination of analytics.wikimedia.org (on, say, cache_misc) and what "a subfolder on analytics.wikimedia.org" is.

Normally there would be some production backend service defined for this kind of thing. I really don't know what this is structured like, but making a bunch of assumptions I'd expect something like:

  1. Traffic layer terminates analytics.wikimedia.org in the cache_misc cluster, routes all requests to internal service hostname analytics-web.svc.eqiad.wmnet
  2. analytics-web.svc.eqiad.wmnet defined in LVS/pybal terms like all other standard internal services, backending to some redundant cluster of real hosts named analytics-web1001.eqiad.wmnet and so-on.
  3. Some kind of service software runs on analytics-web1001.eqiad.wmnet and friends, which hosts the actual code and/or content you're trying to make available (via the public production hostname analytics.wikimedia.org).

If this service is literally just static data that's regenerated and synced periodically, the service could be very simplistic I guess, but there's still puppetization and engineering of that deployment/software/sync-process or whatever it is to do.

(for services small enough to not need a cluster of their own hardware, I think we do have solutions where we virtualize smaller services on ganeti, too. The above is just the typical example of a full-fledged service)

Thank you for the explanations and clarifications here. I really appreciate them.

also see T126281 (i think we should not fix/redirect stats.wikipedia.org, but say that there is just stats.wikimedia.org and this new analytics.wikimedia.org

@BBlack
This would not be a full-fledged service. What we would be deploying either via puppet of fab is just html/js so we only really need an apache install via puppet. Data sourced on this client side app will come from datasets .wikimedia.org just like it does now.

This is the puppet role we use in our labs instances now: https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=952291af-53fa-4d81-b060-b4e79a4e063a&project=dashiki&region=eqiad

role::simplestatic

The build and deploy of actual js/html we handle with fab

@Nuria - thanks for the details!

We still need to sort out an actual place for the js/html to live at in production (which, if it's as simple as it sounds, can probably be a ganeti virtual host inside our private networks), and we'll need to sort out the process around the deployment/updates of it via fab (which I'm also not familiar with).

@BBlack : ganeti sounds fine as really the majority of the time requests are going to be served by varnish.

The fabfile we use to deploy to labs is here: https://github.com/wikimedia/analytics-dashiki/blob/master/fabfile.py
Now, this strategy works well if everyone has ssh-ing permits to the host (or virtual instance) where code is ultimately hosted.

Hopefully this makes sense.

@BBlack: let us know if you think we can proceed with this and whether fab is an acceptable way to deploy

I really have no idea about the fab deployment method (whether it's ok, how we automate it and grant access, where it's fetching data from, etc), or how/when we're going to schedule work for this (to figure out if/how fab data deploy works out, and then puppetize and deploy this new kind of backend service host).

Once those things are complete, the addition of the service to cache_misc is relatively-trivial, but still: there's a lot of other things going on for all of the team driven by pre-defined goals, and this is an interrupt coming from outside of that process that probably involves significant work on our end to be scheduled.

@akosiaris might have more input on building a backend service host in ganeti and/or its puppetization. @mark may have more input on priority/schedule. @Ottomata I assume should be involved in this discussion too.

@BBlack
This would not be a full-fledged service. What we would be deploying either via puppet of fab is just html/js so we only really need an apache install via puppet.

So we are talking about a static site. We are already hosting a couple of these in bromine.eqiad.wmnet. It hosts other static sites, e.g. transparency report, the annualreport and a static copy of bugzilla. And the way this is described, well seems like a perfect candidate.

I really have no idea about the fab deployment method (whether it's ok, how we automate it and grant access, where it's fetching data from, etc), or how/when we're going to schedule work for this (to figure out if/how fab data deploy works out, and then puppetize and deploy this new kind of backend service host).

All these microsites/static sites are kind of one offs that all use the git::clone puppet define to just fetch the latest HEAD of a git branch in gerrit. These static sites are updated so rarely that anything more than that (fabric for example) is an overkill. If that is sufficient for this site, I am fine with it. @Nuria, would that be sufficient ?

@akosiaris might have more input on building a backend service host in ganeti and/or its puppetization.

Looking at the microsites role and the annualreport and transparency modules it seems like not a lot of puppetization is required. Assuming we follow the same approach ofc.

Everything Alex already said :) i setup bromine and most of those microsites and yea, it's meant for small static sites. In addition to one of those small puppet roles we'll just need to request a Gerrit repo which holds the files. I would put that under wikimedia, so wikimedia/analytics or so , like wikimedia/annualreport and others. Requesting Gerrit repos/projects used to be an on-wiki thing and then there was discussion to move that into Phabricator, but not sure of the current status.

Well, the impedance mismatch here on the standard static bromine setup and what analytics is asking for then may be all about the static-ness and deployment process. It sounds to me like they want to do content updates fairly regularly, and have this fab tool set up to do that today. I don't know how well that maps in terms of data update rate or method.

It would mostly just be about who has +2 on the gerrit repo that holds the actual site content. If the puppet role on our site git clones with "ensure latest" then there need to be no regular changes in the ops/puppet repo and deployment is just merging in the content repo. That being said, i dont know about the fab tool.

It would mostly just be about who has +2 on the gerrit repo that holds the actual site content. If the puppet role on our site git clones with "ensure latest"
then there need to be no regular changes in the ops/puppet repo and deployment is just merging in the content repo.

That would be true if we deployed from source but we build our code, so updates from source will not work. We can add to puppet a step to build our code and such but that is what we do with fab

https://github.com/wikimedia/analytics-dashiki/blob/master/fabfile.py#L13

Let us know however if fab doesn't seem like a good strategy to deploy to prod.

When you say "build our code" do you mean building client-side javascript code that's ultimately static content from the server's perspective, or do you mean building server-side code that powers some URLs of this service?

"build code" and "static site" are confusing me a bit. the kind of static site we host on bromine means HTML and CSS and some images.

When you say "build our code" do you mean building client-side javascript code that's ultimately static content from the server's perspective, or do you mean building server-side code that powers some URLs of this service?

Based on the fabfile, it looks like all the build steps could be performed locally, except the creation of some directories. So I would suggest:

  • Declare the relevant directories in Puppet.
  • Change the build script to run locally, and commit the result -- either into a special 'build' subfolder, or into a separate deployment branch.

Hm, I had assumed we would just host analytics.wikimedia.org on stat1001.

I think we'd like analytics.wikimedia.org to eventually supercede stats.wikimedia.org, and also to replace datasets.wikimedia.org. We'd like to eventually move the data files in datasets.wikimedia.org to a sub path in analytics.wikimedia.org. I can also forsee using for other things.

In general, analytics.wikimedia.org will host static files (html, js, tsvs, etc.), but not just for one service (dashiki / reportcard).

(a-team, correct me if I am wrong there.)

In general, analytics.wikimedia.org will host static files (html, js, tsvs, etc.), but not just for one service (dashiki / reportcard).

Correct, some dashiki plots will be one of the many things we hope to host

I want to explain the current setup in labs a little bit and point out that it won't work as is in prod - and needs some re-working. The idea is we have a single instance - that has multiple folders with static content, in say, /srv/static. The folders are set up this way -
/srv/static/edit-analysis.wmflabs.org
/srv/static/browser-reports.wmflabs.org
Hiera + puppet work together to create a virtualhost entry for each of these domains, such that the content in the respective folder is served when you hit edit-analysis.wmflabs.org. All the fabric deployer does is some local JS compilation and rsync the static files to the relevant folder.

In prod - these dashboards would have to be served at analytics.wikimedia.org/browser-reports, analytics.wikimedia.org/edit-reports etc. And the apache setup will have to change to do that. Either way - it's not a single repo that can be git cloned and updated rarely. The compiled dashboards are not on git - they are compiled locally and copied to the remote(which is what fabric does). We would also need flexibility to deploy each of the dashboards separately and frequently.

these dashboards would have to be served at analytics.wikimedia.org/browser-reports, analytics.wikimedia.org/edit-reports etc. And the apache setup will have to change to do that.

It will have to be new apache setup for prod ja, but since they will be hosted on a single domain, the puppetization doesn't need any knowledge of the subdirectories of content that will be hosted. This is a fairly standard static site set up.

The compiled dashboards are not on git - they are compiled locally and copied to the remote(which is what fabric does).

'generated' might be a less confusing term than 'compiled' here. What are they generated locally from? Generated datasets from somewhere, ja? Does this need to be done locally to prod server? Could we do this on a deployment server and use scap3?

and use scap3

Actually, I don't think we can use scap3 if they aren't in git, since scap3 deploys via git.

It will have to be new apache setup for prod ja, but since they will be hosted on a single domain, the puppetization doesn't need any knowledge of the subdirectories of content that will be hosted. This is a fairly standard static site set up.

Right.

'generated' might be a less confusing term than 'compiled' here. What are they generated locally from? Generated datasets from somewhere, ja? Does this need to be done locally to prod server? Could we do this on a deployment server and use scap3?

Compiled is right. The data does not play a part, the only thing we are doing is building javascript (minimizing & splitting dependencies). Only the frontend of dashboards is deployed, not the data. Thus no scap3 process is needed, that seems a bit overkill when we could commit the build source and deploy that way by having puppet git update the depot.

This is what @ori mentioned above "Change the build script to run locally, and commit the result -- either into a special 'build' subfolder, or into a separate deployment branch"

This is what I think needs to be done here to resolve this ticket as soon as possible:
(cc-ing @BBlack and @Ottomata for confirmation)

  1. host analytics.wikimedia.org on stat1001.

2.have a static puppet configuration that serves files under analytics.wikimedia.org/dashboards/browser-reports

We shall be deploying other dashboards at analytics.wikimedia.org/dashboards/other-path-that-we-do-not-know-yet

  1. add a /build directory to our dashboards tool where we commit the build source and deploy from there via git update of the depot in puppet.

If you guys agree with 1) and 2) I can take care of 3)

@BBlack : can you confirm whether is OK with ops to deploy this domain to 1001?

Change 286948 had a related patch set uploaded (by Ottomata):
Add analytics.wikmiedia.org pointing at misc cluster (stat1001)

https://gerrit.wikimedia.org/r/286948

Change 286948 merged by Ottomata:
Add analytics.wikmiedia.org pointing at misc cluster (stat1001)

https://gerrit.wikimedia.org/r/286948

Change 286950 had a related patch set uploaded (by Ottomata):
Add analytics.wikimedia.org to list of domains served by misc varnish backend stat1001

https://gerrit.wikimedia.org/r/286950

Change 286957 had a related patch set uploaded (by Ottomata):
Set up analytics.wikimedia.org site on stat1001

https://gerrit.wikimedia.org/r/286957

Change 286957 merged by Ottomata:
Set up analytics.wikimedia.org site on stat1001

https://gerrit.wikimedia.org/r/286957

Change 286950 merged by Ottomata:
Add analytics.wikimedia.org to list of domains served by misc varnish backend stat1001

https://gerrit.wikimedia.org/r/286950

Ottomata set the point value for this task to 5.May 5 2016, 4:17 PM