Page MenuHomePhabricator

Relocate CI generated docs and coverage reports
Closed, ResolvedPublic

Description

Part of / blocker of T133300

Another blocker, and potentially a prerequisite, is publishing the generated documentation. From the drawing and https://www.mediawiki.org/wiki/Continuous_integration/Documentation_generation

201812 (was CI Target 2016) - Doc / coverage publishing (private) gives an overview of the documentation flow.

As summarized by @thcipriani :

Current

  • Doc and coverage reports are generated on Jenkins Docker instances in Docker containers (and they have lot of build dependencies)
  • The job on the instance runs rsync to a WMCS instance integration-publisher02
  • A Jenkins job publish-on-contint1001 is executing on contint1001.wikimedia.org to rsync the doc from the publisher instance.

We would need a space in production (Ganeti VM?) to host the material with PHP7 (some docs need that see T206046). It should most probably be isolated from the rest of the network, although code published there get Code-Reviewed +2 we never know.

Future?

We would need some intermediary system to have the CI building instances to push to (currently WMCS instance integration-publishing02).

  • Doc and coverage reports are still generated on CI Docker instances
  • Building instance push to a publisher system
    • might reuse integration-publisher02
  • Jenkins (or another system) runs a task that fetch from the publisher system to doc.wikimedia.org document root.

We need to find a target host on which to migrate documentation to. It would have Apache / PHP7 and run code as generated by the various code repositories that have doc/coverage enabled.

Flow originating from webhost

Protosource hostsource IPdest networkdest Hostdest IPdest portdescription
TCPwebhost???WMCS projectintegration-publishing02172.16.4.5873Jenkins job doing a rsync originating from Webhost to fetch material
WARNING: That is where the devil reside. Can a Ganeti instance be allowed to fetch from WMCS instance? That might require to add some routing and punch a hole in the firewall. Else maybe we can expose the rsync daemon on a public IP and restrict it to traffic from the Webhost Ganeti instance.

Flow going to webhost

Protosource networksource hostsource IPdest Hostdest portdescription
TCPlabs support hostscontint1001.wikimedia.org 208.80.154.17webhost22Jenkins master ssh connection
TCPproductionmisc varnish-webhost80Misc cache to Apache serving doc.wikimedia.org

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hashar added a subscriber: chasemp.

I have updated the task with a basic overview. The doc is generated on labs instances and rsync ed to the labs instance integration-publisher.

The system hosting the documentation is attached to Jenkins as a slave so that once the doc is generated, a job run to rsync the material from integration-publisher to the Apache document root /srv/org/integration/doc.

Looks like the webhost system would have to be on the labs support network just like scandium so it can be accessed by Jenkins and can access the labs instance.

It would need to be fairly network isolated though in case the doc has some bad code (unlikely, that should only be merged / reviewed changes). Still it is probably not a good idea to have PHP scripts running straight in the labs support hosts network.

Restricted Application added a project: Operations. · View Herald TranscriptJul 12 2016, 2:11 PM

From a discussion I had with @thcipriani I have been overthinking the requirement to move the doc hosting out of gallium/ its replacement. My point was that running PHP code in labs support network did not seem like a good idea. Then it is limited to:

  • doxygen PHP search script
  • oojs demos

Both are running on gallium right now and we can get them to run on the gallium replacement machine. It should not block the migration.

That is still worth action later on as a Technical-Debt action.

hashar lowered the priority of this task from High to Normal.Jul 13 2016, 3:35 PM
hashar removed hashar as the assignee of this task.Sep 12 2016, 3:05 PM

Not working on it for now. It is staying on gallium/contint1001.

This project is selected for the Developer-Wishlist voting round and will be added to a MediaWiki page very soon. To the subscribers, or proposer of this task: please help modify the task description: add a brief summary (10-12 lines) of the problem that this proposal raises, topics discussed in the comments, and a proposed solution (if there is any yet). Remember to add a header with a title "Description," to your content. Please do so before February 5th, 12:00 pm UTC.

Dzahn added a comment.Feb 4 2017, 12:17 AM

modify the task description: add a brief summary (10-12 lines) of the problem that this proposal raises, topics discussed in the comments, and a proposed solution (if there is any yet). Remember to add a header with a title "Description," to your content.

It looks to me like this is already the case. There is a header with description, it summarizes the issues and suggests a solution. There is even a diagram to go with it.

@Dzahn It looks great. This is just our attempt to stay consistent with content for all the proposals and to get a brief summary that we could add on the MW page!

hashar updated the task description. (Show Details)Dec 14 2018, 10:22 AM

I have updated the drawing.

doc/coverage can probably be hosted on a Ganeti VM, I think that would be sufficient. The devil is into figuring out how to transfer the material from the CI slaves in WMCS to a Webhost instance in production.

We so far have been using an intermediary WMCS instance (integration-publishing02) which CI build hosts can rsync to (same WMCS tenant). There is then a job on the CI master (contint1001) which has the appropriate routing/firewall rules to reach a WMCS instance.

With the new Webhost being moved out of contint1001, it lose the ability to reach WMCS instances.

I am lost on finding an appropriate solution.

I think we can drop the integration-publishing02 proxy entirely. A publishing job would trigger a job on contint1001 and pass it the instance IP/port and the workspace. That is sufficient to initiate a rsync from contint1001. Once synced on contint1001, the job would rsync away to the new webhost on a VM. This way there are no more issues about communicating between WMCS and the webhost machine.

I cant remember why we went with the publishing proxy. Maybe because I could not find a way to retrieve the instance IP/port on which the job ran. When coding Castor, I have been using the Parameterized Trigger Plugin (trigger-builds in JJB) which does pass SSH_CONNECTION which is the instance/port. That let us reach the instance directly.

Seems we would need a Ganeti VM and then refactor the way publishing is handled in the Jenkins jobs. The doc.wikimedia.org would also need to be migrated (rsync current data, change DNS etc).

Change 480657 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] (WIP) Overhaul publishing

https://gerrit.wikimedia.org/r/480657

hashar claimed this task.Dec 19 2018, 5:15 PM

Change 480817 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] doc: copy new httpd config for doc.wm.org to its own place

https://gerrit.wikimedia.org/r/480817

Change 480817 merged by Dzahn:
[operations/puppet@production] doc: copy new httpd config for doc.wm.org to its own place

https://gerrit.wikimedia.org/r/480817

Change 480821 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] doc: httpd, add file handler for .php files -> PHP-FPM

https://gerrit.wikimedia.org/r/480821

Change 480821 merged by Dzahn:
[operations/puppet@production] doc: httpd, add file handler for .php files -> PHP-FPM

https://gerrit.wikimedia.org/r/480821

Change 480828 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/docroot@master] scap configuration for deployment

https://gerrit.wikimedia.org/r/480828

Change 480832 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] scap configuration for integration/docroot.git

https://gerrit.wikimedia.org/r/480832

Change 480830 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] doc: remove "deprecated user of DefaultType"

https://gerrit.wikimedia.org/r/480830

Change 480830 merged by Dzahn:
[operations/puppet@production] doc: remove "deprecated user of DefaultType"

https://gerrit.wikimedia.org/r/480830

The scap config for integration/docroot is:

The repository will then be deployed at /srv/deployment/integration/docroot. So probably we will want to update Apache Document root to point to it and have rsyncd to expose the org/wikimedia/org relatively to that (instead of /srv/).

Change 480828 abandoned by Hashar:
scap configuration for deployment

Reason:
Will do with a simple git::clone and manual pull. that easier to handle.

https://gerrit.wikimedia.org/r/480828

Change 480832 abandoned by Hashar:
scap configuration for integration/docroot.git

Reason:
Will do with a simple git::clone and manual pull. that easier to handle.

https://gerrit.wikimedia.org/r/480832

Change 480879 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] doc: clone integration/docroot

https://gerrit.wikimedia.org/r/480879

Change 480881 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] doc: relocate from /srv to /srv/docroot

https://gerrit.wikimedia.org/r/480881

Change 480879 merged by Dzahn:
[operations/puppet@production] doc: clone integration/docroot

https://gerrit.wikimedia.org/r/480879

Change 480881 merged by Dzahn:
[operations/puppet@production] doc: relocate from /srv to /srv/docroot

https://gerrit.wikimedia.org/r/480881

Mentioned in SAL (#wikimedia-traffic) [2018-12-20T18:06:37Z] <mutante> doc1001 - meged gerrit:480881 and then manually moved the entire /srv/org/wikimedia/doc/ structure into /srv/docroot/srv/org/wikimedia/ and deleted the old dirs T137890

Mentioned in SAL (#wikimedia-releng) [2018-12-21T11:17:18Z] <hashar> updating all publish jobs to also publish to doc1001.eqiad.wmnet | https://gerrit.wikimedia.org/r/#/c/integration/config/+/480657/ | T137890

Change 480657 merged by jenkins-bot:
[integration/config@master] Simplify doc publishing, experiment on a new host

https://gerrit.wikimedia.org/r/480657

All publish jobs now also trigger publish-to-doc1001 job. It seems good so far and I think I caught all potential issues it might have. I will monitor it until next year.

If all is fine, the migration would consist of:

  • rsync from contint1001 /srv/org/wikimedia/doc to doc1001.eqiad.wmnet /srv/docroot/org/wikimedia/doc
  • edit the doc-publish macro to stop triggering the publish-on-contint1001 job and refresh jobs
  • switch doc.wikimedia.org backend in varnish to doc1001.

Merry Christmas and an happy new year

hashar added a comment.Jan 7 2019, 4:34 PM

The new publisher is wrong, I made a mistake in the rsync command that publishes to doc1001. The source directory is transfered as is instead of its content. For example the documentation for CollaborationKit on contint1001 is:

/srv/org/wikimedia/doc/CollaborationKit/master/php/index.html

On doc1001.eqiad.wmnet

/srv/docroot/org/wikimedia/CollaborationKit/master/php/php/index.html

Ditto for coverage report, for labs-tools-heritage the target directory has coverage under ./coverage.

I have missed normalizing the source path to always include /..

Change 482662 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Fix rsync source dir when publishing to doc1001

https://gerrit.wikimedia.org/r/482662

hashar added a comment.Jan 7 2019, 4:36 PM

I will keep doing verifications to make sure content is properly published.

Change 482662 merged by jenkins-bot:
[integration/config@master] Fix rsync source dir when publishing to doc1001

https://gerrit.wikimedia.org/r/482662

Change 480536 had a related patch set uploaded (by Hashar; owner: Dzahn):
[operations/puppet@production] cache/trafficserver: switch doc.wikimedia.org to doc1001 backend

https://gerrit.wikimedia.org/r/480536

hashar added a comment.Jan 9 2019, 9:11 AM

The broken OOJS page (T206046) does work properly on doc1001.eqiad.wmnet

CI has been publishing on both contint1001 and doc1001. On contint1001 I rsynced the whole doc directory and did a compare:

rsync -apz --delete \
   rsync://doc1001.eqiad.wmnet/doc/ /srv/doc1001.eqiad.wmnet/  \
   && colordiff -U0 <( cd /srv/org/wikimedia/doc && find  -maxdepth 4 -type d ) \
                   <( cd /srv/doc1001.eqiad.wmnet && find -maxdepth 4 -type d)

doc1001 just has some extra empty directories I have created for testing the new host:

--- /dev/fd/63	2019-01-09 09:10:27.941278092 +0000
+++ /dev/fd/62	2019-01-09 09:10:27.941278092 +0000
@@ -298,0 +299 @@
+./hashar-test2
@@ -1262,0 +1264 @@
+./hashar-test
@@ -1411,0 +1414 @@
+./ParsoidDOCKER
@@ -1735,0 +1739 @@
+./hashar-testdocpublish

Will clean them when we get sudo access on the host.

In short: we can change the backend in Varnish.

Change 480536 merged by Ema:
[operations/puppet@production] cache/trafficserver: switch doc.wikimedia.org to doc1001 backend

https://gerrit.wikimedia.org/r/480536

Change 483120 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Stop publishing doc to contint1001

https://gerrit.wikimedia.org/r/483120

Mentioned in SAL (#wikimedia-releng) [2019-01-09T13:27:31Z] <hasharAway> shutting down instance integration-publishing02 [172.16.4.5] No more used | T137890

Change 483120 merged by jenkins-bot:
[integration/config@master] Stop publishing doc to contint1001

https://gerrit.wikimedia.org/r/483120

Change 483126 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] Remove ci::publisher, no more used

https://gerrit.wikimedia.org/r/483126

Change 483126 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ci::publisher, no more used

https://gerrit.wikimedia.org/r/483126

hashar added a comment.Jan 9 2019, 1:33 PM

https://doc.wikimedia.org/ is now served by doc1001.eqiad.wmnet and working as expected.

Left to do:

Change 483128 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docroot on doc1001.eqiad.wmnet needs manual deploy

https://gerrit.wikimedia.org/r/483128

Change 483128 merged by jenkins-bot:
[integration/config@master] docroot on doc1001.eqiad.wmnet needs manual deploy

https://gerrit.wikimedia.org/r/483128

Change 484194 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] doc: force users umask for wikidev group

https://gerrit.wikimedia.org/r/484194

Change 484304 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] rsync: readd incoming and outgoing chmod

https://gerrit.wikimedia.org/r/484304

Change 484308 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] doc: make published files group writable

https://gerrit.wikimedia.org/r/484308

Change 484194 merged by Dzahn:
[operations/puppet@production] doc: force users umask for wikidev group

https://gerrit.wikimedia.org/r/484194

Dzahn reopened this task as Open.Jan 14 2019, 11:03 PM

Left to do:

  • get us sudo access for doc-publisher user

done

  • clean up puppet

todo

todo ?

Change 484317 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] doc: ferm, allow http connections from deployment hosts

https://gerrit.wikimedia.org/r/484317

Change 484317 merged by Dzahn:
[operations/puppet@production] doc: ferm, allow http connections from deployment hosts

https://gerrit.wikimedia.org/r/484317

Change 484321 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] contint: delete unused doc.wikimedia.org site config

https://gerrit.wikimedia.org/r/484321

Change 484321 merged by Dzahn:
[operations/puppet@production] contint: delete unused doc.wikimedia.org site config

https://gerrit.wikimedia.org/r/484321

Mentioned in SAL (#wikimedia-operations) [2019-01-15T21:23:38Z] <mutante> contint1001 rmdir /srv/org/wikimedia/integration/coverage ; rmdir /srv/org/wikimedia/integration/logs (T137890)

Mentioned in SAL (#wikimedia-releng) [2019-01-21T15:40:23Z] <hashar> contint1001: removing all generated doc/cover from /srv/org/wikimedia/doc | T137890

hashar added a comment.EditedJan 21 2019, 3:48 PM

It is almost down, I could just use files received from rsync to be group writable. That requires a configuration tweak in rsyncd.conf achieved by the chain of patches:

And I have updated the documentation ( https://www.mediawiki.org/w/index.php?title=Continuous_integration/Documentation_generation&diff=3044742&oldid=2370952&diffmode=source )

hashar closed this task as Resolved.Mar 18 2019, 9:55 AM

This one is resolved. There are a few pending changes but nothing any interesting or urgent for the service.

I ended up skipping integration-publisher02 and I have deleted it.

Mentioned in SAL (#wikimedia-releng) [2019-03-18T09:55:50Z] <hashar> deleting shutdowned instance integration-publisher02 , we do not use it anymore since doc publishing got overhauled ( T137890 ) # T218146

Change 484304 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] rsync: readd incoming and outgoing chmod

https://gerrit.wikimedia.org/r/484304

Change 484308 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] doc: make published files group writable

https://gerrit.wikimedia.org/r/484308