Page MenuHomePhabricator

Create jenkins slave instance dedicated to Android runs
Closed, ResolvedPublic

Description

We want to create a dedicated instance with a single executor to run the Android job. That boot the Android emulator which takes a full core and run some java process to execute the tests.

The new labs instance should thus have at least 2 cores, and most probably a lot more to accommodate the heavy tests being runs in parallel.

We would want a puppet class that includes the required packages and role::ci::slave::labs::common which adds Java / Jenkins key etc.

Once ready, the slave will need to be added to the Jenkins master via https://integration.wikimedia.org/ci/computer/

From there, the slave can be labelled with something like AndroidEmulator and we can get the job tied to that label so it will only run on this slave.

Event Timeline

hashar created this task.Jul 29 2015, 8:02 PM
hashar raised the priority of this task from to Normal.
hashar updated the task description. (Show Details)
hashar added subscribers: gerritbot, Sniedzielski, Dzahn and 9 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 29 2015, 8:02 PM

Thanks, @hashar! Since we will likely be enabling parallelization of our Gradle builds, moar cores = good! In particular, it'd be great to at least see 8 cores on the new server so we can at least match the performance of our laptops.

+ Release-Engineering-Team , we want to have this task dealt by someone from the team instead of me :-D I have poked our internal mailing list about it. Potentially we will arrange some 1/1 pairing to have it done.

Thanks, @hashar! Since we will likely be enabling parallelization of our Gradle builds, moar cores = good! In particular, it'd be great to at least see 8 cores on the new server so we can at least match the performance of our laptops.

Do you really need 8 cores? That would consume resources from the shared labs infrastructure, and for a job that run once in a while that might be over spending. But maybe idle cores can be reused by other instances, ops from labs would know for sure.

hashar set Security to None.

@hashar, thanks again! Happy to pair.

We can probably get by with fewer cores but we definitely want tests that run fast. In particular, we would like to extend the tests to run on a variety of emulated configurations in parallel (once we get the single configuration case working, of course). It seems reasonable to ask for a machine that's at least on par with a laptop to help us with something so important.

hashar updated the task description. (Show Details)Jul 29 2015, 8:20 PM

I rephrased the task to get more cores eventually. The integration labs project still https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=displayquotas&projectname=integration have enough quota:

Cores58/80
RAM116736/204800
Floating IPs0/0
Instances22/29
Security Groups0/10

@thcipriani created https://integration.wikimedia.org/ci/computer/integration-slave-jessie-1003/ which comes with Jenkins labels: contintLabsSlave and AndroidEmulator.

puppet is falling though due to T94836: Create CI slaves using Debian Jessie

I ran the job on the new slave at https://integration.wikimedia.org/ci/job/test-T62720-android-emulator/46/console

It fails with:

00:00:01.608 $ /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb start-server
00:00:01.614 FATAL: Cannot run program "/mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb": error=2, No such file or directory

From the command line:

jenkins-deploy@integration-slave-jessie-1003:~$ /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb
-su: /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb: No such file or directory

$ ldd  /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb 
	not a dynamic executable

$ $ file  /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb 
/mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.6.24, BuildID[sha1]=7af6e53b5decfaeb34a35c718afa27d8bd17fca2, not stripped

$ ls -l /lib/ld-linux.so.2
ls: cannot access /lib/ld-linux.so.2: No such file or directory

That is annoying isn't it ?

That is because the puppet run is broken and fails to install a bunch of .deb dependencies from contint::packages::labs , from a mail I sent:


role::ci::slave::labs::light which include:

role::ci::slave::labs::common (for Jenkins)
include authdns::lint (for gdnsd validation)
package { 'etcd': ensure => latest } (for joe)

What can be done is extract the Android related projects from the huge:

modules/contint/manifests/packages/labs.pp

Create a new class such as contint::packages::android

Then create a new role for android sdk slave:

class role::ci::slave::labs::android {

include role::ci::slave::labs::common
include contint::packages::android

}

There is also a bit to add jenkins-deploy user to the kvm group so the
emulator can access /dev/kvm .

This way we get a runnable puppet and package::labs is a bit lighter, ie
the other slaves do not come with Android SDK.


I'm not an expert but it kind of looks like we have the 32b tools instead of the 64b (preferred). If we want to run 32b tools (we don't), I think you need to install ia32-libs.

Moved this machine over to use Ubuntu Trusty while we're trying to resolve the issues with puppet roles and Jessie. The new machine is at https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1022/

Currently, the builds are having an issue where they abort with the error message in console:

[android] Emulator did not appear to start; giving up

Looking a little bit into this issue I found this ticket: https://issues.jenkins-ci.org/browse/JENKINS-14901
The proposed solution there is, essentially:

sudo mv /mnt/jenkins-workspace/tools/android-sdk/tools/emulator /mnt/jenkins-workspace/tools/android-sdk/tools/emulator.bak
sudo ln -s /mnt/jenkins-workspace/tools/android-sdk/tools/emulator-arm /mnt/jenkins-workspace/tools/android-sdk/tools/emulator

Which I tried and then ran this build to no avail: https://integration.wikimedia.org/ci/job/test-T62720-android-emulator/52/console

After fiddling with some android flags, I'm now back to this point: https://phabricator.wikimedia.org/T62720#1375536

Build 61 (https://integration.wikimedia.org/ci/job/test-T62720-android-emulator/61/console) is just repeating:

22:31:42 $ /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb connect localhost:9478
22:31:56 $ /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb -s localhost:9478 shell getprop init.svc.bootanim
22:31:56 error: device offline

netstat -tlnp shows:

tcp        0      0 127.0.0.1:9477          0.0.0.0:*               LISTEN      15221/emulator64-x8
tcp        0      0 127.0.0.1:9478          0.0.0.0:*               LISTEN      15221/emulator64-x8

and nc -vzw1 127.0.0.1 9478 seems to indicate I can connect to these ports via tcp:

thcipriani@integration-slave-trusty-1022:~$ nc -vzw 1 127.0.0.1 9478
Connection to 127.0.0.1 9478 port [tcp/*] succeeded!

This may explain why the android emulator plugin is failing with a loop trying to detect if the emulator is booted when it is seemingly booted. The device the android emulator keeps trying to use is localhost:[port] which fails on the command line with error: device 'localhost:[port]' not found; however, using emulator-[port] as the device seems to work:

thcipriani@integration-slave-trusty-1022:~$ sudo netstat -tlnp | grep -i 6787
tcp        0      0 127.0.0.1:6787          0.0.0.0:*               LISTEN      15607/emulator64-x8
thcipriani@integration-slave-trusty-1022:~$ /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb -s localhost:6787 shell getprop dev.bootcomplete
error: device 'localhost:6787' not found
thcipriani@integration-slave-trusty-1022:~$ /mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb -s emulator-6787 shell getprop dev.bootcomplete
1
thcipriani@integration-slave-trusty-1022:~$ echo $?
0

@thcipriani, thanks for picking this up!

sudo mv /mnt/jenkins-workspace/tools/android-sdk/tools/emulator /mnt/jenkins-workspace/tools/android-sdk/tools/emulator.bak
sudo ln -s /mnt/jenkins-workspace/tools/android-sdk/tools/emulator-arm /mnt/jenkins-workspace/tools/android-sdk/tools/emulator

We definitely don't want to do this. This will drop support for other ABIs like x86.

These commands work when logged in as jenkins-deploy:

/mnt/jenkins-workspace/tools/android-sdk/tools/android create avd -f -a -s WXGA720 -n test -t android-22 --abi x86_64
/mnt/jenkins-workspace/tools/android-sdk/tools/emulator -ports 9477,9478 -prop persist.sys.language=en -prop persist.sys.country=US -avd test -no-snapshot-load -no-snapshot-save -wipe-data -no-window -noaudio
/mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb shell

When I run a Jenkins build, I can see the following command fail:

/mnt/jenkins-workspace/tools/android-sdk/platform-tools/adb connect localhost:8322

However, the same command succeeds _during_ the build when logged in over ssh as jenkins-deploy. I can even run adb shell pretty quickly. At least, on one iteration. On a subsequent iteration, I couldn't connect for about five minutes. The server load seems really in flux. Sometimes I'll be watching the log and suddenly get a WMF error / unavailable page. I'm wondering if these issues are due to performance issues. Can we get some more dedicated cycles?

Change 230004 had a related patch set uploaded (by Niedzielski):
Set generous CI test device timeout

https://gerrit.wikimedia.org/r/230004

I threw everything we had at it: 8 cores and 16GB of ram (integration-slave-trusty-1023)

Still ended up at failure, but it was a marked improvement—speed-wise: https://integration.wikimedia.org/ci/job/test-T62720-android-emulator/72/console

@thcipriani, thanks for trying that out! It's a frustrating result because that's actually what I'm running on my laptop (2014 MacBook Pro with Linux Mint). I agree on the performance though! For whatever reason, I think the Gradle cache got cleared on your build so it's even more dramatic. Here are two comparable builds before[0] and after[1] the hardware bump. Note: I have a couple patches coming to fix the test failures[2] and increase the timeout duration[3] I hacked in for builds in [0] and [1].

[0] 4:38, https://integration.wikimedia.org/ci/job/test-T62720-android-emulator/69/console
[1] 3:06, https://integration.wikimedia.org/ci/job/test-T62720-android-emulator/74/console
[2] https://gerrit.wikimedia.org/r/#/c/229993/
[3] https://gerrit.wikimedia.org/r/#/c/230004/

@thcipriani, any chance we keep those extra cores and memories? I'd love to keep this as close to a laptop in terms of performance as possible, if not better.

Change 230004 merged by jenkins-bot:
Set generous CI test device timeout

https://gerrit.wikimedia.org/r/230004

@thcipriani, one more question. It looks like making a Zuul job out of this requires jenkins-job-builder support for the Android Emulator Plugin. There is a patch on OpenStack[0] that was nearly merged but now seems untended. The changes suggested by reviewers seem pretty straightforward. Do I need to see this patch gets merged on OpenStack or can I make a similar patch on our Wikimedia fork?

[0] https://review.openstack.org/#/c/146990

@thcipriani, I've created a draft patch[0] that is pending on the aforementioned Android Emulator support.

[0] https://gerrit.wikimedia.org/r/#/c/230260/

@thcipriani / @greg any guidance on my last few comments? I'm particularly curious about the cores / mem and what the next steps are for the JJB Android patch.

@thcipriani, any chance we keep those extra cores and memories? I'd love to keep this as close to a laptop in terms of performance as possible, if not better.

For now the quota on the integration project looks ok, so I'll leave that instance setup and kill the m1.large I previously setup.

I'll try to poke the upstream jjb patch a bit, make the suggested changes maybe. If all else fails, we've got our own fork (as you know).

Change 231722 had a related patch set uploaded (by Niedzielski):
Add Android emulator wrapper

https://gerrit.wikimedia.org/r/231722

@thcipriani, we're excited to get our tests running any progress upstream or should we pursue a local patch[0]?

[0] https://gerrit.wikimedia.org/r/231722

Change 231722 abandoned by Hashar:
Add Android emulator wrapper

Reason:
Thanks Stephen! Our integration/jenkins-job-builder.git repo is merely a snapshot of upstream repo, not a fork. We get patches sent,reviewed and merged upstream before using it.

I am a core developer of JJB and there is an healthy reviewing community upstream. So it is usually quite fast.

Will now review and potentially approve the patch upstream ( https://review.openstack.org/#/c/146990/ ). If it lands, I will update our JJB snapshot.

https://gerrit.wikimedia.org/r/231722

I have forked JJB support to sub task T110307

Change 233923 had a related patch set uploaded (by Hashar):
Create apps-android-wikipedia-emulator

https://gerrit.wikimedia.org/r/233923

hashar added a comment.Sep 3 2015, 1:40 PM

The Jenkins job apps-android-wikipedia-gradlew is tied to Jenkins label contintLabsSlave && AndroidEmulator which is applied on integration-slave-trusty-1023.

@Niedzielski and @thcipriani I think this task is completed isn't it ?

@hashar, I think so! I will move these to done on the sprint board. BTW, there's one related patch I just rebased that could use your review when you have time https://gerrit.wikimedia.org/r/#/c/231624/

hashar closed this task as Resolved.Sep 4 2015, 10:25 AM
hashar claimed this task.

@Niedzielski thanks for the confirmation and congratulations on the JJB work !
@thcipriani Kudos on the slave setup!

The Jenkins job apps-android-wikipedia-gradlew is tied to Jenkins label contintLabsSlave && AndroidEmulator which is applied on Jenkins slave integration-slave-trusty-1023.

Nothing left to do hence resolving this task.