⚓ T76308 Zotero not running in production

Status	Assigned	Task
Resolved	Mvolz	T62768 Cite: Auto-filled references based on destination URL / DOI etc.
Resolved	Jdforrester-WMF	T90281 Improve citoid production service
Resolved	Jdforrester-WMF	T89656 Get citoid working in production
Resolved	Mvolz	T89058 Diagnose cause of 500 error in citoid in production
Resolved	Ottomata	T89057 Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/
Resolved	akosiaris	T89059 Figure out how to make Zotero more reliable
Resolved	akosiaris	T76308 Zotero not running in production
Resolved	akosiaris	T89874 Configure zotero to use an outbound proxy
Resolved	akosiaris	T89869 Assign hardware for the zotero service
Resolved	akosiaris	T89867 Puppetize zotero
Declined	akosiaris	T89866 Backport and using zotero-standalone for the zotero service
Resolved	• mobrovac	T89875 Configure citoid to use outbound proxy
Resolved	Mvolz	T90341 Let outbound proxy be configurable
Resolved	Mvolz	T89756 Use "WikimediaBot" as UA when making public requests
Resolved	Mvolz	T89757 Let the UserAgent be configurable
Resolved	Mvolz	T89968 Make requests.js an object and pass through citoid config

• GWicke created this task.Nov 30 2014, 10:28 PM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added a project: Citoid.

• GWicke changed Security from none to None.

• GWicke subscribed.

• GWicke updated the task description. (Show Details)Nov 30 2014, 10:33 PM

• GWicke added subscribers: Catrope, Mvolz.

Yup, that error means Zotero is not running.

And that is the extent to which I can help. Roan, LMK if you need anything else from me.

So, to start Zotero you do:

./translation-server/run_translation-server.sh &

If should give you output about where it is running.

If it's already running on that port it will probably complain at you.

There are two processes associated with Zotero running;

one is the .sh file, which doesn't need to be alive, the other is what the sh file executes

./xpcshell -v 180 -mn translation-server/init.js

(Also I updated the deploy repo today to a version that shouldn't fall over when Zotero isn't running.)

Jdforrester-WMF renamed this task from Server at http://localhost:1969/%s does not appear to be running. to Zotero not running in production.Jan 24 2015, 4:59 AM

Jdforrester-WMF triaged this task as Unbreak Now! priority.

Mvolz moved this task from Backlog to Production on the Citoid board.Feb 7 2015, 2:55 PM

James, Roan, is this resolved given T76949 is marked resolved?

Apparently not :). Okay.

Mvolz added a project: acl*sre-team.Feb 9 2015, 4:30 PM

Mvolz added a project: Services.Feb 9 2015, 4:54 PM

Hello,

I 've encountered the zotero not running issue during the Dev Summit and I have talked with @Catrope about it and how to solve it. The issue stems from:

a) The fact that the zotero citoid dependency was something that slipped alongside citoid during the initial phase of the citoid deployment. I should have caught it on one hand, on the other hand it should not happened anyway.
What I mean by this is that zotero, while being one of citoid's dependencies is a different service altogether and as such should have followed it's own path in production and not just tag along citoid. The result of this is that zotero is not running, is monitored/people alerted for it or is even known to pretty much anyone aside from me, @Catrope and @Mvolz

b) The deployment method of zotero which right now is distributing a set of shared object files

https://git.wikimedia.org/tree/mediawiki%2Fservices%2Fcitoid%2Fdeploy/a55c024283e6772c7bb3141214db1d6157fbd963/translation-server

yes, .so files including libssl.so.3 as well as others, via a git branch of the citoid software. This was (thankfully) bound to fail due to missing/incompatible libraries. Btw the installation/running instruction on https://github.com/zotero/translation-server are quite possibly some of the worst I 've ever seen for production use. And of course it has security repercussions (shipping a version libssl.so.3 ? So that when the next Heartbleed shows up it does not get patched ?)

I am working on solving that via providing a backport of zotero from Debian Jessie (https://packages.debian.org/jessie/zotero-standalone) to Ubuntu Trusty (before you ask, no there is no phab ticket I am still assessing what needs to be done)

c) Even if it worked, zotero would be unable to access the internet due to the private IPs of the servers it ran on. So no scraping of any form would happen. The same holds true for any kind of citoid (and not zotero) initiated outbound connections. They both need a proxy server

So,

Actionables (I will convert them into tickets):

Investigate zotero-standalone package and see if it is possible to use the package to provide the zotero service
Puppetize zotero as a service with documentation/configuration/monitoring/firewalling/backups that a service needs
Assign corresponding hardware
Assign a service IP
Assign a person responsible for maintaining the zotero service (a point person for ops to contact in case everything goes haywire)
Stop deploying zotero via the citoid/deploy branch cause .so files in deploy is bad, bad, bad
Deploy zotero via the debian package
Configure citoid to use the new zotero service
Configure both to use a forward proxy server for outbound connections

Moving back to High, since unbreak now is kind of impossible (especially the now! part)

akosiaris lowered the priority of this task from Unbreak Now! to High.Feb 9 2015, 5:35 PM

@akosiaris: Thanks for looking into a saner way to deploy zotero.

Lets not over-complicate things though: I think it's fine for zotero & citoid to share hardware and IP. The zotero service should only be used internally, so should never directly see connections that aren't from citoid. We can either bind it to localhost only or firewall it off. Both zotero & citoid are stateless services, so don't need backups. Both do however need to have access to the internet, as the main task is fetching metadata about a URL from external repositories.

• mobrovac subscribed.Feb 9 2015, 6:42 PM

• GWicke updated the task description. (Show Details)Feb 9 2015, 6:53 PM

• GWicke updated the task description. (Show Details)

In T76308#1025322, @akosiaris wrote:

b) The deployment method of zotero which right now is distributing a set of shared object files

https://git.wikimedia.org/tree/mediawiki%2Fservices%2Fcitoid%2Fdeploy/a55c024283e6772c7bb3141214db1d6157fbd963/translation-server

yes, .so files including libssl.so.3 as well as others, via a git branch of the citoid software. This was (thankfully) bound to fail due to missing/incompatible libraries. Btw the installation/running instruction on https://github.com/zotero/translation-server are quite possibly some of the worst I 've ever seen for production use. And of course it has security repercussions (shipping a version libssl.so.3 ? So that when the next Heartbleed shows up it does not get patched ?)

I take full blame for that. The reason I did this is because xulrunner isn't packaged in Ubuntu (but it used to be available, and will be available again in the future, or something weird like that) and because Zotero breaks unless you run it with a very specific version of xulrunner (29.0). So I just followed the installation instructions, which as Alexandros points out are horrible.

I didn't notice libssl was in there, that means I've done something much worse than I thought I was doing. Thanks for cleaning this up.

The proxy server thing was something I should totally have seen coming, my bad.

Some additional factors, which you probably have considered but just writing down here:

The service, translation-server, is not the same thing as zotero-standalone. Zotero[1] is a submodule of translation-server[2]. Zotero[1] is also a submodule of Zotero-standalone[3]. It is probably possible to point to the built version of zotero inside zotero-standalone from translation-server, instead of using the submodule; but then we will also have to consider CI issues here: i.e. the version of zotero built from zotero-standalone may differ from the version pointed to from translation-server.

Translation-server requires translators. These are not a submodule of translation-server. These are in their own separate repository[4]; they are "installed" by manually editing the .sh file of translation-server to point to the path, and then building translation-server (although it's a fairly trivial process and a different build script could be written). These are the most rapidly changing part of Zotero and need to be updated fairly regularly.

[1] https://github.com/zotero/zotero
[2] https://github.com/zotero/zotero-standalone-build/tree/master/modules
[3] https://github.com/zotero/translation-server/tree/master/modules
[4] https://github.com/zotero/translators

In T76308#1025555, @Catrope wrote:

I didn't notice libssl was in there, that means I've done something much worse than I thought I was doing. Thanks for cleaning this up.

While scary-sounding, I don't think old libssl actually matters much in a private service that's only locally accessed through citoid using plain http. There could be other security issues that are more relevant though, so it's definitely a good idea to use a supported deb if available.

More generally, we should improve our sandboxing at the OS and network level. Ideally, we'd run Zotero in its own container with a firewall that only allows communication from citoid & outgoing connections to the internet. In the current shared sca cluster where each process runs on bare metal, apparmor could perhaps be a good stepping stone in that direction.

Mvolz added a parent task: T89059: Figure out how to make Zotero more reliable.Feb 9 2015, 11:15 PM

Trying to answer to every comment in sequence, please bear with me on this.

@GWicke, yes, let's not overcomplicate things more. They are already complicated enough as is. The zotero service should have it's own service IP to allow for LVS High Availability and monitoring but otherwise they will share the same hardware (sca cluster). While firewalling will happen, binding it to localhost is not an option if we want some HA.

@Catrope, I am as much to blame for this. Don't fret too much, let's make this a lesson for the future rather than anything else. We can do better

@Mvolz. You are really shining a light and I appreciate it a lot. I 've been coming to that conclusion myself today and your comments really helped.

@GWicke. An old libssl matters in multiple ways. The most obvious one in this case is reverse heartbleed (incompatibilities, unfixed bugs, very very difficult debugging etc are other ways). Granted, zotero probably does not hold any important information itself but it may have access to unitialized memory from other (now dead) processes. While one can argue that is the other processes' fault and go into a bike-shed of who's fault it is the end result would be leaked memory from a bug that would have been patched in the rest of fleet. The next heartbleed might allow code execution, who knows? The containerization approach would not have helped in this case.

In T76308#1028376, @akosiaris wrote:

Trying to answer to every comment in sequence, please bear with me on this.

@GWicke, yes, let's not overcomplicate things more. They are already complicated enough as is. The zotero service should have it's own service IP to allow for LVS High Availability and monitoring but otherwise they will share the same hardware (sca cluster). While firewalling will happen, binding it to localhost is not an option if we want some HA.

Running one instance of zotero per citoid worker can actually be more HA than a simple LVS setup, as it can also detect hanging backends & restart them as needed. We have done this before with mathoid when it was using phantomjs as a backend. Phantomjs also had trouble executing requests in parallel, so we fed it one request at a time.

@GWicke. An old libssl matters in multiple ways. The most obvious one in this case is reverse heartbleed (incompatibilities, unfixed bugs, very very difficult debugging etc are other ways).

My point is that libssl isn't actually used at all in this setup.

Running one instance of zotero per citoid worker can actually be more HA than a simple LVS setup, as it can also detect hanging backends & restart them as needed. We have done this before with mathoid when it was using phantomjs as a backend. Phantomjs also had trouble executing requests in parallel, so we fed it one request at a time.

I am not in love that approach either.

My point is that libssl isn't actually used at all in this setup.

My point is that it is. When zotero is accessing HTTPS links.

Mvolz added a parent task: T62768: Cite: Auto-filled references based on destination URL / DOI etc..Feb 16 2015, 3:34 PM

Mvolz added a parent task: T89656: Get citoid working in production.Feb 16 2015, 3:37 PM

Mvolz removed a parent task: T62768: Cite: Auto-filled references based on destination URL / DOI etc..

Catrope added a project: Blocked-on-Operations.Feb 18 2015, 6:47 PM

Catrope added a project: Scrum-of-Scrums.

Catrope moved this task from Scheduled to blocked by Operations on the Scrum-of-Scrums board.

My proposal for the way forward is this:

in the short term (absent containers), contain citoid and the zotero xulrunner using
- a tight apparmor policy (no writes or execs), and
- a firewall that only allows connections outside the cluster for the citoid user (using [iptables --uid-owner](http://linuxpoison.blogspot.com/2010/11/how-to-limit-network-access-by-user.html)).

as soon as possible, get rid of xulrunner by figuring out a way to use the zotero scrapers directly in nodejs. Much of the framework logic it uses seems to be defined here.

Hello, so I 've finally got some time to work on this for this and next week (other priorities before that, I am afraid). I 've kind of already started putting various building blocks in place like https://gerrit.wikimedia.org/r/191385. I 'll create tasks for the various issues I 've identified above and start working on them.

• GWicke added a subscriber: • csteipp.Feb 18 2015, 7:57 PM

akosiaris updated the task description. (Show Details)Feb 18 2015, 8:11 PM

@GWicke well, if getting rid of xulrunner is possible (and merging zotero functionality into citoid, if I understand correctly what you are saying), it is making things way way way easier and many of the tasks above not needed.

Can we please not make this already-6-months-late project even later purely for technical architecture reasons?

Can we please not make this already-6-months-late project even later purely for technical architecture reasons?

+1 for being pragmatic in the short term. I share the concerns about xulrunner etc for the longer term, but also think that we can lock it down far enough to get the first iteration out of the door soon.

Jdforrester-WMF added a parent task: T90281: Improve citoid production service.Feb 20 2015, 9:28 PM

Note that xulrunner used to be packaged for Ubuntu, but it was dropped in the Oneiric release to make it easier to keep pace with Mozilla's rapid release process by focusing on essential packages (i.e., firefox): See https://blueprints.launchpad.net/ubuntu/+spec/desktop-o-mozilla-rapid-release-maintenance.

This does mean that there exists a high-quality (one would hope) Debianianization of xulrunner. With any luck it will be easy to forward-port it to Trusty. The last version that was packaged is http://packages.ubuntu.com/lucid/xulrunner-1.9.2.

Debian bug 362190 is relevant, too.

I used LD_DEBUG=files on citoid.wmflabs.org to see which libraries zotero depends on but does not bundle, and then dpkg to figure out which packages provide them. The missing dependencies are:

libasound2
libatk1.0-0
libc6
libcairo2
libdatrie1
libdbus-1-3
libexpat1
libfontconfig1
libgcc1
libgdk-pixbuf2.0-0
libglib2.0-0
libgraphite2-3
libgtk2.0-0
libharfbuzz0b
libice6
libnspr4
libnss3
libpango-1.0-0
libpangocairo-1.0-0
libpangoft2-1.0-0
libpcre3
libpixman-1-0
libselinux1
libsm6
libthai0
libuuid1
libxcb-render0
libxcb-shm0
libxcomposite1
libxcursor1
libxdamage1
libxfixes3
libxi6
libxinerama1
libxrandr2
libxrender1
libxt6
zlib1g

All of these are reverse-dependencies for 'firefox', so the simplest thing may be to just require_package('firefox')

Also worth considering, especially on jessie: https://packages.debian.org/sid/xulrunner-24.0

akosiaris added subtasks: T89875: Configure citoid to use outbound proxy, T89874: Configure zotero to use an outbound proxy, T89873: Configure citoid to use the new zotero service, T89872: Update the citoid/deploy branch to not contain zotero deploy, T89869: Assign hardware for the zotero service, T89867: Puppetize zotero, T89866: Backport and using zotero-standalone for the zotero service.Feb 24 2015, 2:00 PM

Mvolz removed subtasks: T89873: Configure citoid to use the new zotero service, T89875: Configure citoid to use outbound proxy.Feb 24 2015, 3:03 PM

@akosiaris Any news on the zotero front perhaps?

akosiaris closed subtask T89866: Backport and using zotero-standalone for the zotero service as Declined.Mar 4 2015, 5:40 PM

@mobrovac yes I have. So zotero seems to run OK under xulrunner (firefox will not do), some LD_LIBRARY_PATH, redefinition of the GRE directory, defaults in a different place and undoing some of the weirdness build.sh does.

Performance wise, I 've been doing some tests but I don't think you will like them. They're on my ADSL line so I will repeat them from a labs machine but the gist is, with a concurrency level of 2:

Percentage of the requests served within a certain time (ms)

With a concurrency level of 3:

Percentage of the requests served within a certain time (ms)

We 'll have to port xulrunner-24 to trusty. I am already on it

akosiaris updated the task description. (Show Details)Mar 4 2015, 5:57 PM

akosiaris updated the task description. (Show Details)

@akosiaris - ouch. Is there any control there/ did you test different times for the response from the server the url is being requested from? Response times from the server could be a pretty significant source of variability. If speed is problem with zot in particular we can try pulling away from it in certain cases. So for now, if we have DOI we try Zotero first, but we could prioritise crossref over Zotero, which usually has pretty good results for instance, to try to get the overall times down.

@Mvolz, so these are on my DSL line so take them with a grain of salt. Also they are for the exact same content so they may not be variable enough. I am working towards having a working zotero installation in labs today and then to repeat the tests in a more stable environment. I am hoping to disprove the first findings.

The request benchmark was:

ab -n 10 -c 3 -t 30 -p postfile -T 'application/json' http://127.0.0.1:1969/web

with postfile having the content:

{"url":"http://www.tandfonline.com/doi/abs/10.1080/15424060903167229","sessionid":"abc123"}

You might try varying the sessionid every request and see if that helps. (I have experienced some issues which I think may be due to having requests made for the same sessionid and same url concurrently). Also labs will be a lot faster, based on personal experience.

I wouldn't worry too much about external resources being slow at this point, as there's not much we can do apart from doing those requests only once & storing the result.

After puppetization ( https://gerrit.wikimedia.org/r/#/c/194495/ ) deployment-zotero01 in live in Beta and surving requests. My simple benchmark is faster for sure and more reliable.

Percentage of the requests served within a certain time (ms)

at the same time those are not the best numbers around....

Feel free to get a labs machine and hit it with

ab -n 10 -c 1 -t 30 -T 'application/json' -p postfile 'http://deployment-zotero01.eqiad.wmflabs:1969/web'

with the same postfile (or whatever other you feel like)

Is this resolved?

akosiaris closed subtask T89869: Assign hardware for the zotero service as Resolved.Mar 9 2015, 5:48 PM

akosiaris closed subtask T89867: Puppetize zotero as Resolved.

akosiaris closed subtask T89874: Configure zotero to use an outbound proxy as Resolved.Mar 9 2015, 5:51 PM

akosiaris updated the task description. (Show Details)

akosiaris removed a subtask: T89872: Update the citoid/deploy branch to not contain zotero deploy.Mar 9 2015, 5:59 PM

So, we got a running zotero service in production running on sca1001, sca1002 with puppetized with LVS, monitoring, pages. Resolving

akosiaris updated the task description. (Show Details)Mar 9 2015, 6:02 PM

@akosiaris Great! Thanks for your effort! Greatly appreciated

Jdforrester-WMF added a project: VisualEditor 2014/15 Q3 blockers.Mar 11 2015, 2:59 PM

Jdforrester-WMF moved this task from Nominated to Done on the VisualEditor 2014/15 Q3 blockers board.

Jdforrester-WMF added a project: VisualEditor pre-2015 work.Aug 5 2015, 12:58 AM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 5 2015, 12:58 AM

akosiaris mentioned this in T187194: zotero translation server: code stewardship request.Feb 13 2018, 2:05 PM

Zotero not running in production
Closed, ResolvedPublic
Actions

Description

Related Objects
Search...

Event Timeline

Zotero not running in productionClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Zotero not running in production
Closed, ResolvedPublic
Actions

Related Objects
Search...