Page MenuHomePhabricator

Zotero not running in production
Closed, ResolvedPublic

Description

When testing against sca1001, I get the following error:

curl -XPOST -d'format=mediawiki' -d'url=http://link.springer.com/chapter/10.1007/11926078_68' http://sca1001:1970/url
"Internal server error"

The same is true with JSON:

curl -d '{"url":"http://link.springer.com/chapter/10.100":"mediawiki"}' --header "Content-Type: application/json" http://sca1001:1970/url

In /var/log/citoid/main.log, I see

Request made for: http://link.springer.com/chapter/10.1007/11926078_68
Server at http://localhost:1969/%s does not appear to be running.

It looks as if zotero is not actually running. Might be a dependency issue in zotero (from /var/log/citoid/zotero.log):

./xpcshell: error while loading shared libraries: libXrender.so.1: cannot open shared object file: No such file or directory

Moved the actionables here from my comment below to have them at an easy to view place:

  • Puppetize zotero as a service with documentation/configuration/monitoring/firewalling/backups that a service needs T89867
  • Assign corresponding hardware T89869
  • Assign a service IP T89870
  • Configure zotero to use a forward proxy server for outbound connections T89874

Related Objects

Event Timeline

GWicke updated the task description. (Show Details)
GWicke raised the priority of this task from to Needs Triage.
GWicke added a project: Citoid.
GWicke changed Security from none to None.
GWicke added a subscriber: GWicke.
GWicke updated the task description. (Show Details)Nov 30 2014, 10:33 PM
GWicke added subscribers: Catrope, Mvolz.

Yup, that error means Zotero is not running.

And that is the extent to which I can help. Roan, LMK if you need anything else from me.

Mvolz added a comment.Dec 1 2014, 11:53 PM

So, to start Zotero you do:

./translation-server/run_translation-server.sh &

If should give you output about where it is running.

If it's already running on that port it will probably complain at you.

There are two processes associated with Zotero running;

one is the .sh file, which doesn't need to be alive, the other is what the sh file executes

./xpcshell -v 180 -mn translation-server/init.js

(Also I updated the deploy repo today to a version that shouldn't fall over when Zotero isn't running.)

Jdforrester-WMF renamed this task from Server at http://localhost:1969/%s does not appear to be running. to Zotero not running in production.Jan 24 2015, 4:59 AM
Jdforrester-WMF triaged this task as Unbreak Now! priority.
Mvolz moved this task from Backlog to Production on the Citoid board.Feb 7 2015, 2:55 PM

James, Roan, is this resolved given T76949 is marked resolved?

Mvolz added a comment.Feb 9 2015, 4:30 PM

Apparently not :). Okay.

Hello,

I 've encountered the zotero not running issue during the Dev Summit and I have talked with @Catrope about it and how to solve it. The issue stems from:

a) The fact that the zotero citoid dependency was something that slipped alongside citoid during the initial phase of the citoid deployment. I should have caught it on one hand, on the other hand it should not happened anyway.
What I mean by this is that zotero, while being one of citoid's dependencies is a different service altogether and as such should have followed it's own path in production and not just tag along citoid. The result of this is that zotero is not running, is monitored/people alerted for it or is even known to pretty much anyone aside from me, @Catrope and @Mvolz

b) The deployment method of zotero which right now is distributing a set of shared object files

https://git.wikimedia.org/tree/mediawiki%2Fservices%2Fcitoid%2Fdeploy/a55c024283e6772c7bb3141214db1d6157fbd963/translation-server

yes, .so files including libssl.so.3 as well as others, via a git branch of the citoid software. This was (thankfully) bound to fail due to missing/incompatible libraries. Btw the installation/running instruction on https://github.com/zotero/translation-server are quite possibly some of the worst I 've ever seen for production use. And of course it has security repercussions (shipping a version libssl.so.3 ? So that when the next Heartbleed shows up it does not get patched ?)

I am working on solving that via providing a backport of zotero from Debian Jessie (https://packages.debian.org/jessie/zotero-standalone) to Ubuntu Trusty (before you ask, no there is no phab ticket I am still assessing what needs to be done)

c) Even if it worked, zotero would be unable to access the internet due to the private IPs of the servers it ran on. So no scraping of any form would happen. The same holds true for any kind of citoid (and not zotero) initiated outbound connections. They both need a proxy server

So,

Actionables (I will convert them into tickets):

  • Investigate zotero-standalone package and see if it is possible to use the package to provide the zotero service
  • Puppetize zotero as a service with documentation/configuration/monitoring/firewalling/backups that a service needs
  • Assign corresponding hardware
  • Assign a service IP
  • Assign a person responsible for maintaining the zotero service (a point person for ops to contact in case everything goes haywire)
  • Stop deploying zotero via the citoid/deploy branch cause .so files in deploy is bad, bad, bad
  • Deploy zotero via the debian package
  • Configure citoid to use the new zotero service
  • Configure both to use a forward proxy server for outbound connections

Moving back to High, since unbreak now is kind of impossible (especially the now! part)

akosiaris lowered the priority of this task from Unbreak Now! to High.Feb 9 2015, 5:35 PM
GWicke added a comment.EditedFeb 9 2015, 6:41 PM

@akosiaris: Thanks for looking into a saner way to deploy zotero.

Lets not over-complicate things though: I think it's fine for zotero & citoid to share hardware and IP. The zotero service should only be used internally, so should never directly see connections that aren't from citoid. We can either bind it to localhost only or firewall it off. Both zotero & citoid are stateless services, so don't need backups. Both do however need to have access to the internet, as the main task is fetching metadata about a URL from external repositories.

GWicke updated the task description. (Show Details)Feb 9 2015, 6:53 PM
GWicke updated the task description. (Show Details)

b) The deployment method of zotero which right now is distributing a set of shared object files

https://git.wikimedia.org/tree/mediawiki%2Fservices%2Fcitoid%2Fdeploy/a55c024283e6772c7bb3141214db1d6157fbd963/translation-server

yes, .so files including libssl.so.3 as well as others, via a git branch of the citoid software. This was (thankfully) bound to fail due to missing/incompatible libraries. Btw the installation/running instruction on https://github.com/zotero/translation-server are quite possibly some of the worst I 've ever seen for production use. And of course it has security repercussions (shipping a version libssl.so.3 ? So that when the next Heartbleed shows up it does not get patched ?)

I take full blame for that. The reason I did this is because xulrunner isn't packaged in Ubuntu (but it used to be available, and will be available again in the future, or something weird like that) and because Zotero breaks unless you run it with a very specific version of xulrunner (29.0). So I just followed the installation instructions, which as Alexandros points out are horrible.

I didn't notice libssl was in there, that means I've done something much worse than I thought I was doing. Thanks for cleaning this up.

The proxy server thing was something I should totally have seen coming, my bad.

Mvolz added a comment.Feb 9 2015, 6:59 PM

Some additional factors, which you probably have considered but just writing down here:

The service, translation-server, is not the same thing as zotero-standalone. Zotero[1] is a submodule of translation-server[2]. Zotero[1] is also a submodule of Zotero-standalone[3]. It is probably possible to point to the built version of zotero inside zotero-standalone from translation-server, instead of using the submodule; but then we will also have to consider CI issues here: i.e. the version of zotero built from zotero-standalone may differ from the version pointed to from translation-server.

Translation-server requires translators. These are not a submodule of translation-server. These are in their own separate repository[4]; they are "installed" by manually editing the .sh file of translation-server to point to the path, and then building translation-server (although it's a fairly trivial process and a different build script could be written). These are the most rapidly changing part of Zotero and need to be updated fairly regularly.

[1] https://github.com/zotero/zotero
[2] https://github.com/zotero/zotero-standalone-build/tree/master/modules
[3] https://github.com/zotero/translation-server/tree/master/modules
[4] https://github.com/zotero/translators

I didn't notice libssl was in there, that means I've done something much worse than I thought I was doing. Thanks for cleaning this up.

While scary-sounding, I don't think old libssl actually matters much in a private service that's only locally accessed through citoid using plain http. There could be other security issues that are more relevant though, so it's definitely a good idea to use a supported deb if available.

More generally, we should improve our sandboxing at the OS and network level. Ideally, we'd run Zotero in its own container with a firewall that only allows communication from citoid & outgoing connections to the internet. In the current shared sca cluster where each process runs on bare metal, apparmor could perhaps be a good stepping stone in that direction.

Trying to answer to every comment in sequence, please bear with me on this.

@GWicke, yes, let's not overcomplicate things more. They are already complicated enough as is. The zotero service should have it's own service IP to allow for LVS High Availability and monitoring but otherwise they will share the same hardware (sca cluster). While firewalling will happen, binding it to localhost is not an option if we want some HA.

@Catrope, I am as much to blame for this. Don't fret too much, let's make this a lesson for the future rather than anything else. We can do better

@Mvolz. You are really shining a light and I appreciate it a lot. I 've been coming to that conclusion myself today and your comments really helped.

@GWicke. An old libssl matters in multiple ways. The most obvious one in this case is reverse heartbleed (incompatibilities, unfixed bugs, very very difficult debugging etc are other ways). Granted, zotero probably does not hold any important information itself but it may have access to unitialized memory from other (now dead) processes. While one can argue that is the other processes' fault and go into a bike-shed of who's fault it is the end result would be leaked memory from a bug that would have been patched in the rest of fleet. The next heartbleed might allow code execution, who knows? The containerization approach would not have helped in this case.

GWicke added a comment.EditedFeb 10 2015, 6:03 PM

Trying to answer to every comment in sequence, please bear with me on this.

@GWicke, yes, let's not overcomplicate things more. They are already complicated enough as is. The zotero service should have it's own service IP to allow for LVS High Availability and monitoring but otherwise they will share the same hardware (sca cluster). While firewalling will happen, binding it to localhost is not an option if we want some HA.

Running one instance of zotero per citoid worker can actually be more HA than a simple LVS setup, as it can also detect hanging backends & restart them as needed. We have done this before with mathoid when it was using phantomjs as a backend. Phantomjs also had trouble executing requests in parallel, so we fed it one request at a time.

@GWicke. An old libssl matters in multiple ways. The most obvious one in this case is reverse heartbleed (incompatibilities, unfixed bugs, very very difficult debugging etc are other ways).

My point is that libssl isn't actually used at all in this setup.

Running one instance of zotero per citoid worker can actually be more HA than a simple LVS setup, as it can also detect hanging backends & restart them as needed. We have done this before with mathoid when it was using phantomjs as a backend. Phantomjs also had trouble executing requests in parallel, so we fed it one request at a time.

I am not in love that approach either.

My point is that libssl isn't actually used at all in this setup.

My point is that it is. When zotero is accessing HTTPS links.

GWicke added a comment.EditedFeb 18 2015, 7:50 PM

My proposal for the way forward is this:

  • in the short term (absent containers), contain citoid and the zotero xulrunner using
    • a tight apparmor policy (no writes or execs), and
    • a firewall that only allows connections outside the cluster for the citoid user (using iptables --uid-owner).
  • as soon as possible, get rid of xulrunner by figuring out a way to use the zotero scrapers directly in nodejs. Much of the framework logic it uses seems to be defined here.

Hello, so I 've finally got some time to work on this for this and next week (other priorities before that, I am afraid). I 've kind of already started putting various building blocks in place like https://gerrit.wikimedia.org/r/191385. I 'll create tasks for the various issues I 've identified above and start working on them.

akosiaris updated the task description. (Show Details)Feb 18 2015, 8:11 PM

@GWicke well, if getting rid of xulrunner is possible (and merging zotero functionality into citoid, if I understand correctly what you are saying), it is making things way way way easier and many of the tasks above not needed.

Can we please not make this already-6-months-late project even later purely for technical architecture reasons?

Can we please not make this already-6-months-late project even later purely for technical architecture reasons?

+1 for being pragmatic in the short term. I share the concerns about xulrunner etc for the longer term, but also think that we can lock it down far enough to get the first iteration out of the door soon.

ori added a subscriber: ori.Feb 20 2015, 9:28 PM

Note that xulrunner used to be packaged for Ubuntu, but it was dropped in the Oneiric release to make it easier to keep pace with Mozilla's rapid release process by focusing on essential packages (i.e., firefox): See https://blueprints.launchpad.net/ubuntu/+spec/desktop-o-mozilla-rapid-release-maintenance.

This does mean that there exists a high-quality (one would hope) Debianianization of xulrunner. With any luck it will be easy to forward-port it to Trusty. The last version that was packaged is http://packages.ubuntu.com/lucid/xulrunner-1.9.2.

Debian bug 362190 is relevant, too.

ori added a comment.Feb 20 2015, 11:06 PM

I used LD_DEBUG=files on citoid.wmflabs.org to see which libraries zotero depends on but does not bundle, and then dpkg to figure out which packages provide them. The missing dependencies are:

  • libasound2
  • libatk1.0-0
  • libc6
  • libcairo2
  • libdatrie1
  • libdbus-1-3
  • libexpat1
  • libfontconfig1
  • libgcc1
  • libgdk-pixbuf2.0-0
  • libglib2.0-0
  • libgraphite2-3
  • libgtk2.0-0
  • libharfbuzz0b
  • libice6
  • libnspr4
  • libnss3
  • libpango-1.0-0
  • libpangocairo-1.0-0
  • libpangoft2-1.0-0
  • libpcre3
  • libpixman-1-0
  • libselinux1
  • libsm6
  • libthai0
  • libuuid1
  • libxcb-render0
  • libxcb-shm0
  • libxcomposite1
  • libxcursor1
  • libxdamage1
  • libxfixes3
  • libxi6
  • libxinerama1
  • libxrandr2
  • libxrender1
  • libxt6
  • zlib1g

All of these are reverse-dependencies for 'firefox', so the simplest thing may be to just require_package('firefox')

Also worth considering, especially on jessie: https://packages.debian.org/sid/xulrunner-24.0

@akosiaris Any news on the zotero front perhaps?

@mobrovac yes I have. So zotero seems to run OK under xulrunner (firefox will not do), some LD_LIBRARY_PATH, redefinition of the GRE directory, defaults in a different place and undoing some of the weirdness build.sh does.

Performance wise, I 've been doing some tests but I don't think you will like them. They're on my ADSL line so I will repeat them from a labs machine but the gist is, with a concurrency level of 2:

Percentage of the requests served within a certain time (ms)

50%   3609
66%   3751
75%   4168
80%   5135
90%  30003

With a concurrency level of 3:

Percentage of the requests served within a certain time (ms)

50%   2231
66%   2646
75%   2874
80%   3238
90%   4677
95%   9304
98%  14915
99%  14915

We 'll have to port xulrunner-24 to trusty. I am already on it

akosiaris updated the task description. (Show Details)Mar 4 2015, 5:57 PM
akosiaris updated the task description. (Show Details)
Mvolz added a comment.Mar 5 2015, 3:12 PM

@akosiaris - ouch. Is there any control there/ did you test different times for the response from the server the url is being requested from? Response times from the server could be a pretty significant source of variability. If speed is problem with zot in particular we can try pulling away from it in certain cases. So for now, if we have DOI we try Zotero first, but we could prioritise crossref over Zotero, which usually has pretty good results for instance, to try to get the overall times down.

@Mvolz, so these are on my DSL line so take them with a grain of salt. Also they are for the exact same content so they may not be variable enough. I am working towards having a working zotero installation in labs today and then to repeat the tests in a more stable environment. I am hoping to disprove the first findings.

The request benchmark was:

ab -n 10 -c 3 -t 30 -p postfile -T 'application/json' http://127.0.0.1:1969/web

with postfile having the content:

{"url":"http://www.tandfonline.com/doi/abs/10.1080/15424060903167229","sessionid":"abc123"}
Mvolz added a comment.Mar 5 2015, 3:54 PM

You might try varying the sessionid every request and see if that helps. (I have experienced some issues which I think may be due to having requests made for the same sessionid and same url concurrently). Also labs will be a lot faster, based on personal experience.

I wouldn't worry too much about external resources being slow at this point, as there's not much we can do apart from doing those requests only once & storing the result.

After puppetization ( https://gerrit.wikimedia.org/r/#/c/194495/ ) deployment-zotero01 in live in Beta and surving requests. My simple benchmark is faster for sure and more reliable.

Percentage of the requests served within a certain time (ms)

50%   1538
66%   1847
75%   1992
80%   2168
90%   2763
95%   2945

at the same time those are not the best numbers around....

Feel free to get a labs machine and hit it with

ab -n 10 -c 1 -t 30 -T 'application/json' -p postfile 'http://deployment-zotero01.eqiad.wmflabs:1969/web'

with the same postfile (or whatever other you feel like)

Mvolz added a comment.Mar 9 2015, 1:59 PM

Is this resolved?

akosiaris closed this task as Resolved.Mar 9 2015, 6:02 PM
akosiaris claimed this task.

So, we got a running zotero service in production running on sca1001, sca1002 with puppetized with LVS, monitoring, pages. Resolving

akosiaris updated the task description. (Show Details)Mar 9 2015, 6:02 PM

@akosiaris Great! Thanks for your effort! Greatly appreciated

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 5 2015, 12:58 AM