Page MenuHomePhabricator

OCG needs to migrate away from rdb1002 and get its own Redis instance
Closed, ResolvedPublic

Description

OCG uses rdb1002 (a slave Job Queue Redis instance) and needs to be migrated away from it.

Event Timeline

Please assure you removed the inappropriate project tags from the subtask next time. Thank you.

Please assure you removed the inappropriate project tags from the subtask next time. Thank you.

fwiw: I think "Patch-For-Review" should be removed globally because it causes more extra work than it does good and "Tracking" could maybe be replaced by a workboard. Would like to create subtasks without always having this issue with the copied tags like that.

Please assure you removed the inappropriate project tags from the subtask next time. Thank you.

fwiw: I think "Patch-For-Review" should be removed globally because it causes more extra work than it does good and "Tracking" could maybe be replaced by a workboard. Would like to create subtasks without always having this issue with the copied tags like that.

Re the Patch-For-Review : I agree, I would create a rule for that, but I don't have enough rights.

Re Tracking-Neverending : I'm a bit confused. Eg. in {T123525#2032039} you are against turning tracking bugs to projects, which would obviously solve this issue...

Fair, i think what i actually want is to keep using tracking bugs and subtasks but the technical issue is that phabricator always copies the tags to a subtask and while that makes sense for most tags that are about a certain topic or product it does not make sense for this special "tracking" tag. What is the higher goal we want to achieve by adding the tracking tag? On the overview page i see the definition "a report that will never be "fixed"". That is not the case for many of our "tracking" bugs. Often they are resolved and closed after all subtasks are resolved. For example "upgrade all these servers to jessie".

Please move off-topic "how to (not) use patch-for-review" discussions to T95309, T104413, T61831, or whatever. (Likely same for "tracking".) Thanks.

Read some documentation:

https://wikitech.wikimedia.org/wiki/OCG#Monitoring

The Job Queue related to rendering is zero, meanwhile the job status kept for caching is not (but it shouldn't be a problem):

elukey@neodymium:~$ sudo -i salt -t 120 ocg100[123]* cmd.run 'curl -s http://localhost:8000/?command=health'
ocg1001.eqiad.wmnet:
    {"host":"ocg1001","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":679375},"time":1457365084818,"requestTime":1}
ocg1003.eqiad.wmnet:
    {"host":"ocg1003","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":679376},"time":1457365085121,"requestTime":1}
ocg1002.eqiad.wmnet:
    {"host":"ocg1002","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":679382},"time":1457365085923,"requestTime":1}

So the next step is to switch the job queue to another Redis instance to allow maintenance on rdb1002.

Change 275510 had a related patch set uploaded (by Elukey):
Move the OCG Redis Job queue away from rdb1002 to rdb1007 for maintenance.

https://gerrit.wikimedia.org/r/275510

Useful info:

elukey@rdb1002:~$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)"
127.0.0.1:6379> GET ocg_render_job_queue
(nil)
127.0.0.1:6379> keys *ocg*
1) "ocg_job_status"
(9.54s)
127.0.0.1:6379> keys ocg*
1) "ocg_job_status"
(2.25s)
127.0.0.1:6379> type ocg_job_status
hash

So the Job queue doesn't seem to be there anymore.

Change 275510 merged by Elukey:
Move the OCG Redis Job queue away from rdb1002 to rdb1007 for maintenance.

https://gerrit.wikimedia.org/r/275510

All right, patch merged and rdb1002's client connections dropped in favor of rdb1007. I checked on the latter and Redis queues have been created an populated correctly, even though some errors are still present in ocg100[123] logs due to "missing key" errors (they should go away soon).

The plan is to wait tomorrow (EU Time) and then re-image rdb1002 with Debian.

OCG's health:

elukey@neodymium:~$ sudo -i salt -t 120 ocg100[123]* cmd.run 'curl -s http://localhost:8000/?command=health'
ocg1003.eqiad.wmnet:
    {"host":"ocg1003","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":16364},"time":1457375341419,"requestTime":1}
ocg1001.eqiad.wmnet:
    {"host":"ocg1001","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":16364},"time":1457375341423,"requestTime":1}
ocg1002.eqiad.wmnet:
    {"host":"ocg1002","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":16364},"time":1457375341430,"requestTime":2}

Checked again this morning:

elukey@neodymium:~$ sudo -i salt -t 120 ocg100[123]* cmd.run 'curl -s http://localhost:8000/?command=health'
ocg1003.eqiad.wmnet:
    {"host":"ocg1003","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":111934},"time":1457426186810,"requestTime":1}
ocg1001.eqiad.wmnet:
    {"host":"ocg1001","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":111934},"time":1457426186817,"requestTime":1}
ocg1002.eqiad.wmnet:
    {"host":"ocg1002","directories":{"temp":{"path":"/mnt/tmpfs","size":0},"output":{"path":"/srv/deployment/ocg/output","size":0},"postmortem":{"path":"/srv/deployment/ocg/postmortem","size":0}},"JobQueue":{"name":"ocg_render_job_queue","length":0},"StatusObjects":{"name":"ocg_job_status","length":111934},"time":1457426186824,"requestTime":1}

Still seeing errors of missing keys in ocg logs but very low volume. Moreover the length of the new Redis queues on rdb1007 went up a lot this night, good sign that everything is working fine. Didn't notice any alert about OCG, proceeding with rdb1002 re-imaging to Debian.

rdb1002 has been moved successfully to Debian, but it still needs to be placed on another location (Ganeti VM?) far from the rdb Redis Job queues.

Keeping this task open to complete the work.