Page MenuHomePhabricator

Tool Labs CDNjs mirror not up-to-date
Closed, ResolvedPublic

Description

Our mirror of CDNjs seems to be out of sync. Specifically, although TensorFlow.js was added in April (see https://github.com/cdnjs/cdnjs/search?o=asc&q=tensorflow&s=committer-date&type=Commits) and is available on the main mirror (https://cdnjs.com/libraries/tensorflow) but not in ours. Could there be a problem with the syncing?

Event Timeline

It is running into a memory error during the sync job. Need to work around that.

Andrew triaged this task as Medium priority.

So the issue is that memory tends to run out when deserializing JSON to python objects. I'll test just asking for more memory from gridengine, but it is likely that a pretty big rework is needed. The python streaming deserialization story is not great. I've done this in golang with little difficulty, but here, the libraries are a bit weird and often depend on C libs. The tricky thing is sorting by Github stars, since that sort of depends on having them all in memory. It's easy to just dump to files as the stream comes in. It may be a matter of constructing the html files during streaming, while determining the sort by using a much smaller data structure of stubs--or using something to store the data in temporarily like Redis.

If I can just ask for 3GB of memory, and mess with the ulimits, we might be fine.

Good news! Asking for more memory from gridengine did succeed in not getting a memory error so far (got past the whole JSON deserialization). However, I now get:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 14224560: ordinal not in range(128)

Somebody introduced a fancy quote character. That's an easy fix, at least.

Good news! Asking for more memory from gridengine did succeed in not getting a memory error so far (got past the whole JSON deserialization).

Yay! Hm, I wonder if the memory limit be an occasionally-recurring issue as more and more libraries (and their versions) are added to CDNjs over time.

Thank you so much for looking into this and fixing it, @Bstorm! :)

Change 439380 had a related patch set uploaded (by Bstorm; owner: Brooke Storm):
[labs/tools/cdnjs-index@master] cdnjs-index: force encoding on writing files

https://gerrit.wikimedia.org/r/439380

I'll Just merge that patch, since I'm already running it via local hack :-p

Change 439380 merged by Bstorm:
[labs/tools/cdnjs-index@master] cdnjs-index: force encoding on writing files

https://gerrit.wikimedia.org/r/439380

Also, I see tensorflow is now on the index page 💥

@Bstorm: thank you very much!!!

Vvjjkkii renamed this task from Tool Labs CDNjs mirror not up-to-date to bgbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Bstorm as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from bgbaaaaaaa to Tool Labs CDNjs mirror not up-to-date.Jul 2 2018, 1:26 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Bstorm.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.