Page MenuHomePhabricator

Add support for warp.da.ndl.go.jp as archive provider
Open, MediumPublicFeature

Description

https://warp.ndl.go.jp/ .. choose "English" from the top bar pulldown to view the website.

Sample URLS:

https://warp.ndl.go.jp/info:ndljp/pid/11986456/www.kesennuma.miyagi.jp/sec/s023/020/010/010/020/020/agenda2.20200603.pdf
https://warp.ndl.go.jp/info:ndljp/pid/11766846/www.kesennuma.miyagi.jp/sec/s023/020/010/010/020/20200721092320.html
https://warp.ndl.go.jp/collections/content/info:ndljp/pid/11986456/www.kesennuma.miyagi.jp/sec/s002/020/030/050/010/100/010/2020-01-28_zaisei.pdf

They don't have 14-digit timestamps. At a minimum they should be skipped.

The date is available by scraping the page for "collectDate". For example:

https://warp.ndl.go.jp/info:ndljp/pid/12767547/jweld.jp/

Contains the string;

<span class="warp_textArea_collectDate">2023?3?23?</span>

ie. March 23, 2023


https://en.wikipedia.org/w/index.php?diff=1142300795&diffmode=source

Bot doesn't recognize archive provider even though it's valid

Event Timeline

Some secondary websites to verify the legitimacy of the URL:

Harej triaged this task as Medium priority.Mar 20 2023, 8:14 PM

Extracting the archived date from archived pages on WARP is tricky because the data is embedded into the page, and its format depends on the language you choose on their home page https://warp.da.ndl.go.jp/

Example archive page:
https://warp.da.ndl.go.jp/info:ndljp/pid/9597364/www.dus.emb-japan.go.jp/profile/japanisch/j_wirtschaft/j_DUS.htm
When you specify "Japanese (日本語)" on WARP's home page as
https://warp.da.ndl.go.jp/?_lang=ja
then the date portion in the archived page will show:
"ご覧いただいているのは国立国会図書館が保存した<span class="warp_textArea_collectDate">2016年1月6日</span>時点のページです。"
In this mode, the date format is yyyy年m月d日.
https://icu4c-demos.unicode.org/icu-bin/locexp?d_=en&_=ja_JP

When you specify "English" on WARP's home page as
https://warp.da.ndl.go.jp/?_lang=en
then the archived page becomes:
"You are viewing an archived web page captured on <span class="warp_textArea_collectDate">6 Jan 2016</span> by the National Diet Library, Japan."

By default, they seem to ignore the browser's setting for "preferred language for displaying pages" and always show pages with Japanese format (tested with Firefox, clearing all cookies and setting the Webpage Language Settings to "English [en]" on top). But I am not certain here.