Wikimedia cross-wiki coordination and L10n/i18n. Mainly active on Wikiquote, Wiktionary, Wikisource, Commons, Wikidata, Wikibooks. And of course Meta-Wiki, translatewiki.net.
Contact me by MediaWiki.org email or user talk.
Wikimedia cross-wiki coordination and L10n/i18n. Mainly active on Wikiquote, Wiktionary, Wikisource, Commons, Wikidata, Wikibooks. And of course Meta-Wiki, translatewiki.net.
Contact me by MediaWiki.org email or user talk.
I'm closing this task as unclear and not pertaining to MediaWiki core, mostly because it mixes different user groups and permissions some of which are Wikimedia-specific.
This reminds me a bit of the https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool , which I believe focused on identifying easy concepts like numbers. I've not used it in years.
@Mazevedo Here's an example old ticket which may or may not be relevant any more. :)
Do you want to focus on the exonyms in languages which are supported by MediaWiki core (or at least translatewiki.net) but not in CLDR?
That was with all namespaces.
Current status
After the latest run
Mostly fixed upstream.
Not clear to me why this doi:10.1038/s41586-023-06291-2 got an arxiv but not pmc ID https://en.wikipedia.org/w/index.php?title=PubMed&diff=prev&oldid=1195324840
The new round seems to go fine so far https://en.wikipedia.org/w/index.php?title=Special:Contributions/OAbot&target=OAbot&dir=prev&offset=20240107000000&limit=50
For the non-Unpaywall side, continues at T228702
We're still discarding excess merges from Dissemin similar to the 2019 logic https://github.com/dissemin/oabot/commit/e3c74bff735c1ef16ee333dde2ac4bdd20949635 . We're not currently using the Dissemin title matches but if we did it would not be enough to check for title, author, year match: https://en.wikipedia.org/w/index.php?title=User_talk%3AOAbot&diff=1194216712&oldid=1193993325 .
There are over 6500k PMC matches and only 650k matches by title and author, of which some 60k appear without a PMCID match, so perhaps we can just ignore those europepmc matches:
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via pmcid lookup)" 6499014 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via OAI-PMH title and first author match)" 637491 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep "oa repository (via OAI-PMH title and first author match)" | grep -vc "oa repository (via pmcid lookup)" 62310
Both papers on Unpaywall have evidence "oa repository (via OAI-PMH title and first author match)" although the PMC side exposes a link to the correct DOI. The CrossRef API has the page range like "113-128", "283-288", so it may be possible to check for the number of pages.
So we won't suggest edits like this either https://en.wikipedia.org/w/index.php?title=Saccharomyceta&curid=68064105&diff=1194087545&oldid=1182890284 as we don't get non-repository URLs from other sources.
A sample of what kind of URLs we're talking about
Only 35k or so of these are in the best_oa_location (sometimes even when a separate match for arxiv exists, like doi:10.1002/rsa.20071 / oai:CiteSeerX.psu:10.1.1.237.8456 / oai:arXiv.org:math/0209357 ).
Not sure how to narrow this down, we're talking about some 500k matches from CiteSeerX (out of 900k):
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep citeseerx | grep "oa repository (via OAI-PMH doi match)" | jq -r 'select(.oa_locations | .[] | .endpoint_id == "CiteSeerX.psu" and .evidence == "oa repository (via OAI-PMH doi match)" )|.doi' | wc -l 505747 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep -c citeseerx 887759
Another example where URL priorities changed: https://en.wikipedia.org/w/index.php?title=Balbinot_1&diff=prev&oldid=1193722831 (but there was no doi-access=free).
The recent change to sort all URLs https://github.com/dissemin/oabot/commit/ddab25a5ee71e2f23fe4b8dfb5a28c8da333a922 allowed the bot to perform https://en.wikipedia.org/w/index.php?title=Serafim_Kalliadasis&diff=prev&oldid=1193717235 , while previously it would probably only have suggested the first URL https://eprints.qut.edu.au/134215/1/134215p.pdf . http://hdl.handle.net/10044/1/55290 is the 3rd suggestion from Unpaywall and https://arxiv.org/abs/1609.05938 is the 8th.
That's fixed in https://github.com/dissemin/oabot/commit/1cd61525a8cc5d8378e60f63555cf291e1bb4660 hopefully
I've manually updated the leaderboard with https://github.com/nemobis/oabot/commit/4917289ac7b49ca5176129d9f19ae5355ac84b72
The last row created was
https://en.wikipedia.org/w/index.php?title=Lyman_E._Johnson&diff=prev&oldid=1191724248 was not supposed to happen as the existing URL returns a PDF.
Latest run
Still room for improvement
Some doi-access=free being re-added now:
$ find -maxdepth 1 -type f -print0 | xargs -0 -P16 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("doi-access=free")) | .orig_string' | grep doi | grep -Eo 'doi *= *[^"|]+' | grep -Eo '10\.[0-9]+/[a-z]+(\.([a-z]{,8}|[0-9-]{9})\b)?' | sort | uniq -c | sort -nr | head -n 40 546 10.1146/annurev 409 10.1007/s 186 10.4202/app. 178 10.1016/j. 176 10.1016/j.cub 156 10.1126/science. 124 10.1038/s 96 10.1016/j.cretres 84 10.1111/pala. 78 10.1017/jpa. 72 10.1074/jbc. 66 10.1002/ar. 61 10.5252/geodiversitas 56 10.11646/zootaxa. 52 10.5852/ejt. 52 10.5852/cr 52 10.1016/j.palaeo 52 10.1002/spp 48 10.1016/j.jhevol 46 10.1093/zoolinnean 44 10.5962/bhl.part 44 10.1111/j. 42 10.1016/s 41 10.3140/bull.geosci 39 10.1016/j.cell 39 10.1002/ajb 38 10.4049/jimmunol. 38 10.1017/pab. 33 10.1038/nature 32 10.1111/j.1475-4983 31 10.37828/em. 31 10.1093/mnras 28 10.1111/j.1096-3642 27 10.5962/p. 27 10.2476/asjaa. 25 10.7203/sjp. 25 10.1016/j.revpalbo 23 10.1002/ajpa. 21 10.24425/agp. 21 10.1093/bioinformatics
Currently with some 160k pages found:
$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30 15725 www.jstor.org 14451 dx.doi.org 12927 doi.org 9520 www.sciencedirect.com 6442 www.researchgate.net 5630 www.tandfonline.com 5491 onlinelibrary.wiley.com 4498 www.cambridge.org 3824 pubmed.ncbi.nlm.nih.gov 3477 link.springer.com 3182 muse.jhu.edu 3024 linkinghub.elsevier.com 2928 www.nature.com 2770 journals.sagepub.com 2065 www.academia.edu 1934 pubs.acs.org 1896 academic.oup.com 1736 www.persee.fr 1520 www.science.org 1473 semanticscholar.org 1247 www.journals.uchicago.edu 1210 archive.org 1128 books.google.com 956 ieeexplore.ieee.org 854 www.oxforddnb.com 789 brill.com 707 doi.wiley.com 646 www.semanticscholar.org 620 zenodo.org 571 www.degruyter.com
After a broader run
$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30 3020 dx.doi.org 2666 www.jstor.org 2569 doi.org 2116 www.sciencedirect.com 1217 www.researchgate.net 1105 onlinelibrary.wiley.com 1011 www.tandfonline.com 822 www.cambridge.org 789 pubmed.ncbi.nlm.nih.gov 748 linkinghub.elsevier.com 685 link.springer.com 630 www.nature.com 522 journals.sagepub.com 453 muse.jhu.edu 435 pubs.acs.org 361 www.academia.edu 351 semanticscholar.org 341 academic.oup.com 338 www.science.org 301 archive.org 244 www.persee.fr 210 www.journals.uchicago.edu 187 books.google.com 180 ieeexplore.ieee.org 157 pubs.geoscienceworld.org 150 doi.wiley.com 149 www.semanticscholar.org 120 pubs.rsc.org 119 brill.com 108 link.aps.org
How to sample JSTOR DOIs which look closed:
$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("doi-access=|")) | .orig_string' | grep 2307 | grep -Eo "10.2307/[0-9]+" | sort | shuf -n 40
Currently the most represented domains would be:
$ find -maxdepth 1 -type f -mtime -1 -print0 | xargs -0 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30 916 dx.doi.org 723 www.sciencedirect.com 658 doi.org 519 www.jstor.org 312 onlinelibrary.wiley.com 292 linkinghub.elsevier.com 267 www.researchgate.net 221 www.tandfonline.com 218 www.cambridge.org 204 link.springer.com 182 pubmed.ncbi.nlm.nih.gov 179 www.nature.com 152 journals.sagepub.com 131 pubs.acs.org 102 www.science.org 94 academic.oup.com 93 semanticscholar.org 87 archive.org 79 www.academia.edu 74 pubs.geoscienceworld.org 55 doi.wiley.com 54 www.journals.uchicago.edu 52 pubs.rsc.org 50 muse.jhu.edu 49 www.semanticscholar.org 47 ieeexplore.ieee.org 43 iopscience.iop.org 42 link.aps.org 37 xlink.rsc.org 35 aip.scitation.org
Need to check how many url-access=limited we'd add to non-DOI citations like AdsAbs https://en.wikipedia.org/w/index.php?title=T_Scorpii&diff=prev&oldid=1188735108
We should not replace an existing url-access with another for the same URL as happened https://en.wikipedia.org/w/index.php?title=Soft_skills&diff=prev&oldid=1188731807 (even though I'd argue the archive.org inlibrary items are more "limited" than "registration").
I've manually deleted the older suggestions so now the numbers will be lower.
find ~/www/python/src/bot_cache -mtime +3 -delete
Some ISSNs
$ find ~/www/python/src/bot_cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep issn | grep -Eo 'issn *= *[0-9-]{8,9}' | grep -Eo '[0-9-]{8,9}' | sort | uniq -c | sort -nr | head -n 40 87 0036-8075 46 0004-637 45 1476-4687 45 0004-6256 39 0191-2917 39 0098-7484 33 0028-0836 28 1044-0305 25 0067-0049 24 0080-4606 24 0021-8693 19 2156-2202 19 1396-0466 18 1538-4365 17 0148-0227 17 0031-4005 17 0022-0949 16 0950-9232 16 0304-3975 16 0278-2715 16 0140-6736 16 0035-8711 16 0028-646 16 0002-7294 15 1944-8007 15 1538-4357 15 0301-4223 15 0031-949 15 0006-3568 15 0003-9926 14 2330-4804 14 1475-4983 14 0271-5333 13 0272-4634 13 0097-3165 13 0080-4630 12 2515-5172 12 1631-0683 12 1364-5021 12 0094-8276
Or to catch some more ISSN:
$ find ~/www/python/src/bot_cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep doi= | grep -Eo 'doi *=[^"|]+' | grep -Eo '10\.[0-9]+/[a-z]+\b(\.?([a-z]{,8}|[0-9-]{8,9})\b)?' | sort | uniq -c | sort -nr | head -n 4 390 10.1126/science. 260 10.1001/jama. 244 10.1074/jbc. 235 10.1038/sj.onc 155 10.1098/rsbm. 116 10.1098/rstb. 111 10.1525/aa. 110 10.1098/rspa. 104 10.1242/jeb. 104 10.1111/j. 100 10.5210/fm. 100 10.1377/hlthaff. 99 10.1016/j. 91 10.1098/rstl. 86 10.1093/mnras 74 10.1242/jcs. 68 10.1167/iovs. 68 10.1001/archinte. 62 10.1542/peds. 61 10.1111/j.1469-8137 60 10.1098/rsta. 57 10.1111/j.1558-5646 55 10.1001/archneur. 53 10.1111/j.1096-3642 52 10.1001/archpsyc. 48 10.3732/ajb. 46 10.1002/art. 43 10.1038/sj.mp 43 10.1016/j.febslet 42 10.1093/hmg 41 10.1111/j.1432-1033 41 10.1016/j.jacc 40 10.1093/acrefore 40 10.1001/archopht. 39 10.1098/rspb. 39 10.1093/molbev 38 10.1001/archpedi. 37 10.1242/dev. 37 10.1111/j.1475-4983 36 10.1016/j.jasms
Some of the most common DOI segments slated for doi-access=free removal in today's run:
$ find ~/www/python/src/bot_cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep doi= | grep -Eo 'doi *=[^"|]+' | grep -Eo '10\.[0-9]+/[a-z]+(\.([a-z]{,8}|[0-9-]{9})\b)?' | sort | uniq -c | sort -nr | head -n 40 392 10.1126/science. 351 10.1074/jbc. 260 10.1001/jama. 236 10.1038/sj.onc 209 10.1007/s 176 10.1016/s 173 10.1038/s 155 10.1098/rsbm. 147 10.1146/knowable 139 10.1038/d 116 10.1098/rstb. 111 10.1525/aa. 110 10.1098/rspa. 104 10.1242/jeb. 104 10.1111/j. 100 10.5210/fm. 100 10.1377/hlthaff. 99 10.1016/j. 91 10.1098/rstl. 86 10.1093/mnras 76 10.1242/jcs. 75 10.1167/iovs. 68 10.1001/archinte. 62 10.1542/peds. 61 10.1111/j.1469-8137 60 10.1098/rsta. 57 10.1111/j.1558-5646 55 10.1001/archneur. 53 10.1111/j.1096-3642 52 10.1001/archpsyc. 48 10.3732/ajb. 46 10.1038/nature 46 10.1002/art. 45 10.1038/sj.mp 43 10.1016/j.febslet 42 10.1111/j.1432-1033 42 10.1093/hmg 41 10.1016/j.jacc 41 10.1007/bf 40 10.1093/acrefore
You can look at effect of captcha on known-human users (e.g. IPs from some insitutional range)
And currently
$ find ~/www/python/src/cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep url= | grep -Eo 'url=[^"|]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 40 1427 doi.org 1229 dx.doi.org 1180 www.sciencedirect.com 940 www.jstor.org 875 web.archive.org 736 onlinelibrary.wiley.com 606 www.researchgate.net 591 www.nature.com 586 www.tandfonline.com 408 www.cambridge.org 376 archive.org 337 link.springer.com 328 linkinghub.elsevier.com 310 www.escholarship.org 302 journals.sagepub.com 283 www.academia.edu 265 academic.oup.com 261 pubmed.ncbi.nlm.nih.gov 259 www.biodiversitylibrary.org 244 books.google.com 238 www.science.org 224 babel.hathitrust.org 220 zenodo.org 212 nrs.harvard.edu 184 ieeexplore.ieee.org 177 digitalcommons.law.yale.edu 176 www.journals.uchicago.edu 166 urn.kb.se 164 pubs.acs.org 123 www.bioone.org 118 nbn-resolving.de 117 philarchive.org 110 muse.jhu.edu 110 link.aps.org 105 www.research.manchester.ac.uk 100 bioone.org 87 www.aeaweb.org 86 www.osti.gov 79 pubs.rsc.org 77 dspace.lboro.ac.uk
I made reports upstream for Journal of Biological Chemistry (already fixed), Journal of Asian Studies/Duke University Press, Annual Review of Public Health, AAS journals, AME journals. I manually removed their doi-access=free removals in the queue (they were around 10 % of the total, I think, including all 10.1146/annurev DOIs some of which are not open yet).
The most popular domains to be replaced can be found with
$ find ~/www/python/src/cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep url= | grep -Eo 'url=[^"|]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 40 1110 doi.org 940 dx.doi.org 893 www.sciencedirect.com 724 www.jstor.org 639 web.archive.org 571 onlinelibrary.wiley.com 469 www.researchgate.net 451 www.tandfonline.com 444 www.nature.com 316 www.cambridge.org 277 link.springer.com 259 linkinghub.elsevier.com 259 archive.org 227 www.escholarship.org 210 journals.sagepub.com 197 academic.oup.com 196 www.academia.edu 192 pubmed.ncbi.nlm.nih.gov 191 www.biodiversitylibrary.org 184 books.google.com
I opened a PR upstream, https://github.com/ourresearch/oadoi/pull/141#issuecomment-1830788674
Most common prefixes of DOIs which would be removed:
There are currently about 20k edits in the queue which would remove a doi-access=true parameter (but currently are doing nothing).
Comment on doi.org links: https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:OABOT&diff=prev&oldid=1172247256 .
........Equity_premium_puzzle ......Traceback (most recent call last): File "<string>", line 1, in <module> File "/data/project/oabot/www/python/src/app.py", line 218, in get_proposed_edits filtered = list([e for e in all_templates if e.proposed_change]) File "/data/project/oabot/www/python/src/app.py", line 218, in <listcomp> filtered = list([e for e in all_templates if e.proposed_change]) File "/data/project/oabot/www/python/src/oabot/main.py", line 387, in add_oa_links_in_references edit.propose_change(only_doi) File "/data/project/oabot/www/python/src/oabot/main.py", line 118, in propose_change link, oa_status = get_oa_link(paper=dissemin_paper_object, doi=doi, only_unpaywall=only_doi) File "/data/project/oabot/www/python/src/oabot/main.py", line 337, in get_oa_link if 'citeseerx.ist.psu.edu' in resp['best_oa_location']['url_for_landing_page']: TypeError: argument of type 'NoneType' is not iterable
.List_of_topics_characterized_as_pseudoscience .................Traceback (most recent call last): File "<string>", line 1, in <module> File "/data/project/oabot/www/python/src/app.py", line 218, in get_proposed_edits filtered = list([e for e in all_templates if e.proposed_change]) File "/data/project/oabot/www/python/src/app.py", line 218, in <listcomp> filtered = list([e for e in all_templates if e.proposed_change]) File "/data/project/oabot/www/python/src/oabot/main.py", line 387, in add_oa_links_in_references edit.propose_change(only_doi) File "/data/project/oabot/www/python/src/oabot/main.py", line 118, in propose_change link, oa_status = get_oa_link(paper=dissemin_paper_object, doi=doi, only_unpaywall=only_doi) File "/data/project/oabot/www/python/src/oabot/main.py", line 368, in get_oa_link return url, resp['oa_status'] UnboundLocalError: local variable 'resp' referenced before assignment
Some repositories are inevitably stricter than others and will block us, there's little we can do about it. However,
That should be fixed https://github.com/dissemin/oabot/pull/90
A bug in the current version
34 more examples which seem bronze OA from my manual check, out of 71 OAbot found Unpaywall says are closed (the rest I mostly couldn't verify).
The 400+ edits today (with about 1300 more cached) were the outcome of a regularly scheduled oabot refresh which took about 57 hours to prefill at 10 parallel threads. With one thread it would presumably take at least 3 weeks, but perhaps a monthly update is enough. I'll revisit the multiprocessing after the next run.
More examples: