Wikimedia cross-wiki coordination and L10n/i18n. Mainly active on Wikiquote, Wiktionary, Wikisource, Commons, Wikidata, Wikibooks. And of course Meta-Wiki, translatewiki.net.
Contact me by MediaWiki.org email or user talk.
Wikimedia cross-wiki coordination and L10n/i18n. Mainly active on Wikiquote, Wiktionary, Wikisource, Commons, Wikidata, Wikibooks. And of course Meta-Wiki, translatewiki.net.
Contact me by MediaWiki.org email or user talk.
Nowadays we only use Unpaywall and the number of ResearchGate or Academia.edu suggestions is negligible.
Given how many years it has taken us to babysit OAbot on the English Wikipedia to do a fraction of what originally envisioned, I'm starting to wonder whether this should instead be done with an extension, similar to the SecureLinkFixer extension. After all, sending visitors to the websites of legacy publishers is a clear and present danger. The Unpaywall snapshot could be imported similar to what the Tor extension does, or redirection could be delegated to oadoi.org. Ways can be devised to leave more control to local wikis.
Generally speaking, this has been working fine for a while. Example: https://en.wikipedia.org/w/index.php?title=MIM_Museum&diff=prev&oldid=1291114435
Current most popular DOI prefixes
$ find ~/www/python/src/bot_cache -maxdepth 1 -type f -print0 | xargs -0 -P16 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("doi-access=|")) | .orig_string' | grep doi | grep -Eo 'doi *= [^"|]+' | grep -Eo '10\.[0-9]+/[a-z]+(\.([a-z]{,8}|[0-9-]{9})\b)?' | sort | uniq -c | sort -nr | head -n 40 jq: error: Could not open file /data/project/oabot/www/python/src/bot_cache/ISO#IEC_2022.json: No such file or directory parse error: Invalid numeric literal at line 1, column 6 1933 10.1074/jbc. 1194 10.1038/sj.onc 705 10.1126/science. 512 10.1098/rsbm. 396 10.4049/jimmunol. 385 10.1093/hmg 370 10.1111/syen. 304 10.1096/fj. 284 10.1001/jama. 250 10.1242/jcs. 213 10.1096/fasebj. 204 10.11646/zootaxa. 202 10.1182/blood 162 10.1016/j.febslet 138 10.1038/sj.mp 127 10.1182/blood. 111 10.1242/dev. 103 10.1016/s 100 10.1111/j. 100 10.1002/art. 87 10.1210/jcem. 87 10.1167/iovs. 85 10.1111/j.1432-1033 81 10.1093/brain 81 10.1016/j. 80 10.1038/onc. 80 10.1001/archinte. 77 10.1093/humupd 76 10.1038/sj.leu 75 10.1242/jeb. 75 10.1098/rstl. 74 10.1093/mnras 74 10.1002/ijc. 73 10.1001/archneur. 72 10.1007/s 70 10.4269/ajtmh. 70 10.1146/annurev 66 10.1016/j.cell 64 10.1542/peds. 62 10.1124/pr.
Rare cases of removal of url-access=subscription do not seem very useful: https://en.wikipedia.org/w/index.php?title=Economy_of_Russia&diff=prev&oldid=1291484551
The most common proposed changes in the bot queue, currently not acted upon, are:
I think we should decline this for good, since the Wikidata graph split has been completed and the future of WikiCite data on Wikidata remains uncertain.
As rephrased, the issue has been solved.
Thanks for sharing this discussion between those two users. The tool already avoids adding URLs when the doi-access=true is confirmed to be correct (cf. T344114). Links to repositories are added for additional safety when the DOI link appears to be closed.
A [February 2025 RfC](https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(policy)/Archive_201#h-RFC:_Allow_for_bots_(e.g._Citation_bot)_to_remove_redundant_URLs_known_to_not_ho-20250217092500) on the English Wikipedia has explicitly endorsed removing PubMed and OCLC URLs which do not provide a full text.
All the cases mentioned above seem now ok in Unpaywall, from some spot checks.
No longer relevant as Dissemin has closed: T394853.
Not sure whether this is still happening.
In the past year, redundant links have grown from 90k to 120k, so clearly this is more necessary than ever...
I've finally merged the PR as oabot has been running that code for over a year without problems now. https://github.com/dissemin/oabot/pull/91
34 more examples which seem bronze OA from my manual check, out of 71 OAbot found Unpaywall says are closed (the rest I mostly couldn't verify).
Thanks for the report and sorry for the annoying experience. Errors about individual DOIs are best reported to Unpaywall directly. The issue has since been fixed, as doi:10.1007/BF02124750 is now considered closed access.
It's true it would be good to link e.g. doi:10.3897/zookeys.43.390 if it weren't linked already, but the bot already does that.
Thanks for the report. Next time please apply the edit and revert it, or include the suggestion citation, or at least mention the DOI. Links to suggestions expire after a few weeks as they get deleted from the cache.
OABot adds URLs to http://pdfs.semanticscholar.org/8775/3fa9d86e28e1fb332f1509f3519e5b3a9c0d.pdf which redirects
The s2cid parameter does not autolink, so it's not a substitute for the url parameter. See also Why does the oabot tool make edits the bot doesn't?.
It seems clear to me that we need a mirror of Wikimedia Commons files. Ideally we would have kept both the media tarballs at your.org and the WikiTeam collection at the Internet Archive up to date, but we've not managed to keep up after 2012 and 2016.
Also, I've tried the link from a recent post and it doesn't even work: it produces an empty post after one and two redirects. It seems nobody is using those links, as nobody noticed.
Another reason to do this is that Facebook doesn't even allow sharing links to some Wikimedia projects.
Thanks for the update on XML data dumps list. I see there's progress on the other side: https://phabricator.wikimedia.org/T382947#10476420 . Hopefully this will allow to re-enable the dumps soon.
IIRC these (and the OAI feeds) were added back in the day when the WMF got some corporate contribution to provide specialised data feeds. I imagine any contractual obligations have long expired (if they even existed), but I don't know who could verify that.
The query itself will remain, so getting fresh results should be nothing more than a submit query away.
By running more tests and using Mann Whitney we know if a performance regression is of statistical significance. That way we can make sure that we only alert on real regressions. That decreases the number of false alerts and time spent investigating regressions.
We certainly don't want to be in the way. Feel free to delete the VMs. I was hoping to double check there's nothing to salvage in the local mounts but usually there shouldn't be anyway.
As an update, I created the account and luckily we were still in time for this round of submissions (CLDR 46). It's always a good time to ask a CLDR account from me! Six months tend to fly by.
Maybe it could be retrieved from a very early dump or some other means
@Hydriz Can I upgrade the VMs to Debian 11 one of these weekends? The only reason not to that I can think of is some scripts may require Python2, but that's still available in Debian 11.
@HShaikh Please don't propagate myths. https://aeon.co/essays/the-tragedy-of-the-commons-is-a-false-and-dangerous-myth
I'm closing this task as unclear and not pertaining to MediaWiki core, mostly because it mixes different user groups and permissions some of which are Wikimedia-specific.
This reminds me a bit of the https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool , which I believe focused on identifying easy concepts like numbers. I've not used it in years.
https://www.mediawiki.org/wiki/Special:RecentChanges?useskin=vector&uselang=ksh after disabling JavaScript recentchanges:
@Mazevedo Here's an example old ticket which may or may not be relevant any more. :)
Do you want to focus on the exonyms in languages which are supported by MediaWiki core (or at least translatewiki.net) but not in CLDR?
That was with all namespaces.
Current status
After the latest run
Mostly fixed upstream.
Not clear to me why this doi:10.1038/s41586-023-06291-2 got an arxiv but not pmc ID https://en.wikipedia.org/w/index.php?title=PubMed&diff=prev&oldid=1195324840
The new round seems to go fine so far https://en.wikipedia.org/w/index.php?title=Special:Contributions/OAbot&target=OAbot&dir=prev&offset=20240107000000&limit=50
For the non-Unpaywall side, continues at T228702
We're still discarding excess merges from Dissemin similar to the 2019 logic https://github.com/dissemin/oabot/commit/e3c74bff735c1ef16ee333dde2ac4bdd20949635 . We're not currently using the Dissemin title matches but if we did it would not be enough to check for title, author, year match: https://en.wikipedia.org/w/index.php?title=User_talk%3AOAbot&diff=1194216712&oldid=1193993325 .
There are over 6500k PMC matches and only 650k matches by title and author, of which some 60k appear without a PMCID match, so perhaps we can just ignore those europepmc matches:
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via pmcid lookup)" 6499014 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via OAI-PMH title and first author match)" 637491 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep "oa repository (via OAI-PMH title and first author match)" | grep -vc "oa repository (via pmcid lookup)" 62310
Both papers on Unpaywall have evidence "oa repository (via OAI-PMH title and first author match)" although the PMC side exposes a link to the correct DOI. The CrossRef API has the page range like "113-128", "283-288", so it may be possible to check for the number of pages.
So we won't suggest edits like this either https://en.wikipedia.org/w/index.php?title=Saccharomyceta&curid=68064105&diff=1194087545&oldid=1182890284 as we don't get non-repository URLs from other sources.
A sample of what kind of URLs we're talking about
Only 35k or so of these are in the best_oa_location (sometimes even when a separate match for arxiv exists, like doi:10.1002/rsa.20071 / oai:CiteSeerX.psu:10.1.1.237.8456 / oai:arXiv.org:math/0209357 ).
Not sure how to narrow this down, we're talking about some 500k matches from CiteSeerX (out of 900k):
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep citeseerx | grep "oa repository (via OAI-PMH doi match)" | jq -r 'select(.oa_locations | .[] | .endpoint_id == "CiteSeerX.psu" and .evidence == "oa repository (via OAI-PMH doi match)" )|.doi' | wc -l 505747 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep -c citeseerx 887759
Another example where URL priorities changed: https://en.wikipedia.org/w/index.php?title=Balbinot_1&diff=prev&oldid=1193722831 (but there was no doi-access=free).