This page is an archive. Do not edit the contents of this page. Please direct any additional comments to the current main page. |
The pdfs.semanticscholar.org which HTTP 301 redirect to www.semanticscholar.org URLs are actually dead links. There are quite a few now. A link to the wayback machine is possible, but I believe the InternetArchiveBot would not normally add it. Nemo 21:15, 28 April 2021 (UTC)
Nemo, looks done, let me know if you see any problems. -- Green C 16:43, 30 April 2021 (UTC)
Results
|url-status=dead
in preexisting archive URLsWikipedia currently contains citations and source references to the websites TracesOfWar.com and .nl (EN-NL bilingual), but also to the former websites ww2awards.com, go2war2.nl and oorlogsmusea.nl. However, these websites have been integrated into TracesOfWar in recent years, so that the source reference is now incorrect in hundreds of pages and a multiple of that in terms of the source references. Fortunately, there is currently the situation in which ww2awards and go2war2 still redirct to the correct page on TracesOfWar, but this is no longer the case for oorlogsmusea.nl. I have been able to correct all the sources for oorlogsmusea.nl manually. For ww2awards and go2war2 the redirects will stop in the short term, which will result in thousands of dead links, while it can be properly directed towards the same source. A short example: person Llewellyn Chilson (at Tracesofwar persons id 35010) now has a source reference to http://en.ww2awards.com/person/35010, but this must be https://www.tracesofwar.com/persons/35010/. In short, old format to new format in terms of url, but same ID.
In my opinion, that should make it possible to convert everything with format ' http://en.ww2awards.com/person/[id]' (old English) or ' http://nl.ww2awards.com/person/[id]' (old Dutch) to ' https://www.tracesofwar.com/persons/[id]' (new English) or ' https://www.tracesofwar.nl/persons/[id]' (new Dutch) respectively. The same applies to go2war2.nl, but with a different format slightly. http://www.go2war2.nl/artikel/[id] becomes https://www.tracesofwar.nl/articles/[id]. The same has already been done on the Dutch Wikipedia, via a similar bot request. Lennard87 ( talk) 18:50, 29 April 2021 (UTC)
@ Lennard87:, seeing around 500 mainspace URLs on enwiki for all domains combined. Can you verify not missing any? -- Green C 22:18, 1 May 2021 (UTC)
@ Lennard87: results for www2awards it moved 251 URLs. Five examples show different types of problems: [1] [2] [3] [4] [5] .. the variations on "WW2 Awards" and location in the cite are difficult. (BTW instead of /person/ some have /award/ which at the new site is /awards/ Example)-- Green C 18:43, 2 May 2021 (UTC)
Results for go2war are similar it moved 48 URLs: [6] [7] -- Green C 19:26, 2 May 2021 (UTC)
Ancient History Encyclopedia has rebranded to World History Encyclopedia and moved domain to worldhistory.org. There are many references to the site across Wikipedia. All references pointing to ancient.eu should instead point to worldhistory.org. Otherwise the URL structure is the same (ie. https://www.ancient.eu/Rome/ is now https://www.worldhistory.org/Rome/). — Preceding unsigned comment added by Thamis ( talk • contribs)
@ Thamis:, this url works but this url does not. The etc.ancient.eu sub-domain did not transfer, but still works at the old site. For these it will skip as the link still works and I don't want to add an archive URL to live links if it will be transferred in the future to worldhistory.org. Can be revisited later. -- Green C 16:03, 23 April 2021 (UTC)
@
Thamis: it is done. In addition to the URLs it also changed/added |work=
, etc.. to
World History Encyclopedia. It got about 90%, but the string
"Ancient History Encyclopedia" still exists in 89 pages/cites, they will require manual work to convert (the URLs are converted only the string is not). They are mostly free-form cites with unusual formatting and would benefit from manual cleanup probably ideally conversion to {{
cite encyclopedia}}
. --
Green
C 01:07, 24 April 2021 (UTC)
Results
@ GreenC: Thanks a lot for sorting this out! Greatly appreciated. :-) — Preceding unsigned comment added by Thamis ( talk • contribs)
Hello, I think all links to oxfordjournals.org subdomains in the url parameter of {{ cite journal}} should be removed, as long as there's at least a doi, pmid, pmc, or hdl parameter set. Those links are all broken, because they redirect to an HTTPS version which uses a certificate valid only for silverchair.com (example: http://jah.oxfordjournals.org/content/99/1/24.full.pdf ).
The DOI redirects to the real target URL, which nowadays is somewhere in academic.oup.com, so there's no point in keeping or adding archived URLs or url-status parameters. These URLs have been broken for years already, so it's likely they will never be fixed. Nemo 07:13, 25 April 2021 (UTC)
@
Nemo bis: edited 20 articles:
1
2
3
4
5 - I forgot to remove |access-date=
in a few cases. Do you see any other problems? --
Green
C 00:50, 6 May 2021 (UTC)
@ GreenC and Nemo bis: Just saw an edit about this, but the links seem to work fine now? Thanks. Mike Peel ( talk) 18:39, 7 May 2021 (UTC)
@
Nemo bis: The first pass is done, some problems. There are cases of non-{{
cite journal}}
that contain DOIs etc..
Example. The bot was programmed for journal + aliases only. And I missed {{
vcite journal}}
.
[9] There are cases of {{
doi}}
it's not setup to detect
[10]. There were 1,750 archive URLs added, so these problems would be in that group, though most of them are fine. --
Green
C 18:45, 7 May 2021 (UTC)
[http://mollus.oxfordjournals.org/content/77/3/273.full Bouchet P., Kantor Yu.I., Sysoev A. & Puillandre N. (2011) A new operational classification of the Conoidea. Journal of Molluscan Studies 77: 273–308.] → {{cite journal|first1=P.|last1=Bouchet|first2=Y. I.|last2=Kantor|first3=A.|last3=Sysoev|first4=N.|last4=Puillandre|title=A new operational classification of the Conoidea (Gastropoda)|journal=Journal of Molluscan Studies|date=1 August 2011|pages=273–308|volume=77|issue=3|doi=10.1093/mollus/eyr017|url=https://archimer.ifremer.fr/doc/00144/25544/23686.pdf}}
$ curl -s https://quarry.wmflabs.org/run/550945/output/1/tsv | grep -Eo "(/[0-9]+){3}" | sort | uniq -c | sort -nr | head 69 /21/7/1361 60 /22/10/1964 51 /77/3/273 29 /24/6/1300 28 /19/7/1008 25 /24/20/2339 21 /55/6/912 17 /22/2/189 16 /19/1/2 15 /11/3/257
Have you tried contacting the journal about these issues? Since the links *do* work (possibly unless you apply extra restrictions), I don't think these removals should be happening without asking the wider community first. Thanks. Mike Peel ( talk) 08:28, 8 May 2021 (UTC)
Related to #Fix pdfs.semanticscholar.org links, or rather the work that followed it at phabricator:T281631, there are a few hundreds {{ dead link}} notices which can be removed (together with the associated URL) because the DOI or HDL can be expected to provide the canonical permanent link. See a simple search at:
This is not nearly as urgent as the OUP issue above, and if it's complicated I may also do it manually, but it seems big enough to benefit from a bot run at some point. Nemo 16:26, 5 May 2021 (UTC)
|doi-access=free
or |hdl-access=free
and has a {{
dead link}}
attached, remove the {{dead link}}
(plus {{
cbignore}}
) and the |url=
.--
Green
C 20:11, 5 May 2021 (UTC)
|pmid=
? --
Green
C 18:03, 6 May 2021 (UTC)
Hello. As SR/Olympics has been shut down, several SR/Olympics templates are broken. They are Template:SR/Olympics country at games (250 usages), Template:SR/Olympics sport at games and Template:SR/Olympics sport at games/url (both 63 usages). See for example Algeria at the 2012 Summer Olympics and Football at the 2012 Summer Olympics. I'm not sure if InternetArchiveBot can work with these templates. I was wondering how these links could be fixed with archived URLS like at Template:Sports reference. Thanks! -- MrLinkinPark333 ( talk) 19:35, 10 May 2021 (UTC)
|archive=
argument so it's just a matter of updating each instance with a 14-digit timestamp eg. |archive=20161204010101
. The last one is used by the second one which is why it has the same count, nothing to do there. For the first two, I guess it would require some custom code to find a working timestamp and add it. This is why I dislike custom templates, they don't work with standard tools, each instance a custom programming job. I'll see what I can do. --
Green
C 20:04, 10 May 2021 (UTC)The new Reuters website redirected all subdomains to www.reuters.com and broke all links. That's about 50k articles on the English Wikipedia alone, I believe. I see that the domain is whitelisted on InternetArchiveBot, not sure whether that's intended. Nemo 20:13, 1 May 2021 (UTC)
Wow that's major. Domains can become auto-whitelisted if the bot is receiving confusing messages by way of user reverts (of the bot). Looks like some subdomains still work [12]. Or correctly return 404 and would be picked up by IABot - except for the whitelist [13]. Or soft 404'ing [14]. How to determine a soft404 is an art, in this case easy enough it redirects to a page with a title "Homepage" but there are probably other unknown landing locations. WaybackMedic should be able to do this, it has good code for following redirects, checking headers and verifying (known) soft404s. Will not be able to start for at least a week to catch up on other things. Then will take a while due to the size. -- Green C 21:59, 1 May 2021 (UTC)
Reuters is complete. Typical entry in the IABot database: [15] ie. GreenC bot detected the URL is dead and set to blacklisted. Different case: [16] - set Blacklisted, add an archive URL when available. Third type, removes the archive URL if it is not working and no replacement. In total it checked 165k unique URLs and edited about 69k or about 42% are now blacklisted, the rest still work ( Example). Next step would be run IABot on all the pages with a reuters URL (any hostname and .com and .co.uk) on any wiki language sites supported; or, IABot will find them in time. -- Green C 01:52, 21 May 2021 (UTC)
I frequently find "expired" links to amazonaws on Wikipedia: is it possible to automatically repair them? Jarble ( talk) 18:31, 21 May 2021 (UTC)
Expires=1612051241
appears to be a unix date which can be converted
here to January 2021. And sure enough the archive post-dates it from April and doesn't work, returning a 4xx code. So it would need a {{
dead link}}
. In fact even if it had not expired yet, it probably will soon enough, so should be treated as "dead" and an archive found and added ASAP. We currently have no mechanism for automating search+archive processes on certain URLs on a recurring basis. It wouldn't be difficult to processes these 370 but going forward as new one's are added, I'll need to think about it. --
Green
C 19:02, 21 May 2021 (UTC)Ok, I created a bot that will search for the URLs daily, and if it finds a URL not seen before, issue a Save Page Now at Wayback. This should ensure any new additions will be immediately archived. Assuming Wayback is even capable of archiving them. The bot has public logs https://tools-static.wmflabs.org/botwikiawk/awsexp/ and the source is there also. Once the URLs are archived then it's just a matter of IABot or some other process to add the archives into Wiki at our leisure regardless if past the expiration date. -- Green C 20:26, 21 May 2021 (UTC)
An AWS link was added 12:31 pm, May 17, 2021 with an &Expires= timestamp of 13:30:53 pm or about 1 hour later. Holy cow that is a short expiration period. No wonder these links are broken and have no archive URLs available. The bot would need to run every 50 minutes or so, assuming there are none with even shorter expiration. -- Green C 21:17, 21 May 2021 (UTC)
Jarble, I've processed the links in 584 articles and converted most of them to a {{
dead link}}
, or added a working archive.org -- it was difficult because my bot is not designed to find URLs by key words in the path &Expires=
vs. a specific domain name which could be anything in this case. As such, there were 31 cases it could not determine, and rather than debug and test why, it will be faster to fix them manually. I have too much other work to do, and so am moving on. In case you want to fix them, they are in the following articles:
-- Green C 19:57, 24 May 2021 (UTC)
The article about Estradiol as a hormones contains an non working external link in the references. On the reference number 71., the last external link as a PDF is citing the values from this source " Establishment of detailed reference values for luteinizing hormone, follicle stimulating hormone, estradiol, and progesterone during different phases of the menstrual cycle on the Abbott ARCHITECT analyzer".
This external link redirects to an 404 server error and needs to be replaces with an operating link. The original research document is available on the laboratory's website.
How to change this link? I don't know how to use a bot. I'm thankful for any help. — Preceding unsigned comment added by Jerome.lab ( talk • contribs) 13:11, 30 April 2021 (UTC)
Fixed. You can find archive URLs at archive.org and replace them in the article when the link is dead. -- Green C 00:53, 25 May 2021 (UTC)