From Wikipedia, the free encyclopedia

Copyright

The article contains false information: "archive.today removes archived pages in response to DMCA takedown requests from copyright holders." As a webmaster, I've had my site scraped against my will and sent a properly formatted DMCA to both the site and its ISP. It is a scarping site, masking as an archiving service. 80.62.117.71 ( talk) 13:36, 26 June 2014 (UTC) reply

Well it's hosted by a Russian company, Hostkey abuse@hostkey.nl and abuse@hostkey.ru -- after a year of complaining to Cloudflare they told me that. The owner of the site, Denis Petrov, many times does the scraping himself. He will scrape an entire website in a matter of hours. He runs a botnet with endless IPs constantly changing their headers and Cloudflare cannot block their scraper. I asked him extremely politely to remove my site a bit ago and he scraped the entire site in hours. He was doing 100 pages a second all from a botnet with many IPs. While I don't agree with every site on the spam blacklist of Wikipedia, I agree with this site on it, though he has mirror domains such as .fo and .ec that should be added. I don't edit here very much and mostly by IP so I'm not sure where to suggest those. Plimitarmed ( talk) 05:28, 8 November 2016 (UTC) reply
I noticed in the history of the article, the IP address 2a01:4f8:b10:5003:a:8:1:2. When Archive.is scrapes my website I often find that IP address among them. It belongs to that website. When I complained to Hostkey, they say I need to ship documents to them via airmail. Plimitarmed ( talk) 08:02, 8 November 2016 (UTC) reply
An IP came along, 113.180.75.25, and reduced my references to one. That IP is one I had seen from the archive.is scraper on my logs. Then a second IP came and removed the whole section. 80.221.159.67 had a problem with the article giving the current host of where Archive.is was hosted. Plimitarmed ( talk) 16:20, 9 November 2016 (UTC) reply
Agree. Repeated Copyright violator. Complaints routinely ignored. It is not an archive. It is a an unauthorised partial duplicator of content. There are numerous clones. It might be possible to pursue each site, if an ISP -preferably not Russian- can be identified.
As for the 'Functionality' comment
"If a page has already been archived, archive.is asks the user to confirm archiving a new revision, instead of immediately archiving it."
That is incorrect. The "archive" owner has never contacted me. Nor has anyone else proposing my site content to the owner.

See also

 Ark25  ( talk) 19:06, 26 July 2013 (UTC) reply

How does automatic archiving work?

Discussion of features/bugs, not the article

If you look at ro:Biserica de lemn din Hilișeu-Crișan, there is a dead link:

Archive.is knows this link: http://archive.is/http://www.ziarullumina.ro/articole;1418;1;3759;0;Schit-de-maici-cu-o-biserica-unicat.html Strangely, it has only a "newest shot" (6 Jul 2013 03:25) which is an error page and that makes me wonder if it ever had older "shots" and then maybe it deletes the older shots? That would be bad..

The link is there since 29 november 2010, so I guess it was archived before 6 Jul 2013 on Archive.is (Almost all external links of Wikipedia (all Wikipedias, not only English) were archived in May 2012 says the Archive.is owner here: Wikipedia talk:Link rot#Archive.is)

It's a very very good idea to archive automatically all the external links of Wikipedia, but then it's very bad to delete them and to replace with newer shots, which will eventually end up in showing "dead link".

It very much looks like Archive.is keeps only the newest shots when it archives pages automatically. —  Ark25  ( talk) 23:48, 26 July 2013 (UTC) reply

Hi. The page was not archived before 6 Jul 2013. It has short url http://archive.is/HSzOE. This means it has the sequential ID of 47136582 and can also be accessed as http://archive.is/id/47136582 (it is not public url, something like debugging tool, do not use it for linking). If you have a look at the snapshots with the IDs around 47136582 (for example http://archive.is/id/47136581 or http://archive.is/id/47136583) you will see that all of them were made 6 Jul 2013.
Some snapshots are re-archived and overwritten. These are snapshots from urls like http://www.google.com/sorry/indexredirect?continue=http://another.url/. Re-archiving would help when the server responds 500 error or captcha.
Realtime tracking of the recent changes in all national Wikipedias is a relatively new feature (it is fully on duty from May-June 2013), so no wonder that some links which had been in Wikipedia for years have been archived for the first time only in 2013. Rotlink ( talk) 03:16, 18 August 2013 (UTC) reply

Useful feature

Archive.is can archive pages in the Google search cache. Once the content is archived, archive.is attributes it to the original website URL and not to Google's cache URL. This feature is useful when a site goes offline, that fact is noticed within a few days, the page isn't already archived in the Internet Archive, WebCite or elsewhere, and the only remaining copy of the page appears to be in the Google search cache. - 81.157.199.46 ( talk) 20:50, 29 July 2013 (UTC) reply

If this is a fact intended to go into the article, then an independent reliable source, or at the very least primary, published online documentation will be required to support it in an inline citation. This will necessitate more hard HTML documentation or help pages at archive.is. Statements by non-published (non-notable) or anonymous authors in blogs/wikis/forums cannot meet WP:RS requirements. -- Lexein ( talk) 07:47, 3 October 2013 (UTC) reply

wiki.dandascalescu.com

In the comments made in the AfD discussion I don't see a consensus for removing this citation. — rybec 14:51, 21 September 2013 (UTC) reply

It's a wiki; that's a gnarly WP:RS problem, even if Dan Dascalescu is an established or published expert in the field or academia, cited by others. The blog might be assessed as RS if we can establish Dan's bona fides.-- Lexein ( talk) 07:51, 3 October 2013 (UTC) reply

Robot Exclusion Standard

The article seems to be getting mixed up regarding the Robot Exclusion Standard, and the fact that Archive.is does not honor the standard, and what this means. The purpose of my recent edits was to clarify that this standard is used by the main archives (like WayBack and WebCite) to avoid infringing on copyrights, whereas Archive.is does not honor this standard, so there is a large amount of material re-hosted on Archive.is that is in violation of copyright law, specifically, the Digital Millennium Copyright Act (DMCA).

Some other editors deleted the link I provided to the Robot Exclusion Standard (saying it is a "dead link", although I have no trouble accessing it), and then inserted the statement: "... however, the protocol is used against malware robots in general, which routinely scan the web for security vulnerabilities and email-address harvesters used by spammers. Archive.is does not obey the robot exclusion standard designed against spammers." I frankly don't understand these words. The Robot Exclusion Standard doesn't provide any protection against malware robots, nor against spammers. It is a voluntary standard that is used by responsible organizations to work together to avoid unintended interactions, among which are copyright violations (which of course are NOT discretionary).

So, I propose to trim the words about malware and spam, and just go back to the relevant and well-sourced statements about how archives use robot exclusion to avoid copyright infringement, and the well-sourced and undisputed fact that Archive.is does not honor this standard. I'll also add the requested citation for the DMCA. Weakestletter ( talk) 21:57, 22 September 2013 (UTC) reply

By the way, as I was editing the article, I noticed that the words about malware and spam actually make no sense at all, because they say "the protocol is used against malware", and yet the cited reference says just the opposite: "...malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. " These words signify that the protocol is NOT useful against malware or spammers. So those edits to the article are clearly not right. I've deleted them. Weakestletter ( talk) 22:04, 22 September 2013 (UTC) reply
Yeah, that looks like a basic misunderstanding of what robots.txt actually does -- i.e. nothing by itself -- it is the robots that choose to act or not act on it. —   HELLKNOWZ  ▎ TALK 22:58, 22 September 2013 (UTC) reply
The article seems to imply that pages are retrieved when someone requests that they be archived, and that archive.is doesn't retrieve pages en masse but singly when someone requests a page. If it's not crawling sites, it's debatable whether it's a bot.
If someone were to spider and republish the contents of a Web site, the mere absence of a robots.txt would make a poor defence against claims of copyright infringement. The Wikipedia article doesn't mention the word "copyright"--the protocol is not a way to give or withhold permission to republish.
The DMCA is a US law. The .is top-level domain suggests that archive.is may be in Iceland, which is not (yet) part of the United States. The DMCA makes exceptions for libraries and archives, so if archive.is does happen to be under US jurisdiction, it might be able to claim it is an archive, and republish without violating copyright. — rybec 01:15, 23 September 2013 (UTC) reply
It's certainly true that there's a difference between not honoring robot exclusion files and not honoring copyright. In theory, an archive service could try to contact each individual website owner and request permission to re-host their material. Strictly speaking, the copyright laws probably require this... but that would make web archiving almost impossible in practice. So, working in good faith, the responsible archivers (e.g., WayBack, WebCite) honor robot exclusion files. They advertise that if any copyright holder doesn't want their material archived, just place a robot exclusion file in the root directory, and they will respect your copyright and not re-host your material. If you don't, then the archives will take the absence of a robot file for tacit permission to archive your site. This isn't a perfect arrangement, and it's rather heavily biased in favor of the archives, but it's workable.
Now, an archiving service like Archive.is comes along, and says they do not honor robot exclusion files, and they will re-host any material they want to, even if there is a robot file saying the author does not give permission. That's a clear violation of copyright, and moreover, it is clear that Archive.is is not even attempting to honor copyright.
By the way, the DMCA does NOT allow archives to re-host copyrighted material. (Brick and mortar libraries obviously are allowed to loan out purchased paper copies of copyrighted material, but that is completely different, and even they are under strict prohibition against creating any new copies, beyond the physical copies they purchased.) In fact, the whole purpose of the DMCA take-down agreement is so that large sites can avoid prosecution for copyright violation IF they agree to promptly take down and de-link (and even remove from search results) any site that is re-hosting copyrighted material without permission.
One more comment - The fact that Archive.is is hosted outside the United States is not particularly relevant, because they are lobbying to be used as citations for Wikipedia articles (for example), and Wikipedia does business in the US, and strives to avoid violating US copyright laws. So the whole ostensible purpose of Archive.is is undermined by its failure (so far) to implement some effective means of honoring copyright law. Until it does so, I don't think it can be adopted by any reputable web site. Weakestletter ( talk) 02:32, 24 September 2013 (UTC) reply
I see that the DMCA exemption for libraries and archives (look for section 404 at [1]) only allows the copied material to be used on the premises, not over the Internet:

any such copy or phonorecord that is reproduced in digital format is not otherwise distributed in that format and is not made available to the public in that format outside the premises of the library or archives.

Is there a reliable source that says archive.is is "lobbying to be used as citations for Wikipedia articles"? I took out the sentence about Wikipedia because it seemed to belong on a Wikipedia project page, not in a regular article. — rybec 03:44, 24 September 2013 (UTC) reply
That is it's entire function. See the wikipedia project page, where this is being promoted:
http://en.wikipedia.org/wiki/Wikipedia:Using_Archive.is
In particular, note that "Archive.is monitors RecentChanges of many wiki projects (including all national wikipedias) in order to authomaticaly archive new links as soon as possible after the editors added them to the articles." You see? Archive.is is designed specifically to re-host Wikipedia links, for the purpose of having stable references. It's whole reason for existence is to be used to link from Wikipedia articles, rather than linking to the original web sites which, of course, are under the control of those unreliable copyright holders. What the founders of Archive.is seem to have overlooked, is that Wikipedia needs to scrupulously adhere to the copyright laws of the countries where it operates, including the US, and this prohibits them from linking to unauthorized re-hosted copies of copyrighted material. Weakestletter ( talk) 14:36, 24 September 2013 (UTC) reply
You are right, there must be " User:RotlinkBot monitors ..." not "Archive.is monitors ...". It is known from User:RotlinkBot comments on Wikipedia talk pages, not from Archive.is FAQ or Twitter. 88.15.83.61 ( talk) 16:13, 24 September 2013 (UTC) reply
It is also not clear if User:RotlinkBot is still doing it after he/she/it was banned on Wikipedia. 88.15.83.61 ( talk) 16:31, 24 September 2013 (UTC) reply
at stake (on April 17,2019) archive.is has only 2901 URLs belonging to en.wikipedia. It is self-explicative that the overwhelming majority ha sbeen moderated, censored and manually deleted accross the time. — Preceding unsigned comment added by 94.38.235.96 ( talk) 16:28, 17 April 2019 (UTC) reply
Nah a whole lot more than 2k. [2] If anyone was doing wholesale deletions we'd know about it. This thread is very old many new development have occurred since 2013. -- Green C 17:51, 17 April 2019 (UTC) reply

filter misbehaviour

the filter which prevents adding links to the pages on the archive.is website also prevens adding links to the Archive.is article. The links such as [[Archive.is]] — Preceding unsigned comment added by 188.162.64.30 ( talkcontribs) 19:50, 8 July 2015 (UTC) reply

Not a concern anymore, for I think Wikipedia stopped blacklisting archive.is links a year or more ago. However, what is a concern is this statement from this article:
"Web pages cannot be duplicated from archive.is to web.archive.org as second-level backup, as archive.is places an exclusion for Wayback Machine[why?][9]"
Generally, and from my experience, I think this is true. On the contrary, these archive.is links were archived without a problem into web.archive.org:
It is neat that the Wayback Machine can archive some or all archive.is no-JavaScript/frozen-JavaScript pages. -- User123o987name ( talk) 00:01, 13 December 2020 (UTC) reply

Country blocking

Information about the country blocking is self-evidently available on the web.

You can easily google for currently active proxies in the countries in question and then run something like "Chrome.exe --proxy-server=socks5://37.27.205.217:35101 http://archive.is"

Removal of archived content

The article needs to be updated, as there is a "report" button where it's possible to report archived content to be taken down for a wide variety of reasons. nyuszika7h ( talk) 09:24, 16 April 2016 (UTC) reply

The person/s running archive.is don't necessary remove archived content none of my content has even been removed despite asking them to. — Preceding unsigned comment added by Nothappy21010 ( talkcontribs) 17:06, 30 November 2016 (UTC) reply
Agree. My personal non-commercial site has been repeatedly copied, in part, over several years. Repeated direct complaint (by email or by the "Report" button) elicits no reply nor action.

Hatnote

What's better here - "Not to be confused with Internet Archive." or some variant of "For the San Francisco-based nonprofit website at archive.org, see Internet Archive."? User:94.230.146.228 is concerned that by being specific we're implying that the two websites are connected, but I think it's more misleading to say "not to be confused with Internet Archive" because that can be easily read as "not to be confused with archiving on the internet in general" - a reader actually looking for archive.org (without knowing its URL) might not think to click that and assume that they're already at the right article. -- McGeddon ( talk) 10:02, 11 May 2016 (UTC) reply

Perhaps link to Archive.org in the hatnote, which is a redirect to Internet Archive? nyuszika7h ( talk) 20:18, 11 May 2016 (UTC) reply
Saying "For the San Francisco-based nonprofit website at archive.org, see Internet Archive." has a false connotation of "archive.is is sort of archive.org but for-profit" or even "there is a single company with non-profit and for-profit products; the former is archive.org and the latter is archive.is"; Anyway "...for non-profit see ..." implies that the following text is about something which is not non-profit albeit archive.is is non-profit as well 94.230.146.228 ( talk) 16:37, 12 May 2016 (UTC) reply

"Not to be confused with archive.org." sounds good. Any objections? -- McGeddon ( talk) 18:06, 14 May 2016 (UTC) reply

There is an RfC at Wikipedia:Archive.is RFC 4 with the proposal "Remove archive.is from the Spam blacklist and permit adding new links (Oppose/Support)". Cunard ( talk) 06:20, 23 May 2016 (UTC) reply

'Worldwide availability' section

Does the section contain valuable information? Virtually every website is geo-banned somewhere. JustPaste.it blocked in almost the same set of countries, Facebook is banned in China, etc.

In case if this information is valuable, I would suggest creation of a table or List of geo-blocked websites. 59.11.121.66 ( talk) 03:40, 28 May 2016 (UTC) reply

I think it's encyclopedic information, notably because the host of the website has blocked connections from certain countries and given reasons for that. A comprehensive table is not needed, but blocks that have been covered by the media should be recounted here. –  Finnusertop ( talkcontribs) 14:05, 8 June 2016 (UTC) reply

Use cases

I agree with PeterTheFourth actually, I was going to bring that up. It makes archive.is sound like some service which is only used by "authors and hacktivists", when it can be used by anyone, really. nyuszika7h ( talk) 11:35, 22 June 2016 (UTC) reply

RfC: long or short URL

RfC open if we should use long or short URLs when linking to archive.is -- Green C 23:17, 5 July 2016 (UTC) reply

In hindsight

Given hindsight, blacklisting Archive.is and bot-spamming all Wikipedia articles that used to use it did a lot of damage to our portal that cannot be reversed easily. Here's just one of the countless examples of Wikipedia articles referenced to no longer active websites, which were archived by Archive.is and never archived by the Wayback Machine: en.wikipedia.org » DRB Class 52, General Government, Kriegslokomotive » snapshots from host old.pkp.pl including » http://archive.is/bWNJt → Please take a look at my attempts at trying to reverse the damage in just one Wikipedia article: the Holocaust train. Would be nice to see another bot designed specifically to undo the deletions prompted by the original bot. Poeticbent talk 16:19, 2 August 2016 (UTC) reply

archive.fo

archive.fo seems to be one of its domains, and sometimes archive.is redirects there. This is my original research of course, but food for references. 80.221.159.67 ( talk) 23:31, 30 August 2016 (UTC) reply

How to I meet verifiability to say the site is hosted at Hostkey ( hostkey.nl / hostkey.ru )  ?

Cloudflare told me it is hosted there after a year of complaining about this malicious scraper botnet site. Plimitarmed ( talk) 05:29, 8 November 2016 (UTC) reply

@ Plimitarmed: Hostkey is the primary host (servers in Amsterdam and Moscow), distributed by Cloudflare. [1] [2] [see references]
-- John Navas ( talk) 20:06, 3 December 2016 (UTC) reply

References

  1. ^ "Netcraft - Search Web by Domain". Netcraft Services. Netcraft Ltd. Retrieved 3 December 2016.
  2. ^ "About HOSTKEY". HOSTKEY. HOSTKEY. Retrieved 3 December 2016.

Some links I found citing it's now at Hostkey.ru / Hostkey.nl

Most of these links say Hostkey then Cloudflare, however Cloudflare also told me it's hosted there so I can be sure it's not moved from Hostkey. Plimitarmed ( talk) 07:53, 8 November 2016 (UTC) reply

@ Plimitarmed: Hostkey is the primary host (servers in Amsterdam and Moscow), distributed by Cloudflare. [1] [2] [see references]
-- John Navas ( talk) 20:05, 3 December 2016 (UTC) reply

References

  1. ^ "Netcraft - Search Web by Domain". Netcraft Services. Netcraft Ltd. Retrieved 3 December 2016.
  2. ^ "About HOSTKEY". HOSTKEY. HOSTKEY. Retrieved 3 December 2016.
Since the current references mentioned by Jnavas2 only mention either Cloudflare or the servers locations (but not that Hostkey is the provider for archive.is), I added one of the links suggested by Plimitarmed. Saturnalia0 ( talk) 01:12, 24 January 2017 (UTC) reply

Data centers

It would be interesting if something like this /info/en/?search=Google_Data_Centers could be written about how the site runs. Plimitarmed ( talk) 19:50, 11 November 2016 (UTC) reply

@ Plimitarmed: Distribution is handled by Cloudflare. -- John Navas ( talk) 20:03, 3 December 2016 (UTC) reply

Indeed. Who owns it? Who runs it? And who owns them? Millions of people use their service without even asking who and what they are. This article doesn't even begin to be Wikipedian. — Preceding unsigned comment added by 86.239.242.222 ( talkcontribs) 15:44, 5 June 2018 (UTC) reply

Exactly. Who are these people? Incidentally I corresponded with the maintainer way back in the day (not sure I should disclose who he is), but as it says a few sections above, I'm not a RS. -- Dandv 08:02, 3 January 2021 (UTC) reply
@ Dandv:, The people with such questions mostly work for reputation agencies and are paid for purging articles on the Internet. Being rejected at the archives, they try to find their connections to bother them. You have just subscribed to a stream of removal requests 188.162.64.96 ( talk) 03:32, 6 January 2021 (UTC) reply

Reliability

@ Rhododendrites: I maintain that reliability is a serious issue for archive sites, and added relevant information to the Article. Rhododendrites disagrees. Let's discuss. -- John Navas ( talk) 20:02, 3 December 2016 (UTC) reply

Jnavas2 - Well, let's start with the first rather obvious problem. Sourcing. You've provided no source alongisde your claim. On 2 December 2016 the site became unavailable with browsers displaying Loading spinners indefinitely. It resumed normal operation late in the day. For all I know, this could be localized to you and your internet connection specifically. But, let's assume for a moment that you've included a source with the claim and look at it from an encyclopaedic perspective. Wikipedia isn't a collection of all knowledge. Site crashes are generally discussed in some greater context. For example, many sites - including Wikipedia - went on blackout in response to SOPA and PIPA. Another example would be major DDOS attacks like October's Dyn cyberattack. So, what is the greater context here? again, for all I know this could be a localized effect, or, server maintenance. It's of no encyclopaedic value in either of these cases. Has it been attacked or is there something interesting about this event that would make it notable in some way? can you also provide a source to back the addition of this content? Note, a source on this won't necessarily guarantee it's encyclopaedic value, but, at least I'll have something to work off of. Mr rnddude ( talk) 22:21, 3 December 2016 (UTC) reply
@ Mr rnddude: Why wasn't the issue handled this way to begin with?
  • Detail: Assuming you're not questioning my basic competence, I'd be happy to provide more detail. I ran tests with multiple browsers and diagnostic utilities on multiple devices over different Internet connections, including VPN connections to different countries, cross-checked with colleagues. I just kept the content brief so as to avoid excessive detail. How much is enough or too much? Would a long footnote be more appropriate?
  • Context: It seems self-evident to me that reliability is a significant issue for archive sites that profess to create "permanent" records, particularly with distribution by Cloudflare, and I was trying to avoid extraneous detail. Would you like more explanation of the issue? Perhaps by footnote?
  • Sourcing: My material was the result of my own work. I didn't learn it from some other source. Would you really want the detailed output of the utilities I used? Or must I find some 3rd party to vouch for my results? And how would the reliability of the 3rd party be judged?
-- John Navas ( talk) 23:01, 3 December 2016 (UTC) reply
John, your own observations fall under WP:Original research. This is not allowed on Wikipedia. If the outage of Archive.is on 2 December was not important enough to be noticed by any WP:Reliable source that we can cite, then it doesn't belong in our article. EdJohnston ( talk) 23:34, 3 December 2016 (UTC) reply
"If the outage of Archive.is on 2 December was not important enough to be noticed by any WP:Reliable source that we can cite, then it doesn't belong in our article." - That's such a narrow perspective. What WP:RS of any caliber cares enough to notice a brief outage of archive.is? This is very niche knowledge, that you'll never find on CNN. -- Dandv 08:06, 3 January 2021 (UTC) reply
( edit conflict)Thanks for the response. In terms of this being your work and the work of your colleagues, Wikipedia doesn't accept original research. The reason for this is that if it did, anybody could put anything into any article. It would be a true free for all with information and misinformation. It'd be impossible to distinguish between legitimate additions and bogus additions. To your point about sourcing; this brings up the question of reliable sources and event notability simultaneously. Reliable sources are those that are published by respected book publishers, news organizations, journals and other academic sources. In this case, I'd assume that news organizations might mention this information if it's significant enough. Event is generally for stand-alone articles, but, you can vet basic additions using the same principle. Is the event widely covered in the news or is it a passing mention - a footnote in history if you will. I had a quick skim of the internet for "archive.is down" and what I did find was that the only detected down time was on November 17, 2016 for a period of around 35 minutes. I also wouldn't hold the source to any particular measure of reliability. Other than that, I haven't found anything about it. As somebody who edits and develops history article's on Wikipedia I am well aware of the pains that WP:OR brings up and also the annoyance that not being allowed to draw your own conclusions - no matter how obvious or trivial - is. The simplest detail has to be drawn from another published source. If you can get your hands on a reliable third-party source, then I'd be able to vet it myself and assess its notability, if not, it falls under the purview of original research. Mr rnddude ( talk) 23:39, 3 December 2016 (UTC) reply
@ Mr rnddude: I suspect that would be a fool's errand and thus not a good use of my time, so I will simply pass. Sic transit gloria mundi -- John Navas ( talk) 23:48, 3 December 2016 (UTC) reply
As are many things. The balance, it's never quite right. If you have too many sources then little disagreements between them become a battleground between editors, if there are too few then it's borderline impossible to write anything informative about the topic and it will invariably end up at AfD. Mr rnddude ( talk) 23:52, 3 December 2016 (UTC) reply
@ Mr rnddude: Sourcing is an illusion of reliability because the Internet is so full of unreliable sources and bad information. It can have merit where data can be authenticated, but even there it's open to distortion, as in the case of climate change denial. Plus it's simply not possible to source everything in an article. So it ultimately becomes a subjective values proposition, with insiders defending their values against outsiders. As I said, a fool's errand, so I will simply pass. Too much pain for too little gain. -- John Navas ( talk) 00:10, 4 December 2016 (UTC) reply
@ Jnavas2: There is also the issue of reliability for the service, there are quite a few organizations trying to shut them down or block them. I've tried to visit their site several times over the past few weeks with none of the popular domains working anymore. I'm not sure if Archive.is has shut down or is being censored by someone. Lassitergregg ( talk) 03:38, 3 June 2018 (UTC) reply
@ Lassitergregg: https://downforeveryoneorjustme.com/archive.is

Starting on Sept 6th 2021, a multi-day outage occurred. As of 07:20am UTC, archive.today is still down. Gabefair ( talk) 07:21, 8 September 2021 (UTC) reply

How many pages?

Each of their shortened URLs have 5 characters (A-Z a-z 0-9) (4 characters until 2012)

62 possibilities per character.

  • 4 characters = 14.776.336 possibilities.
  • 5 characters = 916.132.832 possibilities, more than Archive.org has saved pages.

How many pages has Archive.is saved so far? Already beyond 14776336? -- 84.147.46.123 ( talk) 01:27, 14 November 2018 (UTC) reply

Contents are regurarly deleted or filtered

The article needs more informations, unless those seem to be very difficult to be found, e.g. about who is the owner of the website, and who manages it.

Search results are sponsored by Google or yandex.ru, like is done by any indipendent and non commercial company in thr web.

An address space formed by only 5 alphanumeric characters is enough for all the internet requests simply because many of the saved results are deleted, censored and made unavailable to the public. Some of them are "embededded" into one result wich is shown by the search box, and continues to link the other saved pages. — Preceding unsigned comment added by 84.223.69.200 ( talk) 19:12, 15 December 2018 (UTC) reply

Who owns/runs this site?

A strange omission from the article! I note somebody above in this talk page names a "Denis Petrov". Equinox 18:14, 26 December 2018 (UTC) reply

That sounds about right. I corresponded with the maintainer of the site many years ago. But I'm not a WP:RS, so what do I know. -- Dandv 08:08, 3 January 2021 (UTC) reply

https not working?

Looks like https doesn't work at the moment. Site is still available on http though. Don't know whether this is a permanent change? Evert ( talk) 15:41, 3 February 2019 (UTC) reply

Working ok for me. -- Green C 15:49, 3 February 2019 (UTC) reply

https://www.ssllabs.com/ssltest/analyze.html?d=archive.today

Something not quite right there... Evert ( talk) 16:07, 3 February 2019 (UTC) reply

Hmm.. do you know if this is a new condition, or just now noticed? -- Green C 17:59, 3 February 2019 (UTC) reply

Looks like the certificates for this site/these sites have been fixed, so I guess all is ok now Evert ( talk) 07:07, 4 February 2019 (UTC) reply

Evert, I'm getting "Assessment failed: Unable to connect to the server". Perhaps it has been blocked. "it usually happens due to firewall restrictions". -- Green C 15:41, 4 February 2019 (UTC) reply
I used a different checker and reports SSL is working. -- Green C 15:43, 4 February 2019 (UTC) reply

Blocked in New Zealand?

Can anyone confirm Archive.today has been blocked in New Zealand following the Christchurch mosque shootings? Muzilon ( talk) 15:49, 17 March 2019 (UTC) reply

It is pingable from a server in New Zealand, but there might still be blocks at the protocol or ISP level. -- Green C 15:56, 17 March 2019 (UTC) reply
I suspect the latter. A bit excessive because it's not really a video-hosting site (which is what the New Zealand ISP's are trying to target in this case). Muzilon ( talk) 16:05, 17 March 2019 (UTC) reply

Digital preservation open archives are unavoidable for a long-term content provider based on external primary sources likeWikipedia

A total ban of archive.is from Wikipedia would be simply foolish and a suicide for the Encyclopedia. Wikipedia aims to long-term digital preservation of ist contents, but Wikipedia doesn't claim itself as a primary source of informations, even if the honesty of its contributors, the quality of its policies, the number of reviewers for each page and any single edit, make it much more affordable and objective than many other renokwn and blasonate encyclopedias. But the points are that:

  • Wikipedia is based uniquely upon externals sources
  • the middle life of a linked Web page is on the order of weeks or some months. So you will have continously to adjust broken links or elsewhere-migrated ones.

Such a type of content provider strongly needs one (or more) permanent archive(s). For example, the French Wikipedia uses a private archive (http://archive.wikiwix. com like in the w:fr:François Mitterrand#Notes et références): not alle references are archived and not all archived contents are publicly readable by anyone, e.g. for legal reasons. I think that this choice was adopted in order to avoid copyright infringements and have a private and independent external certification that a determinate source did exist and was linked to a Wikipedia oldid in the past. But any administrator can decide:

  • what has to be archived in the long term;
  • what is archived but non readable (like a private archive);
  • waht sources can't be ignored, exckuded and not preserved.

Such a system would be completely inappropriate for an Open Project, whose sources must be reliable and verifiable for anyone.

Due to copyright reasons, Internet Archive also has made many saved entries yet unavailable so that the copy is lost or can't be used as the archive-url parameter into the Wikipedia citation templates. In the Web we have the Internet Archive or Archive.is, basically, given that WebCite is only for particular kind of selected materials. So Archive.is has become an unavoidable choice. — Preceding unsigned comment added by ‎94.38.234.134 ( talkcontribs)

I'm not aware of anyone suggesting we block or ban archive.today .. are they somewhere? -- Green C 13:19, 22 June 2019 (UTC) reply
It was banned in 2016. -- Dandv 08:13, 3 January 2021 (UTC) reply
Today, archive.is has introced a Google reCAPTCHA verification for any single link that users try to save. So, the web service has been strongly limited, if not totally shut down.
Use archive.org for volume saves it is unlimited. -- Green C 03:13, 8 November 2019 (UTC) reply
Or create your own WARC files, host them anywhere files can be hosted, and replay them using embedded archive software ("client-side replay technology"). In effect you are the archive provider. [3] -- Green C 03:17, 8 November 2019 (UTC) reply
So there are no reasons to ban archive.is from wikipedia, since it has strongly limited by itself the opportunities of copyright infringements. Said on brackets, anyone who publishes a web page on Internet must accept the risk it is permanently archived locally on some PCs or elsewhere in the Web.

Supported browsers

Following the introduction of a Google reCAPTCHA some months ago, since February 2020 archive.is has ended to support browsers like Waterfox which don't share users'data with the partners of Google. Indeed, archive.is may be accessed uniquely through Opera, Chrome, Internet Explorer or Mozilla Firefox. Maybe, this scenario will change in the upcoming weeks. — Preceding unsigned comment added by 78.14.139.65 ( talk) 21:09, 22 February 2020 (UTC) reply

I get a malware warning when I try to access archive.is pages in Firefox. Something seems to be going on. Mini apolis 23:00, 9 April 2020 (UTC) reply
For every archive? Security Add-ons that might be interfering? The computer clock working/set? -- Green C 00:00, 10 April 2020 (UTC) reply
Every archive.is page I've tried; sometimes I need to check a reference as part of a copyedit. (I'm working on The Grand Budapest Hotel, which seems to use archive.is exclusively.) Fortunately, most of the links are still live. Computer clock set and working, and I have default FF security. Maybe there's something in the archive.is code that triggers the warning. Stay well and all the best, Mini apolis 13:42, 10 April 2020 (UTC) reply
@ Miniapolis: Looking at citation # 102 ("Olsen, Mark") the archive URL is http://archive.today/iaRVK .. when I open it redirects to https://archive.vn/iaRVK (note the "https:" and ".vn"). Can you try connecting direct to https://archive.today/iaRVK and https://archive.vn/iaRVK do either cause an error? (I don't get errors using FF). -- Green C 15:39, 10 April 2020 (UTC) reply
Many thanks for the help. Both those links work; apparently the https:// makes a difference. Pinging @ DAP388:, who wants to take the article to FA; those http://archive.is links may be a problem. The one privacy FF add-on I have is Privacy Badger, but that doesn't seem to be an issue. All the best, Mini apolis 15:53, 10 April 2020 (UTC) reply
I don't know much about this stuff, but my guess is that PB may have triggered the FF warning because of the redirect; both "good" links were identical. Mini apolis 15:58, 10 April 2020 (UTC) reply
This is how .today domain works. It always redirects to another domain: https://twitter.com/archiveis/status/1249475103584370689 — Preceding unsigned comment added by 188.143.233.210 ( talk) 00:13, 14 April 2020 (UTC) reply
we have a free addon for Mozilla Firefox in order to automate the step needed to be performed for an archiviation. Probably, it will be extended for the other browsers in the future. — Preceding unsigned comment added by 78.14.139.236 ( talk) 23:56, 14 May 2020 (UTC) reply
I just archived a web page using Lynx. Perhaps the captcha has been removed or disabled by now? 2001:16B8:2C0F:2600:D91C:72F:4C3C:2D39 ( talk) 13:02, 29 July 2020 (UTC) reply

past tense?

For a some time archive.today and all its mirrors are unavailable (I tried opening directly in my browser and checked by services like "downforeveryoneorjustme.com" or "isitdownrightnow.com"). I can't find any recent info about closing the site or any kind of technical malfunction. The last tweet is from April 2nd (2020), unrelated to the situation. Does it mean that the website has been shut down? If yes, shouldn't the article describe Archive.today with past tense?

-- 37.30.20.131 ( talk) 00:26, 11 December 2020 (UTC) reply

Most of the time since then it has been down. But different URLs have been up for very brief amounts of time so it's not dead completely.-- 76.123.193.174 ( talk) 03:45, 20 December 2020 (UTC) reply

I suspect DNS resolver is the problem. The site is not actually down, but some DNS resolvers are not supported by archive.today so it appears down. Try some from the list at Public recursive name server -- Green C 04:09, 20 December 2020 (UTC) reply

Possible Mistake

The article states that "[archive.today] retrieves one page at a time similar to WebCite, smaller than 50MB each" or that, in other words, it can archive pages up to 50MB large. However, it later goes on to say that "Individual users can only archive and/or retrieve approximately 10 to 20 megabytes of data per day.", which means it would be impossible to archive pages larger than 10-20MB. I believe the person who wrote that part may have meant to write 10 to 20 gigabytes, rather than 10 to 20 megabytes. In the first place, a 10 to 20 megabyte/day limit is pretty ridiculous. The only way I can see that not being a mistake would be if 50MB actually meant 50 megabits, which is equal to 6.25 megabytes. User:Poudink User talk:Poudink 16:48, 26 February 2021 (UTC) reply

FrescoBot

@ Thibaut120094: my understanding is there is a reason the IP added the nobots tag for FrescoBot. Let's discuss before reverting again. -- Green C 21:28, 2 July 2021 (UTC) reply

Sorry I didn’t get your ping (see mw:Manual:Echo#Technical details to know why).
The IP explained the reason, it was a false positive from the bot (although it would be better to use a secondary source, per WP:OR). -- Thibaut ( talk) 17:10, 3 July 2021 (UTC) reply
Ah so after reading Echo, it looks like after making those typos, I should have deleted the section and reposted the entire thing as a new edition. -- Green C 19:25, 3 July 2021 (UTC) reply

Blocked in Australia?

Is there a way to confirm this? See [4] which is contra though not definitive. -- Green C 14:57, 31 July 2021 (UTC) reply

Down September 2021

Archive.today has been down for at least a week as of 06 September 2021. cagliost ( talk) 14:44, 6 September 2021 (UTC) reply

I can confirm @Cagliost's observation regarding a multi-day, seemingly global, outage of the service. Gabefair ( talk) 07:26, 8 September 2021 (UTC) reply

Was down for me ever since like yesterday. This + webcitation both being down at the same time seems strange. Archive.org is still there, and I think archive.today and webcitation should both be up soon. Again, they are free websites, so I don't really mind downtime. Also this is probably why there is space for up to 7(?) web archives in the webarchive template. Rlink2 ( talk) 14:35, 8 September 2021 (UTC) reply

Working for me, and nothing in the blog about an outage. Try a different DNS that doesn't go through CloudFlare eg. 1.1.1.1 .. archive.today outages are often due to certain DNS resolvers. -- Green C 15:47, 8 September 2021 (UTC) reply

Could someone with access add the IP address to the Wikipedia article? cagliost ( talk) 16:24, 8 September 2021 (UTC) reply

I could think about writing a section about the DNS thing, but it would need approval from GreenC first Rlink2 ( talk) 01:39, 11 September 2021 (UTC) reply

Ownership

Does anyone know who runs it? cagliost ( talk) 14:45, 6 September 2021 (UTC) reply

No, I just created a sub-section about it. 1e100 ( talk) 13:38, 21 January 2022 (UTC) reply

I disagree about some of that being appropriate for Wikipedia. First it could be the wrong person and we cause someone trouble. Second it violates a basic rule about Original Research. Third it's speculation. -- Green C 16:35, 21 January 2022 (UTC) reply

But we should say something about this. https://lookup.icann.org/en/lookup says "Organization: Data Protected. Mailing Address: ON, CA . Redacted for privacy: some of the data in this object has been removed." This is strange. Are there no RS sources here? No journalist tried too look into this? PS. Removed information can be seen here; I agree it is ORish, my point is - we should say something. Is this an NGO? A company? From which country? Can such a shady service be considered reliable? Piotr Konieczny aka Prokonsul Piotrus| reply here 06:50, 18 October 2022 (UTC) reply
That Canada mailing address could be anything including a default security address issued by ICANN. According to archive.vn/faq it is "privately funded" so not an NGO. And apparently they want to retain personal privacy. Not sure that makes them a "shady service", there are good reasons, laws about web archiving are themselves a bit shady and change by country. For example Ghostarchive.org is similar. You either need to be 'too big to fail' like Wayback with institutional support in case of legal attack, or small and quick. The question is their reputation and archive.today has a good reputation for being reliable and not manipulating data. They are also pretty open with a blog answering user questions. They appear to host things globally using services like AWS. There's no journalism about it for Wikipedia RS pruposes. -- Green C 17:41, 18 October 2022 (UTC) reply

Is commercial site? (Reliable source?)

In the infobox, there is a citation linking to a page with ads. This is used to infer that Archive.today is a commercial website.
This seems like either original research or synthesis.
Also, I think I remember reading elsewhere that the ads are just to help cover operating costs. (Not commercial?)
Is there a reliable source that actually states that the site is commercial (or otherwise)?
-- 50.89.193.43 ( talk) 10:08, 21 December 2021 (UTC) reply

There is no indication it is non-profit (non-commercial). Infoboxes suck for this sort of thing as they are black and white data points lacking ambiguity. If you think it shouldn't be in the info box then remove it. However, it should be mentioned in the article body with some kind of explanation. It is permissible to link to a primary source. My understanding is they do take money for ads, but it's limited. It has been discussed on the blog before, so would need old blog posts. Funding of archive.today remains opaque the more said about it the better, it would be great if someone researched and reported what has been written about it in the blog. -- Green C 15:45, 21 December 2021 (UTC) reply

Cloudflare DNS service

The text states that "Since at least May 2018 it has not been possible to reach the site when using Cloudflare's 1.1.1.1 DNS service". A query this morning (20 February 2022) suggests that both "archive.is" and "archive.today" resolve using "1.1.1.1". Can others verify the site works using the 1.1.1.1 DNS service? Pvanheus ( talk) 06:32, 20 February 2022 (UTC) reply

When you say "resolve" do you mean via nslookup or dig something returning an IP; or the site is accessible from a browser that is using 1.1.1.1 as its resolver. My experience has been that weird things happen, it can work sometimes and not others, etc.. -- Green C 16:31, 20 February 2022 (UTC) reply
Not working. Could be OpenDNS squabble again. 2601:643:8800:F320:C557:F7D7:3897:7B51 ( talk) 17:40, 24 April 2023 (UTC) reply
Tried both Quad9 and Cloudflare DNS servers today and archive.is goes into a captcha loop.
And I'm not the only one: https://news.ycombinator.com/item?id=37077049 Alex O. ( talk) 02:47, 16 August 2023 (UTC) reply
I've been able to get through on a Chromium browser but not Firefox. The site remains accessible, but as usual it's having trouble. VintageVernacular ( talk) 03:45, 16 August 2023 (UTC) reply
I got a captcha loop on both FF and ungoogled chromium.
Most likely one of your browsers uses DOH and the other does not, or they both do via different providers. Alex O. ( talk) 04:25, 16 August 2023 (UTC) reply

Outage

22 June 2022: detected outage at 18:34 UTC archive pages return "Server Outage" error. Home page works. -- Green C 18:35, 22 June 2022 (UTC) reply

Back up in a few hours. -- Green C 03:05, 23 June 2022 (UTC) reply
Been quite a significant outage the past two days now — can't find any RS for that though — TNT ( talk • she/her) 09:03, 21 July 2022 (UTC) reply

Error 1001 for a while now in South Africa

Are other places facing the same problem? Thanks, Maqdisi ( talk) 12:35, 23 June 2023 (UTC) reply

Getting the same error weeks later, in the US - the article's statement about the Cloudfare issues being resolved might be a tad premature. CleverTitania ( talk) 06:59, 16 August 2023 (UTC) reply

Completely dead?

None of the pages seem to work anymore. Has the entire site died? (I did a quick Google but found no one else complaining recently.)

All down:

Netizen ( talk) 12:22, 11 March 2024 (UTC) reply

Works for me. Archive.today has weird things going on with DNS and CloudFlare. Try to use a DNS provider that is not on CloudFlare. That's probably the case with isitdownrightnow.com for example this site says it is up: https://downforeveryoneorjustme.com/archive.today -- Green C 14:48, 11 March 2024 (UTC) reply
There is a deadlock from both sides (Archive.today and CF), it is unlikely to be solved
https://twitter.com/archiveis/status/1772082674556965373 90.131.35.142 ( talk) 03:30, 14 April 2024 (UTC) reply