From Wikipedia, the free encyclopedia
This CCI case
CCI pages
CCI case main page
'bot task explanation
how to help
'bot discussion
cleanup discussion
changes to the 10,000 articles
list of tagged articles
Policy
Copyright policy
On this page

What is happening

Uncle G's major work 'bot ( talk · contribs) ( bot approval discussion) is reverting or blanking all of the articles created by Darius Dhlomo ( talk · contribs), based upon a list supplied by VernoWhitney ( talk · contribs) and a CCI investigation. There are about 10,000 articles in this list. We've already determined that a significant number of these articles contain prose copied from other sources. There is a further list of just over 13,000 articles that have had significant text added to them by Darius Dhlomo. These, too, are part of the investigation. A more general description of the incident can be found here.

The initial, first pass, operation of the 'bot is to blank the 10,000 articles that were created by Darius Dhlomo. What to do about the further 13,000 articles in the second list is still being discussed, but it is likely to involve a similar mass editing task by a 'bot, following on from this task. Our current plan is to revert the articles on the second list back to the revision prior to any additions by Darius Dhlomo, adding a notice to each article informing editors that this has happened and that the article requires review.

This process is not about determining the motive for the copyright infringement. That is being discussed separately. This is about the cleanup of the result. Our best understanding of the copyright law that applies is that we cannot, having become aware of this mass infringement, do nothing. [fn 1]

Why this is happening

This is being done because, after investigation, it was determined that Darius Dhlomo had been violating copyright on a number of occasions. It turned out that this was happening on quite a large scale, and with a regular pattern. As a consequence, every article that xe has created is now suspect and has to be reviewed for potential copyright infringement. Unfortunately, that turned out to be a huge number of articles, too many for the normal contributor copyright investigation process, where a small number of dedicated editors manually look through a list of a few hundred articles. The number of articles to review is over twenty three thousand. A handful of people cannot cope with that amount of work.

Instead, we have opted for a process where articles are blanked and the editor community in general is asked to diligently and carefully review those articles that interest them for copyright problems.

The articles are being blanked as a precautionary measure. They aren't being deleted. The edit history remains. However, we cannot legitimately continue to have Wikipedia publish the text of what we suspect will be hundreds if not thousands of copyright violations, until such time as we get around to reviewing each article. [fn 2]

How many and which of those articles actually infringe?

The short factual answer is that we don't know. The long factual answer is that we don't know. At least three volunteers have independently sampled selections from the list, of a few hundred articles, and come to the conclusion that around 10% of the articles contain copyright violations. We have fairly good reason, based upon the editing patterns, to conclude that this extends to all 23,000 articles being investigated. The editing patterns found upon investigation have also led us to conclude that any prose content (i.e. anything more than just raw numbers, names, and dates) is likely not this editor's original writing.

The problem is that we have no mechanical way to determine which 10%. The way that the text was copied defeats automated comparison systems such as CorenSearchBot (which detected only a very few of the copyright violations, and is partly why no warning flags were raised earlier than this [fn 3]). Individual sentences or paragraphs were taken from prose sources, wholesale, but were re-ordered. Some very light textual revisions were sometimes made, making close paraphrases, not defeating the charges of either copyright violation or plagiarism, but enough to defeat automated text comparison mechanisms.

Furthermore, and equally unfortunately, the copied prose has sometimes been included amongst legitimate contributions made by other Wikipedians, or has been later modified and edited by other Wikipedians (creating derivative works, which we also have to exclude).

Thus we need humans to review the contents of the articles. This is where you come in.

What happens next

What happens next is you. You can help. We want you to help. If you came here because a link to this page turned up in an edit summary on your watchlist, we'd like you to review the articles that you are watching. The idea is that if everyone reviews just a few articles, this mountain ends up being moved by a thousand teaspoons all digging together.

Please read the instructions for what to do and help.

Where this was discussed

The full discussion, including the original CCI case discussion from late August and the subsequent discussion at the administrator's noticeboard for incidents, can be found at Wikipedia:Administrators' noticeboard/Incidents/CCI, where we analyzed samples of articles, discussed options, and tried to come up with a means for managing such a huge investigation. There is also relevant discussion at User talk:Uncle G and User talk:Moonriddengirl (and their respective archives for September 2010).

Footnotes

  1. ^ This explanation from Moonriddengirl:

    There is a duty of care implicit in 17 U.S.C. § 512(c)(1)(A)(ii). As explicated in Report 105-551 pt.2 by the House of Representatives:

    New subsection (c)(1)(A)(ii) can best be described as a ‘‘red flag’’ test. As stated in new subsection (c)(l), a service provider need not monitor its service or affirmatively seek facts indicating infringing activity (except to the extent consistent with a standard technical measure complying with new subsection (h)), in order to claim this limitation on liability (or, indeed any other limitation provided by the legislation). However, if the service provider becomes aware of

    a ‘‘red flag’’ from which infringing activity is apparent, it will lose the limitation of liability if it takes no action. The ‘‘red flag’’ test has both a subjective and an objective element. In determining whether the service provider was aware of a ‘‘red flag,’’ the subjective awareness of the service provider of the facts or circumstances in question must be determined. However, in deciding whether those facts or circumstances constitute a ‘‘red flag’’—in other words,

    whether infringing activity would have been apparent to a reasonable person operating under the same or similar circumstances— an objective standard should be used.

  2. ^ Even aside from the legal obligations to take action once we become aware of the problem, we have an obligation to people who re-use Wikipedia content, which we intend to be freely re-usable. People who, for example, create printed books mirroring Wikipedia articles would end up fixing copyright violations into print if we left articles untouched until we got around to them at some indefinite point in the future.
  3. ^ In any case, CorenSearchBot only reviews new articles. It only looked over the 10,000 created articles and only their initial revisions at that.