Artificial neural networks (ANNs) are models created using
machine learning to perform a number of tasks. Their creation was inspired by
neural circuitry.[1][a] While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist
Frank Rosenblatt, who developed the
perceptron.[1] Little research was conducted on ANNs in the 1970s and 1980s, with the
AAAI calling that period an "
AI winter".[2]
Later, advances in hardware and the development of the
backpropagation algorithm as well as
recurrent neural networks and
convolutional neural networks, renewed interest in ANNs. The 2010s, saw the development of a deep neural network (a neural network with many layers) called
AlexNet.[3] It greatly outperformed other image recognition models, and is thought to have launched the ongoing
AI spring, and further increasing interest in ANNs.[4] The
transformer architecture was first described in 2017 as a method to teach ANNs grammatical dependencies in language,[5] and is the predominant architecture used by
large language models, such as
GPT-4.
Diffusion models were first described in 2015, and began to be used by image generation models such as
DALL-E in the 2020s.[citation needed]
Linear neural network
The simplest kind of
feedforward neural network is a linear network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated in each node. The
mean squared errors between these calculated outputs and a given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the
method of least squares or
linear regression. It was used as a means of finding a good rough linear fit to a set of points by
Legendre (1805) and
Gauss (1795) for the prediction of planetary movement.[6][7][8][9][10]
Perceptrons and other early neural networks
Warren McCulloch and
Walter Pitts[11] (1943) also considered a non-learning computational model for neural networks.[12] This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to
artificial intelligence. This work led to work on nerve networks and their link to
finite automata.[13]
In the early 1940s,
D. O. Hebb[14] created a learning hypothesis based on the mechanism of
neural plasticity that became known as
Hebbian learning. Hebbian learning is
unsupervised learning. This evolved into models for
long-term potentiation. Researchers started applying these ideas to computational models in 1948 with
Turing's B-type machines. Farley and
Clark[15] (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by
Rochester, Holland, Habit and Duda (1956).[16]
Rosenblatt[1] (1958) created the
perceptron, an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the
exclusive-or circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by
Nobel laureatesHubel and
Wiesel was based on their discovery of two types of cells in the
primary visual cortex:
simple cells and
complex cells.[17]
Some say that research stagnated following
Minsky and
Papert (1969),[18] who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks. However, by the time this book came out, methods for training
multilayer perceptrons (MLPs) by
deep learning were already known.[9]
SOMs create internal representations reminiscent of the
cortical homunculus,[39] a distorted representation of the
human body, based on a neurological "map" of the areas and proportions of the
human brain dedicated to processing
sensory functions, for different parts of the body.
The origin of the CNN architecture is the "
neocognitron"[40] introduced by
Kunihiko Fukushima in 1980.[41][42]
It was inspired by work of
Hubel and
Wiesel in the 1950s and 1960s which showed that cat
visual cortices contain neurons that individually respond to small regions of the
visual field.
The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.
The
time delay neural network (TDNN) was introduced in 1987 by
Alex Waibel and was one of the first CNNs, as it achieved shift invariance.[45] It did so by utilizing weight sharing in combination with
backpropagation training.[46] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.[45]
In 1988, Wei Zhang et al. applied
backpropagation
to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.[47][48]
In 1989,
Yann LeCun et al. trained a CNN with the purpose of
recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.[49] Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types.
Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991[50] and breast cancer detection in mammograms in 1994.[51]
In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system.[52]
In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.[53][54][55][56] Max-pooling is often used in modern CNNs.[57]
LeNet-5, a 7-level CNN by
Yann LeCun et al. in 1998,[58] that classifies digits, was applied by several banks to recognize hand-written numbers on checks (
British English: cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.
In 2010, Backpropagation training through
max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.[59]
Behnke (2003) relied only on the sign of the gradient (
Rprop)[60] on problems such as image reconstruction and face localization. Rprop is a
first-orderoptimizationalgorithm created by Martin Riedmiller and Heinrich Braun in 1992.[61]
ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs)[65] whose embodiments are Where-What Networks, WWN-1 (2008)[66] through WWN-7 (2013).[67]
Artificial curiosity and generative adversarial networks
In 1991,
Juergen Schmidhuber published adversarial
neural networks that contest with each other in the form of a
zero-sum game, where one network's gain is the other network's loss.[68][69][70] The first network is a
generative model that models a
probability distribution over output patterns. The second network learns by
gradient descent to predict the reactions of the environment to these patterns. This was called "artificial curiosity." Earlier adversarial machine learning systems "neither involved unsupervised neural networks nor were about modeling data nor used gradient descent."[70]
In 2014, this adversarial principle was used in a
generative adversarial network (GAN) by
Ian Goodfellow et al.[71] Here the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. This can be used to create realistic
deepfakes.[72]
In 1992, Schmidhuber also published another type of gradient-based adversarial neural networks where the goal of the
zero-sum game is to create disentangled representations of input patterns. This was called predictability minimization.[73][74]
Nvidia's
StyleGAN (2018)[75] is based on the Progressive GAN by Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.[76] Here the GAN generator is grown from small to large scale in a pyramidal fashion. StyleGANs improve consistency between fine and coarse details in the generator network.
Basic ideas for this go back a long way: in 1992,
Juergen Schmidhuber published the Transformer with "linearized self-attention" (save for a normalization operator),[80]
which is also called the "linear Transformer."[81][82][9] He advertised it as an "alternative to RNNs"[80] that can learn "internal spotlights of attention,"[83] and experimentally applied it to problems of variable binding.[80] Here a slow
feedforward neural network learns by
gradient descent to control the fast weights of another neural network through
outer products of self-generated activation patterns called "FROM" and "TO" which in Transformer terminology are called "key" and "value" for "
self-attention."[82] This fast weight "attention mapping" is applied to queries. The 2017 Transformer[77] combines this with a
softmax operator and a projection matrix.[9]
Deep learning with unsupervised or self-supervised pre-training
In the 1980s,
backpropagation did not work well for deep FNNs and RNNs. Here the word "deep" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth.[85] The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For an FNN, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For RNNs, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.
To overcome this problem,
Juergen Schmidhuber (1992) proposed a self-supervised hierarchy of RNNs pre-trained one level at a time by
self-supervised learning.[86] This "neural history compressor" uses
predictive coding to learn
internal representations at multiple self-organizing time scales.[9]
The deep architecture may be used to reproduce the original data from the top level feature activations.[86]
The RNN hierarchy can be "collapsed" into a single RNN, by "distilling" a higher level "chunker" network into a lower level "automatizer" network.[86][9] In 1993, a chunker solved a deep learning task whose CAP depth exceeded 1000.[87]
Such history compressors can substantially facilitate downstream supervised deep learning.[9]
Sepp Hochreiter's diploma thesis (1991)[92] was called "one of the most important documents in the history of machine learning" by his supervisor
Juergen Schmidhuber.[9] Hochreiter not only tested the neural history compressor,[86] but also identified and analyzed the
vanishing gradient problem.[92][93] He proposed recurrent
residual connections to solve this problem. This led to the deep learning method called
long short-term memory (LSTM), published in 1997.[94] LSTM
recurrent neural networks can learn "very deep learning" tasks[85] with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. The "vanilla LSTM" with forget gate was introduced in 1999 by
Felix Gers,
Schmidhuber and Fred Cummins.[95]LSTM has become the most cited neural network of the 20th century.[9]
In 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used
LSTM principles to create the
Highway network, a
feedforward neural network with hundreds of layers, much deeper than previous networks.[96][97] 7 months later, Kaiming He, Xiangyu Zhang; Shaoqing Ren, and Jian Sun won the ImageNet 2015 competition with an open-gated or gateless
Highway network variant called
Residual neural network.[98] This has become the most cited neural network of the 21st century.[9]
In 2011, Xavier Glorot, Antoine Bordes and
Yoshua Bengio found that the
ReLU[43] of
Kunihiko Fukushima also helps to overcome the vanishing gradient problem,[99] compared to widely used activation functions prior to 2011.
Computational devices were created in
CMOS, for both biophysical simulation and
neuromorphic computing inspired by the structure and function of the human brain.
Nanodevices[101] for very large scale
principal components analyses and
convolution may create a new class of neural computing because they are fundamentally
analog rather than
digital (even though the first implementations may use digital devices).[102] Ciresan and colleagues (2010)[103] in Schmidhuber's group showed that despite the vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks.
Ciresan and colleagues won
pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,[110] the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge[111] and others. Their neural networks were the first pattern recognizers to achieve human-competitive/superhuman performance[62] on benchmarks such as traffic sign recognition (IJCNN 2012), or the
MNIST handwritten digits problem.
Researchers demonstrated (2010) that deep neural networks interfaced to a
hidden Markov model with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary speech recognition tasks such as voice search.[citation needed]
GPU-based implementations[112] of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,[110] the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge,[111] the
ImageNet Competition[63] and others.
Deep, highly nonlinear neural architectures similar to the
neocognitron[113] and the "standard architecture of vision",[114] inspired by
simple and
complex cells, were pre-trained with unsupervised methods by Hinton.[90][89] A team from his lab won a 2012 contest sponsored by
Merck to design software to help find molecules that might identify new drugs.[115]
Notes
^Neurons generate an
action potential—the release of neurotransmitters that are chemical inputs to other neurons—based on the sum of its incoming chemical inputs.
^McCulloch, Warren; Walter Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity". Bulletin of Mathematical Biophysics. 5 (4): 115–133.
doi:
10.1007/BF02478259.
^Farley, B.G.; W.A. Clark (1954). "Simulation of Self-Organizing Systems by Digital Computer". IRE Transactions on Information Theory. 4 (4): 76–84.
doi:
10.1109/TIT.1954.1057468.
^Rochester, N.; J.H. Holland; L.H. Habit; W.L. Duda (1956). "Tests on a cell assembly theory of the action of the brain, using a large digital computer". IRE Transactions on Information Theory. 2 (3): 80–93.
doi:
10.1109/TIT.1956.1056810.
^Linnainmaa, Seppo (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors (Masters) (in Finnish). University of Helsinki. pp. 6–7.
^Griewank, Andreas (2012). "Who Invented the Reverse Mode of Differentiation?". Optimization Stories. Documenta Matematica, Extra Volume ISMP. pp. 389–400.
S2CID15568746.
^Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "
Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986.
^Amari, Shun-Ichi (1972). "Learning patterns and pattern sequences by self-organizing nets of threshold elements". IEEE Transactions. C (21): 1197–1206.
^Von der Malsburg, C (1973). "Self-organization of orientation sensitive cells in the striate cortex". Kybernetik. 14 (2): 85–100.
doi:
10.1007/bf00288907.
PMID4786750.
S2CID3351573.
^
abFukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333.
doi:
10.1109/TSSC.1969.300225.
^Ramachandran, Prajit; Barret, Zoph; Quoc, V. Le (October 16, 2017). "Searching for Activation Functions".
arXiv:1710.05941 [
cs.NE].
^
abWaibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.
^Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992
^Schmidhuber, Jürgen (1991). "A possibility for implementing curiosity and boredom in model-building neural controllers". Proc. SAB'1991. MIT Press/Bradford Books. pp. 222–227.
^Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014).
Generative Adversarial Networks(PDF). Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014). pp. 2672–2680.
Archived(PDF) from the original on 22 November 2019. Retrieved 20 August 2019.
^
abSchlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
^Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". ICANN 1993. Springer. pp. 460–463.
^He, Cheng (31 December 2021).
"Transformer in CV". Transformer in CV. Towards Data Science.
^Gers, Felix; Schmidhuber, Jürgen; Cummins, Fred (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855.
doi:
10.1049/cp:19991218.
ISBN0-85296-721-7.
^Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks".
arXiv:1505.00387 [
cs.LG].
^Srivastava, Rupesh K; Greff, Klaus; Schmidhuber, Juergen (2015).
"Training Very Deep Networks". Advances in Neural Information Processing Systems. 28. Curran Associates, Inc.: 2377–2385.
^Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; Stewart, D. R.; Williams, R. S. (2008). "Memristive switching mechanism for metal/oxide/metal nanodevices". Nat. Nanotechnol. 3 (7): 429–433.
doi:
10.1038/nnano.2008.160.
PMID18654568.
^
abCireşan, Dan; Meier, Ueli; Masci, Jonathan; Schmidhuber, Jürgen (August 2012). "Multi-column deep neural network for traffic sign classification". Neural Networks. Selected Papers from IJCNN 2011. 32: 333–338.
CiteSeerX10.1.1.226.8219.
doi:
10.1016/j.neunet.2012.02.023.
PMID22386783.
^
abCiresan, Dan; Giusti, Alessandro; Gambardella, Luca M.; Schmidhuber, Juergen (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q. (eds.).
Advances in Neural Information Processing Systems 25(PDF). Curran Associates, Inc. pp. 2843–2851.
^Fukushima, K. (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position". Biological Cybernetics. 36 (4): 93–202.
doi:
10.1007/BF00344251.
PMID7370364.
S2CID206775608.
^Riesenhuber, M; Poggio, T (1999). "Hierarchical models of object recognition in cortex". Nature Neuroscience. 2 (11): 1019–1025.
doi:
10.1038/14819.
PMID10526343.
S2CID8920227.