《Deep Learning》综述论文研读

本文最后更新于:a few seconds ago

Deep Learning 深度学习

Yann LeCun, Yoshua Bengio & Geoffrey Hinton

Nature volume 521, pages 436–444 (2015)

https://www.nature.com/articles/nature14539



Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. 。Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
深度学习允许由多个处理层组成的计算模型学习具有多个抽象级别的数据表示。这些方法极大地提高了语音识别、视觉对象识别、对象检测和许多其他领域(如药物发现和基因组学)的技术水平。深度学习通过使用反向传播算法来发现大型数据集中的复杂结构,以指示机器应该如何改变其内部参数,这些参数用于根据前一层中的表示来计算每一层中的表示。深度卷积网络在处理图像、视频、语音和音频方面取得了突破,而循环网络则揭示了文本和语音等序列数据。

Main 主要内容

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.
机器学习技术为现代社会的许多方面提供了动力:从网络搜索到社交网络上的内容过滤,再到电子商务网站上的推荐,并且越来越多地出现在相机和智能手机等消费产品中。机器学习系统用于识别图像中的物体,将语音转录成文本,将新闻、帖子或产品与用户的兴趣相匹配,并选择相关的搜索结果。这些应用越来越多地使用一类称为深度学习的技术。

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.
传统的机器学习技术处理原始形式的自然数据的能力有限。几十年来,构建模式识别或机器学习系统需要仔细的工程和相当多的领域专业知识来设计一个特征提取器,将原始数据(如图像的像素值)转换为合适的内部表示或特征向量,学习子系统(通常是分类器)可以从中检测或分类输入中的模式。

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
表征学习是一组方法,允许向机器提供原始数据,并自动发现检测或分类所需的表示。深度学习方法是具有多个表示级别的表示学习方法,通过组成简单但非线性的模块来获得,每个模块将一个级别(从原始输入开始)的表示转换为更高、稍微更抽象的级别的表示。有了足够多的此类转换的组合,就可以学习非常复杂的函数。对于分类任务来说,较高层次的表示会放大输入中对于区分有重要作用的方面,并抑制不相关的变化。例如,图像以像素值数组的形式输入,第一层表示中学习到的特征通常表示图像中特定方向和位置的边缘是否存在。第二层通常通过识别特定的边缘排列来检测图案,而不考虑边缘位置的小变化。第三层可能会将图案组合成更大的组合,这些组合对应于熟悉物体的部分,而随后的层则会将这些部分组合成整体对象进行检测。深度学习的关键在于,这些特征层不是由人类工程师设计的,而是通过通用学习程序从数据中学习得到的。

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition and speech recognition, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules, analysing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and disease. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding, particularly topic classification, sentiment analysis, question answering and language translation.
深度学习在解决多年来人工智能领域难以攻克的问题方面取得了重大进展。事实证明,深度学习在高维数据中发现复杂结构方面非常出色,因此它适用于科学、商业和政府的许多领域。除了在图像识别和语音识别方面打破纪录之外,深度学习还在预测潜在药物分子的活性、分析粒子加速器数据、重建大脑回路,以及预测非编码DNA突变对基因表达和疾病的影响方面超过了其他机器学习技术。更令人惊讶的是,深度学习在自然语言理解的各种任务中也取得了极为可喜的成果,特别是在主题分类、情感分析、问答和语言翻译方面。

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.
我们认为,深度学习在不久的将来还会取得更多成功,因为它几乎不需要手工工程设计,因此可以轻松利用计算能力和数据量的增加。目前正在为深度神经网络开发的新学习算法和架构将进一步加速这一进程。

Supervised learning 监督学习

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.
最常见的机器学习形式,无论是否是深度学习,都是监督学习。想象一下,我们要构建一个系统,能够将图像分类为包含房子、汽车、人物或宠物等内容。我们首先收集大量包含房子、汽车、人物和宠物的图像数据集,每张图像都标有其类别。在训练过程中,机器会看到一张图像,并以向量形式输出分数,每个类别对应一个分数。我们希望目标类别的分数在所有类别中最高,但在训练前这不太可能发生。我们计算一个目标函数,用来衡量输出分数和期望分数模式之间的误差(或距离)。然后,机器修改其内部的可调参数以减少这种误差。这些可调参数通常称为权重,是可以视为定义机器输入输出函数的“旋钮”的实数。在一个典型的深度学习系统中,可能有数亿个这样的可调权重,以及数亿个用于训练机器的标记示例。

To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.
为了正确调整权重向量,学习算法计算一个梯度向量,对于每个权重,该梯度向量指示如果权重增加很小的量,误差将增加或减少多少。然后在与梯度向量相反的方向上调整权重向量。

The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.
在所有训练实例上求平均值的目标函数可以看作是权重值的高维空间中的一种丘陵景观。负梯度矢量表示该景观中最陡下降的方向,使其更接近最小值,其中输出误差平均较低。

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques[18]. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.
在实践中,大多数从业者使用一种称为随机梯度下降(SGD)的程序。这包括显示几个例子的输入矢量,计算输出和误差,计算这些例子的平均梯度,并相应地调整权重。对训练集中的许多小示例集重复该过程,直到目标函数的平均值停止下降。它被称为随机的,因为每个小的例子集都给出了所有例子的平均梯度的噪声估计。与更精细的优化技术相比,这个简单的过程通常能惊人地快速找到一组好的权重[18]。训练后,系统的性能是在一组称为测试集的不同示例上进行测量的。这是为了测试机器的泛化能力——它对新输入产生合理答案的能力,这是它在训练中从未见过的。

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.
当前机器学习的许多实际应用在手工设计的特征之上使用线性分类器。两类线性分类器计算特征向量分量的加权和。如果加权和高于阈值,则输入被分类为属于特定类别。

Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane[19]. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods[20], but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples[21]. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.
自20世纪60年代以来,我们就知道线性分类器只能将其输入空间划分为非常简单的区域,即由超平面分隔的半空间。但是,图像和语音识别等问题要求输入-输出功能对输入的无关变化不敏感,例如物体的位置、方向或照明的变化,或语音音调或重音的变化,同时对特定的微小变化非常敏感(例如,白狼和一种被称为萨摩耶的狼状白狗之间的差异)。在像素级别上,处于不同姿势和不同环境中的两个萨摩耶的图像可能彼此非常不同,而处于相同位置和相似背景的萨摩耶和狼的两个图像可能彼此极其相似。线性分类器或任何其他对原始像素进行操作的“浅”分类器不可能区分后两者,而将前两者归入同一类别。这就是为什么浅分类器需要一个好的特征提取器来解决选择性-不变性的困境,即产生对图像的重要方面具有选择性的表示,但对不相关的方面(如动物的姿势)具有不变性。为了使分类器更强大,可以使用通用非线性特征,如核方法[20],但通用特征(如高斯核产生的特征)不允许学习者在远离训练示例的情况下进行推广[21]。传统的选择是手工设计好的特征提取器,这需要相当多的工程技能和领域专业知识。但是,如果可以使用通用学习程序自动学习好的特征,那么这一切都可以避免。这是深度学习的关键优势。

持续更新中…


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!