ChatGPT背后的底层技术 Deep+Learning (深度学习) 论文翻译

DeepL earning
Y annL eC un∗ Y osh uaB engio∗ G eoffrey H inton
深度学习
Y annL eC un∗ Y osh uaB engio∗ G eoffrey H inton
A bstract
Deepl
earningal
lowscomputationalmodel
sth atare composed ofmul
tipl
e processing
l
ayers to l
earn representations of data with mul
tipl
e l
evel
s of abstraction. T h ese
meth odsh ave dramatical
ly improved th e state- of- th e- artinspeech recognition, visual
objectrecognition, objectdetection and many oth er domains such as drug discovery
and genomics. Deep l
earning discovers intricate structure in l
arge data sets by using
th e backpropagation al
gorith m to indicate h ow a mach ine sh oul
d ch ange its internal
parameters th at are used to compute th e representation in each l
ayer from th e
representation in th e previous l
ayer. Deep convol
utionalnets h ave brough tabout
breakth rough sinprocessing images, video, speech and audio, wh ereasrecurrentnets
h ave sh one l
igh tonsequentialdatasuch astextand speech .
摘要
深度学习允许由多个处理层组成的计算模型学习具有多个抽象级别的数据表示。
这些方法极大地提升了语音识别、视觉目标识别、目标检测以及许多其他领域的
最新技术，例如药物发现和基因组学。深度学习通过使用反向传播算法来指示机
器应如何更新其内部参数（从上一层的表示形式计算每一层的表示形式），从而
发现大型数据集中的复杂结构。深层卷积网络在处理图像、视频、语音和音频方
面带来了突破，而递归网络则对诸如文本和语音之类的顺序数据有所启发。
每日免费获取报告
1、每日微信群内分享7+最新重磅报告；
2、每日分享当日华尔街日报、金融时报；
3、每周分享经济学人
4、行研报告均为公开版，权利归原作者
所有，起点财经仅分发做内部学习。
扫一扫二维码
关注公号
回复：研究报告
加入“起点财经”微信群。。
正文
M ach ine- l
earning tech nol
ogy powers many aspects of modern society: from web
search es to contentfil
tering on socialnetworks to recommendations on e- commerce
websites, and itis increasingl
y presentin consumer products such as cameras and
smartph ones. M ach ine- l
earning systems are used to identify objects in images,
transcribe speech into text, match newsitems, postsorproductswith users’ interests,
and sel
ectrel
evantresul
ts ofsearch . Increasingl
y, th ese appl
ications make use ofa
cl
assoftech niquescal
led deepl
earning.
C onventionalmach ine- l
earning tech niques were l
imited in th eir abil
ity to process
naturaldata in th eir raw form. F or decades, constructing a pattern- recognition or
mach ine- l
earning system required carefulengineering and considerabl
e domain
expertise to designafeature extractorth attransformed th e raw data( such asth e pixel
val
ues ofan image) into a suitabl
e internalrepresentation or feature vector from
wh ich th e l
earning subsystem, often a cl
assifier, coul
d detector cl
assify patterns in
th e input.
R epresentationl
earningisasetofmeth odsth atal
lowsamach ine to be fed with raw
data and to automatical
ly discover th e representations needed for detection or
cl
assification. Deep- l
earning meth ods are representation- l
earning meth ods with
mul
tipl
e l
evel
s of representation, obtained by composing simpl
e but non- l
inear
modul
es th ateach transform th e representation atone l
evel( starting with th e raw
input) into a representation at a h igh er, sl
igh tl
y more abstract l
evel
. W ith th e
compositionofenough such transformations, very compl
exfunctionscanbe l
earned.
F or cl
assification tasks, h igh er l
ayers ofrepresentation ampl
ify aspects ofth e input
th atare importantfordiscriminationand suppressirrel
evantvariations. A nimage, for
exampl
e, comesinth e form ofanarray ofpixelval
ues, and th e l
earned featuresinth e
firstl
ayer ofrepresentation typical
ly representth e presence or absence ofedges at
particul
ar orientations and l
ocations in th e image. T h e second l
ayer typical
ly detects
motifsby spottingparticul
ar arrangementsofedges, regardl
essofsmal
lvariationsin
th e edge positions. T h e th ird l
ayermay assembl
e motifsinto l
argercombinationsth at
correspond to partsoffamil
iarobjects, and subsequentl
ayerswoul
d detectobjectsas
combinations ofth ese parts. T h e key aspectofdeep l
earning is th atth ese l
ayers of
features are notdesigned by h uman engineers: th ey are l
earned from data using a
general
- purpose l
earningprocedure.
Deep l
earning is making major advances in sol
ving probl
ems th ath ave resisted th e
bestattemptsofth e artificialintel
ligence community formany years. Ith asturned out
to be very good atdiscovering intricate structures in h igh - dimensionaldata and is
th erefore appl
icabl
e to many domains of science, business and government. In
additionto beatingrecordsinimage recognitionand speech recognition, ith asbeaten
oth er mach ine- l
earning tech niques at predicting th e activity of potentialdrug
mol
ecul
es, anal
ysing particl
e accel
erator data, reconstructing brain circuits, and
predicting th e effects of mutations in non- coding DN A on gene expression and
disease. Perh aps more surprisingl
y, deep l
earning h as produced extremel
y promising
resul
ts for various tasks in naturall
anguage understanding, particul
arl
y topic
cl
assification, sentimentanal
ysis, questionansweringand l
anguage transl
ation.
W e th ink th atdeepl
earningwil
lh ave many more successesinth e nearfuture because
itrequiresvery l
ittl
e engineeringby h and, so itcaneasil
y take advantage ofincreases
in th e amountof avail
abl
e computation and data. N ew l
earning al
gorith ms and
arch itectures th atare currentl
y being devel
oped for deep neuralnetworks wil
lonl
y
accel
erate th isprogress.
机器学习技术为现代社会的各个方面提供了强大的支持：从网络搜索到社交网络
上的内容过滤再到电子商务网站上的推荐，并且它越来越多地出现在诸如相机和
智能手机之类的消费产品中。机器学习系统用于识别图像中的目标，语音转录为
文本，新闻标题、帖子或具有用户兴趣的产品匹配，以及选择相关的搜索结果。
这些应用程序越来越多地使用一类称为深度学习的技术。
传统的机器学习技术在处理原始格式的自然数据方面的能力受到限制。几十年来，
构建模式识别或机器学习系统需要认真的工程设计和相当多的领域专业知识，才
能设计特征提取器，以将原始数据（例如图像的像素值）转换为合适的内部表示
或特征向量，学习子系统（通常是分类器）可以对输入的图片进行检测或分类。
表示学习是一组方法，这些方法允许向机器提供原始数据并自动发现检测或分类
所需的表示。深度学习方法是具有表示形式的多层次的表示学习方法，它是通过
组合简单但非线性的模块而获得的，每个模块都将一个级别（从原始输入开始）
的表示形式转换为更高、更抽象的级别的表示形式。有了足够多的此类转换，就
可以学习非常复杂的功能。对于分类任务，较高的表示层会放大输入中对区分非
常重要的方面，并抑制不相关的变化。例如，图像以像素值序列的形式出现，并
且在表示的第一层中学习的特征通常表示图像中特定方向和位置上是否存在边
缘。第二层通常通过发现边缘的特定布置来检测图案，而与边缘位置的微小变化
无关。第三层可以将图案组装成与熟悉的对象的各个部分相对应的较大组合，并
且随后的层将这些部分的组合作为目标进行检测。深度学习的关键在于每层的功
能不是由人类工程师设计的，而是通用训练过程从数据中学习的。
深度学习在解决多年来抵制人工智能界最大尝试的问题方面取得了重大进展。事
实证明，它非常善于发现高维数据中的复杂结构，因此适用于科学、商业和政府
的许多领域。除了打破图像识别和语音识别中的记录，它在预测潜在药物分子的
活性、分析粒子加速器数据，重建脑回路和预测非编码 DN A 突变对基因表达和
疾病的影响方面还优于其他机器学习技术。更令人惊讶的是，深度学习在自然语
言理解中的各种任务上产生了非常有希望的结果，尤其是主题分类、情感分析、
问答系统和语言翻译。
由于深度学习只需要极少的人工操作，我们认为其在不久的将来会取得更多的成
功，因此可以轻松地利用增加的可用计算量和数据量的优势。目前正在为深度神
经网络开发的新学习算法和体系结构只会加速这一进展。
Supervised l
earning
T h e mostcommon form ofmach ine l

earning, deep or not, is supervised l
earning.
Imagine th atwe wantto buil
d asystem th atcancl
assify imagesascontaining, say, a
h ouse, a car, a person or a pet. W e firstcol
lecta l
arge data setofimages ofh ouses,
cars, peopl
e and pets, each l
abel
led with itscategory. Duringtraining, th e mach ine is
sh ownanimage and producesanoutputinth e form ofavectorofscores, one foreach
category. W e wantth e desired category to h ave th e h igh estscore ofal
lcategories, but
th is is unl
ikel
y to h appen before training. W e compute an objective function th at
measuresth e error ( or distance) between th e outputscores and th e desired pattern of
scores. T h e mach ine th en modifies its internaladjustabl
e parameters to reduce th is
error. T h ese adjustabl
e parameters, oftencal
led weigh ts, are realnumbersth atcanbe
seen as ‘knobs’ th atdefine th e input– outputfunction ofth e mach ine. In a typical
deep- l
earningsystem, th ere may be h undredsofmil
lionsofth ese adjustabl
e weigh ts,
and h undredsofmil
lionsofl
abel
led exampl
eswith wh ich to trainth e mach ine.
T o properl
y adjustth e weigh tvector, th e l
earning al
gorith m computes a gradient
vector th at, for each weigh t, indicates by wh atamountth e error woul
d increase or
decrease ifth e weigh twere increased by a tiny amount. T h e weigh tvector is th en
adjusted inth e opposite directionto th e gradientvector.
T h e objective function, averaged overal

lth e trainingexampl
es, canbe seenasakind
of h il
lyl
andscape in th e h igh - dimensionalspace of weigh tval
ues. T h e negative
gradientvector indicates th e direction ofsteepestdescentin th is l
andscape, taking it
cl
oserto aminimum, wh ere th e outputerrorisl
ow onaverage.
In practice, mostpractitioners use a procedure cal

led stoch astic gradientdescent
( SG D) . T h isconsists ofsh owingth e inputvector forafew exampl
es, computingth e
outputs and th e errors, computing th e average gradientfor th ose exampl
es, and
adjusting th e weigh ts accordingl
y. T h e process is repeated for many smal
lsets of
exampl
es from th e training setuntilth e average of th e objective function stops
decreasing. Itis cal
led stoch asticbecause each smal
lsetofexampl
es gives a noisy
estimate ofth e average gradientover al
lexampl
es. T h is simpl
e procedure usual
ly
finds a good setof weigh ts surprisingl
y quickl
y wh en compared with far more
el
aborate optimiz ation tech niques. A fter training, th e performance ofth e system is
measured on a differentsetofexampl
es cal
led a testset. T h is serves to testth e
general
iz ation abil
ity ofth e mach ine- - its abil
ity to produce sensibl
e answers on new
inputsth atith asneverseenduringtraining.
M any ofth e currentpracticalappl

icationsofmach ine l
earninguse l
inearcl
assifierson
top ofh and- engineered features. A two- cl
ass l
inear cl
assifier computes a weigh ted
sum ofth e feature vector components. Ifth e weigh ted sum is above a th resh ol
d, th e
inputiscl
assified asbel
ongingto aparticul
arcategory.
Since th e 1 9 6 0 swe h ave knownth atl

inearcl
assifierscanonl
y carve th eirinputspace
into very simpl
e regions, namel
y h al
f- spacesseparated by ah yperpl
ane. B utprobl
ems
such as image and speech recognition require th e input– outputfunction to be
insensitive to irrel
evant variations of th e input, such as variations in position,
orientationoril
luminationofanobject, orvariationsinth e pitch oraccentofspeech ,
wh il
e beingvery sensitive to particul
arminute variations( forexampl
e, th e difference
between a wh ite wol
fand a breed ofwol
f- l
ike wh ite dog cal
led a Samoyed) . A tth e
pixell
evel
, imagesoftwo Samoyedsindifferentposesand indifferentenvironments
may be very differentfrom each oth er, wh ereastwo imagesofaSamoyed and awol
f
inth e same positionand onsimil
arbackgroundsmay be very simil
arto each oth er. A
l
inear cl
assifier, or any oth er ‘sh al
low’ cl
assifier operating on aw pixel
s coul
d not
possibl
y distinguish th e l
attertwo, wh il
e puttingth e formertwo inth e same category.
T h is is wh y sh al
low cl
assifiers require a good feature extractor th atsol
ves th e
sel
ectivity–invariance dil
emma- - one th atproducesrepresentationsth atare sel
ective
to th e aspectsofth e image th atare importantfordiscrimination, butth atare invariant

to irrel
evantaspects such as th e pose of th e animal
. T o make cl
assifiers more
powerful
, one canuse genericnon- l
inearfeatures, aswith kernelmeth ods, butgeneric
features such as th ose arising with th e G aussian kerneldo notal
low th e l
earner to
general
iz e wel
lfar from th e training exampl
es. T h e conventionaloption is to h and
design good feature extractors, wh ich requires a considerabl
e amountofengineering
skil
land domainexpertise. B utth iscanal
lbe avoided ifgood featurescanbe l
earned
automatical
ly using a general
- purpose l
earning procedure. T h is is th e key advantage
ofdeepl
earning.
A deep- l
earningarch itecture isamul
til
ayerstack ofsimpl
e modul
es, al
l( ormost) of
wh ich are subjectto l

earning, and many ofwh ich compute non- l
inear input–output
mappings. Each modul

e in th e stack transforms its input to increase both th e
sel
ectivity and th e invariance ofth e representation. W ith mul
tipl
e non- l
inear l
ayers,
say a depth of5 to 2 0 , a system can impl
ementextremel
y intricate functions ofits
inputs th atare simul
taneousl
y sensitive to minute detail
s- distinguish ing Samoyeds
from wh ite wol
ves- and insensitive to l
arge irrel
evant variations such as th e
background, pose, l
igh tingand surroundingobjects.
监督学习
不论深度与否，机器学习最常见的形式都是监督学习。想象一下，我们想建立一
个可以将图像分类为包含房屋、汽车、人或宠物的系统。我们首先收集大量的房
屋、汽车、人和宠物的图像数据集，每个图像均标有类别。在训练过程中，机器
将显示一张图像，并输出一个分数向量，每个类别一个。我们希望所需的类别在
所有类别中得分最高，但这不太可能在训练之前发生。我们计算一个目标函数，
该函数测量输出得分与期望得分模式之间的误差（或距离）。然后机器修改其内
部可更新参数以减少此误差。这些可更新的参数（通常称为权重）是实数，可以
看作是定义机器输入输出功能的“旋钮”。在典型的深度学习系统中，可能会有
数以亿计的可更新权重，以及数亿个带有标签的实例，用于训练模型。
为了适当地更新权重向量，学习算法计算一个梯度向量，针对每个权重，该梯度
向量表明，如果权重增加很小的量，误差将增加或减少的相应的量。然后沿与梯
度向量相反的方向更新权重向量。
在所有训练示例中平均的目标函数可以在权重值的高维空间中被视为一种丘陵
地形。负梯度矢量指示此地形中最陡下降的方向，使其更接近最小值，其中输出
误差平均较低。
在实践中，大多数从业者使用一种称为随机梯度下降（SG D）的算法。这包括显
示几个示例的输入向量，计算输出和误差，计算这些示例的平均梯度以及相应地
更新权重。对训练集中的许多小样本示例重复此过程，直到目标函数的平均值停
止下降。之所以称其为随机的，是因为每个小的示例集都会给出所有示例中平均
梯度的噪声估计。与更复杂的优化技术相比[1 8 ]，这种简单的过程通常会出乎意
料地快速找到一组良好的权重。训练后，系统的性能将在称为测试集的不同示例
集上进行测量。这用于测试机器的泛化能力：机器在新的输入数据上产生好的效
果的能力，这些输入数据在训练集上是没有的。
机器学习的许多当前实际应用都在人工设计的基础上使用线性分类器。两类别线
性分类器计算特征向量分量的加权和。如果加权和大于阈值，则将输入分为特定
类别。
自二十世纪六十年代以来，我们就知道线性分类器只能将其输入空间划分为非常
简单的区域，即由超平面分隔的对半空间。但是，诸如图像和语音识别之类的问
题要求输入输出功能对输入的不相关变化不敏感，例如目标的位置、方向或照明
的变化，或语音的音高或口音的变化。对特定的微小变化敏感（例如，白狼与萨
摩耶之间的差异，萨摩耶是很像狼的白狗）。在像素级别，两幅处于不同姿势和
不同环境中的萨摩耶图像可能差别很大，而两幅位于相同位置且背景相似的萨摩
耶和狼的图像可能非常相似。线性分类器或其他任何在其上运行的“浅”分类器
无法区分后两幅图片，而将前两幅图像归为同一类别。这就是为什么浅分类器需
要一个好的特征提取器来解决选择性不变性难题的原因。提取器可以产生对图像
中对于辨别重要的方面具有选择性但对不相关方面（例如动物的姿态）不变的表
示形式。为了使分类器更强大，可以使用通用的非线性特征，如核方法，但是诸
如高斯核所产生的那些通用特征，使学习者无法从训练示例中很好地概括。传统
的选择是人工设计好的特征提取器，这需要大量的工程技术和领域专业知识。但
是，如果可以使用通用学习过程自动学习好的功能，则可以避免所有这些情况。
这是深度学习的关键优势。
深度学习架构是简单模块的多层堆叠，所有模块（或大多数模块）都需要学习，
并且其中许多模块都会计算非线性的输入- 输出映射。堆叠中的每个模块都会转
换其输入，以增加表示的选择性和不变性。系统具有多个非线性层（例如深度为
5 到 2 0 ），可以实现极为复杂的输入功能，这些功能同时对细小的细节敏感（区
分萨摩耶犬与白狼），并且对不相关的大变化不敏感，例如背景功能、姿势、灯
光和周围物体。
B ackpropagationto trainm ul
til
ayer arch itectures
F rom th e earl
iestdays ofpattern recognition, th e aim ofresearch ers h as been to
repl
ace h and- engineered features with trainabl
e mul
til
ayer networks, butdespite its
simpl
icity, th e sol
utionwasnotwidel
y understood untilth e mid 1 9 8 0 s. A sitturnsout,
mul
til
ayerarch itecturescanbe trained by simpl
e stoch asticgradientdescent. A sl
ong
as th e modul
es are rel
ativel
y smooth functions ofth eir inputs and ofth eir internal
weigh ts, one can compute gradients using th e backpropagation procedure. T h e idea
th atth iscoul
d be done, and th atitworked, was discovered independentl
y by several
differentgroupsduringth e 1 9 7 0 sand 1 9 8 0 s.
T h e backpropagationprocedure to compute th e gradientofanobjective functionwith

respectto th e weigh ts of a mul
til
ayer stack of modul
es is noth ing more th an a
practicalappl
ication ofth e ch ain rul
e for derivatives. T h e key insigh tis th atth e
derivative ( or gradient) ofth e objective with respectto th e inputofa modul
e can be
computed by workingbackwards from th e gradientwith respectto th e outputofth at
modul
e ( or th e input of th e subsequent modul
e) ( F ig. 1 ) . T h e backpropagation
equation can be appl
ied repeatedl
y to propagate gradients th rough al
lmodul
es,
startingfrom th e outputatth e top( wh ere th e network producesitsprediction) al
lth e
way to th e bottom ( wh ere th e externalinputis fed) . O nce th ese gradients h ave been
computed, itisstraigh tforward to compute th e gradientswith respectto th e weigh tsof
each modul
e.
M any appl
icationsofdeepl
earninguse feedforward neuralnetwork arch itectures( F ig.
1 ) , wh ich l
earn to map a fixed- siz e input( for exampl
e, an image) to a fixed- siz e
output( for exampl
e, a probabil
ity for each ofseveralcategories) . T o go from one
l
ayer to th e next, a setofunits compute a weigh ted sum ofth eir inputs from th e
previousl
ayerand passth e resul
tth rough anon- l
inearfunction. A tpresent, th e most
popul
ar non- l
inear function is th e rectified l
inear unit( R eL U ) , wh ich is simpl
y th e
h al
f- wave rectifier f( z ) = max( 0 , z ) . In pastdecades, neuralnets used smooth er
non- l
inearities, such as tanh ( z ) or 1 / ( 1 + exp( - z ) ) , butth e R eL U typical
lyl
earns much
faster in networks with many l
ayers, al
lowing training ofa deep supervised network
with outunsupervised pre- training. U nits th atare notin th e inputor outputl
ayer are
conventional
ly cal
led h idden units. T h e h idden l
ayers can be seen as distorting th e
inputinanon- l
inearway so th atcategoriesbecome l
inearl
y separabl
e by th e l
astl
ayer
( F ig. 1 ) .
In th e l
ate 1 9 9 0 s, neuralnets and backpropagation were l
argel
y forsaken by th e
mach ine- l
earning community and ignored by th e computer- vision and
speech - recognition communities. It was widel
y th ough t th at l
earning useful
,
mul
tistage, feature extractorswith l
ittl
e priorknowl
edge wasinfeasibl
e. Inparticul
ar,
itwascommonl
y th ough tth atsimpl
e gradientdescentwoul
d gettrapped inpoorl
ocal
minima- weigh tconfigurations for wh ich no smal
lch ange woul
d reduce th e average
error.
Inpractice, poorl
ocalminimaare rarel
y aprobl
em with l
arge networks. R egardl
essof
th e initialconditions, th e system nearl
y al
ways reach es sol
utions of very simil
ar
qual
ity. R ecentth eoreticaland empiricalresul
tsstrongl
y suggestth atl
ocalminimaare
notaseriousissue ingeneral
. Instead, th e l
andscape ispacked with acombinatorial
ly
l
arge numberofsaddl
e pointswh ere th e gradientisz ero, and th e surface curvesupin
mostdimensionsand curvesdowninth e remainder. T h e anal
ysisseemsto sh ow th at
saddl
e points with onl
y a few downward curving directions are presentin very l
arge
numbers, butal
mostal
lofth em h ave very simil
ar val
ues ofth e objective function.
H ence, itdoesnotmuch matterwh ich ofth ese saddl
e pointsth e al
gorith m getsstuck
at.
Interestin deep feedforward networks was revived around 2 0 0 6 by a group of

research ers brough t togeth er by th e C anadian Institute for A dvanced R esearch
( C IF A R ) . T h e research ers intro- duced unsupervised l
earning procedures th atcoul
d
create l
ayers offeature detectors with outrequiring l
abel
led data. T h e objective in
l
earning each l
ayer offeature detectors was to be abl
e to reconstructor modelth e
activities offeature detectors ( or raw inputs) in th e l
ayer bel
ow. B y ‘pre- training’
several l
ayers of progressivel
y more compl
ex feature detectors using th is
reconstructionobjective, th e weigh tsofadeepnetwork coul
d be initial
iz ed to sensibl
e
val
ues. A finall
ayerofoutputunitscoul
d th enbe added to th e topofth e network and
th e wh ol
e deep system coul
d be fine- tuned using standard backpropagation. T h is
worked remarkabl
y wel
l for recogniz ing h andwritten digits or for detecting
pedestrians, especial
ly wh enth e amountofl
abel
led datawasvery l
imited3 6 .
T h e firstmajor appl
ication ofth is pre- training approach was in speech recognition,
and itwasmade possibl
e by th e adventoffastgraph icsprocessingunits( G PU s) th at
were convenientto program and al
lowed research ersto trainnetworks1 0 or2 0 times
faster. In2 0 0 9 , th e approach wasused to mapsh orttemporalwindowsofcoef- ficients
extracted from a sound wave to a setofprobabil
ities for th e various fragments of
speech th atmigh tbe represented by th e frame inth e centre ofth e window. Itach ieved
record- breakingresul
tsonastandard speech recognitionbench mark th atused asmal
l
vocabu- l
ary and was quickl
y devel
oped to give record- breaking resul
ts on a l
arge
vocabul
ary task. B y 2 0 1 2 , versionsofth e deep netfrom 2 0 0 9 were being devel
oped
by many ofth e major speech groups and were al
ready being depl
oyed in A ndroid
ph ones. F or smal
ler data sets, unsupervised pre- training h el
ps to preventoverfitting,
l
eadingto significantl
y bettergeneral
iz ationwh enth e numberofl
abel
led exampl
esis
smal
l, orinatransfersettingwh ere we h ave l
otsofexampl
esforsome ‘source’ tasks
butvery few for some ‘target’ tasks. O nce deep l
earning h ad been reh abil
itated, it
turned outth atth e pre- trainingstage wasonl
y needed forsmal
ldatasets.
T h ere was, h owever, one particul

artype ofdeep, feedforward network th atwasmuch
easier to train and general
iz ed much better th an networks with ful
lconnectivity
between adjacentl
ayers. T h is was th e convol
utionalneuralnetwork ( C onvN et) . It
ach ieved many practicalsuccesses during th e period wh en neuralnetworks were out
offavourand ith asrecentl
y beenwidel
y adopted by th e computervisioncommunity
反向传播训练多层架构
从模式识别的早期开始，研究人员的目的一直是用可训练的多层网络代替手工设
计的功能，但是尽管它很简单，但直到二十世纪八十年代中期才广泛了解该解决
方案。事实证明，可以通过简单的随机梯度下降来训练多层体系结构。只要模块
是其输入及其内部权重的相对平滑函数，就可以使用反向传播过程来计算梯度。
二十世纪七十年代和二十世纪八十年代，几个不同的小组独立地发现了可以做到
这一点并且起作用的想法。
反向传播程序用于计算目标函数相对于模块多层堆叠权重的梯度，无非是导数链
规则的实际应用。关键的见解是，相对于模块输入的目标的导数（或梯度）可以
通过相对于该模块的输出（或后续模块的输入）的梯度进行反运算来计算（图 1 ）。
反向传播方程式可以反复应用，以通过所有模块传播梯度，从顶部的输出（网络
产生其预测）一直到底部的输出（外部输入被馈送）。一旦计算出这些梯度，就
可以相对于每个模块的权重来计算梯度。
深度学习的许多应用都使用前馈神经网络体系结构（图 1 ），该体系结构会将固
定大小的输入（例如图像）映射到固定大小的输出（例如几个类别中的每一个的
概率）。为了从一层到下一层，一组单元计算它们来自上一层的输入的加权和，
并将结果传递给非线性函数。目前，最流行的非线性函数是整流线性单元（R eL U ），
即半波整流器 f( z ) = max( 0 , z ) 。在过去的几十年中，神经网络使用了更平滑的非线
性，例如 tanh ( z ) 或 1 / ( 1 + exp( - z ) ) ，但 R eL U 通常在具有多个层的网络中学习得更
快，从而可以在无需监督的情况下进行深度监督的网络训练。不在输入或输出层
中的单元通常称为隐藏单元。隐藏的层可以被视为以非线性方式使输入失真，以
便类别可以由最后一层实现线性分别（图 1 ）。
在二十世纪九十年代后期，神经网络和反向传播在很大程度上被机器学习领域抛
弃，而被计算机视觉和语音识别领域所忽略。人们普遍认为，在没有先验知识的
情况下学习有用的多阶段特征提取器是不可行的。特别是，通常认为简单的梯度
下降会陷入不良的局部极小值——权重配置，对其进行很小的变化将减少平均误
差。
实际上，较差的局部最小值在大型网络中很少出现问题。不管初始条件如何，该
系统几乎总是能获得效果非常相似的解决方案。最近的理论和经验结果强烈表明，
局部极小值通常不是一个严重的问题。取而代之的是，景观中堆积了许多鞍点，
其中梯度为零，并且曲面在大多数维度上都向上弯曲，而在其余维度上则向下弯
曲。分析似乎表明，只有少数几个向下弯曲方向的鞍点存在很多，但几乎所有鞍
点的目标函数值都非常相似。因此，算法陷入这些鞍点中的哪一个都没关系。
加拿大高级研究所（C IF A R ）召集的一组研究人员在 2 0 0 6 年左右恢复了对深层
前馈网络的兴趣。研究人员介绍了无需监督的学习程序，这些程序可以创建特征
检测器层，而无需标记数据。学习特征检测器每一层的目的是能够在下一层中重
建或建模特征检测器（或原始输入）的活动。通过使用此重建目标“预训练”几
层逐渐复杂的特征检测器，可以将深度网络的权重初始化为合理的值。然后可以
将输出单元的最后一层添加到网络的顶部，并且可以使用标准反向传播对整个深
度系统进行微调。这对于识别手写数字或检测行人非常有效，特别是在标记数据
量非常有限的情况下。
这种预训练方法的第一个主要应用是语音识别，而快速图形处理单元（G PU ）的
出现使编程成为可能，并且使研究人员训练网络的速度提高了 1 0 或 2 0 倍，从而
使之成为可能。在 2 0 0 9 年，该方法用于将从声波提取的系数的短暂时间窗口映
射到可能由窗口中心的帧表示的各种语音片段的一组概率。它在使用少量词汇的
标准语音识别基准上取得了创纪录的结果，并迅速发展为大型词汇任上取得了创
纪录的结果。到 2 0 1 2 年，许多主要的语音组织都在开发 2 0 0 9 年以来的深度网络
版本，并且已经在 A ndroid 手机中进行了部署。对于较小的数据集，无监督的预
训练有助于防止过拟合，从而在标记的示例数量较少时或在转移设置中，对于一
些“源”任务，我们有很多示例，而对于某些“源”任务却很少，这会导致泛化
效果更好“目标”任务。恢复深度学习后，事实证明，仅对于小型数据集才需要
进行预训练。
但是，存在一种特定类型的深层前馈网络，它比相邻层之间具有完全连接的网络
更容易训练和推广。这就是卷积神经网络（C onvN et）。在神经网络未受关注期
间，它取得了许多实际的成功，并且最近被计算机视觉界广泛采用。
C onvol
utionalneuralnetworks
C onvN ets are designed to process data th atcome in th e form ofmul

tipl
e arrays, for
exampl
e a col
our image composed ofth ree 2 D arrays containing pixelintensities in
th e th ree col
ourch annel
s. M any datamodal
itiesare inth e form ofmul
tipl
e arrays: 1 D
forsignal
sand sequences, incl
udingl
anguage; 2 D forimagesoraudio spectrograms;
and 3 D for video or vol
umetricimages. T h ere are four key ideas beh ind C onvN ets
th attake advantage ofth e properties ofnaturalsignal
s: l
ocalconnections, sh ared
weigh ts, pool
ingand th e use ofmany l
ayers.
T h e arch itecture ofatypicalC onvN et( F ig. 2 ) isstructured asaseriesofstages. T h e
firstfew stagesare composed oftwo typesofl
ayers: convol
utionall
ayersand pool
ing
l
ayers. U nits in a convol
utionall
ayer are organiz ed in feature maps, with in wh ich
each unitis connected to l
ocalpatch es in th e feature maps ofth e previous l
ayer
th rough asetofweigh tscal
led afil
terbank. T h e resul
tofth isl
ocalweigh ted sum is
th enpassed th rough anon- l
inearity such asaR eL U . A l
lunitsinafeature mapsh are
th e same fil
ter bank. Differentfeature maps in a l
ayer use differentfil
ter banks. T h e
reasonforth isarch itecture istwofol
d. F irst, inarray datasuch asimages, l
ocalgroups
ofval
ues are often h igh l
y correl
ated, forming distinctive l
ocalmotifs th atare easil
y
detected. Second, th e l
ocalstatistics ofimages and oth er signal
s are invariantto
l
ocation. Inoth erwords, ifamotifcanappearinone partofth e image, itcoul
d appear
anywh ere, h ence th e ideaofunitsatdifferentl
ocationssh aringth e same weigh tsand
detectingth e same patternindifferentpartsofth e array. M ath ematical
ly, th e fil
tering
operationperformed by afeature mapisadiscrete convol
ution, h ence th e name.
Al
th ough th e rol
e ofth e convol
utionall
ayeristo detectl
ocalconjunctionsoffeatures
from th e previousl
ayer, th e rol
e ofth e pool
ingl
ayeristo merge semantical
ly simil
ar
features into one. B ecause th e rel
ative positions ofth e features forming a motifcan
vary somewh at, rel
iabl
y detecting th e motif can be done by coarse- graining th e
position ofeach feature. A typicalpool
ing unitcomputes th e maximum ofa l
ocal
patch ofunits in one feature map ( or in a few feature maps) . N eigh bouring pool
ing
units take inputfrom patch es th atare sh ifted by more th an one row or col
umn,
th ereby reducing th e dimension ofth e representation and creating an invariance to
smal
lsh ifts and distortions. T wo or th ree stages ofconvol
ution, non- l
inearity and
pool
ing are stacked, fol
lowed by more convol
utionaland ful
ly- connected l
ayers.
B ackpropagatinggradientsth rough aC onvN etisassimpl
e asth rough aregul
ardeep
network, al
lowingal
lth e weigh tsinal
lth e fil
terbanksto be trained.
Deepneuralnetworksexpl
oitth e property th atmany naturalsignal
sare compositional
h ierarch ies, in wh ich h igh er- l
evelfeatures are obtained by composing l
ower- l
evel
ones. Inimages, l
ocalcombinationsofedgesform motifs, motifsassembl
e into parts,
and parts form objects. Simil
ar h ierarch ies existin speech and textfrom sounds to
ph ones, ph onemes, syl
labl
es, words and sentences. T h e pool
ing al
lows
representationsto vary very l
ittl
e wh enel
ementsinth e previousl
ayervary inposition
and appearance.
T h e convol
utionaland pool
ingl
ayersinC onvN etsare directl
y inspired by th e cl
assic
notions ofsimpl
e cel
ls and compl
ex cel
ls in visualneuroscience, and th e overal
l
arch itecture is reminiscentofth e L G N - V 1 - V 2 - V 4 - IT h ierarch y in th e visualcortex
ventralpath way. W h en C onvN etmodel
s and monkeys are sh own th e same picture,
th e activations ofh igh - l
evelunits in th e C onvN etexpl
ains h al
fofth e variance of
random sets of1 6 0 neurons in th e monkey’ s inferotemporalcortex. C onvN ets h ave
th eir roots in th e neocognitron, th e arch itecture ofwh ich was somewh atsimil
ar, but
did noth ave anend- to- end supervised- l
earningal
gorith m such asbackpropagation. A
primitive 1 D C onvN etcal
led atime- del
ay neuralnetwasused forth e recognitionof
ph onemesand simpl
e words.
T h ere h ave beennumerous appl

ications ofconvol
utionalnetworks going back to th e
earl
y 1 9 9 0 s, starting with time- del
ay neuralnetworks for speech recognition and
documentreading. T h e documentreadingsystem used aC onvN ettrained jointl
y with
a probabil
isticmodelth atimpl
emented l
anguage constraints. B y th e l
ate 1 9 9 0 s th is
system was reading over 1 0 % ofal
lth e ch eques in th e U nited States. A number of
C onvN et- based opticalch aracter recognition and h andwriting recognition systems
were l
aterdepl
oyed by M icrosoft. C onvN etswere al
so experimented with inth e earl
y
1 9 9 0 s for objectdetection in naturalimages, incl
uding faces and h ands and for face
recognition.
卷积神经网络
C onvN ets被设计为处理以多个阵列形式出现的数据，例如，由三个二维通道组
成的彩色图像，其中三个二维通道在三个彩色通道中包含像素强度。许多数据形
式以多个数组的形式出现：一维用于信号和序列，包括语言；2 D 用于图像或音
频频谱图；和 3 D 视频或体积图像。C onvN ets有四个利用自然信号属性的关键思
想：局部连接，共享权重，池化和多层使用。
典型的 C onvN et的体系结构（图 2 ）由一系列阶段构成。前几个阶段由两种类型
的层组成：卷积层和池化层。卷积层中的单元组织在特征图中，其中每个单元通
过称为滤波器组的一组权重连接到上一层特征图中的局部块。然后，该局部加权
和的结果将通过非线性（如 R eL U ）传递。特征图中的所有单元共享相同的过滤
器组。图层中的不同要素图使用不同的滤镜库。这种体系结构的原因有两个。首
先，在诸如图像的阵列数据中，局部的值通常高度相关，从而形成易于检测的独
特局部图案。其次，图像和其他信号的局部统计量对于位置是不变的。换句话说，
如果图形可以出现在图像的一部分中，则它可以出现在任何位置，因此，位于不
同位置的单元在数组的不同部分共享相同的权重并检测相同的图案。在数学上，
由特征图执行的过滤操作是离散卷积，因此得名。
尽管卷积层的作用是检测上一层的特征的局部连接，但池化层的作用是将语义相
似的要素合并为一个。由于形成图案的特征的相对位置可能会略有变化，因此可
以通过对每个特征的位置进行粗粒度来可靠地检测图案。一个典型的池化单元计
算一个特征图中（或几个特征图中）的局部块的最大值。相邻的池化单元从移动
了不止一个行或一列的色块中获取输入，从而减小了表示的尺寸，并为小幅度的
移位和失真创建了不变性。卷积、非线性和池化的两个或三个阶段被堆叠，随后
是更多卷积和全连接的层。通过 C onvN et进行反向传播的梯度与通过常规深度
网络一样简单，从而可以训练所有滤波器组中的所有权重。
深度神经网络利用了许多自然信号是成分层次结构的特性，其中通过组合较低层
的特征获得较高层的特征。在图像中，边缘的局部组合形成图案，图案组装成零
件，而零件形成对象。从声音到电话，音素，音节，单词和句子，语音和文本中
也存在类似的层次结构。当上一层中的元素的位置和外观变化时，池化使表示形
式的变化很小。
卷积网络中的卷积和池化层直接受到视觉神经科学中简单细胞和复杂细胞的经
典概念的启发，整个架构让人联想到视觉皮层腹侧通路中的 L G N - V 1 - V 2 - V 4 - IT
层次结构。当 C onvN et模型和猴子显示相同的图片时，C onvN et中高层单元的激
活解释了猴子下颞叶皮层中 1 6 0 个神经元随机集合的一半方差。 C onvN ets的根
源是新认知器，其架构有些相似，但没有反向传播等端到端监督学习算法。称为
时延神经网络的原始一维 C onvN et用于识别音素和简单单词。
卷积网络的大量应用可以追溯到二十世纪九十年代初，首先是用于语音识别和文
档阅读的时延神经网络。该文档阅读系统使用了一个 C onvN et，并与一个实现语
言约束的概率模型一起进行了培训。到二十世纪九十年代后期，该系统已读取了
美国所有支票的 1 0 ％以上。M icrosoft随后部署了许多基于 C onvN et的光学字符
识别和手写识别系统。在二十世纪九十年代初，还对 C onvN ets进行试验，以检
测自然图像中的物体，包括面部和手部，以及面部识别。
Im age understandingwith deepconvol

utionalnetworks
Since th e earl
y 2 0 0 0 s, C onvN etsh ave beenappl
ied with greatsuccessto th e detection,
segmentationand recognitionofobjectsand regionsinimages. T h ese were al
ltasksin
wh ich l
abel
led data was rel
ativel
y abundant, such as traffic sign recognition, th e
segmentationofbiol
ogicalimagesparticul
arl
y forconnectomics, and th e detectionof
faces, text, pedestriansand h umanbodiesinnaturalimages. A majorrecentpractical
successofC onvN etsisface recognition.
Importantl
y, imagescanbe l
abel
led atth e pixell
evel
, wh ich wil
lh ave appl
icationsin
tech nol
ogy, incl
uding autonomous mobil
e robots and sel
f- driving cars. C ompanies
such as M obil
eye and N V IDIA are using such C onvN et- based meth ods in th eir
upcoming vision systems for cars. O th er appl
ications gaining importance invol
ve
naturall
anguage understandingand speech recognition.
Despite th ese successes, C onvN ets were l

argel
y forsaken by th e mainstream
computer- vision and mach ine- l
earning communities untilth e ImageN etcompetition
in 2 0 1 2 . W h en deep convol
utionalnetworks were appl
ied to a data setofabouta
mil
lion images from th e web th atcontained 1 , 0 0 0 differentcl
asses, th ey ach ieved
spectacul
ar resul
ts, al
mosth al
ving th e error rates ofth e bestcompeting approach es.
T h is success came from th e efficientuse ofG PU s, R eL U s, a new regul
ariz ation
tech nique cal
led dropout, and tech niques to generate more training exampl
es by
deformingth e existingones. T h issuccessh asbrough taboutarevol
utionincomputer
vision; C onvN ets are now th e dominantapproach for al
mostal
lrecognition and
detection tasks and approach h uman performance on some tasks. A recentstunning
demonstration combines C onvN ets and recurrentnetmodul
es for th e generation of
image captions( F ig. 3 ) .
R ecentC onvN etarch itecturesh ave 1 0 to 2 0 l

ayersofR eL U s, h undredsofmil
lionsof
weigh ts, and bil
lions of connections between units. W h ereas training such l
arge
networkscoul
d h ave takenweeksonl
y two yearsago, progressinh ardware, software
and al
gorith m paral
lel
iz ationh ave reduced trainingtimesto afew h ours.
T h e performance ofC onvN et- based visionsystemsh ascaused mostmajortech nol

ogy
companies, incl
uding G oogl
e, F acebook, M icrosoft, IB M , Y ah oo! , T witter and
A dobe, as wel
las a quickl
y growing number ofstart- ups to initiate research and
devel
opmentprojects and to depl
oy C onvN et- based image understanding products
and services.
C onvN ets are easil

y amenabl
e to efficienth ardware impl
ementations in ch ips or
fiel
d- programmabl
e gate arrays. A numberofcompaniessuch asN V IDIA , M obil
eye,
Intel
, Q ual
comm and Samsung are devel
oping C onvN etch ips to enabl
e real
- time
visionappl
icationsinsmartph ones, cameras, robotsand sel
f- drivingcars.
深度卷积网络的图像理解
自二十一世纪初以来，C onvN ets已成功应用于图像中对象和区域的检测、分割
和识别。这些都是标记数据相对丰富的任务，例如交通标志识别、生物图像分割、
尤其是用于连接组学，以及在自然图像中检测人脸、文字、行人和人体。C onvN ets
最近在实践中取得的主要成功是面部识别[5 9 ]。
重要的是，可以在像素级别标记图像，这将在技术中得到应用，包括自动驾驶机
器人和自动驾驶汽车。 M obil
eye 和 N V IDIA 等公司正在其即将推出的汽车视觉
系统中使用基于 C onvN et的方法。其他日益重要的应用包括自然语言理解和语
音识别。
尽管取得了这些成功，但 C onvN et在很大程度上被主流计算机视觉和机器学习
领域弃用，直到 2 0 1 2 年 ImageN et竞赛为止。当深度卷积网络应用于来自网络的
大约一百万个图像的数据集时，其中包含 1 0 0 0 个不同的类别，取得了惊人的成
绩，几乎使最佳竞争方法的错误率降低了一半。成功的原因是有效利用了 G PU 、
R eL U 、一种称为 dropout的新正则化技术，以及通过使现有示例变形而生成更多
训练示例的技术。这一成功带来了计算机视觉的一场革命。现在，C onvN ets是
几乎所有识别和检测任务的主要方法，并且在某些任务上达到了人类水平。最近
的一次令人震惊的演示结合了 C onvN ets和递归网络模块以生成图像字幕

（图 3 ）。
最新的 C onvN et架构具有 1 0 到 2 0 层 R eL U ，数亿个权重以及单元之间的数十亿
个连接。尽管培训如此大型的网络可能仅在两年前才花了几周的时间，但是硬件、
软件和算法并行化方面的进步已将培训时间减少到几个小时。
基于 C onvN et的视觉系统的性能已引起大多数主要技术公司的发展，其中包括
G oogl
e、F acebook、M icrosoft、IB M 、Y ah oo、T witter和 A dobe，以及数量迅速
增长的初创公司启动了研究和开发项目，部署基于 C onvN et的图像理解产品和
服务。
卷积网络很容易适应芯片或现场可编程门阵列中的高效硬件实现。 N V IDIA 、
M obil
eye、英特尔、高通和三星等多家公司正在开发 C onvN et芯片，以支持智能
手机、相机、机器人和自动驾驶汽车中的实时视觉应用。
Distributed representationsand l
anguage processing
Deep- l
earningth eory sh owsth atdeepnetsh ave two differentexponentialadvantages
over cl
assicl
earning al
gorith ms th atdo notuse distributed representations. B oth of
th ese advantages arise from th e power ofcomposition and depend on th e underl
ying
data- generating distribution h aving an appropriate componentialstructure. F irst,
l
earningdistributed representationsenabl
e general
iz ationto new combinations ofth e
val
ues ofl
earned features beyond th ose seen during training ( for exampl
e, 2 n 2 n2 n
combinations are possibl
e with n nn binary features) . Second, composing l
ayers of
representation in a deep netbrings th e potentialfor anoth er exponentialadvantage
( exponentialinth e depth ) .
T h e h idden l
ayers ofa mul
til
ayer neuralnetwork l
earn to representth e network’ s
inputs in a way th atmakes iteasy to predictth e targetoutputs. T h is is nicel
y
demonstrated by training a mul
til
ayer neuralnetwork to predictth e nextword in a
sequence from al
ocalcontextofearl
ierwords. Each word inth e contextispresented
to th e network asa one- of- N vector, th atis, one componenth asaval
ue of1 and th e
restare 0 . In th e firstl
ayer, each word creates a differentpattern ofactivations, or
word vectors ( F ig. 4 ) . In a l
anguage model
, th e oth er l
ayers ofth e network l
earn to
convertth e inputword vectorsinto anoutputword vectorforth e predicted nextword,
wh ich canbe used to predictth e probabil
ity forany word inth e vocabul
ary to appear
as th e next word. T h e network l
earns word vectors th at contain many active
componentseach ofwh ich canbe interpreted asaseparate feature ofth e word, aswas
firstdemonstrated in th e contextofl
earning distributed representations for symbol
s.
T h ese semantic features were not expl
icitl
y present in th e input. T h ey were
discovered by th e l
earning procedure as a good way offactoriz ing th e structured
rel
ationsh ips between th e input and output symbol
s into mul
tipl
e ‘micro- rul
es’ .
L earning word vectors turned outto al
so work very wel
lwh en th e word sequences
come from a l
arge corpus ofrealtextand th e individualmicro- rul
es are unrel
iabl
e.
W h entrained to predictth e nextword inanewsstory, forexampl
e, th e l
earned word
vectors for T uesday and W ednesday are very simil
ar, as are th e word vectors for
Sweden and N orway. Such representations are cal
led distributed representations
because th eir el
ements ( th e features) are notmutual
ly excl
usive and th eir many
configurations correspond to th e variations seen in th e observed data. T h ese word
vectors are composed ofl
earned features th atwere notdetermined ah ead oftime by
experts, butautomatical
ly discovered by th e neuralnetwork. V ectorrepresentationsof
wordsl
earned from textare now very widel
y used innaturall
anguage appl
ications.
T h e issue ofrepresentation l
ies atth e h eartofth e debate between th e l
ogic- inspired
and th e neural
- network- inspired paradigms for cognition. In th e l
ogic- inspired
paradigm, aninstance ofasymbolissometh ingforwh ich th e onl
y property isth atit
is eith er identicalor non- identicalto oth er symbolinstances. Ith as no internal
structure th atisrel
evantto itsuse; and to reasonwith symbol
s, th ey mustbe bound to
th e variabl
es in judiciousl
y ch osen rul
es ofinference. B y contrast, neuralnetworks
justuse bigactivity vectors, bigweigh tmatricesand scal
arnon- l
inearitiesto perform
th e type offast‘intuitive’ inference th atunderpinseffortl
esscommonsense reasoning.
B efore th e introductionofneurall
anguage model
s, th e standard approach to statistical
model
ling ofl
anguage did notexpl
oitdistributed representations: itwas based on
counting frequencies ofoccurrences ofsh ortsymbolsequences ofl
ength up to N
( cal
led N - grams) . T h e numberofpossibl
e N - gramsisonth e orderofV N , wh ere V is
th e vocabul
ary siz e, so takinginto accountacontextofmore th anah andfulofwords
woul
d require very l
arge trainingcorpora. N - gramstreateach word asanatomicunit,
so th ey cannotgeneral
iz e across semantical
ly rel
ated sequences ofwords, wh ereas
neurall
anguage model
s can because th ey associate each word with a vector ofreal
val
ued features, and semantical
ly rel
ated words end up cl
ose to each oth er in th at
vectorspace ( F ig. 4 ) .
分布式表示和语言处理
深度学习理论表明，与不使用分布式表示的经典学习算法相比，深网具有两个不
同的指数优势。这两个优点都来自于组合的力量，并取决于具有适当组件结构的
底层数据生成分布。首先，学习分布式表示可以将学习到的特征值的新组合推广
（例如使用 n个二进制特征可以进行 2 n个组合）。

到训练期间看不到的那些新组合
其次，在一个深层网络中构成表示层会带来另一个指数优势（深度指数）。
多层神经网络的隐藏层学习以易于预测目标输出的方式来表示网络的输入。通过
训练多层神经网络从较早单词的局部上下文中预测序列中的下一个单词，可以很
好地证明这一点。上下文中的每个单词都以 N 个向量的形式呈现给网络，也就
是说，一个组成部分的值为 1 ，其余均为 0 。在第一层中，每个单词都会创建不

同的激活模式，或者字向量（图 4 ）。在语言模型中，网络的其他层学习将输入
的单词矢量转换为预测的下一个单词的输出单词矢量，这可用于预测词汇表中任
何单词出现为下一个单词的概率。网络学习包含许多有效成分的单词向量，每个
成分都可以解释为单词的一个独立特征，如在学习符号的分布式表示形式时首先
证明的那样。这些语义特征未在输入中明确显示。通过学习过程可以发现它们，
这是将输入和输出符号之间的结构化关系分解为多个“微规则”的好方法。当单
词序列来自大量的真实文本并且单个微规则不可靠时，学习单词向量也可以很好
地工作。例如，在训练以预测新闻故事中的下一个单词时，周二和周三学到的单
词向量与瑞典和挪威的单词向量非常相似。这样的表示称为分布式表示，因为它
们的元素（特征）不是互斥的，并且它们的许多配置对应于在观察到的数据中看
到的变化。这些词向量由专家事先未确定但由神经网络自动发现的学习特征组成。
从文本中学到的单词的矢量表示现在已在自然语言应用中得到广泛使用。
表示问题是逻辑启发和神经网络启发的认知范式之间争论的核心。在逻辑启发范
式中，符号实例是某些事物，其唯一属性是它与其他符号实例相同或不同。它没
有与其使用相关的内部结构；为了用符号进行推理，必须将它们绑定到明智选择
的推理规则中的变量。相比之下，神经网络仅使用较大的活动矢量，较大的权重
矩阵和标量非线性来执行快速的“直觉”推断类型，从而支持毫不费力的常识推
理。
在引入神经语言模型之前，语言统计建模的标准方法并未利用分布式表示形式：
它是基于对长度不超过 N （称为 N - grams）的短符号序列的出现频率进行计数。
可能的 N - grams的数量在 V N 的数量级上，其中 V 是词汇量，因此考虑到少数单

词的上下文，将需要非常大的训练语料库。N - grams将每个单词视为一个原子单
元，因此它们无法在语义上相关的单词序列中进行泛化，而神经语言模型则可以
将它们与实值特征向量相关联，而语义相关的单词最终彼此靠近在该向量空间中
（图 4 ）。
R ecurrentneuralnetworks
W h en backpropagation was firstintroduced, its mostexciting use was for training

recurrentneuralnetworks ( R N N s) . F or tasks th atinvol
ve sequentialinputs, such as
speech and l
anguage, itis often better to use R N N s ( F ig. 5 ) . R N N s process an input
sequence one el
ementatatime, maintaininginth eirh iddenunitsa ‘state vector’
th atimpl
icitl
y contains information aboutth e h istory ofal
lth e pastel
ements ofth e
sequence. W h enwe considerth e outputsofth e h iddenunitsatdifferentdiscrete time
steps as ifth ey were th e outputs ofdifferentneurons in a deep mul
til
ayer network
( F ig. 5 , righ t) , itbecomescl
earh ow we canappl
y backpropagationto trainR N N s.
R N N s are very powerfuldynamic systems, but training th em h as proved to be
probl
ematicbecause th e backpropagated gradientseith ergrow orsh rink ateach time
step, so overmany time stepsth ey typical
ly expl
ode orvanish .
T h anksto advancesinth eirarch itecture and waysoftrainingth em, R N N sh ave been

found to be very good atpredictingth e nextch aracterinth e textorth e nextword ina
sequence, butth ey can al
so be used for more compl
ex tasks. F or exampl
e, after
readinganEngl
ish sentence one word atatime, anEngl
ish ‘encoder’ network canbe
trained so th atth e finalstate vectorofitsh iddenunitsisagood representationofth e
th ough texpressed by th e sentence. T h isth ough tvectorcanth enbe used asth e initial
h iddenstate of( orasextrainputto) ajointl
y trained F rench ‘decoder’ network, wh ich
outputs a probabil
ity distribution for th e firstword ofth e F rench transl
ation. Ifa
particul
ar firstword is ch osen from th is distribution and provided as inputto th e
decoder network itwil
lth en outputa probabil
ity distribution for th e second word of
th e transl
ation and so on untila ful
lstop is ch osen. O veral
l, th is process generates
sequencesofF rench wordsaccordingto aprobabil
ity distributionth atdependsonth e
Engl
ish sentence. T h is rath er naive way of performing mach ine transl
ation h as
quickl
y become competitive with th e state- of- th e- art, and th is raises serious doubts
aboutwh eth erunderstandingasen- tence requiresanyth ingl
ike th e internalsymbol
ic
expressionsth atare manipul
ated by usinginference rul
es. Itismore compatibl
e with
th e view th ateveryday reasoning invol
ves many simul
taneous anal
ogies th ateach
contribute pl
ausibil
ity to aconcl
usion.
Instead oftransl
atingth e meaningofaF rench sentence into anEngl
ish sentence, one
canl
earnto ‘transl
ate’ th e meaningofanimage into anEngl
ish sentence ( F ig. 3 ) . T h e
encoder h ere is a deep C onvN etth atconverts th e pixel
s into an activity vector in its
l
asth idden l
ayer. T h e decoder is an R N N simil
ar to th e ones used for mach ine
transl
ationand neurall
anguage model
ling. T h ere h asbeenasurge ofinterestinsuch
systemsrecentl
y.
R N N s, once unfol
ded intime ( F ig. 5 ) , canbe seenasvery deepfeedforward networks
inwh ich al
lth e l
ayerssh are th e same weigh ts. A l
th ough th eirmainpurpose isto l
earn
l
ong- term dependencies, th eoreticaland empiricalevidence sh owsth atitisdifficul
tto
l
earnto store informationforvery l
ong.
T o correctforth at, one ideaisto augmentth e network with anexpl

icitmemory. T h e
firstproposalofth is kind is th e l
ong sh ort- term memory ( L ST M ) networks th atuse
specialh iddenunits, th e naturalbeh aviourofwh ich isto rememberinputsforal
ong
time. A specialunitcal
led th e memory cel
lactsl
ike anaccumul
atororagated l
eaky
neuron: ith asaconnectionto itsel
fatth e nexttime stepth ath asaweigh tofone, so it
copies its own real
- val
ued state and accumul
ates th e externalsignal
, but th is
sel
f- connectionismul
tipl
icativel
y gated by anoth erunitth atl
earnsto decide wh ento
cl
earth e contentofth e memory.
L ST M networks h ave subsequentl

y proved to be more effective th an conventional
R N N s, especial
ly wh enth ey h ave severall
ayersforeach time step, enabl
inganentire
speech recognition system th atgoes al
lth e way from acoustics to th e sequence of
ch aractersinth e transcription. L ST M networksorrel
ated formsofgated unitsare al
so
currentl
y used forth e encoderand decodernetworksth atperform so wel
latmach ine
transl
ation.
O ver th e pastyear, severalauth ors h ave made differentproposal

s to augmentR N N s
with a memory modul
e. Proposal
s incl
ude th e N euralT uring M ach ine in wh ich th e
network isaugmented by a‘tape- l
ike’ memory th atth e R N N canch oose to read from
orwrite to, and memory networks, inwh ich aregul
arnetwork isaugmented by akind
ofassociative memory. M emory networks h ave yiel
ded excel
lentperformance on
standard question- answeringbench marks. T h e memory isused to rememberth e story
aboutwh ich th e network isl
aterasked to answerquestions.
B eyond simpl
e memoriz ation, neuralT uring mach ines and memory networks are
beingused fortasksth atwoul
d normal
ly require reasoningand symbolmanipul
ation.
N euralT uring mach ines can be taugh t‘al
gorith ms’ . A mong oth er th ings, th ey can
l
earn to outputa sorted l
istofsymbol
s wh en th eir inputconsists ofan unsorted
sequence in wh ich each symbolis accompanied by a realval
ue th atindicates its
priority in th e l
ist. M emory networks can be trained to keep track ofth e state ofth e
worl
d inasettingsimil
arto atextadventure game and afterreadingastory, th ey can
answerquestionsth atrequire compl
exinference9 0 . Inone testexampl
e, th e network
is sh own a 1 5 - sentence version ofth e T h e L ord ofth e R ings and correctl
y answers
questionssuch as“wh ere isF rodo now? ”.
递归神经网络
首次引入反向传播时，其最令人兴奋的用途是训练循环神经网络（R N N ）。对
于涉及顺序输入的任务，例如语音和语言，通常最好使用 R N N （图 5 ）。 R N N
一次处理一个输入序列的一个元素，在其隐藏的单元中维护一个“状态向量”，
该“状态向量”隐式包含有关该序列的所有过去元素的历史信息。当我们将隐藏
单位在不同离散时间步长的输出视为是深层多层网络中不同神经元的输出时（图
5 右），显然我们可以如何应用反向传播来训练 R N N 。
R N N 是非常强大的动态系统，但是事实证明，训练它们是有问题的，因为反向
传播的梯度在每个时间步长都会增大或缩小，因此在许多时间步长上它们通常会
爆炸或消失。
由于其结构和训练方法的进步，人们发现 R N N 非常擅长预测文本中的下一个字
符或序列中的下一个单词，但它们也可以用于更复杂的任务。例如，一次读一个
单词的英语句子后，可以训练英语的“编码器”网络，使其隐藏单元的最终状态
向量很好地表示了该句子表达的思想。然后，可以将此思想向量用作联合训练的
法语“解码器”网络的初始隐藏状态（或作为其额外输入），该网络将输出法语
翻译的第一个单词的概率分布。如果从该分布中选择了一个特定的第一个单词，
并将其作为输入提供给解码器网络，则它将输出翻译的第二个单词的概率分布，
依此类推，直到选择了句号。总体而言，此过程根据取决于英语句子的概率分布
生成法语单词序列。这种相当幼稚的执行机器翻译的方式已迅速与最新技术竞争，
这引起了人们对理解句子是否需要诸如通过使用推理规则操纵的内部符号表达
式之类的严重质疑。日常推理涉及许多同时进行的类比，每个类比都为结论提供
了合理性，这一观点与观点更加兼容。
与其将法语句子的含义翻译成英语句子，不如学习将图像的含义“翻译”成英语
句子（图 3 ）。这里的编码器是一个深层的 C onvN et，可将像素转换为其最后一
个隐藏层中的活动矢量。解码器是一个 R N N ，类似于用于机器翻译和神经语言
建模的 R N N 。近年来，对此类系统的兴趣激增。
R N N s随时间展开（图 5 ），可以看作是非常深的前馈网络，其中所有层共享相
同的权重。尽管它们的主要目的是学习长期依赖关系，但理论和经验证据表明，
很难长期存储信息。
为了解决这个问题，一个想法是用显式内存扩展网络。此类第一种建议是使用特
殊隐藏单元的长短期记忆（L ST M ）网络，其自然行为是长时间记住输入。称为
存储单元的特殊单元的作用类似于累加器或门控泄漏神经元：它在下一时间步与
其自身具有连接，其权重为 1 ，因此它复制自己的实值状态并累积外部信号，但
是此自连接是由另一个单元乘法控制的，该单元学会确定何时清除内存内容。
L ST M 网络随后被证明比常规 R N N 更有效，特别是当它们在每个时间步都有多
层时，使整个语音识别系统从声学到转录中的字符序列都一路走来。L ST M 网络
或相关形式的门控单元目前也用于编码器和解码器网络，它们在机器翻译方面表
现出色。
在过去的一年中，几位作者提出了不同的建议，以使用内存模块扩展 R N N 。建
议包括神经图灵机，其中网络由 R N N 可以选择读取或写入的“像带”存储器来
增强，以及存储网络，其中常规网络由一种关联性存储器来增强。内存网络在标
准问答基准方面已表现出出色的性能。存储器用于记住故事，有关该故事后来被
要求网络回答问题。
除了简单的记忆外，神经图灵机和存储网络还用于执行通常需要推理和符号操作
的任务。神经图灵机可以被称为“算法”。除其他事项外，当他们的输入由未排
序的序列组成时，他们可以学习输出已排序的符号列表，其中每个符号都带有一
个实数值，该实数值指示其在列表中的优先级。可以训练记忆网络，使其在类似
于文字冒险游戏的环境中跟踪世界状况，阅读故事后，它们可以回答需要复杂推
理的问题。在一个测试示例中，该网络显示了 1 5 句的《指环王》，并正确回答
了诸如“ F rodo 现在在哪里？”之类的问题。
T h e future ofdeepl
earning
U nsupervised l
earningh ad acatal
yticeffectin revivinginterestindeepl
earning, but
h as since been oversh adowed by th e successes of purel
y supervised l
earning.
Al
th ough we h ave notfocused onitinth isR eview, we expectunsupervised l
earning
to become far more importantin th e l
onger term. H uman and animall
earning is
l
argel
y unsupervised: we discover th e structure ofth e worl
d by observing it, notby
beingtol
d th e name ofevery object.
H uman vision is an active process th atsequential

ly sampl
es th e optic array in an
intel
ligent, task- specific way using a smal
l, h igh - resol
ution fovea with a l
arge,
l
ow- resol
ution surround. W e expectmuch ofth e future progress in vision to come
from systemsth atare trained end- to- end and combine C onvN etswith R N N sth atuse
reinforcementl
earningto decide wh ere to l
ook. Systemscombiningdeepl
earningand
reinforcementl
earning are in th eir infancy, butth ey al
ready outperform passive
vision systems atcl
assification tasks and produce impressive resul
ts in l
earning to
pl
ay many differentvideo game.
N aturall
anguage understanding is anoth er area in wh ich deep l
earning is poised to
make a l
arge impactover th e nextfew years. W e expectsystems th atuse R N N s to
understand sentences or wh ol
e documents wil
lbecome much better wh en th ey l
earn
strategiesforsel
ectivel
y attendingto one partatatime.
Ul
timatel
y, major progress in artificialintel
ligence wil
lcome aboutth rough systems
th atcombine representationl
earningwith compl
exreasoning. A l
th ough deepl
earning
and simpl
e reasoning h ave been used for speech and h andwriting recognition for a
l
ongtime, new paradigmsare needed to repl
ace rul
e- based manipul
ationofsymbol
ic
expressionsby operationsonl
arge vectors.
深度学习的未来
无监督学习在恢复对深度学习的兴趣方面起了催化作用，但此后被监督学习的成
功所掩盖。尽管我们在本评论中并未对此进行关注，但我们希望从长远来看，无
监督学习将变得越来越重要。人类和动物的学习在很大程度上不受监督：我们通
过观察来发现世界的结构，而不是通过告知每个物体的名称来发现世界的结构。
人的视觉是一个活跃的过程，它使用具有高分辨率，低分辨率环绕的小型高分辨
率中央凹，以智能的，针对特定任务的方式对光学阵列进行顺序采样。我们期望
在视觉上未来的许多进步都将来自端到端训练的系统，并将 C onvN ets与 R N N
结合起来，后者使用强化学习来决定在哪里看。结合了深度学习和强化学习的系
统尚处于起步阶段，但在分类任务上它们已经超过了被动视觉系统，并且在学习
玩许多不同的视频游戏方面产生了令人印象深刻的结果。
自然语言理解是深度学习必将在未来几年产生巨大影响的另一个领域。我们希望
使用 R N N 理解句子或整个文档的系统在学习一次选择性地关注一部分的策略时
会变得更好。
最终，人工智能的重大进步将通过将表示学习与复杂推理相结合的系统来实现。
尽管长期以来，深度学习和简单推理已被用于语音和手写识别，但仍需要新的范
例来通过对大向量进行运算来代替基于规则的符号表达操纵。

ChatGPT背后的底层技术 Deep+Learning (深度学习) 论文翻译

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ChatGPT背后的底层技术 Deep+Learning (深度学习) 论文翻译

Uploaded by

Copyright:

Available Formats

DeepL earning

Y annL eC un∗ Y osh uaB engio∗ G eoffrey H inton

T h e mostcommon form ofmach ine l

seen as ‘knobs’ th atdefine th e input– outputfunction ofth e mach ine. In a typical

T h e objective function, averaged overal

In practice, mostpractitioners use a procedure cal

M any ofth e currentpracticalappl

Since th e 1 9 6 0 swe h ave knownth atl

such as image and speech recognition require th e input– outputfunction to be

to th e aspectsofth e image th atare importantfordiscrimination, butth atare invariant

wh ich are subjectto l

mappings. Each modul

T h e backpropagationprocedure to compute th e gradientofanobjective functionwith

Interestin deep feedforward networks was revived around 2 0 0 6 by a group of

T h ere was, h owever, one particul

即半波整流器 f( z ) = max( 0 , z ) 。在过去的几十年中，神经网络使用了更平滑的非线

性，例如 tanh ( z ) 或 1 / ( 1 + exp( - z ) ) ，但 R eL U 通常在具有多个层的网络中学习得更

加拿大高级研究所（C IF A R ）召集的一组研究人员在 2 0 0 6 年左右恢复了对深层

纪录的结果。到 2 0 1 2 年，许多主要的语音组织都在开发 2 0 0 9 年以来的深度网络

版本，并且已经在 A ndroid 手机中进行了部署。对于较小的数据集，无监督的预

更容易训练和推广。这就是卷积神经网络（C onvN et）。在神经网络未受关注期

C onvN ets are designed to process data th atcome in th e form ofmul

T h ere h ave beennumerous appl

频频谱图；和 3 D 视频或体积图像。C onvN ets有四个利用自然信号属性的关键思

典型的 C onvN et的体系结构（图 2 ）由一系列阶段构成。前几个阶段由两种类型

是更多卷积和全连接的层。通过 C onvN et进行反向传播的梯度与通过常规深度

层次结构。当 C onvN et模型和猴子显示相同的图片时，C onvN et中高层单元的激

活解释了猴子下颞叶皮层中 1 6 0 个神经元随机集合的一半方差。 C onvN ets的根

时延神经网络的原始一维 C onvN et用于识别音素和简单单词。

档阅读的时延神经网络。该文档阅读系统使用了一个 C onvN et，并与一个实现语

美国所有支票的 1 0 ％以上。M icrosoft随后部署了许多基于 C onvN et的光学字符

识别和手写识别系统。在二十世纪九十年代初，还对 C onvN ets进行试验，以检

Im age understandingwith deepconvol

Despite th ese successes, C onvN ets were l

R ecentC onvN etarch itecturesh ave 1 0 to 2 0 l

T h e performance ofC onvN et- based visionsystemsh ascaused mostmajortech nol

C onvN ets are easil

自二十一世纪初以来，C onvN ets已成功应用于图像中对象和区域的检测、分割

尤其是用于连接组学，以及在自然图像中检测人脸、文字、行人和人体。C onvN ets

系统中使用基于 C onvN et的方法。其他日益重要的应用包括自然语言理解和语

尽管取得了这些成功，但 C onvN et在很大程度上被主流计算机视觉和机器学习

领域弃用，直到 2 0 1 2 年 ImageN et竞赛为止。当深度卷积网络应用于来自网络的

训练示例的技术。这一成功带来了计算机视觉的一场革命。现在，C onvN ets是

的一次令人震惊的演示结合了 C onvN ets和递归网络模块以生成图像字幕

增长的初创公司启动了研究和开发项目，部署基于 C onvN et的图像理解产品和

（例如使用 n个二进制特征可以进行 2 n个组合）。

是说，一个组成部分的值为 1 ，其余均为 0 。在第一层中，每个单词都会创建不

它是基于对长度不超过 N （称为 N - grams）的短符号序列的出现频率进行计数。

可能的 N - grams的数量在 V N 的数量级上，其中 V 是词汇量，因此考虑到少数单

W h en backpropagation was firstintroduced, its mostexciting use was for training

T h anksto advancesinth eirarch itecture and waysoftrainingth em, R N N sh ave been

T o correctforth at, one ideaisto augmentth e network with anexpl

L ST M networks h ave subsequentl

O ver th e pastyear, severalauth ors h ave made differentproposal

questionssuch as“wh ere isF rodo now? ”.

句子（图 3 ）。这里的编码器是一个深层的 C onvN et，可将像素转换为其最后一

了诸如“ F rodo 现在在哪里？”之类的问题。

H uman vision is an active process th atsequential

You might also like