Professional Documents
Culture Documents
1 s2.0 S0896627318302502 Main
1 s2.0 S0896627318302502 Main
1 s2.0 S0896627318302502 Main
Article
SUMMARY et al., 2001; Humphries et al., 2010; Miller et al., 2002; Santoro
et al., 2014), but the organization of the rest of auditory cortex
A core goal of auditory neuroscience is to build quan- into regions and pathways remains unresolved, particularly in
titative models that predict cortical responses to nat- humans (Norman-Haignere et al., 2015; Rauschecker and Scott,
ural sounds. Reasoning that a complete model of 2009; Recanzone and Cohen, 2010).
auditory cortex must solve ecologically relevant Our understanding of auditory cortex is limited in part by the
tasks, we optimized hierarchical neural networks for lack of quantitative models of how neural circuitry transforms
sound waveforms into representations that enable behavior.
speech and music recognition. The best-performing
Existing models of auditory processing are mostly limited to
network contained separate music and speech path-
one or two stages, typically based on linear filtering of spectro-
ways following early shared processing, potentially gram-like input (Carlson et al., 2012; Chi et al., 2005; Dau
replicating human cortical organization. The network et al., 1997; McDermott and Simoncelli, 2011; M1ynarski and
performed both tasks as well as humans and ex- McDermott, 2018). Such models explain aspects of auditory
hibited human-like errors despite not being optimized perception (McDermott et al., 2013; Patil et al., 2012) and cortical
to do so, suggesting common constraints on network responses (Norman-Haignere et al., 2015; Santoro et al., 2014;
and human performance. The network predicted Schönwiesner and Zatorre, 2009), but they are clearly incom-
fMRI voxel responses substantially better than tradi- plete. Neural responses are known to be nonlinear functions of
tional spectrotemporal filter models throughout audi- the spectrogram (Christianson et al., 2008; David et al., 2009),
tory cortex. It also provided a quantitative signature and state-of-the-art machine hearing systems are highly
nonlinear (Hershey et al., 2017), suggesting that auditory recog-
of cortical representational hierarchy—primary and
nition requires invariances that cannot be obtained from the
non-primary responses were best predicted by inter-
linear operations typically employed in auditory models.
mediate and late network layers, respectively. The In this paper, we develop a multi-stage computational model
results suggest that task optimization provides a that performs real-world auditory tasks. The underlying hypoth-
powerful set of tools for modeling sensory systems. esis was that everyday recognition tasks may impose strong
constraints on the auditory system, such that a model optimized
to perform such tasks might converge to brain-like representa-
INTRODUCTION tional transformations. We optimized a deep neural network to
map sound waveforms to behaviorally meaningful categories
Human listeners extract a remarkable array of information about (words or musical genres), leveraging recent advances in what
the world from sound. These abilities are enabled by neuronal has become known as deep learning (LeCun et al., 2015).
processing that transforms the sound waveform entering the Although some aspects of such networks deviate substantially
ear into cortical representations thought to render behaviorally from biological systems, they are currently the only known model
important sound properties explicit. Although much is known class that attains human-level performance on many real-world
about the peripheral processing of sound, auditory cortex is classification tasks. Following early hopes that such models
less understood, particularly in computational terms. There is would yield biological insights (Lehky and Sejnowski, 1988;
growing consensus that frequency and modulation tuning Zipser and Andersen, 1988), contemporary deep neural net-
explain aspects of primary auditory cortical responses (Depireux works have been shown to replicate key aspects of visual system
limitation (e.g., the number of neurons). The resulting network In both cases the speech and music excerpts were presented
architecture is consistent with recent evidence for segregated in different types of background noise at a range of SNRs.
speech and music pathways in non-primary auditory cortex We measured network performance on the same stimuli
(Angulo-Perkins et al., 2014; Leaver and Rauschecker, 2010; and tasks.
Norman-Haignere et al., 2015; Tierney et al., 2013). Word Recognition Behavioral Comparison
Human listeners’ word recognition performance improved with
Comparison of Model and Human Behavior SNR, as expected, but some types of background noise
A complete model of the auditory system should replicate impaired performance more than others (Figure 2A). The
human auditory behavior. We thus began by measuring human network exhibited similar absolute performance levels and
performance for the word and genre tasks, comparing both similar dependence of performance on background noise
absolute performance and the pattern of errors to that of the (r2 = 0.92, p < 1013; Figures 2B and 2C). The pattern of perfor-
network. For the word classification task, listeners typed what mance was not explained by simple measures of distortion in
they heard using an interface that auto-completed the 587 the cochlear representation of speech (r2 = 0.37, lower than
words. For the genre classification task, listeners selected the that with network performance, p < 104; Figures 2D–2F). Fig-
five most likely genres for the 2 s excerpt they heard. A ‘‘top ure S1A shows similar results when restricting the distortion
5’’ task was used to ensure that the task was reasonably well measure to only those cochleagram bins with substantial
defined given overlap between different genres (e.g., ‘‘New speech energy; Figure S1B shows the results of distortion
Age’’ versus ‘‘Ambient’’; see Figure S6B for overlap matrix). measured in network layers.
Figure 4. Using the Network to Probe Hierarchical Organization in Human Auditory Cortex
(A) Median variance explained across all voxels in auditory cortex for each network layer. Orange denotes results for trained network (break in curve corresponds
to branch point). Black denotes results for identical architecture but with random, untrained filter weights. Gray line plots the variance explained by the spec-
trotemporal filter model for comparison. Error bars are within-subject SEM.
(B) Map of best-predicting layer for each voxel (median across subjects). Inset on right shows color scale and histograms for each hemisphere.
(C) Map of the difference in voxel response variance explained by later and intermediate network layers for the trained network (conv5 versus conv3). Maps on left
show mean across subjects; right is three example individual subjects. Below is histogram/color bar of all values for each hemisphere. Black outlines show three
anatomically defined sub-divisions of primary auditory cortex (TE 1.1, 1.0, and 1.2), taken from probabilistic maps (Morosan et al., 2001).
(D) Map of the difference in voxel variance explained by higher and lower network layers for the untrained, random-filter network (conv5 versus conv3).
Conventions, including color scale, are same as (C).
(E) Map of the difference in variance explained by later and intermediate layers (conv5 versus conv3), excluding speech and music stimuli from the regressions.
Analogous to (C), with same conventions and color scale as (C) and (D).
(F) Layerwise variance explained in functionally localized regions of interest for the trained network and random-filter network. Same plotting conventions as (A)
were used, except that, for clarity, only intermediate layers (where predictions exceeded those of the spectrotemporal model) are plotted. Error bars are within-
subject SEM.
DECLARATION OF INTERESTS
STAR+METHODS
The authors declare no competing interests.
Detailed methods are provided in the online version of this paper
and include the following: Received: May 18, 2017
Revised: December 22, 2017
d KEY RESOURCES TABLE Accepted: March 23, 2018
d CONTACT FOR REAGENT AND RESOURCE SHARING Published: April 19, 2018
d EXPERIMENTAL MODEL AND SUBJECT DETAILS
B Word & genre recognition psychophyiscs REFERENCES
SUPPLEMENTAL INFORMATION Cammoun, L., Thiran, J.P., Griffa, A., Meuli, R., Hagmann, P., and Clarke, S.
(2015). Intrahemispheric cortico-cortical connections of the human auditory
Supplemental Information includes seven figures and three tables and can be cortex. Brain Struct. Funct. 220, 3537–3553.
found with this article online at https://doi.org/10.1016/j.neuron.2018.03.044. Carlson, N.L., Ming, V.L., and Deweese, M.R. (2012). Sparse codes for speech
predict spectrotemporal receptive fields in the inferior colliculus. PLoS
ACKNOWLEDGMENTS Comput. Biol. 8, e1002594.
Chang, E.F., Rieger, J.W., Johnson, K., Berger, M.S., Barbaro, N.M., and
The authors thank Nancy Kanwisher for sharing fMRI data, Steve Shannon for Knight, R.T. (2010). Categorical speech representation in human superior tem-
MR support, Kevin Woods for analysis of speakers for the representational poral gyrus. Nat. Neurosci. 13, 1428–1432.
Formisano, E., De Martino, F., Bonte, M., and Goebel, R. (2008). ‘‘Who’’ is Montufar, G., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of
saying ‘‘what’’? Brain-based decoding of human voice and speech. Science linear regions of deep neural networks. In Advances in Neural Information
322, 970–973. Processing Systems 27 (NIPS 2014), Z. Gharamani, M. Welling, C. Cortes,
N.D. Lawrence, and K.Q. Weinberger, eds. (Curran Associates),
Garofolo, J.S., and Consortium, L.D. (1993). TIMIT: Acoustic-Phonetic
pp. 2924–2932.
Continuous Speech Corpus (Linguistic Data Consortium).
Morosan, P., Rademacher, J., Schleicher, A., Amunts, K., Schormann, T., and
Glasberg, B.R., and Moore, B.C.J. (1990). Derivation of auditory filter shapes
Zilles, K. (2001). Human primary auditory cortex: Cytoarchitectonic subdivi-
from notched-noise data. Hear. Res. 47, 103–138.
sions and mapping into a spatial reference system. Neuroimage 13, 684–701.
Guۍlu
€, U., and van Gerven, M.A.J. (2015). Deep neural networks reveal a
gradient in the complexity of neural representations across the ventral stream. Mustafa, K., and Bruce, I.C. (2006). Robust formant tracking for continuous
J. Neurosci. 35, 10005–10014. speech with speaker variability. IEEE Trans. Audio Speech Lang. Process.
14, 435–444.
Hershey, S., Chaudhuri, S., Ellis, D.P.W., Gemmeke, J.F., Jansen, A., Moore,
C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B., et al. (2017). CNN architec- Naselaris, T., Kay, K.N., Nishimoto, S., and Gallant, J.L. (2011). Encoding and
tures for large-scale audio classification. IEEE International Conference on decoding in fMRI. Neuroimage 56, 400–410.
Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. Norman-Haignere, S., Kanwisher, N., and McDermott, J.H. (2013). Cortical
Humphries, C., Liebenthal, E., and Binder, J.R. (2010). Tonotopic organization pitch regions in humans respond primarily to resolved harmonics and are
of human auditory cortex. Neuroimage 50, 1202–1211. located in specific tonotopic regions of anterior auditory cortex. J. Neurosci.
33, 19451–19469.
Huth, A.G., de Heer, W.A., Griffiths, T.L., Theunissen, F.E., and Gallant, J.L.
(2016). Natural speech reveals the semantic maps that tile human cerebral Norman-Haignere, S., Kanwisher, N.G., and McDermott, J.H. (2015). Distinct
cortex. Nature 532, 453–458. cortical pathways for music and speech revealed by hypothesis-free voxel
decomposition. Neuron 88, 1281–1296.
Jones, E., Oliphant, T.E., and Peterson, P. (2001). SciPy: Open source scien-
tific tools for Python. Obleser, J., Leaver, A.M., Vanmeter, J., and Rauschecker, J.P. (2010).
Segregation of vowels and consonants in human auditory cortex: evidence
Kaas, J.H., and Hackett, T.A. (2000). Subdivisions of auditory cortex and pro-
for distributed hierarchical organization. Front. Psychol. 1, 232.
cessing streams in primates. Proc. Natl. Acad. Sci. USA 97, 11793–11799.
Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T., and Banno, H. Okada, K., Rong, F., Venezia, J., Matchin, W., Hsieh, I.H., Saberi, K., Serences,
(2008). TANDEM-STRAIGHT: A temporally stable power spectral representa- J.T., and Hickok, G. (2010). Hierarchical organization of human auditory
tion for periodic signals and applications to interference-free spectrum, F0, cortex: evidence from acoustic invariance in the response to intelligible
and aperiodicity estimation. In Acoustics, Speech and Signal Processing speech. Cereb. Cortex 20, 2486–2495.
(IEEE Internal Conference), pp. 3933–39355. Oliphant, T.E. (2006). Guide to NumPy (Brigham Young University).
Khaligh-Razavi, S.-M., and Kriegeskorte, N. (2014). Deep supervised, but not Overath, T., McDermott, J.H., Zarate, J.M., and Poeppel, D. (2015). The
unsupervised, models may explain IT cortical representation. PLoS Comput. cortical analysis of speech-specific temporal structure revealed by responses
Biol. 10, e1003915. to sound quilts. Nat. Neurosci. 18, 903–911.
Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Alex Kell (alexkell@
mit.edu).
METHOD DETAILS
For any input array X of shape (nin, nin, dchannels_in), the output of a convolutional layer is an array Y of shape (nin/ ls, nin/ls, nk):
!
1X
Yði; j; kÞ = relu b½k + 2 K½:; :; :; k1Nks ðX; ls ,i; ls ,jÞ ;
ks
where i and j range in (1, ., nin / ls), 1 denotes the pointwise array multiplication, Nks(X,q,r) selects the square neighborhood across
adjacent time and frequency bins of size ks by ks, centered at location q, r in X, returning an array of shape (ks, ks, dchannels_in). The sum
is over all elements of this 3d array. The convolution is done with ‘‘same’’ mode, meaning that the edges are padded with zeros to
produce an output that would be the same dimensionality of the input if the stride were set to 1. Finally, relu denotes the rectified linear
operator, which is a pointwise nonlinearity applied to every element of the output:
reluðxÞ = maxð0; xÞ:
Normalization layer
A normalization layer implements divisive normalization of a unit by its neighboring filters at the identical time-frequency bin. It is
defined by three parameters: a, b, and nadjacent, which had values of 0.001, 0.75, 5, respectively. It operates on an array of X of shape
(nin, nin, dchannels_in) and returns an array Y of the identical shape:
X½i; j; k
Yði; j; kÞ = P b ;
1 + a l˛A X½i; j; l2
where A is the set of the neighboring input channels whose responses are included in the normalization procedure and k indexes filter
kernels. A is defined as [k – (nadjacent - 1)/2, k – (nadjacent - 1)/2 + 1, ., k, ., k + (nadjacent - 1)/2 – 1, k + (nadjacent - 1)/2 ], which in our case
of nadjacent = 5 results in A = [k-2,k-1,k,k+1,k+2]. This procedure thus normalizes a filter’s in a time-frequency bin by other filter re-
sponses for different filters at that same time-frequency bin. We handled boundaries by zero-padding.
Pooling layer
A pooling layer downsamples its input by aggregating values across nearby time and frequency bins:
A pooling layer operates on an array X of shape (nin, nin, dchannels_in) and returns an array Y of shape of (nin/ ls, nin/ ls, dchannels_in), via
the following function:
1
1X po
Yði; j; kÞ = 2 Nps ðX ; ls ,i; ls ,jÞ½:; :; k
po
;
ps
where Nps is the local square neighborhood in time and frequency. The sum is over all elements in the resulting (ps, ps) matrix. po = 1
yields pooling via the mean, po = 2 yields pooling via the root mean square (‘‘L2 pooling’’), and po = N yields max pooling.
where j iterates the number of input elements in the preceding layer, wij is a weight (a real-valued number), and the point-wise nonlin-
earity (relu) is as defined before. If the preceding layer is another fully connected layer the input is a vector of length (nT). If the pre-
ceding layer is a convolutional layer, a normalization layer or a pooling layer, then the input is an array A of shape (nin, nin, dchannels_in)
unwrapped into a single vector X of length nT = nin2 x dchannels_in.
Dropout during training
During training, we instantiated a dropout layer after each fully connected layer that was not the final classification layer. During each
batch of training data, a randomly drawn fifty percent of the fully connected layer connections were eliminated (i.e., their connection
weight was set to zero). Dropout is common in neural network training and can be seen as a form of model averaging. For evaluation,
we followed the conventional procedure by removing this dropout layer and multiplying the output of each fully connected unit by the
probability of dropout during training (i.e., 0.5).
Softmax classifier
The final layer in the network is a fully connected layer where nout is the number of target classes (587 or 41, respectively in the case of
the word or genre task) and the relu operator is replaced with a softmax, an operation that receives a vector as its input and whose
element-wise output is defined as:
P
nT
exp j = 1 wij xj
yðiÞ = Pnout PnT :
k = 1 exp l = 1 wkl xl
Each element of the output of the softmax is nonnegative and together they sum to one, and can thus be interpreted as a probability
distribution over the classes (either words or genres).
Convolving in frequency
The convolutional neural networks we used in this paper convolved filter kernels with the cochleagram in both time and frequency.
Applying a given kernel in the first layer yields an nspectral by ntemporal feature map – i.e., outputs values at ntemporal different time bins for
each of nspectral different frequencies. Convolving in time is natural for a model of the auditory system because sound waveforms are
naturally ordered in a temporal sequence. The naturalness of convolving in frequency may be less obvious. However, convolution in
frequency can be conceptualized as having nspectral different model units, each centered at a different center frequency, where each
of the nspectral units is constrained to have the same filter weights. Instantiating the same filter at different frequencies might be
reasonable given that sounds are to some extent translation invariant in frequency, and is present in standard models of spectrotem-
poral filtering (Chi et al., 2005). Furthermore, we empirically found in pilot experiments that task performance was better when convo-
lution was applied in frequency as well as time, likely because this substantially reduces the number of parameters to be learned and
thus may have acted as an useful form of regularization.
d The first six stages of processing consisted of the following layers in the following order: convolution, normalization, pooling,
convolution, normalization, pooling. Although this order was identical across all architectures we considered, we varied
some of the architectural hyperparameters that define each of these stages, as we describe below.
Additionally, the number of filters in each of these potential convolutional layers were fixed (in order of convolutional layer number):
96, 256, 512, 1024, and 512. The number of fully connected units (other than the output layer) was set to 4096. The pooling window
size was set to three, and normalization was performed over neighborhoods of five filters. To simplify our search over architectures,
we also fixed the convolutional filters to be ‘‘square,’’ with equal spectral and temporal extent.
The choice of this template architecture was made based on pilot experiments and experience with networks trained on visual ob-
ject recognition tasks. Within this template architecture, there were a number of degrees of freedom that we searched over:
During training, there was a dropout layer after fc1_W and fc1_G.
Although most of our analyses are based on this final network, we also report voxel prediction results for the two totally separate
networks (i.e., that shared no layers), to examine the effects of training on only one of the two tasks (Figure S3E).
Control model: A spectrotemporal modulation filter bank
To compare the neural network to an existing model of auditory cortex, we also performed analyses with a standard spectrotemporal
filter model (Chi et al., 2005). The model consists of a bank of linear filters tuned to spectrotemporal modulations at different acoustic
frequencies, spectral scales, and temporal rates (see Figure 1F for example filters). We used the NSL MATLAB Toolbox (http://www.
isr.umd.edu/Labs/NSL/Software.htm) implementation of the spectrotemporal model, with 63 audio frequencies (ranging between
20 Hz and 8 kHz), 17 temporal rates (logarithmically spaced between 0.5 Hz and 128 Hz), and 13 spectral scales (logarithmically
spaced between 0.125 and 8 cycles/octave), yielding a spectrotemporal modulation filter bank with 29,736 filters – 1071 temporal
modulation filters (17 rates by 63 frequencies), 819 spectral modulation filters (13 rates by 63 frequencies), and 27,846 spectrotem-
poral filters (17 temporal rates by 13 spectral rates by 63 frequencies, by two orientations corresponding to upward and downward
frequency modulations). The filters were applied to a cochleagram. The cochleagrams that were used as input to the CNN were re-
sampled in frequency to match the log spacing of the spectrotemporal model. To evaluate the model response we passed a sound
through the model, extracted the magnitude (absolute value) of each filter’s response at each time step and averaged these magni-
tudes over time, to get 29,736 measures for each sound. To ensure that the comparison with the spectrotemporal filter model was
fair, we verified that this parameterization of the spectrotemporal filter model saturated the amount of variance that a spectrotem-
poral filter model could explain in the voxel data (see Figure S2C, and methods below for more details).
Using neural network layers as voxelwise encoding models to predict cortical responses
Feature extraction from CNN
We first generated cochleagrams for each of the 165 sounds from the fMRI experiment and passed them through the CNN, recording
each model unit’s response in each layer to each sound (schematic in Figure 3A). We performed this procedure for all sounds and
layers.
Because the neural network’s input (a cochleagram) had a temporal dimension, the extracted responses from all layers (except for
the fully connected layers) had a temporal dimension. This temporal component was critical to the computation of the features
throughout the layers of the deep network, as temporal information at each layer is integrated by model filters at the succeeding layer
to compute a signal that depends on both time and frequency. However, the hemodynamic signal to which we were comparing the
model blurs the temporal variation of the cortical response, thus a fair comparison of the model to the fMRI data involved predicting
each voxel’s time-averaged response to each sound from time-averaged model responses. We therefore averaged the model re-
sponses over the temporal dimension after extraction. As a result of the relu and softmax operators, all responses in all layers
were nonnegative, such that averaging preserved the mean response amplitude. For each layer that had a temporal dimension
(i.e., each convolutional layer, normalization layer, and pooling layer), we first extracted a three-dimensional array of responses of
shape (nspectral, ntemporal, nkernels). We then reshaped this 3d array into a matrix where each row corresponded to a kernel at a given
frequency and each column corresponded to a time point. We finally took the average of each row over columns, yielding for each
sound a vector whose length was the product of nspectral and nkernels.
As a result of this procedure, each layer produced the following number of regressors for each sound: conv1 (8160), rnorm1 (8160),
pool1 (4032), conv2 (5632), rnorm2 (5632), pool2 (2816), conv3 (6656), conv4_W (15360), conv4_G (15360), conv5_W (8704),
conv5_G (8704), pool5_W (4096), and pool5_G (4096). Each fully connected layer did not have a temporal dimension, and so we sim-
ply extracted each model unit’s response yielding the following number of features: fc1_W (4096), fc1_G (4096), fctop_W (587), and
fctop_G (41). We used each of these 17 feature sets to predict the fMRI responses to the natural sound stimuli.
Voxelwise modeling: Regularized linear regression and cross validation
We modeled each voxel’s time-averaged response as a linear combination of a layer’s time-averaged unit responses. We first gener-
ated 10 randomly selected train/test splits of the 165 sound stimuli into 83 training sounds and 82 testing sounds. For each split, we
estimated a linear map from model units to voxels on the 83 training stimuli and evaluated the quality of the prediction using the re-
maining 82 testing sounds (described below in greater detail). For each voxel-layer pair, we took the median across the 10 splits.
The linear map was estimated using regularized linear regression. Given that the number of regressors (i.e., time-averaged model
units) typically exceeded the number of sounds used for estimation (83), regularization was critical. We used L2-regularized (‘‘ridge’’)
regression, which can be seen as placing a zero-mean Gaussian prior on the regression coefficients. Introducing the L2-penalty on
the weights results in a closed-form solution to the regression problem, which is similar to the ordinary least-squares regression
normal equation:
1
w = XT X + nlI XT y;
where w is a d-length column vector (the number of regressors – i.e., the number of time-averaged units for the given layer), y is an
n-length column vector containing the voxel’s mean response to each sound (length 83), X is a matrix of regressors (n stimuli by d re-
gressors), n is the number of stimuli used for estimation (83), and I is the identity matrix (d by d). We demeaned each column of the
rðv123 ; b
2
v 123 Þ
r 2b = ;
v; v rv0 r 0
bv
where v123 is the voxel response to the 82 left-out sounds averaged over the three scans, b v 123 is the predicted response to the 82 left-
out sounds (with regression weights learned from the other 83 sounds), r is a function that computes the correlation coefficient, rv0 is
the estimated reliability of that voxel’s response to the 83 sounds and r 0 is the estimated reliability of that predicted voxel’s response.
bv
rv0 is the median of the correlation between all 3 pairs of scans (scan 0 with scan 1; scan 1 with scan 2; and scan 0 with scan 2), which is
then Spearman-Brown corrected to account for the increased reliability that would be expected from tripling the amount of data
(Spearman, 1910). Figure S2A shows the histograms of the pairwise correlation and the Spearman-Brown corrected correlation
(i.e., the estimate of the reliability of the average data across three scans). r 0 is analogously computed by taking the median of
bv
the correlations for all pairs of predicted responses and Spearman-Brown correcting this measure. Note that for very noisy voxels,
this division by the estimated reliability can be unstable and can cause for corrected variance explained measures that exceed one.
To ameliorate this problem, we limited both the reliability of the prediction and the reliability of the voxel response to be greater than
some value k (Huth et al., 2016). For k = 1, the denominator would be constrained to always equal one and thus the ‘‘corrected’’ vari-
ance explained measured would be identical to uncorrected value. For k = 0, the corrected estimated variance explained measure is
unaffected by the value k. This k-correction can be seen through the lens of a bias-variance tradeoff: this correction reduces the
amount of variance in the estimate of variance explained across different splits of stimuli, but does it at the expense of a downward
bias of those variance explained metrics (by inflating the reliability measure for unreliable voxels). We used a k of 0.128, which is the
p < 0.05 significance threshold for the correlation of two 165-dimensional Gaussian variables (i.e., with the same length as our 165-
dimensional voxel response vectors).
Voxelwise modeling: Summary
We repeated this procedure for each layer and voxel ten times, once each for 10 random train/test splits, and took the median ex-
plained variance across the ten splits for a given layer-voxel pair. We performed this procedure for all 17 layers and all 7694 selected
individual voxels. For comparison, we performed an identical procedure with the layers of a random-filter network with the same base
architecture as our main network (Figures 3B, 4A, 4D, 4F, S2D, and S3B), single-task networks (Figure S2E), and for the time-aver-
aged magnitudes for a bank of spectrotemporal filters (Figures 3B, 4A, 4B, S2C, and S3B–S3E). Additionally, we performed the
regression for 78 randomly chosen random-filter networks with architectures that were not selected, and we summarized these pre-
dictions in ROIs by taking the median across all networks (Figures 3B and S3B).
When computing mean r2 values throughout this paper, we averaged the values after Fisher transforming the correlation values.
That is, we took the square root of each of the r2 values we sought to average, performed the Fisher r-to-z transform, took the average
across these values, performed the inverse of the Fisher z-to-r transform, and then squared the result. Averaging z-transformed
values is appealing because the sampling distribution of correlation coefficients is skewed, and averaging in z-space reduces the
bias of the estimate of the true mean.
All regression and analysis code was written in Python, making heavy use of the numpy and scipy libraries (Jones et al., 2001; Oli-
phant, 2006).
Varying the parameterization of the spectrotemporal filter bank
For voxelwise predictions, our baseline model was a spectrotemporal filter bank (Chi et al., 2005). To give as generous of comparison
as possible, we sought to maximize the ability of the spectrotemporal model to explain cortical responses. We generated thirty-six
different parameterizations of the spectrotemporal model; each with 63 different acoustic frequencies, and we varied across six
different sets of spectral scales and six different temporal rates, all of which were logarithmically spaced.
Temporal rates: 3 rates (0.5, 4, 32 Hz), 5 rates (0.5, 2, 8, 32, 128 Hz), 9 rates (0.5, 1, 2, 4, 8, 16, 32, 64, 128 Hz), 13 rates (0.5, 0.79,
1.26, 2, 3.17, 5.04, 8, 12.70, 20.16, 32, 50.80, 80.63, 128 Hz), 17 rates (0.5, 0.71, 1, 1.41, 2, 2.83, 4, 5.66, 8, 11.31, 16, 22.63, 32, 45.25,
64, 90.51, 128 Hz), 23 rates (0.5, 0.66, 0.87, 1.15, 1.52, 2, 2.64, 3.48, 4.59, 6.06, 8, 10.56, 13.93, 18.38, 24.25, 32, 42.22, 55.72, 73.52,
97.01, 128 Hz).
Spectral scales: 2 scales (0.25, 2 cycles/octave), 4 scales (0.125, 0.5, 2, 8 cycles/octave), 7 scales (0.125, 0.25, 0.5, 1, 2, 4, 8 cy-
cles/octave), 10 scales (0.125, 0.20, 0.32, 0.5, 0.79, 1.26, 2, 3.17, 5.04, 8 cycles/octave), 13 scales (0.125, 0.18, 0.25, 0.35, 0.5, 0.71,
1.0, 1.41, 2.0, 2.83, 4.0, 5.66, 8 cycles/octaves), 16 scales (0.125, 0.16, 0.22, 0.29, 0.38, 0.5, 0.66, 0.87, 1.15, 1.52, 2, 2.64, 3.48, 4.59,
6.06, 8 cycles/octave).
Examining relationship between network task performance and auditory cortical variance explained
To examine the relationship between how well a network performed the task on which it was trained and how well it predicted audi-
tory cortical responses, we examined task performance and voxelwise variance explained in a subset of the networks that were
generated in the process of our first step of architectural hyperparameter search (Figure 7). We examined a random subset of net-
works rather than all networks because of practical constraints due to the amount of compute time and disk space required to
perform the following analysis. We examined fifty-seven different architectures at fourteen different time points during task training
(for a total of 798 different networks). Each network was trained for either word recognition or genre classification (respectively, 392
and 406 networks), and we measured how well each network performed the task for which it was trained using approximately 25,000
left-out validation stimuli. On each of these 798 networks, we trained a softmax classifier to convergence (using SGD) while holding
the rest of the network parameters constant.
For each of these 798 networks we measured the median voxelwise variance explained, across all selected individual-subject vox-
els in auditory cortex. For a given voxel and a given network layer, we performed a similar split-half (across sounds) cross validation
scheme that we describe above, except that we predicted responses for each scan separately. We determined which layer best pre-
dicted each voxel by averaging the variance explained across the predictions of each layer for two sessions, taking the argmax
across layers, and then noting the variance explained by that layer in the left-out third session. We repeated this procedure three
times, leaving each session out. We took the median of the explained variance for the three left-out splits. For each network, we iter-
ated this procedure over all 7694 selected voxels, and summarized how well that network predicted auditory cortical responses by
taking the median across these 7694 voxelwise variance explained measures. We iterated this procedure over all 798 neural
networks.
Figure 7A shows the results across the 392 word-trained networks. Figure 7C shows the results for the 406 genre-trained networks.
Figures 7B and 7D show the same data, but highlight the trajectories of individual architectures over training time – each subplot
highlights the 14 network checkpoints over training for a given architecture.
Psychophysics statistics
For the plots of human psychophysical performance on the word and genre tasks (Figures 2A and 2G; also in Figures 2C, 2F, and 2I),
error bars are within-subject SEM. For corresponding plots of network psychophysical performance (Figures 2B and 2H; also in Fig-
ures 2C and 2I), error bars are SEM, bootstrapped over stimuli and classes (either words or genres). For distortion plots (Figures 2E,
2F, S1A, and S1B), error bars are bootstrapped over stimuli. To examine the similarity between pairs of genre confusion matrices
The Tensorflow network implementation described above, including the trained filter weights, will be made available on the
McDermott lab website.
Code and data are available by request to the Lead Contact (alexkell@mit.edu).