GTC Silicon Valley-2019: Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

Note: This video may require joining the NVIDIA Developer Program or login

GTC 2019

GTC Silicon Valley-2019 ID:S9251:Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

Nan RosemaryKe(MILA, University of Montreal) Learning long-term dependencies in extended temporal sequences requires credit assignment to events far in the past. The most common method for training recurrent neural networks, backpropagation through time, requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. We'll describe how this becomes computationally expensive or even infeasible when used with long sequences. Although biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states, humans often reminded of past memories or mental states associated with their current mental states. We'll discuss the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state.

View the slides (pdf)

Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. Based on this principle, we study a novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly long-term dependencies, but without requiring the biologically implausible backward replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full self-attention.

1 Introduction

Humans have a remarkable ability to remember events from the distant past which are associated with the current mental state (Ciaramelli et al., 2008 ) . Most experimental and theoretical analyses of memory have focused on understanding the deliberate route to memory formation and recall. But automatic reminding—when memories pop into one’s head—can have a potent influence on cognition. Reminding is normally triggered by contextual features present at the moment of retrieval which match distinctive features of the memory being recalled (Berntsen et al., 2013 ; Wharton et al., 1996 ) , and can occur more often following unexpected events (Read & Cesa, 1991 ) . Thus, an individual’s current state of understanding can trigger reminding of a past state. Reminding can provide distracting sources of irrelevant information (Forbus et al., 1995 ; Novick, 1988 ) , but it can also serve a useful computational role in ongoing cognition by providing information essential to decision making (Benjamin & Ross, 2010 ) .

In this paper, we identify another possible role of reminding: to perform credit assignment across long time spans. Consider the following scenario. As you drive down the highway, you hear an unusual popping sound. You think nothing of it until you stop for gas and realize that one of your tires has deflated, at which point you are suddenly reminded of the pop. The reminding event helps determine the cause of your flat tire, and probably leads to synaptic changes by which a future pop sound while driving would be processed differently. Credit assignment is critical in machine learning. Back-propagation is fundamentally performing credit assignment. Although some progress has been made toward credit-assignment mechanisms that are functionally equivalent to back-propagation (Lee et al., 2014 ; Scellier & Bengio, 2016 ; Whittington & Bogacz, 2017 ) , it remains very unclear how the equivalent of back-propagation through time, used to train recurrent neural networks (RNNs), could be implemented by brains. Here we explore the hypothesis that an associative reminding process could play an important role in propagating credit across long time spans, also known as the problem of learning long-term dependencies in RNNs, i.e., of learning to exploit statistical dependencies between events and variables which occur temporally far from each other.

1.1 Credit Assignment in Recurrent Neural Networks

RNNs are used to processes sequences of variable length. They have achieved state-of-the-art results for many machine learning sequence processing tasks. Examples where models based on RNNs shine include speech recognition (Miao et al., 2015 ; Chan et al., 2016 ) , image captioning (Vinyals et al., 2015 ; Lu et al., 2017 ) , machine translation (Luong et al., 2015 ) .

It is common practice to train RNNs using gradients computed with back-propagation through time (BPTT), wherein the network states are unrolled in time over the whole trajectory of discrete time steps and gradients are back-propagated through the unrolled graph. The network unfolding procedure of BPTT does not seem biologically plausible because it requires storing and playing back these events much later (at the end of a trajectory of T 𝑇 T time steps) in reverse order to propagate gradients backwards. If a discrete time instant corresponds to a saccade (about 200-300ms,) then a trajectory of 100 days would require replaying back computations through over 42 million time steps. This is not only inconvenient, but more importantly a small error to any one of these events could either vanish or blow up and cause catastrophic outcomes. Also, if this unfolding and back-propagation is done only over shorter sequences, then learning typically will not capture longer-term dependencies linking events across larger temporal spans then the length of the back-propagated trajectory.

What are the alternatives to BPTT? One approach we explore here exploits associative reminding of past events which may be triggered by the current state and added to it, thus making it possible to propagate gradients with respect to the current state into approximate gradients in the state corresponding to the recalled event. The approximation comes from not backpropagating through the unfolded ordinary recurrence across long time spans, but only through this memory retrieval mechanism. Completely different approaches are possible but are not currently close to BPTT in terms of learning performance on large networks, such as methods based on the online estimation of gradients  (Ollivier et al., 2015 ) . Assuming that no exact gradient estimation method is possible (which seems likely) it could well be that brains combine multiple estimators.

In machine learning, the most common practical alternative to full BPTT is truncated BPTT (TBPTT) Williams & Peng ( 1990 ) . In TBPTT, a long sequence is sliced into a number of (possibly overlapping) subsequences, gradients are backpropagated only for a fixed, limited number of time steps into the past, and the parameters are updated after each backpropagation through a subsequence. Unfortunately, this truncation makes capturing dependencies across distant timesteps nigh-impossible, because no error signal reaches further back into the past than TBPTT’s truncation length .

Neurophysiological findings support the existence of remembering memories and their involvement in credit assignment and learning in biological systems. In particular, hippocampal recordings in rats indicate that brief sequences of prior experience are replayed both in the awake resting state and during sleep, both of which conditions are linked to memory consolidation and learning (Foster & Wilson, 2006 ; Davidson et al., 2009 ; Gupta et al., 2010 ; Ambrose et al., 2016 ) . Thus, the mental look back into the past seems to occur exactly when credit assignment is to be performed. Thus, it is plausible that hippocampal replay could be a way of doing temporal credit assignment (and possibly BPTT) on a short time scale, but here we argue for a solution which could handle credit assignment over much longer durations.

1.2 Novel Credit Assignment Mechanism: Sparse Attentive Backtracking

Inspired by the ability of brains to selectively reactivate memories of the past based on the current context, we propose here a novel solution called Sparse Attentive Backtracking (SAB) that incorporates a differentiable, sparse (hard) attention mechanism to select from past states. Inspired by the cognitive analogy of reminding, SAB is designed to retrieve one or very few past states. This may also be advantageous in focusing the credit assignment, although this hypothesis remains to be tested. SAB meshes well with TBPTT, yet allows gradient to propagate over distances far in excess of the TBPTT truncation length. We experimentally answer affirmatively the following questions:

Can Sparse Attentive Backtracking ( sab ) capture long-term dependencies? sab captures long-term dependencies. See results for 7 tasks supporting this in § 4 .

Generalization and transfer ability of sab ? See the strong transfer results in § 4 .

How does sab perform compared to the Transformers (Vaswani et al., 2017 ) ? sab outperforms the Transformers (comparison in § 4 ).

Is sparsity important for sab and does it learn to retrieve meaningful memories? See the results on the Importance of Sparsity and Table 3 in § 4 .

2 Related Machine Learning Work

Skip-connections and gradient flow.

Neural architectures such as Residual Networks (He et al., 2016 ) and Dense Networks (Huang et al., 2016 ) allow information to skip over convolutional processing blocks of an underlying convolutional network architecture. This construction provably mitigates the vanishing gradient problem by allowing the gradient at any given layer to be bounded. Densely-connected convolutional networks alleviate the vanishing gradient problem by allowing a direct path from any layer in the network to the output layer. In contrast, in this work we propose and explore what one might regard as a form of dynamic skip connection, modulated by an attention mechanism corresponding to a reminding process, which matches the current state with an older state which is retrieved from memory.

The transformer network

The Transformer network (Vaswani et al., 2017 ) takes sequence processing using attention to its logical extreme – using attention only , not relying on RNNs at all. The attention mechanism is a softmax not over the sequence itself but over the outputs of the previous self-attention layer. In order to attend to multiple parts of the layer outputs simultaneously, the Transformer uses 8 small attention “heads” per layer (instead of a single large head) and combines the attention heads’ outputs by concatenation. No attempt is made to make the attention weights sparse, and the authors do not test their models on sequences of length greater than the intermediate representations of the Transformer model. With brains clearly involving a recurrent computation, this approach would seem to miss an important characteristic of biological credit assignment through time. Another implausible aspect of the Transformer architecture is the simultaneous access to (and linear combination of) all past memories (as opposed to a handful with SAB.)

3 Sparse Attentive Backtracking

Mindful that humans use a very sparse subset of past experiences in credit assignment, and are capable of direct random access to past experiences and their relevance to the present, we present here sab : the principle of learned, dynamic, sparse access to, and replay of, relevant past states for credit assignment in neural network models, such as RNNs.

Refer to caption

In the limit of maximum sparsity (no access to the past), SAB degenerates to the use of a regular static neural network. In the limit of minimum sparsity (full access to the past), SAB degenerates to the use of a full self-attention mechanism. For the purposes of this paper, we explore the gap between these with a specific variety of augmented LSTM models; but SAB does not refer to any particular architecture, and the augmented LSTM described herein is used purely as a vehicle to explore and validate our hypotheses in §1.

Broadly, an SAB neural network is required to do two things:

During the forward pass, manage a memory unit and select at most a sparse subset of past memories at every timestep. We will call this sparse retrieval .

During the backward pass, propagate gradient only to that sparse subset of memory and its local surroundings. We will call this sparse replay .

3.1 Sparse retrieval of memories

Just as humans make a selective use of all past memories to inform their decisions in the present, so must an SAB model learn to remember and dynamically select only a few memories that could be potentially useful in the present. There are several alternative implementations of this concept. An important class of them are attention mechanisms , especially self-attention over a model’s own past states. Closely linked to the question of dynamic access to memory is the structure of the memory itself; for instance, in the Differentiable Neural Computer (DNC) (Graves et al., 2016 ) , the memory is a fixed-size tensor accessed with explicit read and write operations, while in Bahdanau et al. ( 2014 ) , the memory is implicitly a list of past hidden states that continuously grows.

For the purposes of this paper, we choose a simple approach similar to Bahdanau et al. ( 2014 ) . Many other options are possible, and the question of memory representation in humans (faithful to actual brains) and machines (with good computationsl properties) remains open. Here, to test the principle of SAB without having to answer that question, we use an approach already shown to work well in machine learning. We augment a unidirectional LSTM with the memory of every k a ​ t ​ t subscript 𝑘 𝑎 𝑡 𝑡 k_{att} ’th hidden state from the past, with a modified hard self-attention mechanism limited to selecting at most k t ​ o ​ p subscript 𝑘 𝑡 𝑜 𝑝 k_{top} memories at every timestep. Future work should investigate more realistic mechanisms for storing memories, e.g., based on saliency, novelty, etc. But this simple scheme allows us to test the hypothesis that neural network models can still perform well even when compelled at every timestep to access their past sparsely. If they cannot, then it would be meaningless to further encumber them with a bounded-size memory.

SAB-augmented LSTM

We now describe the sparse retrieval mechanism that we have settled on. It determines which memories will be selected on the forward pass of the RNN, and therefore also which memories will receive gradient on the backward pass during training.

At time t 𝑡 t , the underlying LSTM receives a vector of hidden states 𝒉 ( t − 1 ) superscript 𝒉 𝑡 1 \bm{h}^{(t-1)} , a vector of cell states 𝒄 ( t − 1 ) superscript 𝒄 𝑡 1 \bm{c}^{(t-1)} , and an input 𝒙 ( t ) superscript 𝒙 𝑡 \bm{x}^{(t)} , and computes new cell states 𝒄 ( t ) superscript 𝒄 𝑡 \bm{c}^{(t)} and a provisional hidden state vector 𝒉 ^ ( t ) superscript bold-^ 𝒉 𝑡 \bm{\hat{h}}^{(t)} that also serves as a provisional output. We next use an attention mechanism that is similar to Bahdanau et al. ( 2014 ) , but modified to produce sparse attention decisions. First, the provisional hidden state vector 𝒉 ^ ( t ) superscript bold-^ 𝒉 𝑡 \bm{\hat{h}}^{(t)} is concatenated to each memory vector 𝒎 ( i ) superscript 𝒎 𝑖 \bm{m}^{(i)} in the memory ℳ ℳ \mathcal{M} . Then, an MLP with one hidden layer maps each such concatenated vector to a scalar, non-sparse, raw attention weight a i ( t ) subscript superscript 𝑎 𝑡 𝑖 \smash{a^{(t)}_{i}} representing the salience of the memory i 𝑖 i at the current time t 𝑡 t . The MLP is parametrized with weight matrices 𝑾 1 subscript 𝑾 1 \bm{W}_{1} , 𝑾 2 subscript 𝑾 2 \bm{W}_{2} and 𝑾 3 subscript 𝑾 3 \bm{W}_{3} .

subscript 𝑘 𝑡 𝑜 𝑝 1 (k_{top}+1) ’th raw weight from all the others, passing the intermediate result through ReLU, then normalizing to sum to 1. This effectively implements a discrete, hard decision to drop all but k t ​ o ​ p subscript 𝑘 𝑡 𝑜 𝑝 k_{top} memories, weigh the selected memories by their prominence over the others, as opposed to their raw value. This is different from typical attention mechanisms that normalize attention weights using a softmax function (Bahdanau et al., 2014 ) , whose output is never sparse.

A summary vector 𝒔 ( t ) superscript 𝒔 𝑡 \bm{s}^{(t)} is then computed using a simple sum of the selected memories, weighted by their respective sparsified attention weight. Given that this sum is very sparse, the summary operation is very fast. This summary is then added into the provisional hidden state 𝒉 ^ ( t ) superscript bold-^ 𝒉 𝑡 \bm{\hat{h}}^{(t)} computed previously to obtain final state 𝒉 ( t ) superscript 𝒉 𝑡 \bm{h}^{(t)} .

Lastly, to compute the SAB-augmented LSTM cell’s output 𝒚 ( t ) superscript 𝒚 𝑡 \bm{y}^{(t)} at t 𝑡 t , we concatenate 𝒉 ( t ) superscript 𝒉 𝑡 \bm{h}^{(t)} and summary vector 𝒔 ( t ) superscript 𝒔 𝑡 \bm{s}^{(t)} , then apply an affine output transform parametrized with learned weights matrices 𝑽 1 subscript 𝑽 1 \bm{V}_{1} and 𝑽 2 subscript 𝑽 2 \bm{V}_{2} and bias vector 𝒃 𝒃 \bm{b} .

The forward pass into a hidden state 𝒉 ( t ) superscript 𝒉 𝑡 \bm{h}^{(t)} has two paths contributing to it. One path is the regular sequential forward path in an RNN; the other path is through the dynamic but sparse skip connections in the attention mechanism that connect the present states to potentially very distant past experiences.

3.2 Sparse replay

Humans are trivially capable of assigning credit or blame to events even a long time after the fact, and do not need to replay all events from the present to the credited event sequentially and in reverse to do so. But that is effectively what RNNs trained with full BPTT require, and this does not seem biologically plausible when considering events which are far from each other in time. Even less plausible is TBPTT because it ignores time dependencies beyond the truncation length k t ​ r ​ u ​ n ​ c subscript 𝑘 𝑡 𝑟 𝑢 𝑛 𝑐 k_{trunc} .

SAB networks’ twin paths during the forward pass (sequential connection and sparse skip connections) allow gradient to flow not just from 𝒉 ( t ) superscript 𝒉 𝑡 \bm{h}^{(t)} to 𝒉 ( t − 1 ) superscript 𝒉 𝑡 1 \bm{h}^{(t-1)} , but also to the at-most k t ​ o ​ p subscript 𝑘 𝑡 𝑜 𝑝 k_{top} memories 𝒎 ( i ) superscript 𝒎 𝑖 \bm{m}^{(i)} retrieved by the attention mechanism (and no others.) Learning to deliver gradient directly (and sparsely) where it is needed (and nowhere else) (1) avoids competition for the limited information-carrying capacity of the sequential path, (2) is a simple form of credit assignment, (3) and imposes a trade-off that is absent in previous, dense self-attentive mechanisms: opening a connection to an interesting or useful timestep must be made at the price of excluding others. This competition for a limited budget of k t ​ o ​ p subscript 𝑘 𝑡 𝑜 𝑝 k_{top} connections results in interesting timesteps being given frequent attention and strong gradient flow, while uninteresting timesteps are ignored and starve.

Mental updates

If we not only allow gradient to flow directly to a past timestep, but on to a few local timesteps around it as well, we have mental updates : a type of local credit assignment around a memory. There are various ways of enabling this. In our SAB-augmented LSTM, we choose to perform TBPTT locally before the selected timesteps ( k t ​ r ​ u ​ n ​ c subscript 𝑘 𝑡 𝑟 𝑢 𝑛 𝑐 k_{trunc} timesteps before a selected one.)

Refer to caption

4 Experimental Setup and Results

For all tasks, We compare sab to two baseline models for all tasks. The first is an LSTM trained both using full BPTT and TBPTT with various truncation length. The second is an LSTM augmented with full self-attention trained using full BPTT. For pixel-by-pixle Cifar10 classification task, we also compare to the Transformer (Vaswani et al., 2017 ) architecture.

Copying and Adding problems (Q1)

The copy and adding problems defined in Hochreiter & Schmidhuber ( 1997 ) are synthetic tasks specifically designed to evaluate a model’s performance on long-term dependencies by testing its ability to remember a sub-sequence for a large number of timesteps. The performance of sab almost matches the performance of LSTMs augmented with self-attention trained using full BPTT. Note that our copy and adding LSTM baselines are more competitive compared to ones reported in the existing literature (Arjovsky et al., 2016 ) . These findings support our hypothesis that at any given time step, only a few past events need to be recalled for the correct prediction of output of the current timestep.

Table  2 reports the cross-entropy (CE) of the model predictions on unseen sequences in the adding task. LSTM with full self-attention trained using BPTT obtains the lowest CE loss, followed by LSTM trained using BPTT. LSTM trained with truncated BPTT performs significantly worse. When T = 200 𝑇 200 T=200 , sab ’s performance is comparable to the best baseline models. With longer sequences ( T = 400 𝑇 400 T=400 ), sab outperforms TBPTT, but is outperformed by pure BPTT. For more details regarding the setup, refer to supplementary material.

Character level Penn TreeBank (PTB) (Q1)

Details about our experimental setup can be found in the supplementary material. We evaluate the performance of our model using the bits-per-character (BPC) metric. As shown in Table  2 , SAB’s performance is significantly better than TBPTT and almost matches BPTT, which is roughly what one expects from an approximate-gradient method like SAB.

Details about our experimental setup can be found in supplementary material. Note that we did not carry out any additional hyperparameter search for our model. Table  2 reports the BPC of the model’s predictions on the test sets. sab outperforms LSTM trained using TBPTT. SAB also outperforms LSTM and self-attention trained with TBPTT. For more details, refer to supplementary material.

Permuted pixel-by-pixel MNIST (Q1)

This task is a sequential version of the MNIST classification dataset. The task involves predicting the label of the image after being given its pixels as a sequence permuted in a fixed, random order. Our experiment setup can be found in the supplementary material. Table  5 shows that sab performs well compared to BPTT.

CIFAR10 classification (Q1,Q3)

We test our model’s performance on pixel-by-pixel CIFAR10 (no permutation). This task involves predicting the label of the image after being given it as a sequence of pixels. This task is relatively difficult compared to other tasks, as sequences are substantially longer (length 1024.) Our method outperforms Transformers and LSTMs trained with BPTT (Table 5 ).

Learning long-term dependencies (Q1)

Table  1 reports both accuracy and cross-entropy (CE) of the models’ predictions on unseen sequences for the copy memory task. The best-performing baseline model is the LSTM with full self-attention trained using BPTT, followed by vanilla LSTMs trained using BPTT. Far behind are LSTMs trained using truncated BPTT. Table  1 demonstrates that sab is able to learn the task almost perfectly for all copy lengths T 𝑇 T . Further, sab outperforms all LSTM baselines and matches the performance of LSTMs with full self-attention trained using BPTT on the copy memory task. This becomes particularly noticeable as the sequence length increases.

Transfer Learning (Q2)

We examine the generalization ability of sab compared to full BPTT trained LSTM and LSTM with full self-attention. The experiment is set up as follows: For the copy task of length T = 100 𝑇 100 T=100 , we train sab , LSTM trained with BPTT, LSTM and full self-attention to convergence. We then take the trained model and evaluate them on the copy task for an array of larger T 𝑇 T values. The results are shown in Table 5 . Although all 3 models have similar performance on T = 100 𝑇 100 T=100 , it is clear that performance for all 3 models drops as T 𝑇 T grows. However, sab still manages to complete the task at T = 5000 𝑇 5000 T=5000 , whereas by T = 2000 𝑇 2000 T=2000 both vanilla LSTM and LSTM with full self-attention do no better than random guessing ( 1 / 8 = 12.5 % 1 8 percent 12.5 1/8=12.5\% ).

Importance of Sparisity and Mental Updates (Q4)

We study the necessity of sparsity and mental updates by running an ablation study on the copying problem. The ablation study focuses on two variants. The first model attends to all events in the past while performing a truncated update. This can be seen either as a dense version of sab or an LSTM with full self-attention trained using TBPTT. Empirically, we find that such models are both more difficult to train and do not reach the same performance as sab . The second ablation experiment tests the necessity of mental updates, without which the model would only attend to the past time steps without passing gradients through them to preceding time steps. We observe a degradation of model performance when blocking gradients to past events. This effect is most evident when attending to only one timestep in the past ( k t ​ o ​ p = 1 subscript 𝑘 𝑡 𝑜 𝑝 1 \smash{k_{top}=1} ).

[Uncaptioned image]

We evaluate sab on language modeling, with the Penn TreeBank (PTB) (Marcus et al., 1993 ) and Text8 Mahoney ( 2011 ) datasets. For models trained using truncated BPTT, the performance drops as k trunc subscript 𝑘 trunc k_{\textrm{trunc}} shrinks. We found that on PTB, sab with k trunc = 20 subscript 𝑘 trunc 20 k_{\textrm{trunc}}=20 , k top = 10 subscript 𝑘 top 10 k_{\textrm{top}}=10 performs almost as well as full BPTT. For the larger Text8 dataset, sab with k trunc = 10 subscript 𝑘 trunc 10 k_{\textrm{trunc}}=10 and k top = 5 subscript 𝑘 top 5 k_{\textrm{top}}=5 outperforms LSTM trained using BPTT.

Comparison to Transformer (Q3)

We test how SAB compares to the Transformer model (Vaswani et al., 2017 ) , based a self-attention mechanism. On pMNIST, the Transformer model outperforms our best model, as shown in Table  5 . On CIFAR10, however, our proposed model performs much better.

5 Conclusions

By considering how brains could perform long-term temporal credit assignment, we developed an alternative to the traditional method of training recurrent neural networks by unfolding of the computational graph and BPTT. We explored the hypothesis that a reminding process which uses the current state to evoke a relevant state arbitrarily far back in the past could be used to effectively teleport credit backwards in time to the computations performed to obtain the past state. To test this idea, we developed a novel temporal architecture and credit assignment mechanism called SAB for Sparse Attentive Backtracking, which aims to combine the strengths of full backpropagation through time and truncated backpropagation through time. It does so by backpropagating gradients only through paths for which the current state and a past state are associated. This allows the RNN to learn long-term dependencies, as with full backpropagation through time, while still allowing it to only backtrack for a few steps, as with truncated backpropagation through time, thus making it possible to update weights as frequently as needed rather than having to wait for the end of very long sequences.

Cognitive processes in reminding serve not only as the inspiration for SAB, but suggest two interesting directions of future research. First, we assumed a simple content-independent rule for selecting microstates for inclusion in the macrostate, whereas humans show a systematic dependence on content: salient, extreme, unusual, and unexpected experiences are more likely to be stored and subsequently remembered. These landmarks of memory should be useful for connecting past to current context, just as an individual learns to map out a city via distinctive geographic landmarks. Second, SAB determines the relevance of past microstates to the current state through a generic, flexible mapping, whereas humans perform similarity-based retrieval. We conjecture that a version of SAB with a strong inductive bias in the mechanism to select past states may further improve its performance.

6 Acknowledgement

The authors would like to thank Hugo Larochelle, Walter Senn, Alex Lamb, Remi Le Priol, Matthieu Courbariaux, Gaetan Marceau Caron, Sandeep Subramanian for the useful discussions, as well as NSERC, CIFAR, Google, Samsung, SNSF, Nuance, IBM, Canada Research Chairs, National Science Foundation awards EHR-1631428 and SES-1461535 for funding. We would also like to thank Compute Canada and NVIDIA for computing resources. The authors would also like to thank Alex Lamb for code review. The authors would also like to express debt of gratitude towards those who contributed to Theano over the years (now that it is being sunset), for making it such a great tool.

  • Ambrose et al. (2016) Ambrose, R. Ellen, Pfeiffer, Brad E., and Foster, David J. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron , 91(5):1124 – 1136, 2016.
  • Arjovsky et al. (2016) Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. In International Conference on Machine Learning , pp. 1120–1128, 2016.
  • Bahdanau et al. (2014) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.
  • Benjamin & Ross (2010) Benjamin, A. S. and Ross, B. H. The causes and consequences of reminding. In Benjamin, A. S. (ed.), Successful remembering and successful forgetting: A Festschrift in honor of Robert A. Bjork . Psychology Press, 2010.
  • Berntsen et al. (2013) Berntsen, Dorthe, Staugaard, Søren Risløv, and Sørensen, Louise Maria Torp. Why am i remembering this now? predicting the occurrence of involuntary (spontaneous) episodic memories. Journal of Experimental Psychology: General , 142(2):426, 2013.
  • Chan et al. (2016) Chan, William, Jaitly, Navdeep, Le, Quoc, and Vinyals, Oriol. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on , pp.  4960–4964. IEEE, 2016.
  • Ciaramelli et al. (2008) Ciaramelli, E, Grady, C L, and Moscovitch, M. Top-down and bottom-up attention to memory: A hypothesis on the role of the posterior parietal cortex in memory retrieval. Neuropsychologia , 46(7):1828–1851, 2008.
  • Cooijmans et al. (2016) Cooijmans, Tim, Ballas, Nicolas, Laurent, César, Gülçehre, Çağlar, and Courville, Aaron. Recurrent batch normalization. arXiv preprint arXiv:1603.09025 , 2016.
  • Davidson et al. (2009) Davidson, Thomas J, Kloosterman, Fabian, and Wilson, Matthew A. Hippocampal replay of extended experience. Neuron , 63(4):497–507, 2009.
  • Forbus et al. (1995) Forbus, K D, Gentner, D, and Law, K. Mac/fac: A model of similarity-based retrieval. Cognitive Science , 19:141–205, 1995.
  • Foster & Wilson (2006) Foster, David J and Wilson, Matthew A. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature , 440(7084):680–683, 2006.
  • Graves et al. (2016) Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley, Tim, Danihelka, Ivo, Grabska-Barwińska, Agnieszka, Colmenarejo, Sergio Gómez, Grefenstette, Edward, Ramalho, Tiago, Agapiou, John, et al. Hybrid computing using a neural network with dynamic external memory. Nature , 538(7626):471, 2016.
  • Gupta et al. (2010) Gupta, Anoopum S, van der Meer, Matthijs AA, Touretzky, David S, and Redish, A David. Hippocampal replay is not a simple function of experience. Neuron , 65(5):695–705, 2010.
  • He et al. (2016) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp.  770–778, 2016.
  • Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation , 9(8):1735–1780, 1997.
  • Huang et al. (2016) Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 , 2016.
  • Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.
  • Lee et al. (2014) Lee, Dong-Hyun, Zhang, Saizheng, Biard, Antoine, and Bengio, Yoshua. Target propagation. CoRR , abs/1412.7525, 2014. URL http://arxiv.org/abs/1412.7525 .
  • Lu et al. (2017) Lu, Jiasen, Xiong, Caiming, Parikh, Devi, and Socher, Richard. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , volume 6, 2017.
  • Luong et al. (2015) Luong, Minh-Thang, Pham, Hieu, and Manning, Christopher D. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 , 2015.
  • Mahoney (2011) Mahoney, Matt. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text. html , 2011.
  • Marcus et al. (1993) Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Building a large annotated corpus of english: The penn treebank. Computational linguistics , 19(2):313–330, 1993.
  • Miao et al. (2015) Miao, Yajie, Gowayyed, Mohammad, and Metze, Florian. Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on , pp.  167–174. IEEE, 2015.
  • Novick (1988) Novick, R L. Analogical transfer, problem similarity, and expertise. Journal of Experimental Psychology: Learning, Memory, & Cognition , 14:510–520, 1988.
  • Ollivier et al. (2015) Ollivier, Yann, Tallec, Corentin, and Charpiat, Guillaume. Training recurrent networks online without backtracking. arXiv preprint arXiv:1507.07680 , 2015.
  • Read & Cesa (1991) Read, S J and Cesa, I L. Expectation failures in reminding and explanation. Journal of Experimental Social Psychology , 27:1–25, 1991.
  • Scellier & Bengio (2016) Scellier, Benjamin and Bengio, Yoshua. Towards a biologically plausible backprop. CoRR , abs/1602.05179, 2016. URL http://arxiv.org/abs/1602.05179 .
  • Vaswani et al. (2017) Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Łukasz, and Polosukhin, Illia. Attention is all you need. In Advances in Neural Information Processing Systems , pp. 6000–6010, 2017.
  • Vinyals et al. (2015) Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp.  3156–3164, 2015.
  • Wharton et al. (1996) Wharton, C M, Holyoak, K J, and Lange, T E. Remote analogical reminding. Memory & Cognition , 24:629–643, 1996.
  • Whittington & Bogacz (2017) Whittington, James CR and Bogacz, Rafal. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation , 29(5):1229–1262, 2017.
  • Williams & Peng (1990) Williams, Ronald J and Peng, Jing. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation , 2(4):490–501, 1990.

7 Supplementary material

7.1 synthetic experiments, the copying memory problem.

𝑇 20 T+20 inputs consisting of: a) 10 (randomly generated) digits (digits 1 to 8) followed by; b) T 𝑇 T blank inputs followed by; c) a special end-of-sequence character followed by; d) 10 additional blank inputs. After the end-of-sequence character the network must output a copy of the initial 10 digits.

The adding task

The adding task requires the model to sum two specific entries in a sequence of T 𝑇 T (input) entries (Hochreiter & Schmidhuber, 1997 ) . In the spirit of the copying task, larger values of T 𝑇 T will require the model to keep track of longer-term dependencies. The exact setup is as follows. Each example in the task consists of two input vectors of length T 𝑇 T . The first is a vector of uniformly generated values between 0 0 and 1 1 1 . The second vector encodes a binary mask which indicates the two entries in the first input to be added (the mask vector consists of T − 2 𝑇 2 T-2 zeros and 2 2 2 ones). The mask is randomly generated with the constraint that masked-in entries must be from different halves of the first input vector.

Hyperparameters

The hyperparameters for both baselines and sab are kept the same. All models has 128 hidden units and uses the Adam (Kingma & Ba, 2014 ) optimizer with a learning rate of 1 ​ e − 3 1 𝑒 3 1e-3 . The first model in the ablation study (dense version of sab ) was more difficult to train, therefore we explored different learning rate ranging from 1 ​ e − 3 1 𝑒 3 1e-3 to 1 ​ e − 5 1 𝑒 5 1e-5 , we report the best performing model.

7.2 Char Level PennTree Bank

We follow the setup in Cooijmans et al. ( 2016 ) and all of our models use 1000 hidden units for and a learning rate of 0.002. We used non-overlapping sequences of 100 in the batches of 32 as in Cooijmans et al. ( 2016 ) . All models trained for upto 100 epochs with early stopping on the validation set. We evaluate the performance of our model using the bits-per-character (BPC) metric.

7.3 Char Level Text8

We follow the setup of Mikolov et al. (2012); use the first 90M characters for training, the next 5M for validation and the final 5M characters for testing. We train on non-overlapping sequences of length 180. Due to computational constraints, all baselines use 1000 hidden units. We trained all models using a batch size of 64. We trained sab for a maximum of 30 epochs.

7.4 Permuted Pixel-by-pixel MNIST

All models use an LSTM with 128 hidden units. The prediction is produced by passing the final hidden state of the network into a softmax. We used a learning rate of 0.001. We trained our model for about 100 epochs, and did early stopping based on the validation set.

7.5 Comparison to LSTM + Self Attention(with truncation)

While SAB is trained with truncated BPTT (and the vanilla LSTM+self-attention is not), Here we argue, that training the vanilla LSTM and self attention with truncation works less well on a more challenging Text8 language modelling dataset.

8 Computational Complexity of SAB

If the memory was allowed to grow unbounded in size, then the computational complexity would scale linearly with the length of history. However, humans have a bounded memory. In a computer science context with unbounded memory, the time complexity of the forward pass of both training and inference in sab is O ​ ( t 2 ​ n 2 ) 𝑂 superscript 𝑡 2 superscript 𝑛 2 O(t^{2}n^{2}) , with t 𝑡 t the number of timesteps and n 𝑛 n the size of the hidden state. The space complexity of the forward pass of training is unchanged at O ​ ( t ​ n ) 𝑂 𝑡 𝑛 O(tn) , but the space complexity of inference in sab is now O ​ ( t ​ n ) 𝑂 𝑡 𝑛 O(tn) rather than O ​ ( n ) 𝑂 𝑛 O(n) . However, the time cost of the backward pass of training cost is very difficult to formulate. Hidden states depend on a sparse subset of past microstates, but each of those past microstates may itself depend on several other, even earlier microstates. The web of active connections is, therefore, akin to a directed acyclic graph, and it is quite possible in the worst case for a backpropagation starting at the last hidden state to touch all past microstates several times. However, if the number of microstates truly relevant to a task is low, the attention mechanism will repeatedly focus on them to the exclusion of all others, and pathological runtimes will not be encountered.

8.1 Gradient Flow

Our method approximates the true gradient but in a sense it’s no different than the kind of approximation made with truncated gradient, except that instead of truncating to the last k trunc subscript 𝑘 trunc k_{\textit{trunc}} time steps, we truncate to one skip-step in the past, which can be arbitrarily far in the past. This provides a way of combating exploding and vanishing gradient problems by learning long-term dependencies. To verify the fact, we ran our model on all the datasets (Text8, Pixel-By-Pixel MNIST, char level PTB) with and without gradient clipping. We empirically found, that we need to use gradient clipping only for text8 dataset, for all the other datasets we observed little or no difference with gradient clipping.

ar5iv homepage

ServiceNow Research

Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

sparse attentive backtracking temporal credit assignment through reminding

Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. Based on this principle, we study a novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly long-term dependencies, but without requiring the biologically implausible backward replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full self-attention.

Christopher Pal

Christopher Pal

Distinguished scientist.

Distinguished Scientist at Multimodal Foundation Models located at Montreal, QC, Canada.

Yoshua Bengio

Yoshua Bengio

Research advisor.

Research Advisor at Human Decision Support located at Montreal, QC, Canada.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

Profile image of joseph pal

Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. B...

Related Papers

A major drawback of backpropagation through time (BPTT) is the difficulty of learning long-term dependencies, coming from having to propagate credit information backwards through every single step of the forward computation. This makes BPTT both computationally impractical and biologically implausible. For this reason, full backpropagation through time is rarely used on long sequences, and truncated backpropagation through time is used as a heuristic. However, this usually leads to biased estimates of the gradient in which longer term dependencies are ignored. Addressing this issue, we propose an alternative algorithm, Sparse Attentive Backtracking, which might also be related to principles used by brains to learn long-term dependencies. Sparse Attentive Backtracking learns an attention mechanism over the hidden states of the past and selectively backpropagates through paths with high attention weights. This allows the model to learn long term dependencies while only backtracking fo...

sparse attentive backtracking temporal credit assignment through reminding

Abstract The class of recurrent networks known as attractor networks is known to exhibit behaviors relevant to modeling human memory processes���notably content-addressable memory, storage of repeated inputs as stable patterns (under Hebbian learning), and maintenance of information (as activity) over time. In addition, these networks provide a natural account of the effect of similarity on interference in recall.

Alexander Rivkind

To be effective in sequential data processing, Recurrent Neural Networks (RNNs) are required to keep track of past events by creating memories. While the relation between memories and the network's hidden state dynamics was established over the last decade, previous works in this direction were of a predominantly descriptive nature focusing mainly on locating the dynamical objects of interest. In particular, it remained unclear how dynamical observables affect the performance, how they form and whether they can be manipulated. Here, we utilize different training protocols, datasets and architectures to obtain a range of networks solving a delayed classification task with similar performance, alongside substantial differences in their ability to extrapolate for longer delays. We analyze the dynamics of the network's hidden state, and uncover the reasons for this difference. Each memory is found to be associated with a nearly steady state of the dynamics which we refer to as a...

Larry Manevitz

Katherine Moore

Mahmoud Sherteel

Felix A Gers

Connection Science

Stephane Rousset , Serban Musca

John Kalaska , Francois Rivest

Dopaminergic models based on the temporal-difference learning algorithm (TD) usually do not differentiate trace from delay conditioning. Instead, they use a fixed temporal representation of elapsed time since conditioned stimulus onset. Recently, a new model was proposed in which timing is learned within a long short-term memory (LSTM) artificial neural network representing the cerebral cortex (Rivest et al. 2010). In this paper, that model’s ability to reproduce and explain relevant data, as well as its ability to make interesting new predictions, are evaluated. The model reveals a strikingly different temporal representation between trace and delay conditioning since trace conditioning requires working memory to remember the past conditioned stimulus while delay conditioning does not. On the other hand, the model predicts no important difference in DA responses between those two conditions when trained on one conditioning paradigm and tested on the other. The model predicts that in trace conditioning, animal timing starts with the conditioned stimulus offset as opposed to its onset. In classical conditioning, it predicts that if the conditioned stimulus does not disappear after the reward, the animal may expect a second reward. Finally, the last simulation reveals that the build-up of activity of some units in the networks can adapt to new delays by adjusting their rate of integration. Most importantly, the paper shows that it is possible, with the proposed architecture, to acquire discharge patterns similar to those observed in dopaminergic neurons and in the cerebral cortex on those tasks simply by minimizing a predictive cost function.

The Journal of Mathematical Neuroscience

Misha Tsodyks

Memory and forgetting constitute two sides of the same coin, and although the first has been extensively investigated, the latter is often overlooked. A possible approach to better understand forgetting is to develop phenomenological models that implement its putative mechanisms in the most elementary way possible, and then experimentally test the theoretical predictions of these models. One such mechanism proposed in previous studies is retrograde interference, stating that a memory can be erased due to subsequently acquired memories. In the current contribution, we hypothesize that retrograde erasure is controlled by the relevant “importance” measures such that more important memories eliminate less important ones acquired earlier. We show that some versions of the resulting mathematical model are broadly compatible with the previously reported power-law forgetting time course and match well the results of our recognition experiments with long, randomly assembled streams of words.

RELATED PAPERS

james muola

Journal of Human Sciences

sevda çetin

Aquaculture International

annita yong

A History of East-the-Water, Bideford

Michael Teare

Pan African Medical Journal

Claudia Ickinger

Gene therapy

Laura Nally

2330018072 GITA VIOLA SOFYANI

Debate Feminista

Sandra Lorenzano

ETHOS (Jurnal Penelitian dan Pengabdian)

Iin Ernawati

Hubert Quispe-Bustamante

European Journal of Physics

Enrique Miranda

Margitta Rouse

Journal of The Korean Society of Physical Medicine

Roh hyo-lyun

SPE Journal

William Mclendon

Redele Revista Electronica De Didactica Ele

Haydee Nieto

Aline Ferreira

Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology

George Schneider

Journal of Namibian Studies : History Politics Culture

Margaretha Hanita

Dragana Pavlovic

Reviews in Aquaculture

Matthias Wolff

Bharat Wadher

Afzaal Mubashir hayat

Global Journal of Health Science

Brenda Mahon

Jurnal Kelautan Nasional

Dinarika Jatisworo

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Code for our paper "Sparse Attentive Backtracking: Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding" https://papers.nips.cc/paper/7991-sparse-attentive-backtracking-temporal-credit-assignment-through-reminding.pdf

nke001/sparse_attentive_backtracking_release

Folders and files, repository files navigation, sparse_attentive_backtracking_release.

Create environment using

  • pytorch 0.1.12

To run experiment, use the following command

  • Python 100.0%

IMAGES

  1. Sparse Attentive Backtracking: Temporal CreditAssignment Through

    sparse attentive backtracking temporal credit assignment through reminding

  2. Sparse Attentive Backtracking: Temporal Credit Assignment Through

    sparse attentive backtracking temporal credit assignment through reminding

  3. GTC Silicon Valley-2019: Sparse Attentive

    sparse attentive backtracking temporal credit assignment through reminding

  4. GitHub

    sparse attentive backtracking temporal credit assignment through reminding

  5. Figure 1 from Ensemble perspective for understanding temporal credit

    sparse attentive backtracking temporal credit assignment through reminding

  6. PN-12: Sparse Attentive Backtracking (NIPS 2018)

    sparse attentive backtracking temporal credit assignment through reminding

VIDEO

  1. Part 2 of my Extra Credit assignment 3

  2. Temporal Assignment Rosol small scale

  3. Part 3 of my Extra Credit assignment 3

  4. Part 1 of Extra credit assignment 3

  5. Graph-based Virtual Sensing from Sparse and Partial Multivariate Observation

  6. Mission Assignment Through SAMR

COMMENTS

  1. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding. Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated ...

  2. Sparse attentive backtracking: temporal credit assignment through reminding

    Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps.

  3. Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

    Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding. Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018) Bibtex Metadata Paper Reviews Supplemental. Authors. Nan Rosemary Ke, Anirudh Goyal ALIAS PARTH GOYAL, Olexa Bilaniuk, Jonathan Binas, Michael C. Mozer, Chris Pal, Yoshua Bengio. Abstract.

  4. PDF Sparse Attentive Backtracking: Temporal credit assignment through reminding

    Sparse Attentive Backtracking Forward pass Backward pass 3. Some results 4. ... Temporal credit assignment through reminding Author: Nan Rosemary Ke1,2, Anirudh Goyal1, Olexa Bilaniuk 1, Jonathan Binas1 Chris Pal2,4, Mike Mozer 3, Yoshua Bengio1,5 1Mila, Université de Montréal 2Mila, Polytechnique Montreal 3University of Colorado, Boulder ...

  5. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Abstract. Learning long-term dependencies in extended temporal sequences requires credit. assignment to events far back in the past. The most common method for training. recurrent neural networks ...

  6. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps.

  7. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    A novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states is studied. Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back ...

  8. Reviews: Sparse Attentive Backtracking: Temporal Credit Assignment

    Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding: Reviewer 1. ... Sparse Attentive Backtracking (SAB), which can be used to capture long term dependencies in sequential data as an alternative to Back Propagation Through Time (BPTT). The inspiration for the framework comes from the intuition of how humans do credit ...

  9. PDF [1cm]Sparse Attentive Backtracking: Temporal credit assignment through

    Sparse Attentive Backtracking: Temporal credit assignment through reminding Nan Rosemary Ke1,2, Anirudh Goyal 1, Olexa Bilaniuk , Jonathan Binas1 Chris Pal2,4, Mike Mozer 3, Yoshua Bengio1,5 1Mila, Universit e de Montr eal 2Mila, Polytechnique Montreal 3University of Colorado, Boulder 4Element AI 5CIFAR Senior Fellow

  10. proceedings.neurips.cc

    Thus, it is\nplausible that hippocampal replay could be a way of doing temporal credit assignment (and possibly\nBPTT) on a short time scale, but here we argue for a solution which could handle credit assignment\nover much longer durations.\n\n1.2 Novel Credit Assignment Mechanism: Sparse Attentive Backtracking\n\nInspired by the ability of ...

  11. GTC Silicon Valley-2019: Sparse Attentive

    GTC Silicon Valley-2019 ID:S9251:Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding. Nan RosemaryKe(MILA, University of Montreal) Learning long-term dependencies in extended temporal sequences requires credit assignment to events far in the past.

  12. Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

    Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BP…

  13. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Table 3: Performance on the adding task (left) and language modeling tasks (PTB and Text8; right). The adding task performance is evaluated on unseen sequences of the T = 200 and T = 400 (note that all methods have configurations that allow them to perform near optimally.) For T = 400, BPTT slightly outperforms SAB, which outperforms TBPTT. For the language modeling tasks, the BPC score is ...

  14. Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

    Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This ...

  15. PDF Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

    3 Sparse Attentive Backtracking. Mindful that humans use a very sparse subset of past experiences in credit assignment, and are capable of direct random access to past experiences and their relevance to the present, we present here SAB: the principle of learned, dynamic, sparse access to, and replay of, relevant past states for credit ...

  16. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Table 4: Left: ablation studies on the adding and copying tasks. The limiting cases of dense attention (ktop = all) and of no mental updates (MU) were tested. Right: focus of the attention for the T=200 copying task, where reproduction of the inital 10 input symbols is required (black corresponds to stronger attention weights). The was generated at different points in training (a-c) within the ...

  17. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially ...

  18. Sparse Attentive Backtracking: Temporal Credit Assignment Through

    NeurIPS uses cookies to remember that you are logged in. By using our websites, you agree to the placement of these cookies.

  19. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information. Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time ...

  20. nke001/sparse_attentive_backtracking_release

    Code for our paper "Sparse Attentive Backtracking: Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding" https://papers.nips.cc/paper ...

  21. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Figure 1: This figure illustrates the forward pass in SAB for the configuration ktop = 3, katt = 2, ktrunc = 2. This involves sparse retrieval (§ 3.1) and summarization of memories into the next RNN hidden state. Gray arrows depict how attention weights a(t) are evaluated, first by broadcasting and concatenating the current provisional hidden state ĥ(t) against the set of all memoriesM and ...

  22. Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

    Table 6: Transfer performance (Accuracy for last 10 digits) for models trained on T = 100 copy memory task. Comparisons to LSTM and LSTM with full selfattention trained with BPTT. - "Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding"