본문 바로가기

Enginius/Machine Learning

Recent News about HTM

Hierarhcical Temporal Memory (HTM)

HTM에 관련된 이전 포스팅들

http://enginius.tistory.com/236

http://enginius.tistory.com/240

http://enginius.tistory.com/256

http://enginius.tistory.com/265


AMA with Juergen Schmidhuber



Question

 The LSTM unit is delicately crafted to solve a specific problem in training RNNs. Do you see the need for other similarly "high-complexity" units in RNNs or CNNs, like for example Hinton's "capsules"? On the topic of CNNs and capsules, do you agree with Hinton's assessment that the efficacy of pooling is actually a disaster? (I do, for what it's worth)

LSTM은 RNN 학습에서 특정 문제를 풀기 위해 만들어졌다. 당신은 이와 비슷한 '더 복잡한' 구조가 필요하다가 보나? 힌튼 박사의 capsur


Answer 1

 I am not Dr. Schmidhuber, but I would like to weigh in on this since I talked to Hinton in person about his capsules. Now please take this with a grain of salt, since it is quite possible that I misinterpreted him :) Dr. Hinton seems to believe that all information must somehow still be somewhat visible at the highest level of a hierarchy. With stuff like maxout units, yes, information is lost at higher layers. But the information isn't gone! It's still stored in the activations of the lower layers. So really, we could just grab that information again. Now this is probably very difficult for classifiers, but in HTM-style architectures (where information flows in both the up and down directions), it is perfectly possible to use both higher-layer abstracted information as well as lower layer "fine-grained" information simultaneously. For MPFs (memory prediction frameworks, a generalization of HTM) this works quite well since they only try to predict their next input (which in turn can be used for reinforcement learning). 

Also, capsules are basically columns in HTM (he said that himself IIRC), except in HTM they are used for storing contextual (temporal) information, which to me seems far more realistic than storing additional feature-oriented spatial information like Dr. Hinton seems to be using them for.

본인은 아니지만, 힌튼과 capsule에 대해서 얘기를 해봤어서 답을 단다. 힌튼 박사는 모든 정보가 가장 상위 단에서 어떤 식으로든 보여야 (visible) 한다고 생각한다. maxout unit과 같은 걸 사용하면 상위 단에서 정보를 잃어버리게 된다. 하지만 정보가 없어지는 것은 아니다! 정보는 하위 단에 activation에 담겨 있다. 그래서 실제로는 우리는 해당 정보에 접근할 수  (grab) 있다. 이러한 것은 분류기에 있어서는 매우 어렵지만, HTM과 같은 구조에선 (정보가 위 아래로 이동) 이러한 상위단에 abstracted information과 하위단에 fine-grained 정보를 동시에 사용하는 것이 가능하다. Memory prediction framework (MPF)는 이러한 것이 상당히 잘 동작하는데 이는 다음번 input을 예측하는데 집중하기 떄문이다. 

Capsule들은 근본적으로 HTM에서 column과 같은 구조를 갖는다. (힌튼 본인이 IIRC에서 직접 얘기를 했다) 다만 다른 점은 HTM은 temporal 정보를 저장하는데 사용된다는 것이고, 나 개인적으로는 힌튼 박사가 하려는 추가적인 feature-oriented spatial 정보를 저장하는 것 보다는 것 보다는 더 좋아보인다. 


Answer 2

 I think pooling is a disaster only if you want to do everything with a single feedforward network and don't have a more general reversible (possibly separate) system that retains the information in all observations. As mentioned in a previous reply: While a problem solver is interacting with the world, it should store and compress (e.g., as in this 1991 paper) the entire raw history of observations. The data is ‘holy’ as it is the only basis of all that can be known about the world (see this 2009 paper). If you have enough storage space to encode the entire data, do not throw it away! For example, universal AIXI is mathematically optimal only because it never abandons the limited number of observations so far. Brains may have enough storage capacity to store 100 years of lifetime at reasonable resolution (see again this 2009 paper). On top of that, they presumably have lots of little algorithms in subnetworks (for pooling and other operations) that look at parts of the data, and process it under local loss of information, depending on the present goal, e.g., to achieve good classification. That's ok as long as it is efficient and successful, and does not have to affect the information-preserving parts of the system. 

Feedforward 네트워크 구조에서 pooling은 재앙이라고 생각한다. ...중략...

데이터는 'holy'하다. 용량이 허용하는한 정보를 다 저장하는 것이 좋다. 버리지 마라. 

기존에 잘 되는 것은 분류라는 단순한 목적만을 수행해서 정보를 버려도 잘 되는 것이다.



Deep Learning과 Hierarchical Temporal Memory (HTM)의 차이점



There’s been a somewhat less than convivial history between two of the theories of neurally-inspired computation systems over the last few years. When a leading protagonist of one school is asked a question about the other, the answer often varies from a kind of empty semi-praise to downright dismissal and the occasional snide remark. The objections of one side to the others’ approach are usually valid, and mostly admitted, but the whole thing leaves one with a feeling that it is not a very scientific way to proceed or behave. 


This post describes an idea which might go some way to resolving this slightly unpleasant impasse and suggests that the discrepancies may simply be as a result of two groups using the same name for two quite different things


HTM과 Deep Network의 차이점

In HTM, Jeff Hawkins’ plan is to identify the mechanisms which actually perform computation in real neocortex, abstracting them only far enough that the details of the brain’s bioengineering are simplified out, and hopefully leaving only the pure computational systems in a form which allows us to implement them in software and reason about them. On the other hand, Hinton and LeCun’s neural networks are each built “computation-first,” drawing some inspiration from and resembling the analogous (but in detail very different) computations in neocortex. The results (ie the models produced), inevitably, are as different at all levels as their inventors’ approaches and goals.


 For example, one criterion for the Deep Network developer is that her model is susceptible to a set of mathematical tools and techniques, which allow other researchers to frame questions, examine and compare models, and so on, all in a similar mathematical framework. HTM, on the other hand, uses neuroscience as a standard test, and will not admit to a model any element which is known to be contradicted by observation of natural neocortex. The Deep Network people complain that the models of HTM cannot be analysed like theirs can (indeed it seems they cannot), while the HTM people complain that the neurons and network topologies in Deep Networks bear no relationship with any known brain structures, and are several simplifications too far. 


얀 리쿤도 제프를 인정하다! 하지만 실제로 사용하긴 어렵다

Yann LeCun said recently on Reddit (with a great summary): Jeff Hawkins has the right intuition and the right philosophy. Some of us have had similar ideas for several decades. Certainly, we all agree that AI systems of the future will be hierarchical (it’s the very idea of deep learning) and will use temporal prediction. But the difficulty is to instantiate these concepts and reduce them to practice. Another difficulty is grounding them on sound mathematical principles (is this algorithm minimizing an objective function?). 


이 글의 저자가 생각하는 HTM의 단점

I think Jeff Hawkins, Dileep George and others greatly underestimated (?) the difficulty of reducing these conceptual ideas to practice. As far as I can tell, HTM has not been demonstrated to get anywhere close to state of the art on any serious task. The topic of HTM and Jeff Hawkins was second out of all the major themes in the Q&A session, reflecting the fact that people in the field view this as an important issue, and (it seems to me) wish that the impressive progress made by Deep Learning researchers could be reconciled with the deeper explanatory power of HTM in describing how the neocortex works. 


Of course, HTM people seldom refuse to play their own role in this spat, saying that a Deep Network sacrifices authenticity in favour of mathematical tractability and getting high scores on artificial “benchmarks”. We explain or excuse the fact that our models are several steps smaller in hierarchy and power, making the valid claim that there are shortcuts and simplifications we are not prepared to make, and speculating that we will – like the tortoise – emerge alone at the finish with the prize of AGI in our hands. 


The problem is, however, a little deeper and more important than an aesthetic argument (as it sometimes appears). This gap in acknowledging the valid accomplishments of the two models, coupled with a certain defensiveness, causes a “chilling effect” when an idea threatens to cross over into the other realm. This means that findings in one regime are very slow to be noticed or incorporated in the other. I’ve heard quite senior HTM people actually say things like “I don’t know anything about Deep Learning, just that it’s wrong” – and vice versa. This is really bad science. From reading their comments, I’m pretty sure that no really senior Deep Learning proponent has any knowledge of the current HTM beyond what he’s read in the popular science press, and the reverse is nearly as true. I consider a very good working knowledge of Deep Learning to be a critical part of any area of computational neuroscience or machine learning. Obviously I feel at least the same way about HTM, but recognise that the communication of our progress (or even the reporting of results) in HTM has not made it easy for “outsiders” to achieve the levels of understanding they feel they need to take part. (맞는 말, HTM은 지들끼리만 공유한다.


There are historical reasons for much of this, but it’s never too late to start fixing a problem like this, and I see this post (and one of my roles) as a step in the right direction. The Neuron as the Unit of Computation In both models, we have identified the neuron as the atomic unit of computation, and the connections between neurons as the location of the memory or functional adjustment which gives the network its computational power. This sounds fine, and clearly the brain uses neurons and connections in some way like this, but this is exactly where the two schools mistakenly diverge. Jeff Hawkins rejects the NN integrate-and-fire model (weighted sum + sigmoid) and builds a neuron with vastly higher complexity. Geoff Hinton admits that, while impossible to reason about mathematically, HTM’s neuron is far more realistic if your goal is to mimic neocortex. Deep Learning, using neurons like Lego bricks, can build vast hierarchies and huge networks, find cats in Youtube videos, and win prizes in competitions. HTM, on the other hand, struggles for years to fit together its “super-neurons” and builds a tiny, single-layer model which can find features and anomalies in low-dimensional streaming data. 


Deep Learning이라 불리는 것과 HTM은 전혀 다른걸 얘기하고 있다.

Looking at this, you’d swear these people were talking about entirely different things. They’ve just been using the same names for them. And, it’s just dawned on me, therein lies both the problem and its solution. The answer’s been there all the time: Each and every neuron in HTM is actually a Deep Network. In a HTM neuron, there are two types of dendrite. One is the proximal dendrite, which contains synapses receiving inputs from the feedforward (mainly sensory) pathway. The other is a set of coincidence-detecting, largely independent, distal dendrite segments, which receive lateral and top-down predictive inputs from the same layer or higher layers and regions in neocortex

Proximal dendrite: feedforward 센서 입력을 받아드리는 부분

Distal dendrite segments: lateral (같은 layer) + top-down (상위 layer) predictive input을 받아드리는 입력 


My thesis here is that a single neuron can be seen as composed of many elements which have direct analogues in various types of Deep Learning networks, and that there are enough of these, with a sufficient structural complexity, that it’s best to view the neuron as a network of simple, Deep Learning-sized nodes, connected in a particular way. I’ll describe this network in some detail now, and hopefully it’ll become clear how this approach removes much of the dichotomy between the models. 


Firstly, a synapse in HTM is very much like a single-input NN node, where HTM’s permanence value is akin to the bias in a NN node, and the weight on the input connection is fixed at 1.0. If the input is active, and the permanence exceeds the threshold, the synapse produces a 1. In HTM we call such a synapse connected, in that the gate is open and the signal is passed through. The dendrite or dendrite segment is like the next layer of nodes in NN, in that it combines its inputs and passes the result up. The proximal dendrite effectively acts as a semi-rectifier, summing inputs and generating a scalar depolarisation value to the cell body. The distal segments, on the other hand, act like thresholded coincidence detectors and produce a depolarising spike only if the sum of the inputs exceeds a threshold. These depolarising inputs (feedforward and recurrent) are combined in the cell body to produce an activation potential. This only potentially generates the output of the entire neuron, because a higher-level inhibition system is used to identify those neurons with highest potential, allow those to fire (producing a binary 1), and suppress the others to zero (a winner-takes-all step with multiple local winners in the layer). 


So, a HTM layer is a network of networks, a hierarchy in which neuron-networks communicate with connections between their sub-parts. At the HTM layer level, each neuron has two types of input and one output, and we wire them together at such, but each neuron is really hiding an internal, network-like structure of its own.