Back from IJCNN19 (International Joint Conference on Neural Networks), Budapest, I want to recollect some of the insightful bits of knowledge I have gathered from the big ones in the neural network community. Like Dante Alighieri in his Convivio, I have been catching some of the crumbs that fall from the table of the most influencing Neural Networks researchers, and I want to feed a discussion on this. Read on...
"E io adunque, che non seggio a la beata mensa, ma, fuggito de la pastura del vulgo, a' piedi di coloro che seggiono ricolgo di quello che da loro cade" (Dante Alighieri, Convivio, Trattato Primo)
Image: Luca Signorelli - Opera propria Georges Jansoone (JoJan) Taken on il 30 aprile 2008, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=834493
Deep Learning is a troublemaker, in a sense. While it paved the way for a plethora of engineering applications, speeding up the throughput of researchers and engineers alike in their respective domains, it irritated quite a lot of researchers who were following different routes, fighting hard against mathematical problems without obtaining the results that Deep Learning now yields in a glimpse. Deep Learning as we know it now can also fuel bad research habits and, in general, it is very hard to gather insightful knowledge from deep neural networks.
One of the most intriguing moments of the IJCNN conference was a panel titled "Deep Learning: Hype or Hallelujah?", chaired by Vladimir Cherkassky. The panel was composed of several experts who concisely presented their point of view. At the conclusion there was an excited Q&A time.
The Panels: claims and contents
The panelists had different takes on the topic. Vladimir Cherkassky pointed out some excellent questions for debate. I'll recap them as I have grasped them, in the hope not to misinterpret the author's view:
"Data deluges makes Science obsolete", i.e. we don't need good models anymore, since statistics will know how to handle data. This creates a fracture from, e.g., Karl Popper's view: Science starts from problem, not from empirical data
We are, thus, shifting from "first principle knowledge" (hypothesis --> experiment --> theory) to "data-driven knowledge" (program + data = knowledge). The question is: is this really knowledge? Can we make a sense of it?
No real theoretical change has been made since the 1990s. Cherkassky argues that Deep Learning is more a marketing/umbrella term for a combination of old techniques + new powerful hardware and large datasets.
Deep Learning claims to be biologically-inspired, but this is not true.
I have reflected on these points myself for a long time. I strongly agree with the first two points. Specifically, knowledge, should be regarded as the ability to autonomously generalize. If a neural network learns to predict the fall of an object under gravity, I will, myself, never be able to do that. Newton's law of motion, is probably a far better model for predicting objects falls: it can generalize to any two m and a, it is more accurate, it is easier to compute, it doesn't require a computer, it is symbolic, thus, it can be plugged in other equations to generate new models. The knowledge distilled by the neural network, conversely, is unusable for us. Making a sense of it would be like transplanting the brain of a teacher to the brain of his disciple to transfer knowledge. Our current way to teach is different: we use symbols, associate this with a shared semantics, and manipulate these through language.
Point 3 is more blurry. Another panelist (my memory fails) argued on this. He stated that Deep Learning opens interesting theoretical questions worth investigating and that there are many new developments since the 1990s. One aspect is generalization, and the speaker pointed out that a recent ICML paper showed how the usual training/validation error U-shaped curves can, in reality, decrease after a "interpolation threshold", which is something unexpected in the current theoretical framework. I think the speaker referred to this paper , but I'm guessing.
Point 4 was explained as follows:
Biological (e.g. human) beings learn efficiently with little data (Plato's problem), while Deep networks require very large datasets
Nobody really knows how the brain works, and Deep Learning pioneers were not neuroscientists, they were computer scientists
Deep networks do not understand natural language
Indeed, deep neural networks do not emulate any known biological system. The most common example, convolutional neural nets (CNN) are claimed to be inspired by the visual cortex, but, in practice there is a light connection between a biological visual cortex and the CNN we nowadays handle in computer science. Furthermore we process any kind of data with CNN, not only visual information.
Indeed, there are many research works that are truly biologically-inspired (see e.g. Spiking neural nets). However, these are not contemplated in the Deep Learning bubble and they still get worse results for what concerns classification and regression performances. (This does not mean they are not worth investigating. Quite the opposite.)
On the other side, if we look at the larger picture, the learning process we have in Deep Learning, can still be considered a biologically-inspired process and it has replaced more traditional methods based on, e.g. digital signal processing. All in all, this has been a huge change in many scientific domains and it shifted the attention from linear mathematical models to nonlinear biologically-inspired learning circuits.
My objections to the claims
Some of the claims were supported by evidences that I don't find true.
Cherkassky explicitly excluded unsupervised learning methods from his talk. This brings a bias that helped him draw the conclusion on the need of large datasets. Unsupervised and one-shot learning allows unlabelled/smaller datasets.
"Supervised DL requires large datasets": true, but for smaller datasets we still have other techniques. And if you look it the other way round: traditional techniques fail to succeed with large datasets. That's why DL started gain attention.
"DL works mainly for images". First, don't recurrent networks belong to DL methods? Furthermore: a lot of 1D signals (e.g. audio) are converted to 2D (e.g. by STFT and logmel transforms). Even if these 2D matrices are not isotropic, DL methods still work well. Bottom line: DL can be used for other domains as well.
Last argument. Deep networks are not easily "explainable". One of the speakers pointed out differences between deep and shallow network. One of the points, if I recall correctly, was that shallow networks can be dealt with mathematically. I like to debate this point a lot. Most scientists prefer to stray from problems that cannot be solved analytically. I can understand this, but we have to go further and build models for complexity. In the 1960s many scientists faced complexity in control theory, physics, biologic, chemistry. I'll speak one name: Ilya Prigogine. Nonlinear chaotic dynamical systems are found everywhere and require quantitative methods. Traditional thermodynamics, after all, is about dealing with extremely complex systems on a macro-scale, and it is an old and understood theory.
In the field of Deep Learning, I think we really need to go further with theory. We should stop regarding dull domain-specific applications of neural networks as Science, and spend more time on analyzing the behavior of these architectures. This may require dropping some analytical techniques that were feasible only with shallow networks, but mathematicians should not be afraid of facing complexity*.
* this would be easier if governments did not push for the publish or perish paradigm.
One of the speaker claimed, in support to DL, that it allowed "a lot of successful applications". Everyone knows this, but I still wonder whether it is a slightly biased claim (do we publish papers when it fails to work?). And even so, this seems to be the only argument to support DL: it works.
That's the engineer's view, not the scientist's!
Now, I'm only wondering, what should engineering researcher do? Favor the first or the second point of view? What would you do? ;)
 Mikhail Belkin , Daniel Hsu , Siyuan Ma , and Soumik Mandal, "Reconciling modern machine learning and the bias-variance trade-off", https://arxiv.org/pdf/1812.11118.pdf