Information Bottleneck and its Applications in Deep Learning
Subject Areas : Machine learningHassan Hafez Kolahi 1 , Shohreh Kasaei 2
1 - Sharif University of Technology
2 - Sharif University
Keywords: Machine Learning, , Information Theory, , Information Bottleneck, , Deep Learning, , Variational Auto-Encoder.,
Abstract :
Information Theory (IT) has been used in Machine Learning (ML) from early days of this field. In the last decade, advances in Deep Neural Networks (DNNs) have led to surprising improvements in many applications of ML. The result has been a paradigm shift in the community toward revisiting previous ideas and applications in this new framework. Ideas from IT are no exception. One of the ideas which is being revisited by many researchers in this new era, is Information Bottleneck (IB); a formulation of information extraction based on IT. The IB is promising in both analyzing and improving DNNs. The goal of this survey is to review the IB concept and demonstrate its applications in deep learning. The information theoretic nature of IB, makes it also a good candidate in showing the more general concept of how IT can be used in ML. Two important concepts are highlighted in this narrative on the subject, i) the concise and universal view that IT provides on seemingly unrelated methods of ML, demonstrated by explaining how IB relates to minimal sufficient statistics, stochastic gradient descent, and variational auto-encoders, and ii) the common technical mistakes and problems caused by applying ideas from IT, which is discussed by a careful study of some recent methods suffering from them.
[1] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal,
vol. 27, no. 3, pp. 379–423, 1948. # #
[2] S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951. # #
[3] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff, “Learners that Use Little
Information,” in Algorithmic Learning Theory, pp. 25–55, 2018. # #
[4] M. Vera, P. Piantanida, and L. R. Vega, “The Role of Information Complexity and
Randomization in Representation Learning,” arXiv:1802.05355 [cs, stat], Feb. 2018. # #
[5] I. Nachum, J. Shafer, and A. Yehudayoff, “A Direct Sum Result for the Information Complexity of Learning,” arXiv:1804.05474 [cs, math, stat], Apr. 2018. # #
[6] J. Zhang, T. Liu, and D. Tao, “An Information-Theoretic View for Deep Learning,” p. 12. # #
[7] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning
Research, vol. 15, no. 1, pp. 1929–1958, 2014. # #
[8] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” International Conference on Learning Representations, 2017. # #
[9] N. Tishby and N. Zaslavsky, “Deep Learning and the Information Bottleneck Principle,” arXiv
preprint arXiv:1503.02406, 2015. # #
[10] N. Tishby, F. Pereira, and W. Bialek, “The information bottleneck method,” in Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pp. 368–
377, 1999. # #
[11] R. Shwartz-Ziv and N. Tishby, “Opening the Black Box of Deep Neural Networks via Information,” arXiv:1703.00810 [cs], Mar. 2017. # #
[12] P. Khadivi, R. Tandon, and N. Ramakrishnan, “Flow of information in feed-forward deep neural networks,” arXiv preprint arXiv:1603.06220, 2016. # #
[13] A. Achille and S. Soatto, “On the Emergence of Invariance and Disentangling in Deep Representations,” arXiv:1706.01350 [cs, stat], June 2017. # #
[14] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox,
“On the Information Bottleneck Theory of Deep Learning,” International Conference on Learning
Representations, Feb. 2018. # #
[15] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep Variational Information Bottleneck,” arXiv preprint arXiv:1612.00410, 2016. # #
[16] A. Kolchinsky, B. D. Tracey, and D. H. Wolpert, “Nonlinear Information Bottleneck,”
arXiv:1705.02436 [cs, math, stat], May 2017. # #
[17] A. Achille and S. Soatto, “Information Dropout: Learning Optimal Representations Through
Noisy Computation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2018. # #
[18] A. Kolmogorov, “On the Shannon theory of information transmission in the case of continuous signals,” IRE Transactions on Information Theory, vol. 2, no. 4, pp. 102–108, 1956. # #
[19] T. M. Coverl, “Kolmogorov’s Contributions to Information Theory and Algorithmic Complexity,” 1989. # #
[20] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 2012. # #
[21] M. A. RA Fisher, “On the mathematical foundations of theoretical statistics,” Phil. Trans. R.
Soc. Lond. A, vol. 222, no. 594-604, pp. 309–368, 1922. # #
[22] O. Shamir, S. Sabato, and N. Tishby, “Learning and generalization with the information bottleneck,” Theoretical Computer Science, vol. 411, pp. 2696–2711, June 2010. # #
[23] E. L. Lehmann and H. Scheff´e, “Completeness, Similar Regions, and Unbiased Estimation,” in Bulletin of the American Mathematical Society, vol. 54, pp. 1080–1080, Charles St, Providence, 1948. # #
[24] B. O. Koopman, “On distributions admitting a sufficient statistic,” Transactions of the American Mathematical society, vol. 39, no. 3, pp. 399–409, 1936. # #
[25] N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby, “Multivariate information bottleneck,”
in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 152–161, Morgan Kaufmann Publishers Inc., 2001. # #
[26] R. Gilad-Bachrach, A. Navot, and N. Tishby, “An information theoretic tradeoff between complexity and accuracy,” in Learning Theory and Kernel Machines, pp. 595–609, Springer, 2003. # #
[27] R. Blahut, “Computation of channel capacity and rate-distortion functions,” IEEE transactions on Information Theory, vol. 18, no. 4, pp. 460–473, 1972. # #
[28] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless channels,” IEEE Transactions on Information Theory, vol. 18, no. 1, pp. 14–20, 1972. # #
[29] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss, “Information bottleneck for Gaussian variables,” Journal of machine learning research, vol. 6, no. Jan, pp. 165–188, 2005. # #
[30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, pp. 1097–1105, 2012. # #
[31] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On largebatch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836, 2016. # #
[32] S. Hochreiter and J. Schmidhuber, “Flat minima,” Neural Computation, vol. 9, no. 1, pp. 1–
42, 1997. # #
[33] P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun, “Entropy-sgd: Biasing gradient descent into wide valleys,” arXiv preprint arXiv:1611.01838, 2016. # #
[34] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochastic gradient descent,” arXiv preprint arXiv:1509.01240, 2015. # #
[35] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao, “Why and When Can Deep–but Not Shallow–Networks Avoid the Curse of Dimensionality,” arXiv preprint arXiv:1611.00740,
2016. # #
[36] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep nets,” arXiv preprint arXiv:1703.04933, 2017. # #
[37] R. Vidal, J. Bruna, R. Giryes, and S. Soatto, “Mathematics of Deep Learning,”
arXiv:1712.04741 [cs], Dec. 2017. # #
[38] R. A. Amjad and B. C. Geiger, “How (Not) To Train Your Neural Network Using the Information Bottleneck Principle,” arXiv:1802.09766 [cs, math], Feb. 2018. # #
[39] P. Chaudhari and S. Soatto, “Stochastic gradient descent performs variational inference,
converges to limit cycles for deep networks,” arXiv:1710.11029 [cond-mat, stat], Oct. 2017. # #
[40] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. # #
[41] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in Advances in Neural Information Processing Systems, pp. 899–907, 2013. # #
[42] D. J. Im, S. Ahn, R. Memisevic, and Y. Bengio, “Denoising Criterion for Variational Auto-
Encoding Framework.,” in AAAI, pp. 2059–2065, 2017. # #
[43] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy, “Fixing a Broken ELBO,” arXiv:1711.00464 [cs, stat], Feb. 2018. # #
[44] C.-W. Huang and S. S. S. Narayanan, “Flow of Renyi information in deep neural networks,” in Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop
On, pp. 1–6, IEEE, 2016. # #
[45] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and
A. Lerchner, “Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework,” Nov. 2016. # #
[46] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and
P. Abbeel, “Variational lossy autoencoder,” arXiv preprint arXiv:1611.02731, 2016. # #
[47] S. Zhao, J. Song, and S. Ermon, “Infovae: Information maximizing variational autoencoders,” arXiv preprint arXiv:1706.02262, 2017. # #
[48] H. Zheng, J. Yao, Y. Zhang, and I. W. Tsang, “Degeneration in VAE: In the Light of Fisher
Information Loss,” arXiv:1802.06677 [cs, stat], Feb. 2018. # #