Convolutional Neural Networks (CNNs) are one of the most effective machine learning approaches for a variety of important real world problems. CNNs require a very large amount of computation, so it is important to make best use of available hardware resources.
The im2col approach has been highly successful in Deep Neural Network (DNN) frameworks such as Caffe, Theano and Torch. However, a major downside of im2col is the space explosion caused by building the input matrix. For a convolution with a 2D k × k kernel, the associated Toeplitz matrix is k^2 times larger than the original image.
In this paper we propose a new approach to DNN convolution that allows us to exploit existing optimized routines for accelerators and processors, but does not require costly input transformation operations.