Learning Document Image Features With SqueezeNet Convolutional Neural Network

Document Type: Original Article


Department of Computer Science Engineering, Shahid Beheshti University, Tehran, Iran


The classification of various document image classes is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for training, and their very large number of weights. Previous successful attempts at learning document image features have been based on training very large CNNs. SqueezeNet is a CNN architecture that achieves accuracies comparable to other state of the art CNNs while containing up to 50 times less weights, but never before experimented on document image classification tasks. In this research we have taken a novel approach towards learning these  document image features by training on a very small CNN network such as SqueezeNet. We show that an ImageNet pretrained SqueezeNet achieves an accuracy of approximately 75 percent over 10 classes on the Tobacco-3482 dataset, which is comparable to other state of the art CNN. We then visualize saliency maps of the gradient of our trained SqueezeNet's output to input, which shows that the network is able to learn meaningful features that are useful for document classification. Previous works in this field have made no emphasis on visualizing the learned document features. The importance of features such as the existence of handwritten text, document titles, text alignment and tabular structures in the extracted saliency maps, proves that the network does not overfit to redundant representations of the rather small Tobacco-3482 dataset, which contains only 3482 document images over 10 classes.


1.     Vincent, N. and Ogier, J.-M., "Shall deep learning be the mandatory future of document analysis problems?", Pattern Recognition,  Vol. 86, (2019), 281-289. https://doi.org/10.1016/j.patcog.2018.09.010

2.     Han, S., Mao, H. and Dally, W.J., "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding", arXiv preprint arXiv:1510.00149,  (2015).

3.     Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J. and Keutzer, K., "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size", arXiv preprint arXiv:1602.07360,  (2016).

4.     Krizhevsky, A., Sutskever, I. and Hinton, G.E., "Imagenet classification with deep convolutional neural networks", in Advances in neural information processing systems., 1097-1105.

5.     Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L., "Imagenet: A large-scale hierarchical image database", in 2009 IEEE conference on computer vision and pattern recognition, Ieee., 248-255.DOI: 10.1109/CVPR.2009.5206848

6.     Harley, A.W., Ufkes, A. and Derpanis, K.G., "Evaluation of deep convolutional nets for document image classification and retrieval", in 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE. , 991-995. DOI: 10.1109/ICDAR.2015.7333910

7.     Afzal, M.Z., Kölsch, A., Ahmed, S. and Liwicki, M., "Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification", in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), IEEE. Vol. 1, 883-888. DOI: 10.1109/ICDAR.2017.149













8.     He, K., Zhang, X., Ren, S. and Sun, J., "Deep residual learning for image recognition", in Proceedings of the IEEE conference on computer vision and pattern recognition., 770-778.

9.     Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., "Going deeper with convolutions", in Proceedings of the IEEE conference on computer vision and pattern recognition., 1-9.

10.   Simonyan, K. and Zisserman, A., "Very deep convolutional networks for large-scale image recognition", arXiv preprint arXiv:1409.1556,  (2014).

11.   Jaderberg, M., Simonyan, K. and Zisserman, A., "Spatial transformer networks", in Advances in neural information processing systems., 2017-2025.

12.   Kumar, J., Ye, P. and Doermann, D., "Structural similarity for document image classification and retrieval", Pattern Recognition Letters,  Vol. 43, No., (2014), 119-126. https://doi.org/10.1016/j.patrec.2013.10.030

13.   Kang, L., Kumar, J., Ye, P., Li, Y. and Doermann, D., "Convolutional neural networks for document image classification", in 2014 22nd International Conference on Pattern Recognition, IEEE., 3168-3172. DOI: 10.1109/ICPR.2014.546

14.   Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., "Dropout: A simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research,  Vol. 15, No. 1, (2014), 1929-1958. DOI: 10.5555/2627435.2670313

15.   Diligenti, M., Frasconi, P. and Gori, M., "Hidden tree markov models for document image classification", IEEE Transactions on Pattern Analysis and Machine Intelligence,  Vol. 25, No. 4, (2003), 519-523. DOI: 10.1109/TPAMI.2003.1190578

16.   Tensmeyer, C. and Martinez, T., "Confirm–clustering of noisy form images using robust matching", Pattern Recognition,  Vol. 87, (2019), 1-16. https://doi.org/10.1016/j.patcog.2018.10.004

17.   Kingma, D.P. and Ba, J., "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980,  (2014).

18.   Simonyan, K., Vedaldi, A. and Zisserman, A., "Deep inside convolutional networks: Visualising image classification models and saliency maps", arXiv preprint arXiv:1312.6034,  (2013).

19.   He, S. and Schomaker, L., "Deepotsu: Document enhancement and binarization using iterative deep learning", Pattern Recognition,  Vol. 91, (2019), 379-390. https://doi.org/10.1016/j.patcog.2019.01.025

20.   Guo, J., He, C. and Wang, Y., "Fourth order indirect diffusion coupled with shock filter and source for text binarization", Signal Processing,  Vol. 171, (2020), 107478. https://doi.org/10.1016/j.sigpro.2020.107478

21.   Oord, A.v.d., Li, Y. and Vinyals, O., "Representation learning with contrastive predictive coding", arXiv preprint arXiv:1807.03748,  (2018).