Created: June 24, 2016 / Updated: March 22, 2020 / Status: in progress / 12 min read (~2244 words)
Handwriting recognition has been one of the first task to interest machine learning and AI researchers. The initial goal was to rapidly process IRS forms and convert them into their digital equivalent. This meant that a large amount of handwritten content was available, all that was missing was to label it and then develop tools to convert the characters image into their ASCII equivalent.
- BCCS: BiConnected/Binary Connected Components
- Can handwriting recognition be taught through the process of teaching a network how to write?
- How to detect characters?
- How to support multi-scale characters recognition?
- How to group characters together to form words/large numbers?
- How do you group together words in order to form lines?
- How to improve word recognition accuracy using a vocabulary?
- How do you properly classify characters?
- Given that your average MNIST neural network is trained on a 28x28 image with 256 gray values, you have a space of $255^768$ to cover
- Content is written on white sheets of papers
- Paper might have lines (loose leaf sheet)
- Paper may have various formats (generally 8.5"x11")
- Glyphs are generally written using a color that has a large contrast with the sheet/background color
- Text may be blurry due to the capture device
- Image size is expected to vary (due to the capture device)
- PPI may vary
- Language may vary and many languages may be used within the same page
- Page may contain one or many images/drawing
- Preprocess image
- RGB to gray (0-255) or black and white (0-1)
- Canny edge detection
- Noise removal (Gaussian, salt and pepper)
- Line extraction
- Cropping of the text region
- Vertical scan to find blank rows => line separator
- Letter extraction
- Horizontal scan to find blank columns => character separator
- A memory model that is used to remember where we are currently looking at (either through landmarks or some form of (x, y) coordinates)
- A reading (attention) model that knows how to scan pages (depending on the language)
- A character recognition model
DNN: 2 layers of dense 512 units with relu activation, with a dropout layer of ratio 0.2, softmax activation on the output layer, optimizing categorical cross-entropy
CNN: 2 convolutional layers of kernel size (3, 3) with relu activation, then max pooling (2, 2), 0.25 dropout, flatten, dense 128 with relu activation, 0.5 dropout and finally a softmax output layer, optimizing categorical cross-entropy
Test set size is 25% of the training set size (thus 20% of the total data set size). The training and test sets are fixed.
|Network||Character classes||Training set size||Training duration per epoch (s)||Test loss||Test accuracy|
- To evaluate what should be improved/worked on, note that a sequential pipeline is the most affected by its earlier components, and thus the accumulation of errors early on will propagate to the further layers
- Parent read to them by pointing at the part of the text they are reading
- They point to object they recognize and link the word to the object
- An association between known words, their phonetics and how they are written is built up over time
- Partial alphabetic
- Full alphabetic
- Consolidated alphabetic
- Multi-scale character detection via sliding window classification
- Use of Random Ferns due to the large number of categories (62 => 26 upper/lower + 10 digits)
- Naturally multi-class and efficient both to train and test
- The features consist of applying randomly chosen thresholds on randomly chosen entries in a HOG descriptor computed at the window location
- Application of non-maximal suppression
- Glyphs have the prior (or are conditioned on the fact) that there's high probability that the pixel in every direction is likely to be part of the glyph as well (continuity) and that if it's too different, then it's likely to not be part of the glyph
- Ba, Jimmy, Volodymyr Mnih, and Koray Kavukcuoglu. "Multiple object recognition with visual attention." arXiv preprint arXiv:1412.7755 (2014).
- Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
- Goodfellow, Ian J., et al. "Multi-digit number recognition from street view imagery using deep convolutional neural networks." arXiv preprint arXiv:1312.6082 (2013).
- Graves, Alex, et al. "A novel connectionist system for unconstrained handwriting recognition." IEEE transactions on pattern analysis and machine intelligence 31.5 (2009): 855-868.
- Graves, Alex, and Jürgen Schmidhuber. "Offline handwriting recognition with multidimensional recurrent neural networks." Advances in neural information processing systems. 2009.
- Larochelle, Hugo, and Geoffrey E. Hinton. "Learning to combine foveal glimpses with a third-order Boltzmann machine." Advances in neural information processing systems. 2010.
- Neumann, Lukáš, and Jiří Matas. "Real-time scene text localization and recognition." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
- Wang, Kai, Boris Babenko, and Serge Belongie. "End-to-end scene text recognition." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
- Bluche, Théodore, Jérôme Louradour, and Ronaldo Messina. "Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention." arXiv preprint arXiv:1604.03286 (2016).
- Lucas, Simon M., et al. "ICDAR 2003 robust reading competitions." Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on. IEEE, 2003.
- Karatzas, Dimosthenis, et al. "ICDAR 2013 robust reading competition." Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013.
- Non-maximum suppression - http://www.pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/