One-shot learning and Computer Vision
admin | 22 April 2021
Artificial Intelligence has paved its way into major sectors and almost everything in our daily life. Deep Neural networks have certainly contributed to this as they can capture more dynamic and complex data patterns and make predictions based on that. However, one of the major problems in training such Deep Neural Networks is the huge data requirement that comes with it. Especially in context to Computer vision tasks, annotating huge datasets and managing that data makes the project computationally expensive and time consuming as well.
To solve this data problem, especially where getting hold of such an amount of data is not possible, people have tried to refine the approach so that a model can be trained with lesser data. We as humans need much less ‘supervision’ to learn anything new. In comparison, the best deep learning algorithms take hundreds to thousands images to learn a face or classify an image. Hence the arrival of one-shot learning where one or very images or data examples are used for training purposes. Application of one-shot learning can be primarily seen in face recognition, image recognition tasks.
The main idea behind one-shot learning approach is to convert the classification problem into a distance evaluation problem. For example, if we try to solve face identification/ verification problems in a traditional approach, we would need many images for every person to be identified. Not only that, but adding new faces into the system also needs retraining with hundreds of images. Obviously, this is not scalable solution and poses computational difficulties as well for large mass identifications (e.g. Airports). Instead of trying to classify a person from a given input image, evaluating “distance” between an ‘Anchor’ image with a ‘Positive’ / ‘Negative’ image makes life easy. This approach is massively scalable since during prediction on an unknown data, even one anchor image and one input image is enough to get good results. Usually, instead of conventional CNN, we use the Siamese network (with Triplet loss) to train such a model.
Training a Siamese network still requires a good number of image/ data examples. Though, it’s much easier to curate such APN triplets (Anchor, positive, negative). Researchers have also looked into reducing the training samples at the first chance. A famous study tried to reduce the training sample by understanding the important information required. They have effectively reduced the MNIST dataset (a famous ML dataset for handwritten digits) from 60,000 to 10 without affecting the model performance significantly. There has been recent research to reduce training sample quantity even further by ‘soft’ labeling the data so that the model may not even need to see the actual data that would be used to test it.
Approaches like one-shot learning or zero shot learning, have their own set of problems as well. They are very specific to the task and can not be generalizable to other use cases. Also, it is easier to fool such systems (for face verification tasks, wearing fake beards, goggles etc.) rather than a conventional classification algorithm. Considering both pros and cons, this topic is worth more research and attention for further improvements and our understanding.
- Sucholutsky, Ilia, and Matthias Schonlau. “‘Less Than One’-Shot Learning: Learning N Classes From M< N Samples.” arXiv preprint arXiv:2009.08449 (2020).
- Wang, Tongzhou, et al. “Dataset distillation.” arXiv preprint arXiv:1811.10959 (2018).