From reading numbers to understanding videos

Omid Abdollahi Aghdam
Omid Abdollahi Aghdam
-Jan 12, 2023

Promising yet time-consuming

Handwritten Digit Recognition is considered a fairly simple problem right now, however, at the time of developing the first Convolutional Neural Network (CNN) by Yan Lecun and his colleagues at Bell Labs, it was a challenging problem due to hardware and software constraints. Although Artificial Neural Networks (ANNs) have been around since the 1950s it was not until 1989 when Yann LeCun and his colleagues first applied CNN, a variant of ANN, to the handwritten digit recognition problem. From 1988 to 1993, they developed CNNs and trained them on a dataset of handwritten digits (from 0 to 9) for the digit recognition task. They published their research on a paper that has been cited 51692 times at the time of writing this article and released the MNIST database which together with their paper paved the way for more developments in Deep Learning in the following years. You can see a demo from their first version of LeNet at work in the video below.

LeNet 1 demo which is the first convolutional network recognizing handwritten digits with good speed and accuracy.

The advent of the Internet and the lunch of search engines in the late 90s have had a great impact on the proliferation of online textual and visual data which have made it feasible for researchers to collect, annotate, and share these datasets, yet, there was a remaining problem to be solved which was the computational cost of training and inference of ANN and CNN on CPUs. Just to put it into perspective, according to a comment by Yann LeCun on LinkedIn, the model used in the above video was trained on a Sun 4, which took about 10 days.

Yann LeCun’s answer to a question about the training time of LeNet1 on a LinkedIn post.

Researchers Never Stop

Researchers’ difficulty in dealing with large-scale Neural Networks was a temporary setback in the adoption and development of deep learning. After all, researchers from academia and industry did not stop there and came up with innovative solutions to overcome deep learning computation costs. In 2009 “Large-scale Deep Unsupervised Learning using Graphics Processors” was published by Ranja, Madhavan and Andrew Ng which introduced GPUs to neural networks, not long after, MNIST database accuracy record was surpassed by Dan Claudiu Ciresan et al. They leveraged GPUs to train the deep neural network , meanwhile, new Computer Vision datasets were introduced in 2009, namely CIFAR10, CIFAR100, and ImageNet.

Deep Learning Gains Popularity

From 2010 to 2017 ImageNet project organized The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and in 2012 Alex Krizhevsky et al. work, “ImageNet Classification with Deep Convolutional Neural Networks” outperformed all the classic methods by a large margin and fascinated the research community to further improve deep learning architectures. The following google trend figure demonstrates the increasing interest in Deep Learning during these years. During the following years, each year the performance of a new CNN architecture surpassed that of the previous year and achieved human-level performance on ImageNet. Some widely used architectures are VGG, Inception, ResNet, and SENet.

Google trends from 2010 to 2022 about increasing interest for deep learning.

Deep Learning Spring

The arrival of Nvidia GPUs and CUDA, open-source frameworks such as PyTorch, Tensorflow, open access Computer Vision Foundation Sponsored Conferences such as CVPR, ICCV, and ECCV, and Kaggle Competitions have been the catalyst for the proliferation of applications and startups in the field of Computer Vision. We believe that it is just early spring and we will witness unprecedented adoption of deep learning in the years to come. Nowadays, from medical image analysis, to self-driving cars, and fashion to EHS solutions there are numerous startups that will continue to foster the field of Deep Learning. Apart from that, the big tech uses deep learning to understand videos and images as soon as you upload them; YouTube analyzes uploaded videos to optimize their recommendation system and search for Copyright violations, Instagram and Facebook do the same, and Tesla autopilot uses fully visual sensors to understand the road to navigate.

Bonus: Can You Defeat the state-of-the-art

Despite the fact that Deep Learning has revolutionized many areas in technology and industry, we are still far away from the dream of General Artificial Intelligence (GAI). Given enough data and resources, Deep Learning is capable of defeating humans in most specific tasks such as AlphaGo winning against Go world Champion specific game or predicting, however, even the most advanced deep learning algorithms trained on a dataset of 400 million (image, text) pairs collected from the internet (CLIP: Contrastive Language-Image Pre-Training) can sometimes miss-classify a Sheepdog with Mop. Using CLIP, we tried to answer a series of questions posted by Karen Zack (@teenybiscuit) on Twitter about similarity between animal and food or other objects. We conduct a Zero-Shot Image classification using CLIP and achieved impressive results on images that can be challenging for humans as well. Below you can see the results for the Sheepdog or Mop challenge.

To check for yourself visit the GitHub repo.

#Blog Post
Schedule a Demo