Image Segmentation and classification using deep learning
Updated: Dec 15, 2020
-- Before you read this: try looking for something unusual in the photo above. --
Did you notice the huge toothbrush on the left? Probably not. That’s because when humans search through scenes for a particular object, we often miss objects whose size is inconsistent with the rest of the scene. According to scientists in the Department of Psychological & Brain Sciences at UC Santa Barbara. “When something appears at the wrong scale, you will miss it more often because your brain automatically ignores it,” said UCSB professor Miguel Eckstein, who specializes in computational human vision, visual attention, and search.
The experiment used scenes of ordinary objects featured in computer-generated images that varied in color, viewing angle, and size, mixed with “target-absent” scenes. The researchers asked 60 viewers to search for these objects (e.g., toothbrush, parking meter, computer mouse) while eye-tracking software monitored the paths of their gaze. The researchers found that people tended to miss the target more often when it was not scaled to appropriate dimensions (too large or too small) — even when looking directly at the target object. Computer vision (a subfield of AI), by contrast, doesn’t have this issue, the scientists reported.
In the last few years, we’ve seen a lot of improvement in computer vision techniques such as object detection, classification, segmentation due to advancements in deep learning and AI. We are able to build computer vision models that can detect objects, determine their shape and much more.
Computers can understand images at various levels and for each of these levels, there is a problem defined in the computer vision domain. Let’s look at a few of them below:
- Silicon Valley (Hot dog / Not hot dog app) -
Image classification is one of the fundamental building blocks in Computer Vision. Provided an image, the task is to classify the input image to a certain class. In this task, we assume that there is only one (and not multiple) objects in the image.
-- Original image vs segmented mask --
In semantic segmentation, we perform the classification at a pixel level. The output of a semantic segmentation model is not a label or a bounding box parameter but a high-resolution image of the same size as that of the input image, where each image is classified into a particular class.
-- Difference between semantic segmentation and instance segmentation --
Instance segmentation focuses on the identification of boundaries of objects at a detailed pixel level. For example, in the above image, there is a clear demarcation between the chairs in the instance segmentation output as opposed to the semantic segmentation output.
UNET for Semantic Segmentation
UNET is a deep learning model developed by Olaf Ronneberger et al. which is capable of performing semantic as well as for instance segmentation. It was initially developed for Biomedical Image Segmentation.
- An example of UNET applied on a slice of dental CT Scan (i) Input image (ii) UNET output (iii) Mask overlayed on the input image (Left to Right) -
Let’s consider a real business problem of identifying grocery objects in an image. A major use case of this would be inventory management, self-checkout .etc to help simplify grocery purchasing experience. A real-world example of such a system would be Amazon Go.
-- Amazon Go: A chain of convenience stores where grocery purchasing is partially-automated --
We at Zignite had worked on a smaller scale POC object identification system. The task was to segment and classify the objects in the image into one of the pre-determined 25000 labels. There are two main tasks at hand in identifying grocery objects. First, you’d need to segment all the objects in the image and then identify each of the segmented objects. This involves segmentation and classification.
The challenge at hand was that we did not have too many data points to train the model. We did have the advantage that the background for the input images did not vary a lot. In essence, the task was to make the model understand the difference between a background and an object at a pixel level. This way, even with very few training images, the model would perform well on unseen data.
The first step is to create training data for the UNET model. To create training data, we used AWS Sagemaker and manually annotated the objects.
-- AWS Sagemaker semantic segmentation tool --
The number of examples used for training was very few (less than 30) and below is an example of the UNET model output on an unseen image.
-- Segmenting grocery objects from images --
The next step after performing semantic segmentation is to add bounding boxes for each of the segmented objects. Having a bounding box provides us with the location of the object in the image in terms of pixel coordinates.
-- Using UNET mask to find the bounding boxes --
-- Finding similarity with all images in the database --
Given that we’ve successfully segmented the grocery objects in the image, the next step is to identify the objects in the image. For this, we need to hold a database of objects that we can compare against. An embedding similarity-based model then assigns a label to each of the segmented images. The idea is that we leverage the power of pre-trained models to convert images to a flat vector representation. We can then calculate their similarity by looking at how close their vectors are to each other. In addition, we’ve calculated these vectors for all images in our database ahead of time, and thus this method is fast and scalable. The advantage of using an embedding similarity-based model is that it requires no model training. We use pre-trained models (such as VGG Net) to extract image embeddings. Below is a result of passing the segmented images through the embedding similarity-based model.
-- Embedding similarity-based model output labels --
This approach of using UNET for semantic segmentation and embedding similarity-based model for object detection performed well on most of the test images that we had. This method has it’s own disadvantages. For example, when the objects are very close to each other in an image, semantic segmentation models aren't equipped to identify the boundaries between different objects. In such a case using Instance segmentation would be a better approach.
-- UNET mask fails at identifying the boundary between the different bottles --