Carzam is a simple web application that utilizes object recognition and machine learning to predict the make and model and production date range of a user uploaded vehicle. Simply download a large photo of a supported vehicle you would like to identify and upload it to the app to receive a prediction.
Carzam supports 102 classes of vehicles. Please refer to the supported vehicles section on the bottom of the demo website for a list of all supported vehicles. If you upload an image not included on this list, the application may return an erroneous result as it has not been trained to recognize the given image.
Computer Vision and Machine Learning:
YOLOv5 is a real-time object detection framework and stands for "You Only Look Once". We are leveraging YOLOv5's object detection framework to recognize the vehicle in the image, draw a frame around the vehicle, and crop the background out of the image.
Why do this?
The machine learning model will take this image and examine every pixel on the image for similarities to the data it has previous learned. Take for instance a machine learning model that is fed a lot of photos of Jeep Wranglers driving along beaches.
What if we try to identify an image of a Corvette that happens to be driving along a beach? The model may see "beach" and associate the photo with "Jeep Wrangler". This would be an incorrect association as a Corvette is in fact not a Jeep Wrangler.
Cropping reduces the "background noise" of the image, allowing the machine learning model to focus on the vehicle itself.
Pytorch is an open source machine learning library. We chose to use Pytorch because it has extensive documentation and it is one of the most popular machine learning libraries with support for many pre-trained machine learning models.
We relied on transfer learning to build our machine learning model, as building a model from scratch was unfeasible. Transfer learning is a process where a machine learning model that was developed for another task is reused as the starting point for a new task. This allowed us to re-use a model that was previously trained on a dataset much larger than we have access too. We then tweaked the model for our needs.
Pre-trained machine learning models come in many shapes and sizes. The strengths and weaknesses of each model essentially boil down to three important categories for consideration, the size of the model, speed of the model, and accuracy of the model. Typically models with the greatest accuracy also have the largest size and slowest speeds due to being trained on a much larger set of images. This results in a much large amount of data that needs to be processed.
We chose to use the ResNet34 pre-trained model which has been pre-trained on the ImageNet dataset of over 1.2 million images with 1000+ different categories.
By default ResNet34 provides decent accuracy results at 73.3% for accuractly predicting and image on the first attempt, and 91.42% for accurately predicting an image in 5 attempts. It has a model size of ~100MB and an average inference time of 5ms on GPU.
In comparison, the ResNeXT-101-32x8d model offers 79.3% accuracy for predicting an image on the first atatempt, and 94.6% accuracy for predicting the image within 5 attempts. It has a model size of ~350MB and an average inference time of 32ms.
For a small loss in accuracy, ResNet 34 is roughly 1/3 the size and 5x faster at making a prediction than the ResNeXT-101-32x8d model. Size was the determining factor for our team since we integrated the model into a web application with limited capacity.
Backend Web Application:
Flask is a lightweight web framework written in Python. We chose to use Flask for consistency since it is written in Python and we are already working in Python with the Pytorch library.
Docker is an operating system virtualization tool. It allows you to encapsulate your project within a virtual operating system, guaranteeing that the application will run regardless of the operating system the end user has installed on their personal computer.
We decided to containerize our project in Docker primarily to overcome Heroku's 500mb slug size limitation.
A slug size is essentially the size of a compressed copy of our application with all of the image recognition and machine learning dependencies installed.
Heroku waives the slug size limitation for Docker deployments, and instead imposes a dyno boot time limitation of 60 seconds on the application.
A secondary benefit of containerizing our project in Docker is that it makes our project more portable if we decide to change cloud providers in the future.
Heroku is a cloud platform as a service with robust set of free resources for deploying our project to the cloud. It greatly reduced the complexity of deploying our project by integrating with and automatically deploying from our projects GitHub repository.
The ResNet 34 model that we chose as a baseline requires all input images to conform to specific image transformation requirements in order to normalize the data. One of these transformations includes making the image 400 x 400 pixels.
We noticed that if we allowed users to upload photos of cars that were smaller than 400 x 400 after the background image was removed, the accuracy of the model decreased. Imagine taking a small image and making it larger. What happens? It becomes distorted.
Initially we imposed a 400 x 400 pixel minimum requirement on images after being cropped. This ended up being too restrictive for the user, as our application rejected too many photos. We chose a middle ground of 200 x 200 pixel minimum for the post-cropped vehicle within the image. This allows most reasonbly sized images to be accepted by the application and keeps accuracy adequetley high.
However, it is still not an ideal solution because it is unreasonable for us to expect the user to know the size of the image they are uploading, let alone the size of the vehicle image after the background has been cropped out.
More importantly, this limitation prevented us from using one of the most robust and free Car Make Model datasets available today, the Stanford Cars Dataset containing 16,000 images of vehicles across 196 unique vehicle models.
Given the scope and timeline of the project we decided to move forward with image cropping, the ResNet 34 model, and a smaller dataset of 100 classes of vehicles and 4000 images as a proof of concept. This is an obvious area for improvement in future iterations.
It turns out that computer vision and machine learning requires a lot of memory! Our application ended up being around 2 Gigabytes in size once all of the required dependencies were installed.
This presented a major issue for deploying our application on the web, as many hosting services have a 500mb limitation. As described above under the Docker section, we were able to overcome this issue by containerizing our web application, allowing us to circumvent Heroku's size limitations.