๐ PhotoBooth Lite on Raspberry Pi with TensorFlow Lite
๐ก Newskategorie: AI Videos
๐ Quelle: blog.tensorflow.org
Posted by Lucia Li, TensorFlow Lite Intern
Illustration of the Smart Photo Booth application running in real time. |
Why should we build an application on Raspberry Pi?
Raspberry Pi is not only a widely-used embedded platform, but also tiny in size and cheap in price. We decided to use TensorFlow Lite as it is specifically designed for mobile and IoT devices which is perfect for Raspberry Pi.What do we need to build the Photo Booth App Demo?
We implemented our Photo Booth App on Raspberry Pi 3B+, with 1GB RAM equipped and the 32-bit ARMv7 operating system installed. Our application has image input and audio input, so we will also need a camera and a microphone. In addition, we will need a monitor for display. The total cost is under $100 USD. The details are listed below:- A Raspberry Pi ($35)
โฃ Parameters:
ยป Quad core 64-bit processor clocked at 1.4GHz.
ยป 1GB LPDDR2 SRAM. - A camera to capture image (~$15+)
- A microphone to sample audio data (~$5+)
- A 7-inch monitor (~$20+)
How do we detect smiling faces?
Using a single model to detect faces and predict the resulting smiling score, with both high accuracy and low latency, is difficult. Thus, we detect a smiling face by three steps:Smiling Face Detection Workflow |
- Apply a face detection model to detect whether there is a face in the given image.
- If there is a face, crop it from the original image.
- With the cropped face image, apply a facial attribute classification model to measure if it is a smiling face.
- In order to reduce memory and speed up execution, we leveraged the TensorFlow model optimization toolkit's post-training quantization. In this tutorial, you can see how easy it is to use in your own TensorFlow Lite model.
- We resized the original image captured from the camera with its length-width ratio fixed. The compression ratio can be 4 or 2 depending on its original size. We try to make the image size less than 160x160 (the original designed size is 320x320). Smaller inputs significantly reduce the inference time, as shown in the table below. In our application, the original image size captured from the camera is 640x480, so we resized it to 160x120.
- Instead of using the original image for facial attribute classification, we cropped the standard faces and abandoned the background. It reduced the input size while keeping the useful information.
- We used multi-threads for inference.
Face Detection Latency Comparison |
Face detection
Our face detection model consists of an 8-bit modified MobileNet v1 body and SSD-Lite head with a 0.25 depth multiplier. Its size is only a little larger than 200kB. Why is this model so small? First, the TensorFlow Lite model is based on Flatbuffer, which is smaller in size than the TensorFlow model based on protobuf. Second, we applied an 8-bit quantized model. Third, our modified MobileNet v1 has fewer channels than the original. Similar to most face detection models, our model outputs the position of a bounding box and 6 landmarks including the left eye, right eye, nose tip, mouth center, left ear tragion, and right ear tragion. We also apply non-maximum suppression to filter repeated faces. The inference time of our face detection TensorFlow Lite model is about 30ms. It means our model can detect a face on Raspberry Pi in real time.Example of the bounding box and 6 landmarks. |
Face cropper
The detected face may have various directions and various sizes. To unify them for better classification, we rotated, cropped, and resized the original image. The input of this function is the positions of the 6 landmarks we get from the face detection model. With 6 landmarks, we can compute the rotation Euler angles and resize ratios. Through this, we can get a 128x128 standard face. The figure below shows an example of our face cropper function. The blue bounding box is the output of the face detection model, while the red bounding box is our calculated cropping bounding box. We duplicated the borderline for the pixels outside the image.Face Cropper Illustration |
Face attribute classification
Our face attribute classification model is also an 8-bit quantized MobileNet model. With a 128x128 standard face as the input, the model outputs a float variable from 0 to 1 to predict the smiling probability. The model also outputs a 90-d vector to predict age from 0 to 90. Its inference times on Raspberry Pi can reach around 30ms.How to recognize speech commands?
Real-time speech commands recognition can also be divided into three steps:- Pre-processing: we use a sliding window to store the latest 1s audio data, with 512 frames different from the last recording.
- Inference: given a 1s audio input, we can apply a speech command recognition model to get probabilities for four categories (โyesโ/โnoโ/โsilenceโ/โunknownโ).
- Post-processing: we average current inference result with previous ones. When the average probability of one word exceeds one threshold, we decide that a speech command is detected.
Pre-processing:
We use PortAudio, an open-source library to get audio data from a microphone. The following figure shows how we store the audio data.Audio Stream Processing |
Speech command recognition
The speech command recognition model we used can be found publicly in many TensorFlow examples. It is composed of audio_spectrogram, MFCC, 2 convolutional layers, and 1 fully-connected layer. The input of this model is 1s of audio data with a sampling rate of 16kHz. The dataset is public, or you can train it yourself. This dataset contains 30 categories of speech command data. Since we only need โyesโ and โnoโ, we disregard all other categories labeled as โunknownโ. Additionally, we used other methods to improve the latency performance:- We cut half the channels. The TensorFlow Lite model size is about 1.9 MB after compression.
- We used 4 output channels of the last fully-connected layer than the usual 12 as we only need 4 categories.
- We use multi-threads for inference.
Post-processing:
Audio Stream Post-processing |