Introduction to TensorFlow Lite

TensorFlow Lite is an open-source deep learning framework designed for on-device inference, commonly referred to as Edge Computing. It enables developers to deploy trained machine learning models directly on edge devices such as mobile phones, embedded systems and IoT devices without relying on cloud resources.

It is a production-ready, cross-platform framework used for deploying ML on mobile devices and embedded systems.
Designed specifically for on-device and edge machine learning
Supports platforms like Android, iOS, embedded Linux and microcontrollers (MCUs)
Enables execution of trained models directly on edge devices
Reduces dependency on cloud-based inference
Ideal for real-time applications such as object detection and image recognition

Architecture of TensorFlow Lite

The architecture of TensorFlow Lite is designed to enable efficient on-device machine learning by converting and executing models in a lightweight runtime environment.

frame_3317 — Tensorflow Lite Architecture

Components of TensorFlow Lite Architecture

TensorFlow Model: A trained TensorFlow model that is saved on disk after training using standard TensorFlow workflows.
TensorFlow Lite Converter: A conversion tool that transforms the trained TensorFlow model into the TensorFlow Lite format by applying optimizations such as quantization.
TensorFlow Lite Model File (.tflite): A lightweight platform-independent model format based on FlatBuffers, optimized for low latency, high performance and minimal memory usage.

Deployment and Execution

Mobile Application: The TensorFlow Lite model file is deployed inside a mobile or embedded application for on-device inference.
Java API: A high-level wrapper over the C++ API used mainly for Android application development.
C++ API: A core API available on both Android and iOS that loads the TensorFlow Lite model and invokes the Interpreter.
Interpreter: Executes the model using optimized operators with selective loading, requiring very small memory compared to TensorFlow Mobile.
Hardware Acceleration: Uses Android NNAPI for faster execution on supported devices, otherwise runs on CPU.
Custom Kernels: Developers can implement custom operators using the C++ API which can be executed by the TensorFlow Lite Interpreter.

Supported Models in TensorFlow Lite

MobileNet: A lightweight vision model designed for efficient image classification on mobile and embedded devices.
Inception v3: An image recognition model offering higher accuracy than MobileNet but with a larger model size.
Smart Reply: An on-device conversational model that provides quick reply suggestions in messaging applications, commonly used on Android Wear.

Workflow of TensorFlow Lite

The workflow of TensorFlow Lite involves a simple and efficient pipeline to deploy machine learning models on edge devices. It starts with training a model using TensorFlow and ends with running optimized inference on resource-constrained devices.

Train the Model: Build and train the machine learning model using TensorFlow on high-performance systems.
Convert to TensorFlow Lite: Convert the trained model into .tflite format using the TensorFlow Lite Converter, applying optimizations if required.
Optimize the Model: Apply techniques like quantization to reduce model size and improve inference speed.
Deploy on Edge Device: Integrate the .tflite model into mobile, embedded or IoT applications.
Run Inference: Execute the model using the TensorFlow Lite interpreter for fast, on-device predictions.

TensorFlow Lite uses FlatBuffers as its model file format instead of Protocol Buffers. FlatBuffers allow direct access to serialized data without an unpacking step resulting in faster execution and lower memory usage which is crucial for edge and mobile environments.

Step By Step Implementation

Here in this code we uses a DistilBERT model for text classification, converts it to TensorFlow Lite (FP32 and FP16) for efficient deployment on edge devices and demonstrates how to classify input text. It loads the model and tokenizer, exports a SavedModel, converts it to TFLite and runs inference to predict the text’s class label.

Step 1: Install and Import Required Libraries

Import TensorFlow For building and running deep learning models.
Import NumPy For handling numerical computations.
Import Hugging Face Transformers components TFDistilBertForSequenceClassification for model, DistilBertTokenizer for text tokenization.

Python

!pip install -q transformers tensorflow accelerate

import tensorflow as tf
import numpy as np
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer

Step 2: Load Model and Tokenizer

Define the model name, maximum sequence length and class labels.
Specify the directory to save or load the TensorFlow model.
Initialize the DistilBERT tokenizer from the pretrained model.
Load the DistilBERT sequence classification model using pretrained weights.

Python

MODEL_NAME = "distilbert-base-uncased"
MAX_LEN = 128
CLASS_LABELS = ["Class 0", "Class 1"]
SAVED_MODEL_DIR = "tf_distilbert_saved_model"

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tf_model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)

print("Model loaded successfully.")

Step 3: Define Serving Function and Save Model

Create a TensorFlow function (@tf.function) with input signatures for input_ids and attention_mask.
Pass inputs to the model and return the logits.
Save the TensorFlow model in the SavedModel format with the defined serving function.

Python

@tf.function(
    input_signature=[
        tf.TensorSpec([None, MAX_LEN], tf.int32, name="input_ids"),
        tf.TensorSpec([None, MAX_LEN], tf.int32, name="attention_mask"),
    ]
)
def serving_fn(input_ids, attention_mask):
    outputs = tf_model(input_ids=input_ids, attention_mask=attention_mask)
    return {"logits": outputs.logits}

tf.saved_model.save(tf_model, SAVED_MODEL_DIR, signatures={"serving_default": serving_fn})

print("SavedModel exported successfully.")

Step 4: Convert and Save TensorFlow Lite Models

Create a TFLite converter from the saved TensorFlow model.
Convert the model to FP32 precision and save it as distilbert_fp32.tflite.
Enable default optimizations and set supported type to FP16.
Convert the model to FP16 precision and save it as distilbert_fp16.tflite.

Python

converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
tflite_fp32 = converter.convert()
with open("distilbert_fp32.tflite", "wb") as f:
    f.write(tflite_fp32)
print("FP32 TFLite model saved.")


converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_fp16 = converter.convert()
with open("distilbert_fp16.tflite", "wb") as f:
    f.write(tflite_fp16)
print("FP16 TFLite model saved.")

Step 5: Load and Inspect TFLite Model

Load the FP16 TFLite model using tf.lite.Interpreter.
Allocate tensors for the interpreter to prepare it for inference.
Retrieve input and output details of the model.

Python

interpreter = tf.lite.Interpreter(model_path="distilbert_fp16.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

print("Input details:", input_details)
print("Output details:", output_details)

Step 6: Prepare Input Text and Set Tensors

Define the input text for inference.
Tokenize the text using the DistilBERT tokenizer with padding and truncation to MAX_LEN.
Convert input_ids and attention_mask to NumPy arrays of type int32.
Set the tokenized input tensors in the TFLite interpreter.

Python

text = "TensorFlow Lite is optimized for edge devices."


inputs = tokenizer(
    text, 
    max_length=MAX_LEN, 
    padding="max_length", 
    truncation=True, 
    return_tensors="tf"
)

input_ids = inputs["input_ids"].numpy().astype(np.int32)
attention_mask = inputs["attention_mask"].numpy().astype(np.int32)


interpreter.set_tensor(input_details[0]['index'], input_ids)
interpreter.set_tensor(input_details[1]['index'], attention_mask)

Step 7: Run Inference and Get Predictions

Invoke the TFLite interpreter to perform inference.
Retrieve the output logits from the model.
Determine the predicted class by taking the argmax of the logits.

Python

interpreter.invoke()


logits = interpreter.get_tensor(output_details[0]['index'])
predicted_class = np.argmax(logits, axis=1)


print("\n--- TFLite Inference Results ---")
print("Input text:", text)
print("Logits:", logits)
print("Predicted class ID:", predicted_class)
print("Predicted label:", [CLASS_LABELS[i] for i in predicted_class])

Output:

TL5 — TensorFlow Lite FP16 DistilBERT Inference

Step 8: Compare Model Sizes

Define a utility function to calculate the size of a model file or directory.
Handle both SavedModel directories and single TFLite files.
Compute the total size in bytes and convert it to megabytes (MB).
Print and compare the sizes of the original TensorFlow model, FP32 TFLite model and FP16 TFLite model.

Python

import os

def print_model_size(model_path, name):
    if os.path.isdir(model_path):
        total_size = sum(
            os.path.getsize(os.path.join(dp, f))
            for dp, dn, filenames in os.walk(model_path)
            for f in filenames
        )
    else:
        total_size = os.path.getsize(model_path)
    print(f"{name} size: {total_size / (1024*1024):.2f} MB")


print_model_size(SAVED_MODEL_DIR, "Original SavedModel")
print_model_size("distilbert_fp32.tflite", "TFLite FP32 Model")
print_model_size("distilbert_fp16.tflite", "TFLite FP16 Model")

Output:

You can download full code from here

Difference between TensorFlow and TensorFlow Lite

Here we compare tensorflow lite with tensorflow

Feature	TensorFlow	TensorFlow Lite
Purpose	Used to build, train and evaluate ML/DL models	Used mainly for running inference on already trained models
Target Devices	High-performance systems like servers, desktops and GPUs	Mobile devices, embedded systems and edge devices
Model Training	Fully supports model training and fine-tuning	Does not support training only inference
Resource Usage	Requires more memory and computational power	Optimized for low memory and low computation
Model Size	Models are generally large	Uses optimized and compressed .tflite models
Optimization Techniques	Limited edge-specific optimizations	Supports quantization and other optimizations

Applications

Mobile Applications: Used for on-device ML tasks like image, text and speech processing in Android and iOS apps.
IoT Devices: Enables real-time inference on smart sensors, wearables and embedded systems without cloud dependency.
Edge Computing: Performs low-latency inference on edge devices such as Raspberry Pi and microcontrollers.
Computer Vision: Applied in object detection, face recognition, OCR and real-time camera-based applications.
Speech and Audio Processing: Supports speech recognition, keyword spotting and audio classification on-device.
Healthcare Applications: Used in portable medical systems for disease detection and medical image analysis.

Advantages

Model Conversion: Easily converts TensorFlow models into optimized TensorFlow Lite models suitable for mobile and edge devices.
Minimal Latency: Provides faster inference making it ideal for real-time applications.
User-Friendly: Simplifies integration of machine learning models into Android and iOS applications.
Offline Inference: Enables on-device execution without internet connectivity useful for remote and mission-critical applications.

Limitations

No Training Support: TensorFlow Lite supports only inference model training and fine-tuning must be done using TensorFlow.
Limited Operator Support: Some TensorFlow operations are not supported which may cause issues during model conversion.
Complex Model Conversion: Large and advanced models such as transformer-based architectures can be difficult to convert efficiently.
Limited Debugging Tools: Debugging and profiling TensorFlow Lite models is more challenging compared to full TensorFlow.
Accuracy Trade-offs: Optimization techniques like quantization may lead to a slight reduction in model accuracy.
Hardware Dependency: Performance improvements depend on device hardware and supported delegates like GPU or NNAPI.

Introduction to TensorFlow Lite

Architecture of TensorFlow Lite

Components of TensorFlow Lite Architecture

Deployment and Execution

Supported Models in TensorFlow Lite

Workflow of TensorFlow Lite

Step By Step Implementation

Step 1: Install and Import Required Libraries

Step 2: Load Model and Tokenizer

Step 3: Define Serving Function and Save Model

Step 4: Convert and Save TensorFlow Lite Models

Step 5: Load and Inspect TFLite Model

Step 6: Prepare Input Text and Set Tensors

Step 7: Run Inference and Get Predictions

Step 8: Compare Model Sizes

Difference between TensorFlow and TensorFlow Lite

Applications

Advantages

Limitations

Explore