TensorFlow Lite is an open-source deep learning framework designed for on-device inference, commonly referred to as Edge Computing. It enables developers to deploy trained machine learning models directly on edge devices such as mobile phones, embedded systems and IoT devices without relying on cloud resources.
- It is a production-ready, cross-platform framework used for deploying ML on mobile devices and embedded systems.
- Designed specifically for on-device and edge machine learning
- Supports platforms like Android, iOS, embedded Linux and microcontrollers (MCUs)
- Enables execution of trained models directly on edge devices
- Reduces dependency on cloud-based inference
- Ideal for real-time applications such as object detection and image recognition
Architecture of TensorFlow Lite
The architecture of TensorFlow Lite is designed to enable efficient on-device machine learning by converting and executing models in a lightweight runtime environment.

Components of TensorFlow Lite Architecture
- TensorFlow Model: A trained TensorFlow model that is saved on disk after training using standard TensorFlow workflows.
- TensorFlow Lite Converter: A conversion tool that transforms the trained TensorFlow model into the TensorFlow Lite format by applying optimizations such as quantization.
- TensorFlow Lite Model File (.tflite): A lightweight platform-independent model format based on FlatBuffers, optimized for low latency, high performance and minimal memory usage.
Deployment and Execution
- Mobile Application: The TensorFlow Lite model file is deployed inside a mobile or embedded application for on-device inference.
- Java API: A high-level wrapper over the C++ API used mainly for Android application development.
- C++ API: A core API available on both Android and iOS that loads the TensorFlow Lite model and invokes the Interpreter.
- Interpreter: Executes the model using optimized operators with selective loading, requiring very small memory compared to TensorFlow Mobile.
- Hardware Acceleration: Uses Android NNAPI for faster execution on supported devices, otherwise runs on CPU.
- Custom Kernels: Developers can implement custom operators using the C++ API which can be executed by the TensorFlow Lite Interpreter.
Supported Models in TensorFlow Lite
- MobileNet: A lightweight vision model designed for efficient image classification on mobile and embedded devices.
- Inception v3: An image recognition model offering higher accuracy than MobileNet but with a larger model size.
- Smart Reply: An on-device conversational model that provides quick reply suggestions in messaging applications, commonly used on Android Wear.
Workflow of TensorFlow Lite
The workflow of TensorFlow Lite involves a simple and efficient pipeline to deploy machine learning models on edge devices. It starts with training a model using TensorFlow and ends with running optimized inference on resource-constrained devices.

- Train the Model: Build and train the machine learning model using TensorFlow on high-performance systems.
- Convert to TensorFlow Lite: Convert the trained model into .tflite format using the TensorFlow Lite Converter, applying optimizations if required.
- Optimize the Model: Apply techniques like quantization to reduce model size and improve inference speed.
- Deploy on Edge Device: Integrate the .tflite model into mobile, embedded or IoT applications.
- Run Inference: Execute the model using the TensorFlow Lite interpreter for fast, on-device predictions.
TensorFlow Lite uses FlatBuffers as its model file format instead of Protocol Buffers. FlatBuffers allow direct access to serialized data without an unpacking step resulting in faster execution and lower memory usage which is crucial for edge and mobile environments.
Step By Step Implementation
Here in this code we uses a DistilBERT model for text classification, converts it to TensorFlow Lite (FP32 and FP16) for efficient deployment on edge devices and demonstrates how to classify input text. It loads the model and tokenizer, exports a SavedModel, converts it to TFLite and runs inference to predict the text’s class label.
Step 1: Install and Import Required Libraries
- Import TensorFlow For building and running deep learning models.
- Import NumPy For handling numerical computations.
- Import Hugging Face Transformers components TFDistilBertForSequenceClassification for model, DistilBertTokenizer for text tokenization.
!pip install -q transformers tensorflow accelerate
import tensorflow as tf
import numpy as np
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer
Step 2: Load Model and Tokenizer
- Define the model name, maximum sequence length and class labels.
- Specify the directory to save or load the TensorFlow model.
- Initialize the DistilBERT tokenizer from the pretrained model.
- Load the DistilBERT sequence classification model using pretrained weights.
MODEL_NAME = "distilbert-base-uncased"
MAX_LEN = 128
CLASS_LABELS = ["Class 0", "Class 1"]
SAVED_MODEL_DIR = "tf_distilbert_saved_model"
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tf_model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)
print("Model loaded successfully.")
Step 3: Define Serving Function and Save Model
- Create a TensorFlow function (@tf.function) with input signatures for input_ids and attention_mask.
- Pass inputs to the model and return the logits.
- Save the TensorFlow model in the SavedModel format with the defined serving function.
@tf.function(
input_signature=[
tf.TensorSpec([None, MAX_LEN], tf.int32, name="input_ids"),
tf.TensorSpec([None, MAX_LEN], tf.int32, name="attention_mask"),
]
)
def serving_fn(input_ids, attention_mask):
outputs = tf_model(input_ids=input_ids, attention_mask=attention_mask)
return {"logits": outputs.logits}
tf.saved_model.save(tf_model, SAVED_MODEL_DIR, signatures={"serving_default": serving_fn})
print("SavedModel exported successfully.")
Step 4: Convert and Save TensorFlow Lite Models
- Create a TFLite converter from the saved TensorFlow model.
- Convert the model to FP32 precision and save it as distilbert_fp32.tflite.
- Enable default optimizations and set supported type to FP16.
- Convert the model to FP16 precision and save it as distilbert_fp16.tflite.
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
tflite_fp32 = converter.convert()
with open("distilbert_fp32.tflite", "wb") as f:
f.write(tflite_fp32)
print("FP32 TFLite model saved.")
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_fp16 = converter.convert()
with open("distilbert_fp16.tflite", "wb") as f:
f.write(tflite_fp16)
print("FP16 TFLite model saved.")
Step 5: Load and Inspect TFLite Model
- Load the FP16 TFLite model using tf.lite.Interpreter.
- Allocate tensors for the interpreter to prepare it for inference.
- Retrieve input and output details of the model.
interpreter = tf.lite.Interpreter(model_path="distilbert_fp16.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print("Input details:", input_details)
print("Output details:", output_details)
Step 6: Prepare Input Text and Set Tensors
- Define the input text for inference.
- Tokenize the text using the DistilBERT tokenizer with padding and truncation to MAX_LEN.
- Convert input_ids and attention_mask to NumPy arrays of type int32.
- Set the tokenized input tensors in the TFLite interpreter.
text = "TensorFlow Lite is optimized for edge devices."
inputs = tokenizer(
text,
max_length=MAX_LEN,
padding="max_length",
truncation=True,
return_tensors="tf"
)
input_ids = inputs["input_ids"].numpy().astype(np.int32)
attention_mask = inputs["attention_mask"].numpy().astype(np.int32)
interpreter.set_tensor(input_details[0]['index'], input_ids)
interpreter.set_tensor(input_details[1]['index'], attention_mask)
Step 7: Run Inference and Get Predictions
- Invoke the TFLite interpreter to perform inference.
- Retrieve the output logits from the model.
- Determine the predicted class by taking the argmax of the logits.
interpreter.invoke()
logits = interpreter.get_tensor(output_details[0]['index'])
predicted_class = np.argmax(logits, axis=1)
print("\n--- TFLite Inference Results ---")
print("Input text:", text)
print("Logits:", logits)
print("Predicted class ID:", predicted_class)
print("Predicted label:", [CLASS_LABELS[i] for i in predicted_class])
Output:

Step 8: Compare Model Sizes
- Define a utility function to calculate the size of a model file or directory.
- Handle both SavedModel directories and single TFLite files.
- Compute the total size in bytes and convert it to megabytes (MB).
- Print and compare the sizes of the original TensorFlow model, FP32 TFLite model and FP16 TFLite model.
import os
def print_model_size(model_path, name):
if os.path.isdir(model_path):
total_size = sum(
os.path.getsize(os.path.join(dp, f))
for dp, dn, filenames in os.walk(model_path)
for f in filenames
)
else:
total_size = os.path.getsize(model_path)
print(f"{name} size: {total_size / (1024*1024):.2f} MB")
print_model_size(SAVED_MODEL_DIR, "Original SavedModel")
print_model_size("distilbert_fp32.tflite", "TFLite FP32 Model")
print_model_size("distilbert_fp16.tflite", "TFLite FP16 Model")
Output:

You can download full code from here
Difference between TensorFlow and TensorFlow Lite
Here we compare tensorflow lite with tensorflow
Feature | TensorFlow | TensorFlow Lite |
|---|---|---|
Purpose | Used to build, train and evaluate ML/DL models | Used mainly for running inference on already trained models |
Target Devices | High-performance systems like servers, desktops and GPUs | Mobile devices, embedded systems and edge devices |
Model Training | Fully supports model training and fine-tuning | Does not support training only inference |
Resource Usage | Requires more memory and computational power | Optimized for low memory and low computation |
Model Size | Models are generally large | Uses optimized and compressed .tflite models |
Optimization Techniques | Limited edge-specific optimizations | Supports quantization and other optimizations |
Applications
- Mobile Applications: Used for on-device ML tasks like image, text and speech processing in Android and iOS apps.
- IoT Devices: Enables real-time inference on smart sensors, wearables and embedded systems without cloud dependency.
- Edge Computing: Performs low-latency inference on edge devices such as Raspberry Pi and microcontrollers.
- Computer Vision: Applied in object detection, face recognition, OCR and real-time camera-based applications.
- Speech and Audio Processing: Supports speech recognition, keyword spotting and audio classification on-device.
- Healthcare Applications: Used in portable medical systems for disease detection and medical image analysis.
Advantages
- Model Conversion: Easily converts TensorFlow models into optimized TensorFlow Lite models suitable for mobile and edge devices.
- Minimal Latency: Provides faster inference making it ideal for real-time applications.
- User-Friendly: Simplifies integration of machine learning models into Android and iOS applications.
- Offline Inference: Enables on-device execution without internet connectivity useful for remote and mission-critical applications.
Limitations
- No Training Support: TensorFlow Lite supports only inference model training and fine-tuning must be done using TensorFlow.
- Limited Operator Support: Some TensorFlow operations are not supported which may cause issues during model conversion.
- Complex Model Conversion: Large and advanced models such as transformer-based architectures can be difficult to convert efficiently.
- Limited Debugging Tools: Debugging and profiling TensorFlow Lite models is more challenging compared to full TensorFlow.
- Accuracy Trade-offs: Optimization techniques like quantization may lead to a slight reduction in model accuracy.
- Hardware Dependency: Performance improvements depend on device hardware and supported delegates like GPU or NNAPI.