Introduction to TensorFlow Lite

Last Updated : 29 Dec, 2025

TensorFlow Lite is an open-source deep learning framework designed for on-device inference, commonly referred to as Edge Computing. It enables developers to deploy trained machine learning models directly on edge devices such as mobile phones, embedded systems and IoT devices without relying on cloud resources.

  • It is a production-ready, cross-platform framework used for deploying ML on mobile devices and embedded systems.
  • Designed specifically for on-device and edge machine learning
  • Supports platforms like Android, iOS, embedded Linux and microcontrollers (MCUs)
  • Enables execution of trained models directly on edge devices
  • Reduces dependency on cloud-based inference
  • Ideal for real-time applications such as object detection and image recognition

Architecture of TensorFlow Lite

The architecture of TensorFlow Lite is designed to enable efficient on-device machine learning by converting and executing models in a lightweight runtime environment.

frame_3317
Tensorflow Lite Architecture

Components of TensorFlow Lite Architecture

  • TensorFlow Model: A trained TensorFlow model that is saved on disk after training using standard TensorFlow workflows.
  • TensorFlow Lite Converter: A conversion tool that transforms the trained TensorFlow model into the TensorFlow Lite format by applying optimizations such as quantization.
  • TensorFlow Lite Model File (.tflite): A lightweight platform-independent model format based on FlatBuffers, optimized for low latency, high performance and minimal memory usage.

Deployment and Execution

  • Mobile Application: The TensorFlow Lite model file is deployed inside a mobile or embedded application for on-device inference.
  • Java API: A high-level wrapper over the C++ API used mainly for Android application development.
  • C++ API: A core API available on both Android and iOS that loads the TensorFlow Lite model and invokes the Interpreter.
  • Interpreter: Executes the model using optimized operators with selective loading, requiring very small memory compared to TensorFlow Mobile.
  • Hardware Acceleration: Uses Android NNAPI for faster execution on supported devices, otherwise runs on CPU.
  • Custom Kernels: Developers can implement custom operators using the C++ API which can be executed by the TensorFlow Lite Interpreter.

Supported Models in TensorFlow Lite

  • MobileNet: A lightweight vision model designed for efficient image classification on mobile and embedded devices.
  • Inception v3: An image recognition model offering higher accuracy than MobileNet but with a larger model size.
  • Smart Reply: An on-device conversational model that provides quick reply suggestions in messaging applications, commonly used on Android Wear.

Workflow of TensorFlow Lite

The workflow of TensorFlow Lite involves a simple and efficient pipeline to deploy machine learning models on edge devices. It starts with training a model using TensorFlow and ends with running optimized inference on resource-constrained devices.

frame_3318784783
Workflow
  • Train the Model: Build and train the machine learning model using TensorFlow on high-performance systems.
  • Convert to TensorFlow Lite: Convert the trained model into .tflite format using the TensorFlow Lite Converter, applying optimizations if required.
  • Optimize the Model: Apply techniques like quantization to reduce model size and improve inference speed.
  • Deploy on Edge Device: Integrate the .tflite model into mobile, embedded or IoT applications.
  • Run Inference: Execute the model using the TensorFlow Lite interpreter for fast, on-device predictions.

TensorFlow Lite uses FlatBuffers as its model file format instead of Protocol Buffers. FlatBuffers allow direct access to serialized data without an unpacking step resulting in faster execution and lower memory usage which is crucial for edge and mobile environments.

Step By Step Implementation

Here in this code we uses a DistilBERT model for text classification, converts it to TensorFlow Lite (FP32 and FP16) for efficient deployment on edge devices and demonstrates how to classify input text. It loads the model and tokenizer, exports a SavedModel, converts it to TFLite and runs inference to predict the text’s class label.

Step 1: Install and Import Required Libraries

  • Import TensorFlow For building and running deep learning models.
  • Import NumPy For handling numerical computations.
  • Import Hugging Face Transformers components TFDistilBertForSequenceClassification for model, DistilBertTokenizer for text tokenization.
Python
!pip install -q transformers tensorflow accelerate

import tensorflow as tf
import numpy as np
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer

Step 2: Load Model and Tokenizer

  • Define the model name, maximum sequence length and class labels.
  • Specify the directory to save or load the TensorFlow model.
  • Initialize the DistilBERT tokenizer from the pretrained model.
  • Load the DistilBERT sequence classification model using pretrained weights.
Python
MODEL_NAME = "distilbert-base-uncased"
MAX_LEN = 128
CLASS_LABELS = ["Class 0", "Class 1"]
SAVED_MODEL_DIR = "tf_distilbert_saved_model"

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tf_model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)

print("Model loaded successfully.")

Step 3: Define Serving Function and Save Model

  • Create a TensorFlow function (@tf.function) with input signatures for input_ids and attention_mask.
  • Pass inputs to the model and return the logits.
  • Save the TensorFlow model in the SavedModel format with the defined serving function.
Python
@tf.function(
    input_signature=[
        tf.TensorSpec([None, MAX_LEN], tf.int32, name="input_ids"),
        tf.TensorSpec([None, MAX_LEN], tf.int32, name="attention_mask"),
    ]
)
def serving_fn(input_ids, attention_mask):
    outputs = tf_model(input_ids=input_ids, attention_mask=attention_mask)
    return {"logits": outputs.logits}

tf.saved_model.save(tf_model, SAVED_MODEL_DIR, signatures={"serving_default": serving_fn})

print("SavedModel exported successfully.")

Step 4: Convert and Save TensorFlow Lite Models

  • Create a TFLite converter from the saved TensorFlow model.
  • Convert the model to FP32 precision and save it as distilbert_fp32.tflite.
  • Enable default optimizations and set supported type to FP16.
  • Convert the model to FP16 precision and save it as distilbert_fp16.tflite.
Python
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
tflite_fp32 = converter.convert()
with open("distilbert_fp32.tflite", "wb") as f:
    f.write(tflite_fp32)
print("FP32 TFLite model saved.")


converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_fp16 = converter.convert()
with open("distilbert_fp16.tflite", "wb") as f:
    f.write(tflite_fp16)
print("FP16 TFLite model saved.")

Step 5: Load and Inspect TFLite Model

  • Load the FP16 TFLite model using tf.lite.Interpreter.
  • Allocate tensors for the interpreter to prepare it for inference.
  • Retrieve input and output details of the model.
Python
interpreter = tf.lite.Interpreter(model_path="distilbert_fp16.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

print("Input details:", input_details)
print("Output details:", output_details)

Step 6: Prepare Input Text and Set Tensors

  • Define the input text for inference.
  • Tokenize the text using the DistilBERT tokenizer with padding and truncation to MAX_LEN.
  • Convert input_ids and attention_mask to NumPy arrays of type int32.
  • Set the tokenized input tensors in the TFLite interpreter.
Python
text = "TensorFlow Lite is optimized for edge devices."


inputs = tokenizer(
    text, 
    max_length=MAX_LEN, 
    padding="max_length", 
    truncation=True, 
    return_tensors="tf"
)

input_ids = inputs["input_ids"].numpy().astype(np.int32)
attention_mask = inputs["attention_mask"].numpy().astype(np.int32)


interpreter.set_tensor(input_details[0]['index'], input_ids)
interpreter.set_tensor(input_details[1]['index'], attention_mask)

Step 7: Run Inference and Get Predictions

  • Invoke the TFLite interpreter to perform inference.
  • Retrieve the output logits from the model.
  • Determine the predicted class by taking the argmax of the logits.
Python
interpreter.invoke()


logits = interpreter.get_tensor(output_details[0]['index'])
predicted_class = np.argmax(logits, axis=1)


print("\n--- TFLite Inference Results ---")
print("Input text:", text)
print("Logits:", logits)
print("Predicted class ID:", predicted_class)
print("Predicted label:", [CLASS_LABELS[i] for i in predicted_class])

Output:

TL5
TensorFlow Lite FP16 DistilBERT Inference

Step 8: Compare Model Sizes

  • Define a utility function to calculate the size of a model file or directory.
  • Handle both SavedModel directories and single TFLite files.
  • Compute the total size in bytes and convert it to megabytes (MB).
  • Print and compare the sizes of the original TensorFlow model, FP32 TFLite model and FP16 TFLite model.
Python
import os

def print_model_size(model_path, name):
    if os.path.isdir(model_path):
        total_size = sum(
            os.path.getsize(os.path.join(dp, f))
            for dp, dn, filenames in os.walk(model_path)
            for f in filenames
        )
    else:
        total_size = os.path.getsize(model_path)
    print(f"{name} size: {total_size / (1024*1024):.2f} MB")


print_model_size(SAVED_MODEL_DIR, "Original SavedModel")
print_model_size("distilbert_fp32.tflite", "TFLite FP32 Model")
print_model_size("distilbert_fp16.tflite", "TFLite FP16 Model")

Output:

TL6
Model Size Comparison

You can download full code from here

Difference between TensorFlow and TensorFlow Lite

Here we compare tensorflow lite with tensorflow

Feature

TensorFlow

TensorFlow Lite

Purpose

Used to build, train and evaluate ML/DL models

Used mainly for running inference on already trained models

Target Devices

High-performance systems like servers, desktops and GPUs

Mobile devices, embedded systems and edge devices

Model Training

Fully supports model training and fine-tuning

Does not support training only inference

Resource Usage

Requires more memory and computational power

Optimized for low memory and low computation

Model Size

Models are generally large

Uses optimized and compressed .tflite models

Optimization Techniques

Limited edge-specific optimizations

Supports quantization and other optimizations

Applications

  • Mobile Applications: Used for on-device ML tasks like image, text and speech processing in Android and iOS apps.
  • IoT Devices: Enables real-time inference on smart sensors, wearables and embedded systems without cloud dependency.
  • Edge Computing: Performs low-latency inference on edge devices such as Raspberry Pi and microcontrollers.
  • Computer Vision: Applied in object detection, face recognition, OCR and real-time camera-based applications.
  • Speech and Audio Processing: Supports speech recognition, keyword spotting and audio classification on-device.
  • Healthcare Applications: Used in portable medical systems for disease detection and medical image analysis.

Advantages

  • Model Conversion: Easily converts TensorFlow models into optimized TensorFlow Lite models suitable for mobile and edge devices.
  • Minimal Latency: Provides faster inference making it ideal for real-time applications.
  • User-Friendly: Simplifies integration of machine learning models into Android and iOS applications.
  • Offline Inference: Enables on-device execution without internet connectivity useful for remote and mission-critical applications.

Limitations

  • No Training Support: TensorFlow Lite supports only inference model training and fine-tuning must be done using TensorFlow.
  • Limited Operator Support: Some TensorFlow operations are not supported which may cause issues during model conversion.
  • Complex Model Conversion: Large and advanced models such as transformer-based architectures can be difficult to convert efficiently.
  • Limited Debugging Tools: Debugging and profiling TensorFlow Lite models is more challenging compared to full TensorFlow.
  • Accuracy Trade-offs: Optimization techniques like quantization may lead to a slight reduction in model accuracy.
  • Hardware Dependency: Performance improvements depend on device hardware and supported delegates like GPU or NNAPI.
Comment