How to Install PySpark in Kaggle

Last Updated : 14 Mar, 2026

PySpark is a Python API for distributed data processing built on Apache Spark, an open-source big data framework maintained by the Apache Software Foundation. It enables fast processing of large datasets using parallel computing. Running PySpark in Kaggle notebooks allows cloud-based big data analysis without complex local setup.

Why PySpark in Kaggle Matters:

  • Helps practice real big data processing in a cloud-based environment
  • Useful for data science competitions and large dataset analysis
  • Saves time by avoiding local Spark installation and configuration
  • Enables scalable data processing even on limited local hardware

Prerequisites:

Before proceeding, ensure:

  • Active Kaggle account with notebook access
  • Basic Python knowledge
  • Basic understanding of distributed computing (optional but helpful)

How to Install PySpark via Kaggle Notebook

To install PySpark in Kaggle, follow these simple steps:

Step 1: Open New Kaggle Notebook

  • Sign in to your Kaggle account
  • Click Create → New Notebook
  • Wait for the notebook environment to start
Screenshot-2024-10-07-010827
Create new notebook

Step 2: Install PySpark

Now in the first cell of Kaggle Notebook type the following python code to install PySpark. Make sure you are connected to internet.

!pip install pyspark

Then:

  • Press Shift + Enter OR
  • Click Run Cell

This installs PySpark in the notebook environment.

start-execution
Start the execution
Screenshot-2024-10-09-164749
Installation successful

Step 3: Import PySpark

  • After installation, import it:

import pyspark

Step 4: Initialize Spark Session

Create a Spark session:

Python
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder \
        .master("local") \
        .appName("MyApp") \
        .getOrCreate()

# Verify Spark Session
print(spark.version)

Output:
This starts a local Spark session inside Kaggle.

Screenshot-2024-10-09-170433
Initializing spark session

Step: 5 Verify the PySpark Installation

To ensure that PySpark is correctly installed you can verify it by running simple example.

  • Check Spark Version: This is an easiest way to verify if the spark is installed or not. To check the version of Spark type the following code.

pyspark.__version__

Screenshot-2024-10-09-165550
Verifying the PySpark Installation
  • Simple PySpark Example: Another method to verify the installation of PySpark is to run a simple example. Try this sample code to ensure everything works fine.
Python
data = [('Rahul', 25), ('Aman', 30), ('Ravi', 28)]
columns = ['Name', 'Age']

df = spark.createDataFrame(data, columns)
df.show()

Output:

  • This code creates PySpark DataFrame and displays it. If the table with names and ages appears then it means PySpark is running properly and installed correctly in your Kaggle notebook.
Screenshot-2024-10-09-165429
PySpark example to ensure installation

Troubleshooting Common Issues

Some issues can be raised while installing PySpark in Kaggle, but it can be troubleshoot and can be fixed. Some possible issues are listed below:

  • Installation Failure: If your installation fails, then check whether your notebook has Internet and whether you have well-typed the command. Try to 'Turn on internet' in the notebook settings.
Screenshot-2024-10-09-164729
Turn on internet option always work
  • Spark session not launching: If Spark does not start up, it can be rebooted through Runtime > Restart. Always use the correct version of PySpark.
  • Memory Limit Issues: Kaggle notebooks have resource limits.
  • Solutions:
    • Use smaller datasets
    • Process data in batches
    • Cache only required data
Comment