How to run Llama 2 locally on your Mac or PC

If you like the idea of ChatGPT, Google Gemini, Microsoft Copilot, or any of the other AI assistants, then you may have some concerns relating to the likes of privacy, costs, or more. That's where Llama 2 comes in. Llama 2 is an open-source large language model developed by Meta, and there are variants ranging from 7 billion to 70 billion parameters.

Given that it's an open-source LLM, you can modify it and run it in any way that you want, on any device. If you want to give it a try on a Linux, Mac, or Windows machine, you can easily!

Requirements

You'll need the following to run Llama 2 locally:

One of the best Nvidia GPUs (you can use AMD on Linux)
An internet connection

These are the best graphics cards you can buy

Best GPUs in 2025: Our top graphics card picks

Picking the right graphics card can be difficult given the sheer number of options on the market. Here are the best graphics cards to consider.

Posts

By Rich Edmonds

How to run Llama 2 on a Mac or Linux using Ollama

If you have a Mac, you can use Ollama to run Llama 2. It's by far the easiest way to do it of all the platforms, as it requires minimal work to do so. All you need is a Mac and time to download the LLM, as it's a large file.

Step 1: Download Ollama

The first thing you'll need to do is download Ollama. It runs on Mac and Linux and makes it easy to download and run multiple models, including Llama 2. You can even run it in a Docker container if you'd like with GPU acceleration if you'd like to have it easily configured.

Once Ollama is downloaded, extract it to a folder of your choice and run it.

Step 2: Download the Llama 2 model

Once Ollama is installed, run the following command to pull the 13 billion parameter Llama 2 model.

ollama pull llama2:13b

This may take a while, so give it time to run. It's a 7.4GB file and may be slow on some connections.

Step 3: Run Llama 2 and interact with it

Running Llama 2 model and asking it about XDA-Developers

Next, run the following command to launch and interact with the model.

ollama run llama2

This will then launch the model, and you can interact with it. You're done!

How to run Llama 2 on Windows using a web GUI

If you're using a Windows machine, then there's no need to fret as it's just as easy to set up, though with more steps! You'll be able to clone a GitHub repository and run it locally, and that's all you need to do.

Step 1: Download and run the Llama 2 Web GUI

If you're familiar with Stable Diffusion and running it locally through a Web GUI, that's what this basically is. oobabooga's text generation Web UI GitHub repository is inspired by that and works in very much the same way.

Download the repository linked above
Run start_windows.bat, start_linux.sh, or start_macos.sh depending on what platform you're using
Select your GPU and allow it to install everything that it needs

Step 2: Access the Llama 2 Web GUI

From the above, you can see that it will give you a local IP address to connect to the web GUI. Connect to it in your browser and you should see the web GUI.

Click around and familiarize yourself with the UI. You'll have first loaded a chat window, but it won't work until you load a model.

Step 3: Load a Llama 2 model

Now you'll need to load a model. This will take some time as it will need to download it, but you can do that from inside of the Web GUI.

Click the Model tab at the top
On the right, enter TheBloke/Llama-2-13B-chat-GPTQ and click Download
If it's downloading, you should see a progress bar in your command prompt as it downloads the relevant files.
When it finishes, refresh the model list on the left and click the downloaded model.
Click Load, making sure that model loader says GPTQ-for-LLaMa

It may take a moment for it to load, as these models require a lot of vRAM.

Step 4: Interact with Llama 2!

All going well, you should now have Llama 2 running on your PC! You can interact with it through your browser in a no-internet environment, so long as you have the hardware necessary to execute it. On my RTX 4080 with 16GB of vRAM it can generate at nearly 20 tokens per second, which is significantly faster than you'll find on most free plans for any LLMs like ChatGPT or otherwise.

If you wanted, you could also try to use LM Studio, as there are pre-built models available using Llama 2.