Run Llama3 with Ollama on Google Colab(T4) Free-tire

0𝕏koji

2 min readMay 16, 2024

Ollama

Ollama

Get up and running with large language models.

ollama.com

Llama3

Meta Llama 3

Build the future of AI with Meta Llama 3. Now available with both 8B and 70B pretrained and instruction-tuned versions…

llama.meta.com

In this post, we will try to run llama3 on Google Colab free-tire, T4 with Ollama and use it as an API server.

Step 1. Setup Ollama

First, you need to select T4.

GoogleColab/ollama.ipynb at main · koji/GoogleColab

Contribute to koji/GoogleColab development by creating an account on GitHub.

github.com

!curl https://ollama.ai/install.sh | sh

!echo 'debconf debconf/frontend select Noninteractive' | sudo debconf-set-selections
!sudo apt-get update && sudo apt-get install -y cuda-drivers

import os
# Set LD_LIBRARY_PATH so the system NVIDIA library
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})
os.environ.update({'OLLAMA_HOST': '0.0.0.0'})

!wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!dpkg -i cloudflared-linux-amd64.deb

import subprocess
import threading
import time
import socket

def iframe_thread(port):
    while True:
        time.sleep(0.5)
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        result = sock.connect_ex(('127.0.0.1', port))
        if result == 0:
            break
        sock.close()

    p = subprocess.Popen(["cloudflared", "tunnel", "--url", f"http://127.0.0.1:{port}"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    for line in p.stderr:
        l = line.decode()
        if "trycloudflare.com " in l:
            print("\n\n\n\n\n")
            print("running ollama server\n\n", l[l.find("http"):], end='')
            print("\n\n\n\n\n")

threading.Thread(target=iframe_thread, daemon=True, args=(11434,)).start()

!ollama serve

When you run step1 successfully, you will see an URL like the following.

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFV6YKzw4J7jqYfTNrwkB6h41sdeGDY644nCHUJCPqor

2024/05/16 05:00:08 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-16T05:00:08.999Z level=INFO source=images.go:704 msg="total blobs: 0"
time=2024-05-16T05:00:08.999Z level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-16T05:00:09.000Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"
time=2024-05-16T05:00:09.000Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2437403823/runners






running ollama server

 https://dakota-tape-figures-facial.trycloudflare.com                                      |

Step 2. Install Llama3

This step will take some time since you will need to download a model.
If you want to try another model, you can find anything you want in Ollama model site.
https://ollama.com/library

curl  url_you_got/api/pull -d '{ "name": "llama3" }'

Step 3. Test prompt

curl url_you_got/api/chat -d '{ "model": "llama3", "stream": false, "messages": [{ "role": "user", "content": "What is LLM?" }]}'

If you use jq

curl url_you_got/api/chat -d '{ "model": "llama3", "stream": false, "messages": [{ "role": "user", "content": "What is LLM?" }]}' | jq ".message.content"