Ollama

Ollama is an open-source GenAI engine for running the large-language models built upon Llama. It offers an opportunity to run GenAI engine on-premises to avoid data security and privacy concerns with Cloud-based engines.

This guide will show you how to run an ollama server on the HPC cluster serving the codellama model, and use it as a coding assisstent.

Running Ollama server with a slurm job

In the HPC cluster, the program ollama and a model for coding codellama are installed in /opt/ollama.

User can setup the runtime enviroment by loading the ollama module, and start a ollama server using the Slurm job script /opt/ollama/ollama_serve.sh using the commands below:

$ module load ollama
$ sbatch --time=1:00:00 /opt/ollama/ollama_serve.sh

The job requests 1 CPU, 1 GPU and 64GB memory resource to run on, and for 1 hour. Once the job is started, check the output file ollama-gpu-{jobid}.out and find the server endpoint.

The example below shows you the OLLAMA job 46131868 is started and listen on endpoint dccn-c083.dccn.nl:11434.

$ cat ollama-gpu-46131868.out
...
starting OLLAMA to listen on dccn-c083.dccn.nl:11434 ...
...

Note

As the started server is capable for serving multiple clients at the same time, it is possible to use the same server endpoint in multiple clients as long as the client can access the DCCN network (e.g. via eduVPN Trigon full access profile).

Connecting clients to the OLLAMA server

On the access node, open a new terminal and run

$ module load ollama
$ OLLAMA_HOST=dccn-c083.dccn.nl:11434 ollama run codellama

Note

The command above requests the Ollama server to load the model codellama. A list of available models can be found using the command below:

$ OLLAMA_HOST=dccn-c083.dccn.nl:11434 ollama list
NAME                ID              SIZE      MODIFIED
deepseek-r1:14b     ea35dfe18182    9.0 GB    14 hours ago
codellama:13b       9f438cb9cd58    7.4 GB    39 hours ago
llama3.3:latest     a6eb4748fd29    42 GB     39 hours ago
codellama:latest    8fdf8f752f6e    3.8 GB    4 months ago

When selecting a model, pay attention to the SIZE of the model. It should fit in the total GPU memories (the GPU memory is part of the GPU model, it is NOT the --mem of the job request) of the job that starts the Ollama server.

It can take a while when the server is loading the model. Once succeeded, you should get a “chat” prompt back as below:

>>> Send a message (/? for help)

Now you can chat with the model just like using the ChatGPT, for example,

>>> hello

Hey there! How's it going?

>>> who are you?

I am LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a
conversational manner. I am trained on a massive dataset of text from the internet and can answer questions
or provide information on a wide range of topics.

Connecting VSCode to the OLLAMA server

On VSCode IDE, one can install the plugin Continue to connect to the OLLAMA server.

After the installation, the Continue panel is accessible via the Ctrl/Cmd + l key-binding. Before starting using it, one needs to configure the model provider endpoint as follows:

open the Continue configuration file, see this instruction.

given the ollama server endpoint dccn-c083.dccn.nl:11434 and the codellama model, we modify the configuration JSON file accordingly as below:

{
    "models": [
        {
        "title": "Code Llama",
        "provider": "ollama",
        "apiBase": "http://dccn-c083.dccn.nl:11434",
        "model": "codellama"
        }
    ],
    ...
}

After that, in the Continue chat pannel toggled with the Ctrl/Cmd + l key-binding, you can start chatting with the model.