How to Create an OpenAI-Compatible Wrapper for Ollama
As large language models become more integral to modern applications, developers often face the challenge of switching between providers like OpenAI and local solutions like Ollama. Fortunately, the OpenAI Python client has become a de facto standard for interacting with chat models, offering powerful features, clean syntax, and extensive community support.
By creating a wrapper that mimics the OpenAI API, developers can take advantage of this standard interface while enjoying the flexibility of swapping between OpenAI and Ollama models simply by changing the base URL and API key. This abstraction not only simplifies model management but also allows rapid prototyping and seamless integration across environments like Jetson devices, cloud servers, or local desktops.
Installing Ollama
To get started with Ollama, install it using the following command:
curl -fsSL https://ollama.com/install.sh | sh
Once installed, you can pull models by running (for example, for llama3.2 model):
ollama pull llama3.2
You can find a complete list of available models at https://ollama.com/search.
To test a model interactively, try the following:
ollama run llama3.2
Installing OpenAI Python Client
Install the official OpenAI Python client with pip:
pip install openai
Creating the Ollama Wrapper Using the OpenAI Client
We can now create a class that wraps the OpenAI client but points to the local Ollama server. This allows you to interact with Ollama models as if they were OpenAI models.
from openai import OpenAI
class ChatModel:
def __init__(self, base_url, key):
self.client = OpenAI(
base_url=base_url,
api_key=key,
)
def chat_completion(self, model, messages):
response = self.client.chat.completions.create(
model=model,
messages=messages
)
return response
BASE_URL = "http://localhost:11434/v1" # Default Ollama local URL
chatModel = ChatModel(base_url=BASE_URL, key="fake-key") # Key required but unused in Ollama
messages = [
{"role": "system", "content": "You are a Jetson-based assistant."},
{"role": "user", "content": "How can I optimize GPU usage on Jetson Nano?"},
{"role": "assistant", "content": "Use TensorRT for inference and disable unused services."},
{"role": "user", "content": "Got it, thanks!"}
]
response = chatModel.chat_completion(model="llama3", messages=messages)
print(response.choices[0].message.content)
Using the Same Wrapper with OpenAI Models
To switch to OpenAI's hosted models, you only need to change the base URL and provide your actual OpenAI API key:
from openai import OpenAI
class ChatModel:
def __init__(self, base_url, key):
self.client = OpenAI(
base_url=base_url,
api_key=key,
)
def chat_completion(self, model, messages):
response = self.client.chat.completions.create(
model=model,
messages=messages
)
return response
BASE_URL = "https://api.openai.com/v1" # Default OpenAI API URL
chatModel = ChatModel(base_url=BASE_URL, key="your-openai-key")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
response = chatModel.chat_completion(model="gpt-4o", messages=messages)
print(response.choices[0].message.content)
Setting Model Parameters
The chat.completions.create
method also supports several parameters that let you fine-tune model behavior. Some useful ones include:
max_completion_tokens
: Limits the number of tokens in the completion.temperature
: Controls randomness (higher is more random).top_p
: An alternative to temperature, using nucleus sampling.stop
: Specifies sequences where the model should stop generating.
Example:
response = self.client.chat.completions.create(
model=model,
messages=messages,
max_completion_tokens=200,
temperature=0.7
)
You can find a full list of parameters in the OpenAI API reference: https://platform.openai.com/docs/api-reference/chat/create