llama.cpp
LMQL also supports llama.cpp as an inference backend, which can run efficiently on CPU-only and mixed CPU/GPU systems.
Before using llama.cpp models, make sure you installed its Python bindings via pip install llama-cpp-python in the same environment where you installed LMQL. You also need the sentencepiece or transformers package installed, for tokenization.
Also make sure you first convert your model weights according to the llama.cpp documentation, to the .bin format.
Just like Transformers models, you can load llama.cpp models either locally or via a long-lived lmql serve-model inference server.
Model Server
To start a llama.cpp model server, you can run the following command:
lmql serve-model llama.cpp:<PATH TO WEIGHTS>.bin
This will launch an LMTP inference endpoint on localhost:8080, which can be used from LMQL with a query program like this:
argmax
"Say 'this is a test':[RESPONSE]"
from
"llama.cpp:<PATH TO WEIGHTS>.bin"
where
len(TOKENS(RESPONSE)) < 10
Running Without a Model Server
To load the llama.cpp from the Python process that executes your LMQL query, you can use the following syntax:
argmax
"Say 'this is a test':[RESPONSE]"
from
"local:llama.cpp:<PATH TO WEIGHTS>.bin"
where
len(TOKENS(RESPONSE)) < 10
Configuring the Llama(...) instance
Any parameters passed to lmql serve-model and, when running locally, to lmql.model(...) will be passed to the Llama(...) constructor.
For example, to configure the Llama(...) instance to use an n_ctx value of 1024, run:
lmql serve-model llama.cpp:<PATH TO WEIGHTS>.bin --n_ctx 1024
Or, when running locally, you can use lmql.model("local:llama.cpp:<PATH TO WEIGHTS>.bin", n_ctx=1024).