| title | Cactus Python Package | |||||||
|---|---|---|---|---|---|---|---|---|
| description | Python package and ctypes bindings for the Cactus on-device AI inference engine. | |||||||
| keywords |
|
Python bindings for Cactus Engine via FFI. Auto-installed when you run source ./setup.
Model bundles: Pre-built runtime bundles for all supported models at huggingface.co/Cactus-Compute.
git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
cactus build --python# Download pre-built bundles (defaults to the generic CPU variant)
cactus download LiquidAI/LFM2-VL-450M
cactus download openai/whisper-small --platform apple # CoreML/NPU variant
# Optional: set your Cactus Cloud API key for automatic cloud fallback
cactus authfrom cactus import ensure_model
from cactus import cactus_init, cactus_complete, cactus_destroy
import json
# Downloads the pre-built bundle from HuggingFace if not already present
bundle = ensure_model("LiquidAI/LFM2-VL-450M")
model = cactus_init(str(bundle), None, False)
messages = json.dumps([{"role": "user", "content": "What is 2+2?"}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])
cactus_destroy(model)All functions are module-level and mirror the C FFI directly. Handles are plain int values (C pointers).
Download pre-built bundles programmatically (no CLI needed):
from cactus import ensure_model, get_bundle_dir
# ensure_model downloads the pre-built bundle if missing, returns its Path
bundle = ensure_model("openai/whisper-tiny")
# Or resolve the expected on-disk location explicitly
bundle_dir = get_bundle_dir("openai/whisper-tiny", bits=4, platform=None)
# -> Path("transpiled/whisper-tiny-cq4") (or `-cq4-apple` with platform="apple")model = cactus_init(model_path: str, corpus_dir: str | None, cache_index: bool) -> int
cactus_destroy(model: int)
cactus_reset(model: int) # clear KV cache
cactus_stop(model: int) # abort ongoing generation
cactus_get_last_error() -> str | NoneReturns a dict with success, error, cloud_handoff, response, optional thinking (only present when the model emits chain-of-thought content, placed before function_calls), function_calls, segments (always [] for completion — populated only in transcription responses), confidence, timing stats (time_to_first_token_ms, total_time_ms, prefill_tps, decode_tps, ram_usage_mb), and token counts (prefill_tokens, decode_tokens, total_tokens).
result = cactus_complete(
model: int,
messages_json: str, # JSON array of {role, content}
options_json: str | None, # optional inference options
tools_json: str | None, # optional tool definitions
callback: Callable[[str, int], None] | None, # streaming token callback
pcm_data: list[int] | None = None # optional raw audio bytes
) -> dict# With options and streaming
options = json.dumps({"max_tokens": 256, "temperature": 0.7})
def on_token(token, token_id): print(token, end="", flush=True)
result = cactus_complete(model, messages_json, options, None, on_token)
if result["cloud_handoff"]:
# response already contains cloud result
passResponse format:
{
"success": true,
"error": null,
"cloud_handoff": false,
"response": "4",
"function_calls": [],
"segments": [],
"confidence": 0.92,
"time_to_first_token_ms": 45.2,
"total_time_ms": 163.7,
"prefill_tps": 619.5,
"decode_tps": 168.4,
"ram_usage_mb": 512.3,
"prefill_tokens": 28,
"decode_tokens": 12,
"total_tokens": 40
}Pre-processes input text and populates the KV cache without generating output tokens. This reduces latency for subsequent calls to cactus_complete.
cactus_prefill(
model: int,
messages_json: str, # JSON array of {role, content}
options_json: str | None, # optional inference options
tools_json: str | None, # optional tool definitions
pcm_data: list[int] | None = None # optional raw audio bytes
) -> Nonetools = json.dumps([{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City, State, Country"}
},
"required": ["location"]
}
}
}])
messages = json.dumps([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather in Paris?"},
{"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
{"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
{"role": "assistant", "content": "It's sunny and 72°F in Paris!"}
])
cactus_prefill(model, messages, None, tools)
completion_messages = json.dumps([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather in Paris?"},
{"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
{"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
{"role": "assistant", "content": "It's sunny and 72°F in Paris!"},
{"role": "user", "content": "What about SF?"}
])
result = cactus_complete(model, completion_messages, None, tools, None)Response format:
{
"success": true,
"error": null,
"prefill_tokens": 25,
"prefill_tps": 166.1,
"total_time_ms": 150.5,
"ram_usage_mb": 245.67
}Returns a dict with the response field (transcribed text), the segments array (timestamped segments as {"start": <sec>, "end": <sec>, "text": "<str>"} — Whisper: phrase-level from timestamp tokens; Parakeet TDT: word-level from frame timing; Parakeet CTC and Moonshine: one segment per transcription window (consecutive VAD speech regions up to 30s)), and other metadata.
result = cactus_transcribe(
model: int,
audio_path: str | None,
prompt: str | None,
options_json: str | None,
callback: Callable[[str, int], None] | None,
pcm_data: list[int] | bytes | None
) -> dictCustom vocabulary biases the decoder toward domain-specific words (supported for Whisper and Moonshine models). Pass custom_vocabulary and vocabulary_boost in options_json:
options = json.dumps({
"custom_vocabulary": ["Omeprazole", "HIPAA", "Cactus"],
"vocabulary_boost": 3.0
})
result = cactus_transcribe(model, "medical_notes.wav", None, options, None, None)result = cactus_transcribe(model, "/path/to/audio.wav", None, None, None, None)
print(result["response"])
for seg in result["segments"]:
print(f"[{seg['start']:.3f}s - {seg['end']:.3f}s] {seg['text']}")embedding = cactus_embed(model: int, text: str, normalize: bool) -> list[float]
embedding = cactus_image_embed(model: int, image_path: str) -> list[float]
embedding = cactus_audio_embed(model: int, audio_path: str) -> list[float]tokens = cactus_tokenize(model: int, text: str) -> list[int]
result = cactus_score_window(model: int, tokens: list[int], start: int, end: int, context: int) -> dictresult = cactus_rag_query(model: int, query: str, top_k: int) -> dictReturns a dict with a chunks array. Each chunk has score (float), source (str, from document metadata), and content (str):
{
"chunks": [
{"score": 0.0142, "source": "doc.txt", "content": "relevant passage..."}
]
}index = cactus_index_init(index_dir: str, embedding_dim: int) -> int
cactus_index_add(index: int, ids: list[int], documents: list[str],
metadatas: list[str] | None, embeddings: list[list[float]])
cactus_index_delete(index: int, ids: list[int])
result = cactus_index_get(index: int, ids: list[int]) -> dict
result = cactus_index_query(index: int, embedding: list[float], options_json: str | None) -> dict
cactus_index_compact(index: int)
cactus_index_destroy(index: int)cactus_index_query returns {"results":[{"id":<int>,"score":<float>}, ...]}. cactus_index_get returns {"results":[{"document":"...","metadata":<str|null>,"embedding":[...]}, ...]}.
cactus_log_set_level(level: int) # 0=DEBUG 1=INFO 2=WARN (default) 3=ERROR 4=NONE
cactus_log_set_callback(callback: Callable[[int, str, str], None] | None)cactus_set_telemetry_environment(framework: str, cache_location: str | None, version: str | None)
cactus_set_app_id(app_id: str)
cactus_telemetry_flush()
cactus_telemetry_shutdown()Functions that return a value raise RuntimeError on failure. cactus_prefill, cactus_index_add, cactus_index_delete, and cactus_index_compact also raise RuntimeError on failure despite not returning a value. Truly void functions that never raise: cactus_destroy, cactus_reset, cactus_stop, cactus_index_destroy, logging and telemetry functions.
Pass images in the messages content for vision-language models (LFM2-VL, LFM2.5-VL, Gemma4, Qwen3.5):
messages = json.dumps([{
"role": "user",
"content": "Describe this image",
"images": ["path/to/image.png"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])Pass audio files in messages for models with native audio understanding (Gemma4):
messages = json.dumps([{
"role": "user",
"content": "Transcribe the audio.",
"audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])
# Combined vision + audio
messages = json.dumps([{
"role": "user",
"content": "Describe the image and transcribe the audio.",
"images": ["path/to/image.png"],
"audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)The Graph API provides a tensor computation graph for building and executing dataflow pipelines on the Cactus kernel layer:
from cactus.bindings.cactus import Graph
import numpy as np
g = Graph()
a = g.input((2, 2))
b = g.input((2, 2))
y = ((a - b) * (a + b)).abs().pow(2.0).view((4,))
g.set_input(a, np.array([[2, 4], [6, 8]], dtype=np.float16))
g.set_input(b, np.array([[1, 2], [3, 4]], dtype=np.float16))
g.execute()
print(y.numpy()) # [9. 36. 81. 144.]Supported ops: +, -, *, /, abs, pow, view, flatten, concat, cat, relu, sigmoid, tanh, gelu, softmax.
Run the full test suite:
python python/test.py # compact output
python python/test.py -v # verboseTests are in python/tests/ — bindings, CLI, server, graph, model, transpile,
and component-partition coverage. Add a new test_*.py to extend.
Cactus Engine API— Full C API reference that the Python bindings wrapCactus Index API— Vector database API for RAG applicationsFine-tuning Guide— Train and deploy custom LoRA fine-tunesRuntime Compatibility— Weight versioning across releases- Apple Build Step — Builds Apple native artifacts used by bindings
- Android Build Step — Builds Android native artifacts used by bindings
- Swift Bindings — Swift C-module bindings
- Kotlin Bindings — Kotlin/JNI bindings
- Flutter Bindings — Dart FFI bindings
- Rust Bindings — Raw Rust FFI declarations