Name	Name	Last commit message	Last commit date
parent directory ..
cactus	cactus
tests	tests
README.md	README.md
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt
test.py	test.py

title

Cactus Python Package

description

Python package and ctypes bindings for the Cactus on-device AI inference engine.

keywords

Python package

Python bindings

on-device AI

Python FFI

embeddings

transcription

RAG

Cactus Python Package

Python bindings for Cactus Engine via FFI. Auto-installed when you run source ./setup.

Model bundles: Pre-built runtime bundles for all supported models at huggingface.co/Cactus-Compute.

Getting Started

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
cactus build --python

# Download pre-built bundles (defaults to the generic CPU variant)
cactus download LiquidAI/LFM2-VL-450M
cactus download openai/whisper-small --platform apple   # CoreML/NPU variant

# Optional: set your Cactus Cloud API key for automatic cloud fallback
cactus auth

Quick Example

from cactus import ensure_model
from cactus import cactus_init, cactus_complete, cactus_destroy
import json

# Downloads the pre-built bundle from HuggingFace if not already present
bundle = ensure_model("LiquidAI/LFM2-VL-450M")

model = cactus_init(str(bundle), None, False)
messages = json.dumps([{"role": "user", "content": "What is 2+2?"}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])
cactus_destroy(model)

API Reference

All functions are module-level and mirror the C FFI directly. Handles are plain int values (C pointers).

Model Downloads

Download pre-built bundles programmatically (no CLI needed):

from cactus import ensure_model, get_bundle_dir

# ensure_model downloads the pre-built bundle if missing, returns its Path
bundle = ensure_model("openai/whisper-tiny")

# Or resolve the expected on-disk location explicitly
bundle_dir = get_bundle_dir("openai/whisper-tiny", bits=4, platform=None)
# -> Path("transpiled/whisper-tiny-cq4")  (or `-cq4-apple` with platform="apple")

Init / Lifecycle

model = cactus_init(model_path: str, corpus_dir: str | None, cache_index: bool) -> int
cactus_destroy(model: int)
cactus_reset(model: int)   # clear KV cache
cactus_stop(model: int)    # abort ongoing generation
cactus_get_last_error() -> str | None

Completion

Returns a dict with success, error, cloud_handoff, response, optional thinking (only present when the model emits chain-of-thought content, placed before function_calls), function_calls, segments (always [] for completion — populated only in transcription responses), confidence, timing stats (time_to_first_token_ms, total_time_ms, prefill_tps, decode_tps, ram_usage_mb), and token counts (prefill_tokens, decode_tokens, total_tokens).

result = cactus_complete(
    model: int,
    messages_json: str,              # JSON array of {role, content}
    options_json: str | None,        # optional inference options
    tools_json: str | None,          # optional tool definitions
    callback: Callable[[str, int], None] | None,  # streaming token callback
    pcm_data: list[int] | None = None              # optional raw audio bytes
) -> dict

# With options and streaming
options = json.dumps({"max_tokens": 256, "temperature": 0.7})
def on_token(token, token_id): print(token, end="", flush=True)

result = cactus_complete(model, messages_json, options, None, on_token)
if result["cloud_handoff"]:
    # response already contains cloud result
    pass

Response format:

{
    "success": true,
    "error": null,
    "cloud_handoff": false,
    "response": "4",
    "function_calls": [],
    "segments": [],
    "confidence": 0.92,
    "time_to_first_token_ms": 45.2,
    "total_time_ms": 163.7,
    "prefill_tps": 619.5,
    "decode_tps": 168.4,
    "ram_usage_mb": 512.3,
    "prefill_tokens": 28,
    "decode_tokens": 12,
    "total_tokens": 40
}

Prefill

Pre-processes input text and populates the KV cache without generating output tokens. This reduces latency for subsequent calls to cactus_complete.

cactus_prefill(
    model: int,
    messages_json: str,              # JSON array of {role, content}
    options_json: str | None,        # optional inference options
    tools_json: str | None,          # optional tool definitions
    pcm_data: list[int] | None = None              # optional raw audio bytes
) -> None

tools = json.dumps([{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City, State, Country"}
            },
            "required": ["location"]
        }
    }
}])

messages = json.dumps([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
    {"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
    {"role": "assistant", "content": "It's sunny and 72°F in Paris!"}
])
cactus_prefill(model, messages, None, tools)

completion_messages = json.dumps([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
    {"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
    {"role": "assistant", "content": "It's sunny and 72°F in Paris!"},
    {"role": "user", "content": "What about SF?"}
])
result = cactus_complete(model, completion_messages, None, tools, None)

Response format:

{
    "success": true,
    "error": null,
    "prefill_tokens": 25,
    "prefill_tps": 166.1,
    "total_time_ms": 150.5,
    "ram_usage_mb": 245.67
}

Transcription

Returns a dict with the response field (transcribed text), the segments array (timestamped segments as {"start": <sec>, "end": <sec>, "text": "<str>"} — Whisper: phrase-level from timestamp tokens; Parakeet TDT: word-level from frame timing; Parakeet CTC and Moonshine: one segment per transcription window (consecutive VAD speech regions up to 30s)), and other metadata.

result = cactus_transcribe(
    model: int,
    audio_path: str | None,
    prompt: str | None,
    options_json: str | None,
    callback: Callable[[str, int], None] | None,
    pcm_data: list[int] | bytes | None
) -> dict

Custom vocabulary biases the decoder toward domain-specific words (supported for Whisper and Moonshine models). Pass custom_vocabulary and vocabulary_boost in options_json:

options = json.dumps({
    "custom_vocabulary": ["Omeprazole", "HIPAA", "Cactus"],
    "vocabulary_boost": 3.0
})
result = cactus_transcribe(model, "medical_notes.wav", None, options, None, None)

result = cactus_transcribe(model, "/path/to/audio.wav", None, None, None, None)
print(result["response"])
for seg in result["segments"]:
    print(f"[{seg['start']:.3f}s - {seg['end']:.3f}s] {seg['text']}")

Embeddings

embedding = cactus_embed(model: int, text: str, normalize: bool) -> list[float]
embedding = cactus_image_embed(model: int, image_path: str) -> list[float]
embedding = cactus_audio_embed(model: int, audio_path: str) -> list[float]

Tokenization

tokens = cactus_tokenize(model: int, text: str) -> list[int]
result = cactus_score_window(model: int, tokens: list[int], start: int, end: int, context: int) -> dict

RAG

result = cactus_rag_query(model: int, query: str, top_k: int) -> dict

Returns a dict with a chunks array. Each chunk has score (float), source (str, from document metadata), and content (str):

{
    "chunks": [
        {"score": 0.0142, "source": "doc.txt", "content": "relevant passage..."}
    ]
}

Vector Index

index = cactus_index_init(index_dir: str, embedding_dim: int) -> int
cactus_index_add(index: int, ids: list[int], documents: list[str],
                 metadatas: list[str] | None, embeddings: list[list[float]])
cactus_index_delete(index: int, ids: list[int])
result = cactus_index_get(index: int, ids: list[int]) -> dict
result = cactus_index_query(index: int, embedding: list[float], options_json: str | None) -> dict
cactus_index_compact(index: int)
cactus_index_destroy(index: int)

cactus_index_query returns {"results":[{"id":<int>,"score":<float>}, ...]}. cactus_index_get returns {"results":[{"document":"...","metadata":<str|null>,"embedding":[...]}, ...]}.

Logging

cactus_log_set_level(level: int)  # 0=DEBUG 1=INFO 2=WARN (default) 3=ERROR 4=NONE
cactus_log_set_callback(callback: Callable[[int, str, str], None] | None)

Telemetry

cactus_set_telemetry_environment(framework: str, cache_location: str | None, version: str | None)
cactus_set_app_id(app_id: str)
cactus_telemetry_flush()
cactus_telemetry_shutdown()

Functions that return a value raise RuntimeError on failure. cactus_prefill, cactus_index_add, cactus_index_delete, and cactus_index_compact also raise RuntimeError on failure despite not returning a value. Truly void functions that never raise: cactus_destroy, cactus_reset, cactus_stop, cactus_index_destroy, logging and telemetry functions.

Vision (VLM)

Pass images in the messages content for vision-language models (LFM2-VL, LFM2.5-VL, Gemma4, Qwen3.5):

messages = json.dumps([{
    "role": "user",
    "content": "Describe this image",
    "images": ["path/to/image.png"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])

Audio (Multimodal)

Pass audio files in messages for models with native audio understanding (Gemma4):

messages = json.dumps([{
    "role": "user",
    "content": "Transcribe the audio.",
    "audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])

# Combined vision + audio
messages = json.dumps([{
    "role": "user",
    "content": "Describe the image and transcribe the audio.",
    "images": ["path/to/image.png"],
    "audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)

Compute Graph

The Graph API provides a tensor computation graph for building and executing dataflow pipelines on the Cactus kernel layer:

from cactus.bindings.cactus import Graph
import numpy as np

g = Graph()
a = g.input((2, 2))
b = g.input((2, 2))
y = ((a - b) * (a + b)).abs().pow(2.0).view((4,))

g.set_input(a, np.array([[2, 4], [6, 8]], dtype=np.float16))
g.set_input(b, np.array([[1, 2], [3, 4]], dtype=np.float16))
g.execute()

print(y.numpy())  # [9. 36. 81. 144.]

Supported ops: +, -, *, /, abs, pow, view, flatten, concat, cat, relu, sigmoid, tanh, gelu, softmax.

Testing

Run the full test suite:

python python/test.py        # compact output
python python/test.py -v     # verbose

Tests are in python/tests/ — bindings, CLI, server, graph, model, transpile, and component-partition coverage. Add a new test_*.py to extend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Cactus Python Package

Getting Started

Quick Example

API Reference

Model Downloads

Init / Lifecycle

Completion

Prefill

Transcription

Embeddings

Tokenization

RAG

Vector Index

Logging

Telemetry

Vision (VLM)

Audio (Multimodal)

Compute Graph

Testing

See Also

FilesExpand file tree

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

Cactus Python Package

Getting Started

Quick Example

API Reference

Model Downloads

Init / Lifecycle

Completion

Prefill

Transcription

Embeddings

Tokenization

RAG

Vector Index

Logging

Telemetry

Vision (VLM)

Audio (Multimodal)

Compute Graph

Testing

See Also