Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

title Cactus Python Package
description Python package and ctypes bindings for the Cactus on-device AI inference engine.
keywords
Python package
Python bindings
on-device AI
Python FFI
embeddings
transcription
RAG

Cactus Python Package

Python bindings for Cactus Engine via FFI. Auto-installed when you run source ./setup.

Model bundles: Pre-built runtime bundles for all supported models at huggingface.co/Cactus-Compute.

Getting Started

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
cactus build --python
# Download pre-built bundles (defaults to the generic CPU variant)
cactus download LiquidAI/LFM2-VL-450M
cactus download openai/whisper-small --platform apple   # CoreML/NPU variant

# Optional: set your Cactus Cloud API key for automatic cloud fallback
cactus auth

Quick Example

from cactus import ensure_model
from cactus import cactus_init, cactus_complete, cactus_destroy
import json

# Downloads the pre-built bundle from HuggingFace if not already present
bundle = ensure_model("LiquidAI/LFM2-VL-450M")

model = cactus_init(str(bundle), None, False)
messages = json.dumps([{"role": "user", "content": "What is 2+2?"}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])
cactus_destroy(model)

API Reference

All functions are module-level and mirror the C FFI directly. Handles are plain int values (C pointers).

Model Downloads

Download pre-built bundles programmatically (no CLI needed):

from cactus import ensure_model, get_bundle_dir

# ensure_model downloads the pre-built bundle if missing, returns its Path
bundle = ensure_model("openai/whisper-tiny")

# Or resolve the expected on-disk location explicitly
bundle_dir = get_bundle_dir("openai/whisper-tiny", bits=4, platform=None)
# -> Path("transpiled/whisper-tiny-cq4")  (or `-cq4-apple` with platform="apple")

Init / Lifecycle

model = cactus_init(model_path: str, corpus_dir: str | None, cache_index: bool) -> int
cactus_destroy(model: int)
cactus_reset(model: int)   # clear KV cache
cactus_stop(model: int)    # abort ongoing generation
cactus_get_last_error() -> str | None

Completion

Returns a dict with success, error, cloud_handoff, response, optional thinking (only present when the model emits chain-of-thought content, placed before function_calls), function_calls, segments (always [] for completion — populated only in transcription responses), confidence, timing stats (time_to_first_token_ms, total_time_ms, prefill_tps, decode_tps, ram_usage_mb), and token counts (prefill_tokens, decode_tokens, total_tokens).

result = cactus_complete(
    model: int,
    messages_json: str,              # JSON array of {role, content}
    options_json: str | None,        # optional inference options
    tools_json: str | None,          # optional tool definitions
    callback: Callable[[str, int], None] | None,  # streaming token callback
    pcm_data: list[int] | None = None              # optional raw audio bytes
) -> dict
# With options and streaming
options = json.dumps({"max_tokens": 256, "temperature": 0.7})
def on_token(token, token_id): print(token, end="", flush=True)

result = cactus_complete(model, messages_json, options, None, on_token)
if result["cloud_handoff"]:
    # response already contains cloud result
    pass

Response format:

{
    "success": true,
    "error": null,
    "cloud_handoff": false,
    "response": "4",
    "function_calls": [],
    "segments": [],
    "confidence": 0.92,
    "time_to_first_token_ms": 45.2,
    "total_time_ms": 163.7,
    "prefill_tps": 619.5,
    "decode_tps": 168.4,
    "ram_usage_mb": 512.3,
    "prefill_tokens": 28,
    "decode_tokens": 12,
    "total_tokens": 40
}

Prefill

Pre-processes input text and populates the KV cache without generating output tokens. This reduces latency for subsequent calls to cactus_complete.

cactus_prefill(
    model: int,
    messages_json: str,              # JSON array of {role, content}
    options_json: str | None,        # optional inference options
    tools_json: str | None,          # optional tool definitions
    pcm_data: list[int] | None = None              # optional raw audio bytes
) -> None
tools = json.dumps([{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City, State, Country"}
            },
            "required": ["location"]
        }
    }
}])

messages = json.dumps([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
    {"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
    {"role": "assistant", "content": "It's sunny and 72°F in Paris!"}
])
cactus_prefill(model, messages, None, tools)

completion_messages = json.dumps([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
    {"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
    {"role": "assistant", "content": "It's sunny and 72°F in Paris!"},
    {"role": "user", "content": "What about SF?"}
])
result = cactus_complete(model, completion_messages, None, tools, None)

Response format:

{
    "success": true,
    "error": null,
    "prefill_tokens": 25,
    "prefill_tps": 166.1,
    "total_time_ms": 150.5,
    "ram_usage_mb": 245.67
}

Transcription

Returns a dict with the response field (transcribed text), the segments array (timestamped segments as {"start": <sec>, "end": <sec>, "text": "<str>"} — Whisper: phrase-level from timestamp tokens; Parakeet TDT: word-level from frame timing; Parakeet CTC and Moonshine: one segment per transcription window (consecutive VAD speech regions up to 30s)), and other metadata.

result = cactus_transcribe(
    model: int,
    audio_path: str | None,
    prompt: str | None,
    options_json: str | None,
    callback: Callable[[str, int], None] | None,
    pcm_data: list[int] | bytes | None
) -> dict

Custom vocabulary biases the decoder toward domain-specific words (supported for Whisper and Moonshine models). Pass custom_vocabulary and vocabulary_boost in options_json:

options = json.dumps({
    "custom_vocabulary": ["Omeprazole", "HIPAA", "Cactus"],
    "vocabulary_boost": 3.0
})
result = cactus_transcribe(model, "medical_notes.wav", None, options, None, None)
result = cactus_transcribe(model, "/path/to/audio.wav", None, None, None, None)
print(result["response"])
for seg in result["segments"]:
    print(f"[{seg['start']:.3f}s - {seg['end']:.3f}s] {seg['text']}")

Embeddings

embedding = cactus_embed(model: int, text: str, normalize: bool) -> list[float]
embedding = cactus_image_embed(model: int, image_path: str) -> list[float]
embedding = cactus_audio_embed(model: int, audio_path: str) -> list[float]

Tokenization

tokens = cactus_tokenize(model: int, text: str) -> list[int]
result = cactus_score_window(model: int, tokens: list[int], start: int, end: int, context: int) -> dict

RAG

result = cactus_rag_query(model: int, query: str, top_k: int) -> dict

Returns a dict with a chunks array. Each chunk has score (float), source (str, from document metadata), and content (str):

{
    "chunks": [
        {"score": 0.0142, "source": "doc.txt", "content": "relevant passage..."}
    ]
}

Vector Index

index = cactus_index_init(index_dir: str, embedding_dim: int) -> int
cactus_index_add(index: int, ids: list[int], documents: list[str],
                 metadatas: list[str] | None, embeddings: list[list[float]])
cactus_index_delete(index: int, ids: list[int])
result = cactus_index_get(index: int, ids: list[int]) -> dict
result = cactus_index_query(index: int, embedding: list[float], options_json: str | None) -> dict
cactus_index_compact(index: int)
cactus_index_destroy(index: int)

cactus_index_query returns {"results":[{"id":<int>,"score":<float>}, ...]}. cactus_index_get returns {"results":[{"document":"...","metadata":<str|null>,"embedding":[...]}, ...]}.

Logging

cactus_log_set_level(level: int)  # 0=DEBUG 1=INFO 2=WARN (default) 3=ERROR 4=NONE
cactus_log_set_callback(callback: Callable[[int, str, str], None] | None)

Telemetry

cactus_set_telemetry_environment(framework: str, cache_location: str | None, version: str | None)
cactus_set_app_id(app_id: str)
cactus_telemetry_flush()
cactus_telemetry_shutdown()

Functions that return a value raise RuntimeError on failure. cactus_prefill, cactus_index_add, cactus_index_delete, and cactus_index_compact also raise RuntimeError on failure despite not returning a value. Truly void functions that never raise: cactus_destroy, cactus_reset, cactus_stop, cactus_index_destroy, logging and telemetry functions.

Vision (VLM)

Pass images in the messages content for vision-language models (LFM2-VL, LFM2.5-VL, Gemma4, Qwen3.5):

messages = json.dumps([{
    "role": "user",
    "content": "Describe this image",
    "images": ["path/to/image.png"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])

Audio (Multimodal)

Pass audio files in messages for models with native audio understanding (Gemma4):

messages = json.dumps([{
    "role": "user",
    "content": "Transcribe the audio.",
    "audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])

# Combined vision + audio
messages = json.dumps([{
    "role": "user",
    "content": "Describe the image and transcribe the audio.",
    "images": ["path/to/image.png"],
    "audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)

Compute Graph

The Graph API provides a tensor computation graph for building and executing dataflow pipelines on the Cactus kernel layer:

from cactus.bindings.cactus import Graph
import numpy as np

g = Graph()
a = g.input((2, 2))
b = g.input((2, 2))
y = ((a - b) * (a + b)).abs().pow(2.0).view((4,))

g.set_input(a, np.array([[2, 4], [6, 8]], dtype=np.float16))
g.set_input(b, np.array([[1, 2], [3, 4]], dtype=np.float16))
g.execute()

print(y.numpy())  # [9. 36. 81. 144.]

Supported ops: +, -, *, /, abs, pow, view, flatten, concat, cat, relu, sigmoid, tanh, gelu, softmax.

Testing

Run the full test suite:

python python/test.py        # compact output
python python/test.py -v     # verbose

Tests are in python/tests/ — bindings, CLI, server, graph, model, transpile, and component-partition coverage. Add a new test_*.py to extend.

See Also

  • Cactus Engine API — Full C API reference that the Python bindings wrap
  • Cactus Index API — Vector database API for RAG applications
  • Fine-tuning Guide — Train and deploy custom LoRA fine-tunes
  • Runtime Compatibility — Weight versioning across releases
  • Apple Build Step — Builds Apple native artifacts used by bindings
  • Android Build Step — Builds Android native artifacts used by bindings
  • Swift Bindings — Swift C-module bindings
  • Kotlin Bindings — Kotlin/JNI bindings
  • Flutter Bindings — Dart FFI bindings
  • Rust Bindings — Raw Rust FFI declarations