Vibe coded RKNN LLM engine

Rust 98%
Dockerfile 1.3%
Nix 0.7%

Find a file

damfle 7585a012a5 Some checks failed CI / Lint (push) Failing after 41s Details CI / Build (push) Has been skipped Details CI / Test (push) Has been skipped Details CI / Create Tag (push) Has been skipped Details init: initial commit		2026-05-08 13:44:28 +02:00
.forgejo/workflows	init: initial commit	2026-05-08 13:44:28 +02:00
src	init: initial commit	2026-05-08 13:44:28 +02:00
.dockerignore	init: initial commit	2026-05-08 13:44:28 +02:00
.gitignore	init: initial commit	2026-05-08 13:44:28 +02:00
API.md	init: initial commit	2026-05-08 13:44:28 +02:00
Cargo.toml	init: initial commit	2026-05-08 13:44:28 +02:00
CHANGELOG.md	init: initial commit	2026-05-08 13:44:28 +02:00
config.example.env	init: initial commit	2026-05-08 13:44:28 +02:00
CONTRIBUTING.md	init: initial commit	2026-05-08 13:44:28 +02:00
Dockerfile	init: initial commit	2026-05-08 13:44:28 +02:00
DOCUMENTATION.md	init: initial commit	2026-05-08 13:44:28 +02:00
LICENSE	init: initial commit	2026-05-08 13:44:28 +02:00
README.md	init: initial commit	2026-05-08 13:44:28 +02:00
shell.nix	init: initial commit	2026-05-08 13:44:28 +02:00

README.md

RKEngine

OpenAI-compatible API server for RKLLM models.

Features

OpenAI-Compatible API: Provides /v1/models, /v1/chat/completions, /health, /ready, /live, and /metrics endpoints
Multiple Parser Support: Mistral, Llama, Qwen, GPT-OSS, and base parsers
Streaming Support: Both streaming and non-streaming responses
CLI Interface: Command-line interface with clap for argument parsing
CORS Enabled: All origins allowed (authentication managed externally)
Structured Logging: Uses tracing for production-grade logging
Prometheus Metrics: Built-in metrics endpoint at /metrics
Concurrency Control: Single LLM inference at a time (other endpoints concurrent)
FFI Bindings: Real librkllmrt.so support with mock fallback


## Building

### Development Build
```bash
cargo build

Release Build (Recommended for Production)

cargo build --release

The release build will be optimized and located at target/release/rkengine.

Running

Development Mode

cargo run -- --model path/to/model.rkllm --parser mistral --host 0.0.0.0 --port 8080

Production Mode

./target/release/rkengine --model path/to/model.rkllm --parser mistral --host 0.0.0.0 --port 8080

CLI Arguments

Argument	Description	Default	Required
`--model`	Path to the .rkllm model file	-	Yes
`--parser`	Output parser type	`none`	No
`--platform`	Target platform	`rk3588`	No
`--lib-path`	Path to librkllmrt.so	Auto-detect	No
`--host`	Host to bind to	`0.0.0.0`	No
`--port`	Port to listen on	`8080`	No
`--thinking`	Enable thinking/reasoning mode	`false`	No

Parser Types

none - No parsing (base parser)
mistral - Mistral format
gpt-oss - GPT-OSS format
llama - Llama format
qwen - Qwen format

Environment Variables

Logging

RUST_LOG - Control logging level (default: info)
- Example: RUST_LOG=debug cargo run -- ...
- Levels: error, warn, info, debug, trace

API Endpoints

Health Check

GET /health

Response:

{
  "status": "ok"
}

Readiness Check

GET /ready

Response:

{
  "status": "ok"
}

Liveness Check

GET /live

Response:

{
  "status": "ok"
}

List Models

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "model-name",
      "object": "model",
      "created": 0,
      "owned_by": "rkllm"
    }
  ]
}

Chat Completions

POST /v1/chat/completions

Request:

{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": false,
  "tools": null
}

Response (non-streaming):

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "model-name",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Response text",
        "reasoning_content": null,
        "tool_calls": null
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 20,
    "total_tokens": 30
  }
}

Prometheus Metrics

GET /metrics

Returns Prometheus-formatted metrics for:

HTTP request counts by method, endpoint, and status
Request durations
Error counts
Model inference counts
Token counts
Inference durations

Concurrency

The server uses a semaphore to limit LLM inference to 1 concurrent request at a time. This ensures that only one chat completion request processes at a time, while other endpoints (health, metrics, models, etc.) can run concurrently without restriction.

This design allows for:

Multiple instances to be run in parallel (horizontal scaling)
No caching between requests (each inference is independent)
Other endpoints remain responsive during inference

FFI Bindings

The server supports both real mode and mock mode:

Real mode: When librkllmrt.so is available, the server uses FFI to call the actual RKLLM C library for inference
Mock mode: When the library is not found, the server simulates responses for development and testing

The server automatically detects and uses the real library if available. To specify a custom library path:

rkengine --model path/to/model.rkllm --lib-path /custom/path/librkllmrt.so

Testing

Run all tests:

cargo test

Run with verbose output:

cargo test -- --nocapture

Docker

Build the Docker image:

docker build -t rkengine .

Run the container:

docker run -p 8080:8080 -v /path/to/models:/models rkengine --model /models/model.rkllm --parser mistral

License

ISC License - See LICENSE file for details.