Vibe coded RKNN LLM engine
  • Rust 98%
  • Dockerfile 1.3%
  • Nix 0.7%
Find a file
damfle 7585a012a5
Some checks failed
CI / Lint (push) Failing after 41s
CI / Build (push) Has been skipped
CI / Test (push) Has been skipped
CI / Create Tag (push) Has been skipped
init: initial commit
2026-05-08 13:44:28 +02:00
.forgejo/workflows init: initial commit 2026-05-08 13:44:28 +02:00
src init: initial commit 2026-05-08 13:44:28 +02:00
.dockerignore init: initial commit 2026-05-08 13:44:28 +02:00
.gitignore init: initial commit 2026-05-08 13:44:28 +02:00
API.md init: initial commit 2026-05-08 13:44:28 +02:00
Cargo.toml init: initial commit 2026-05-08 13:44:28 +02:00
CHANGELOG.md init: initial commit 2026-05-08 13:44:28 +02:00
config.example.env init: initial commit 2026-05-08 13:44:28 +02:00
CONTRIBUTING.md init: initial commit 2026-05-08 13:44:28 +02:00
Dockerfile init: initial commit 2026-05-08 13:44:28 +02:00
DOCUMENTATION.md init: initial commit 2026-05-08 13:44:28 +02:00
LICENSE init: initial commit 2026-05-08 13:44:28 +02:00
README.md init: initial commit 2026-05-08 13:44:28 +02:00
shell.nix init: initial commit 2026-05-08 13:44:28 +02:00

RKEngine

OpenAI-compatible API server for RKLLM models.

Features

  • OpenAI-Compatible API: Provides /v1/models, /v1/chat/completions, /health, /ready, /live, and /metrics endpoints
  • Multiple Parser Support: Mistral, Llama, Qwen, GPT-OSS, and base parsers
  • Streaming Support: Both streaming and non-streaming responses
  • CLI Interface: Command-line interface with clap for argument parsing
  • CORS Enabled: All origins allowed (authentication managed externally)
  • Structured Logging: Uses tracing for production-grade logging
  • Prometheus Metrics: Built-in metrics endpoint at /metrics
  • Concurrency Control: Single LLM inference at a time (other endpoints concurrent)
  • FFI Bindings: Real librkllmrt.so support with mock fallback

## Building

### Development Build
```bash
cargo build
cargo build --release

The release build will be optimized and located at target/release/rkengine.

Running

Development Mode

cargo run -- --model path/to/model.rkllm --parser mistral --host 0.0.0.0 --port 8080

Production Mode

./target/release/rkengine --model path/to/model.rkllm --parser mistral --host 0.0.0.0 --port 8080

CLI Arguments

Argument Description Default Required
--model Path to the .rkllm model file - Yes
--parser Output parser type none No
--platform Target platform rk3588 No
--lib-path Path to librkllmrt.so Auto-detect No
--host Host to bind to 0.0.0.0 No
--port Port to listen on 8080 No
--thinking Enable thinking/reasoning mode false No

Parser Types

  • none - No parsing (base parser)
  • mistral - Mistral format
  • gpt-oss - GPT-OSS format
  • llama - Llama format
  • qwen - Qwen format

Environment Variables

Logging

  • RUST_LOG - Control logging level (default: info)
    • Example: RUST_LOG=debug cargo run -- ...
    • Levels: error, warn, info, debug, trace

API Endpoints

Health Check

GET /health

Response:

{
  "status": "ok"
}

Readiness Check

GET /ready

Response:

{
  "status": "ok"
}

Liveness Check

GET /live

Response:

{
  "status": "ok"
}

List Models

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "model-name",
      "object": "model",
      "created": 0,
      "owned_by": "rkllm"
    }
  ]
}

Chat Completions

POST /v1/chat/completions

Request:

{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": false,
  "tools": null
}

Response (non-streaming):

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "model-name",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Response text",
        "reasoning_content": null,
        "tool_calls": null
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 20,
    "total_tokens": 30
  }
}

Prometheus Metrics

GET /metrics

Returns Prometheus-formatted metrics for:

  • HTTP request counts by method, endpoint, and status
  • Request durations
  • Error counts
  • Model inference counts
  • Token counts
  • Inference durations

Concurrency

The server uses a semaphore to limit LLM inference to 1 concurrent request at a time. This ensures that only one chat completion request processes at a time, while other endpoints (health, metrics, models, etc.) can run concurrently without restriction.

This design allows for:

  • Multiple instances to be run in parallel (horizontal scaling)
  • No caching between requests (each inference is independent)
  • Other endpoints remain responsive during inference

FFI Bindings

The server supports both real mode and mock mode:

  • Real mode: When librkllmrt.so is available, the server uses FFI to call the actual RKLLM C library for inference
  • Mock mode: When the library is not found, the server simulates responses for development and testing

The server automatically detects and uses the real library if available. To specify a custom library path:

rkengine --model path/to/model.rkllm --lib-path /custom/path/librkllmrt.so

Testing

Run all tests:

cargo test

Run with verbose output:

cargo test -- --nocapture

Docker

Build the Docker image:

docker build -t rkengine .

Run the container:

docker run -p 8080:8080 -v /path/to/models:/models rkengine --model /models/model.rkllm --parser mistral

License

ISC License - See LICENSE file for details.