- Rust 98%
- Dockerfile 1.3%
- Nix 0.7%
| .forgejo/workflows | ||
| src | ||
| .dockerignore | ||
| .gitignore | ||
| API.md | ||
| Cargo.toml | ||
| CHANGELOG.md | ||
| config.example.env | ||
| CONTRIBUTING.md | ||
| Dockerfile | ||
| DOCUMENTATION.md | ||
| LICENSE | ||
| README.md | ||
| shell.nix | ||
RKEngine
OpenAI-compatible API server for RKLLM models.
Features
- OpenAI-Compatible API: Provides
/v1/models,/v1/chat/completions,/health,/ready,/live, and/metricsendpoints - Multiple Parser Support: Mistral, Llama, Qwen, GPT-OSS, and base parsers
- Streaming Support: Both streaming and non-streaming responses
- CLI Interface: Command-line interface with clap for argument parsing
- CORS Enabled: All origins allowed (authentication managed externally)
- Structured Logging: Uses tracing for production-grade logging
- Prometheus Metrics: Built-in metrics endpoint at
/metrics - Concurrency Control: Single LLM inference at a time (other endpoints concurrent)
- FFI Bindings: Real librkllmrt.so support with mock fallback
## Building
### Development Build
```bash
cargo build
Release Build (Recommended for Production)
cargo build --release
The release build will be optimized and located at target/release/rkengine.
Running
Development Mode
cargo run -- --model path/to/model.rkllm --parser mistral --host 0.0.0.0 --port 8080
Production Mode
./target/release/rkengine --model path/to/model.rkllm --parser mistral --host 0.0.0.0 --port 8080
CLI Arguments
| Argument | Description | Default | Required |
|---|---|---|---|
--model |
Path to the .rkllm model file | - | Yes |
--parser |
Output parser type | none |
No |
--platform |
Target platform | rk3588 |
No |
--lib-path |
Path to librkllmrt.so | Auto-detect | No |
--host |
Host to bind to | 0.0.0.0 |
No |
--port |
Port to listen on | 8080 |
No |
--thinking |
Enable thinking/reasoning mode | false |
No |
Parser Types
none- No parsing (base parser)mistral- Mistral formatgpt-oss- GPT-OSS formatllama- Llama formatqwen- Qwen format
Environment Variables
Logging
RUST_LOG- Control logging level (default: info)- Example:
RUST_LOG=debug cargo run -- ... - Levels: error, warn, info, debug, trace
- Example:
API Endpoints
Health Check
GET /health
Response:
{
"status": "ok"
}
Readiness Check
GET /ready
Response:
{
"status": "ok"
}
Liveness Check
GET /live
Response:
{
"status": "ok"
}
List Models
GET /v1/models
Response:
{
"object": "list",
"data": [
{
"id": "model-name",
"object": "model",
"created": 0,
"owned_by": "rkllm"
}
]
}
Chat Completions
POST /v1/chat/completions
Request:
{
"model": "model-name",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"stream": false,
"tools": null
}
Response (non-streaming):
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1234567890,
"model": "model-name",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Response text",
"reasoning_content": null,
"tool_calls": null
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 20,
"total_tokens": 30
}
}
Prometheus Metrics
GET /metrics
Returns Prometheus-formatted metrics for:
- HTTP request counts by method, endpoint, and status
- Request durations
- Error counts
- Model inference counts
- Token counts
- Inference durations
Concurrency
The server uses a semaphore to limit LLM inference to 1 concurrent request at a time. This ensures that only one chat completion request processes at a time, while other endpoints (health, metrics, models, etc.) can run concurrently without restriction.
This design allows for:
- Multiple instances to be run in parallel (horizontal scaling)
- No caching between requests (each inference is independent)
- Other endpoints remain responsive during inference
FFI Bindings
The server supports both real mode and mock mode:
- Real mode: When
librkllmrt.sois available, the server uses FFI to call the actual RKLLM C library for inference - Mock mode: When the library is not found, the server simulates responses for development and testing
The server automatically detects and uses the real library if available. To specify a custom library path:
rkengine --model path/to/model.rkllm --lib-path /custom/path/librkllmrt.so
Testing
Run all tests:
cargo test
Run with verbose output:
cargo test -- --nocapture
Docker
Build the Docker image:
docker build -t rkengine .
Run the container:
docker run -p 8080:8080 -v /path/to/models:/models rkengine --model /models/model.rkllm --parser mistral
License
ISC License - See LICENSE file for details.