multistral (0.0.1)

Published 2025-12-15 13:27:45 +00:00 by damien in damien/multistral

Installation

pip install --index-url  multistral

About this package

Multistral: A multimodal MoE language model combining text, vision, and audio capabilities

Multistral 🚀

Multistral is a cutting-edge multimodal Mixture of Experts (MoE) language model that seamlessly combines text, vision, and audio processing capabilities. Built on top of state-of-the-art architectures including Ministral-3, Pixtral, and Voxtral, Multistral provides a unified interface for multimodal AI applications.

🌟 Features

  • 🔤 Text Processing: Advanced language understanding and generation based on Ministral-3
  • 👁️ Vision Understanding: Image analysis and visual reasoning with Pixtral integration
  • 🎵 Audio Processing: Speech and audio understanding through Voxtral components
  • 🧠 Mixture of Experts: Efficient scaling with sparse MoE architecture
  • 🔧 Easy Installation: Simple pip install with comprehensive API
  • 📝 Custom Tokenizer: Specialized tokenizer with 255 special tokens for multimodal tasks
  • Optimized Performance: Support for bfloat16, CUDA, and efficient inference

🚀 Quick Start

Installation

# Basic installation
pip install multistral

# With all optional dependencies
pip install multistral[all]

# Development installation
git clone https://github.com/your-username/multistral.git
cd multistral
pip install -e ".[dev]"

Basic Usage

from multistral import MultistralForConditionalGeneration, MultistralTokenizer

# Load model and tokenizer
model = MultistralForConditionalGeneration.from_pretrained("path/to/model")
tokenizer = MultistralTokenizer.from_pretrained("path/to/tokenizer")

# Text generation
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multimodal Usage

import torch
from PIL import Image
from multistral import MultistralForConditionalGeneration, MultistralTokenizer

# Load model
model = MultistralForConditionalGeneration.from_pretrained("path/to/model")
tokenizer = MultistralTokenizer.from_pretrained("path/to/tokenizer")

# Text + Image
image = Image.open("image.jpg")
text = "What do you see in this image?"

# Process multimodal input
inputs = tokenizer(text, return_tensors="pt")
# Add image processing here (implementation depends on your specific setup)

# Generate response
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

📖 Documentation

Model Architecture

Multistral combines three powerful architectures:

  1. Text Model (Ministral-3):

    • 3B parameter language model
    • 32 attention heads, 8 key-value heads
    • 262K context length support
    • Advanced RoPE positioning
  2. Vision Model (Pixtral):

    • 24-layer vision transformer
    • 1540x1540 image resolution
    • Patch size: 14x14
    • 16 attention heads
  3. Audio Model (Voxtral):

    • 12-layer audio encoder
    • 16kHz sample rate support
    • 80 mel-frequency features
    • Real-time processing capability
  4. MoE Architecture:

    • 4-8 experts per layer
    • Top-3 expert selection
    • Load balancing with auxiliary loss
    • Efficient sparse computation

Configuration

from multistral import MultistralConfig

config = MultistralConfig(
    vocab_size=128000,
    num_experts=4,
    moe_top_k=3,
    text_config={
        "hidden_size": 3072,
        "num_hidden_layers": 12,
        "num_attention_heads": 32,
    },
    vision_config={
        "hidden_size": 1024,
        "num_hidden_layers": 24,
        "image_size": 1540,
    }
)

Special Tokens

Multistral includes comprehensive special tokens for multimodal tasks:

from multistral import SpecialTokens

# Access special tokens
print(SpecialTokens.IMAGE.value)    # <|image|>
print(SpecialTokens.AUDIO.value)    # <|audio|>
print(SpecialTokens.VIDEO.value)    # <|video|>
print(SpecialTokens.BEGIN.value)    # <|begin|>
print(SpecialTokens.END.value)      # <|end|>

🔧 Advanced Usage

Custom Training

from multistral import MultistralForConditionalGeneration
from transformers import AdamW

model = MultistralForConditionalGeneration.from_pretrained("path/to/model")

# Freeze all parameters except specific expert
target_expert = 0
for name, param in model.named_parameters():
    if f".experts.{target_expert}." in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

# Setup optimizer
optimizer = AdamW([p for p in model.parameters() if p.requires_grad], lr=1e-5)

# Training loop
for batch in dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

📦 Installation Options

Core Installation

pip install multistral

With Training Support

pip install multistral[training]

With Audio Processing

pip install multistral[audio]

With Vision Processing

pip install multistral[vision]

Development Installation

pip install multistral[dev]

All Features

pip install multistral[all]

Memory Optimization

import torch
from multistral import MultistralForConditionalGeneration

# Load with memory optimization
model = MultistralForConditionalGeneration.from_pretrained(
    "path/to/model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    low_cpu_mem_usage=True
)

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

Parameter Information

import torch

# Get device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

🎯 Model Performance

Component Parameters Memory (bfloat16) Features
Text Model ~3.0B ~6.0GB Language understanding/generation
Vision Model ~0.8B ~1.6GB Image analysis
Audio Model ~0.4B ~0.8GB Audio processing
Total ~4.2B ~8.4GB Multimodal AI

Benchmarks

  • Text Generation: Competitive with Mistral-7B on common benchmarks
  • Vision Understanding: High accuracy on VQA and image captioning tasks
  • Audio Processing: Real-time speech recognition and understanding
  • Multimodal: Strong performance on combined text+vision+audio tasks

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone repository
git clone https://github.com/your-username/multistral.git
cd multistral

# Install development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest tests/

# Run linting
black multistral/
isort multistral/
flake8 multistral/

📄 License

This project is licensed under the ISC License - see the LICENSE file for details.

🙏 Acknowledgments

  • Mistral AI for the Ministral-3 architecture
  • Mistral AI for Pixtral vision components
  • Mistral AI for Voxtral audio processing
  • Hugging Face for the transformers library
  • PyTorch team for the deep learning framework

📞 Support

🗺️ Roadmap

  • v0.2.0: Enhanced multimodal fusion
  • v0.3.0: Streaming inference support
  • v0.4.0: Quantization and optimization
  • v0.5.0: Fine-tuning utilities
  • v1.0.0: Production-ready release

Built with ❤️ by the Multistral Team

Requirements

Requires Python: >=3.8
Details
PyPI
2025-12-15 13:27:45 +00:00
1
ISC
1.8 MiB
Assets (2)
Versions (5) View all
0.0.7 2025-12-25
0.0.4 2025-12-24
0.0.3 2025-12-24
0.0.2 2025-12-21
0.0.1 2025-12-15