multistral (0.0.4)
Installation
pip install --index-url multistralAbout this package
Multistral: A multimodal MoE language model combining text, vision, and audio capabilities
Multistral 🚀
Multistral is a cutting-edge multimodal Mixture of Experts (MoE) language model that seamlessly combines text, vision, and audio processing capabilities. Built on top of state-of-the-art architectures including Ministral-3, Pixtral, and Voxtral, Multistral provides a unified interface for multimodal AI applications.
🌟 Features
- 🔤 Text Processing: Advanced language understanding and generation based on Ministral-3
- 👁️ Vision Understanding: Image analysis and visual reasoning with Pixtral integration
- 🎵 Audio Processing: Speech and audio understanding through Voxtral components
- 🧠 Mixture of Experts: Efficient scaling with sparse MoE architecture
- 🔧 Easy Installation: Simple pip install with comprehensive API
- 📝 Custom Tokenizer: Specialized tokenizer with 255 special tokens for multimodal tasks
- ⚡ Optimized Performance: Support for bfloat16, CUDA, and efficient inference
🚀 Quick Start
Installation
# Basic installation
pip install multistral
# With all optional dependencies
pip install multistral[all]
# Development installation
git clone https://github.com/your-username/multistral.git
cd multistral
pip install -e ".[dev]"
Basic Usage
from multistral import MultistralForConditionalGeneration, MultistralTokenizer
# Load model and tokenizer
model = MultistralForConditionalGeneration.from_pretrained("path/to/model")
tokenizer = MultistralTokenizer.from_pretrained("path/to/tokenizer")
# Text generation
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Multimodal Usage
import torch
from PIL import Image
from multistral import MultistralForConditionalGeneration, MultistralTokenizer
# Load model
model = MultistralForConditionalGeneration.from_pretrained("path/to/model")
tokenizer = MultistralTokenizer.from_pretrained("path/to/tokenizer")
# Text + Image
image = Image.open("image.jpg")
text = "What do you see in this image?"
# Process multimodal input
inputs = tokenizer(text, return_tensors="pt")
# Add image processing here (implementation depends on your specific setup)
# Generate response
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
📖 Documentation
Model Architecture
Multistral combines three powerful architectures:
-
Text Model (Ministral-3):
- 3B parameter language model
- 32 attention heads, 8 key-value heads
- 262K context length support
- Advanced RoPE positioning
-
Vision Model (Pixtral):
- 24-layer vision transformer
- 1540x1540 image resolution
- Patch size: 14x14
- 16 attention heads
-
Audio Model (Voxtral):
- 12-layer audio encoder
- 16kHz sample rate support
- 80 mel-frequency features
- Real-time processing capability
-
MoE Architecture:
- 4-8 experts per layer
- Top-3 expert selection
- Load balancing with auxiliary loss
- Efficient sparse computation
Configuration
from multistral import MultistralConfig
config = MultistralConfig(
vocab_size=128000,
num_experts=4,
moe_top_k=3,
text_config={
"hidden_size": 3072,
"num_hidden_layers": 12,
"num_attention_heads": 32,
},
vision_config={
"hidden_size": 1024,
"num_hidden_layers": 24,
"image_size": 1540,
}
)
Special Tokens
Multistral includes comprehensive special tokens for multimodal tasks:
from multistral import SpecialTokens
# Access special tokens
print(SpecialTokens.IMAGE.value) # <|image|>
print(SpecialTokens.AUDIO.value) # <|audio|>
print(SpecialTokens.VIDEO.value) # <|video|>
print(SpecialTokens.BEGIN.value) # <|begin|>
print(SpecialTokens.END.value) # <|end|>
🔧 Advanced Usage
Custom Training
from multistral import MultistralForConditionalGeneration
from transformers import AdamW
model = MultistralForConditionalGeneration.from_pretrained("path/to/model")
# Freeze all parameters except specific expert
target_expert = 0
for name, param in model.named_parameters():
if f".experts.{target_expert}." in name:
param.requires_grad = True
else:
param.requires_grad = False
# Setup optimizer
optimizer = AdamW([p for p in model.parameters() if p.requires_grad], lr=1e-5)
# Training loop
for batch in dataloader:
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
📦 Installation Options
Core Installation
pip install multistral
With Training Support
pip install multistral[training]
With Audio Processing
pip install multistral[audio]
With Vision Processing
pip install multistral[vision]
Development Installation
pip install multistral[dev]
All Features
pip install multistral[all]
Memory Optimization
import torch
from multistral import MultistralForConditionalGeneration
# Load with memory optimization
model = MultistralForConditionalGeneration.from_pretrained(
"path/to/model",
torch_dtype=torch.bfloat16,
device_map="auto",
low_cpu_mem_usage=True
)
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
Parameter Information
import torch
# Get device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
🎯 Model Performance
| Component | Parameters | Memory (bfloat16) | Features |
|---|---|---|---|
| Text Model | ~3.0B | ~6.0GB | Language understanding/generation |
| Vision Model | ~0.8B | ~1.6GB | Image analysis |
| Audio Model | ~0.4B | ~0.8GB | Audio processing |
| Total | ~4.2B | ~8.4GB | Multimodal AI |
Benchmarks
- Text Generation: Competitive with Mistral-7B on common benchmarks
- Vision Understanding: High accuracy on VQA and image captioning tasks
- Audio Processing: Real-time speech recognition and understanding
- Multimodal: Strong performance on combined text+vision+audio tasks
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Setup
# Clone repository
git clone https://github.com/your-username/multistral.git
cd multistral
# Install development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest tests/
# Run linting
black multistral/
isort multistral/
flake8 multistral/
📄 License
This project is licensed under the ISC License - see the LICENSE file for details.
🙏 Acknowledgments
- Mistral AI for the Ministral-3 architecture
- Mistral AI for Pixtral vision components
- Mistral AI for Voxtral audio processing
- Hugging Face for the transformers library
- PyTorch team for the deep learning framework
📞 Support
- 📧 Email: contact@multistral.ai
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📖 Documentation: https://multistral.readthedocs.io
🗺️ Roadmap
- v0.2.0: Enhanced multimodal fusion
- v0.3.0: Streaming inference support
- v0.4.0: Quantization and optimization
- v0.5.0: Fine-tuning utilities
- v1.0.0: Production-ready release
Built with ❤️ by the Multistral Team