Requirements
Host System Requirements
- Docker Engine must be installed.
- The host system must include an Nvidia GPU with latest drivers sufficient to support Cuda >= 12.8. Per Nvidia's documentation, this is currently >= 570.124.06 for Linux or >=572.61 for Windows.
- The Nvidia Container Tookit is required to be installed for Linux host environments.
- Windows is not recommended as a host environment, but for testing purposes you can review Support for GPU-enabled Containers on Windows.
Compute Resource Requirements
Hardware Requirements
Recommended Configuration
For optimal performance, we recommend:
- GPU: NVIDIA GPU with at least 12GB VRAM (16GB+ preferred)
- RAM: 16GB minimum (24GB recommended)
- CPU: 4+ cores, modern x86-64 architecture
- Storage: Minimum 50GB available for model files and application
GPU Support
The application is optimized for NVIDIA GPUs and automatically uses:
- CUDA for GPU acceleration
- 4-bit quantization (NF4 format) to reduce memory requirements
- BFloat16 precision when supported by the GPU
When multiple GPUs are available, the application will currently use only the first GPU. Future updates may enable multi-GPU support.
CPU-Only Operation (Not For Production)
Customers should not consider CPU only as a functional alternative. If no GPU is available, the application will automatically fall back to CPU-only mode with:
- Adaptive memory usage based on available system RAM (minimum 16GB)
- 4-bit quantization to reduce memory requirements
- Significantly slower inference times compared to GPU deployment
Note: CPU-only deployment is not recommended for production environments due to severely limited performance.
Performance Expectations
Hardware Configuration | Approximate Inference Time | Max Tokens/Second |
---|---|---|
High-end GPU (24GB+) | 0.5-2 seconds | 15-30 |
Mid-range GPU (8-16GB) | 2-5 seconds | 5-15 |
CPU-only (32GB RAM) | 60+ seconds | 0.5-2 |
Inference times are approximate and vary based on prompt length, generation parameters, and hardware specifics.
Monitoring Recommendations
We recommend monitoring:
- GPU memory usage (should stay below 90% to avoid OOM errors)
- System RAM usage
- API response times
- CPU utilization (especially in CPU-only mode)
Updated about 14 hours ago
Did this page help you?