HomeGuidesAPI ReferenceChangelog
Log In
Guides

Requirements

Host System Requirements

Compute Resource Requirements

Hardware Requirements

Recommended Configuration

For optimal performance, we recommend:

  • GPU: NVIDIA GPU with at least 12GB VRAM (16GB+ preferred)
  • RAM: 16GB minimum (24GB recommended)
  • CPU: 4+ cores, modern x86-64 architecture
  • Storage: Minimum 50GB available for model files and application

GPU Support

The application is optimized for NVIDIA GPUs and automatically uses:

  • CUDA for GPU acceleration
  • 4-bit quantization (NF4 format) to reduce memory requirements
  • BFloat16 precision when supported by the GPU

When multiple GPUs are available, the application will currently use only the first GPU. Future updates may enable multi-GPU support.

CPU-Only Operation (Not For Production)

Customers should not consider CPU only as a functional alternative. If no GPU is available, the application will automatically fall back to CPU-only mode with:

  • Adaptive memory usage based on available system RAM (minimum 16GB)
  • 4-bit quantization to reduce memory requirements
  • Significantly slower inference times compared to GPU deployment

Note: CPU-only deployment is not recommended for production environments due to severely limited performance.

Performance Expectations

Hardware ConfigurationApproximate Inference TimeMax Tokens/Second
High-end GPU (24GB+)0.5-2 seconds15-30
Mid-range GPU (8-16GB)2-5 seconds5-15
CPU-only (32GB RAM)60+ seconds0.5-2

Inference times are approximate and vary based on prompt length, generation parameters, and hardware specifics.

Monitoring Recommendations

We recommend monitoring:

  • GPU memory usage (should stay below 90% to avoid OOM errors)
  • System RAM usage
  • API response times
  • CPU utilization (especially in CPU-only mode)

Did this page help you?