Scope for MVP:¶
Features:¶
- 
Local Model storage
- Unified location
 - Downloading models (asynchronously)
 - Updating models
 - Deleting models (asynchronously)
 - Storage available
 
 - 
Deploy vLLM docker
- Deploy instance (asynchronously)
 - Stop instance (asynchronously)
 - Status of instance (Docker health check)
 - Attach GPUs to instance
 
 - 
vLLM container version control
- Dropdown for selecting vLLM container version
 - Checkbox for most recent version
 - Nightly job to pull available version to populate dropdown
 
 - 
Deploy OpenWebUI
- Initial config
 - Status
 - Delete
 - Stopping (asynchronously)
 - Status of instance (Docker health check)
 
 - 
GPU state
- Vram load
 - Utilization
 - Power consumption
 
 - 
Async operations for IO-bound tasks
- Container management
 - Image operations
 - Model downloads
 - Graceful thread pool shutdown
 
 - 
Use health checks for containers
 - Unified local model location
 
Config¶
LLM Specific¶
- Ability to set all the args for vLLM when launching instance (pydantic)
 
Host Specific¶
- Unified location of model
 - The port/interface (default to 0.0.0.0) where is deployed on
 - Nvidia docker toolkit installed
 - Refresh interval for gpu state
 - Refresh interval for container state
 - HF token
 - Thread pool sizes for IO and CPU operations
 
Operating Data¶
- What assets are deployed via docker
 - Thread pool utilization metrics
 
Architecture¶
- Async operations with specialized thread pools
 - FastApi with asyncio for non-blocking event loop
 - Proper exception handling and propagation
 - Graceful application shutdown
 
Stack¶
- Deploy InferAdmin in docker
 - InferAdmin interacts with host docker to deploy docker containers for inference/interface
 - FastApi for backend (async mode)
 - Frontend Vue JS + Shadcn
 - YML for config and data, pydantic for representation
 
Ideas¶
- Proxying inference requests in front of vLLM to route to correct model
- Placeholding /llms endpoint for this functionality
 
 - Engines other than vLLM
 - Have multiple storage locations
 - Add analytics for vLLM instances collected from vLLM's prometheus instance
 - Enhanced logging system to replace print statements
 
Initial assumption¶
- Nvidia gpus
 - All gpus are same type
 - Interface launches on all exposed via 0.0.0.0