Scope for MVP:¶
Features:¶
-
Local Model storage
- Unified location
- Downloading models (asynchronously)
- Updating models
- Deleting models (asynchronously)
- Storage available
-
Deploy vLLM docker
- Deploy instance (asynchronously)
- Stop instance (asynchronously)
- Status of instance (Docker health check)
- Attach GPUs to instance
-
vLLM container version control
- Dropdown for selecting vLLM container version
- Checkbox for most recent version
- Nightly job to pull available version to populate dropdown
-
Deploy OpenWebUI
- Initial config
- Status
- Delete
- Stopping (asynchronously)
- Status of instance (Docker health check)
-
GPU state
- Vram load
- Utilization
- Power consumption
-
Async operations for IO-bound tasks
- Container management
- Image operations
- Model downloads
- Graceful thread pool shutdown
-
Use health checks for containers
- Unified local model location
Config¶
LLM Specific¶
- Ability to set all the args for vLLM when launching instance (pydantic)
Host Specific¶
- Unified location of model
- The port/interface (default to 0.0.0.0) where is deployed on
- Nvidia docker toolkit installed
- Refresh interval for gpu state
- Refresh interval for container state
- HF token
- Thread pool sizes for IO and CPU operations
Operating Data¶
- What assets are deployed via docker
- Thread pool utilization metrics
Architecture¶
- Async operations with specialized thread pools
- FastApi with asyncio for non-blocking event loop
- Proper exception handling and propagation
- Graceful application shutdown
Stack¶
- Deploy InferAdmin in docker
- InferAdmin interacts with host docker to deploy docker containers for inference/interface
- FastApi for backend (async mode)
- Frontend Vue JS + Shadcn
- YML for config and data, pydantic for representation
Ideas¶
- Proxying inference requests in front of vLLM to route to correct model
- Placeholding /llms endpoint for this functionality
- Engines other than vLLM
- Have multiple storage locations
- Add analytics for vLLM instances collected from vLLM's prometheus instance
- Enhanced logging system to replace print statements
Initial assumption¶
- Nvidia gpus
- All gpus are same type
- Interface launches on all exposed via 0.0.0.0