High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Fast Inference of MoE Models with CPU-GPU Orchestration