Skypilot Versions Save

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

v0.5.0

2 months ago

SkyPilot v0.5.0: SkyServe, New Provisioner, LLMs, Kubernetes, and More Clouds

We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:

  • SkyPilot Serving
  • New provisioner
  • LLM recipes for the latest open models and engines
  • Kubernetes support improvement
  • 4 new clouds (contributed by the cloud providers!)

and more!

Release Highlights

New Features

  • Multiple candidate resources: SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators, any_of or ordered in resources), allowing users to significantly enlarge the resource pool and get higher availability.
  • New Provisioner: Provisioner gets a new implementation, which is 2x faster and more reliable for supported clouds. Support launching clusters with more than 100 nodes. Dependency requirements for clouds are also significantly reduced.
  • Disk Tier: Introducing best disk tier for the best performance and cost, so you can choose the best disk for any cloud. (#2434)
  • Allow 2x spot jobs to be run concurrently
  • Mount storage back after cluster restart

SkyServe

SkyServe is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.

  • Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (#2458)
  • Autoscaler: Request rate based autoscaling policy. (#2868, #2878)
  • Autoscaler: Support scaling to 0 when no requests (#2938)
  • Rolling update: Support rolling update for existing services (#2935, #3057)

Other Enhancements

  • Environment variable support in services field (#3078)
  • Override task configurations with CLI arguments (#2979)
  • Logging improvement for replicas (#2924, #2949)
  • Smoke tests for SkyServe (#2911)
  • Documents for SkyServe (#3022, #2794, #2864, #2894, #2922, #2989, #3182)
  • UX improvements for SkyServe (#2895, #2940, #2961, #3054, #3176, #3094)
  • Bug fixes and robustness improvement (#2811, #2822, #2860, #2995, #2983, #3058, #3075, #3226)

New LLM Recipes

  • Gemma: Serve your Gemma on any cloud (#3207, #3220)
  • SGLang: Speed up your LLM deployments with SGLang for 5x throughput on SkyServe (#3126, #3140, #3170, #3145)
  • Mixtral 8x7B: Serving and scaling Mixtral 8x7B model on any regions/clouds (#2857, #2888, #3017, #3067, #2882)
  • Mistral 7B: Official docs for hosting Mistral 7B from mistral.ai (#2615, #2856)
  • CodeLlama: Hosting CodeLlama model with SkyServe and accessing it with API, chat or VSCode (#3050, #3143)
  • LoRAX: efficient multi-lora LLM inference (#2883)
  • axolotl: a latest LLM tool for finetuning AI models running on SkyPilot (#2784, #2789)
  • Tabby: Self-host coding assistant Tabby on SkyPilot (#2597, #3068)
  • vLLM: Serve with vLLM to expose OpenAI API for Vicuna and Mixtral (#2614, #2643, #2616, #2786, #2791, #2948,#3118)
  • TGI: Scale the inference engine TGI with SkyServe (#3121)

Kubernetes

Kubernetes support received a number of New Features and Enhancements.

  • Multi-node support for Kubernetes (#2609, #3019)
  • Open ports support for Kubernetes (#2588, #2713, #2997, #3200)
  • Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (#2650)
  • Starting a kubernetes GPU cluster locally with sky local up (#2890)
  • Custom Image Support for Kubernetes Instances (#2729, #3019, #3210)
  • New provisioner for kubernets for better performance and robustneess (#3019)
  • Supporting Kubernetes cluster launched with k3s and Rancher (#3148)

Other Enhancements

  • Support H100 80GB in Kubernetes (#2840)
  • Share SSH jump pod across users to reduce resources consumption (#2826)
  • Allow KUBECONFIG env var for config file specification (#3169)
  • Robustify the kubernetes cluster removement (#3043)
  • Fixes GPU labeller (#2636, #2653)
  • UX and Robustness improvement (#2638, #2712, #2589, #2785, #2551, #2795, #2884, #2913, #2795)
  • Documents improvement (#2595, #2705, #2957, #2991, #2997, #3119)

More Clouds

SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: VMWare vSphere, RunPod, Fluidstack and Cudo Compute.

  • RunPod: RunPod is a specialized AI cloud, with additional capacities for high-end GPUs. (#2980, #3018)
  • Fluidstack: Fluidstack offers accessible GPUs for AI with low cost. (#3086, #3224)
  • Cudo Compute: GPU cloud provides low cost GPUs powered with green energy. (#2975, #3224)
  • VMWare vSphere: you can now bring your own vSphere cluster to SkyPilot. (docs) (#3000)

Clouds

AWS

New Features

  • New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (#1702, #2719, #2792)
  • Support for AWS Trainium accelerator (#2690)
  • Support null for proxy command to filter regions (#2756)
  • Support CUDA 12.1 with default image updates (#2788)
  • Job scheduling on Inferentia and Trainium (#2969, #2798)
  • Allow specifying security_group (#3133)

Enhancements

  • Make public / private subnet selection robust (#2867)
  • Avoid hanging for restarting an instance in STOPPING state (#2998)
  • Remove sunset instance types (#2610)
  • Add docs for custom VPC support (#2776)

Fixes

  • Fix conda installation on AWS default image (#3206)
  • Robustify the custom image support (#3216)
  • Fix subnet selection for AWS and autodown for spot instances (#2921)
  • Fix minimal permission for AWS (#2978)
  • Improve opening ports for AWS (#2716)
  • Autstop with new provisioner (#2719)

GCP

New Features

  • Security: Custom VPC support for GCP. (#2764, #2772, #2854, #2944)
  • Security: Support private IP with proxy jump on GCP. (#2819)
  • New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (#2681, #2719, #2943)
  • Automatically use reserved instances from multiple reserved pools (#2836, #2681)
  • Support L4 accelerator for GCP (#2724)
  • Allow stopping spot clusters on GCP (#2877)

Enhancements

  • Allow stopping VM with local SSD (#2587)
  • Update default runtime version for TPU node (#2601, #2602)
  • Handling transient error during launching GCP clusters (#2669)
  • Update GCSFuse version to 1.3.0 for GCS storage mount (#2887)
  • Set TPU VM the default option for TPU accelerators (#1758)
  • Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (#3028, #3172, #3234)

Fixes

  • Fix custom docker image support (#3218)
  • Fix minimal roles required for GCP (#2704)
  • Robustify the catalog fetching (#3141)
  • Fix ports on TPU VM and cluster launched before 0.4.0 (#2641)
  • Fix backward compatibility issue with GCP clusters (#2604)
  • Fix --disk-size for Custom Machine Images (#2718)
  • Update catalog fetcher with more options (#2562)
  • Assign GCP VMs with service account (#2972)
  • Fix machine image support (#3030, #3236)
  • Fix error handling for failed provisioning (#2852)
  • Leave out TPU v5 in catalog as it is not supported (#2656)
  • Fix GCP minimal permission (#2947, #2770, #2761)

Azure

Enhancements

  • Make ports openning more robust (#2649, #2891, #3084)
  • Additional arguments for Azure catalog fetcher and support H100 (#2561, #2844, #2847)
  • Support CUDA 12.1 with default image updates (#2468)
  • Support spot instances on Azure (#2871)

Fixes

  • Fix custom docker image support (#3218)
  • UX: Fix Azure disk tier explicitly shown in resources str (#3064)
  • Fix status query for Azure (#3015)

SCP

  • Fix SCP error raised in sky check (#3038)

CLI & Core interfaces

New Features

  • Multi-node jobs fail fast fast for single node failure (#3081)
  • Add configurations for not uploading credentials (#2904)
  • Adding sky status --endpoints CLI (#3199)
  • Support more characters in cluster name (#3130)
  • Show all regions and more accurate price in sky show-gpus (#2583, #2892, #2933, #2946, #3083, #3149, #3113)
  • Allow infering cloud from region or zone (#2632)
  • Add --commit and --version for sky CLI (#2720, #2731, #2733)

Enhancements

  • Robustify runtime initialization on remote cluster (#3132)
  • Better error message for YAML parsing (#3040)
  • Smarter GPU name completion (#3014)
  • Speed up retry until up by not doing exponential backoff (#2821)
  • Add schema validation for config (#2645)
  • Allow --disk-tier none override (#2906)
  • sky check improvement (#3174, #3212, #3160)
  • Better logging for CLIs (#2535, #2691, #2728, #3139, #3175)

Fixes

  • Fix permission issues for SSH config file on specific linux distributions (#3151)
  • Fix sky_logs and mounting directory (#2667, #2845)
  • Fix job related commands (#2662, #2767)
  • Fix sky logs with --sync-down (#2660)

Deprecations

  • Deprecate cpunode/gpunode/tpunode, hide admin (#2800)
  • Remove deprecated Local cloud which is now replaced by Kubernetes support (#3037, #3186)

Backend/Provisioner

New Features

  • Support multiple candidate resources (#2498, #2803, #2833, #2886, #3107)
  • Support launching 100-node cluster for AWS, GCP, Kubernetes, and RunPod (#3004, #3005)
  • Support spaces in paths (#2762)
  • Support long local username with special characters (#3105, #3130)

Enhancements

  • Robustify termination of failed clusters during failover (#2990)
  • Improve the ssh check for clusters just provisioned (#2797)
  • Robustify failover to avoid terminating clusters that has user data (#2977)
  • Move ssh config to ~/.ssh/generated/ssh instead of directly editing ~/.ssh/config (#2706, #3069)
  • Code refactoring and cleanup (#2541, #2736, #3046, #2633, #2870, #2925, #3087, #3088, #3153)
  • Improve usage collection (#2654, #2672)
  • Better explanation of failover in docs (#2850, #2834)

Fixes

  • Avoid backward compatibility issue with provisioner (#2682)
  • Fix cloud provisioning internal file mount cache (#2715)
  • Fix optimization for DAG when some resources provided are not feasible (#2657)
  • Fix runtime installation on remote VM (#2909, #2912)
  • Fix cluster termination when the cluster is not fully UP (#3025)
  • Fixes for tests (#2651, #2976, #3023, #3166, #3167, #3202)
  • Improve logging (#2594, #2678, #2696, #3003)

Managed spot

New Features

  • Allow 2x spot jobs to be run concurrently (#3191, #3208)

Enhancements

  • Better logging and UX (#2630)
  • Add docs for customizing spot controller (#2753)
  • Add spot pipeline docs (#2936)

Fixes

  • Fix private VPC support for spot jobs (#2874)
  • Fix ~/.sky/config.yaml for spot jobs (#2876)
  • Fix OOM for long running spot jobs (#2675)
  • Fix AWS NoCredentialError caused by credential rotation (#2695)
  • Fix Azure dependency on spot controller (#2875)

Storage

New Features

  • Mount storage back to clusters after restarted (#2322, #2804)

Enhancements

  • Clarify the syntax for external and managed storage (#3162, #2804)
  • Confirmation prompt for sky storage delete, and --yes flag to skip it (#2726)
  • Refactor and clean up storage code (#2774, #2986)

Fixes

  • Fix permission issue for S3 mounting on specific images (#3215)
  • Fix spaces in source path for storages (#2835)

Dependencies

  • Recommand nightly build in docs for better performance and robustness (#2984)
  • Automatic build for nightly Docker image (#2229)
  • Avoid ray dependency locally for AWS, GCP, and Kubernetes (#2625, #2943, #3019)
  • Remove AWS dependency by default for better setup time and less confliction (#2841, #2942)
  • Fix GCP dependency by updating google-api-python-client (#2577, #2759)
  • Pin remote dependency for ray job (#2659)
  • Robustify dependencies (#2642, #2679, #3024)

Examples

  • NeMo distributed training for BERT and GPT3 (#2533)
  • Add docker compose example to run multiple containers (#2745)
  • Distributed ray train example (#2828)
  • Benchmark Torch DDP (#2987)
  • Example updates for supported models (#2637, #2825)

Full Changelog: https://github.com/skypilot-org/skypilot/compare/v0.4.0...v0.5.0

Thanks to all contributors!

New contributors: @rtalaricw, @jackyk02, @Vaibhav2001, @rohanvaidya45, @Shrinandan, @manishiitg, @amitkumarj441, @tgaddair, @aseriesof-tubes, @changxiaohui, @thams, @kishb87, @PratikKumar125, @mmcclean, @dtran24, @davidwagnerkc, @mjibril, @kbrgl, @msehsah1, @JungleCatSW, @Ying1123

Many thanks to all contributors who contributed to this release!

Contributors: @Michaelvll, @concretevitamin, @cblmemo, @romilbhardwaj, @MaoZiming, @landscapepainter, @sunny0826, @suquark, @Vaibhav2001, @infwinston, @hemildesai, @asaiacai, @Shrinandan, @kishb87, @rtalaricw, @iojw, @aseriesof-tubes, @manishiitg, @jackyk02, @mmcclean, @thams, @amitkumarj441, @rohanvaidya45, @saihtaungkham, @tgaddair, @davidwagnerkc, @PratikKumar125, @dtran24, @changxiaohui, @mjibril, @kbrgl, @msehsah1, @JungleCatSW, @Ying1123

v0.4.1

6 months ago

This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the new provisioner for AWS, fixing OOM and credential issues for long-running spot jobs, and some additional improvements.

Detailed changelog coming up in v0.5!

v0.4.0

7 months ago

SkyPilot v0.4.0: Kubernetes, native containers, ports and new clouds

We are excited to release SkyPilot v0.4.0, which brings a host of new features and improvements, including Kubernetes support, native container support, ability to open ports, and more.

Release Highlights

New Features

  • Kubernetes support: SkyPilot tasks and clusters can now run on Kubernetes clusters, including on-prem and cloud hosted deployments (GKE, EKS).
    • If you have a working kubeconfig, simply run sky check and sky launch --cloud kubernetes to run your task on Kubernetes.
    • If desired, tasks can also failover to the cloud when the Kubernetes cluster does not have enough resources. The same SkyPilot YAMLs and CLI works seamlessly across Kubernetes and clouds.
  • Opening ports on clusters: Open ports on your clusters with the ports field. These ports are publicly accessible and can be used for hosting LLM inference endpoints, Jupyter notebooks, web servers, Tensorboard, and other services.
  • Native container support: If your task uses docker containers, SkyPilot's setup and run commands can now directly be executed in that container. This allows you to wrap your environment in a container and run it on any cloud with SkyPilot.
  • Reservation support: This release adds support for GCP reservations. SkyPilot will now prioritize using your reservations on the cloud to save costs and get higher availability.
  • New Managed Spot Features

New LLM Recipes

More Clouds

SkyPilot now supports 8 clouds, including community contributed support for two new clouds:

SkyPilot now also supports IBM COS buckets (#1966).

Core and UX Improvements

  • Faster failover: 30x faster failover with our new quota optimization which checks if quotas are available before launching a cluster (Supported on GCP, AWS).
  • Easily get VM IPs: The new --ip flag for sky status returns the public IP address of the cluster (e.g., sky status --ip mycluster). Use this to access services such as LLM inference endpoints, jupyter notebooks and more.
  • Improved scriptability: SkyPilot YAMLs and CLI are more scriptable than ever - file_mounts can be dynamically defined with environment variables (docs, example), environment variables can be set through a dotenv file with the new --env-file flag (#2296).
  • Core optimizations: Multi-node clusters stop 4x faster (#2199), sky status updates for stopped clusters are 10x faster (#2288), and the job queue is more memory efficient (#1636).
  • Nightly releases: We now release nightly versions of SkyPilot. To get the cutting edge of SkyPilot without installing from source, run pip install skypilot-nightly (#1446)

Deprecation

  • SkyPilot On-prem is now deprecated and Kubernetes will be the recommended mode of running SkyPilot on on-prem clusters.

Below is a detailed list of changes.

Managed Spot

New Features

  • Spot pipeline support: automatically handles a pipeline of spot jobs. (#1982)
  • Spot dashboard is now available with sky spot dashboard: you can now see all your spot jobs in GUI (#2103, #2136)
  • Spot callback - users can now run custom code when spot job status changes (#2106, #2364)
  • Resource configuration of the spot controller can now be customized (docs, #2040)

Enhancements

  • SkyPilot now shows the spot job's resources and estimated cost before confirmation (#2524)
  • Switch to eager failover recovery policy for better spot lifetime (#2234)
  • Reduce the logging for launching spot controller (#2056)

Fixes

  • We now show PENDING spot job in the spot queue before it starts (#2044)
  • Robustness fixes (#2102, #2153, #2119, #2004, #2330, 1998)

CLI & YAML interfaces

New Features

  • Users can now use environment variables to dynamically define file_mounts (docs, #2146)
  • sky status can now show the head IP of the cluster with -a or --ip flags (#2305, #2563)
  • sky down/stop/start defaults to a unique cluster if it exists and sky cancel without cluster cancels the latest task (#2325)

Enhancement

  • sky check output is now friendlier with more hints for disabled clouds (#2002, #2017, #2196, #2114, #2221, #2377)
  • sky down progress bar now reflects clusters failed to terminate (#1595, #2005)
  • We now fail early if rsync is not installed locally (#2168)
  • Better messages and hints for CLI (#2027, #2028, #2077, #2083, #2085)

Fixes

  • Fixed the order of VMs in optimizer table when --cpus is provided (#2037)
  • Better handling when sky launch is interrupted (#2206, #2252)

Backend

New Features

  • Users can now open ports for their clusters with the ports field (docs, #2210, #2477)
  • Docker support in image_id - tasks can now be run inside docker containers (docs, #1910)
  • Users can now clone a cluster from an existing cluster's disk with the --clone-disk-from flag (#2098)
  • Users can now launch their own ray cluster on a SkyPilot cluster (#2020)

Enhancements

  • 30x faster failover for AWS and GCP when quotas are not available (#1953, #2187, #2313)
  • Faster sky launch by caching cluster IP address (#2400)
  • Job queue is now more resource efficient, with significant memory consumption reduction on remote cluster (#1636)
  • Cluster names no longer map directly to cloud cluster names. Instead, they are mapped to a unique cluster name on the cloud. This helps with isolation across users sharing cloud accounts. (#2403)
  • More efficient and robust stopping/termination for AWS (#2121)
  • sky status --refresh for STOPPED cluster is 10x faster (#2079)
  • Empty YAML fields are now allowed (#1890)

Fixes

  • Manually started/stopped clusters are now better handled (#2130, #2203, #2389)
  • Fix edge case where existing clusters were terminated when resources are not available (#2170)
  • Fixes for disk_tier UX (#2156, #2215)
  • Robustness fixes (#2033, #2061, #2009, #2491, #2290, #1259, #2074, #2023, #2042)

Storage

New Features

  • IBM COS is now supported (#1966)
  • sky spot launch will now exclude files from .gitignore (#2018)

Enhancements

  • Deletion is now parallelized for faster deletion (#2058)
  • UX improvements for sky storage CLI (#2063, #2177)
  • GCS bucket mounting now uses gcsfuse v1.0.1 (#2470)

Fixes

  • Fix transient failures when uploading to GCS from MacOS due to multiprocessing bug (#2125)
  • Robustness fixes (#2049, #2117, #2165, #2259, #2326, #2250)

Dependencies

  • Avoid buggy grpcio versions (#2055)
  • Pydantic is pinned to <2.0 (#2157)
  • PyYAML is pinned to >3.13, != 5.4.* to avoid issues with Cython 3 (#2256, #2514)
  • Ray <= 2.6.3 is supported on local machines (#2401)
  • pycryptodome, oauth2client are no longer required (#2515)

Clouds

AWS

  • H100 GPUs are now supported (#2323)
  • New docs for AWS cloud administrator about advanced login option (SSO and account switching) (#1888)
  • Insufficient permission is now handled gracefully (#2415, #2456)
  • Fixed a bug where existing AWS cluster would end up in INIT state after changing identity (#2442)
  • Fix fetching AZ when describe zones permission does not exist in all regions (#2463)

GCP

  • Nvidia L4 GPUs are now supported (#2212)
  • Machine Images are now supported (#2280)
  • GCP reservations are now supported (#2352)
  • SkyPilot optimizer is 4x faster for GCP instances (#2410)
  • GCP pricing is now dynamically fetched and is more robust (#2118, #2076, #2131)
  • Default image has been updated to Debian 11 (#2279)
  • New docs for minimal permission required by GCP account to use SkyPilot for administrator (#2100, #2112)
  • Robustness fixes (#2135, #2199, #2124, #1879, #2116)
  • TPU support is now more robust (#2310, #2471, #2350, #2540)

Azure

  • westus3 region is now supported (#2149)
  • Fix status refresh for Azure (#2120)
  • Fix Azure disk tier interruption for optimize progress (#2111)
  • Azure catalog fetching is more robust (#2115, #2553)

Lambda

  • Add H100 support for Lambda Cloud (#2010, #2323)
  • API rate limit is now handled with backoff and retry (#2265)
  • Errors are now more detailed (#2371)

Oracle Cloud Infrastructure (OCI)

  • OCI is now supported (#1909, #2047, #2057, #2068, #2034, #2070, #2069, #2062,#2067, #2092, #2095, #2099)

Samsung Cloud Platform (SCP)

  • Samsung Cloud Platform (SCP) is now supported for single-node clusters (#1941, # 2001, #2014)

Examples

  • New DeepSpeed example (#2208)
  • New Distributed Tensorflow example (#1721)
  • New DVC example (#2444)
  • Examples dependencies are now up to date (#2145, #2223, #2359)

Full changelog

Thanks to all contributors!

New contributors: @JGoo1, @tobi, @HysunHe, @blucz, @shethhriday29, @MaoZiming, @ksasi, @pushmatrix, @hzeng-0, @saihtaungkham, @fozziethebeat, @n10dollar, @asaiacai, @mtaku3, @gbmarc1, @alex000kim, @steve-marmalade, @xzrderek, @sunny0826.

Many thanks to all contributors who contributed to this release!

@Michaelvll, @concretevitamin, @romilbhardwaj, @cblmemo, @HysunHe, @landscapepainter, @shethhriday29, @infwinston, @alex000kim, @suquark, @sunny0826, @gbmarc1, @MaoZiming, @xzrderek, @tobi, @steve-marmalade, @saihtaungkham, @pushmatrix, @n10dollar, @mtaku3, @ksasi, @hzeng-0, @fozziethebeat, @blucz, @asiaacai, @WoosukKwon, @JGoo1, @mraheja, @iojw, @hemildesai, @ewzeng, @aviweit, @Saikrishna-Achalla, @Cohen-J-Omer

v0.3.3

9 months ago

This patch release brings many bug fixes and features, including new mechanics for stop/down, callbacks for spot jobs and a critical dependency fix for PyYAML after the release of cython 3.

Detailed changelog coming up in v0.4!

v0.3.2

10 months ago

This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the pedantic dependency issue, disk cloning, file mounts, and cloud-specific improvements.

Detailed changelog coming up in v0.4!

v0.3.1

11 months ago

This is a patch release to ship several important enhancements and bug fixes:

Enhancements

  • On-demand H100 GPU from Lambda is supported! sky launch --gpus h100
    • To use it, remove any previous Lambda catalog: rm -rf ~/.sky/catalogs/v5/lambda
  • Managed spot: make job cancellation during failover more robust to mitigate a rare FAILED_SETUP error (#1998)

Fixes

  • Provisioner / Backend
    • Fix provision failover encountering FileNotFoundError (#2005)
    • Fix user-level ray cluster causing SkyPilot cluster to be in INIT state (#2020)
  • Logging
    • Fix certain logs of multi-node jobs not being streamed due to Ray 2.4 log dedup (#2026)
    • Fix logs being created in current pwd $PWD/~/sky_logs in some cases (#2009)
  • Managed spot
    • Fix sky spot launch --retry-until-up to make it actually retry until up (#2004)
  • Storage
    • Fix a rare storage cloud check error if sky check has never been called (#2017)
  • On-prem
    • Fix detecting A5000 and A6000 GPUs (#2023)

Full Changelog: https://github.com/skypilot-org/skypilot/compare/v0.3.0...v0.3.1

v0.3.0

11 months ago

SkyPilot v0.3.0: LLM Support, New Clouds, Enhanced Production-Readiness

We are excited to release SkyPilot v0.3, the most significant release thus far in the project's history.

v0.3 focuses on:

  • LLM support (Vicuna, LLaMA)
  • New clouds (Lambda Cloud; IBM; Cloudflare R2)
  • Enhanced production readiness

See the release blog post for a deep-dive into highlights.

Release notes below are as compared to v0.2 (full changelog).

Release Highlights

  • LLM support
    • Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
    • Serve your own LLaMA LLM chatbot on any cloud: full example, blog, repo
    • Significantly expanded GPU availability by leveraging the widest selection of clouds (see below)
  • More clouds, more choices: delivering the highest GPU availability & cost savings
    • Lambda Cloud is now supported!
      • This brings high-end GPUs at lower costs to SkyPilot. (#1557, #1838)
      • Simply run sky check to set it up. Docs here.
    • IBM Cloud is now supported!
      • This brings the first hyperscaler cloud after AWS/GCP/Azure to SkyPilot. (#1598)
    • Cloudflare R2 object store is now supported!
      • This brings zero-egress cost object storage to SkyPilot. (#1736)
      • To use it, see setup docs and usage docs.
  • Managed Spot is made significantly more robust via a host of fixes/enhancements.
  • Cluster leakage prevention and detection are significantly improved.
  • CLI/API & Backend shipped many new features:
    • sky cost-report; fine-grained optimizer; user identity; AWS SSO; private IP-only VPCs; Ray runtime is decoupled from user's Ray clusters; ...

CLI/API

New Features

Enhancements

Fixes

Managed spot

New Features

  • Latest in-progress spot jobs are shown in sky status (#1270, #1467, #1691)
  • Detailed reasons for failed spot jobs are exposed in sky spot queue -a (#1655)

Enhancements

Fixes

TPU

Robustness is enhanced for TPUs in various modes: VMs, pods, spot (#1500, #1279, #1359, #1483, #1562, ...).

Provisioner

Enhancements

  • Cluster leakage prevention is significantly improved!
    • Skip Ray's launch hash check, which caused many leakage (#1671)
    • Launch existing cluster in the same zone to avoid leakage (#1700)
    • Existing cluster's cluster YAML will keep certain fields unchanged across re-launch (#1235, #1251)
    • Fix leakage of existing cluster when failed to start https://github.com/skypilot-org/skypilot/pull/1497
  • Disable unattended-upgrade (nondeterministic APT lock) on cluster start
    • Previously, apt install ... in setup may non-deterministically fail due to APT lock being held by background unattended upgrades
    • Now: for AWS cloud-init ensures unattended-upgrade is disabled at boot (#1949, #1954); for other clouds we kill the processes (#1347)
  • Generate valid cluster names when username has invalid characters https://github.com/skypilot-org/skypilot/pull/1526

Fixes

Storage

New Features

Enhancements

Fixes

Backend

New Features

  • New feature: Fine-grained optimizer
    • Optimizing & provisioning retries at the granularity of regions/zones https://github.com/skypilot-org/skypilot/pull/975
    • In other words, SkyPilot now automatically recognizes and optimizes across the cost differences between zones (e.g., AWS zones have different prices for the same spot instance type) or regions
  • New feature: User identity is associated with each cluster (#1513, #1550, #1809)
    • Identities are e.g., different AWS profiles / GCP projects
    • With this, users are free to switch across identities, and SkyPilot will properly protect each cluster

Enhancements

Fixes

  • Fix SKY_NODE_RANK environment variable https://github.com/skypilot-org/skypilot/pull/1291
  • Make .ssh/config more robust (#1763, #1683)
  • Mitigate the "database is locked" problem (#1509, #1576)
  • Many other fixes (#1224, #1257, #1325, #1421, #1695, #1806, #1713, #1595, ...)

Cloud: AWS

New Features

Enhancements

Fixes

Cloud: GCP

Enhancements

Fixes

Cloud: Azure

Enhancements

Fixes

Catalog

New Features

  • AWS catalog is refreshed periodically via GitHub actions, so users get up-to-date prices (#1451)
    • On any call that uses the catalog, SkyPilot automatically pulls the latest catalog from the catalog repo if it detects the local copy is too old

Enhancements

Fixes

  • Misc fixes (#1426, #1492, #1505, #1525, #1835, #1786, ...)

Thanks to all contributors!

New contributors: @dongreenberg, @turian, @scruel, @vivekkhimani, @stephenbalaban, @landscapepainter, @cblmemo, @Saikrishna-Achalla, @datlife, @Cohen-J-Omer (IBM Cloud support!), @zetavg.

Many thanks to all contributors who contributed to this release!

@Michaelvll, @concretevitamin, @romilbhardwaj, @infwinston, @ewzeng, @michaelzhiluo, @WoosukKwon, @iojw, @sumanthgenz, @landscapepainter, @suquark, @dongreenberg, @cblmemo, @mraheja, @vivekkhimani, @turian, @stephenbalaban, @scruel, @lhqing, @datlife, @Saikrishna-Achalla, @Cohen-J-Omer, @zetavg

v0.2.5

1 year ago

Another patch release to ship bug fixes faster to our users! This release includes many fixes, including those for managed spot and cloud specific improvements.

Detailed changelog coming up in v0.3!

v0.2.4

1 year ago

This patch release brings more bug fixes, including fixes for cloud-specific networking and VPC configuration and managed spot.

Detailed changelog coming up in v0.3!

v0.2.3

1 year ago

What's Changed

This is a patch release with lots of bug fixes across the board, including many cloud-specific networking and VPC fixes.

Stay tuned for a detailed changelog coming up in v0.3!