Pai Versions Save

Resource scheduling and cluster management for AI

v1.8.0

2 years ago

Release v1.8.0

New Features

  • Marketplace related update

  • Alert manager

    • Send alert to users when job status changed #5337
  • Webportal

    • Support UX of Job Priority #5417
  • Others

    • Customizable Autoscaler #5412
    • Add custom ssl port support #5386
    • Clean up repo. Remove obsolete code #5489

v1.7.0

3 years ago

Release v1.7.0

New Features

  • Marketplace related update

  • New job submission page

    • Please refer to new submission tutorial for how to use new submission page.
    • New submission page replaces Advanced with More info and places it under each section to improve user experience.
    • In new submission page, the sidebar can be shrank to give the main area more visual space.
    • The new submission page moves the yaml editor into a single page, which allows user to focus on setting config or editing yaml protocol.
    • The new submission page improves the responsive design in small and medium resolution.

    Know Issue: Tensorboard tool is not implemented in the new submission page yet. If you need to use it, please use the old version.

  • Alert system enhancement

    • Add alert & auto-fix for GPU perf issue #5342 #5383
    • Refine kill-low-efficiency-job-alert email templates #5384
    • Add alert for API server cert expired #5334
  • Support sort by completionTime for get job list API #5347

  • Deployment

Bug fixes:

  • Webportal package build issue #5378

v1.6.0

3 years ago

Release v1.6.0

Upgrade Guide

Before upgrade, we recommend you to check this issue first.

New Features

  • Job protocol update: Add prerequisites #5145

  • Marketplace related update

  • Introduce an optional docker cache in cluster #5290

  • A regular GPU utilization report can be set up for admins #5281, #5294, #5324, #5331

    • #5324 introduces a schema change for pai-bearer-token in the alert-manager section. The old configuration still works but is deprecated. If you have configured pai-bearer-token of alert-manager, please refer to #5331 to modify the previous configuration.
  • Users can save frequently-used SSH publish keys on the profile page #5223

  • Improve log experience #5271 #5272

  • Reduce ansible logs when deploy #5305

Bug Fixes:

  • Database controller: Tolerant to wrong framework spec #5284
  • Database controller: Remove sensitive fields in db #5289
  • Database controller: Fix memory leak #5309
  • Set correct launchTime in rest-server #5307
  • Database may use unmounted host path #5343

v1.5.0

3 years ago

Release v1.5.0

New Features

  • Improve Web Portal Experience

    • Fix Home page overlap issue #5213 #5180
    • Add filter, search box and export csv button in task detail list #5175
  • Create a new page for yaml editor #5172

  • Marketplace related update

  • Support different types of computing hardware #5138

  • Deployment process refinement

    • master.csv + worker.csv -> layout.yaml
    • move config.yaml, layout.yaml under quick-start folder, remove all the argument parse logic
    • Add support for cpu-only worker installation
    • Add support for heterogeneous workers
    • Unify version requirements: pai version, pai image tag
    • Set default value in config files
    • Generate hiveD config with layout.yaml #5179
    • Check layout before installing k8s #5184 #5181
    • Config folder structure arrangement
    • Refine installation logs
    • Add skip service list argument #5193
  • Log manager

    • Change get logs api return code #5125

v1.4.1

3 years ago

Release v1.4.1

Bug Fixes

  • Marketplace
    • Fix initializing blob data issue (#5189)
  • Log Collection
    • Fix getting wrong log for retried task & frontend crash issue (#5190)

v1.4.0

3 years ago

Release v1.4.0

New Features

  • multi-cluster (https://github.com/microsoft/pai/issues/4929)
    • Support job transfer (#5082, #5088)
  • Autoscaler
    • Update docs for Cluster Autoscaler on AKS Engine (#5057)
  • Log Collection (https://github.com/microsoft/pai/issues/4992)
    • Rest API
    • Webportal
  • Https configuration document (#5076, #5078)
  • Marketplace (https://github.com/microsoft/openpaimarketplace/issues/73)
    • Data
      • Move NFS to Azure Blob as backend
      • Upload Job output to Azure Blob
      • Download data from azure blob to local
      • Use Azure storage SDK for privacy
      • Refactor data use logic after change storage to blob
      • Update project development doc and manual
    • Service Deployment
      • Start Local Rest Server
      • Deployed Rest Server in PAI
      • Start database and save items into it
      • Register in PAI pylon (#5066)
      • Add azure storage to service configuration (#5104)
  • Web Portal
    • Fix stop job button issue #5079
  • Admin Experience
    • Prometheus alert rules update (#5021)
    • Refine deployment process (#5077, #5085)

Bug Fixes

  • Fix updateUserGroupList API issue (#5121)
  • Fix hived config issue caused by k8s coreDNS deployment (#5071)

v1.3.0

3 years ago

Release v1.3.0

New Features

  • Marketplace
    • New templates in marketplace (microsoft/openpaimarketplace#60)
  • HiveD Scheduler
    • Support cluster autoscale with HiveD scheduler on AKS (#4868)
    • Support dynamic sku types for different vc on webportal (#4900)
  • Advanced job debug mode
    • Add per task retry history (microsoft/frameworkcontroller#62, #4958, #4966)
    • Expose Kubernetes events (#4939, #4975)
  • GPU monitoring and utilization
    • Support job tagging (#4924)
    • Stop low GPU utilization job with alert-manager (#4940)
    • Cordon node with GPU ECC Errors (#4942)
  • Documentation
    • Fix document according to DRI tickets (#4828)
    • Add distributed examples (#4821)
  • Webportal
    • Add help info for items on webportal (#4950)

Known Issues

  • Job stop button no feedback after click successfully (#5023)
  • Alert handler stop-job notice not clear to end user (#5021)
  • DB framework / Rest-Server job inconsistency (#5027)

v1.2.1

3 years ago

Release v1.2.1

Bug Fixes:

  • Fix config generate bug #4970
  • Fix database controller dependency #4978

v1.2.0

3 years ago

Release v1.2.0

New Features

  • Database
    • New RestServer Arch: RestServer -> DB -> ApiServer (#4651)
  • Webportal
    • Job list paging in server side (#4651)
    • Change job createdTime to submissionTime (#4761)
    • VC && Group experience for Admin (AAD Mode) (#4800)
    • Support SKU count and SKU type in job submission page (#4796)
    • Upgrade api/v1 code to api/v2 (#4704)
    • Show "More Diagnostics" in job detail page (#4670)
  • Marketplace
  • Others

Improvements

  • HiveD improvement (#4868)
  • Robustness improvement (#4694)

Bug Fixes

Known Issues

v1.1.1

3 years ago

Release v1.1.1

Bug Fixes:

  • Fix SDK request timeout #4756