Pai Versions Save

Resource scheduling and cluster management for AI

v1.1.0

3 years ago

Release v1.1.0

New Features

  • Storage:
    • Support readonly storage. (#4523)
  • Security
    • If ssl is enabled, all requests will use https. (#4550)
  • Authentication
    • Support nested AD group in AAD Mode. (#4639)
  • Marketplace

Improvements

  • Add stress test for PAI API. (#4665)
  • Resolve job always retry for port conflict. (#4384)
  • Webportal/VScode use JS SDK + SDK improvement. (#4660)
  • Align webportal submit default value with backend. (#4682)
  • Document enhance. (#4700)

Bug Fixes

  • Fix tensorboard v2 the logdir is not correct issue.
  • Fix webPortal submit job help link broken.
  • Fix ssh barrier bug.

v1.0.1

3 years ago

Release v1.0.1

Bug Fixes:

  • Fix network related issues in installation #4560 #4565 #4568
  • Fix write permission for dir under home folder in storage manager #4562
  • Remove default webportal plugins in quick start #4581
  • Fix private registry use problem #4586 #4592

v1.0.0

3 years ago

Release v1.0.0

New Features

  • AAD Support
    • mapping between AAD groups and VCs #3275
    • End-to-End AAD test cases #3362
  • Introduce kubespray to deploy clusters #3757
    • Kubernetes Deployment, GPU drivers installation, Nvidia docker runtime #3757 #3842 #3873
    • Add new worker node #3846
    • Clean up cluster environment where OpenPAI is deployed before #3879 #3883
    • Ansible playbooks to uninstall GPU drivers installed by apt #3899
  • Framework Controller
    • Deployment #3435
    • Foreground stop all frameworks #3664
  • hivedscheduler provides a Kubernetes Scheduler Extender for multi-tenant GPU clusters.
    • Hived scheduler deployment #3495, #3579
    • Hived as the default k8s scheduler #3599
    • Job Near FIFO scheduling #3726, #3731
    • Expose LazyPreemptionStatus #3917
    • Disable leader election #3928
    • HiveD intra-vc preemption for restart #3861
    • Check suggested nodes after preemption #3843
    • Update hived config validation #3812
    • HiveD reconfiguration #3768
  • openpai-runtime is a module that provides runtime support to job containers.
    • Port kube runtime #3013
    • Job ssh for kube-runtime #3153, #3729
    • Add PAI env variables in init scripts #3154
    • Generate random ports for scheduling #3224
    • Refine init and runtime script in k8s pods #3245
    • Port conflict check #3259
    • Kubernetes ErrorSpec #3585
    • Add job exit code #3559
    • Add sshbarrier to ssh plugin #3587
    • Clean ${PAI_WORK_DIR} before mv content to this folder #3695
    • Force to flush after user command finished #3794
    • Decompress the framework when the size is large #3820
    • Apt package cache #4226
  • openpaisdk provides JavaScript SDK designed to facilitate the developers of OpenPAI to offer user friendly experience.
  • openpaimarketplace provides a webportal plugin, which stores examples and job templates. Users can use openpaimarketplace to share their jobs or run-and-learn others' sharing job.
  • Enable RBAC
    • RBAC for Prometheus #3716, #3799, #3844, #3865, #3896
    • RBAC for framework-controller, hived-scheduler, kube-runtime #3709, #3739
    • RBAC for watchdog #3721
    • RBAC for rest-server #3719, #3433, #3750
  • Device plugin
    • InfiniBand device plugin in HCA mode for k8s #3732
    • GPU device plugin #3744
    • Host device plugin #3792
  • Storage
    • K8S managed NFS + SMB storage #3826
    • Use extras to save team-storage selection #3814
    • Storage Rest API, Refine Storage Data and CLI #3809, #3754
    • Change query storage configs/servers rest-api to GET #3791
    • Refactor storage by leveraging persistent volumes #4157
  • Limited internal storage and postgres db #3813
    • Use postgres db to show job history #4164
  • TensorBoard integration #3257

Improvements

  • Deployment
    • Support to manage a list of services in paictl #3432
    • Choose services for different cluster type #3528
    • Improve deployment process, reduce the time cost #4022
    • Add deployment pre-check #1893
    • Remove yarn version components in k8s version object model #4027
    • Inform user of pai cluster id, configuration, username, password PR #4267
    • Clean docker image in service-boot.sh PR #4248
    • Disable yarn value in cluster-type #4445
    • Remove yarn content from deployment doc #4447
  • Webportal
    • Quick wizards or templates for users #3430
    • Enhance data storage functions #3416
    • Available GPU chart and virtual cluster list #3265
    • User filter #3310
    • Show VC utilization and alerts notification #3411
    • A confirm dialog for stop actions #3408
    • Add a button in job-detail page to get merged stdout&stderr log #3282
    • Add more information when SSH is disabled #3389
    • Job history UI #3831
    • User profile page #3804, #3853, #3884
    • Add SSH config to webportal for pure-k8s based PAI #3596
    • SSH Generator for webportal #3644
    • Hide pages and links that are not supported in pure k8s version #3574
    • Hide grafana dashboard in k8s webportal #3688
    • Map rest server level job status completing, retry pending to running, waiting #3636
    • Seperate 'waiting' and 'running' states on task role's statistics #3727
    • Display stopped task count in task role's header #3840
    • Disable utilization charts' animation #3730
    • Render clone job button as a link if possible #3854
    • Add stopping status to task #3868
    • Lazy loading of monaco editor component #3871
    • Render clone job button as a link if possible #3854
    • Static content caching optimization #3852
    • Webportal redirection to the origin page after login #3914
    • Display stopped task count in task role's header #3840
    • Show CPU/Mem usage in home page for normal user #3784
    • new UX design
      • New header, left navigation and command bar #3985 #4057
      • New homepage #3712 #4074
      • WebUI Font Format Adjustment #3877
    • Replace code injection in webportal with new plugin implementation #3823
  • Rest-server
    • Cluster info api #3281
    • Update hived config validation #3812
    • Job history API #3831
    • Token API #3774 #3834 #3835
    • Change to default k8s scheduler #3599
    • Check groups every time restserver start #3458
    • Add pod GPU number for default scheduler #3642
    • Map rest server level job status completing, retry pending to running, waiting #3636
    • Make -1 compatible in launcher completion policy #3870
    • Update hived resource validation #3867
    • Update restart policy to avoid stuck pending pods #3856
    • Filter pods without nodename and completed pods #3841
    • Reverse encoded framework name #3824
    • Mask secret in framework annotations #3821
    • Update priority class owner references #3808
    • Refine all APIs and documents #4355
    • Change gpuType(s) to skuType(s) #4362
    • Update virtual cluster metrics using scheduler api #4329
  • openpai-protocol provides a specification of OpenPAI job protocol.
    • Remove job name length limit #3935 #4069
  • openpaivscode provides a VSCode extension to connect OpenPAI clusters, submit AI jobs, simulate jobs locally, manage files, and so on.
    • Support AAD login in VSCode Extension #3647
  • Security
    • Enable https for K8S Dashboard #4025
    • Add REST API for checking node/pod status #3892
    • Pass secret field in job config to runtime #3572
  • Migrate components to separate repos #4319 #4307 #4311 #4324
  • AMD GPU Support #4127
    • GPU scheduling for jobs #4093
    • GPU metrics in exporter #4258
    • Rocm job examples on how to use amd gpu #4093
  • Others
    • Access Log Manager through Pylon #3600
    • Remove invalid chart in Grafana Dashboard #4020
    • Watchdog auto delete leaked priorityclass #3866
    • Check ACS Docker Image in initContainer #3572
    • For basic authentication mode, prepare document to explain how to create a group and associate users to the group #4130
    • Support to add Pod Creation http error pattern to errorspec #4125
    • Change watchdog default mem limit #4413

Documentation

  • Update job submission doc #3347
  • AAD end2end document #3362
  • Upgrade document #4238
  • Add installation FAQs and troubleshooting PR #4249
  • Rest-server API documents refinement #4355
  • End-to-end manual for cluster users and administrators #4023
  • Remove outdated docs #4446

Bug Fixes

  • Docker's data-root will lost on Azure Node restart #3307
  • Fix kubelet.service in add machines #3807
  • Fix missing dependencies during installation #3800
  • Fix VC view link bug #3689
  • Fix job list page's stopping status #3869
  • Fix bug in hived resources calculation #3595
  • Fix grafana can not be accessed behind the gateway #3659
  • Fix job issues on k8s based PAI #3555
  • Remove framework owner reference for priority class & default not to create priority class #4131
  • Job retry link invalid with unknown reason #4008
  • Do not change the semantic meaning of user submitted/cloned job config #3823
  • Storage manager constantly restart #4081
  • Wrong retry log path in job history issue #4237
  • API server overloaded by job detail page when containers are too many #4270 #4279
  • Job API not return correct appLaunchedTime #4295
  • Fix Azure File issues in storage #4438
  • Fix job retry url #4442
  • Add rate limit for RESTful API #4418 #4422

Known Issues

  • Weave net cause MPI job hang #4394
  • Hivedscheduler is prone to misconfig due to daemon Pods, such as weave net and nginx proxy #4331
  • Cert expiration will fail the access to the bed #4216
  • Can not access job pod in k8s-dashboard #4181
  • A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141
  • Job config is modified after it is imported/uploaded/cloned #3823
  • Get recursive nested AD Users in AAD Mode #3440

v0.17.0

4 years ago

This release is an intermediate release major for the upcoming PureK8S version release. As there are breaking changes from PAI's K8S+YARN version to PureK8S version, if you are currently using PAI's K8S+YARN version for production, please stay with 0.14.0 version and plan for upgrade later.

v0.14.0

4 years ago

Release v0.14.0

New Features

  • Web portal:

    • New job submission page for pai protocol v2 #3026
    • Update home page to support dedicated vc #2995
  • Python SDK:

    • Sdk release v0.4.00 #3018
  • New scheduler:

    • Dedicated vc support #2960
  • PAI vscode extension:

    • Submit job v2 support #2913
    • Add schema and snippets for YAML text editor. #2978
  • Team storage plugin:

    • New team-wise manage cli #2943

Improvements

  • Web portal:

    • Refine job detail page's task list #2953
    • add new webHDFSUri in env.js.template (#3048)
    • Tweak job submission page layout (#3043)
    • css tweak (#3041)
    • refine submit job UI page (#3037)
    • refine home page's error handling (#3196)
    • renew docker image list and add tooltip (#3181)
    • remove prettier config file (#3184)
    • remove tachyons css to avoid classname error (#3173)
    • Add confirm dialog before batch edit admin's password (#3174)
    • fix UI broken if choose all of the VC when create user (#3177)
    • redesign batch edit behavior (#3172)
    • trim the docker url after job submission; fix job detail page's clone button's padding (#3169)
    • Change 'Import CSV' to 'Create Bulk Users' in user management (#3136)
    • update command section's placeholder (#3150)
    • move documents link to top nav bar (#3126)
    • tweak home page's gpu chart's height (#3131)
    • display red border when a task role is invalid (#3072)
    • change label of container size to resources per instance (#3101)
    • add error message when command is empty (#3098)
    • disable edit user form's auto fill (#3107)
  • Rest server:

    • Add get job api v2 #2851
    • Fix error in Docker runtime detection #2888
    • update user's vc permission when create/remove vc #2939
    • Remove deprecated api in rest server #2954
    • AAD & New Schema of User and Group (#3034)
    • change get vc api (#3149)
  • Yarn cluster && Framework launcher:

    • Upgrade zookeeper version to 3.4.14 #2884
    • Upgrade to Zookeeper 3.4.14 #2911
  • Deployement:

    • client tolerate datanode replacement failure #2895
    • "paictl service delete" supports to delete log #2926
    • Update maximum resources in yarn-site.xml #2942
    • 418.56 support #3020
    • upgrade pai version in k8s (609680)
    • upgrade pai version (#3056)
    • Remove duplicate version prefix (#3124)
  • Plugins:

    • Upgrade fstream version to fix vulnerability #2886
    • Refine error message for marketplace #2998
  • Security:

    • Update js-yaml #2890 #2891 #2892
    • Add dependabot config file #2893
    • Update handlebars #2899 #2900
    • Update dependabot config file #2914
    • Update is-my-json-valid #2931
    • Update twisted #2912

Documentation

  • Rewording and some format fixes. #2927
  • Chinese translation and placeholder. #2919
  • add submit v2 job to readme #3017)
  • api doc update (#3216)
  • add release note (#3204)
  • pai upgrade doc (#3195)
  • add job submission docs (#3183)
  • change link of external project like python sdk (#3237)

Bug Fixes

  • Web portal:

    • hide 0 gpu nodes from Available Nodes Chart #2915
    • fix job detail page's gpu attributes display bug. #3027
    • disable submit if command is empty (#3080)
    • auto remove empty lines in command (#3074)
    • keep mount config selection state (0af835)
    • fix the set-state warning after clone job (#3070)
    • fix clone job bug (#3068)
    • fix <p> tag and prop-types warning (#3067)
    • Return empty command if no teamwise mounts (4afd2e)
    • Adjust team-mount-list view (1b4190)
    • Align job submit page's submission section to task role (#3065)
    • Add tooltips to job submission page's field label. (#3046)
    • remove plugin in webportal config (#3063)
    • remove pylon address dependency (#3040)
    • fix export yaml bug (#3047)
    • fix webhdfs wrong request (#3044)
    • customized docker image inputField may disappear (#3033)
    • add dependency of joi for node server (#3031)
    • remove duplicated v of feedback (#3230)
    • change webportal doc link (#3229)
    • fix stdout/stderr's full log link bug when pylon is not used (#3219)
    • change tutorial link of home page (#3213)
    • fix bug of GPU available number (#3210)
    • hot fix for hdfs CORS problem (#3145)
    • fix docker bug #3134
    • hot fix hdfscli proxy problem (#3130)
    • fix virtual cluster's default value after job clone (#3128)
    • refine hdfs check for robustness (#3116)
    • change to lowercase letter for 'Completion Policy' (#3119)
    • fix data command error (#3111)
    • fix job submission page's jobRetryCount and taskRetryCount field (#3112)
    • redict v2 job to default submission papge if plugin not installed (#3091)
    • fix docker (#3097)
    • add empty key check to key-value list control (#3096)
    • change command section's default comments to placeholder (#3095)
    • disable submit if command is empty (#3080)
    • fix deployment field missing bug (#3238)
    • [Web Portal] fix port list bug (#3240)
  • Rest server:

    • Add quotes for masked secrets field in protocol
    • Trap SIGTERM in entrypoint to avoid yarn container early stop #2947
    • fix bug #3009
    • Update http errors in get job v2 #3022
    • User Migrate Script Fix. (#3090)
    • Fix issue in updateUserVirtualCluster of rest-server (565073)
    • Fix user migration issue (#3036)
    • api permission fix (#3211)
    • Fix AAD group in dedicated vc create/delete (#3143)
    • API to create vc and remove vc and do the same operation to group (#3064)
    • Groupname schema (#3099)
  • Hadoop:

    • Increase HDFS client default replica to 3 #2924
    • port conflict #3012
    • Pre check vclist from yarn and remove the vc type group which is not in. (#3158)
    • fix and improvement (#3093)
  • Deployment:

    • fix configMigration srcipt bug(#3220)
    • user migrate script issue fix (#3209)
    • Move npm version to dockerfile #3109 (#3110)

Known Issues

  • All lines in command will be concatenated by &&, so use # or \ in the command will cause bugs. This will be fixed in the future.
  • Based on official doc, the different gpu driver versions may support different cuda versions. As our tests, current 384.111 gpu driver version does not support cuda10 image.

v0.13.0

4 years ago

Release v0.13.0

New Features

  • OpenPAI protocol:

  • Web portal:

    • Add login page for guests (#2544)

    • Add user home page (#2614)

      • Job Status
      • My virtual clusters
      • Available GPU nodes (whole cluster)
      • My recent jobs

      home

    • Add new user management page (#2726, #2796)

    • User Management UX refactoring with new layout and themes (#2726, #2796)

Improvements

  • OpenPAI protocol:

    • Update example jobs in marketplace v2 for OpenPAI protocol (#2827)
  • Web portal:

    • Refine styles in job pages (#2829, #2856, #2858, #2862)
    • Refine alert message in job pages (#2698)
    • Reduce the build bundle size to improve webportal performance (#2715)
  • Rest server:

    • Add job v1 config to v2 converter (#2756)
    • Check default runtime before starting Docker (#2754)
  • Framework launcher:

    • Upgrade to Hadoop 2.9.0 (#2704)
  • Job exporter:

    • Change triggering rule for exporter hangs (#2766)
    • Add GPU temperature detection (#2757)
  • Watchdog:

    • Use /api/v1/pods to get all pods (#2750)
  • Deployement:

    • Allow user to use Backspace in paictl input (#2769)
    • Disable InfiniBand driver installation by default (#2595)

Documentation

  • Refine document of VS Code extension (#2707)
  • Add document for PAI storage (#2822)
  • OpenPAI protocol specification document (#2260)
  • Job submission v2 plugin document (#2820)
  • Update RESTful API document for API v2 (#2816)
  • Fix typos in document (#2818)

Bug Fixes

  • Web portal:

    • Fix text broken when create or edit user (#2849)
    • Fix token authentication bug (#2843)
    • Fix retry count's margin-top (#2845)
    • Fix job clone bug (#2836)
    • Fix home page's responsive layout (#2805)
    • Fix job list page filter bug (#2787)
    • Fix home page failed to load virtual cluster list bug (#2774)
  • Rest server:

    • Check duplicate job in submission v2 (#2837)
  • Hadoop:

    • Increase YARN kill container timeout (#2778)
    • Remove cross origin in resource manager (#2758)
    • Fix Haddoop AI matching nvidia-smi regex (#2681)

Known Issues

  • Deployments issues on NVIDIA DGX2 (#2742)

v0.12.0

5 years ago

Release v0.12.0

New Features

  • Web portal:

    • Display error message in job detail page #2456
    • Import users from CSV file directly and show the final results #2495
    • Add TotalGpuCount and TotalTaskCount into job list #2499
  • Deployment

    • Add cluster version info #2528
    • Check if the nodes are ubuntu 16.04 #2520
    • Check duplicate hostname #2403

Improvements

  • Web portal:
    • Replace the suffix if a cloned job is resubmited #2451
    • Refine view full log #2431
    • Job list: optimize filter #2444
    • Replace the url module with the querystring module #1825
  • REST server:
    • Follow REST protocol in job create controller #2481
    • Add task state; Add job's retry details; Refine job config #2306
    • Remove error message #2464
  • Framework Launcher:
    • Add more info into SummarizedFrameworkInfo #2435
  • Alert manager:
    • Send resolved email and make user can config repeat interval #2438
    • Monitor process memory consumption and alert for omiagent and omsagent #2419

Documentation

  • Doc refactoring and update hello-world sample #2445
  • Add Chinese translation #2344

Bug Fixes

  • Web portal:

    • Add validation when submitting job by json #2375
    • Job List-filter UI fix #2479
    • Fix job detail "jobConfig is null" bug #2500
    • Fix job detail page's "retry link" #2478
    • Fix job v2 detail page rendering error #2480
  • REST server:

    • code_dir_size report incorrect error message #2388
    • fix script entrypoint #2522
    • Fixed jq invocation errors with numeric taskRoles #2405
  • Hadoop:

    • Remove duplicate diagnostics #2527
  • Alart manager:

    • Fix alert label error #2521
  • Drivers:

    • Add an optional configuration to skip ib drivers installation. #2514
    • Fix delete script of rollback nvidia runtime #2370
    • Fix driver parse #2458
  • Storage plugin

    • Add environment and handle corner cases #2525

Known Issues

N/A

Upgrading from Earlier Release

Please follow the Upgrading to v0.12 for detailed instructions.

v0.11.0

5 years ago

Release v0.11.0

New Features

  • Support team wise NFS storage, including:

    • An NFS configuration plug-in and a commandline tool. #2346
    • A simple NFS-job submit plug-in. #2358

    Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.

  • New alerts for unhealthy GPUs, currently including following alerts #2209:

    • gpu used by zombie container
    • gpu used by external process
    • gpu ecc error
    • gpu hangs
    • gpu memory leak
  • Admin could know all running jobs on a node. #2197

  • Filter supports in Job List View. #302

  • Hold the Env for failed jobs which are casued by user error. #2272

Improvements

Service

  • Webportal:

    • New job list page look and feel. #302
    • New job detail page: #2211
  • Alert-manager: Increase node memory and CPU threshold to reduce false alerts. #2345, #2296

  • Hadoop: Persist yarn and hdfs service log to host. #2244

  • Runtime: Support samba shares in container. #2318

Documentation

  • Add troubleshooting guide for jobs. #2305
  • Refine document for new user to submit job. #2278

Examples

  • Remove TensorFlow mpi example which cannot be run currently. #2337

Others

  • Operations: Add a commandline tool to query unhealthy gpu information from prometheus. #2319

Notable Fixes

  • Hadoop: Scheduler may get stuck in a indefinite loop. #2365
  • Hadoop: Sometimes, hadoop-ai can't detect ecc error. #2343
  • Runtime: Users might see unallocated gpus. #2352
  • Runtime: Jobs might get a free retry when using exceed memory. #1108
  • Drivers: Fix IB installation bugs. #2278, #2271, #2269

Known Issues

  • There might be a mismatch between linux kernel and driver. #2446
  • Retry link of new job details page is missing. #2466

Upgrading from Earlier Release

Please follow the Upgrading to v0.11 for detailed instructions.

v0.10.1

5 years ago

Release v0.10.1

New Features

  • Admin can configure MaxCapacity through REST API for a given Virtual Cluster so that the virtual cluster can use iddle resources as bonus. #2147
  • Support Azure RDMA. #2091; how-to doc
  • New Disk Cleaner for abnormal disk usage: The disk cleaner will check disk usage every 60 second(configurable), and if the disk usage is above 94%(configurable), it will kill the container that uses largest disk space using specific signal(10), the container will exit with code 1, and the related job will fail. Admin/User can track the reason in job logs. #2119
  • Web portal: add "My jobs" filter button. #2111
  • "Submit Simple Job" web portal plugin. #2131 Document

Improvements

Service

  • Hadoop: Improved log readability by disable a not in use HDFS shortcircuit setting. #2027
  • Extended the job log retention time from 7 days to 30 days. Enabled the log retain time as configurable settings for Admin. #2034
  • Optimized the RM and Yarn's default configurations for PAI to reduce the resource usage by AM. #2072
  • Pylon: WebHDFS library compatibility. #2134
  • Extend the NM expiry time from 15 mins to 60 mins to provide a better tolerable experience for NM downtime. #2142
  • Alart Manager: Make it more clear in service not up. #2105
  • Web Portal: Allow jsonc in job submission. #2084

Deployment

  • Only restart docker deamon, if the configuration is updated. #2138

Documentation

  • Update document about docker data root's configuration. #2052
  • Improved how-to-setup-dev-box.md with more details. #2087
  • Improved hdfs_service.md with more details. #2096

Examples

  • Add an exmaple of horovod with rdma & intel mpi. #2112

Others

  • Build: Add error message when image build failed. #2133

Bug Fixes

  • Issue #2099 is fixed by
    • Launcher: Revise the definition of Framework running state. #2135
    • REST server: Classify two states to WAITING. #2154
  • Kubernetes: Disable kubernetes's pod eviction. #2124
  • Grafana: Use yarn's metrics in cluster view. #2148
  • Add /usr/local/cuda/extras/CUPTI/lib64 to LD_LIBRARY_PATH. #2043

Upgrading from Earlier Release

Known Issue

Issue: There is a known issue #2433 in v0.10.1 upgrade, some users might hit this issue. When hitting the issue, deploy kubernetes cluster with OpenPAI will hang. Resolution: We had provided an hotfix #2441 for it. But if your organization does not have any urgency to upgrade to v0.10.1 by end of March 2019, you can postpone the upgrade plan for a week, by when we will release v0.11.0 #2307 in which the known issue has been officially fixed.

Please follow the Upgrading to v0.10 for detailed instructions.

v0.9.1

5 years ago

Release v0.9.1

Bug Fixes:

  • REST Server: Fix admin permission, Closes #2172