Pai Versions Save

Resource scheduling and cluster management for AI

v1.1.0

3 years ago

Release v1.1.0

New Features

Storage:
- Support readonly storage. (#4523)
Security
- If ssl is enabled, all requests will use https. (#4550)
Authentication
- Support nested AD group in AAD Mode. (#4639)
Marketplace
- Integrate with new version of PAI marketplace.

Improvements

Add stress test for PAI API. (#4665)
Resolve job always retry for port conflict. (#4384)
Webportal/VScode use JS SDK + SDK improvement. (#4660)
Align webportal submit default value with backend. (#4682)
Document enhance. (#4700)

Bug Fixes

Fix tensorboard v2 the logdir is not correct issue.
Fix webPortal submit job help link broken.
Fix ssh barrier bug.

v1.0.1

3 years ago

Release v1.0.1

Bug Fixes:

Fix network related issues in installation #4560 #4565 #4568
Fix write permission for dir under home folder in storage manager #4562
Remove default webportal plugins in quick start #4581
Fix private registry use problem #4586 #4592

v1.0.0

3 years ago

Release v1.0.0

New Features

AAD Support
- mapping between AAD groups and VCs #3275
- End-to-End AAD test cases #3362
Introduce kubespray to deploy clusters #3757
- Kubernetes Deployment, GPU drivers installation, Nvidia docker runtime #3757 #3842 #3873
- Add new worker node #3846
- Clean up cluster environment where OpenPAI is deployed before #3879 #3883
- Ansible playbooks to uninstall GPU drivers installed by apt #3899
Framework Controller
- Deployment #3435
- Foreground stop all frameworks #3664
hivedscheduler provides a Kubernetes Scheduler Extender for multi-tenant GPU clusters.
- Hived scheduler deployment #3495, #3579
- Hived as the default k8s scheduler #3599
- Job Near FIFO scheduling #3726, #3731
- Expose LazyPreemptionStatus #3917
- Disable leader election #3928
- HiveD intra-vc preemption for restart #3861
- Check suggested nodes after preemption #3843
- Update hived config validation #3812
- HiveD reconfiguration #3768
openpai-runtime is a module that provides runtime support to job containers.
- Port kube runtime #3013
- Job ssh for kube-runtime #3153, #3729
- Add PAI env variables in init scripts #3154
- Generate random ports for scheduling #3224
- Refine init and runtime script in k8s pods #3245
- Port conflict check #3259
- Kubernetes ErrorSpec #3585
- Add job exit code #3559
- Add sshbarrier to ssh plugin #3587
- Clean ${PAI_WORK_DIR} before mv content to this folder #3695
- Force to flush after user command finished #3794
- Decompress the framework when the size is large #3820
- Apt package cache #4226
openpaisdk provides JavaScript SDK designed to facilitate the developers of OpenPAI to offer user friendly experience.
openpaimarketplace provides a webportal plugin, which stores examples and job templates. Users can use openpaimarketplace to share their jobs or run-and-learn others' sharing job.
Enable RBAC
- RBAC for Prometheus #3716, #3799, #3844, #3865, #3896
- RBAC for framework-controller, hived-scheduler, kube-runtime #3709, #3739
- RBAC for watchdog #3721
- RBAC for rest-server #3719, #3433, #3750
Device plugin
- InfiniBand device plugin in HCA mode for k8s #3732
- GPU device plugin #3744
- Host device plugin #3792
Storage
- K8S managed NFS + SMB storage #3826
- Use extras to save team-storage selection #3814
- Storage Rest API, Refine Storage Data and CLI #3809, #3754
- Change query storage configs/servers rest-api to GET #3791
- Refactor storage by leveraging persistent volumes #4157
Limited internal storage and postgres db #3813
- Use postgres db to show job history #4164
TensorBoard integration #3257

Improvements

Deployment
- Support to manage a list of services in paictl #3432
- Choose services for different cluster type #3528
- Improve deployment process, reduce the time cost #4022
- Add deployment pre-check #1893
- Remove yarn version components in k8s version object model #4027
- Inform user of pai cluster id, configuration, username, password PR #4267
- Clean docker image in service-boot.sh PR #4248
- Disable yarn value in cluster-type #4445
- Remove yarn content from deployment doc #4447
Webportal
- Quick wizards or templates for users #3430
- Enhance data storage functions #3416
- Available GPU chart and virtual cluster list #3265
- User filter #3310
- Show VC utilization and alerts notification #3411
- A confirm dialog for stop actions #3408
- Add a button in job-detail page to get merged stdout&stderr log #3282
- Add more information when SSH is disabled #3389
- Job history UI #3831
- User profile page #3804, #3853, #3884
- Add SSH config to webportal for pure-k8s based PAI #3596
- SSH Generator for webportal #3644
- Hide pages and links that are not supported in pure k8s version #3574
- Hide grafana dashboard in k8s webportal #3688
- Map rest server level job status completing, retry pending to running, waiting #3636
- Seperate 'waiting' and 'running' states on task role's statistics #3727
- Display stopped task count in task role's header #3840
- Disable utilization charts' animation #3730
- Render clone job button as a link if possible #3854
- Add stopping status to task #3868
- Lazy loading of monaco editor component #3871
- Render clone job button as a link if possible #3854
- Static content caching optimization #3852
- Webportal redirection to the origin page after login #3914
- Display stopped task count in task role's header #3840
- Show CPU/Mem usage in home page for normal user #3784
- new UX design
  - New header, left navigation and command bar #3985 #4057
  - New homepage #3712 #4074
  - WebUI Font Format Adjustment #3877
- Replace code injection in webportal with new plugin implementation #3823
Rest-server
- Cluster info api #3281
- Update hived config validation #3812
- Job history API #3831
- Token API #3774 #3834 #3835
- Change to default k8s scheduler #3599
- Check groups every time restserver start #3458
- Add pod GPU number for default scheduler #3642
- Map rest server level job status completing, retry pending to running, waiting #3636
- Make -1 compatible in launcher completion policy #3870
- Update hived resource validation #3867
- Update restart policy to avoid stuck pending pods #3856
- Filter pods without nodename and completed pods #3841
- Reverse encoded framework name #3824
- Mask secret in framework annotations #3821
- Update priority class owner references #3808
- Refine all APIs and documents #4355
- Change gpuType(s) to skuType(s) #4362
- Update virtual cluster metrics using scheduler api #4329
openpai-protocol provides a specification of OpenPAI job protocol.
- Remove job name length limit #3935 #4069
openpaivscode provides a VSCode extension to connect OpenPAI clusters, submit AI jobs, simulate jobs locally, manage files, and so on.
- Support AAD login in VSCode Extension #3647
Security
- Enable https for K8S Dashboard #4025
- Add REST API for checking node/pod status #3892
- Pass secret field in job config to runtime #3572
Migrate components to separate repos #4319 #4307 #4311 #4324
AMD GPU Support #4127
- GPU scheduling for jobs #4093
- GPU metrics in exporter #4258
- Rocm job examples on how to use amd gpu #4093
Others
- Access Log Manager through Pylon #3600
- Remove invalid chart in Grafana Dashboard #4020
- Watchdog auto delete leaked priorityclass #3866
- Check ACS Docker Image in initContainer #3572
- For basic authentication mode, prepare document to explain how to create a group and associate users to the group #4130
- Support to add Pod Creation http error pattern to errorspec #4125
- Change watchdog default mem limit #4413

Documentation

Update job submission doc #3347
AAD end2end document #3362
Upgrade document #4238
Add installation FAQs and troubleshooting PR #4249
Rest-server API documents refinement #4355
End-to-end manual for cluster users and administrators #4023
Remove outdated docs #4446

Bug Fixes

Docker's data-root will lost on Azure Node restart #3307
Fix kubelet.service in add machines #3807
Fix missing dependencies during installation #3800
Fix VC view link bug #3689
Fix job list page's stopping status #3869
Fix bug in hived resources calculation #3595
Fix grafana can not be accessed behind the gateway #3659
Fix job issues on k8s based PAI #3555
Remove framework owner reference for priority class & default not to create priority class #4131
Job retry link invalid with unknown reason #4008
Do not change the semantic meaning of user submitted/cloned job config #3823
Storage manager constantly restart #4081
Wrong retry log path in job history issue #4237
API server overloaded by job detail page when containers are too many #4270 #4279
Job API not return correct appLaunchedTime #4295
Fix Azure File issues in storage #4438
Fix job retry url #4442
Add rate limit for RESTful API #4418 #4422

Known Issues

Weave net cause MPI job hang #4394
Hivedscheduler is prone to misconfig due to daemon Pods, such as weave net and nginx proxy #4331
Cert expiration will fail the access to the bed #4216
Can not access job pod in k8s-dashboard #4181
A job (or a pod of a job) may get stuck in a state neither running nor waiting #4141
Job config is modified after it is imported/uploaded/cloned #3823
Get recursive nested AD Users in AAD Mode #3440

v0.17.0

4 years ago

This release is an intermediate release major for the upcoming PureK8S version release. As there are breaking changes from PAI's K8S+YARN version to PureK8S version, if you are currently using PAI's K8S+YARN version for production, please stay with 0.14.0 version and plan for upgrade later.

v0.14.0

4 years ago

Release v0.14.0

New Features

Web portal:
- New job submission page for pai protocol v2 #3026
- Update home page to support dedicated vc #2995
Python SDK:
- Sdk release v0.4.00 #3018
New scheduler:
- Dedicated vc support #2960
PAI vscode extension:
- Submit job v2 support #2913
- Add schema and snippets for YAML text editor. #2978
Team storage plugin:
- New team-wise manage cli #2943

Improvements

Web portal:
- Refine job detail page's task list #2953
- add new webHDFSUri in env.js.template (#3048)
- Tweak job submission page layout (#3043)
- css tweak (#3041)
- refine submit job UI page (#3037)
- refine home page's error handling (#3196)
- renew docker image list and add tooltip (#3181)
- remove prettier config file (#3184)
- remove tachyons css to avoid classname error (#3173)
- Add confirm dialog before batch edit admin's password (#3174)
- fix UI broken if choose all of the VC when create user (#3177)
- redesign batch edit behavior (#3172)
- trim the docker url after job submission; fix job detail page's clone button's padding (#3169)
- Change 'Import CSV' to 'Create Bulk Users' in user management (#3136)
- update command section's placeholder (#3150)
- move documents link to top nav bar (#3126)
- tweak home page's gpu chart's height (#3131)
- display red border when a task role is invalid (#3072)
- change label of container size to resources per instance (#3101)
- add error message when command is empty (#3098)
- disable edit user form's auto fill (#3107)
Rest server:
- Add get job api v2 #2851
- Fix error in Docker runtime detection #2888
- update user's vc permission when create/remove vc #2939
- Remove deprecated api in rest server #2954
- AAD & New Schema of User and Group (#3034)
- change get vc api (#3149)
Yarn cluster && Framework launcher:
- Upgrade zookeeper version to 3.4.14 #2884
- Upgrade to Zookeeper 3.4.14 #2911
Deployement:
- client tolerate datanode replacement failure #2895
- "paictl service delete" supports to delete log #2926
- Update maximum resources in yarn-site.xml #2942
- 418.56 support #3020
- upgrade pai version in k8s (609680)
- upgrade pai version (#3056)
- Remove duplicate version prefix (#3124)
Plugins:
- Upgrade fstream version to fix vulnerability #2886
- Refine error message for marketplace #2998
Security:
- Update js-yaml #2890 #2891 #2892
- Add dependabot config file #2893
- Update handlebars #2899 #2900
- Update dependabot config file #2914
- Update is-my-json-valid #2931
- Update twisted #2912

Documentation

Rewording and some format fixes. #2927
Chinese translation and placeholder. #2919
add submit v2 job to readme #3017)
api doc update (#3216)
add release note (#3204)
pai upgrade doc (#3195)
add job submission docs (#3183)
change link of external project like python sdk (#3237)

Bug Fixes

Web portal:
- hide 0 gpu nodes from Available Nodes Chart #2915
- fix job detail page's gpu attributes display bug. #3027
- disable submit if command is empty (#3080)
- auto remove empty lines in command (#3074)
- keep mount config selection state （0af835)
- fix the set-state warning after clone job (#3070)
- fix clone job bug (#3068)
- fix <p> tag and prop-types warning (#3067)
- Return empty command if no teamwise mounts (4afd2e)
- Adjust team-mount-list view (1b4190)
- Align job submit page's submission section to task role (#3065)
- Add tooltips to job submission page's field label. (#3046)
- remove plugin in webportal config (#3063)
- remove pylon address dependency (#3040)
- fix export yaml bug (#3047)
- fix webhdfs wrong request (#3044)
- customized docker image inputField may disappear (#3033)
- add dependency of joi for node server (#3031)
- remove duplicated v of feedback (#3230)
- change webportal doc link (#3229)
- fix stdout/stderr's full log link bug when pylon is not used (#3219)
- change tutorial link of home page (#3213)
- fix bug of GPU available number (#3210)
- hot fix for hdfs CORS problem (#3145)
- fix docker bug #3134
- hot fix hdfscli proxy problem (#3130)
- fix virtual cluster's default value after job clone (#3128)
- refine hdfs check for robustness (#3116)
- change to lowercase letter for 'Completion Policy' (#3119)
- fix data command error (#3111)
- fix job submission page's jobRetryCount and taskRetryCount field (#3112)
- redict v2 job to default submission papge if plugin not installed (#3091)
- fix docker (#3097)
- add empty key check to key-value list control (#3096)
- change command section's default comments to placeholder (#3095)
- disable submit if command is empty (#3080)
- fix deployment field missing bug (#3238)
- [Web Portal] fix port list bug (#3240)
Rest server:
- Add quotes for masked secrets field in protocol
- Trap SIGTERM in entrypoint to avoid yarn container early stop #2947
- fix bug #3009
- Update http errors in get job v2 #3022
- User Migrate Script Fix. (#3090)
- Fix issue in updateUserVirtualCluster of rest-server (565073)
- Fix user migration issue (#3036)
- api permission fix (#3211)
- Fix AAD group in dedicated vc create/delete (#3143)
- API to create vc and remove vc and do the same operation to group (#3064)
- Groupname schema (#3099)
Hadoop:
- Increase HDFS client default replica to 3 #2924
- port conflict #3012
- Pre check vclist from yarn and remove the vc type group which is not in. (#3158)
- fix and improvement (#3093)
Deployment:
- fix configMigration srcipt bug(#3220)
- user migrate script issue fix (#3209)
- Move npm version to dockerfile #3109 (#3110)

Known Issues

All lines in command will be concatenated by &&, so use # or \ in the command will cause bugs. This will be fixed in the future.
Based on official doc, the different gpu driver versions may support different cuda versions. As our tests, current 384.111 gpu driver version does not support cuda10 image.

v0.13.0

4 years ago

Release v0.13.0

New Features

OpenPAI protocol:
- Introduce OpenPAI protocol and job submission v2 (#2260)
- Add new job submission v2 plugin (#2461)
Web portal:
- Add login page for guests (#2544)
- Add user home page (#2614)
  - Job Status
  - My virtual clusters
  - Available GPU nodes (whole cluster)
  - My recent jobs
- Add new user management page (#2726, #2796)
- User Management UX refactoring with new layout and themes (#2726, #2796)

Improvements

OpenPAI protocol:
- Update example jobs in marketplace v2 for OpenPAI protocol (#2827)
Web portal:
- Refine styles in job pages (#2829, #2856, #2858, #2862)
- Refine alert message in job pages (#2698)
- Reduce the build bundle size to improve webportal performance (#2715)
Rest server:
- Add job v1 config to v2 converter (#2756)
- Check default runtime before starting Docker (#2754)
Framework launcher:
- Upgrade to Hadoop 2.9.0 (#2704)
Job exporter:
- Change triggering rule for exporter hangs (#2766)
- Add GPU temperature detection (#2757)
Watchdog:
- Use /api/v1/pods to get all pods (#2750)
Deployement:
- Allow user to use Backspace in paictl input (#2769)
- Disable InfiniBand driver installation by default (#2595)

Documentation

Refine document of VS Code extension (#2707)
Add document for PAI storage (#2822)
OpenPAI protocol specification document (#2260)
Job submission v2 plugin document (#2820)
Update RESTful API document for API v2 (#2816)
Fix typos in document (#2818)

Bug Fixes

Web portal:
- Fix text broken when create or edit user (#2849)
- Fix token authentication bug (#2843)
- Fix retry count's margin-top (#2845)
- Fix job clone bug (#2836)
- Fix home page's responsive layout (#2805)
- Fix job list page filter bug (#2787)
- Fix home page failed to load virtual cluster list bug (#2774)
Rest server:
- Check duplicate job in submission v2 (#2837)
Hadoop:
- Increase YARN kill container timeout (#2778)
- Remove cross origin in resource manager (#2758)
- Fix Haddoop AI matching nvidia-smi regex (#2681)

Known Issues

Deployments issues on NVIDIA DGX2 (#2742)

v0.12.0

5 years ago

Release v0.12.0

New Features

Web portal:
- Display error message in job detail page #2456
- Import users from CSV file directly and show the final results #2495
- Add TotalGpuCount and TotalTaskCount into job list #2499
Deployment
- Add cluster version info #2528
- Check if the nodes are ubuntu 16.04 #2520
- Check duplicate hostname #2403

Improvements

Web portal:
- Replace the suffix if a cloned job is resubmited #2451
- Refine view full log #2431
- Job list: optimize filter #2444
- Replace the url module with the querystring module #1825
REST server:
- Follow REST protocol in job create controller #2481
- Add task state; Add job's retry details; Refine job config #2306
- Remove error message #2464
Framework Launcher:
- Add more info into SummarizedFrameworkInfo #2435
Alert manager:
- Send resolved email and make user can config repeat interval #2438
- Monitor process memory consumption and alert for omiagent and omsagent #2419

Documentation

Doc refactoring and update hello-world sample #2445
Add Chinese translation #2344

Bug Fixes

Web portal:
- Add validation when submitting job by json #2375
- Job List-filter UI fix #2479
- Fix job detail "jobConfig is null" bug #2500
- Fix job detail page's "retry link" #2478
- Fix job v2 detail page rendering error #2480
REST server:
- code_dir_size report incorrect error message #2388
- fix script entrypoint #2522
- Fixed jq invocation errors with numeric taskRoles #2405
Hadoop:
- Remove duplicate diagnostics #2527
Alart manager:
- Fix alert label error #2521
Drivers:
- Add an optional configuration to skip ib drivers installation. #2514
- Fix delete script of rollback nvidia runtime #2370
- Fix driver parse #2458
Storage plugin
- Add environment and handle corner cases #2525

Known Issues

N/A

Upgrading from Earlier Release

Please follow the Upgrading to v0.12 for detailed instructions.

v0.11.0

5 years ago

Release v0.11.0

New Features

Support team wise NFS storage, including:
- An NFS configuration plug-in and a commandline tool. #2346
- A simple NFS-job submit plug-in. #2358
Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.
New alerts for unhealthy GPUs, currently including following alerts #2209:
- gpu used by zombie container
- gpu used by external process
- gpu ecc error
- gpu hangs
- gpu memory leak
Admin could know all running jobs on a node. #2197
Filter supports in Job List View. #302
Hold the Env for failed jobs which are casued by user error. #2272

Improvements

Service

Webportal:
- New job list page look and feel. #302
- New job detail page: #2211
Alert-manager: Increase node memory and CPU threshold to reduce false alerts. #2345, #2296
Hadoop: Persist yarn and hdfs service log to host. #2244
Runtime: Support samba shares in container. #2318

Documentation

Add troubleshooting guide for jobs. #2305
Refine document for new user to submit job. #2278

Examples

Remove TensorFlow mpi example which cannot be run currently. #2337

Others

Operations: Add a commandline tool to query unhealthy gpu information from prometheus. #2319

Notable Fixes

Hadoop: Scheduler may get stuck in a indefinite loop. #2365
Hadoop: Sometimes, hadoop-ai can't detect ecc error. #2343
Runtime: Users might see unallocated gpus. #2352
Runtime: Jobs might get a free retry when using exceed memory. #1108
Drivers: Fix IB installation bugs. #2278, #2271, #2269

Known Issues

There might be a mismatch between linux kernel and driver. #2446
Retry link of new job details page is missing. #2466

Upgrading from Earlier Release

Please follow the Upgrading to v0.11 for detailed instructions.

v0.10.1

5 years ago

Release v0.10.1

New Features

Admin can configure MaxCapacity through REST API for a given Virtual Cluster so that the virtual cluster can use iddle resources as bonus. #2147
Support Azure RDMA. #2091; how-to doc
New Disk Cleaner for abnormal disk usage: The disk cleaner will check disk usage every 60 second(configurable), and if the disk usage is above 94%(configurable), it will kill the container that uses largest disk space using specific signal(10), the container will exit with code 1, and the related job will fail. Admin/User can track the reason in job logs. #2119
Web portal: add "My jobs" filter button. #2111
"Submit Simple Job" web portal plugin. #2131 Document

Improvements

Service

Hadoop: Improved log readability by disable a not in use HDFS shortcircuit setting. #2027
Extended the job log retention time from 7 days to 30 days. Enabled the log retain time as configurable settings for Admin. #2034
Optimized the RM and Yarn's default configurations for PAI to reduce the resource usage by AM. #2072
Pylon: WebHDFS library compatibility. #2134
Extend the NM expiry time from 15 mins to 60 mins to provide a better tolerable experience for NM downtime. #2142
Alart Manager: Make it more clear in service not up. #2105
Web Portal: Allow jsonc in job submission. #2084

Deployment

Only restart docker deamon, if the configuration is updated. #2138

Documentation

Update document about docker data root's configuration. #2052
Improved how-to-setup-dev-box.md with more details. #2087
Improved hdfs_service.md with more details. #2096

Examples

Add an exmaple of horovod with rdma & intel mpi. #2112

Others

Build: Add error message when image build failed. #2133

Bug Fixes

Issue #2099 is fixed by
- Launcher: Revise the definition of Framework running state. #2135
- REST server: Classify two states to WAITING. #2154
Kubernetes: Disable kubernetes's pod eviction. #2124
Grafana: Use yarn's metrics in cluster view. #2148
Add /usr/local/cuda/extras/CUPTI/lib64 to LD_LIBRARY_PATH. #2043

Upgrading from Earlier Release

Known Issue

Issue: There is a known issue #2433 in v0.10.1 upgrade, some users might hit this issue. When hitting the issue, deploy kubernetes cluster with OpenPAI will hang. Resolution: We had provided an hotfix #2441 for it. But if your organization does not have any urgency to upgrade to v0.10.1 by end of March 2019, you can postpone the upgrade plan for a week, by when we will release v0.11.0 #2307 in which the known issue has been officially fixed.

Please follow the Upgrading to v0.10 for detailed instructions.

v0.9.1

5 years ago

Release v0.9.1

Bug Fixes:

REST Server: Fix admin permission, Closes #2172