Resource scheduling and cluster management for AI
${PAI_WORK_DIR}
before mv content to this folder #3695openpaimarketplace
to share their jobs or run-and-learn others' sharing job.completing, retry pending
to running, waiting
#3636completing, retry pending
to running, waiting
#3636This release is an intermediate release major for the upcoming PureK8S version release. As there are breaking changes from PAI's K8S+YARN version to PureK8S version, if you are currently using PAI's K8S+YARN version for production, please stay with 0.14.0 version and plan for upgrade later.
Web portal:
Python SDK:
New scheduler:
PAI vscode extension:
Team storage plugin:
Web portal:
Rest server:
Yarn cluster && Framework launcher:
Deployement:
Plugins:
Security:
Web portal:
<p>
tag and prop-types warning (#3067)jobRetryCount
and taskRetryCount
field (#3112)Rest server:
Hadoop:
Deployment:
&&
, so use #
or \
in the command will cause bugs. This will be fixed in the future.OpenPAI protocol:
Web portal:
OpenPAI protocol:
Web portal:
Rest server:
Framework launcher:
Job exporter:
Watchdog:
/api/v1/pods
to get all pods (#2750)Deployement:
Web portal:
Rest server:
Hadoop:
Web portal:
Deployment
Web portal:
REST server:
Hadoop:
Alart manager:
Drivers:
Storage plugin
N/A
Please follow the Upgrading to v0.12 for detailed instructions.
Support team wise NFS storage, including:
Refer to Simplified Job Submission for OpenPAI + NFS deployment for more details.
New alerts for unhealthy GPUs, currently including following alerts #2209:
Admin could know all running jobs on a node. #2197
Filter supports in Job List View. #302
Hold the Env for failed jobs which are casued by user error. #2272
Webportal:
Alert-manager: Increase node memory and CPU threshold to reduce false alerts. #2345, #2296
Hadoop: Persist yarn and hdfs service log to host. #2244
Runtime: Support samba shares in container. #2318
Please follow the Upgrading to v0.11 for detailed instructions.
Issue: There is a known issue #2433 in v0.10.1 upgrade, some users might hit this issue. When hitting the issue, deploy kubernetes cluster with OpenPAI will hang. Resolution: We had provided an hotfix #2441 for it. But if your organization does not have any urgency to upgrade to v0.10.1 by end of March 2019, you can postpone the upgrade plan for a week, by when we will release v0.11.0 #2307 in which the known issue has been officially fixed.
Please follow the Upgrading to v0.10 for detailed instructions.