XLearning Versions Save

AI on Hadoop

4 years ago

Release XLearning 1.4

Support the application running on the docker
Support the mpi application
ClusterDef is avaliable for TensorFlow Distribution Strategy API
Allow the amount of memory to be set separately for chief and estimator worker for TensorFlow Application
Specify the Yarn node label for job execution
Multi-threads upload the output
Allow the inter-result incremental upload
Support the regular matching for input path

The memory usage adjustment prompt is only displayed when the application finish status is successed.

5 years ago

Support the lightLDA, see examples/lightLDA for use
Support the xflow, see examples/xflow for use
By submitting the configuration parameter to support the user-defined environment variable settings
Setting the last worker as estimator role of the distribute TensorFlow application if the user set the tf-evaluator as true, see examples/tfEstimators for use
Define the single worker index to save the output by set the output-index
Port reservation mechanism optimization
Local data container allocation priority mechanism
Display resource application and usage information
ps role function expansion: more convenient metrics use information rendering and output output upload

Container waits for the remaining machine port addresses to be stuck in the process due to the failure of the Container in distributed mode
After the worker applies, the number of redundant applications is released, and the remove request operation is added
Application failed due to excessive environment variables too long of the input in PLACEHOLDER mode
Job execution judgment failure condition control
The status code returns incorrectly when the Container successfully exits

6 years ago

Client print the containers status information when the state changes
add the xlearning.localresource.timeout configuration to control the local resource download
support the VisualDL, see examples/mxnetVisualDL for use
support the local cache when input strategy is inputformat with epoch greater than 1

6 years ago

worker or ps memory auto scaled when application retry after failed
application exit as fail when container allocated exceed limit time
support the user's job jar using the --jars when application submit
add the cpu metrics on the web display. Note that if hadoop version lower than 2.6.4, please see the FAQ first.
support more distribute deep learning frameworks, such xgboost, LightGBM. Specific usage details please see the FAQ.