## Configuration

[**中文文档**](./configure_cn.md)

The default value of each configuration can be modified by setting the corresponding properties in the `$XLEARNING_HOME/conf/xlearning-site.xml` at the XLearning client or the parameter of `--conf` when submitting the application.    

### Application Configuration

Property Name | Default | Meaning  
---------------- | --------------- | ---------------  
xlearning.am.memory | 1024MB | amount of memory to use for the AM process  
xlearning.am.cores | 1 | number of cores to use for the AM process  
xlearning.worker.num | 1 | number of worker containers to use for the application  
xlearning.worker.memory | 1024MB | amount of memory to use for the worker process  
xlearning.worker.cores | 1 | number of cores to use for the worker process    
xlearning.chief.worker.memory | 1024 | amount of memory for chief worker,especially for the index 0 worker of the TensorFlow application, default as the setting of the worker memory.   
xlearning.evaluator.worker.memory | 1024 | amount of memory for evaluator worker, especially for the TensorFlow Estimator application, default as the setting of the worker memory.  
xlearning.ps.num | 0 | number of ps containers to use for the application  
xlearning.ps.memory | 1024MB | amount of memory to use for the ps process  
xlearning.ps.cores | 1 | number of cores to use for the ps process   
xlearning.app.queue | DEFAULT | the queue which application submitted to  
xlearning.app.priority | 3 | the priority of the application, divided into level 0 to 5, corresponding to DEFAULT, VERY\_LOW, LOW, NORMAL, HIGH, VERY\_HIGH  
xlearning.input.strategy | DOWNLOAD | loading strategy of input file, including DOWNLOAD, STREAM, PLACEHOLDER  
xlearning.inputfile.rename | false | whether to rename the download file in the DOWNLOAD strategy of input file  
xlearning.stream.epoch | 1 | the number of the input file loading in the STREAM strategy of input file  
xlearning.input.stream.shuffle | false | whether to shuffle the input splits in the STREAM strategy of input file  
xlearning.inputformat.class | org.apache.hadoop.mapred.TextInputFormat.class | which inputformat implementation to use in the STREAM strategy of input file   
xlearning.inputformat.cache | false | whether cache the inputformat file to local when the stream epoch longer than 1  
xlearning.inputformat.cachefile.name | inputformatCache.gz | the local cache file name for inputformat  
xlearning.inputformat.cachesize.limit | 100*1024 | the limit size of the local cache file (in MB)   
xlearning.output.local.dir | output | If the local output path is not specified, the local directory of the output file is the default value.  
xlearning.output.strategy | UPLOAD | loading strategy of output file, including DOWNLOAD, STREAM  
xlearning.outputformat.class | TextMultiOutputFormat.class | which outputformat implementation to use in the STREAM strategy of output file  
xlearning.interresult.dir | /interResult_ | specify the HDFS subdirectory that the intermediate output file upload to  
xlearning.interresult.upload.timeout | 30 * 60 * 1000 | upload timeout to save the intermediate output (in milliseconds)  
xlearning.interresult.save.inc | false | increment upload the intermediate output file, default not (upload all output file each time)
xlearning.tf.evaluator | false | whether to set the last worker as evaluator of the distributed TensorFlow job type for the estimator api  
xlearning.tf.distribution.strategy | false | whether use the distribution strategy API for the TensorFlow, default as false  


### Board Service Configuration  

Property Name | Default | Meaning  
---------------- | --------------- | ---------------  
xlearning.tf.board.enable | true | If set to false, Board service is not necessary  
xlearning.tf.board.worker.index | 0 | the index of the worker which start the service of Board  
xlearning.tf.board.log.dir | eventLog | the directory saving TensorBoard event log  
xlearning.tf.board.history.dir | /tmp/XLearning/eventLog | specify the HDFS path which the TensorBoard event log upload to  
xlearning.tf.board.reload.interval | 1 | how often the backend should load more data of event log (in seconds) for tensorboard  
xlearning.board.modelpb | "" | model proto in ONNX format for VisualDL  
xlearning.board.cache.timeout | 20 | memory cache timeout duration in seconds for VisualDL  
xlearning.tf.board.path | tensorboard | the path of the tensorboard  
xlearning.board.path | visualDL | the path of the visualDL  


### System Configuration

Property Name | Default | Meaning  
---------------- | --------------- | ---------------  
xlearning.container.extra.java.opts | "" | A string of extra JVM options to pass to ApplicationMaster to launch container  
xlearning.allocate.interval | 1000ms | interval between the AM get the container assigned state from RM  
xlearning.status.update.interval | 1000ms | interval between the AM report the state to RM  
xlearning.task.timeout | 5 * 60 * 1000 | communication timeout between the AM and container (in milliseconds) 
xlearning.task.timeout.check.interval | 3 * 1000 | how often the AM check the timeout of the container (in milliseconds)  
xlearning.localresource.timeout | 5 * 60 * 1000 | set the timeout of the download the localResources (in milliseconds)  
xlearning.messages.len.max | 1000 | Maximum size (in bytes) of message queue  
xlearning.execute.node.limit | 200 | Maximum number of nodes that application use  
xlearning.staging.dir | /tmp/XLearning/staging | HDFS directory that application local resources upload to  
xlearning.cleanup.enable | true | whether delete the resources after the application finished  
xlearning.container.maxFailures.rate | 0.5 | maximum percentage of the failure containers   
xlearning.download.file.retry | 3 | Maximum number of retries for the input file download when the strategy of input file is DOWNLOAD  
xlearning.download.file.thread.nums | 10 | number of download threads of the input file in the strategy of DOWNLOAD  
xlearning.upload.output.thread.nums | 10 | number of upload threads of the output file in the strategy of UPLOAD  
xlearning.container.heartbeat.interval | 10 * 1000 | interval between each container to the AM (in milliseconds)  
xlearning.container.heartbeat.retry | 3 | Maximum number of retries for the container send the heartbeat to the AM  
xlearning.container.update.appstatus.interval | 3 * 1000 | how often the containers get the state of the application process (in milliseconds)  
xlearning.container.auto.create.output.dir | true | If set to true, the containers create the local output path automatically  
xlearning.log.pull.interval | 10000 | interval between the client get the log output of the AM (in milliseconds)  
xlearning.user.classpath.first | true |  whether user job jar should be the first one on class path or not.  
xlearning.worker.mem.autoscale | 0.5 | automatic memory scale ratio of worker when application retry after failed.   
xlearning.ps.mem.autoscale | 0.2 | automatic memory scale ratio of ps when application retry after failed.   
xlearning.app.max.attempts | 1 | the number of application attempts， default not retry after failed.   
xlearning.report.container.status | true | whether the client report the status of the container.  
xlearning.env.maxlength | 102400 | the maximum length of environment variable when container execute the user program.  
xlearning.am.env.[EnvironmentVariableName] | (none) |  Add the environment variable specified by EnvironmentVariableName to the AM process. The user can specify multiple of these to set multiple environment variables.  
xlearning.container.env.[EnvironmentVariableName] | (none) | Add the environment variable specified by EnvironmentVariableName to the Container process. The user can specify multiple of these to set multiple environment variables.  
xlearning.am.nodeLabelExpression | (none) | A YARN node label expression that restricts the set of nodes AM will be scheduled on.  
xlearning.worker.nodeLabelExpression | (none) | A YARN node label expression that restricts the set of nodes Worker will be scheduled on.  
xlearning.ps.nodeLabelExpression | (none) | A YARN node label expression that restricts the set of nodes PS will be scheduled on.  


### History Configuration  

Property Name | Default | Meaning   
---------------- | --------------- | ---------------  
xlearning.history.log.dir | /tmp/XLearning/history | the HDFS directory that saves the history log  
xlearning.history.log.delete-monitor-time-interval | 24 * 60 * 60 * 1000 | set the time interval by which the application history logs will be checked to clean (in milliseconds)  
xlearning.history.log.max-age-ms | 24 * 60 * 60 * 1000 | how long the history log can be saved (in milliseconds)  
xlearning.history.port | 10021 | port for the history service  
xlearning.history.address | 0.0.0.0:10021 | address for the history service  
xlearning.history.webapp.port | 19886 | port for the history http web service  
xlearning.history.webapp.address | 0.0.0.0:19886 | address for the history http web service  
xlearning.history.webapp.https.port | 19885 | port for the history https web service  
xlearning.history.webapp.https.address | 0.0.0.0:19885 | address for the history https web service  


### MPI Configuration

Property Name | Default | Meaning   
---------------- | --------------- | ---------------  
xlearning.mpi.install.dir | /usr/local/openmpi | the installation path of the openmpi  
xlearning.mpi.extra.ld.library.path | (none) | the extra library path that openmpi need  
xlearning.mpi.container.update.status.retry | 3 | the retry times for the container status update  


### Docker Configuration

Property Name | Default | Meaning   
---------------- | --------------- | ---------------  
xlearning.container.type | yarn | container running type  
xlearning.docker.registry.host | (none) | docker register host  
xlearning.docker.registry.port | (none) | docker register port  
xlearning.docker.image | (none) | docker image name  
xlearning.docker.worker.dir | /work | the work dir of the docker container