Parallel Optimization mode¶
Overview¶
Oftentimes each experiment testing a knob configuration runs for a long time - hours, or even days.
Optimizer Studio is able to distribute the workload between multiple computing nodes, testing different knob configurations in parallel.
This allows to both speed-up the optimization process and better utilize the available hardware resources.
Pitfalls¶
The parallel processing is not without its pitfalls:
-
Optimizer Studio explores concurrently only so much configurations. By default the number of configurations being explored at once is less than 10, each configuration will be measured at least
min_samples_per_config
times. This poses a practical limit on the number of parallel computing nodes as~7 x min_samples_per_config
. -
The computers running the workload should be very similar and provide a similar performance results for the same knob configuration, with reasonable standard deviation. Otherwise, Optimizer Studio will not be able to distinguish between an improvement achieved due to configuration choice and better performing hardware.
Parallel Optimization Workflow¶
- Select main (admin) node to run Optimizer Studio on and one or more worker nodes. The main node can double up as a worker node.
- Install Optimizer Studio package on each computer - admin and worker nodes. Notice, the license must be activated on the admin node only.
- Install the knob file(s) and workload script(s) on each node, in the same location, so that they can be accessed via the same absolute path.
- Arrange a password-less ssh connection from the admin node to the worker nodes, so that it doesn't require login credentials interactively. The recommended way is to use SSH keys with no passphrase.
- Enable HTTP access to the admin node, so that worker nodes can communicate with the optimization engine via REST API.
The default HTTP port is 8421. The user can change the HTTP port via command line switch
optimizer-ctl init ... --http-port=8421
- Add
parallel
subsection to the knob fileworkload
section. - Optimizer Studio invokes the workload starter using the parameters passed in the
parallel
subsection. Workload starter invokesoptimizer-studio.worker
on worker node(s), e.g. remotely via ssh.
Optimizer Studio Configuration in Parallel Mode¶
Workload Starter Script(s)¶
Optimizer Studio comes with two different starter scripts located in the Optimizer Studio installation directory:
optimizer-studio.starter.ssh
optimizer-studio.starter.local
The users can modify these scripts to meet their environment requirements.
Each starter script accepts its own parameters:
optimizer-studio.starter.ssh
script accepts the list of IP addresses of the worker nodes, while
optimizer-studio.starter.local
script accepts the number of (local) worker processes.
The purpose of the starter script is to initiate all the worker processes and exit - the worker processes are expected to keep running on their own.
Worker process(es)¶
The worker process can be initiated by either the starter script or in any other way, if appropriate.
The worker process runs until the communication channel with the admin node goes down, then it initiates self-shutdown.
knobs.yaml Configuration File¶
Workload Section¶
To set up Optimizer Studio in parallel mode, parallel
subsection has to be added to the normal workload
section (workload declarative definition).
Example [ssh]:
global_settings:
...
pending_config_timeout: 0
http_port: 8421
http_buffer_size: 20480
http_retry_limit: 5
domain:
common:
knobs:
...
metrics:
result:
kind: file
path: /tmp/result.${WORKER_ID}
target: result:max
workload:
kind: sync
parallel:
mode: ssh
workers:
- user@host1[:port]
- user@host2[:port]
run:
command: |
echo "{{A + B + C}}" > /tmp/result.${WORKER_ID}
Example [local]:
...
target: result:max
workload:
kind: sync
parallel:
mode: local
num_workers: 2
run:
command: |
echo "{{A + B + C}}" > /tmp/result.${WORKER_ID}
Parallel-specific configuration parameters¶
Parallel mode employs normal knobs.yaml file. This section relates to parameters relevant mainly to Parallel mode.
pending_config_timeout¶
Configuration attempt scheduling policy.
pending_config_timeout: 0
The configurations will be attempted sequentially. E.g. configurations A, B, C, D would be attempted AAABBBBCCDDD
.
This policy is appropriate for singular operating mode.
pending_config_timeout: T
(T is a time specification, e.g. 1m30s
or 1h30m
)
This policy is appropriate for parallel operating mode.
The configurations will be attempted interleaved with timeout T
.
For example, given configurations A, B, C, D and min_samples_per_config = 2, the configurations would be attempted AABBCCDDABDB
.
In case more than T
time passed between consecutive attempts of the same configuration, this configuration will be retired, and the next configuration attempted.
Since the purpose of the timeout is to prevent optimization from being stuck on the same configuration forever, its value should be selected with particular slack:
T = ~10(#configurations) x min_samples_per_config x max-execution-time
.
http_port¶
Override the default HTTP port (8421).
http_buffer_size¶
HTTP buffer size [bytes] used, among other things, for passing the knob configuration to the worker node. In case of longer knob lists, this value might need be increased beyond the default 20480 bytes.
http_retry_limit¶
The maximum number of retries, a worker attempts in case of HTTP transaction failure, prior to declaring the communication channel with the admin DOWN, and initiating self-shutdown.