Skip to content

Parallel Optimization mode

Overview

Oftentimes each experiment testing a knob configuration runs for a long time - hours, or even days, making the whole optimization process unbearably long.
Optimizer Studio can speed up the optimization process by distributing the workload between multiple computing nodes, testing different knob configurations in parallel.

Limitations

The parallel processing has some limitations:

  1. Optimizer Studio explores concurrently only so much configurations. By default the number of configurations being explored at once is less than 10, each configuration will be measured at least min_samples_per_config times. This poses a practical limit on the number of parallel computing nodes as ~7 x min_samples_per_config.

  2. The computers running the workload should be very similar and provide a similar performance results for the same knob configuration, with reasonable standard deviation. Otherwise, Optimizer Studio will not be able to distinguish between an improvement achieved due to configuration choice and better performing hardware.

Parallel Optimization Workflow

  1. Select main node to run Optimizer Studio on and one or more worker nodes. The main node can double up as a worker node.
  2. Install Optimizer Studio package on each computer, main and worker nodes.
  3. Install the workload script(s) on each worker node, in the same location, so that they can be accessed via the same absolute path.
  4. Arrange a password-less ssh connection from the main computer to the worker nodes, so that it doesn't require login credentials interactively. The recommended way is to use SSH keys with no passphrase.
  5. Enable HTTP access to the main computer so that worker nodes can communicate with the optimization engine via REST API. The default HTTP port is 8421. The user can change the HTTP port via command line switch optimizer-studio ... --http-port=8421
  6. Add parallel subsection to the knob file workload section.
  7. Optimizer Studio invokes the workload starter using the parameters passed in the parallel subsection. Workload starter invokes optimizer-studio.worker on worker node(s), e.g. remotely via ssh. All terminal output from the worker nodes is transferred via ssh to the main computer terminal console.

Optimizer Studio Configuration in Parallel Mode

Workload Starter Script(s)

Optimizer Studio comes with two different starter scripts: optimizer-studio.starter.ssh and optimizer-studio.starter.local, located in the Optimizer Studio installation directory. The users can modify these scripts to meet their environment requirements.

Each starter script accepts its own parameters:
optimizer-studio.starter.ssh script accepts the list of IP addresses of the worker nodes, while
optimizer-studio.starter.local script accepts the number of (local) worker processes.

knobs.yaml Configuration File

Workload Section

To set up Optimizer Studio in parallel mode both workload and workload_settings sections need to be configured.
Concertio is currently working on simplifying this configuration process.

Example [ssh]:

global_settings:
  ...
  pending_config_timeout: 0
  http_buffer_size: 20480

domain:
  common:
    ...
    target: workload.target:max

workload:
  kind: sync
  parallel:
    mode: ssh
    workers:
      - user@host1[:port]
      - user@host2[:port]

workload_settings:
  workloads:
    -
      script: ./workload.sh
      metrics:
        my_target: /tmp/my_target.${WORKER_ID}

  target: my_target

Example [local]:

    ...
    target: workload.target:max

workload:
  kind: sync
  parallel:
    mode: local
    num_workers: 2

workload_settings:
  workloads:
    -
      script: ./workload.sh
      metrics:
        my_target: /tmp/my_target.${WORKER_ID}

  target: my_target

Parallel mode employs normal knobs.yaml file. This section relates to parameters and sections relevant mainly to Parallel mode.

pending_config_timeout parameter

Configuration attempt scheduling policy.
pending_config_timeout: 0
The configurations will be attempted sequentially. E.g. configurations A, B, C, D would be attempted AAABBBBCCDDD.
This policy is appropriate for singular operating mode.

pending_config_timeout: T
(T is a time specification, e.g. 1m30s or 1h30m)

This policy is appropriate for parallel operating mode.
The configurations will be attempted interleaved with timeout T.
For example, given configurations A, B, C, D and min_samples_per_config = 2, the configurations would be attempted AABBCCDDABDB. In case more than T time passed between consecutive attempts of the same configuration, this configuration will be retired, and the next configuration attempted.
Since the purpose of the timeout is to prevent optimization from being stuck on the same configuration forever, its value should be selected with particular slack:
T = ~10(#configurations) x min_samples_per_config x max-execution-time.

http_buffer_size parameter

HTTP buffer size [bytes] used, among other things, for passing the knob configuration to the worker node. In case of longer knob lists, this value might need be increased beyond the default 20480 bytes.