wiki:ClusterInstallGuide
Last modified 5 years ago Last modified on 08/16/18 11:04:21

Setting up Cobalt For Use on Cluster Systems

Requirements

  • Python 2.6 or later
  • openssl
  • m2-crypto

Cobalt Installation

Installation Via RPM

Cobalt comes as two RPMS, one for the core daemons and administrative commands, and one for the clients. For cluster systems, the 'cobalt' RPM should be installed on the Cobalt host, usually a service node, where the daemons will run. The 'cobalt-clients' RPM must be installed on all login and compute nodes, as well as the service node.

Building RPMs

You may build RPMs for Cobalt using the cobalt.spec file in the misc directory of the source tree for your platform. The spec file has been tested against RHEL 6 and should be compatible with RHEL derived distributions.

Cluster System Configuration

Init Script

The Cobalt init scirpt (included in the source tree as misc/cobalt, typically installed at /etc/init.d/cobalt) should be edited such that the list of components started by /etc/init.d/cobalt start is slp cqm bgsched cluster_system system_script_forker user_script_forker.

Generate Keys

Cobalt uses SSL for it's internal communication, and X.509 certs must be generated. An example of how to generate these certificates is below.

# openssl req -batch -x509 -nodes -subj "/C=$COUNTRY/ST=$STATE/L=$LOCATION/CN=$HOSTNAME" -days 1000 -newkey rsa:2048 -keyout /etc/cobalt.key -noout
# openssl req -batch -new -subj "/C=$COUNTRY/ST=$STATE/L=$LOCATION/CN=$HOSTNAME" -key /etc/cobalt.key | openssl x509 -req -days 1000 -signkey /etc/cobalt.key -out /etc/cobalt.cert

These must be readable by all Cobalt components and by the setgid wrapper for client commands.

cobalt.hostfile

All scheduled nodes need to have an entry in the cobalt.hostfile For example, for the four-node cluster copper:

vs1.copper
vs2.copper
vs3.copper
vs4.copper

The service and login nodes should not be included in this file, only the compute resource hostnames.

Auxiliary Scripts

Cluster systems can be further customized using auxiliary scripts, and may require them to properly configure hardware and/or access for jobs. In addition to cqm prologue and epilogue scripts, there are additional prologue and epilogue scripts specified for the cluster_system component that are run on the compute nodes to prepare for and clean up after jobs.

cqm scripts have a standard set of arguments provided to them, as described here. Additionally a script included in Cobalt's clients, prologue_helper.py should also be added to the job_prescripts entry of the Cobalt configuration file for cluster systems. This script may be run at any point in the list of scripts provided to that option.

The cluster_system prologue and epilogue scripts are intended for node-by node initialization of services and access controls. The prologue is called with jobid and user where jobid is the id of the Cobalt job requesting the resources, and user is the submitting user of the job. The epilogue is called with the same arguments as the prologue. By default, these scripts are given a 60 second timeout, which can be changed via entries in the cobalt.conf file. These scripts must return an exit status of 0 on successful completion and return non-zero statuses in the event of errors. A timeout is considered an error. In the event of a prologue failure, the job initiation will fail and the Cobalt job proceeds to cleanup. In the event of an epilogue failure, the non-zero status, or timeout is detected, and the node will be marked down, and removed from normal scheduling until marked up by an administrator after addressing the failure.

Utility Functions

A site may specify their own utility functions as described here. Default (FIFO), and high-priority utility functions are provided as built-ins to Cobalt.

Cobalt Configuration File

The following entries should be added to the Cobalt configuration file, which is, by default /etc/cobalt.conf. Environment variables added to paths in the configuration file will be expanded.

[bgsched]
utility_file: /etc/cobalt.utility
 
[cqm]
log_dir: /var/log
job_prescripts: <path-to-prologue-helper>/prologue_helper.py

[system]
size: <number-of-compute-nodes>

[cluster_system]
launcher = <path-to-cobalt-launcher>/cobalt-launcher.py
hostfile: /etc/cobalt.hostfile
prologue: <path-to-node-prologue>/prologue.sh
prologue_timeout: 60
epilogue: <path-to-node-epilogue>/epilogue.sh
epilogue_timeout:  600
epi_epilogue: /bin/true

[components]
service-location=https://<host-running-slp>:8256
python=/usr/bin/python2.6

[communication]
key: /etc/cobalt.key
cert: /etc/cobalt.cert
ca: /etc/cobalt.cert
password: passwordyoushouldreallychange

[statefiles]
location: <path-to-statefiles>

[logger]
to_syslog: true
syslog_level: INFO
to_console: true
console_level: INFO

As a note, the cobalt-launcher path must be the path the the cobalt-launcher on the compute nodes, by default this is /usr/bin/cobalt-launcher.py

If you would like to try running a cluster system in a simulation mode, you can do so by adding the following lines to the [cluster_system] section of the configuration file:

[cluster_system]
simulation_mode: true
simulation_executable: <path-to-cluster_simulator_run>/cluster_simulator_run.py

The cluster_simulator_run file is included in the cobalt source at src/clients/cluster_simulator_run.py. This will let you run a mock-up of your configuration without requiring hardware and can be useful when trying out new scheduling options.

Initial Startup

When starting up cobalt, first check to see if all components are running, this can be done via the 'slpstat' command, you should see output similar to this:

Name                  Location              Update Time
======================================================================
queue-manager         https://login1:45673  Tue Sep  3 22:50:37 2013
scheduler             https://login1:57998  Tue Sep  3 22:50:40 2013
user_script_forker    https://login1:53072  Tue Sep  3 22:50:42 2013
system                https://login1:53870  Tue Sep  3 22:50:37 2013
system_script_forker  https://login1:57026  Tue Sep  3 22:49:35 2013

This indicated that all components have started and initialized. Initialization should not take more than a few minutes for the cluster_system component, and depends on system size.

Next, you will have to generate a queue to associate with the compute hardware:

cqadm --addq default

After that, enable the queue so that jobs submitted to it can run:

cqadm --startq default

Finally you will have to associate the queue with hardware. For the "copper" cluster described in the hostfile above it would be:

nodeadm --queue default vs1.copper vs2.copper vs3.copper vs4.copper

At this point the system should be ready to run jobs submitted to the default queue (or those submitted without a -q argument to qsub).

Security Considerations

  • The cobalt.conf file, has a shared secret used to authenticate components. Due to this, and the level of privilege that Cobalt runs at, it must not be world readable and should be set as readable by it's own group, such as a cobalt group. The setgid wrapper (default: /usr/bin/wrapper) should be made a part of this group so that clients may function.
  • Scripts that are run via the system_script_forker (helper scripts run by the cqm and cluster_system component) are run as the user of system_script_forker, which is usually root or at least as a privileged user.