sourCEntral - mobile manpages

pdf

SMAP

NAME

smap − graphically view information about SLURM jobs, partitions, and set configurations parameters.

SYNOPSIS

smap [OPTIONS...]

DESCRIPTION

smap is used to graphically view job, partition and node information for a system running SLURM. Note that information about nodes and partitions to which a user lacks access will always be displayed to avoid obvious gaps in the output. This is equivalent to the −−all option of the sinfo and squeue commands.

OPTIONS

−c, −−commandline

Print output to the commandline, no curses.

−D <option>, −−display=<option>

sets the display mode for smap. Showing revelant information about specific views and displaying a corresponding node chart. While in any display a user can switch by typing a different view letter. This is true in all modes except for ’configure mode’ user can type ’quit’ to exit just configure mode. Typing ’exit’ will end the configuration mode and exit smap. Note that unallocated nodes are indicated by a ’.’ and nodes in the DOWN, DRAINED or FAIL state by a ’#’.

j

Displays information about jobs running on system.

s

Displays information about slurm partitions on the system

b

Displays information about BG partitions on the system

c

Displays current node states and allows users to configure the system.

−h, −−noheader

Do not print a header on the output.

−−help,

Print a message describing all smap options.

−i <seconds> , −−iterate=<seconds>

Print the state on a periodic basis. Sleep for the indicated number of seconds between reports. User can exit at anytime by typing ’q’ or hitting the return key. If user is in configure mode type ’exit’ to exit program, ’quit’ to exit configure mode.

−p, −−parse

Used with −c commandline option. Don’t format output send only single tab delimited output to stdout.

−R <RACK_MIDPLANE_ID/XYZ>, −−resolve=<RACK_MIDPLANE_ID/XYZ>

Returns the XYZ coords for a Rack/Midplane id or vice−versa.

To get the XYZ coord for a Rack/Midplane id input −R R101 where 10 is the rack and 1 is the midplane.

To get the Rack/Midplane id from a XYZ coord input −R 101 where X=1 Y=1 Z=1 with no leading ’R’.

−−usage

Print a brief message listing the smap options.

−V , −−version

Print version information and exit.

INTERACTIVE OPTIONS

When using smap in curses mode you can scroll through the different windows using the arrow keys. The up and down arrow keys scroll the window containing the grid, and the left and right arrow keys scroll the window containing the text information.

OUTPUT FIELD DESCRIPTIONS

AVAIL

Partition state: up or down.

BG_BLOCK

BlueGene Block Name.

CONN

Connection Type: TORUS or MESH or SMALL (for small blocks).

ID

Key to identify the nodes associated with this entity in the node chart.

MODE

Mode Type: COPROCESS or VIRTUAL.

NAME

Name of the job.

NODELIST or BP_LIST

Names of nodes or base partitions associated with this configuration/partition.

NODES

Count of nodes or base partitions with this particular configuration.

PARTITION

Name of a partition. Note that the suffix "*" identifies the default partition.

ST

State of a job in compact form. Possible states include: PD (pending), R (running), S (suspended), CG (completing), CD (completed), F (failed), TO (timeout), and NF (node failure). See JOB STATE CODES section below for more information.

STATE

State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, idle, and unknown plus their abbreviated forms: alloc, comp, donw, drain, drng, fail, failg, idle, and unk respectively. Note that the suffix "*" identifies nodes that are presently not responding. See NODE STATE CODES section below for more information.

TIMELIMIT

Maximum time limit for any user job in days−hours:minutes:seconds. infinite is used to identify jobs or partitions without a job time limit.

TOPOGRAPHY INFORMATION

The node chart is designed to indicate relative locations of the nodes. On most Linux clusters this will represent a one−dimensional array of nodes. Larger clusters will utilize multiple as needed with right side of one line being logically followed by the left side of the next line.

On BlueGene systems, the node chart will indicate the three
dimensional topography of the system.
The X dimension will increase from left to right on a given line.
The Y dimension will increase in planes from bottom to top.
The Z dimension will increase within a plane from the back
line to the front line of a plane.
Note the example below:

a a a a b b d d
a a a a b b d d
a a a a b b c c
a a a a b b c c

a a a a b b d d
a a a a b b d d
a a a a b b c c
a a a a b b c c

a a a a . . d d
a a a a . . d d
a a a a . . e e Y
a a a a . . e e |
|
a a a a . . d d 0−−−−X
a a a a . . d d /
a a a a . . . . /
a a a a . . . # Z

ID JOBID PARTITION BG_BLOCK USER NAME ST TIME NODES BP_LIST
a 12345 batch RMP0 joseph tst1 R 43:12 32k bgl[000x333]
b 12346 debug RMP1 chris sim3 R 12:34 8k bgl[420x533]
c 12350 debug RMP2 danny job3 R 0:12 4k bgl[622x733]
d 12356 debug RMP3 dan colu R 18:05 8k bgl[600x731]
e 12378 debug RMP4 joseph asx4 R 0:34 2k bgl[612x713]

CONFIGURATION INSTRUCTIONS

For Admin use. From this screen one can create a configuration file that is used to partition and wire the system into usable blocks.

OUTPUT

BG_BLOCK

BlueGene Block Name.

CONN

Connection Type: TORUS or MESH or SMALL (for small blocks).

ID

Key to identify the nodes associated with this entity in the node chart.

MODE

Mode Type: COPROCESS or VIRTUAL.

INPUT COMMANDS

resolve <RACK_MIDPLANE_ID/XYZ>

Returns the XYZ coords for a Rack/Midplane id or vice−versa.

To get the XYZ coord for a Rack/Midplane id input −R R101 where 10 is the rack and 1 is the midplane.

To get the Rack/Midplane id from a XYZ coord input −R 101 where X=1 Y=1 Z=1 with no leading ’R’.

load <bluegene.conf file>

Load an already exsistant bluegene.conf file. This will varify and mapout a bluegene.conf file. After loaded the configuration may be edited and saved as a new file.

create <size> <options>

Submit request for partition creation. The size may be specified either as a count of base partitions or specific dimensions in the X, Y and Z directions separated by "x", for example "2x3x4". A variety of options may be specified. Valid options are listed below. Note that the option and their values are case insensitive (e.g. "MESH" and "mesh" are equivalent).

Start = XxYxZ

Identify where to start the partition. This is primarily for testing purposes. For convenience one can only put the X coord or XxY will also work. The default value is 0x0x0.

Connection = MESH | TORUS | SMALL

Identify how the nodes should be connected in network. The default value is TORUS.

Small

Equivalent to "Connection=Small". If a small connection is specified the base partition chosen will create smaller partitions based on options NodeCards and Quarters within the base partition. These number will be altered to take up the entire base partition. Size does not need to be specified with a small request, we will always default to 1 base partition for allocation.

Mesh

Equivalent to "Connection=Mesh".

Torus

Equivalent to "Connection=Torus".

Rotation = TRUE | FALSE

Specifies that the geometry specified in the size parameter may be rotated in space (e.g. the Y and Z dimensions may be switched). The default value is FALSE.

Rotate

Equivalent to "Rotation=true".

Elongation = TRUE | FALSE

If TRUE, permit the geometry specified in the size parameter to be altered as needed to fit available resources. For example, an allocation of "4x2x1" might be used to satisfy a size specification of "2x2x2". The default value is FALSE.

Elongate

Equivalent to "Elongation=true".

copy <id> <count>

Submit request for partition to be copied. You may copy a specific partition by specifying its id, by default the last configured partition is copied. You may also specify a number of copies to be made. By default, one copy is made.

delete <id>

Delete the specified block.

down <node_range>

Down a specific node or range of nodes. i.e. 000, 000−111 [000x111]

up <node_range>

Bring a specific node or range of nodes up. i.e. 000, 000−111 [000x111]

alldown

Set all nodes to down state.

allup

Set all nodes to up state.

save <file_name>

Save the current configuration to a file. If no file_name is specified, the configuration is written to a file named "bluegene.conf" in the current working directory.

clear

Clear all partitions created.

NODE STATE CODES

Node state codes are shortened as required for the field size. If the node state code is followed by "*", this indicates the node is presently not responding and will not be allocated any new work. If the node remains non−responsive, it will be placed in the DOWN state (except in the case of COMPLETING, DRAINED, DRAINING, FAIL, FAILING nodes).

If the node state code is followed by "~", this indicates the node is presently in a power saving mode (typically running at reduced frequency).

ALLOCATED

The node has been allocated to one or more jobs.

ALLOCATED+

The node is allocated to one or more active jobs plus one or more jobs are in the process of COMPLETING.

COMPLETING

All jobs associated with this node are in the process of COMPLETING. This node state will be removed when all of the job’s processes have terminated and the SLURM epilog program (if any) has terminated. See the Epilog parameter description in the slurm.conf man page for more information.

DOWN

The node is unavailable for use. SLURM can automatically place nodes in this state if some failure occurs. System administrators may also explicitly place nodes in this state. If a node resumes normal operation, SLURM can automatically return it to service. See the ReturnToService and SlurmdTimeout parameter descriptions in the slurm.conf(5) man page for more information.

DRAINED

The node is unavailable for use per system administrator request. See the update node command in the scontrol(1) man page or the slurm.conf(5) man page for more information.

DRAINING

The node is currently executing a job, but will not be allocated to additional jobs. The node state will be changed to state DRAINED when the last job on it completes. Nodes enter this state per system administrator request. See the update node command in the scontrol(1) man page or the slurm.conf(5) man page for more information.

FAIL

The node is expected to fail soon and is unavailable for use per system administrator request. See the update node command in the scontrol(1) man page or the slurm.conf(5) man page for more information.

FAILING

The node is currently executing a job, but is expected to fail soon and is unavailable for use per system administrator request. See the update node command in the scontrol(1) man page or the slurm.conf(5) man page for more information.

IDLE

The node is not allocated to any jobs and is available for use.

UNKNOWN

The SLURM controller has just started and the node’s state has not yet been determined.

JOB STATE CODES

Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of each state follows.

CA CANCELLED

Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

CD COMPLETED

Job has terminated all processes on all nodes.

CG COMPLETING

Job is in the process of completing. Some processes on some nodes may still be active.

F FAILED

Job terminated with non−zero exit code or other failure condition.

NF NODE_FAIL

Job terminated due to failure of one or more allocated nodes.

PD PENDING

Job is awaiting resource allocation.

R RUNNING

Job currently has an allocation.

S SUSPENDED

Job has an allocation, but execution has been suspended.

TO TIMEOUT

Job terminated upon reaching its time limit.

ENVIRONMENT VARIABLES

The following environment variables can be used to override settings compiled into smap.

SLURM_CONF

The location of the SLURM configuration file.

COPYING

Copyright (C) 2004−2007 The Regents of the University of California. Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). LLNL−CODE−402394.

This file is part of SLURM, a resource management program. For details, see <https://computing.llnl.gov/linux/slurm/>.

SLURM is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

SLURM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

SEE ALSO

scontrol(1), sinfo(1), squeue(1), slurm_load_ctl_conf(3), slurm_load_jobs(3), slurm_load_node(3), slurm_load_partitions(3), slurm_reconfigure(3), slurm_shutdown(3), slurm_update_job(3), slurm_update_node(3), slurm_update_partition(3), slurm.conf(5)

pdf