Skip to content

cluster cannot accept SSH connections #393

@YqI777

Description

@YqI777

Hi,when i use sscha to simulation the H3S example, I have a problem.
The sscha.Cluster module was designed for a specific workflow:

Standard workflow: the user runs the main Python script on a local workstation or on the cluster’s login node. The script submits and distributes computational tasks to remote compute nodes via ssh and scp, and manages the files.

My actual situation: I run the main Python script directly on a compute node (after submitting it with sbatch), and, for security reasons, this compute node itself has SSH service disabled.

This creates a fundamental conflict: I am executing a module that requires SSH for its operation inside an environment (my HPC compute node) that cannot accept SSH connections.
what should i do?
the output is

(base) [login1 H3S]$ cat 2out.dat 
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
  Local host:   cpu11
  Local device: mlx5_0
--------------------------------------------------------------------------
ssh: Could not resolve hostname none: Name or service not known
Error with cmd: ssh  None 'echo "/public/home/yan/qe/H3S"'

EXITSTATUS: 255; attempt = 1
THREAD 41879 EXECUTE COMMAND: ssh  None 'echo "/public/home/yan/qe/H3S"'
Traceback (most recent call last):
  File "/public/home/yan/qe/H3S/H3S_relax.py", line 126, in <module>
    my_hpc.setup_workdir()
  File "/public/home/yan/apps/anaconda3/envs/sscha/lib/python3.10/site-packages/sscha/Cluster.py", line 1400, in setup_workdir
    workdir = self.parse_string(self.workdir)
  File "/public/home/yan/apps/anaconda3/envs/sscha/lib/python3.10/site-packages/sscha/Cluster.py", line 1453, in parse_string
    status, output = self.ExecuteCMD(cmd, return_output = True, raise_error= True)
  File "/public/home/yan/apps/anaconda3/envs/sscha/lib/python3.10/site-packages/sscha/Cluster.py", line 402, in ExecuteCMD
    raise IOError("Error while communicating with the cluster. More than %d attempts failed." % (i+1))
OSError: Error while communicating with the cluster. More than 1 attempts failed.
`

my input.py
`#my_hpc = sscha.Cluster.Cluster(mpi_cmd=r"srun -n 40",AlreadyInCluster=True)  
my_hpc = sscha.Cluster.Cluster(mpi_cmd=r"srun -n 40")
#my_hpc.hostname = "login1"
my_hpc.workdir = "/public/home/yan/qe/H3S/run"
my_hpc.binary = "/public/home/apps/qe/qe-7.3.1/bin/pw.x -npool NPOOL -i PREFIX.pwi > PREFIX.pwo"
#Then we need to specify if some modules must be loaded in the submission script
my_hpc.load_modules = """##!/bin/bash
#SBATCH  --job-name=sscha
#SBATCH  --partition=cpu
#SBATCH  --nodes=1
#SBATCH  --ntasks-per-node=40
#SBATCH  --time=14-00:00:00

source /public/env/intel2021
source /public/env/openmpi-4.1.5_icc

"""
my_hpc.n_cpu = 40 # We will use 40 processors
my_hpc.n_nodes = 1 #In 1 node
my_hpc.n_pool = 4 # This is an espresso specific tool, the parallel CPU are divided in 4 pools

#We can also choose in how many batch of jobs we want to submit simultaneously, and how many configurations for each job
my_hpc.batch_size = 4
my_hpc.job_number = 8
#In this way we submit 10 jobs, each one with 10 configurations (overall 100 configuration at time)

my_hpc.set_timeout(300) # We give 30 seconds of timeout
my_hpc.time = "00:20:00" # We can specify the time limit for each job,

my_hpc.setup_workdir()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions