Fixing a failing SSH connection to a remote computer#
Today’s blog post features a user story from Virginie de Mestral, who kindly shared her experience tackling tricky SSH communication issues between AiiDA and a remote HPC system. This post walks you through the challenges and how they could ultimately be resolved. We hope you’ll find it helpful!
The exposition#
Recently, I configured a new remote host in the land of AiiDA, a supercomputer going by the name of Alps.
While Alps is quite powerful, it is also very fond of security.
Thus, it requires every user who demands its services to validate their SSH key every 24 hours.
Unfortunately, if a user’s job lasts longer than those 24 hours, he/she has to re-validate their key and, if needed, re-play
their simulation job[1].
Thanks to AiiDA’s provenance tracking and checkpointing, re-playing a job usually goes smoothly—past information is preserved and passed on to the next calculation.
However, it can happen that the key re-validation process fails, in which case the initial calculation is put on hold indefinitely.
This can occur (and did so for me), in spite of following exactly the same procedure for key validation as before.
The rising action#
I validated my key to access Alps the other day and sent a routine job.
Everything was working fine until the next 24 hours had passed.
After another re-validation of my key (business as usual), it suddenly had become impossible to reconnect to Alps.
The verdi computer test alps
command prompted the following message:
Error: paramiko.ssh_exception.SSHException: key cannot be used for signing
Error:
Error: Error connecting to 'daint.alps.cscs.ch' through SSH: [SshTransport] No existing session, connect_args were: {'username': 'iam', 'port': 22, 'look_for_keys': True, 'key_filename': '', 'timeout': 60, 'allow_agent': True, 'proxy_jump': '', 'proxy_command': 'ssh -i ~/.ssh/cscs-key iam@ela.cscs.ch /usr/bin/netcat daint.alps.cscs.ch 22', 'compress': True, 'gss_auth': False, 'gss_kex': False, 'gss_deleg_creds': False, 'gss_host': 'daint.alps.cscs.ch'}
[FAILED]: Error while trying to connect to the computer
Apparently, AiiDA could not find the right key anymore.
The climax#
Not knowing the exact origin of the problem, one evident attempt was to explicitly tell AiiDA where to find the key.
It might have been looking for the one among many (yes, I have a conservative tendency to accumulate SSH keys) and picked the wrong one (no offense AiiDA, everybody can make mistakes).
I therefore re-configured the SSH setup for my computer using verdi computer configure core.ssh alps
, setting the Look for keys
option to False, and instead explicitly provided the absolute path to my SSH key file
[2]:
$ verdi computer configure core.ssh alps
Report: enter ? for help.
Report: enter ! to ignore the default and set no value.
User name [iam]:
Port number [22]:
Look for keys [Y/n]: n
SSH key file [/user/.ssh/key]: /home/iam/.ssh/cscs-key
Connection timeout in s [60]:
Allow ssh agent [Y/n]: n
SSH proxy jump []: ela.cscs.ch
SSH proxy command []:
Compress file transfers [Y/n]:
GSS auth [False]:
GSS kex [False]:
GSS deleg_creds [False]:
GSS host [daint.alps.cscs.ch]:
Load system host keys [Y/n]:
Key policy (RejectPolicy, WarningPolicy, AutoAddPolicy) [RejectPolicy]:
Use login shell when executing command [Y/n]:
Connection cooldown time (s) [30.0]:
Report: Configuring computer daint.alps for user aiida@localhost.
Success: alps successfully configured for aiida@localhost
This time, verdi computer test alps
confirmed that the remote host Alps was successfully configured[3].
The falling action#
Thanks to the reconfiguration of the remote host, I could send jobs to Alps and retrieve calculations after completion.
So, problem solved?! Unfortunately, not quite…
As it turned out, I still could not access individual jobs while they were running using the handy gotocomputer
command:
(aiida) [iam@local ~/.aiida]$ verdi calcjob gotocomputer <pk>
Report: going to the remote work directory...
iam@ela.cscs.ch: Permission denied (publickey).
kex_exchange_identification: Connection closed by remote host
This is despite being able to SSH into Alps outside of AiiDA via the proxy ela
without any problem.
But when doing so using verdi calcjob gotocomputer
, I was denied permission.
The resolution#
The issue can be understood with the knowledge that the gotocomputer
command calls the regular SSH protocol of the operating system (as one ends up in an actual shell session on the remote machine).
For OS level OpenSSH, the configuration is usually given in the ~/.ssh/config
file, and mine read[4]:
Host ela
Hostname ela.cscs.ch
User <my-user>
IdentityFile ~/.ssh/cscs-key
Host alps
Hostname daint.alps.cscs.ch
User <my-user>
Proxyjump ela
Forwardagent yes
IdentityFile ~/.ssh/cscs-key
AddKeysToAgent yes
However, the actual SSH command constructed by AiiDA when calling verdi calcjob gotocomputer
uses the values given during verdi computer configure
, among others, the SSH Proxy jump
[5].
Can you spot the difference?
ela.cscs.ch
for the SSH Proxy jump
(see verdi computer configure
above), but Host ela
in my ~/.ssh/config
.
Thus, the issue could finally be resolved by adding ela.cscs.ch
to the Host
in the ~/.ssh/config
file:
Host ela, ela.cscs.ch
Hostname ela.cscs.ch
User <my-user>
IdentityFile ~/.ssh/cscs-key
Host alps
Hostname daint.alps.cscs.ch
User <my-user>
Proxyjump ela
Forwardagent yes
IdentityFile ~/.ssh/cscs-key
AddKeysToAgent yes
The denouement#
Since then, jobs run smoothly on Alps, and can be checked regularly while running via verdi calcjob gotocomputer <pk>
.
Peace has returned to the land of AiiDA.
Afterword (by the devs)#
In summary, the following two problems occurred here:
Due to changes in the local configuration of SSH keys, the correct key could not be identified automatically by AiiDA (using
paramiko
), which broke all communication of AiiDA with the HPC → This could be fixed by manually passing the correct absolute path to the key file.A discrepancy between the host that was defined in the
~/.ssh/config
file (ela
) and the host defined for theProxyJump
in AiiDA’s computer configuration (ela.cscs.ch
) led to a failure of thegotocomputer
command → This could be fixed by modifying the~/.ssh/config
file accordingly.
The occurrence of both issues in succession made this problem tricky to solve, and we hope this story is helpful if you ever find yourself encountering failures in the communication of AiiDA with a remote machine via SSH.
Lastly, we would like to expand a bit on the technical background information for the interested reader:
In AiiDA, the entity responsible for communication with the remote machine via SSH is the SshTransport
plugin
(src
here).
It has to adhere to the general Transport
interface (defined by an abstract base class), and uses the paramiko
SSH library for its actual implementation.
This is why AiiDA’s SSH communication with a remote machine differs from OS level OpenSSH calls (ssh
in the terminal) most people are used to.
Instead, as already mentioned above, the verdi calcjob gotocomputer
command is somewhat of an odd one out, as it executes an OS OpenSSH call (necessary to be able to enter a shell session on a remote machine), but actually uses configuration options from the AiiDA setup (that are used to configure paramiko
), rather than the OS ~/.ssh/config
file.
The situation can be summarized in the following table:
Route |
Implementation |
Configuration |
---|---|---|
Via the terminal |
OpenSSH of the OS |
|
AiiDA transport tasks |
paramiko |
options set during |
AiiDA |
OpenSSH of the OS |
options set during |
We are well aware of this issue, so to solve it, in the upcoming AsyncSshTransport
plugin[6], intended to be included in the upcoming AiiDA v2.7 release, the default behavior is that the ~/.ssh/config
file is automatically used to set authentication parameters, instead of having to re-enter them manually during computer configuration.
This reduces the effort for the user, as well as the number of possible error sources.
The AsyncSshTransport
plugin can be selected through the core.ssh_async
entry point during verdi computer setup
.
A configuration YAML file could look like the following:
append_text: ''
default_memory_per_machine: <lots-of-memory>
description: ''
hostname: daint.alps.cscs.ch
label: alps-async
mpiprocs_per_machine: <many-cores>
mpirun_command: mpirun -np {tot_num_mpiprocs}
prepend_text: ''
scheduler: core.slurm
shebang: '#!/bin/bash'
transport: core.ssh_async # AsyncSshTransport selected here
use_double_quotes: false
work_dir: /capstor/scratch/cscs/<my-user>/aiida
The configuration procedure then looks something like the following:
$ verdi computer configure core.ssh_async alps-async
Report: enter ? for help.
Report: enter ! to ignore the default and set no value.
Machine(or host) name as in `ssh <your-host-name>` command. (It should be a password-less setup): alps
`ssh alps` successful!
Maximum number of concurrent I/O operations. [8]:
Local script to run *before* opening connection (path) [None]:
Use login shell when executing command [Y/n]:
Connection cooldown time (s) [30.0]:
Report: Configuring computer daint-async for user aiida@localhost.
Success: alps-async successfully configured for aiida@localhost
As you can see, it’s as simple as entering the host
in the same way as one would typically do in the terminal when running the ssh
command.
The configuration, as given in the ~/.ssh/config
file is then picked up by AiiDA.
With the AsyncSshTransport
plugin, we could succesfully circumvent the second issue presented in this post.
For further reading, you can find the original Discourse post that lead to this blog story
here,
AiiDA’s general documentation on setting up remote connections
here,
and, lastly, the PR that adds the AsyncSshTransport
plugin
here.
Footnotes