Preventing SSH timeouts: A probable solution

Recently, we had seen some dropped SSH connections causing our process to fail which are dependent on SSH. I tried some analysis on both the hosts for the SSH connection drop and presenting below some analysis  that might have triggered the failure .

Analyzing SSH connection drops due to network inactivity:

The connection tracking procedures implemented in proxies and firewalls  keeps track of all connections that pass through them. Because of the physical limits of these machines, they can only keep a finite number of connections in their memory. The most common and logical policy is to keep newest connections and to discard old and inactive connections first. This can be one of the reason for connection drops but does not looks to be the reason in our case as our hosts are not behind NAT . For scenarios where hosts are behind NAT and are seeing dropped SSH connections , we may probably want to set the keep-alive time (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT timeout. Even then, that will only work for programs which has enable keep-alive (ssh and sshd both do by default and this is controlled by the TCPKeepAlive option).

Now let's discuss a scenario which might have possibly triggered a SSH connection dropout in our case . This looks to be a rare case and you have to convince people to beleive you , but looks to be strong suspect in our case. Both of our client and server hosts used to be 100% CPU busy for around 4 hours for a certian time interval doing some processing  and the failure is during that time . So , probably on that day when the process failed due to ssh connection drop , the hosts were so much busy that the connection was idle for a long time causing SSHD to terminate the connection .


Workaround:

We can enable the TCPKeepAlive option in /etc/ssh/sshd_config so that after frequent time interval it will send a keep alive message indicating that the connection is active . But in our case since both the server and client hosts remains busy for a long time (around 4 hours ) , so if the client does not respond to the TCPKeepalive messages , SSHD will drop the connection with the believe that the client host has crashed which we donot want to happen.


The right option for us is to use either ClientAliveInterval or ServerAliveInterval . The following is what the man page says regarding the options:

ServerAliveInterval
             Sets a timeout interval in seconds after which if no data has
             been received from the server, ssh(1) will send a message through
             the encrypted channel to request a response from the server.  The
             default is 0, indicating that these messages will not be sent to
             the server. This needs to be set on client side on /etc/ssh/ssh_config

Along with this , we can use the ServerAliveCountMax option in ssh_config . This option sets the number of server alive messages which may be sent without ssh(1) receiving any messages back from the server. If this threshold is reached while server alive messages are being sent, ssh will disconnect from the server, terminating the session.

ClientAliveInterval
             Sets a timeout interval in seconds after which if no data has
             been received from the client, sshd(8) will send a message
             through the encrypted channel to request a response from the
             client.  The default is 0, indicating that these messages will
             not be sent to the client.This needs to be set on server side on /etc/ssh/sshd_config
Along with this , we can use the ClientAliveCountMax option in sshd_config . This option sets the number of client alive messages (see below) which may be sent without sshd(8) receiving any messages back from the client. If this threshold is reached while client alive messages are being sent, sshd will disconnect the client, terminating the session.

In our case , since both client and server remains busy , we will set both the options , ServerAliveInterval on Server  and ClientAliveInterval on the client . I have decided to set both the options because if one of the host remains too much busy to send a KeepAlive message , then the keepalive message from the other host can be a life to the idle TCP connection . This is on the assumption that both the client and server will not remain busy at the same time so that neither of them can send a keepalive message . Of course, we have to identify a ideal value for both the options depending on our scenario so that it does not put some additional overload on our server and also generates too much additional traffic.

Comments

Popular posts from this blog

PSSH : Parallel SSH to execute commands on a number of hosts

How to add check_http as a service in Nagios Monitoring using NRPE

Configuring Nagios to monitor services using NRPE