Introduction

rsh is the remote shell "r command" that has long been superseded by SSH as it uses a much stronger authentication system. While you should generally not use rsh for remote access, it does have some valid use cases.

In an HPC environment, for example, where the cluster is on a private network not connected to the internet, various MPI applications have the option to use rsh to launch parallel jobs onto multiple compute nodes and rsh conveniently offers password-less, host-based authentication to these remote nodes.

In this article, we will look at why rsh can be very slow and suggest a workaround for this slowness


Prerequisites

In this example, the rsh client hostname is client1 with an IP address of 192.168.122.200. The rsh server hostname server1 with an IP address of 192.168.122.202. Make sure that each server has an entry in /etc/hosts for both servers if you are not using DNS for name resolution.


Installing the rsh Client

To install the rsh client on client1, install the rsh package:

sudo yum install rsh


Installing the rsh Server

The rsh server runs under the xinetd superserver. Install both packages on server1:

sudo yum install xinetd rsh-server


Configuring the rsh Server

The configuration file for rsh is /etc/xinetd.d/rsh and is disabled by default:

service shell
{
    socket_type     = stream
    wait            = no
    user            = root
    log_on_success  += USERID
    log_on_failure  += USERID
    server          = /usr/sbin/in.rshd
    disable         = yes
}

To enable rsh, change disable from yes to no. Then, start or restart the xinetd service:

sudo service xinetd restart


Adjusting iptables

rsh uses a range of ports from tcp/514 to tcp/1024. Add the following iptables rule on the client:

-A INPUT -m state --state NEW -m tcp -p tcp -s 192.168.122.202 --dport 514:1023 -j ACCEPT

Similarly, add the following iptables rule on the server:

-A INPUT -m state --state NEW -m tcp -p tcp -s 192.168.122.200 --dport 514:1023 -j ACCEPT

Restart the iptables service on both client and server after adding these rules, adjusting the IP addresses for ones that match your configuration.


Configuring Host-Based Authentication

The authentication mechanism for rsh is based on the contents of /etc/hosts.equiv and or $HOME/.rhosts. Both files include a list of hostnames that a remote user is allowed to access.

In this example, we only need the system-wide /etc/hosts.equiv for testing rsh. On server1, /etc/hosts.equiv should have at least a single line for client1:

[giovanni@server1 ~]$ sudo cat /etc/hosts.equiv 
client1

The permissions of this file should be 0600.


Testing From the Command Line

Test the rsh connection from client1 to server1. Wrap the command around a pair of date commands to see the time it takes to complete the remote command:

[giovanni@client1 ~]$ date +%s.%3N; /usr/bin/rsh server1 /usr/bin/uptime; date +%s.%3N
1439740132.050
 11:48:52 up 15:40,  1 user,  load average: 0.00, 0.00, 0.00
1439740132.158

The +%s format parameter of the date command outputs the current time in seconds since epoch. The %3N outputs the first 3 digits of time in nanoseconds, or effectively milliseconds.

The latency to return the output from the uptime command on the remote server is great and typical, only 108ms.

In an HPC cluster environment, there may be hundreds, even thousands, of nodes in a given cluster.

Let us generate a set of fake hosts in /etc/hosts.equiv, starting with an empty file, then appending client1 to the end of the file:

sudo cp /dev/null /etc/hosts.equiv
for i in {001..099}; do sudo -i sh -c "echo node$i >> /etc/hosts.equiv"; done
sudo -i sh -c "echo client1 >> /etc/hosts.equiv"

There are now 100 hosts in the /etc/hosts.equiv file, with client1 being the last in the list. Test the rsh connection again from client1 to server1, same as before:

[giovanni@client1 ~]$ date +%s.%3N; /usr/bin/rsh server1 /usr/bin/uptime; date +%s.%3N
1439740856.004
 12:01:00 up 15:52,  1 user,  load average: 0.00, 0.00, 0.00
1439740859.902

What just happened? Now it takes almost 4 seconds to return the output from the remote server. There is no immediate indication of why this latency occurs. A great debugging tool for Linux is strace, which traces the system calls and signals an application makes from userspace to kernel space and the responses. We will now use strace on the server to capture system calls to see what the rshd daemon is doing.


Testing with Strace

First, make sure you have the strace package installed. If not, installing it is easy:

sudo yum install strace

In the xinetd rsh configuration file shown above, the server parameter specifies the path to the rsh daemon:

    server          = /usr/sbin/in.rshd

We will replace this server with a custom script that will use strace to launch the rsh daemon. Replace the server parameter value with the following:

    server          = /tmp/rsh-strace.sh

Now, create the /tmp/rsh-strace.sh bash script with the following two lines:

#!/bin/bash
/usr/bin/strace -f -tt -o /tmp/rsh-server.strace /usr/sbin/in.rshd

The strace parameters used are:

-f trace child processes as they are created

-tt print timestamps with microsecond precision

-o write the output to a specified file instead of stderr

The script must be executable. Change the permissions to at least 0755 before restarting xinetd:

sudo chmod 0755 /tmp/rsh-strace.sh
sudo /sbin/service xinetd restart

Be sure to check /var/log/messages to make sure there were no errors restarting the xinetd service.

Now, every time the rsh server daemon in invoked, it will write the strace output to /tmp/rsh-server.strace, overwriting it each time.

From client1, rsh to server1 and let us have a look at the resulting strace output to see if we can tell why the rsh command got slow after adding more nodes into the /etc/hosts.equiv file.

Buried in all the system calls is an open() and read() to /etc/hosts.equiv:

11800 19:09:24.244476 open("/etc/hosts.equiv", O_RDONLY) = 4
[...]
11800 19:09:24.244559 read(4, "node001\nnode002\nnode003\nnode004\n"..., 4096) = 799

The open() function returns a file descriptor (4), which the read() function uses as its first argument.

Subsequently, there are many open() and read() calls to /etc/hosts that return the same file descriptor, 5:

11800 19:09:24.283178 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 5
11800 19:09:24.283279 read(5, "127.0.0.1   localhost localhost."..., 4096) = 253
11800 19:09:24.283327 read(5, "", 4096) = 0

In fact, this pattern repeats itself many times. Using grep, we can count how many times this occurs:

[giovanni@server1 ~]$ grep open.*etc.*hosts.*5 /tmp/rsh-server.strace | wc -l
100

There were exactly 100 open() calls to /etc/hosts and client1, the rsh client, is at line 100 in /etc/hosts.equiv. It appears the rsh daemon serially checks /etc/hosts for each host list in /etc/hosts.equiv, also checking for DNS mapping if the host is not in /etc/hosts.

Let's see what happens where there are 1000 hosts in /etc/hosts.equiv:

sudo cp /dev/null /etc/hosts.equiv
for i in {001..999}; do sudo -i sh -c "echo node$i >> /etc/hosts.equiv"; done
sudo -i sh -c "echo client1 >> /etc/hosts.equiv"

From client1, rsh to server1 again:

[giovanni@client1 ~]$ date +%s.%3N; /usr/bin/rsh server1 /usr/bin/uptime; date +%s.%3N
1439946641.429
 21:11:16 up 3 days,  1:02,  1 user,  load average: 0.12, 0.04, 0.01
1439946675.667

This time it takes over 30 seconds. Using grep again, we can count how many times the rsh daemon opens and reads the /etc/hosts file:

[giovanni@server1 ~]$ grep open.*etc.*hosts.*5 /tmp/rsh-server.strace | wc -l
1000

It really does read the /etc/hosts once for every host in the /etc/hosts.equiv file. So, if the rsh client is near the bottom of a very long hosts.equiv file, it can take 10s of seconds to return a command prompt or command output from the remote server.


Workaround using Netgroups

It's clear that this strategy will not work. Either abandon rsh or try to fix the problem if you really must use rsh.

We can work around the problem using netgroups on the rsh server. Netgroups are a very old way of specifying access control, seen a lot when NIS (Network Information System) was still heavily used. Nonetheless, it works well in this situation.

Netgroups are a tuple of the form (username, hostname, domain) and are specified in /etc/netgroup. This file does not exist by default on CentOS 6 and the system is not configured to use it by default, so we will need to configure our server to do so.

First, edit the /etc/nsswitch.conf file. The default nsswitch configuration file uses nisplus:

netgroup:   nisplus

Change this to files:

netgroup:   files

Next, we'll construct the netgroup file with the first column in the file being the name of the netgroup. The second column will be a list of netgroup tuples, one for each host. The beauty about this solution is that we can list 100 or 1000 hosts on the same line in the same netgroup. Keep in mind, there may be a line size limit at some point.

We'll call the name of our netgroup nodes:

sudo touch /etc/netgroup
sudo -i sh -c "echo -n nodes >> /etc/netgroup"
for i in {001..999}; do sudo -i sh -c "echo -n ' (node$i,-,-)' >> /etc/netgroup"; done
sudo -i sh -c "echo -n ' (client1,-,-)' >> /etc/netgroup"

So, nodes 001 thru 999 plus client1 are in the nodes netgroup. Now, we need to adjust the hosts.equiv file to use the new netgroup:

sudo cp /dev/null /etc/netgroup
sudo -i sh -c "echo +@nodes >> /etc/hosts.equiv"

Now, let's retest rsh again from client1:

[giovanni@client1 ~]$ date +%s.%3N; /usr/bin/rsh server1 /usr/bin/uptime; date +%s.%3N
1439948137.505
 21:35:38 up 3 days,  1:27,  1 user,  load average: 0.00, 0.03, 0.00
1439948137.731

Well, that was fast. Let's see how many times rshd opened /etc/hosts file with the same file descriptor:

[giovanni@server1 ~]$ grep open.*etc.*hosts.*5 /tmp/rsh-server.strace | wc -l
0

None, so let's check the /etc/netgroup file:

[giovanni@server1 ~]$ grep open.*etc.*netgroup.*5 /tmp/rsh-server.strace | wc -l
1

The per-line search behavior shifted from incessantly reading /etc/hosts to reading /etc/netgroup. Except, reading a single line from /etc/netgroup speeds up the rsh server response significantly.


Conclusion

Again, you are probably not using rsh and for good reason. rsh should only be used in private networks that have no direct connection to the Internet. However, if you are using it, say for launching parallel jobs in an HPC cluster, then hopefully this article explains why rsh is sometimes slow and how to workaround the slowness using netgroups.

For more information, see the man pages for rshd and hosts.equiv.


Comments

comments powered by Disqus