Skip to content. Skip to navigation

The indicbhaaratii collaborative portal

Sections
Personal tools
You are here: Home IndClient Trouble Shoot
Document Actions

Trouble Shoot

by sijisunny last modified 2008-01-08 13:57

If, after following through the previous documentaion your workstation doesn't boot, then you've got to start the process of troubleshooting the installation.

The first thing to do is figure out how far through the boot process the workstation has gotten.


7.1. Troubleshooting Etherboot floppy image

When you boot from the floppy, you should see something similar to this:

loaded ROM segment 0x0800 length 0x4000 reloc 0x9400 Etherboot 5.0.1 (GPL) Tagged ELF for [LANCE/PCI] Found AMD Lance/PCI at 0x1000, ROM address 0x0000 Probing...[LANCE/PCI] PCnet/PCI-II 79C970A base 0x1000, addr 00:50:56:81:00:01 Searching for server (DHCP)... <sleep> 

The above example shows what you can expect to see on the screen when booting from a floppy. If you don't see those messages, indicating that Etherboot has started, then you may have a bad floppy disk, or you didn't write the image to it properly.

If, you see a message like the following, then it probably indicates that the Etherboot image you have generated is not the correct image for your network card.

ROM segment 0x0800 length 0x8000 reloc 0x9400 Etherboot 5.0.2 (GPL) Tagged ELF for [Tulip] Probing...[Tulip]No adapter found <sleep> <abort> 

If it does get to the point where it detects the network card and displays the proper MAC address, then the floppy is probably good.


7.2. Troubleshooting DHCP

Once the network card is initialized, it will send out a DHCP broadcast on the local network, looking for a DHCP server.

If the workstation gets a valid reply from the DHCP server, then it will configure the network card. You can tell if it worked properly if the IP address information is displayed on the screen. Here's an example of what it should look like:

ROM segment 0x0800 length 0x4000 reloc 0x9400 Etherboot 5.0.1 (GPL) Tagged ELF for [LANCE/PCI] Found AMD Lance/PCI at 0x1000, ROM address 0x0000 Probing...[LANCE/PCI] PCnet/PCI-II 79C970A base 0x1000, addr 00:50:56:81:00:01 Searching for server (DHCP)... <sleep> Me: 192.168.0.1, Server: 192.168.0.254, Gateway 192.168.0.254 

If you see the line that starts with 'Me:', following by an IP address, then you know that DHCP is working properly. You can move on to checking to see if TFTP is working.

If instead, you see the following message on the workstation, followed by lots of <sleep> messages, then something is wrong. Although, it is common to see one or two <sleep> messages, after which the dhcp server replies.

Searching for server (DHCP)...  

Figuring out what is wrong can sometimes be difficult, but here are some things to look for.


7.2.1. Check the connections

Is the workstation physically connected to the same network that the server is connected to?

With the workstation turned on, make sure that the link lights are lit at all of the connections.

If you are connecting directly between the workstation and the server (no hub or switch), make sure that you are using a cross-over cable. If you are using a hub or switch, then make sure you are using a normal straight-through cable, both between the workstation and hub, and also between the hub and server.


7.2.2. Is dhcpd running?

You need to determine whether dhcpd is running on the server. We can find the answer a couple of ways.

dhcpd normally sits in the background, listening on udp port 67. Try running the netstat command to see if anything is listening on that port:

netstat -an | grep ":67 " 

You should see output similar to the following:

udp     0    0   0.0.0.0:67         0.0.0.0:*

The 4th column contains the IP address and port, separated by a colon (':'). An address of all zeroes ('0.0.0.0') indicates that it is listening on all interfaces. That is, you may have an eth0 and an eth1 interface, and dhcpd is listening on both interfaces.

Just because netstat shows that something is listening on udp port 67, it doesn't mean that it is definitely dhcpd that is listening. It could be bootpd, but that is unlikely, because bootp is no longer included on most distributions of Linux.

To make sure that it is the dhcpd that is running, try running the ps command.

ps aux | grep dhcpd 

You should see something like the following:

root 23814 0.0 0.3 1676 820 ?      S 15:13 0:00 /usr/sbin/dhcpd root 23834 0.0 0.2 1552 600 pts/0  S 15:52 0:00 grep dhcp 

The first line shows that dhcpd is running. The second line is just our grep command.

If you don't see any lines showing that dhcpd is running, then you need to check that the server is configured for runlevel 5, and that dhcpd is configured to start in runlevel 5. On Redhat based systems, you can run the ntsysv and scroll down to make sure dhcpd is configured to start.

You can try starting dhcpd with this command:

service dhcpd start

Pay attention to the output, it may show errors.


7.2.3. Double-check the dhcpd configuration

Does the /etc/dhcpd.conf file have an entry for our workstation?

You should double-check the 'fixed-address' setting in the config file, to make sure it exactly matches the card in the workstation.


7.2.4. Is ipchains or iptables blocking the request?

7.2.4.1. Checking for ipchains

Run the following command to see what it says:

ipchains -L -v 

If you see something like this:

Chain input (policy ACCEPT: 229714 packets, 115477216 bytes): Chain forward (policy ACCEPT: 10 packets, 1794 bytes): Chain output (policy ACCEPT: 188978 packets, 66087385 bytes): 

Then it isn't ipchains that is getting in the way.


7.2.4.2. Checking for iptables

Run the following command to see what it says:

iptables -L -v 

If you see something like this:

Chain INPUT (policy ACCEPT 18148 packets, 2623K bytes)  pkts bytes target     prot opt in     out     source               destination  Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)  pkts bytes target     prot opt in     out     source               destination  Chain OUTPUT (policy ACCEPT 17721 packets, 2732K bytes)  pkts bytes target     prot opt in     out     source               destination

Then it is not iptables getting in the way.


7.2.5. Is the workstation sending the request?

Try watching the /var/log/messages file while the workstation is booting. You can do that with the following command:

tail -f /var/log/messages 

This will 'follow' the log file as new records are written to it.

server dhcpd: DHCPDISCOVER from 00:50:56:81:00:01 via eth0 server dhcpd: no free leases on subnet WORKSTATIONS server dhcpd: DHCPDISCOVER from 00:50:56:81:00:01 via eth0 server dhcpd: no free leases on subnet WORKSTATIONS 

If you see messages like those above, saying 'no free leases', that indicates that dhcpd is running, but it doesn't know anything about the workstation that is requesting an IP address.


7.3. Troubleshooting TFTP

Etherboot uses TFTP to retrieve a Linux kernel from the server. This is a fairly simple protocol, but sometimes there are problems trying to get it to work.

If you see a message similar to this:

Loading 192.168.0.254:/lts/vmlinuz-2.4.24-IndClient-4......... 

with the dots filling the screen rather quickly, that normally indicates that TFTP is working properly, and the kernel is downloading.

If, instead, you don't see the dots, there is a problem. Possible problems include:


7.3.1. tftpd is not running

If tftpd isn't configured to run, then it certainly won't be able to answer the request from the workstation. You can see if it is running, you can use the netstat command, like this:

[root@bigdog]# netstat -anp | grep ":69 "  udp     0   0 0.0.0.0:69         0.0.0.0:*                 453/inetd                         

If you don't see any output from that command, then tftpd is likely not running.

There are two common methods for invoking tftpd, They are inetd and the newer xinetd

inetd uses a configuration file called /etc/inetd.conf. In that file, make sure that the line for starting tftpd is NOT commented out. This is what the line should look like:

tftp dgram udp wait nobody /usr/sbin/tcpd  /usr/sbin/in.tftpd -s /tftpboot                 

xinetd uses a directory of individual configuration files. One file for each service. If your server is using xinetd, then the configuration file for tftpd is called /etc/xinetd.d/tftp. Below is an example:

service tftp {   disable          = no   socket_type      = dgram   protocol         = udp   wait             = yes   user             = root   server           = /usr/sbin/in.tftpd   server_args      = -s /tftpboot }                 

Make sure that the disable line doesn't say yes.


7.3.2. Kernel not where tftpd expects to find it

The kernel needs to be in a location that the tftpd daemon can access it. If the '-s' option is specified for the tftpd daemon, then whatever the workstation is asking for must be relative to the /tftpboot. So, if the filename entry in the /etc/dhcpd.conf file is set to /lts/vmlinuz-2.4.24-IndClient-4, then the kernel actually needs to be /tftpboot/lts/vmlinuz-2.4.24-IndClient-4


7.4. Troubleshooting NFS root filesystem

There are several things that can prevent a root filesystem from being mounted. Including the following:


7.4.1. No init found

If you get the following error:

Kernel panic: No init found.  Try passing init= option to kernel.  

Then it is is most likely that either you are mounting the wrong directory as the root filesystem, or the /opt/IndClient/i386 directory is empty.


7.4.2. Server returned error -13

If you get the following error:

Root-NFS: Server returned error -13 while mounting /opt/IndClient/i386 

This indicates that either the /opt/IndClient/i386 directory isn't listed in the /etc/exports file.

Take a look in the /var/log/messages file to see if there are any clues. An entry like this:

Jul 20 00:28:39 bigdog rpc.mountd: refused mount request from ws004                   for /opt/IndClient/i386 (/): no export entry 

Then it confirms our suspicion that the entry in /etc/exports isn't correct.


7.4.3. NFS Daemon problems (portmap, nfsd & mountd)

NFS can be a complex and difficult service to trouble-shoot, but understanding what should be setup and what tools are available to diagnose the problems will surely help to make it easier.

There are three daemons that need to be running on the server for NFS to work properly. portmap, nfsd and mountd.


7.4.3.1. The Portmapper (portmap)

If you get the following messages:

Looking up port of RPC 100003/2 on 192.168.0.254 portmap: server 192.168.0.254 not responding, timed out Root-NFS: Unable to get nfsd port number from server, using default Looking up port of RPC 100005/2 on 192.168.0.254 portmap: server 192.168.0.254 not responding, timed out Root-NFS: Unable to get mountd port number from server, using default mount: server 192.168.0.254 not responding, timed out Root-NFS: Server returned error -5 while mounting /opt/IndClient/i386 VFS: unable to mount root fs via NFS, trying floppy. VFS: Cannot open root device "nfs" or 02:00 Please append a correct "root=" boot option Kernel panic: VFS: Unable to mount root fs on 02:00 

This most likely is caused by the portmap daemon not running. You can confirm whether or not the portmapper is running by using the ps command:

ps -e | grep portmap 

If the portmapper is running, you should see output like this:

30455 ?        00:00:00 portmap 

Another test is to use the netstat. The portmapper uses TCP and UDP ports 111. Try running this:

netstat -an | grep ":111 " 

You should see the following output:

tcp   0   0 0.0.0.0:111       0.0.0.0:*          LISTEN       udp   0   0 0.0.0.0:111       0.0.0.0:*                           

If you don't see similar output, then the portmapper isn't running. You start the portmapper by running:

/etc/rc.d/init.d/portmap   start 

Then, you should make sure that the portmapper is setup to start when the server boots. Run ntsysv to make sure it is selected to run.


7.4.3.2. The NFS and MOUNT daemons (nfsd & mountd)

NFS has 2 daemons that need to be running. nfsd and mountd. They are both started by the /etc/rc.d/init.d/nfs script.

You can run the ps command to make sure that they are running.

ps -e | grep nfs ps -e | grep mountd 

If it shows that one or both of the daemons are not running, then you will need to start them.

Normally, you should be able to run the startup script with the restart argument to cause them both to startup, but for some reason, the /etc/rc.d/init.d/nfs script doesn't restart nfsd that way. It only restarts mountd (bug?). So, you should instead run the following sequence of commands:

/etc/rc.d/init.d/nfs  stop /etc/rc.d/init.d/nfs  start 

You may get errors on the stop command, but that is OK. The start command should show OK as the status.

If the daemons are running, but NFS is still not working, you can verify that they have registered themselves with the portmapper by using the rpcinfo command.

rpcinfo -p localhost 

You should see results similar to the following:

program vers proto   port  100000    2   tcp    111  portmapper  100000    2   udp    111  portmapper  100003    2   udp   2049  nfs  100003    3   udp   2049  nfs  100021    1   udp  32771  nlockmgr  100021    3   udp  32771  nlockmgr  100021    4   udp  32771  nlockmgr  100005    1   udp    648  mountd  100005    1   tcp    651  mountd  100005    2   udp    648  mountd  100005    2   tcp    651  mountd  100005    3   udp    648  mountd  100005    3   tcp    651  mountd  100024    1   udp    750  status  100024    1   tcp    753  status

This indicates that nfs (nfsd) and mountd are both running and have registered themselves with the portmapper.


7.5. Troubleshooting the Xserver

Oh boy!, Probably the single most difficult part of setting up an IndClient workstation is getting the X server configured properly. If you are using a fairly new video card, and it is supported by the Xorg Xservers, and you have a fairly new monitor that can handle a large range of frequencies and resolutions, then it is fairly straight forward. Usually, in that case, if it doesn't work, it is most likely the wrong X server for that card.

When an X server doesn't work with your card, it is usually pretty obvious. Either the X server won't start, or the display will be incorrect.

When the workstation is ready to start the X server, it calls the startx script, which starts the X server on the local workstation, with a -query option pointing to a server, where a display manager, such as XDM , GDM or KDM is running.

Because the X server is started by the startx script, which is itself started by the init program, when it fails, init will attempt to run it again. init will continue this loop of trying to run the X server 10 times, then give up, because it thinks that it is re-spawning too quickly. After it finally gives up, the error message from the X server should be left on the screen.

Waiting for the X server to fail 10 times can be rather irritating, so a simple way to avoid the repeated failures is to start the workstation in runlevel 3, so that the X server is NOT started automatically. Instead, when you boot the workstation, you will get a bash prompt. From the bash prompt, you can start the X server manually with the following command:

sh  /tmp/start_ws 

The X server will attempt to start, then when it fails, it will return back to the bash prompt, so you can see what the reason for the failure is.


7.6. Troubleshooting the Display manager

The display manager is the daemon that runs on the server, waiting for an X server to make contact with it. Once contact has been made, it will display a login dialog box on the screen, offering the user a chance to log into the server.

The three most common display managers are:

  • XDM - It's been around forever. It is included with the standard X windows system.

  • GDM - The 'Gnome Display Manager'. This is part of the Gnome package.

  • KDM - The 'KDE Display Manager'. This is part of the K Desktop system.

Most recent GNU/Linux distributions include all three display managers.


7.6.1. Grey screen with large X cursor

This indicates that the X server is running, but it has not been able to make contact with a display manager. Some possible reasons for that are:

  1. The display manager may not be running

    On recent versions of Redhat (7.0 and above), the display manager is started from init. In the /etc/inittab file, there is a line that looks like this:

    x:5:respawn:/etc/X11/prefdm -nodaemon 

    The prefdm script will make the determination of which display manager to run.

    The default display manager depends on which packages have been installed. If Gnome is installed, then GDM is the default display manager. If Gnome is not installed, then the prefdm script will check to see if KDE is installed. If it is, then KDM will be the default display manager. If KDE also is not installed, then XDM will be the default display manager.

    Using the netstat command, you should be able to see if there is a display manager running. On the server, run the following command:

    netstat -ap | grep xdmcp 

    You should see results showing that there is a process listening on the xdmcp port (177).

    udp     0   0 *:xdmcp            *:*               1493/gdm 

    This shows clearly that gdm is running with a PID of 1493, and it is listening on the xdmcp port.

    If you see a line like the one shown above, indicating that there is definitely a display manager listening, then you need to make sure that the workstation is sending the XDMCP query to the correct server.

    In the lts.conf file, you can have an entry which specifies the IP address of the server that is running the display manager. the entry is optional, but if present, should look like this:

    XDM_SERVER  =  192.168.0.254 

    of course, the IP address for your network may be different than the example above.

    If the 'XDM_SERVER' entry is not present, it will then use the value of the 'SERVER' entry, if present. If that is not present, then it will use 192.168.0.254.

    Which ever way it is specified, you just need to make sure that the IP address is actually the correct address of the server running the display manager.

  2. The display manager may be configured to ignore requests from remote hosts.

    If you've determined that the display manager is running, then it is possible that it has been configured to ignore XDMCP requests from remote hosts. You will need to check the configuration files of the particular running display manager, to determine if it is configured properly.

    • XDM

      The default configuration for Redhat is to disable the ability for workstations to get login access from XDM. The IndClient_initialize script will take care of enabling this for you, but if it's not working, you should check the /etc/X11/dim/xdm-config file. Look for an entry that looks like this:

      DisplayManager.requestPort:     0 

      This entry MUST be commented out in order for XDM to listen on port 177 for remote requests.

      Another configuration file is also important for XDM to serve up remote login requests. There is a file called /etc/X11/xdm/Xaccess that MUST have a line that starts with an asterisk '*'. the line is normally included in the file, but Redhat leaves the line commented out. the IndClient_initialize script will fix the line for you, but if XDM doesn't seem to be working, you should check this file. A valid line should look like this:

      *        #any host can get a login window 
    • KDM

      Newer versions of KDM have a file called kdmrc . Different Linux distributions store that file in different locations. For Redhat 7.2, it is /etc/kde/kdm/kdmrc. For the other distros, you should run the locate command to find out where it is stored.

      The entry that controls whether remote workstations can get a login is in the [Xdmcp] section. Make sure that the Enable entry is set to true.

      Older versions of KDM use the XDM configuration files, located in /etc/X11/xdm.

    • GDM

      GDM uses a different set of configuration files. They are located in the /etc/X11/gdm directory.

      The main file to look at is the gdm.conf file. Look for the [xdmcp] section. you should see an entry within that section called 'Enable'. It must be set to '1' or 'true', depending on the version of GDM. Here is an example:

      [xdmcp] Enable=true HonorIndirect=0 MaxPending=4 MaxPendingIndirect=4 MaxSessions=16 MaxWait=30 MaxWaitIndirect=30 Port=177 

      Notice the 'Enable=true' line. Older versions of GDM use '0' and '1' to signify whether to Disable or Enable the remote XDMCP. Newer versions use 'false' and 'true'.

  3. If the Display manager is definitely running, and it is listening for requests from remote workstations, it may be a simple problem that the display manager is unable to map the IP address to a hostname. The workstation either needs to be listed in the /etc/hosts file, or it needs to be correctly setup in the DNS tables.





« July 2010 »
Su Mo Tu We Th Fr Sa
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
 

Powered by Plone, the Open Source Content Management System

This site conforms to the following standards: