The Basics of Server Monitoring

So you’re on your first job as a systems or server admin and wondering what you need to monitor. The answer is: it depends. Depending on the application that the server is running, there will be mission-critical services you have to watch 24/7. Some applications will require some built-in OS services to run in addition to their services and some only require their own services. Having said that, regardless of what application the server is running, there are a set of basic components that must be monitored. Let’s take a look at some of them here.


Monitor CPU

The CPU is the brains of the server hardware. The CPU, or the Central Processing Unit, is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. A server that has its CPU pegged at 100% for several minutes, or even hours is an unhappy server. This means that it has no time (cycles) to service additional requests-whether they are mission-critical or not. Depending on what is causing the CPU spike, you have to either upgrade the CPU hardware, add more CPU’s or shut down frivolous services that are hogging these critical resources. There is no one magical number percentage that the CPU usage should be below, and this is certainly dependent on your server application; but if your server CPU is always at 75% or higher usage, you should consider one of the above suggested steps.


Monitor RAM

RAM, or Random Access Memory, is a form of data storage. A server can load information required by certain applications into RAM for faster access thereby improving the overall performance of the application. The reason for this is that RAM is flash based storage and is several times faster than the slower hard disk (physical moving components). If a server runs out of RAM, it sets up a portion of the hard drive as virtual memory and this space is reserved for CPU usage. This process is called swapping and causes performance degradation since the hard drive is much slower than RAM (think several 1000 times). Swapping also contributes to file system fragmentation which degrades overall server performance. If RAM usage is constantly rising, consider adding more RAM. Adding RAM is a cheap way to boost server performance.


Monitor Disk

The hard disk is the device that the server uses to store data. The hard disk drive consists of several rigid rotating discs coated with magnetic material with magnetic heads arranged strategically to write data to, and read from the disc or platter. The data stored is permanent (survives a reboot unlike RAM) and non-volatile and is available till it is consciously erased by the end user. It is important to monitor the hard disk for a couple of reasons. The operating system needs space on the disk for normal operating processes including paging files and certain caches. The application running on the server also needs space to write temporary data to cache for efficient operation as well as permanent data that will be accessed by the user. Low free space on a drive is also one of the reasons for file system fragmentation which causes severe performance issues.


Hardware faults and performance

Last, but certainly not the least, there are hardware components of a server that must be monitored.
1. CPU Fan-This is a fan that draws heat away from the CPU. It may draw cooler air into the case from the outside, expel warm air from inside, or move air across a heatsink to cool a particular component. If the CPU fan fails, the server will eventually overheat and fail or perform an emergency shutdown to prevent serious damage to its components-either way, your server will become unavailable. To prevent this from happening, you should monitor CPU Fan speed-monitor RPM and make sure that the fan rpm is not exceeding safe levels. Also keep an eye on historic fan RPM-if you notice a spike in fan rpm that lasts several minutes or hours, that is an indicator of a serious issue .(maybe the AC is not working, the fan vents are blocked, etc).
2. Power Supply- A power supply unit (PSU) converts mains AC to low-voltage regulated DC power for the internal components of the computer. Modern personal computers universally use a switched-mode power supply. You need to monitor the amperage, voltage and wattage of the power supply.
3. Temperature- The temperature of the system board or motherboard is another important component to monitor. Unusually high temperatures can cause permanent damage to the server and will affect server performance adversely. You can obtain the safe working temperature limits from the manufacturer and monitor it to make sure that it does not exceed this safe range. Also, with the addition of Virtualization to the mix, server density has increased per rack. This makes monitoring temperatures-both internal server and external environment even more critical. You can use temperature measurements to plan and deploy virtual servers where appropriate. For example, a high load web or database server should not be deployed on a physical server that is approaching temperature limits for safe operation.
4. Environmental-The temperature, air flow and humidity control of the physical space that houses your server (Data Center, closet, etc) is another important parameter to monitor. Some of the parameters mentioned above may be a direct result of faulty A/C, improper air flow and dangerous humidity levels.
5. Other components that must be monitored are CMOS Battery, Disk array health, chassis intrusion detection and CPU hardware status.



A good server health and performance monitoring tool will monitor all these aspects to provide a total picture of what is going on with your server farm. By collecting these key performance metrics, you can also perform performance analysis and report on server capacity.


