Linux System Monitoring: Basic Tools and Performance Metrics

Linux system monitoring is a fundamental skill to ensure server stability. Imagine if your website suddenly becomes unavailable or the server lags, but you can’t identify the root cause—this is where monitoring tools come in to “diagnose” the system status. This article introduces the most commonly used Linux monitoring tools and key performance indicators, enabling you to quickly detect system anomalies.

I. Common Basic Monitoring Tools¶

1. Process Monitoring: Who is “consuming” system resources?¶

When the system slows down, check which processes are hogging resources. ps is the basic process-viewing tool with the simple command:

ps aux

-a: Show processes for all users
-u: Display detailed process information (user, CPU/memory usage, start time, etc.)
-x: Show background processes without a terminal

Key columns to focus on:
- PID: Process ID (unique identifier for a process)
- %CPU: Percentage of CPU usage by the process
- %MEM: Percentage of memory usage by the process
- USER: User running the process
- COMMAND: Process command name

To sort by CPU usage:

ps aux --sort=-%cpu | head -10  # Show top 10 CPU-consuming processes

2. Real-time System Dashboard: `top` Command¶

top is more dynamic than ps, providing real-time system status like a “dynamic dashboard”. Run:

top

Load Average: Format 1min load, 5min load, 15min load—represents tasks waiting for CPU. For a 4-core CPU, ideal 1min load ≤ 4.
CPU Usage (Cpu(s) row):
us: CPU time used by user-space processes (high = abnormal programs, e.g., infinite loops)
sy: CPU time used by kernel-space processes (high = excessive system calls, e.g., frequent I/O)
wa: I/O wait time (high = slow disk I/O, check disk bottlenecks)
id: Idle CPU time (higher = better system idle time)

3. Enhanced Process Tool: `htop`¶

htop is a top enhancement with mouse support and process tree visualization, ideal for intuitive process management. After installation:

htop

Press F5 for process tree, F9 to kill a process, P to sort by CPU, M by memory, q to quit.

4. Memory “Physical Exam”: `free` Command¶

Memory shortages are common; free visualizes memory usage:

free -h  # -h for human-readable units (MB/GB)

Output explanation:
- total: Total memory
- used: Memory in use (including cache)
- free: Actual free memory (excluding cache)
- buff/cache: System cache for read/write optimization (not wasted)
- available: Actual available memory (including reclaimable cache)

Note: buff/cache is not “wasted” memory; it accelerates I/O operations.

5. Disk Space “Manager”: `df` and `du`¶

When disk space is full:
- df -h: Check overall partition usage (Use% > 85% requires cleanup)

  df -h  # Example: /dev/vda1 75% used—monitor this

du -sh /path: Check directory size (-s summarize, -h human-readable)

  du -sh /var/log  # Check log directory size for cleanup

6. Disk I/O Performance: `iostat`¶

If disk I/O is slow:

iostat -x 1  # -x for detailed stats, 1s refresh interval

Key metrics:
- tps: I/O requests per second (high = disk busy)
- kB_read/s/kB_wrtn/s: Data read/written per second (high = I/O pressure)
- %util: Disk busy percentage (>80% = disk bottleneck)

7. Network Connection “Detective”: `ss` and `netstat`¶

For network issues (e.g., port conflicts, excessive connections):
- ss -tuln: List TCP/UDP listening ports (t TCP, u UDP, l listening, n numeric ports)

  ss -tuln  # Example: 0.0.0.0:22 (SSH port in use)

ss -s: Summarize connection states (ESTABLISHED, TIME_WAIT, etc.)

II. Key Performance Indicator Interpretation¶

1. CPU Metrics¶

Load Average: > CPU cores → system slowdown (e.g., 4-core CPU, load >4 = alert)
CPU Usage: High us = abnormal user processes; high wa = disk I/O bottlenecks
Context Switches: Frequent context switches (check vmstat cs metric) consume CPU

2. Memory Metrics¶

Memory Usage: used/total ≤ 80% recommended; >80% = need expansion
Swap Usage: Persistent swap used (non-zero in free) = memory shortage
Cache Release: echo 3 > /proc/sys/vm/drop_caches (use with caution)

3. Disk Metrics¶

Partition Usage: df -h Use% near 100% → clean large files
I/O Performance: iostat %util >80% or wa >20% → optimize disk

4. Network Metrics¶

Throughput: Use iftop for real-time traffic to avoid bandwidth limits
Connection Count: ss -s to check ESTABLISHED connections; excessive TIME_WAIT → connection leaks

III. Practical Scenario: What to Do When the System Slows Down?¶

Troubleshooting steps for system lag:
1. top: Check load and CPU (wa high = disk I/O; us high = CPU-bound processes)
2. free -h: Free memory by cleaning cache or deleting large files
3. df -h: Use du -sh to locate large directories (e.g., logs, backups)
4. ss -tuln: Check for abnormal ports (malicious processes or leaks)

IV. Summary¶

Linux monitoring requires mastering tools (ps, top, df, iostat) and metrics (CPU, memory, disk, network). Use the “Observe → Analyze → Optimize” cycle to maintain stability. Practice daily tasks (e.g., df -h for disk checks) and gradually tackle complex issues to quickly identify bottlenecks.

I. Common Basic Monitoring Tools¶

1. Process Monitoring: Who is “consuming” system resources?¶

2. Real-time System Dashboard: top Command¶

3. Enhanced Process Tool: htop¶

4. Memory “Physical Exam”: free Command¶

5. Disk Space “Manager”: df and du¶

6. Disk I/O Performance: iostat¶

7. Network Connection “Detective”: ss and netstat¶