Linux system monitoring is a fundamental skill to ensure server stability. Imagine if your website suddenly becomes unavailable or the server lags, but you can’t identify the root cause—this is where monitoring tools come in to “diagnose” the system status. This article introduces the most commonly used Linux monitoring tools and key performance indicators, enabling you to quickly detect system anomalies.
I. Common Basic Monitoring Tools¶
1. Process Monitoring: Who is “consuming” system resources?¶
When the system slows down, check which processes are hogging resources. ps is the basic process-viewing tool with the simple command:
ps aux
-a: Show processes for all users-u: Display detailed process information (user, CPU/memory usage, start time, etc.)-x: Show background processes without a terminal
Key columns to focus on:
- PID: Process ID (unique identifier for a process)
- %CPU: Percentage of CPU usage by the process
- %MEM: Percentage of memory usage by the process
- USER: User running the process
- COMMAND: Process command name
To sort by CPU usage:
ps aux --sort=-%cpu | head -10 # Show top 10 CPU-consuming processes
2. Real-time System Dashboard: top Command¶
top is more dynamic than ps, providing real-time system status like a “dynamic dashboard”. Run:
top
- Load Average: Format
1min load, 5min load, 15min load—represents tasks waiting for CPU. For a 4-core CPU, ideal 1min load ≤ 4. - CPU Usage (Cpu(s) row):
us: CPU time used by user-space processes (high = abnormal programs, e.g., infinite loops)sy: CPU time used by kernel-space processes (high = excessive system calls, e.g., frequent I/O)wa: I/O wait time (high = slow disk I/O, check disk bottlenecks)id: Idle CPU time (higher = better system idle time)
3. Enhanced Process Tool: htop¶
htop is a top enhancement with mouse support and process tree visualization, ideal for intuitive process management. After installation:
htop
- Press
F5for process tree,F9to kill a process,Pto sort by CPU,Mby memory,qto quit.
4. Memory “Physical Exam”: free Command¶
Memory shortages are common; free visualizes memory usage:
free -h # -h for human-readable units (MB/GB)
Output explanation:
- total: Total memory
- used: Memory in use (including cache)
- free: Actual free memory (excluding cache)
- buff/cache: System cache for read/write optimization (not wasted)
- available: Actual available memory (including reclaimable cache)
Note: buff/cache is not “wasted” memory; it accelerates I/O operations.
5. Disk Space “Manager”: df and du¶
When disk space is full:
- df -h: Check overall partition usage (Use% > 85% requires cleanup)
df -h # Example: /dev/vda1 75% used—monitor this
- du -sh /path: Check directory size (
-ssummarize,-hhuman-readable)
du -sh /var/log # Check log directory size for cleanup
6. Disk I/O Performance: iostat¶
If disk I/O is slow:
iostat -x 1 # -x for detailed stats, 1s refresh interval
Key metrics:
- tps: I/O requests per second (high = disk busy)
- kB_read/s/kB_wrtn/s: Data read/written per second (high = I/O pressure)
- %util: Disk busy percentage (>80% = disk bottleneck)
7. Network Connection “Detective”: ss and netstat¶
For network issues (e.g., port conflicts, excessive connections):
- ss -tuln: List TCP/UDP listening ports (t TCP, u UDP, l listening, n numeric ports)
ss -tuln # Example: 0.0.0.0:22 (SSH port in use)
- ss -s: Summarize connection states (ESTABLISHED, TIME_WAIT, etc.)
II. Key Performance Indicator Interpretation¶
1. CPU Metrics¶
- Load Average: > CPU cores → system slowdown (e.g., 4-core CPU, load >4 = alert)
- CPU Usage: High
us= abnormal user processes; highwa= disk I/O bottlenecks - Context Switches: Frequent context switches (check
vmstatcsmetric) consume CPU
2. Memory Metrics¶
- Memory Usage:
used/total≤ 80% recommended; >80% = need expansion - Swap Usage: Persistent
swap used(non-zero infree) = memory shortage - Cache Release:
echo 3 > /proc/sys/vm/drop_caches(use with caution)
3. Disk Metrics¶
- Partition Usage:
df -hUse%near 100% → clean large files - I/O Performance:
iostat %util>80% orwa>20% → optimize disk
4. Network Metrics¶
- Throughput: Use
iftopfor real-time traffic to avoid bandwidth limits - Connection Count:
ss -sto check ESTABLISHED connections; excessive TIME_WAIT → connection leaks
III. Practical Scenario: What to Do When the System Slows Down?¶
Troubleshooting steps for system lag:
1. top: Check load and CPU (wa high = disk I/O; us high = CPU-bound processes)
2. free -h: Free memory by cleaning cache or deleting large files
3. df -h: Use du -sh to locate large directories (e.g., logs, backups)
4. ss -tuln: Check for abnormal ports (malicious processes or leaks)
IV. Summary¶
Linux monitoring requires mastering tools (ps, top, df, iostat) and metrics (CPU, memory, disk, network). Use the “Observe → Analyze → Optimize” cycle to maintain stability. Practice daily tasks (e.g., df -h for disk checks) and gradually tackle complex issues to quickly identify bottlenecks.