Performance monitoring ----------------------- This TIL will be a bit longer because it aims to list some of the most useful and most important performance monitoring tools available on Linux. Some are usually installed by default and some must be installed manually. The tools are split into several categories. The base of this page has been constructed by selecting information from: https://www.tecmint.com/command-line-tools-to-monitor-linux-performance/. It has then been enhanced based on my personal preference and uses. General ~~~~~~~ This section shows tools displaying general information about the system state and usage. When troubleshooting a Linux machine, these are the tools to be inspected first. **top** The first tool to execute, to inspect te running process on a system interactively. :: top - 06:49:14 up 1 day, 23:04, 1 user, load average: 0.52, 0.86, 1.12 Tasks: 220 total, 1 running, 219 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.9 us, 0.7 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.3 hi, 0.1 si, 0.0 st MiB Mem : 7879.9 total, 814.6 free, 2198.2 used, 4867.1 buff/cache MiB Swap: 2294.0 total, 2271.5 free, 22.5 used. 5041.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1034 john 20 0 4007988 373052 120692 S 8.6 4.6 6:07.41 gnome-shell 10943 john 20 0 9240860 653416 143408 S 3.3 8.1 3:39.64 firefox 1352 john 20 0 503356 43776 32716 S 1.0 0.5 0:06.27 gnome-terminal- 1 root 20 0 220192 8968 6852 S 0.3 0.1 0:03.90 systemd 563 root 20 0 57172 6336 5500 S 0.3 0.1 0:00.64 systemd-logind **vmstat** This less known tool is also very useful to get an overview about a lot of components from the system, but must be installed manually. The tool provides several options, but the most simple use case is: :: $ vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 23004 790600 341872 4640592 0 2 723 581 514 1102 22 5 73 0 0 The default command displays summary values since the system boot. To understand all values, have a look at the man page of the tool. Below are some more examples of ``vmstat`` commands: * ``vmstat 3 10``: execute vmstat every 3 seconds for 10 times * ``vmstat -a``: display active an inactive memory * ``vmstat -s``: display a variety of event counters and statistics * ``vmstat -d``: display disk statistics * ``vmstat -S M``: use megabyte unit instead of kilobyte **nmon** Global performance and system utilization interactive monitoring tool. It can be used interactively or store data in CSV files. It can collect a lot of information and present nicely: CPU, Memory, Disk Usage, Network, Processes, NFS, Kernel, etc. The panels must be toggled inside the application by pressing the corresponding letter. You can toggle help with ``h``. If you use always the same panels, you can load them automatically by setting the ``NMON`` variable. :: NMON=cmdnk nmon Processes ~~~~~~~~~ **htop** It is an advanced and interactive top command. Actually it is my advised replacement for htop. **progress** Monitoring tool to show the progress of basic coreutils linux commands such as: cp, mv, dd, tar, gzip, ... The official documentation is located here: https://github.com/Xfennec/progress A few examples from the official documentation: Monitor all current and upcoming instances of coreutils commands in a simple window: :: watch progress -q See how your download is progressing: :: watch progress -wc firefox Look at your Web server activity: :: progress -c httpd File system and I/O ~~~~~~~~~~~~~~~~~~~ Some of the most useful tools for debugging file system and disk IO issues. **iotop** It is a live and real time monitoring of disk I/O operations per process. It shows the number of bytes read from / written to the disk, the IO capacity usaged of the media, etc. Very useful to find why your disk is slow or heavily used. :: $ iotop Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] 3 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_gp] 4 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_par_gp] 6 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/0:0H-kblockd] 8 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [mm_percpu_wq] 9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 10 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_preempt] **lsof** This command lists all open files and the owning process. As everything is a file in Linux, this can display disk files, processes, pipes, mqueues, devices, sockets, ... The default command is very verbose. Look at the man page to get some insights on what the values mean. Some useful switches: * ``lsof -u geoffrey``: list the files opened by the user geoffrey * ``lsof -i TCP:22``: find running processes with connection on port 22 * ``lsof -i 4``: list IPv4 network files * ``lsof -i 6``: list IPv6 network files * ``lsof -i TCP:1-100``: list open connections on port range 1 to 100 * ``lsof -p 1234``: list open files belonging to process id 1234 :: $ lsof COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME systemd 983 john cwd DIR 8,2 4096 2 / systemd 983 john rtd DIR 8,2 4096 2 / systemd 983 john txt REG 8,2 1411208 14142 /usr/lib/systemd/systemd systemd 983 john mem REG 8,2 561040 12192 /usr/lib/libsystemd.so.0.23.0 systemd 983 john mem REG 8,2 333728 41178 /usr/lib/libdbus-1.so.3.19.8 systemd 983 john mem REG 8,2 133000 60752 /usr/lib/libnl-3.so.200.26.0 **iostat** The ``iostat`` tool reports global CPU and input / output statistics for each partition on the system. :: $ iostat Linux 4.18.6-arch1-1-ARCH (core-m) 09/14/2018 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 21.87 0.45 5.60 0.15 0.00 71.93 Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.88 27.86 43.78 8422261 13232504 dm-0 0.72 23.94 40.29 7236249 12177052 **df** This is the global disk usage analysis tool. Can be combined with ``-h`` for human readable data. The result is immediate - the tool doesn't scan the file system. :: $ df -h Filesystem Size Used Avail Use% Mounted on dev 3.9G 0 3.9G 0% /dev run 3.9G 1.3M 3.9G 1% /run /dev/sda2 14G 11G 1.8G 86% / tmpfs 3.9G 32M 3.9G 1% /dev/shm tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup tmpfs 3.9G 1.7M 3.9G 1% /tmp /dev/sda3 2.0G 386M 1.5G 22% /opt tmpfs 788M 24K 788M 1% /run/user/120 /dev/mapper/_dev_sda4 100G 72G 23G 76% /data tmpfs 788M 2.6M 786M 1% /run/user/1000 **du** This tool can give the size of a single files or directories. It will scan the disk to compute the size, so it might be slow. My best use case is ``du -hs folder`` to get the size of folder. Note that it will start to scan mounted device. :: $ du -hs source 396K source Network stats ~~~~~~~~~~~~~~ The best tool to show which application is sending which amount of data is `nethogs` **nethogs** The result is self-describing: :: NetHogs version 0.8.1 PID USER PROGRAM DEV SENT RECEIVED 10364 john /usr/share/spotify/spotify tun0 3.707 40.496 KB/sec 9162 john /usr/lib/gvfs/gvfsd-http tun0 0.454 2.342 KB/sec 8608 john firefox tun0 0.000 0.000 KB/sec 9398 john thunderbird tun0 0.000 0.000 KB/sec 8220 john dropbox tun0 0.000 0.000 KB/sec 10238 john /usr/share/spotify/spotify tun0 0.000 0.000 KB/sec 8224 john /usr/bin/owncloud tun0 0.012 0.000 KB/sec ? root unknown TCP 0.000 0.000 KB/sec TOTAL 4.173 42.838 KB/sec **iftop** Live and real time network bandwidth vizualization tool. Can be useful to check network speed, see through which interface the traffic is going and to which endpoint. An example use case to get the network traffic on interface `eth0`:: iftop -i eth0 -B **nload** `nload` is yet another traffic stats reporting connection speed. It shows current, peak and average speed with a funky animation. A typical usage example for getting the network traffic on interface `eth0` and speed in MBytes/s:: nload -u M eth0 Network traffic analysis ~~~~~~~~~~~~~~~~~~~~~~~~~ **tcpdump** Tcpdump could have its own dedicated TIL because it has so many options. But here are some useful simple commands: * ``tcpdump -i eth0``: capture packets from the interface eth0 * ``tcpdump -XX -i eth0``: display packets and its data in HEX and ASCII format * ``tcpdump -w capture.pcap -i eth0``: save packets to a capture.pcap file * ``tcpdump -r capture.pcap``: read from a previously saved capture file * ``tcpdump -i eth0 src 172.26.10.10``: capture packets for a specific destination IP * ``tcpdump -i eth0 dst 172.26.10.10``: capture packets coming from specific IP **netstat** Netstat can monitor open ports and incoming and outgoing traffic. Below some useful commands: * ``netstat -plunt``: list listening applications * ``netstat -a``: list all TCP and UDP ports * ``netstat -s``: statistics by protocol, ca be combined with ``u`` (UDP) and ``t`` (TCP). Can show bad segments, retransmissions, failures, ... * ``netstat -at``: only TCP connections * ``netstat -au``: ony UDP connections * ``netstat -l``: all listening connections, can be combined with ``u`` (UDP) and ``t`` (TCP) * ``netstat -lx``: all UNIX listening ports * ``netstat -tp``: display PID and program name * ``netstat -r``: display kernel routing table * ``netstat -ie``: show kernel interface table * ``netstat -i``: show network interface packet statistics * ``netstat -c 10``: print continuously, every 10 seconds * ``netstat --statistics --raw``: display a lot of network statistics like number of packets received, errors, ... **iptraf** More advanced iftop tool, which collects in additional and more advanced informaiton. **arpwatch** Monitors ARP address resolution happening on the network. Useful to debug ARP resolution on the network and ARP spoofing. It can even send e-mail alerts when addresses change. Load average ~~~~~~~~~~~~ Linux makes use of three "magic" numbers that get used in several of these tools to describe the load average of the system. They could be quite confusing for beginners, but actually there are easy to understand. This is one example:: load average: 0.20, 1.05, 5.09 The numbers define the load average for the last 1 minute (1.05), the last 5 minutes (0.70) and the last 15 minutes (5.09). The number defines the number of processes which are in running, waiting or uninterruptible sleep states (e.g. waiting for I/O). The number have to be compared to the number of CPU on the system. So if the example above is a system with 2 CPUs, it means that the system is fully loaded when the value is equal 2: * over the last minute, the system was mostly idle * over the last 5 minutes, it was half used * over the last 15 minutes, it was overloaded: there were many processes waiting for the CPU