Monitor Linux Server Incidents

Monitoring and managing incidents are serious business. It is impossible to predict what the next problem will be. Here is a series of aids

Monitor Linux Server Incidents

monitor_server_linux_garanet

A warning or an alert from your Monitor system like Nagios, indicates that the server is down, what to do?

  1. Don’t panic.
  2. We can get information on what’s going on.
  3. Connect via SSH on the Server.

If the server is full of memory or the processors are overloaded this could take a long time. If you really can’t connect via SSH to the server, that’s bad. You will need to restart the server or log in using the serial console.

Commands to use after ssh login on the server.

Is anyone on the problem already?

#:~$ users oppure #:~$ who

To see the running processes, use top, it will tell you about memory and CPU usage.

#:~$ top  oppure #:~$ htop  (se installato.)

It will show you something like:

top - 11:11:25 up 48 days, 14:40,  2 users,  load average: 49.67, 48.85, 33.94
Tasks: 158 total,   1 running, 157 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.8%sy,  0.0%ni, 11.5%id, 86.7%wa,  0.0%hi,  0.6%si,  0.1%st
Mem:   8147096k total,  8108144k used,    38952k free,    28712k buffers
Swap:  4194296k total,   164740k used,  4029556k free,  2919400k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                          
30261 root      20   0 5148m 4.6g 5884 S   18 58.6   6931:26 java                                                                              
 4470 n2        20   0 14004  964  396 S    1  0.0 391:12.84 n2txd                                                                             
 3645 root      20   0     0    0    0 D    0  0.0  11:18.61 flush-202:1                                                                       
22092 root      20   0  9768  604  568 S    0  0.0   0:54.42 tail                                                                              
22839 guest     20   0 19236 1420 1040 R    0  0.0   0:00.04 top                                                                               
    1 root      20   0 23832 1120  532 S    0  0.0   0:17.27 init                                                                              
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd                                                                          
    3 root      RT   0     0    0    0 S    0  0.0   0:00.82 migration/0                                                                       
    4 root      20   0     0    0    0 S    0  0.0   0:13.66 ksoftirqd/0                                                                       
    5 root      RT   0     0    0    0 S    0  0.0   0:00.94 watchdog/0                                                                        
    6 root      RT   0     0    0    0 S    0  0.0   0:00.92 migration/1

At the top, it will usually tell you what you need to know. The first process in the top list on a WebServer should be java.

#:~$ ps aux | grep java
root     30261 79.9 58.5 5260904 4774052 ?     Sl   May30 6932:43 /usr/lib/jvm/java-6-sun/bin/java -Djava.util.logging.config.file=/usr/local/tomcat7/conf/logging.properties -server

The first number is the process id, with this we can kill/close the running process and start it again.

#:~$ sudo kill -9 30261
#:~$ sudo /etc/init.d/tomcat start

A server shows warnings or errors related to the CPU load, the sites are still open, probably the server is busy doing something heavy.

CPU load should not exceed 1.0 per core. In reality, loads up to 5.0 at the core are still acceptable.

top - 11:11:25 up 48 days, 14:40,  2 users,  load average: 49.67, 48.85, 33.94
Tasks: 158 total,   1 running, 157 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.8%sy,  0.0%ni, 11.5%id, 86.7%wa,  0.0%hi,  0.6%si,  0.1%st
Mem:   8147096k total,  8108144k used,    38952k free,    28712k buffers
Swap:  4194296k total,   164740k used,  4029556k free,  2919400k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                          
30261 root      20   0 5148m 4.6g 5884 S   18 58.6   6931:26 java                                                                              
 4470 n2        20   0 14004  964  396 S    1  0.0 391:12.84 n2txd                                                                             
 3645 root      20   0     0    0    0 D    0  0.0  11:18.61 flush-202:1                                                                       
22092 root      20   0  9768  604  568 S    0  0.0   0:54.42 tail                                                                              
22839 guest     20   0 19236 1420 1040 R    0  0.0   0:00.04 top                                                                               
    1 root      20   0 23832 1120  532 S    0  0.0   0:17.27 init                                                                              
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd                                                                          
    3 root      RT   0     0    0    0 S    0  0.0   0:00.82 migration/0                                                                       
    4 root      20   0     0    0    0 S    0  0.0   0:13.66 ksoftirqd/0                                                                       
    5 root      RT   0     0    0    0 S    0  0.0   0:00.94 watchdog/0                                                                        
    6 root      RT   0     0    0    0 S    0  0.0   0:00.92 migration/1

It has a load of 49.67. This is not good. With loads like this performance, the server will be useless. Let’s look for the cause:

Too many visitors on the website or app ???

#:~$ sudo tail -f /var/log/apache2/access.log 
#:~$ sudo tail -f /var/log/apache2/other_vhosts_access.log

Error.log in the same folder, it will show you if there are serious errors.

#:~$ sudo tail error.log

DDoS, hacking attempt, security scan, brute force, how monitor them

If someone is trying to access the system more than 1000 times per second it will cause high loads. The Apache log will show weird things if this is happening.

To see if anyone is using a brute force via SSH or via the Apache Web / App, check the following log’s files:

/var/log/apache2/error.log
/var/log/apache2/access.log
/var/log/apache2/other_vhosts_acces.log
/var/log/auth.log

Hackers usually try to find a popular tool for finding bugs like PHP. There is no real danger that the hacker will compromise our systems if he tries some tools.

However, it takes a long time for Apache to serve an error. If the tool makes requests like a million times per second, the server will become slow.

If you ever feel the need to run a script like WebGUI / CGI, please use extra login for Apache to protect it. (http://www.elated.com/articles/password-protecting-your-pages-with-htaccess/).

#:~$ sudo tail /var/log/apache2/access.log
*:80 110.173.1.118 - - [30/Mag/2015:03:37:15 +0200] "GET //scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:15 +0200] "GET //admin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:16 +0200] "GET //admin/pma/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:17 +0200] "GET //admin/phpmyadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:17 +0200] "GET //db/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:18 +0200] "GET //dbadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:19 +0200] "GET //myadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:20 +0200] "GET //mysql/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:20 +0200] "GET //mysqladmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:21 +0200] "GET //typo3/phpmyadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:22 +0200] "GET //phpadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:23 +0200] "GET //phpMyAdmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:23 +0200] "GET //phpmyadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:24 +0200] "GET //phpmyadmin1/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:25 +0200] "GET //phpmyadmin2/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:25 +0200] "GET //pma/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:27 +0200] "GET //web/phpMyAdmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:28 +0200] "GET //xampp/phpmyadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:29 +0200] "GET //web/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:29 +0200] "GET //php-my-admin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:30 +0200] "GET //websql/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:31 +0200] "GET //phpmyadmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:32 +0200] "GET //phpMyAdmin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:32 +0200] "GET //phpMyAdmin-2/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:33 +0200] "GET //php-my-admin/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:34 +0200] "GET //phpMyAdmin-2.2.3/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:36 +0200] "GET //phpMyAdmin-2.2.6/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:37 +0200] "GET //phpMyAdmin-2.5.1/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:37 +0200] "GET //phpMyAdmin-2.5.4/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:38 +0200] "GET //phpMyAdmin-2.5.5-rc1/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
*:80 110.173.1.118 - - [30/Mag/2015:03:37:39 +0200] "GET //phpMyAdmin-2.5.5-rc2/scripts/setup.php HTTP/1.1" 404 1808 "-" "-"
#:~$ sudo grep w00t /var/log/apache2/error.log
[May 30 03:14:53 2015] [error] [client 193.200.124.171] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /w00tw00t.at.ISC.SANS.DFind:)
[May 11 17:47:18 2015] [error] [client 50.57.84.107] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /w00tw00t.at.ISC.SANS.test0:)
[Jun 02 08:24:29 2015] [error] [client 95.211.37.204] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /w00tw00t.at.ISC.SANS.DFind:)
[Aug 04 06:51:10 2015] [error] [client 95.211.37.224] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): /w00tw00t.at.ISC.SANS.DFind:)

SSH Attack Example Monitor

#:~$ sudo tail /var/log/auth.log
May 30 07:01:17 serverX sshd[15598]: Failed password for root from 31.6.80.232 port 51975 ssh2
May 30 07:01:17 serverX sshd[15598]: Received disconnect from 31.6.80.232: 11: Bye Bye [preauth]
May 30 07:01:17 serverX sshd[15600]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=31.6.80.232  user=root
May 30 07:01:19 serverX sshd[15600]: Failed password for root from 31.6.80.232 port 52235 ssh2
May 30 07:01:19 serverX sshd[15600]: Received disconnect from 31.6.80.232: 11: Bye Bye [preauth]
May 30 07:01:20 serverX sshd[15602]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=31.6.80.232  user=root
May 30 07:01:22 serverX sshd[15602]: Failed password for root from 31.6.80.232 port 52505 ssh2
May 30 07:01:22 serverX sshd[15602]: Received disconnect from 31.6.80.232: 11: Bye Bye [preauth]
May 30 07:01:22 serverX sshd[15604]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=31.6.80.232  user=root
May 30 07:01:24 serverX sshd[15604]: Failed password for root from 31.6.80.232 port 52767 ssh2
May 30 07:01:24 serverX sshd[15604]: Received disconnect from 31.6.80.232: 11: Bye Bye [preauth]
May 30 07:01:25 serverX sshd[15606]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=31.6.80.232  user=root
May 30 07:01:27 serverX sshd[15606]: Failed password for root from 31.6.80.232 port 53009 ssh2
May 30 07:01:27 serverX sshd[15606]: Received disconnect from 31.6.80.232: 11: Bye Bye [preauth]
May 30 07:01:27 serverX sshd[15609]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=31.6.80.232  user=root

They are trying to log in as root, using some common passwords.
Again, there is no risk that they can compromise the system since we don’t allow root login.
(In /etc/ssh/sshd_config we have PermitRootLogin no).

If we want to know where the IP comes from just type:

#:~$ whois 31.6.80.232

If the IP and / or the user does not have a familiar look like coming from China, Russia or some other dubious country, you can block the IP by adding it in:

#:~$ sudo vi /etc/hosts.deny

With:

#:~$ ALL: 31.6.80.232

The server is swapping

When a server gets very slow, it could be the swap partition. Here are some outputs from the top command. The memory row indicates that the server is using most of the memory and you have 164740k in the swap.

MEM: 8147096k totale, 8108144k utilizzato, 38952 KB liberi, 28712 buffer k
Swap: 4194296k totale, utilizzato 164740k, 4029556k gratis, 2919400k memorizzati nella cache

When a server runs out of memory and has a configured swap partition, it will use the swap partition as backup storage. If there is no swap space available when memory runs out, the Linux kernel will start killing processes that use a lot of memory, which usually starts with java. If there is swap space available, the kernel will not kill anything, but it is possible that the server will become slow because it is trying to handle the read/write to the disk.

Partition / swap control:

#:~$ cat /etc/fstab

It will show something like:

# /etc/fstab: static file system information.
#
# Use 'blkid -o value -s UUID' to print the universally unique identifier
# for a device; this may be used with UUID= as a more robust way to name
# devices that works even if disks are added and removed. See fstab(5).
#
# proc /proc proc nodev,noexec,nosuid 0 0
# / was on /dev/sda1 during installation
/dev/xvda1 / ext3 errors=remount-ro 0 1
/dev/xvda2 none swap sw 0 0
dev /dev tmpfs rw 0 0

Disable Swap

#:~$  /sbin/swapoff /dev/xvda2

Enable Swap

#:~$  /sbin/swapon /dev/xvda2

If there is not enough memory available, the server will refuse to deactivate the swap partition.