How do I solve this weird server problem?

Roosh

Cardinal
Orthodox
I'm having a problem on the ROK server that is hard to solve. The server runs on LAMP with cPanel. ROK runs on Wordpress.

On Monday, I noticed the server load was running high (above 5 on 16 CPUs when normally it's under 1). Traffic wasn't any higher. The server would then spike to server loads of over 150, crash, and then come back online to a load of around 5 or so. It would then spike again before crashing once more). Working on the back end to edit articles was extremely slow.

First thing I did was disable Wordpress's heartbeat function, but that didn't help. Then I disabled most of the plugins. There was one plugin that uses the wp-admin-ajax function. Disabling it reduced the load somewhat, but the spikes kept happening. I'm pretty sure it's not the Wordpress installation.

Then I changed Cloudflare's security protection to the highest ("I'm under attack!"). This would immediately halve the server load to around 3 and stopped the server spikes, but the load is still higher than before. Whenever I disabled the Cloudflare and lowered the security to "High", the load immediately double and soon I see another spike.

When I looked in Cloudflare's analytics, there is hardly anything under Threats. (The browser challenged spiked when I changed the security protection).

[attachment=39554]

It does seem like memory usage is high:

[attachment=39555]

I used to know a top command where you can take a snapshot over a period of time (e.g. 5 minutes), but I forgot how to do that.

Any ideas?
 

Attachments

  • screenshot-dash.cloudflare.com-2018.07.21-16-11-23.png
    screenshot-dash.cloudflare.com-2018.07.21-16-11-23.png
    60.8 KB · Views: 967
  • memory-usage.png
    memory-usage.png
    14.9 KB · Views: 956

Atomic

Robin
Top -n 1 -b > some-filename.out

-n is number of iterations top runs before it exits
-b is batch

Not sure of a way to get it capture snapshot over 5 minutes outside of just increasing the -n value. You can always "man top" to get the docs.

Any new/updated plugins?
 

chicane

Woodpecker
Gold Member
My first guess is that somebody found an exploit and is running some sort of malware on your server. Given the load, I don't think it is a miner, but it could be one that is careful to not saturate the server. I suggest looking at network traffic, as in what kinds of outgoing connections are being made and to where.
 

Roosh

Cardinal
Orthodox
Atomic said:
Top -n 1 -b > some-filename.out

-n is number of iterations top runs before it exits
-b is batch

Not sure of a way to get it capture snapshot over 5 minutes outside of just increasing the -n value. You can always "man top" to get the docs.

Any new/updated plugins?

Plugins are recent version. I'll run this command when I'm home.

chicane said:
My first guess is that somebody found an exploit and is running some sort of malware on your server. Given the load, I don't think it is a miner, but it could be one that is careful to not saturate the server. I suggest looking at network traffic, as in what kinds of outgoing connections are being made and to where.

What command do I run?
 

chicane

Woodpecker
Gold Member
Roosh said:
Atomic said:
Top -n 1 -b > some-filename.out

-n is number of iterations top runs before it exits
-b is batch

Not sure of a way to get it capture snapshot over 5 minutes outside of just increasing the -n value. You can always "man top" to get the docs.

Any new/updated plugins?

Plugins are recent version. I'll run this command when I'm home.

chicane said:
My first guess is that somebody found an exploit and is running some sort of malware on your server. Given the load, I don't think it is a miner, but it could be one that is careful to not saturate the server. I suggest looking at network traffic, as in what kinds of outgoing connections are being made and to where.

What command do I run?

if you have a command line console, "top" will show you the top processes, "?" will get you options, such as sorting by CPU usage. Normal update is every 2 seconds, so that might find something. Doing a "tail -f /var/log/messages" or "tail -f /var/log/syslog" will show you what is being logged to the system log. It can be a bit challenging to read, but look for anything that seems off. "crontab -l" will show you anything scheduled to run under that account and then "cat /etc/cron.d/*" will display yet another set of crontabs that are managed by the system. If you are unknowingly hosting a spammer, try "tail -f /var/log/maillog"

Some of these commands require admin privileges, so you may need to put "sudo" before them.
 

Roosh

Cardinal
Orthodox
Some top action:

top - 20:32:21 up 6 days, 10:19, 1 user, load average: 2.98, 2.72, 2.80
Tasks: 422 total, 5 running, 416 sleeping, 0 stopped, 1 zombie
Cpu(s): 29.7%us, 1.9%sy, 0.0%ni, 68.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12186024k total, 11252544k used, 933480k free, 999648k buffers
Swap: 4095996k total, 61864k used, 4034132k free, 7974940k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3320 rok 20 0 0 0 0 Z 90.4 0.0 0:03.00 php-cgi <defunct>
3328 rok 20 0 207m 22m 8588 R 49.1 0.2 0:01.49 php-cgi
3330 rok 20 0 338m 63m 46m R 39.6 0.5 0:01.20 php-cgi
3331 rok 20 0 338m 64m 46m R 35.0 0.5 0:01.06 php-cgi
3329 rok 20 0 338m 63m 46m S 29.0 0.5 0:00.88 php-cgi
335 mysql 20 0 2607m 534m 6680 S 25.4 4.5 78:26.70 mysqld
3297 nobody 20 0 208m 8984 2812 S 1.3 0.1 0:00.04 httpd
3242 root 20 0 13396 1484 912 R 0.7 0.0 0:00.20 top


top - 20:32:18 up 6 days, 10:19, 1 user, load average: 2.72, 2.66, 2.78
Tasks: 423 total, 9 running, 414 sleeping, 0 stopped, 0 zombie
Cpu(s): 19.1%us, 1.8%sy, 0.0%ni, 79.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12186024k total, 11244548k used, 941476k free, 999636k buffers
Swap: 4095996k total, 61864k used, 4034132k free, 7976616k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3318 rooshv 20 0 331m 45m 34m R 15.6 0.4 0:00.47 php-cgi
3319 reaxxion 20 0 330m 42m 33m R 13.9 0.4 0:00.42 php-cgi
335 mysql 20 0 2607m 533m 6680 S 9.9 4.5 78:25.93 mysqld
3320 rok 20 0 331m 37m 26m R 8.6 0.3 0:00.26 php-cgi
3321 rooshv 20 0 330m 33m 24m R 6.6 0.3 0:00.20 php-cgi
3322 rok 20 0 328m 28m 21m R 4.0 0.2 0:00.12 php-cgi
3323 rok 20 0 327m 27m 20m R 3.3 0.2 0:00.10 php-cgi
3324 rok 20 0 326m 24m 18m R 1.7 0.2 0:00.05 php-cgi
3213 nobody 20 0 208m 8192 1916 S 1.0 0.1 0:00.05 httpd

top - 20:32:12 up 6 days, 10:19, 1 user, load average: 2.78, 2.67, 2.78
Tasks: 417 total, 1 running, 416 sleeping, 0 stopped, 0 zombie
Cpu(s): 14.2%us, 1.3%sy, 0.0%ni, 84.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12186024k total, 11054304k used, 1131720k free, 999616k buffers
Swap: 4095996k total, 61864k used, 4034132k free, 7850024k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
335 mysql 20 0 2607m 533m 6680 S 4.3 4.5 78:25.45 mysqld
3184 nobody 20 0 208m 8060 1912 S 1.3 0.1 0:00.06 httpd
3242 root 20 0 13396 1484 912 R 1.0 0.0 0:00.14 top
2612 root 20 0 103m 5004 3756 S 0.7 0.0 0:00.09 sshd
3145 nobody 20 0 208m 8268 1924 S 0.7 0.1 0:00.05 httpd
3162 nobody 20 0 209m 9344 2840 S 0.7 0.1 0:00.04 httpd
1437 nobody 20 0 209m 9400 2916 S 0.3 0.1 0:00.61 httpd
2006 root 20 0 0 0 0 S 0.3 0.0 5:43.09 kondemand/0
2427 nobody 20 0 208m 9216 2912 S 0.3 0.1 0:00.17 httpd
2740 nobody 20 0 209m 9332 2824 S 0.3 0.1 0:00.16 httpd

top - 20:32:12 up 6 days, 10:19, 1 user, load average: 2.78, 2.67, 2.78
Tasks: 417 total, 1 running, 416 sleeping, 0 stopped, 0 zombie
Cpu(s): 14.2%us, 1.3%sy, 0.0%ni, 84.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12186024k total, 11054304k used, 1131720k free, 999616k buffers
Swap: 4095996k total, 61864k used, 4034132k free, 7850024k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
335 mysql 20 0 2607m 533m 6680 S 4.3 4.5 78:25.45 mysqld
3184 nobody 20 0 208m 8060 1912 S 1.3 0.1 0:00.06 httpd
3242 root 20 0 13396 1484 912 R 1.0 0.0 0:00.14 top
2612 root 20 0 103m 5004 3756 S 0.7 0.0 0:00.09 sshd
3145 nobody 20 0 208m 8268 1924 S 0.7 0.1 0:00.05 httpd
3162 nobody 20 0 209m 9344 2840 S 0.7 0.1 0:00.04 httpd
1437 nobody 20 0 209m 9400 2916 S 0.3 0.1 0:00.61 httpd
2006 root 20 0 0 0 0 S 0.3 0.0 5:43.09 kondemand/0
2427 nobody 20 0 208m 9216 2912 S 0.3 0.1 0:00.17 httpd
2740 nobody 20 0 209m 9332 2824 S 0.3 0.1 0:00.16 httpd
3213 nobody 20 0 208m 8140 1916 S 0.3 0.1 0:00.02 httpd

I ran "tail -f /var/log/messages" and it shows a UDP IN block every 30 seconds or so.

Jul 21 20:36:19 kernel: [556113.311168] Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:0c:29:a0:c2:a2:08:00 SRC=[server ip removed] DST=255.255.255.255 LEN=138 TOS=0x00 PREC=0x00 TTL=64 ID=0 PROTO=UDP SPT=5678 DPT=5678 LEN=118

I can't find the syslog file.

Doing "crontab -l" had mostly cpanel update checkers. "cat /etc/cron.d/*" looked clean.

"tail -f /var/log/maillog" had a log every 5-10 minutes.

I ran "netstate" to check connections and that looked clean too.
 

tomcat

Chicken
Roosh said:
Atomic said:
Top -n 1 -b > some-filename.out

-n is number of iterations top runs before it exits
-b is batch

Not sure of a way to get it capture snapshot over 5 minutes outside of just increasing the -n value. You can always "man top" to get the docs.

Any new/updated plugins?

Plugins are recent version. I'll run this command when I'm home.

chicane said:
My first guess is that somebody found an exploit and is running some sort of malware on your server. Given the load, I don't think it is a miner, but it could be one that is careful to not saturate the server. I suggest looking at network traffic, as in what kinds of outgoing connections are being made and to where.

What command do I run?

To get a basic idea of the connections on your server, run: netstat -a. Take a close look at the tcp/udp protos.
 

budoslavic

Eagle
Orthodox
Gold Member
Roosh said:
It does seem like memory usage is high:
....
....
Any ideas?
I'm not a server expert, but since my expertise is in web development and we always run into server-related issues, it sounds like you need to check the processes that may be consuming the CPU.

Try these troubleshooting sources.
https://martincarstenbach.wordpress.com/2013/06/25/troubleshooting-high-load-average-on-linux/ (start with this one first)

https://unix.stackexchange.com/ques...high-load-when-there-are-no-obvious-processes

https://www.tummy.com/articles/isolating-heavy-load/

Edit.


Video description:
This Tutorial explains Linux top commands. All concepts are explained with elaborated examples.

Tutorial Topics:-

1. What is the use of top command in linux/Unix.
2. Understanding Time field in top command.
3. Understanding Task Field in top command
4. Understanding Cpu Field in top command.
5. Understanding Memory Field in top command.
6. Understanding Process Field in top command.
7. How to sort process by memory utilization
8. How to sort process by cpu utilization
9. How to sort process by custom (PID, NI etc) values.
10. List running process by absolute path.
11. Kill process from top.
12. Color output.
13. Renice process priority.
14. Utilization of multiple Cpus.
15. Change the refresh rate.
16. Reduce the number of listed process.
18. How to make output uncluttered (l,t,m).
19. View selected user process in top.
20. view specific process ids in top.
21. Batch Mode in Top for taking the output in a file.
22. View threads in a processes and analyse using top.
 

Atomic

Robin
Running those top commands during the high latency times? They memory and CPU load looks light.

Also, it looks like the forum, rok, and reaxxion are all running on the one server? The forum doesnt seem to be experiencing any problems. Are you seeing the same issues on your WP panel for reaxxion?
 

Tigre

Kingfisher
Gold Member
ApacheTop will give you a clearer picture of the profile of requests coming through. If there's a low key denial of service attack, you might notice it there, when you wouldn't see it with just a casual inspection of the log file.

Sometimes that symptom can happen when you get sustained requests that make your server do busy-work (such as searching articles).

One way to gather more info is selectively disabling features. See if you can observe the load going down as a result of disabling a feature of the website.
 

Atomic

Robin
Tigre said:

I used to just to do some basic grep/awk commands to parse the log file when I needed real time analytics. Ill have to check out ApacheTop, it looks awesome. Thanks.

If someone is spamming searches it would still show up as a high memory load with top, but its worth looking into.

If a search is killing the server it would be caused by mysql running out of memory. You check thr mysql log by:
Tail /var/log/mysql.log

If its a server side error it will show something out of place on top, ApacheTop, or mysql log.

While I dont have access to it I have heard of a private forum here. I recommend you move this thread over there, especially if you will be sharing top/apachetop/mysql log outputs. Minimizes the data a public attacker can leverage.
 

Roosh

Cardinal
Orthodox
Atomic said:
Running those top commands during the high latency times? They memory and CPU load looks light.

Also, it looks like the forum, rok, and reaxxion are all running on the one server? The forum doesnt seem to be experiencing any problems. Are you seeing the same issues on your WP panel for reaxxion?

Yeah I'm running it when the load is low (late at night in the weekend). I think I have simulate high load conditions and do it again.

The forum and ROK are on different servers.
 

Roosh

Cardinal
Orthodox
I took off the Cloudflare security. The server load increased within minutes by about 75%.

top - 09:12:32 up 6 days, 22:59, 1 user, load average: 5.02, 5.53, 4.33
Tasks: 420 total, 3 running, 417 sleeping, 0 stopped, 0 zombie
Cpu(s): 34.9%us, 2.7%sy, 0.2%ni, 62.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12186024k total, 9530204k used, 2655820k free, 890148k buffers
Swap: 4095996k total, 61336k used, 4034660k free, 6309848k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27077 rok 20 0 340m 65m 46m R 99.2 0.6 0:01.28 php-cgi
27079 rok 20 0 336m 60m 45m R 99.2 0.5 0:00.72 php-cgi
26912 nobody 20 0 209m 8672 1896 S 5.8 0.1 0:00.10 httpd

Only process that consumed more than 1% of CPU was mysql:

mysql 335 12.2 5.0 3196228 609604 ? Sl Jul21 168:26

avg-cpu: %user %nice %system %iowait %steal %idle
34.88 0.19 2.74 0.02 0.00 62.18

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 43.79 1530.06 728.49 920330692 438187400

total used free shared buffers cached
Mem: 11900 9643 2256 266 871 6373
-/+ buffers/cache: 2398 9501
Swap: 3999 59 3940

I ran netstat twice, once without Cloudflare protection and one with.

With the protection, there were 264 TCP connections. I turned protection off, and then the count jumped to 443 TCP connections. So when the protection is disabled, the connections almost double. And of course those connections come from Cloudflare IPs. Is this a big clue?
 

Roosh

Cardinal
Orthodox
I also ran apachetop. Without protection enabled:

last hit: 14:45:53 atop runtime: 0 days, 00:05:20 14:45:56
All: 167 reqs ( 0.5/sec) 73.8K ( 236.8B/sec) 452.4B/req
2xx: 165 (98.8%) 3xx: 0 ( 0.0%) 4xx: 2 ( 1.2%) 5xx: 0 ( 0.0%)
R (300s): 161 reqs ( 0.5/sec) 73.8K ( 251.8B/sec) 469.3B/req
2xx: 159 (98.8%) 3xx: 0 ( 0.0%) 4xx: 2 ( 1.2%) 5xx: 0 ( 0.0%)

REQS REQ/S KB KB/S URL
158 0.53 0.0 0.0**
2 0.29 19.7 2.8 /feed
1 0.02 54.1 1.0 /whm-server-status


With protection:

last hit: 14:53:49 atop runtime: 0 days, 00:04:50 14:53:55
All: 128 reqs ( 0.4/sec) 53.5K ( 189.7B/sec) 428.2B/req
2xx: 128 ( 100%) 3xx: 0 ( 0.0%) 4xx: 0 ( 0.0%) 5xx: 0 ( 0.0%)
R (290s): 128 reqs ( 0.4/sec) 53.5K ( 189.0B/sec) 428.2B/req
2xx: 128 ( 100%) 3xx: 0 ( 0.0%) 4xx: 0 ( 0.0%) 5xx: 0 ( 0.0%)

REQS REQ/S KB KB/S URL
127 0.44 0.0 0.0**
1 0.00 53.5 0.2 /whm-server-status
 

chicane

Woodpecker
Gold Member
This thread reminded me that I had not set up logwatch on a couple of my servers. That has been remedied. It's a somewhat useful tool.
 

redpillage

 
Banned
Gold Member
Roosh said:
I ran netstat twice, once without Cloudflare protection and one with.

With the protection, there were 264 TCP connections. I turned protection off, and then the count jumped to 443 TCP connections. So when the protection is disabled, the connections almost double. And of course those connections come from Cloudflare IPs. Is this a big clue?

Okay, I'm not officially a sys admin, I just play one on TV ;-)

Looking at your netstat and top output I think you're looking at an external threat. And yes the fact that enabling Cloudfare protection lowers the number of open connections suggests that the server is being targeted. It's not a DOS attack per se - since it's php-cgi that's blowing up it reminds me of this:

http://seclists.org/fulldisclosure/2014/May/21

Some suggestions for mitigation:

https://www.nightlionsecurity.com/blog/news/2014/phpstress-dos-attack-php-nginx-apache/

By the way, this is few years old though, have you properly patched your server?

Also, it sounds like you are dealing with this all by yourself. Do you know/have a good sys admin?

FWIW, if you don't know much about top, netstat, ps, etc. and Linux server administration in general then you are ill equipped to dealing with a live attack. Do yourself a favor and spring the money for a decent sys admin. You are obviously a target and will remain going forward. Be prepared.
 

Roosh

Cardinal
Orthodox
Problem persists.

Without protection, here's the number of active connections:

# netstat -nap | grep 80 | grep EST | wc -l
58

And then I turned the protection back on:

# netstat -nap | grep 80 | grep EST | wc -l
4
 

rosario

Chicken
I am a unix sysadmin and have a few suggestions:

Check to see if there are issues with swap space as root

free -h

If that is good, run the ps -auwx command and see if there is anything running that is unusual. Any unusual user or process is a red flag.If you find, delete the id then cleanup the server.

Change all your passwords.

Analyse your log files ( apache,mysql,php) . Look at the IP addresses and see if there are any patterns or unusual activity. if so,

If you can, patch these system when you can.
 
Top