Recently I had to help figure out why some customers have been getting slow performance with VMs.  Reservations were used but didn’t help.  What did I do to find the issue?  We are a VMware shop using vCloud Suite Enterprise which gives us vCenter Operations Manager (has a new name but I will always call it vCOPS) and the ability to use custom dashboards.  Sadly I did not have it setup to use LDAP nor shared dashboards.  After getting it going so our Operations staff can login and see the TOP-N graphs as well as the vCloud Director dashboards, I started seeing off the bat several high CPU Ready% VMs.  Wasn’t too long that I was able to see that the VMs were hammering the vCPUs but the default Pay-as-You-Go organization setting in vCD is limiting the vCPU speed to 1GHz.  No wonder right?  The problem is that you have to make that change in the VDC of the organization and then restart the vApp.  Not something you can do just in the middle of the day.

Like below I created a dashboard for our biggest customer so they can see how their environment is doing.

high-CPU-Ready

Notice that the top VM is at 31.5% CPU Ready.  The issue is that this VM is in vCloud Director with the vCPU limited at 1GHz.  Changed it to 4GHz but we cannot reboot this VM in the middle of the day so I removed the limit within vCenter.

high-CPU-Ready1

You cannot tell since I blurred out their VM names but the one that was being limited is now off the list.  This is actually the 2nd VM in the first picture that’s now at the top.

high-cpu-ready3

The above is right after I removed the limit.  Notice the spike to just above 3,750MHz for this 2vCPU VM.  Since they have 2vCPU they were able to hit 2000MHz (1GHz x 2vCPU) but the demand was more at this time.  They could have added more vCPU’s but then we could run into other issues of too many vCPUs per host if that was the stance.  Now I’m not saying the high CPU Ready% is always going to be this case, it could be the we need to right size VMs across the board and maxing out the hosts so there’s a CPU wait going on.  So use what you can and monitor it as often as you can if you are providing services to customers.  This is one case where I was able to find a problem before the customer reported it.  That’s a win in my books!

 

 

 

Advertisements