Subject: [RFC -v6 PATCH 0/8] directed yield for Pause Loop Exiting
Date: Thursday 20th January 2011 21:31:27 UTC (over 5 years ago)
When running SMP virtual machines, it is possible for one VCPU to be spinning on a spinlock, while the VCPU that holds the spinlock is not currently running, because the host scheduler preempted it to run something else. Both Intel and AMD CPUs have a feature that detects when a virtual CPU is spinning on a lock and will trap to the host. The current KVM code sleeps for a bit whenever that happens, which results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up. This is because the VCPU holding the lock is actually running and not sleeping, so the pause is counter-productive. In other workloads a pause can also be counter-productive, with spinlock detection resulting in one guest giving up its CPU time to the others. Instead of spinning, it ends up simply not running much at all. This patch series aims to fix that, by having a VCPU that spins give the remainder of its timeslice to another VCPU in the same guest before yielding the CPU - one that is runnable but got preempted, hopefully the lock holder. v6: - implement yield_task_fair in a way that works with task groups, this allows me to actually get a performance improvement! - fix another race Avi pointed out, the code should be good now v5: - fix the race condition Avi pointed out, by tracking vcpu->pid - also allows us to yield to vcpu tasks that got preempted while in qemu userspace v4: - change to newer version of Mike Galbraith's yield_to implementation - chainsaw out some code from Mike that looked like a great idea, but turned out to give weird interactions in practice v3: - more cleanups - change to Mike Galbraith's yield_to implementation - yield to spinning VCPUs, this seems to work better in some situations and has little downside potential v2: - make lots of cleanups and improvements suggested - do not implement timeslice scheduling or fairness stuff yet, since it is not entirely clear how to do that right (suggestions welcome) Benchmark "results": Two 4-CPU KVM guests are pinned to the same 4 physical CPUs. One guest runs the AMQP performance test, the other guest runs 0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5 and 4. The AMQP perftest is run 30 times, with 8 and 16 threads. 8thr no overcommit 1.5x overcommit 2x overcommit no PLE 223801 135137 104951 PLE 224135 141105 118744 16thr no overcommit 1.5x overcommit 2x overcommit no PLE 222424 126175 105299 PLE 222534 138082 132945 Note: this is with the KVM guests NOT running inside cgroups. There seems to be a CPU load balancing issue with cgroup fair group scheduling, which often results in one guest getting only 80% CPU time and the other guest 320%. That will have to be fixed to get meaningful results with cgroups. CPU time division between the AMQP guest and the infinite loop guest were not exactly fair, but the guests got close to the same amount of CPU time in each test run. There is a substantial amount of randomness in CPU time division between guests, but the performance improvement is consistent between multiple runs. -- All rights reversed. -- -- All rights reversed.