Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Dominik Vogt <vogt <at> linux.vnet.ibm.com>
Subject: Lock elision test results
Newsgroups: gmane.comp.lib.glibc.alpha
Date: Friday 14th June 2013 10:26:53 UTC (over 4 years ago)
Test results on a zEC12 with eight cpus with Andi's the lock
elision v10 patches ported to z/architecture.  Unfortunately I
cannot provide the source code used for the tests at the moment,
but I can share relative performance data.  I plan to create a
collection of test programs that can be used to measure elision
performance in specific cases.

The tests were run on 13th of June, 2013.

Test 1
======

Setup
-----

Two concurrent threads using pthread mutexes (m1, m2) and
counters c1, c2, c3.  All static data structures are allocated
in separate cache lines.

thread 1:

  barrier
  repeat  times
    lock m1
    lock m2
    increment c1
    unlock m1
    increment c2
    repeat  times
      waste a minimal amount of cpu
    unlock m2
    signal that thread 1 has finished its work
  barrier

thread 2:

  barrier
  get start timestamp
  while thread 1 has not finished
    lock m1
    increment c3
    unlock m2
  get end timestamp

Performance is measured in loops of thread 2 divided by the time
taken.

Test execution
--------------

The test is run ten times each with four different versions and
setups of glibc:

(1) current glibc without elision patchs (2506109403de)
(2) glibc-2.15
(3) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=none
(4) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=elision

The best results of all runs for each glibc setup are compared.
The result for (1) is the reference (i.e. 100%).  Higher values
mean higher relative performance.

Result
------

(1) unpatched  : 100.00%
(2) old glibc  : 101.83%
(3) elision off:  77.87%
(4) elision on :  29.37%

The abort ratio in (4) is >= 75% on thread 1 and < 1% on thread 2.

Test 2 (nested locks)
======

Setup
-----

Three concurrent threads using pthread mutexes (m1, ..., m10) and
counters c1, ..., c10.  All static data structures are allocated
in separate cache lines.

all threads:

  barrier
  take start timestamp (only thread 1)
  repeat  times
    lock m1, increment c1
    lock m2, increment c2
    ...
    lock m10, increment c10
    unlock m10
    unlock m9
    ...
    unlock m1
  barrier
  take end timestamp (only thread 1)

Performance is measured in the inverse of the time taken on thread
1.

Test execution
--------------

Identical to test 1.

Result
------

(1) unpatched  : 100.00%
(2) old glibc  : 134.35%
(3) elision off:  56.45%
(4) elision on :  31.31%

The abort ratio in (4) in all threads is between 5% and 10%.

Test 3 (cacheline pingpong)
======

Setup
-----

Four concurrent threads using a pthread mutexes m and c1, ..., c4.
All static data structures are allocated in separate cache lines.

thread :

  barrier
  take start timestamp (only thread 1)
  barrier
  repeat  times
    lock m
    increment c
    unlock m
  barrier
  take end timestamp (only thread 1)

Performance is measured in the inverse of the time taken on thread
1.

Test execution
--------------

Identical to test 1.

Result
------

(1) unpatched  : 100.00%
(2) old glibc  : 103.94%
(3) elision off:  76.25%
(4) elision on : 373.38%

The abort ratio in (4) in all threads is < 0.01%.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany
 
CD: 3ms