Wednesday, August 1, 2007

Chomp Sample at Large Delta

Previous method was limiting based on delta threshold every sample size (periodic). This limits at any sporadic sample.


Cache Friendliness Score Responsiveness

Threshold biased for Cache Friendly Score

Synthetic Bursty Application

The application basically burned cycles at random intervals and copied over random amount of memory. Not the best example of burstiness, but it was just to see responsive of various sliding window sizes.

mencoder Analysis (AMD64/ccNUMA)

This is a plot showing L2 cache utilization by mencoder over part of its lifetime with different methods for calculating cache utilization at different rates in time (average versus sliding window versus real).



Here are the timing results of mencoder on cpu0, cpu2 and cpu6.

CPU 0:
real 51.6
user 50.4
sys 1.0

CPU 2:
real 49.9
user 48.6
sys 1.1

CPU 6:
real 51.5
user 50.1
sys 1.3

L2 Cache Utilization (Sliding window versus Average)





L2 Cache Utilization Over Program Lifetime (Average, Sliding Window and Sliding window versus Original)

This is for a memcpy() program.


Friday, July 27, 2007

x4600 Diagram

So, after brute forcing the topology that is reasonable and describes the performance characteristics that is seen, Abdullah Kayi found a diagram on a German site containing a diagram showing CPU module use with the 4-socket x4600 AMD64 system. It seems the topology I guess worked was correct.

Memory Bandwidth on Solaris with CPU Binding on 4-socket x4600

Read Latency on Solaris with CPU Binding

The lat_mem_rd microbenchmark which is part of lmbench was used to measure latency. Not x and y axis are measured in log scale and that the machine was a 4-socket x4600. An array size of 128MB was used.

Sunday, July 22, 2007

Confusion with Solaris lgrp Behavior on AMD64 System

The whole point of locality groups is to maximize (or attempt to, at the least) locality of resources applications depend on. I was expecting to walk in seeing consistent behavior across the processors. This is not the case and I still haven't found a valid reason as to why this is occuring.

memcpy on Solaris 10 on ccNUMA AMD64 System (4-socket)

For more information on system configuration please see the first ccNUMA Solaris post.


Friday, July 20, 2007

ccNUMA Effect on Solaris

The memcpy test that was used on the 8-socket AMD64 system running Linux was run on a 4-socket AMD64 Solaris system. I was interested in seeing the effect of strong lgrp affinity on Solaris.

[sbahra@numa ~/ccnuma/tests/malloc] lgrpinfo
lgroup 0 (root):
Children: 5-8
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 120
lgroup 1 (leaf):
Children: none, Parent: 5
CPUs: 0 1
Memory: installed 3.5G, allocated 290M, free 3.2G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.000198
Latency: 50
lgroup 2 (leaf):
Children: none, Parent: 6
CPUs: 2 3
Memory: installed 4.0G, allocated 295M, free 3.7G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0
Latency: 50
lgroup 3 (leaf):
Children: none, Parent: 7
CPUs: 4 5
Memory: installed 4.0G, allocated 248M, free 3.8G
Lgroup resources: 3 (CPU); 3 (memory)
Load: 0.5
Latency: 50
lgroup 4 (leaf):
Children: none, Parent: 8
CPUs: 6 7
Memory: installed 4.0G, allocated 689M, free 3.3G
Lgroup resources: 4 (CPU); 4 (memory)
Load: 0
Latency: 50
lgroup 5 (intermediate):
Children: 1, Parent: 0
CPUs: 0-5
Memory: installed 12G, allocated 832M, free 11G
Lgroup resources: 1-3 (CPU); 1-3 (memory)
Latency: 83
lgroup 6 (intermediate):
Children: 2, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 7 (intermediate):
Children: 3, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 8 (intermediate):
Children: 4, Parent: 0
CPUs: 2-7
Memory: installed 13G, allocated 1.2G, free 11G
Lgroup resources: 2-4 (CPU); 2-4 (memory)
Latency: 83

[sbahra@numa ~/ccnuma/tests/malloc]





The following two plots have memory size in bytes on x-axis and CPU ticks (returned from rdtsc) on y-axis. This is a simple memcpy test with a hot cache.



Explicit lgrp affinity


No explicit lgrp affinity.



Saturday, July 7, 2007

Parallel Smith Waterman with Unified Parallel C on Sun T1 processor

The machine used was a Sun T2000. The algorithm is a traditional smith waterman parallelized through a wave-front mechanism. In order to improve cache utilization the matrix was transposed to allow for more "horizontal" accesses (lessens cache line sharing). Alpha defines data distribution, workload distribution is a quotient of Alpha and Beta. This work was done by myself and Mohammad Bakhouya (bakhouya@gmail.com).


























ccNUMA factor of AMD64 on Linux Performance

These were done on a Sun X4600 8-socket dual-core system. Note time values are half of what they should be. Work in progress, plots done by myself and Abdullah Kayi (apokayi@gwu.edu).