Friday, July 27, 2007

x4600 Diagram

So, after brute forcing the topology that is reasonable and describes the performance characteristics that is seen, Abdullah Kayi found a diagram on a German site containing a diagram showing CPU module use with the 4-socket x4600 AMD64 system. It seems the topology I guess worked was correct.

Memory Bandwidth on Solaris with CPU Binding on 4-socket x4600

Read Latency on Solaris with CPU Binding

The lat_mem_rd microbenchmark which is part of lmbench was used to measure latency. Not x and y axis are measured in log scale and that the machine was a 4-socket x4600. An array size of 128MB was used.

Sunday, July 22, 2007

Confusion with Solaris lgrp Behavior on AMD64 System

The whole point of locality groups is to maximize (or attempt to, at the least) locality of resources applications depend on. I was expecting to walk in seeing consistent behavior across the processors. This is not the case and I still haven't found a valid reason as to why this is occuring.

memcpy on Solaris 10 on ccNUMA AMD64 System (4-socket)

For more information on system configuration please see the first ccNUMA Solaris post.


Friday, July 20, 2007

ccNUMA Effect on Solaris

The memcpy test that was used on the 8-socket AMD64 system running Linux was run on a 4-socket AMD64 Solaris system. I was interested in seeing the effect of strong lgrp affinity on Solaris.

[sbahra@numa ~/ccnuma/tests/malloc] lgrpinfo
lgroup 0 (root):
Children: 5-8
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 120
lgroup 1 (leaf):
Children: none, Parent: 5
CPUs: 0 1
Memory: installed 3.5G, allocated 290M, free 3.2G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.000198
Latency: 50
lgroup 2 (leaf):
Children: none, Parent: 6
CPUs: 2 3
Memory: installed 4.0G, allocated 295M, free 3.7G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0
Latency: 50
lgroup 3 (leaf):
Children: none, Parent: 7
CPUs: 4 5
Memory: installed 4.0G, allocated 248M, free 3.8G
Lgroup resources: 3 (CPU); 3 (memory)
Load: 0.5
Latency: 50
lgroup 4 (leaf):
Children: none, Parent: 8
CPUs: 6 7
Memory: installed 4.0G, allocated 689M, free 3.3G
Lgroup resources: 4 (CPU); 4 (memory)
Load: 0
Latency: 50
lgroup 5 (intermediate):
Children: 1, Parent: 0
CPUs: 0-5
Memory: installed 12G, allocated 832M, free 11G
Lgroup resources: 1-3 (CPU); 1-3 (memory)
Latency: 83
lgroup 6 (intermediate):
Children: 2, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 7 (intermediate):
Children: 3, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 8 (intermediate):
Children: 4, Parent: 0
CPUs: 2-7
Memory: installed 13G, allocated 1.2G, free 11G
Lgroup resources: 2-4 (CPU); 2-4 (memory)
Latency: 83

[sbahra@numa ~/ccnuma/tests/malloc]





The following two plots have memory size in bytes on x-axis and CPU ticks (returned from rdtsc) on y-axis. This is a simple memcpy test with a hot cache.



Explicit lgrp affinity


No explicit lgrp affinity.



Saturday, July 7, 2007

Parallel Smith Waterman with Unified Parallel C on Sun T1 processor

The machine used was a Sun T2000. The algorithm is a traditional smith waterman parallelized through a wave-front mechanism. In order to improve cache utilization the matrix was transposed to allow for more "horizontal" accesses (lessens cache line sharing). Alpha defines data distribution, workload distribution is a quotient of Alpha and Beta. This work was done by myself and Mohammad Bakhouya (bakhouya@gmail.com).


























ccNUMA factor of AMD64 on Linux Performance

These were done on a Sun X4600 8-socket dual-core system. Note time values are half of what they should be. Work in progress, plots done by myself and Abdullah Kayi (apokayi@gwu.edu).