Tuesday, July 31, 2007
Friday, July 27, 2007
x4600 Diagram
So, after brute forcing the topology that is reasonable and describes the performance characteristics that is seen, Abdullah Kayi found a diagram on a German site containing a diagram showing CPU module use with the 4-socket x4600 AMD64 system. It seems the topology I guess worked was correct.
Posted by
sbahra
at
27.7.07
1 comments
Memory Bandwidth on Solaris with CPU Binding on 4-socket x4600
Posted by
sbahra
at
27.7.07
0
comments
Read Latency on Solaris with CPU Binding
The lat_mem_rd microbenchmark which is part of lmbench was used to measure latency. Not x and y axis are measured in log scale and that the machine was a 4-socket x4600. An array size of 128MB was used.
Posted by
sbahra
at
27.7.07
0
comments
Sunday, July 22, 2007
Confusion with Solaris lgrp Behavior on AMD64 System
The whole point of locality groups is to maximize (or attempt to, at the least) locality of resources applications depend on. I was expecting to walk in seeing consistent behavior across the processors. This is not the case and I still haven't found a valid reason as to why this is occuring.
Posted by
sbahra
at
22.7.07
0
comments
Friday, July 20, 2007
ccNUMA Effect on Solaris
The memcpy test that was used on the 8-socket AMD64 system running Linux was run on a 4-socket AMD64 Solaris system. I was interested in seeing the effect of strong lgrp affinity on Solaris.
[sbahra@numa ~/ccnuma/tests/malloc] lgrpinfo
lgroup 0 (root):
Children: 5-8
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 120
lgroup 1 (leaf):
Children: none, Parent: 5
CPUs: 0 1
Memory: installed 3.5G, allocated 290M, free 3.2G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.000198
Latency: 50
lgroup 2 (leaf):
Children: none, Parent: 6
CPUs: 2 3
Memory: installed 4.0G, allocated 295M, free 3.7G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0
Latency: 50
lgroup 3 (leaf):
Children: none, Parent: 7
CPUs: 4 5
Memory: installed 4.0G, allocated 248M, free 3.8G
Lgroup resources: 3 (CPU); 3 (memory)
Load: 0.5
Latency: 50
lgroup 4 (leaf):
Children: none, Parent: 8
CPUs: 6 7
Memory: installed 4.0G, allocated 689M, free 3.3G
Lgroup resources: 4 (CPU); 4 (memory)
Load: 0
Latency: 50
lgroup 5 (intermediate):
Children: 1, Parent: 0
CPUs: 0-5
Memory: installed 12G, allocated 832M, free 11G
Lgroup resources: 1-3 (CPU); 1-3 (memory)
Latency: 83
lgroup 6 (intermediate):
Children: 2, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 7 (intermediate):
Children: 3, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 8 (intermediate):
Children: 4, Parent: 0
CPUs: 2-7
Memory: installed 13G, allocated 1.2G, free 11G
Lgroup resources: 2-4 (CPU); 2-4 (memory)
Latency: 83
[sbahra@numa ~/ccnuma/tests/malloc]
The following two plots have memory size in bytes on x-axis and CPU ticks (returned from rdtsc) on y-axis. This is a simple memcpy test with a hot cache.
Posted by
sbahra
at
20.7.07
0
comments
Saturday, July 7, 2007
Parallel Smith Waterman with Unified Parallel C on Sun T1 processor
The machine used was a Sun T2000. The algorithm is a traditional smith waterman parallelized through a wave-front mechanism. In order to improve cache utilization the matrix was transposed to allow for more "horizontal" accesses (lessens cache line sharing). Alpha defines data distribution, workload distribution is a quotient of Alpha and Beta. This work was done by myself and Mohammad Bakhouya (bakhouya@gmail.com).





Posted by
sbahra
at
7.7.07
0
comments
ccNUMA factor of AMD64 on Linux Performance
These were done on a Sun X4600 8-socket dual-core system. Note time values are half of what they should be. Work in progress, plots done by myself and Abdullah Kayi (apokayi@gwu.edu).




Posted by
sbahra
at
7.7.07
0
comments





