Friday, August 3, 2007
Wednesday, August 1, 2007
Chomp Sample at Large Delta
Previous method was limiting based on delta threshold every sample size (periodic). This limits at any sporadic sample.

Posted by
sbahra
at
1.8.07
0
comments
Synthetic Bursty Application
The application basically burned cycles at random intervals and copied over random amount of memory. Not the best example of burstiness, but it was just to see responsive of various sliding window sizes.
Posted by
sbahra
at
1.8.07
0
comments
mencoder Analysis (AMD64/ccNUMA)
This is a plot showing L2 cache utilization by mencoder over part of its lifetime with different methods for calculating cache utilization at different rates in time (average versus sliding window versus real).
Here are the timing results of mencoder on cpu0, cpu2 and cpu6.
CPU 0:
real 51.6
user 50.4
sys 1.0
CPU 2:
real 49.9
user 48.6
sys 1.1
CPU 6:
real 51.5
user 50.1
sys 1.3
Posted by
sbahra
at
1.8.07
0
comments
Tuesday, July 31, 2007
Friday, July 27, 2007
x4600 Diagram
So, after brute forcing the topology that is reasonable and describes the performance characteristics that is seen, Abdullah Kayi found a diagram on a German site containing a diagram showing CPU module use with the 4-socket x4600 AMD64 system. It seems the topology I guess worked was correct.
Posted by
sbahra
at
27.7.07
1 comments
Memory Bandwidth on Solaris with CPU Binding on 4-socket x4600
Posted by
sbahra
at
27.7.07
0
comments
Read Latency on Solaris with CPU Binding
The lat_mem_rd microbenchmark which is part of lmbench was used to measure latency. Not x and y axis are measured in log scale and that the machine was a 4-socket x4600. An array size of 128MB was used.
Posted by
sbahra
at
27.7.07
0
comments
Sunday, July 22, 2007
Confusion with Solaris lgrp Behavior on AMD64 System
The whole point of locality groups is to maximize (or attempt to, at the least) locality of resources applications depend on. I was expecting to walk in seeing consistent behavior across the processors. This is not the case and I still haven't found a valid reason as to why this is occuring.
Posted by
sbahra
at
22.7.07
0
comments
Friday, July 20, 2007
ccNUMA Effect on Solaris
The memcpy test that was used on the 8-socket AMD64 system running Linux was run on a 4-socket AMD64 Solaris system. I was interested in seeing the effect of strong lgrp affinity on Solaris.
[sbahra@numa ~/ccnuma/tests/malloc] lgrpinfo
lgroup 0 (root):
Children: 5-8
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 120
lgroup 1 (leaf):
Children: none, Parent: 5
CPUs: 0 1
Memory: installed 3.5G, allocated 290M, free 3.2G
Lgroup resources: 1 (CPU); 1 (memory)
Load: 0.000198
Latency: 50
lgroup 2 (leaf):
Children: none, Parent: 6
CPUs: 2 3
Memory: installed 4.0G, allocated 295M, free 3.7G
Lgroup resources: 2 (CPU); 2 (memory)
Load: 0
Latency: 50
lgroup 3 (leaf):
Children: none, Parent: 7
CPUs: 4 5
Memory: installed 4.0G, allocated 248M, free 3.8G
Lgroup resources: 3 (CPU); 3 (memory)
Load: 0.5
Latency: 50
lgroup 4 (leaf):
Children: none, Parent: 8
CPUs: 6 7
Memory: installed 4.0G, allocated 689M, free 3.3G
Lgroup resources: 4 (CPU); 4 (memory)
Load: 0
Latency: 50
lgroup 5 (intermediate):
Children: 1, Parent: 0
CPUs: 0-5
Memory: installed 12G, allocated 832M, free 11G
Lgroup resources: 1-3 (CPU); 1-3 (memory)
Latency: 83
lgroup 6 (intermediate):
Children: 2, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 7 (intermediate):
Children: 3, Parent: 0
CPUs: 0-7
Memory: installed 16G, allocated 1.5G, free 15G
Lgroup resources: 1-4 (CPU); 1-4 (memory)
Latency: 83
lgroup 8 (intermediate):
Children: 4, Parent: 0
CPUs: 2-7
Memory: installed 13G, allocated 1.2G, free 11G
Lgroup resources: 2-4 (CPU); 2-4 (memory)
Latency: 83
[sbahra@numa ~/ccnuma/tests/malloc]
The following two plots have memory size in bytes on x-axis and CPU ticks (returned from rdtsc) on y-axis. This is a simple memcpy test with a hot cache.
Posted by
sbahra
at
20.7.07
0
comments
Saturday, July 7, 2007
Parallel Smith Waterman with Unified Parallel C on Sun T1 processor
The machine used was a Sun T2000. The algorithm is a traditional smith waterman parallelized through a wave-front mechanism. In order to improve cache utilization the matrix was transposed to allow for more "horizontal" accesses (lessens cache line sharing). Alpha defines data distribution, workload distribution is a quotient of Alpha and Beta. This work was done by myself and Mohammad Bakhouya (bakhouya@gmail.com).





Posted by
sbahra
at
7.7.07
0
comments
ccNUMA factor of AMD64 on Linux Performance
These were done on a Sun X4600 8-socket dual-core system. Note time values are half of what they should be. Work in progress, plots done by myself and Abdullah Kayi (apokayi@gwu.edu).




Posted by
sbahra
at
7.7.07
0
comments















