|
demos are
explained here;
a menu at top column right indexes actual topic demos.
Here we demo stat.
problem
To study systems with activity consisting of very many events with varying sizes — like i/o operations — Wil likes to count them all and find both averages and standard deviations. Essentially this is a kind of summary: N operations map to triple (n, avg, sdev) no matter how large N gets. A few years ago Wil wrote class yrsum — meaning real (sequence) sum — to cheaply capture necessary information to calculate average and variance using only count, sum, and sum of squares. Here are the state members: struct yrsum { // real sequence sum maps to standard dev «
long long m_len; // number of points observed
double m_sum; // sum of all points
double m_squares; // sum of squared points
yrsum() : m_len(0), m_sum(0.0), m_squares(0.0) { }
void clear() { m_len = 0; m_sum = 0.0; m_squares = 0.0; }
Since all parts might as well be public, yrsum is just a struct with methods. It starts all zeroes, and can be cleared at will to restart. Adding a new observation is simple: void add(double x) { // add new event of size x «
++m_len; m_sum += x; m_squares += x * x; }
Using add() as a base, Wil defines overloaded operator+=() variants to highlight expected usage: yrsum& operator+=(double x) { add(x); return *this; } «
yrsum& operator+=(long x) { add((double) x); return *this; }
yrsum& operator+=(long long x) { add(x); return *this; }
yrsum& operator+=(u32 x) { add((double) x); return *this; }
yrsum& operator+=(u64 x) { add((double) x); return *this; }
Wil also defines difference of two yrsum instances, as well as division by a common factor: yrsum operator-(const yrsum& x) const { «
yrsum copy(*this);
copy.m_len -= x.m_len;
copy.m_sum -= x.m_sum;
copy.m_squares -= x.m_squares;
return copy;
}
yrsum& operator-=(const yrsum& x) {
m_len -= x.m_len;
m_sum -= x.m_sum;
m_squares -= x.m_squares;
return *this;
}
yrsum& operator/=(unsigned divisor) {
m_len /= divisor;
m_sum /= divisor;
m_squares /= divisor;
return *this;
}
Wil uses operator/=() only when stats need a periodic decay step to diminish effect of earlier observations. We're getting close to the interesting part now, which is variance; note standard deviation is defined as the square root of variance. Below are getters for parts of the primary (n, avg, sdev) triple. Following the code, we'll look at an explanation for why the variance() code works. long long size() const { return m_len; } «
operator long long() const { return m_len; } «
double mean() const { // synonym for average «
return (m_len>1)? m_sum/m_len : m_sum; }
double avg() const { return this->mean(); } «
double variance() const { // square of standard dev «
if (m_len > 1) {
double mu = this->mean();
double n_mu_sq = m_len * (mu * mu);
return (m_squares - n_mu_sq)/(m_len - 1);
}
return 0.0;
}
double sdv() const { // standard deviation «
double v = this->variance(); // std dev is sqrt(v)
return (v < 0.0)? ::sqrt(-v): ::sqrt(v);
}
}; // struct yrsum
Why does code shown in variance() work? What's the definition of variance? And how can we calculate the value given only count, sum, and sum of squares? (The following presentation was last shown in August2007, but the original version dates back to 2005.) The field of statistics supplies what we want. You can calculate average and standard deviation without keeping the entire population for analysis. To calculate variance, all you need is the three values we have:
The definition of variance is the sum of terms (xi - μ)2 divided by (N-1), where μ is the average value (aka the mean). If you expand the squared term by simple multiplication into xi2 - 2xiμ + μ2, you find you can remove the awkward xi component because xiμ is equal to μ2 in the average case, looking at all terms in the summation. So (-2xiμ + μ2) → -μ2, and variance is just (Σxi2-Σμ2) /(N-1). Obviously since you can't divide by zero, this only makes sense for values of N over one: a single value doesn't vary. For this reason, variance() is zero for only one observation, which makes sense.
floating point
"What about all those floating point operations?" Dex objected. "Don't they slow things down? What about floating point processor contention? I hear that's bad." "Of course I've been asked that before," Wil noted. "So far the problem has only been theory. I haven't seen it, at all." "Maybe you didn't look," Dex slyly accused. "I have a guy," Wil responded, "who does profiling for me using hand crafted tools suitable for seeing latency due to cache line misses on specific assembler instructions." "Um," Dex mumbled. "He was one of my picks," Wil continued. "He cut his teeth on old school super computer assembler. So far he hasn't said a thing about any of the floating point. All that matters is where I touch lots of memory. What can I say?" "Maybe he's wrong," Dex persisted. "Yeah, that's it."
license
All this code is available only under the BriarPig mu-babel license described fully on the rights page. You do not have permission to reprint this page in any way. No feeds or repackaging is allowed. You can link this page if you want folks to read it. |
A submenu for demos appears below, letting you
go to the page on a topic written as a demo (as the
demos page defines it).
menu
thorn: todo, names, fd, iovec, assert, log, run, hex, crc, buf, in, out, quote, escape, compare, file, deck, cow, arc, blob, tree, slice, rand, time, stat « Þ, hash, heap, node, primes, page, book, pile, stack, atomic, lock, mutex, thread, map, meter, list, iter, ctype (mu: toy, peg, imm, tag, box, symbol, token, number, bigint, class, method, reader, writer, eval, env, vm, gc, world, pcode, compiler, asm, lathe, lisp, smalltalk, design, weight, jar, card, harp, debug, profile) Some demos are stubs: todo is a demo guide. See toy for mu updates on language pages; names introduces naming schemes.
boxmuller
The following sample code empirically verifies mutual consistency of yrsum (shown column left) calculating standard deviation from a sequence, and class yr64g in the rand demo which generates a pseudo random stream of values with a specified mean and standard deviation (cf ») using a Box-Muller tranformation. (This material last appeared in August2007, based on blog material from 2001. This is a light rewrite.) The following sample code basically tests a sequence of sample outputs from yr64g using a target mean and standard deviation (see yr64g::grandom()) while summing all these values with an instance of yrsum. Each value is also plotted in a histogram, which is later printed to visually illustrate the bell shape curve. The final mean and standard deviation reported by yrsum closely match those targeted by yr64g arguments. For a change, let's start with test program output printed on stdout. Note observed average reported is 1502.05 (target: 1500.0) with standard deviation 447.30 (target: 450.0). avg=1502.05 var=200076.06 sd=447.30 cv= 0.30
0000-0100: +
0100-0200: *
0200-0300: *
0300-0400: #*
0400-0500: ###-
0500-0600: ###+
0600-0700: #######-
0700-0800: ###########-
0800-0900: #############
0900-1000: ##########################*
1000-1100: ##########################
1100-1200: #####################################*
1200-1300: ###################################*
1300-1400: ##########################################
1400-1500: ############################################-
1500-1600: ##########################################-
1600-1700: ##############################################*
1700-1800: #########################################*
1800-1900: ###############################*
1900-2000: ########################
2000-2100: ###########################
2100-2200: #################
2200-2300: #######*
2300-2400: #######*
2400-2500: ####
2500-2600: ##+
2600-2700: ##+
2700-2800: #-
2800-2900: +
2900-3000: +
This output is based on 2048 data points falling into 30 histogram buckets according to these defined constants: #define TMP_ARRY_LEN 2048 /* total population to generate*/
#define TMP_HILL_LEN 30 /* number of buckets in histogram*/
The test starts by zeroing histogram hill: int yr64g_examples() { «
// make gaussian points, find variance, graph plot
int hill[ TMP_HILL_LEN + 1 ]; // collect gaussian population
double ary[ TMP_ARRY_LEN + 1 ];
u32 k;
for (k = 0; k < TMP_HILL_LEN; k++)
hill[k] = 0;
Then it declares yrsum to measure stats and yr64g with target mean and standard deviation to generate a normal distribution. The loop fills array ary with each random value, in case further analysis is desired later. Each value is added to yrsum. yrsum rsum; // double sequence to (len, sum, sum-of-squares)
u32 seed = 31555;
yr64g nm(&seed, /*avg*/ 1500.0, /*sdev*/ 450.0);
for (int i = 0; i < TMP_ARRY_LEN; i++) {
double point = ++nm;
ary[ i ] = point;
rsum.add(point);
u32 j = (u32) (point / 100.0);
if (j < TMP_HILL_LEN)
++hill[j];
}
This part writes the first summary line in output: double avg = rsum.mean(); // average value
double var = rsum.variance(); // square of std dev
double sd = rsum.sdv(); // standard deviation
fprintf(stdout, "avg=%5.2lf var=%5.2lf sd=%5.2lf cv=%5.2lf\n",
avg, var, sd, sd/avg);
The last loop prints each histogram bucket as a single line using stars() below to print a tally. for (k = 0; k < TMP_HILL_LEN; k++) { // for each bucket
fprintf(stdout, "\n%04d-%04d: ", k*100, (k+1)*100); // label
stars(hill[k]); // helper printing hits in the bucket
}
fputc('\n', stdout); fflush(stdout); // final trailing newline
return 0; // return value is ignored
}
A simple-minded tally is printed by stars() using letters to symbolize one, two, three, and four tally marks each using -, +, *, and # respectively. void stars(int n) { // show one histogram bucket «
int top = n / 4; // histogram bucket has one '#' per 4 hits
for (int i = 0; i < top; i++) // one '#' per group of 4
fputc('#', stdout); // (note glyph # has four lines)
int last = n % 4; // the last partial group
if (last) { // a last partial group to show as another byte?
static const char* odd = " -+*"; // 0 1 2 3
int c = odd[last]; // each glyph has last tick marks
fputc(c, stdout); // 1:- 2:+ 3:*
}
}
|
demos « Þ
+ todo + names + fd + iovec + assert + log + run + hex + crc + buf + in + out + quote + escape + compare + file + deck + cow + arc + blob + tree + slice + rand + time + stat « Þ + hash + heap + node + primes + page + book + pile + stack + atomic + lock + mutex + thread + map + meter + list + iter + ctype |