|
demos are
explained here;
a menu at top column right indexes actual topic demos.
Here we demo run.
problem
Wil must often operate on memory using only starting address and length. A large set of utility methods can be implemented knowing only this much about content. A yv object for contiguous octet sequences represented in pointer-plus-length form was already introduced at the start in the demos root — so this page is a continuation of that demo. But here we begin again with basic state members. Code for intersecting two yv runs won't be shown again; you need to see demos for that (cf vintersect«). But vintersect() is used by other yv methods (such as for slicing, cf »).
memory
Struct yv is a description of contiguous memory with length in octets, just like an iovec but with convenience methods like construction and conversion with other types. And hash, and compare. And whatever else you want to do with a pointer and length. Wil also tends to add new yv methods porting std C string library methods for nul terminated strings, but without need for nul termination.
struct yv { // vector of bytes (like iovec) «
u8* v_p; // byte pointer «
n32 v_n; // length in bytes «
yv() : v_p(0), v_n(0) { } «
yv(const void* p, n32 n) : v_p((u8*) p), v_n(n) { }
explicit yv(const iovec& v) «
: v_p((u8*) v.iov_base), v_n(v.iov_len) { }
yv& operator=(const iovec& v) {
v_p = (u8*) v.iov_base; v_n = v.iov_len; return *this; }
operator iovec() const { // auto convert yv to iovec «
iovec v; v.iov_base = v_p; v.iov_len = v_n; return v; }
// (continued ...)
};
The last three inline methods above are casually used everywhere in demos with iovecs, making iovecs and yv interchangeable as values in many places because they convert to one another. Just because memory is described doesn't mean it exists — in some apps it can make sense to use yv instances for unmapped memory, calculating values with pointers never dereferenced. Knowing what can be touched is someone else's job — yv doesn't care. The empty constructor inits both members to zero. This means you can't make vectors of yv that are not touched in construction. (You should use iovec for that when necessary — it has no constructor; your iovec vectors have no time cost until used.)
init
In addition to api shown above, several other ways exist to init yv instances. Default assignment and copy construction are fine. struct yv { // (... continued)
explicit yv(const char* cstr) «
: v_p((u8*) cstr), v_n((cstr)? ::strlen(cstr) : 0) { }
explicit yv(const std::string& str) «
: v_p((u8*) str.c_str()), v_n((n32) str.size()) { }
yv& operator=(const std::string& s) {
v_p = (u8*) s.c_str(); v_n = (n32) s.size(); return *this; }
operator std::string() const { // convert to std string
return std::string((const char*) v_p, v_n); }
Wil uses vinit() to re-construct, and vclear() to zero. Conversion to and from std::string is just as easy as iovec. Of course, if you construct yv from std::string, the space only lives as long as std::string keeps it; so take care.
order
Method veq() comparing two yv instances for equality appeared first in the demos root (cf «) with more comments. Ordering two octet sequences lexicographically is done by vcmp() which orders bytes much the way memcmp() does: in binary. bool veq(const yv& v) const { // true iff identical content «
n32 n = v.v_n;
return n == v_n && (!n || 0 == ::memcmp(v_p, v.v_p, n));
}
bool operator==(yv const& v) const { return this->veq(v); } «
bool operator!=(yv const& v) const { return !this->veq(v); } «
static int vcmp(const yv& one, const yv& two); // caseful «
static int vicmp(const yv& one, const yv& two); // caseless
bool operator< (yv const& v) const { return vcmp(*this, v)< 0; }
bool operator<=(yv const& v) const { return vcmp(*this, v)<=0; }
bool operator> (yv const& v) const { return vcmp(*this, v)> 0; }
bool operator>=(yv const& v) const { return vcmp(*this, v)>=0; }
But vcmp() differs from memcmp() by using separate lengths for each run, so either can end first and be considered less than when all preceding bytes are equal. You want this code rewritten in assembler when called often enough to show in profiles as a big consumer of cycles. This version is simple but doesn't pre-load memory, nor align addresses, nor compare multiple bytes at once. /*static*/ int yv::vcmp(const yv& one, const yv& two) { «
u8* a = one.v_p; // first byte of one
u8* ax = a + one.v_n; // 1 beyond last byte
u8* b = two.v_p; // first byte of two
u8* bx = b + two.v_n; // 1 beyond last byte
while (a < ax && b < bx) { // reached end of neither?
if (*a != *b)
return ((int)*a) - ((int)*b);
++a; ++b;
}
// either a or b or both was exhausted (both or one only):
if (a >= ax) // a only, or a and b both?
return (b >= bx)? 0 : -1; // 0=equal: both; 1=a>b: a only
else
return 1; //-1:a<b if b exhausted but not a too
}
Case insensitive compare in vicmp() is not shown for brevity, and because it's based on tolower() in ctype.h in the standard C library and is something of a kludge since it presumes too much about encoding. But the only lines differing from vcmp() are these: if (*a != *b) { // first mismatching octet? «
int la = tolower(*a); int lb = tolower(*b);
if (la != lb) return ((int)la) - ((int)lb);
}
This approach to case insensitive compare follows an optimization principle: do not canonicalize case until necessary. Be lazy.
prefix and suffix
(Wil shows prefixes and suffixes here because the code uses vicmp() for case insensitive compare just described above.) These two methods select N bytes from the start or end of a run: vprefix(N) returns at most N leading bytes and vsuffix(N) returns at most N trailing bytes. At most v_n bytes is returned. struct yv { // (... continued)
public:
yv vprefix(u32 n) const { return yv(v_p, (n<v_n)? n: v_n); } «
yv vsuffix(u32 n) const { // return yv of final n octets «
if (n > v_n) // more than all of the run?
return yv(*this);
else {
u8* x = v_p + v_n; // one after last byte in run
return yv(x - n, n);
}
}
}; // struct yv
Boolean predicates vstarts() and vends() are true if yv starts or ends with exactly the pattern argument supplied. This is true when the subset of that length is equal to the pattern: (Yes vends() is only one trailing letter different in name from vend() used with iterators elsewhere in this api (cf ») — sorry.) bool yv::vstarts(const yv& head) const { // head is a prefix? «
return v_n >= head.v_n && this->vprefix(head.v_n) == head;
}
bool yv::vends(const yv& tail) const { // tail is a suffix? «
return v_n >= tail.v_n && this->vsuffix(tail.v_n) == tail;
}
Those boolean predicates also have case insensitive forms, but they assume an encoding understood by tolower() since that's how vicmp() makes case canonical when difference is seen. bool yv::vistarts(yv const& head) const { // head is ci prefix? «
if ( v_n >= head.v_n ) { // as long as prefix head?
yv pre = vprefix(head.v_n); // my prefix with length of head
return yv::vicmp(pre,head) == 0; // case insens cmp is equal
}
return false;
}
bool yv::viends(yv const& tail) const { // tail is ci suffix? «
if ( v_n >= tail.v_n ) { // as long as suffix tail?
yv suf = vsuffix(tail.v_n); // my suffix with length of tail
return yv::vicmp(suf,tail) == 0; // case insens cmp is equal
}
return false;
}
If tolower() for lowercasing doesn't suit an actual encoding in your content, you can use this code as a template for your code.
compare
This section shows the beginning of code to compare with intent to find longest matches. More of this sort of code appears in the compare demo, so this is only a taste showing the idea, which is counting bytes that match starting at a known position. Like vcmp() in the last section for yv ordering, this code can be redone in assembler when vsame() becomes a profile bottleneck. But it won't when more exotic variants more specific to a situation are used instead. n32 yv::vsame(const yv& pattern) const { «
const u8* t = pattern.v_p;
const u8* s = (const u8*) v_p;
n32 cmpLen = pattern.v_n;
if (s && t) {
if (cmpLen > v_n) // pattern is longer than this?
cmpLen = v_n; // min(v_n, pattern.v_n)
const u8* x = s + cmpLen; // one past last byte to see
--s; // prepare for preincrement
while ( ++s < x ) { // another byte to examine?
if ( *s != *t++ ) { // found non-matching byte?
return s - v_p; // distance from mismatch to start
}
}
return cmpLen; // all the requested bytes matched
}
return 0; // empty
}
Wil mainly uses vsame() to test other code in compare demos, so speed is less important than a clear result in simple code. Overloading to match leading bytes in an iovec vector below is trivial using the ydvp iovec iterator shown in the iovec demo. n32 yv::vsame(const ydvp& pattern) const { «
ydvp iter(pattern, 0); // locally modified copy
n32 outLen = 0; // value returned by this method
yv copy(*this); // local copy of this run we can modify
for ( ; iter; ++iter) {
yv next(*iter); // construct from next iovec in iter
n32 nextLen = copy.vsame(next);
outLen += nextLen;
if (nextLen < next.v_n) // did not match all of next?
return outLen; // stop matching now
copy.vskip(nextLen); // advance beyond bytes just matched
}
return outLen;
}
A call to vskip() toward the end uses a method shown in the next section — it removes the leading N bytes from a run.
parsing
Some yv methods divide octet sequences into pieces while parsing. Wil uses multiple temporary stack yv instances to keep track of various positions while scanning content in octet runs. void vtail() { if (v_n) { ++v_p; --v_n; } } // skip 1st u8 «
void vskip(u32 n) { if (v_n>=n) { v_p+=n; v_n-=n; } } «
Wil often uses vtail() in conjunction with vsplit() below, for reasons that will become apparent. Note vsplit() resembles method vindex() shown column right (cf »). u8* yv::vsplit(register int c, yv& outSecondHalf) { «
// vsplit() returns same value returned by vindex() (right).
// But vsplit also alters this and outSecondHalf, so this
// run ends just before first c, and outSecondHalf starts w/
// the first c. Callers use vtail() to remove the leading c.
register u8* p = v_p;
u8* end = p + v_n; // one beyond last byte in the run
--p; // prepare for preincrement
while ( ++p < end ) { // another byte to examine?
if ( *p == c ) { // found c? need to split run in 2 parts?
this->v_n = p - v_p; // bytes remaining in 1st half
outSecondHalf.vinit(p, end - p); // u8s to 2nd half
return p;
}
}
// c was not found inside run anywhere, so 2nd half is empty:
outSecondHalf.vinit(end, 0); // no match, but return end
return (u8*) 0; // nil
}
Since vsplit() modifies the original yv (as well as arg outSecondHalf) you must pass in copies if you need originals unchanged; the return is the same as index() for two reasons: it helps document behavior, and you can test it for nil to see if dividing the original run in two happened. The following iterator class is based on vsplit(). struct yv { // (... continued)
public: // iter api
class Vp;
inline Vp vbegin(int c) const; // see after yv::Vp
inline y0x* vend() { return (y0x*) 0; } «
}; // struct yv
Here's an example of a nested class starting with the last letter of the containing class but in uppercase. And since p means pointer or iter, the name Vp means vector iter. Normally that would iterate vector elements. But since that's boring and trivial for yv, instead this class visits yv subsets divided by an octet passed to vbegin(). class yv::Vp { // yv iter: visit marker separated substrings «
private:
yv p_v; // current substring before byte c found in orig run
yv p_m; // more substring following first c found in run
int p_c; // marker octet between substrings in orig run
void _pnext() { «
p_v = p_m; // resume search in remainder after last c seen
if (p_v.vsplit(p_c, p_m))
p_m.vtail(); // skip leading c in remainder
}
public:
Vp(yv const& v, int c) : p_v(v), p_m(v), p_c(c) {
if (p_v.vsplit(p_c, p_m))
p_m.vtail(); // skip leading c in remainder
}
operator bool() const { return p_v.v_n || p_m.v_n; }
bool operator==(const y0x*) const {
return !p_v.v_n && !p_m.v_n; }
bool operator!=(const y0x*) const {
return p_v.v_n || p_m.v_n; }
yv operator*() const { return p_v; } «
void operator++() { _pnext(); } // prefix
// prefix is better; we act just like prefix here (beware)
void operator++(int) { _pnext(); } // postfix
}; // class yv::Vp
Note the definition of inline yv::vbegin() after the definition of class yv::Vp so that class is defined. Some folks might not know it's this easy to delay inline definition. Wil does it casually at need. Wil sometimes uses yv::Vp to iterate clumps of command line arguments passed as a single arg. Here's a bit of sample code showing what the iterator returns given a sample C string using colon as a separator marking subset divisions: yv args("age=49:eyes=blue::hair=brown:fun=no"); «
yv::Vp p = args.vbegin(':');
for ( ; p != args.vend(); ++p)
yout << "(" << *p << ")" << yendl;
yout << ynow; // flush
This code puts parentheses around each value returned from the iterator to make results clear. The output for this appears on stdout below. Note the empty string between 2nd and 4th values resulting from two adjacent colons in the input. (age=49)
(eyes=blue)
()
(hair=brown)
(fun=no)
The purpose of using vtail() in the code is to remove colon : from iter output returned. (The name tail here treats a run like a list of octets where head is the first and tail is the rest. So calling vtail() removes the head octet and keeps the rest.) It's trivially easy to iterate (for example) HTTP query parameters using this iterator, using a second nested iterator in the loop to find equal sign (=) between key and value. But a separate iterator for this specifically is often simpler, and more easily distinguishes all legal forms.
scanning
Especially when reading yv content from non-critical code, Wil just uses the standard C library to do simple things, even when it forces a copy to get a required end nul byte for C string termination. This tactic is good for placating coworkers who only ever use atoi() to convert strings to numbers; in test code Wil doesn't care. These methods get used in config and test code, but seldom elsewhere: i32 yv::vatoi32() const { // MAN ATOI(3) «
if (v_p && v_n) {
char copy[ 512 + 1]; // so we can add an end nul
u32 cpyLen = (v_n < 512)? v_n : 512;
::memcpy(copy, v_p, cpyLen);
copy[cpyLen] = 0;
return ::atoi(copy);
}
return -1;
}
i64 yv::vatoi64() const { // MAN ATOLL(3) «
if (v_p && v_n) {
char copy[ 512 + 1]; // so we can add an end nul
u32 cpyLen = (v_n < 512)? v_n : 512;
::memcpy(copy, v_p, cpyLen);
copy[cpyLen] = 0;
return ::atoll(copy);
}
return -1;
}
Wil uses scanf() on yv instances using vscanf(), which makes a temp local copy with an end nul first. The 2K upper limit on max content size is a reasonable compromise in normal coding contexts. This is usually another config time feature used during app setup. // vscanf() makes temp stack copy ensuring end nul char:
int yv::vscanf(const char* fmt, ...) { // uses ::vsscanf() «
const char* p = (const char*) v_p;
u32 n = v_n;
if (!p || !n) { return -1; }
char temp[ 2048 + 2 ]; // temp buf used if necessary
if (p[n-1]) { // no nul byte in last byte of vector?
if (n > 2048) n = 2048; // copy no more than temp buf
::memcpy(temp, p, n);
temp[n] = 0; // write nul after last byte copied
p = temp;
}
va_list args;
va_start(args,fmt);
int outVal = ::vsscanf(p, fmt, args);
va_end(args);
return outVal;
}
Since a constant purpose of yv is avoiding memory scans to find end nuls in strings, this code using vsscanf() is a casual, expedient surrender under practical circumstances.
allocation
Wil suballocates memory from yv instances, especially when they describe a page or book meant for suballocation. In this sense yv acts like a heap that allocates by pointer bumping. All the yv space allocation methods below are inlines. The following 4K constant is used to define size of one page; this isn't a portable way to do it, and it's wrong on some systems — typically those with 8K pages. But it makes this api simple, so go with it for now. #define yv_PAGESZ (4096) /*alignment of vpage() « */
Method vtake() returns the next N bytes of yv or returns nil when N is more than what remains; the code looks like vskip() above (cf «) except a pointer is returned. The toy language under mu uses vtake() often for space allocation inside pages. struct yv { // (... continued)
u8* vtake(unsigned n) { // like vskip() but return leading n «
if (n <= v_n) { // enough bytes remain in this octet vector?
u8* p = v_p; v_p = p + n; v_n -= n; return p;
}
else // the unlikely case: not enough space
return 0; // nil when yv space is exhausted
}
When new space must contain zeroes, Wil uses vcalloc() below instead of vtake(), which merely adds a memset() call. Pointer v_p isn't checked for nil first here or in vtake() because it's one of those places Wil thinks a crash is good on nil in runs used as heaps. /// \brief vcalloc() is like vtake() but zeroes allocated space
u8* vcalloc(unsigned n) { // like vskip() but returns b bytes «
if (n <= v_n) { // enough bytes remain in this octet vector?
u8* p = v_p; v_p = p + n; v_n -= n;
::memset(p, 0, n);
return p;
}
else // unlikely case:
return 0; // nil when the yv space is exhausted
}
The behavior of method vpage() resembles system library call valloc() in effect: it returns page-aligned memory from yv, skipping as many bytes as needed to force alignment. Other kinds of aligned allocation are easily generalized from this code. Note yip32 is understood as a synonym for ptrdiff_t (as explained on the types page). /// \brief vpage() acts like valloc() but allocating from yv
u8* vpage(unsigned sz) { // alloc sz page-aligned bytes «
u8* p = v_p;
u32 n = v_n;
yip32 a = (yip32) p; // address p as ptrdiff_t to align
yip32 lo = (yv_PAGESZ-1) & a; // low bits of a (in a page)
if (0 != lo) { // p not already aligned? (add complement?)
yip32 rest = yv_PAGESZ - lo; // complement: rest of page
if (rest > (yip32) n) // skipping rest exhausts remainder?
return 0; // allocation cannot succeed
p += rest; n -= rest; // p now aligned; n is new remainder
}
if (n > sz) { // able to alloc sz bytes?
v_p = p + sz; // next alloc follows this one
v_n = n - sz;
return p; // sz bytes allocated at page aligned boundary
}
return 0; // not enough bytes left in run
}
}; // struct yv
Because Wil already has heap apis to allocate pages, Wil uses this vpage() method only in special circumstances like suballocation from big swaths of pre-allocated space like shared memory.
search
Occasionally one needs a method for naive pattern search — for example when writing unit tests to show your superior Boyer-Moore algorithm returns correct answers, or when writing unit tests showing de-dup algorithms find expected patterns. Naive vfind() tries every possible position in turn while testing for equality. u8* yv::vfind(yv const& pat) const { // find pattern, else 0 «
// plain vanilla, naive brute force search -- not BoyerMoore
if (v_n >= pat.v_n) { // big enough to contain pat?
u32 extra = v_n - pat.v_n; // extra u8s in excess of pat.v_n
u8* p = v_p;
u8* end = p + extra; // 1 past last u8 that might start pat
for (/*prep preincr*/--p; ++p < end; ) {
if (::memcmp(p, pat.v_p, pat.v_n)==0) // match?
return p;
}
}
return 0;
}
An iterator finding all instances of a pattern is trivial to write using vfind() since you can restart the search in a smaller yv starting just after the returned address. Case insensitive vifind() looks the same except for the loop: for (/*prep preincr*/--p; ++p < end; ) { // vifind() «
yv here(p, pat.v_n); // next u8 position with pat length
if (yv::vicmp(here, pat)==0) // case insens match?
return p;
}
The yv api aims to cover simple problems using pointer and length. Obviously production code should use faster search. |
A submenu for demos appears below, letting you
go to the page on a topic written as a demo (as the
demos page defines it).
menu
thorn: todo, names, fd, iovec, assert, log, run « Þ, hex, crc, buf, in, out, quote, escape, compare, file, deck, cow, arc, blob, tree, slice, rand, time, stat, hash, heap, node, primes, page, book, pile, stack, atomic, lock, mutex, thread, map, meter, list, iter, ctype (mu: toy, peg, imm, tag, box, symbol, token, number, bigint, class, method, reader, writer, eval, env, vm, gc, world, pcode, compiler, asm, lathe, lisp, smalltalk, design, weight, jar, card, harp, debug, profile) Some demos are stubs: todo is a demo guide. See toy for mu updates on language pages; names introduces naming schemes.
run
Wil uses term run as a noun meaning "a contiguous sequence of x", and refers to yv as a run of octets, or just a run. This confuses a few folks (especially speakers of English as a second language) who think run is only a verb, and prefer words have only one meaning. But words have many meanings. Wil often uses run as a short synonym for sequence; this demo uses it to mean yv.
strings
A yv run is not itself a string, but might contain a string. So yv has methods to manipulate some kinds of strings, e.g. runs of Latin1 octets (and later, Unicode codepoints). But it's up to code using yv to decide if string api use is right, and which encoding applies, because yv has no idea — it's all mechanism with no policy about content inside. The only thing yv represents is location and length in bytes. The meaning of content is app specific. Later methods will support Unicode too, when Wil needs it. But yv still won't know if Unicode codepoints occur inside. Length of yv is always in octets, even when 16-bit or 32-bit character codepoints are inside.
const
Member v_p is not const, and const is cast away when you supply a const pointer as a value. This is intentional as a simplification. Wil experimented with a const ykv object with a const kv_p. But using this class induced much complexity in interfaces, implying baroque type hierarchies. So Wil passed on the idea; thus you must remember when content is const. Wil often uses yv const& to mean content is const, even though technically it only means yv itself is const, not what it references.
printing
The result of printing most þ objects depends on whether content inside or object structure is desired. The latter print — called a dump — reveals object structure, and is requested by quoting a value (see the quote demo) when using operator << writing to an out stream (see the out demo). yv abc("abc"); // run of three octets [1] «
yout << "# three octets and newline:" << yendl; // [2]
yout << abc << yendl; // abc alone on a line [3]
yout << "# dump of abc with crc, etc:" << yendl; // [4]
yout << abc.quote(); // call yv::vdebug() [5]
yout << yendl << ynow; // newline and flush [6]
The usage example above emits output on stdout shown below. Line [3] writes the same thing that would appear if the left hand side was an instance of yfdw from the fd demo instead of the instance of yo (from the out demo) that it is. Line [5] calls vdump() whose code is listed below. # three octets and newline:
abc
# dump of abc with crc, etc:
<yv p=0xee2d n=3 crc='0x352441c2:3'>
00000: 61 62 63 ; abc
</yv>
The interface for line [5] uses the nested yv::Pq class used for quoting (cf quote») from this part of the yv class api: struct yv { // (... continued)
struct Ve { yv const& e_v; Ve(yv const& v): e_v(v) { } }; «
Ve escape() const { return Ve(*this); } // for yvw::wescape() «
struct Vq { yv const& q_v; Vq(yv const& v): q_v(v) { } }; «
Vq quote() const { return Vq(*this); } // to request dump «
void vprint() const; // dump to stdout via yout
void vdump(yo& o) const; void vcite(yo& o) const;
void vshow(yo& o, const char* kind, u32 max, u32 base=0) const;
void vhexmax(yo& o, u32 max, u32 base=0) const; // up to max hex
}; // struct yv
inline yo& operator<<(yo& o, yv::Vq const& x) {
x.q_v.vdump(o); return o; }
inline yo& operator<<(yo& o, yct<yv> const& x) {
x.c_t.vcite(o); return o; }
Most of this is boilerplate common to printing all þ objects: the quote, print, dump, and cite methods are standard, and only nested struct name Vq changes, because it always uppercases the prefix letter (V in this case) and merely wraps a yv instance to bind the desired version of overloaded operator<<(). The last line above uses yct to request cite format is covered more in the quote demo; another example appears in the iovec demo (cf iovec«). In code shown below, hex dumps are done by yv::vhexmax() which appears in the hex demo instead of this demo. Similarly, the code handling crc generation appears in the crc demo instead of here. (Several other demo pages are little more than sub topics of this run demo.) Dumping yv just calls vshow() with specific values, including the tag to use and max number of bytes to hex dump. An upper limit of 16K is arbitrary, and is driven mainly by subjective irritation at long dumps. The code for vshow() below in turn merely puts XML style tags around the hex dump. The crc32 hash aims to simplify rapid comparison of multiple dumps in output. void yv::vshow(yo& o, const char* kind, u32 max, u32 base) const { «
if (v_n) {
yh32 h; h << *this; // yh32::hadd() in crc demo
o.oftn("<%s p=%#lx n=%ld crc='%#lx:%lu'>", kind,
(long) v_p, (long) v_n, (long) h.hcrc(), (long) h.hlen());
this->vhexmax(o, max, base); // see yv::vhexmax()
o.ouend(kind); // out untab end (tag)
}
else
o.of("<%s p=%#lx n=%ld/>", kind, (long) v_p, (long) v_n);
}
The print wrapper around dump methods usually looks exactly like this one, and is intended for use under gdb. And cite methods emit one line only. All the formatting work is done by yo's api (see out) which aims to make typical print usage as terse as possible. (Atypical comments are added here.) void yv::vprint() const { «
yout << yendl; this->vdump(yout); yout << yendl << ynow;
}
void yv::vcite(yo& o) const { «
yh32 h; h << *this; // yh32::hadd() in crc demo
o.of("<yv p=%#lx n=%ld crc='%#lx:%lu'/>",
(long) v_p, (long) v_n, (long) h.hcrc(), (long) h.hlen());
}
hash
Basic vhash() is a trivial wrapper around use of yh32's api for crc32 as shown in the crc demo. But case insensitive hashing involves a couple issues: batching consecutive runs for crc32 when possible, and choice of encoding. (This code uses isupper() and to tolower(), working well with Latin1.) u32 yv::vhash() const { yh32 h; h << *this; return h.hcrc(); } «
u32 yv::vihash() const { // tolower bytes only at need before crc «
// Send contiguous non-uppercase to crc32 as-is from where they sit;
// but copy uppercase to a temp buf for lowercase before crc32.
yh32 hash; // where the crc hash gets accumulated
char temp[ 64 + 1 ]; // buf to copy bytes lowercased on the fly
const u8* p = v_p; const u8* x = p + v_n;
while (p < x) { // at least one more byte to hash?
const u8* p0 = p; // start of successive bytes done this time
int c = *p++; // first byte in remaining bytes to process
if (isupper(c)) { // find longest consecutive uppercase run?
char* s = temp; char* sx = temp + 64;
*s++ = tolower(c); // copy to temp buf as lowercase
while (p < x && s < sx && isupper(*p)) { // 1 more upper fits?
*s++ = tolower(*p); ++p; // copy as lower then advance beyond
}
// assert: p - p0 is number of lowercased bytes put into temp
yv lowerCopy(temp, p - p0); // contiguous lower copy in temp
hash << lowerCopy; // add to crc via yh32::hadd()
}
else { // find longest NON-uppercase sequence in original vec
while (p < x && !isupper(*p)) // another non-upper?
++p; // just advance beyond each non-upper byte
// assert: p points at x or p points at 1st upper byte before x
yv lower(p0, p - p0); // contiguous original non-upper bytes
hash << lower; // add to crc via yh32::hadd()
}
}
return hash.hcrc();
}
Logic to batch octets for conversion resembles code in the escape demo for delayed encoding. An earlier example appeared in the fd demo (cf «). When Wil needs a generalized partial mapping hash, he can pass in a map parameter showing which octets need to change. (And of course to hash strings in other codepoints, an efficient codepoint iter is needed for an encoding.)
queries
Many yv methods answer questions about content described without altering it. Such methods — returning boolean true or false — are called predicates by convention and answer such questions as (e.g. vall()) Other queries count things in a run, like all appearances of an octet value: unsigned yv::vcount(int c) const { // count number of c's in run «
unsigned sum = 0;
register const u8* p = v_p;
const u8* end = p + v_n; // one beyond last byte in the run
--p; // prepare for preincrement
while ( ++p < end ) {
if ( *p == c )
++sum;
}
return sum;
}
Here's the api for several boolean predicates: struct yv { // (... continued)
operator bool() const { return v_n && v_p; } // is empty «
bool vempty() const { return 0==v_n || 0==v_p; } // is empty «
bool visdigit() const;
bool vall(yum const& map) const; // true only if all true for map[]
bool vall(yutm const& map) const; // true only if all true for map[]
};
Of the last three visdigit() is easiest to explain, so it's shown first; it returns true if no octet returned false from isdigit(). Wil sometimes uses it when deciding whether a number seems intended. Note true is returned when a run is empty, so that must be checked separately if it matters. bool yv::visdigit() const { // true if all bytes are isdigit() «
if (v_p && v_n) {
register u8* p = v_p;
u8* end = p + v_n; // one beyond last byte in the run
--p; // prepare for preincrement
while ( ++p < end ) { // another byte to examine?
if ( !isdigit(*p) )
return false;
}
}
return true;
}
Both variants of vall() return true as long as no byte in the run returns false from the map[] predicate passed as an argument. Both yum and yutm are maps where u is short for u8 (ie an octet); yum is an octet map and yutm is an octet type map. Both are explained and shown in the ctype demo, so coverage here is minimal. In effect, either type can be tested to see an octet is mapped, so map[c] here is similar to use of isdigit(c) above. You can make arbitrary maps with yum because it has a value for all 256 possible octet inputs; yutm represents ctype.h semantics. bool yv::vall(yum const& map) const { // false ONLY on !map[c] «
if (v_p && v_n) {
register u8* p = v_p;
u8* end = p + v_n; // one beyond last byte in the run
--p; // prepare for preincrement
while ( ++p < end ) { // another byte to examine?
if ( !map[*p] ) // not mappped? (see ctype)
return false;
}
}
return true;
}
bool yv::vall(yutm const& map) const { // false ONLY on !map[c]
if (v_p && v_n) {
register u8* p = v_p;
u8* end = p + v_n; // one beyond last byte in the run
--p; // prepare for preincrement
while ( ++p < end ) { // another byte to examine?
if ( !map[*p] ) // not mapped by ctype predicate?
return false;
}
}
return true;
}
See the ctype demo for yum and yutm apis. Other queries return different values, but it's hard to generalize about queries as a class. However, many examine octets one at a time — from the left or right side — until a specific condition is found, such as the first appearance of an octet value, or class of octet values. The next two methods act like std C library methods index() and rindex() (see man pages) without end nuls.
std lib
Wil wrote several yv methods similar to standard C library calls, but using explicit v_n length in yv instead of terminating nuls in C strings. u8* yv::vindex(register int c) const { // 1st c in this, leftward «
register u8* p = v_p;
u8* end = p + v_n; // one beyond last byte in the run
--p; // prepare for preincrement (start before first byte)
while ( ++p < end ) { // until we pass beyond last byte
if ( *p == c )
return p; // 1st c in this (from left), else nil
}
return (u8*) 0;
}
u8* yv::vrindex(register int c) const { // reverse vindex() «
u8* p = v_p;
register u8* x = p + v_n; // one beyond last byte in the run
while ( --x >= p ) { // until we pass before first byte
if ( *x == c )
return x; // 1st c in this from RIGHT, else nil
}
return (u8*) 0;
}
Both vspn() and vcspn() below use types yum and yutm mentioned in the last section on vall() (cf «). These are more efficient ways to express charsets than nul terminated C strings — each is a map in vector representation. Method vspn() has the same meaning as strspn() in ISO C90 (see a man page for strspn) but avoids use of nul termination. In short, vspn() returns a count of octets at the start of yv true for acceptMap (ie acceptMap[c] is nonzero). You could build vall() shown earlier with this since:
u32 yv::vspn(yum const& acceptMap) const { // like std ::strspn() «
/// vspn() returns count of leading bytes appearing in acceptMap[]
u8* p = v_p; // first byte
u8* end = p+v_n; // one beyond last byte
while (p < end && acceptMap[*p]) // another leading match?
++p; // skip
return p - v_p; // p ONLY moved on a match above
}
u32 yv::vspn(yutm const& acceptMap) const { // like std ::strspn()
/// vspn() returns count of leading bytes appearing in acceptMap[]
u8* p = v_p; // first byte
u8* end = p+v_n; // one beyond last byte
while (p < end && acceptMap[*p]) // another leading match?
++p; // skip
return p - v_p; // p ONLY moved on a match above
}
Method vcspn() spans the complement of the charset expressed by rejectMap, and has the same meaning as strcspn() in ISO C90 (see a man page for strcspn) but avoids use of nul termination. In short, vcspn() returns a count of octets at the start of yv which are false for predicate rejectMap (ie rejectMap[c] is zero). If you find all these double negatives confusing, you might avoid using vcspn(). But if you already use strcspn() in C's std library, this is a direct replacement if you convert your charset to a yum or yutm map following recipes in the ctype demo. u32 yv::vcspn(yum const& rejectMap) const { // like std ::strcspn() «
/// vcspn() (c=complement): count of leading bytes NOT in map[]
u8* p = v_p; // first byte
u8* end = p+v_n; // one beyond last byte
while (p < end && !rejectMap[*p]) // another leading match?
++p; // skip
return p - v_p; // p ONLY moved on a match above
}
u32 yv::vcspn(yutm const& rejectMap) const { // like std ::strcspn()
/// vcspn() (c=complement): count of leading bytes NOT in map[]
u8* p = v_p; // first byte
u8* end = p+v_n; // one beyond last byte
while (p < end && !rejectMap[*p]) // another leading match?
++p; // skip
return p - v_p; // p ONLY moved on a match above
}
The advantage of having both vspn() and vcspn() is that you can use one yum or yutm instance to count either matching or non-matching octets without needing to generate the complement of your one map. (These two methods almost appeared in the ctype demo only, along with the yum or yutm map types; example usage is shown there. But other methods here are sufficiently similar to warrant listing them all together.) As you can see from methods in this section, code using yum looks exactly the same as code using yutm when only operator[] is called. For this reason, only methods for yum need be listed; the version for yum looks the same. The just completed ctype demo shows new yv method vpick() (cf vpick») resembling vspn() above, but selecting octets matching a yutm predicate and writing them to a yo out stream.
trim
Wil must sometimes make a new yv instance with leading and trailing octets removed that appear in a map (satisfying the predicate it represents). The next vtrim() method shows how this is done for a yum map. As explained in the last paragraph above, the version for yutm looks exactly the same. (Just remember vtrim() is overloaded for yum and ytum.) // vtrim() removes both leading and trailing bytes true for map[]
u32 yv::vtrim(yum const& map) { // return count of trimmed bytes «
u32 outCount = 0;
register u8* p = v_p;
register u8* end = p + v_n;
if ( p && p < end ) { // vec not nil and not empty?
--p; // prepare for preincrement
while ( ++p < end && map[*p] ) {
++outCount; // skipping another leading map byte
}
if ( p < end ) { // any bytes remaining in the vec?
while ( --end > p && map[*end] ) {
++outCount; // skipping another trailing map byte
}
++end; // we went one too far
}
v_n = end - p;
v_p = p;
}
return outCount;
}
The yutm version is identical except input type is yutm. Note only the yv instance is modified — the content bytes are immutable.
copy
Here we show rare yv methods modifying bytes described by a destination yv — vcopy() writes content of one yv (the src) into another (this) provided room is present. This is most likely true after space has just been allocated with the desired size. Writing to iovecs in the general case is often better done using yb in the buf demo. The simplest vcopy() is this one: u32 yv::vcopy(yv const& src) { // v_n = min(v_n, src.v_n) then copy «
if (v_n > src.v_n)
v_n = src.v_n;
if (v_p && src.v_p && v_n)
::memcpy(v_p, src.v_p, v_n);
return v_n;
}
Instead of copying all bytes, you might want to pick and choose, copying only octets accepted by a map parameter. Here's a version of vcopy() which only copies a source octet c when acceptMap[c] is nonzero: // vcopy copies only src bytes that fit, and true for acceptMap
u32 yv::vcopy(const yv& src, yum const& acceptMap) { // skip misses «
u8* s = src.v_p; // first byte
u8* sx = s+src.v_n; // one beyond last byte
u8* p = v_p; // first byte
u8* x = p+v_n; // one beyond last byte
while (p < x && s < sx) { // another octet?
int c = *s++;
if (acceptMap[c])
*p++ = (u8) c;
}
v_n = p - v_p;
return v_n;
}
The yutm version of vcopy() is identical except input type is yutm. Content in destination this is not immutable because it's updated to make a copy of the source, to get a deep copy of the run. (A shallow copy of yv by copy construction is trivial; you must grasp why vcopy() is different.)
mutation
Other than vcopy() above, Wil has few yv methods that alter memory inside content described by yv. This section shows a few more miscellaneous methods treating the content as mutable. int yv::vread(yi& src) {
// neg when src.iread() returns neg on error «
return (v_p && v_n)? src.iread(v_p, v_n): 0;
}
The in demo shows in stream apis include this yi::iread() method that resembles a ::read() system call. void yv::vreverse() { // invert order of all bytes «
register u8* p = v_p;
if ( p ) { // not nil and not empty?
register u8 c; // temp for swaps
register u8* x = p + v_n; // one past last used byte
for (/*prep preincr*/--p ; ++p < --x; ) { // 2 more to swap?
c = *p; // save earlier byte
*p = *x; // move later byte forward
*x = c; // finish swap at end
}
}
}
Reversing a string in place is a classic interview question Wil has answered countless times. Wil likes to counter with the question about how you exchange two adjacent subsets while maintaining order in each subset. (Yes, just reverse each subset separately and then everything together.)
slices
Slices are treated more fully in the slice demo, and the iovec demo has a long section on slicing scatter-gather lists using iovec vectors, so this section is short. In constrast with most other objects when sliced, yv eagerly returns a new yv instance instead of delaying evaluation as long as possible by joining the original object with a slice subclass. (Wil fears over-subtle code would result from delayed yv slices using a hypothetical yvz subclass of yz.) struct yv { // (... continued) // slice inlines:
// vz: negative p is relative to v_p+v_n; neg n is relative to v_n;
yv vz(zp32 p, zn32 n) const; // slice this vector «
yv vz(yz const& slice) { return this->vz(slice.z_p, slice.z_n); }
yv operator()(zp32 p, zn32 n) const { return this->vz(p, n); }
}; // struct yv
Instead of using yz::zabsolute() to resolve relative slices with negative values, which would force slices to resolve only inside yv, Wil currently allows both positions and lengths to be negative after adding negative values to v_n because he can still resolve the result by intersection with the original yv (see the demos root for vintersect() code). Wil interprets negative length to mean a run's start is the "end" at an address before v_p. It's unclear whether this is useful. yv yv::vz(zp32 p, zn32 n) const { // slice this vector «
u8* pz = (p < 0)? ((v_p+v_n)+p) : (v_p + p);
i32 nz = (n < 0)? ((i32) v_n + n) : n;
// nz can still be neg if abs(n)>abs(v_n); if so, len is "negative"
yv slice(pz, (u32) nz); // assume nz is positive
if (nz < 0) // neg length makes "end" of slice come before start?
slice.vinit(pz + nz, (u32) -nz); // make earlier "end" the start
return this->vintersect(slice); // return overlap of this and slice
}
This sample illustrates positive and negative slice values: static const char* lo = "abcdefghijklmnopqrstuvwxyz";
yv vlo(lo); // yv::yv(const char* cstr); «
yout << "# vlo(5, 10): ten at pos=5" << yendl;
yout << vlo(5, 10) << yendl;
yout << "# vlo(3, -20): six (26 - 20 = 6) at pos=3" << yendl;
yout << vlo(3, -20) << yendl;
yout << "# vlo(-8, 7): seven of last eight" << yendl;
yout << vlo(-8, 7) << yendl;
yout << "# vlo(-8, 1024): 1024 of last eight" << yendl;
yout << vlo(-8, 1024) << yendl; // saved by vintersect()
yout << ynow; // flush
That writes the following on stdout. Note how use of vintersect() in the last example limits a slice to a subset of the original yv. # vlo(5, 10): ten at pos=5
fghijklmno
# vlo(3, -20): six (26 - 20 = 6) at pos=3
defghi
# vlo(-8, 7): seven of last eight
stuvwxy
# vlo(-8, 1024): 1024 of last eight
stuvwxyz
|
demos « Þ
+ todo + names + fd + iovec + assert + log + run « Þ + hex + crc + buf + in + out + quote + escape + compare + file + deck + cow + arc + blob + tree + slice + rand + time + stat + hash + heap + node + primes + page + book + pile + stack + atomic + lock + mutex + thread + map + meter + list + iter + ctype |