|
demos are
explained here;
a menu at top column right indexes actual topic demos.
Here we demo escape.
problem
When written, sometimes content needs a transformation making syntax acceptable to downstream consumers. Simply stuffing bytes in raw binary can be the wrong thing. This demo looks at an example of content transformation: escaping strings in a standard way, while delaying costs until necessary. The yvw api shown next is used implicitly when you escape a yv run instance instead of quoting one like this: yout << yescape(v); // calls yvw::wescape()
Expression yescape(v) creates an instance of yesc<T> using inline yescape(), both shown top column right (cf »). Unless you want to read, alter, or clone yvw::wescape() code that actually backslash escapes octets when a sequence of them is written to yo, you needn't be aware of class yvw aka (u8) vector writer, nor the singleton instance g_1vw that maps octets needing a backslash escape encoding when written. class yvw { // yv writer (print u8 vector as string literal) «
private: // clone of yb2ow code escaping string literals
u8 w_hex[256]; // u8s needing hex when writing strings
public:
~yvw() { } «
yvw(); // init map of bytes needing backslash escape
static yvw g_1vw; // singleton global instance of yvw
/// \brief escape ctl chars and non-ascii bytes (> 0x7F)
void wescape(yo& o, yv const& v) const; // escape to out stream
enum We_control { // special cases «
// http://en.wikipedia.org/wiki/C0_and_C1_control_codes
we_bs = 0x8, // "\b" backspace ^H
we_ht = 0x9, // "\t" horizontal tab ^I
we_lf = 0xA, // "\n" linefeed (aka newline) ^J
we_vt = 0xB, // "\v" vertical tab ^K
we_ff = 0xC, // "\f" form feed ^L
we_cr = 0xD, // "\r" carriage return ^M
};
}; // class yvw
inline yo& operator<<(yo& o, yesc<yv> const& x) { «
yvw::g_1vw.wescape(o, x.e_t); return o; }
inline yo& operator<<(yo& o, yv::Ve const& x) { «
yvw::g_1vw.wescape(o, x.e_v); return o; }
The We_control enum only appears became w_hex map initialization below wants to set those values specially; these names are consistent with names used later. Code for yvw::wescape() below resembles yfdw::wescape() in the fd demo (cf «) because the latter was derived from this code, and merely shown much earlier than this original version. yvw yvw::g_1vw; // global singleton instance «
yvw::yvw() { // yv writer (print u8 vec as string literal) «
register u8* map = w_hex;
::memset(map, 0, 128); // not all ascii needs to be escaped
::memset(map + 128, 'x', 128); // but all non-ascii needs hex
::memset(map, 'x', 0x20); // ... but control chars need hex
map[ 0x7F ] = 'x'; // including e_del
// need specific letter code escapes instead of hex (non 'x')
map[we_bs] = 'b'; // backspace 0x8
map[we_cr] = 'r'; // '\r'; // 0xD CR carriage return
map[we_lf] = 'n'; // '\n'; // 0xA newline
map[we_ht] = 't'; // '\t'; // 0x9 HT horizontal tab
map[we_ff] = 'f'; // '\f'; // 0xC FF (page)
map[we_vt] = 'v'; // 0xB VT vertical tab
map['\\'] = '\\'; // \ maps to self (after 1st backslash)
map['"'] = '"'; // quote maps to self (after 1st backslash)
}
The format of w_hex inside yvw is quite direct: any octet c used as an index must be escaped with a leading \ backslash if the value w_hex[c] is nonzero, and the actual value says what form the rest of the escape must take. In the normal case 'x' as the value means hex follows the backslash. Another other value besides 'x' means the value itself is what follows the leading backslash. void yvw::wescape(yo& o, yv const& s) const { «
// backslash escape
register const u8* escape = w_hex; // map u8s needing escape
const u8* p = s.v_p;
if (p && s.v_n) { // nonempty?
const u8* start = p; // last byte not yet written
const u8* end = p + s.v_n; // one beyond last u8 to write
for (/*prep preincr*/ --p; ++p < end; ) {
register int c = *p;
if (escape[c]) { // need to escape this byte?
if (p > start) { // need to write earlier bytes 1st?
yv before(start, p - start); // u8s before p
o << before;
}
o.o1c('\\'); // escape always starts with backslash
if ('x' == escape[c]) // write as hex?
o.of("x%02x", (int) c); // 'x' then 2 hex digits
else // write as the value stored in escape[c]
o.o1c(escape[c]);
start = p + 1; // next u8 to write is after current p
} // if (escape[c])
} // for
if (p > start) { // at least final u8 was not escaped?
yv last(start, p - start); // trailing u8s before end
o << last; // yo::operator<<(yv const&)
}
} // if (p && s.v_n)
}
The format written by yvw::wescape() happens to suit the same style of escapes used in literal C strings, which Wil plans to support many places in toy languages he writes, whether the overall syntax looks more like Lisp, or Smalltalk, or whatever. Wil's later Unicode extensions should be obvious once codepoint scanners are defined. (Here we scan by pointer bumping.) |
A submenu for demos appears below, letting you
go to the page on a topic written as a demo (as the
demos page defines it).
menu
thorn: todo, names, fd, iovec, assert, log, run, hex, crc, buf, in, out, quote, escape « Þ, compare, file, deck, cow, arc, blob, tree, slice, rand, time, stat, hash, heap, node, primes, page, book, pile, stack, atomic, lock, mutex, thread, map, meter, list, iter, ctype (mu: toy, peg, imm, tag, box, symbol, token, number, bigint, class, method, reader, writer, eval, env, vm, gc, world, pcode, compiler, asm, lathe, lisp, smalltalk, design, weight, jar, card, harp, debug, profile) Some demos are stubs: todo is a demo guide. See toy for mu updates on language pages; names introduces naming schemes.
yescape
The inline method for yescape() was introduced in the quote demo (cf «) where explanation of wrapper types like yesc<T> appears in more detail. Here we merely observe any object of type X can be wrapped in an instance of yesc<X> just by calling yescape() below. template <typename T> struct yesc { // mu.h: yesc escape template «
T const& e_t; // a T wrapper requesting escape rather than dump
yesc(T const& t) : e_t(t) { } // just capture pointer value &t
};
template <typename T> yesc<T> // Stu was here
yescape(T const& t) { return yesc<T>(t); } // escape wrapper «
The yescape(vbuf) expression in sample code shown next section below returns an instance of yesc<yv> permitting operator<<() to be overloaded with yesc<yv> on the right hand side. The original version of this code was defined entirely in the yv api as shown below, using nested class yv::Ve and method yv::escape() for the same purpose as yesc<T> and yescape() shown above, which are generic to all types instead of being specific to yv alone. struct yv { // sequence of octets «
struct Ve { yv const& e_v; Ve(yv const& v): e_v(v) { } }; «
Ve escape() const { return Ve(*this); } // to request escape «
// ...
}; // struct yv
inline yo& operator<<(yo& o, yesc<yv> const& x) {
x.e_v.vescape(o); return o; }
This old yv api seems best retired to favor generic yescape().
sample
This example of actual usage shows what gets printed when non-ascii and control chars appear in content, as well as typical backslash escaped values like tab and carriage return. yv lower("abcdefghijklmnopqrstuvwxyz"); «
char buf[ 26 ]; // not enough room for end nul
memcpy(buf, lower.v_p, lower.v_n); // mutable copy
buf[1] = '\\'; // replace 'b' with '\\'
buf[5] = 0xff; // replace 'f' with '\xff'
buf[6] = 0x08; // replace 'g' with '\b'
buf[17] = yvw::we_cr; // replace 'r' with '\r'
buf[19] = '\t'; // replace 't' with '\t'
buf[20] = 0; // replace 'u' with '\x00'
yv vbuf(buf, 26);
yout << "# vbuf.quote():" << yendl << vbuf.quote() << yendl;
yout << "# yescape(vbuf):" << yendl << yescape(vbuf) << yendl;
yout << ynow;
The following output appears on stdout; the last line written by yescape(vbuf) is the real result of interest since it's output from yvw::wescape() above, which backslash-escapes any octet written which yvw says needs an escape. # vbuf.quote():
<yv p=0xbffffad6 n=26 crc='0x8a606218:26'>
00000: 61 5c 63 64 65 ff 08 68 69 6a 6b 6c ; a\cde..hijkl
0000c: 6d 6e 6f 70 71 0d 73 09 00 76 77 78 ; mnopq.s..vwx
00018: 79 7a ; yz
</yv>
# yescape(vbuf):
a\\cde\xff\bhijklmnopq\rs\t\x00vwxyz
Note the decision whether to add surrounding quotes is up to the caller. Since a caller might as easily want single as double quotes, yvw::wescape() really should escape both for flexibility.
variation
You shouldn't get the idea this is the only acceptable way to escape string literals when printed. Instead you should adapt this technique as many times as you need for each different escape context. Maybe you'll find a use for a piece of code like this: template <typename T> struct yesc2 { // yesc2 escape template «
T const& e_t; // a T wrapper requesting escape rather than dump
yvw const& e_w; // a specific wrapper map, not g_1vw singleton
yesc2(T const& t, yvw const& w) : e_t(t), e_w(w) { }
};
template <typename T> yesc<T> // capture both T and yvw
yescape2(T const& t, yvw const& w) { return yesc2<T>(t, w); } // «
inline yo& operator<<(yo& o, yesc2<yv> const& x) { «
x.e_w.wescape(o, x.e_t); return o; }
Though a kludge, this might suggest whatever you need.
license
All this code is available only under the BriarPig mu-babel license described fully on the rights page. You do not have permission to reprint this page in any way. No feeds or repackaging allowed. You can link this page if you want folks to read it. |
demos « Þ
+ todo + names + fd + iovec + assert + log + run + hex + crc + buf + in + out + quote + escape « Þ + compare + file + deck + cow + arc + blob + tree + slice + rand + time + stat + hash + heap + node + primes + page + book + pile + stack + atomic + lock + mutex + thread + map + meter + list + iter + ctype |