Þ   briarpig  » mu  » imm


problem

     imm — an immediate is a scalar value fitting in a pointer-sized slot. In particular, immediates below fit into a peg: the canonical slot format in box structs stored in gc space. Immediate schemes here are specific to toy programming language lathe: a lisp dialect plus smalltalk class system, using þ library under mu, using the mu-babel license. (Updates? See toy.)

     For context, see peg on pointers that contain immediate values, and see box for struct values used when a peg does not contain an immediate. Much of the api shown on this page is actually the yp peg api.

bits «

     The following figure shows 32 bits inside yp, with low (least significant) bits on the right. The lowest two are ilk bits — ilk is a synonym for kind. The uses of all other bitfields depend on the two ilk bits. If zero, the peg is actually a pointer to a box: there's no immediate at all.

     Any nonzero value in ilk bits means the peg contains an immediate. Values 1, 2, and 3 mean different things.

/* 3 2 1 « 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | value |--ply--:cue:ilk| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | | ply bits ----------------------o-o-o-o | | | | | | | | cue bits ------------------------------o-o | | | | ilk bits ----------------------------------o-o **/

     The following parts of the yp api show what values mean in each of the ilk, cue, and ply fields.

struct yp { // (4) bytes of value in one slot of a box of pegs «

typedef ptrdiff_t p_t; // ptr type converts to int for masks « typedef i32 lid_t; // sizeof(lid_t) must be >= sizeof(void*) «

union { // overlay low order tag bits with high order values ybox* p_box; // the pointer when a peg holds a pointer « lid_t p_lid; // all (32) bits as single int « } u; // union «

     When p_lid contains nonzero in the low two bits, the value is one specified in Pilk below. (Note the leading P each time just means the type is nested inside the peg api.)

enum Pilk { i_box = 0, i_int = 1, i_nil = 2, i_cue = 3 }; «

     As already mentioned, zero (0) ilk means box pointer. One (1) is i_int, meaning the high 30 bits are a signed integer, shifted over two bits. Two (2) is i_nil, meaning nil or () at the language level, with the high 30 bits unused.

     But three (3) means i_cue, which says the next two bits contain a cue value taken from the Pcue enum below — it's essentially an escape value expanding the two bit ilk field into more bits for designating formats using less than 30 bits.

enum Pcue { c_ply = 0, c_one = 1, c_err = 2, c_chr = 3 }; «

     The two bit cue field is only used when ilk is equal to i_cue. The meaning of other bits depends on what value appears in the cue field. For example, a value of three (3) means c_chr, which uses the high 24 bits to hold a 21-bit Unicode codepoint — actually ascii when all but the low seven bits are zero. Charactors stored in c_chr also use the four ply bits to contain MIT-style "bucky" bits as modifiers.

     A cue of one (1) is not yet used — it's reserved for future use (immediates have more bit capacity than Wil can use).

     A cue of two (2) is an error code stored in the high 24 bits; however, Wil hasn't yet defined a suite of error codes.

     Finally, a cue of zero (0) means c_ply, which is yet another escape code — this time using the four bit ply field to denote immediate format. The Pply enum below itemizes:

enum Pply { // cue==c_ply==0 « p_false = 0, p_true, p_void, p_none, // 0..3 p_one, p_two, p_three, p_nobox, // 4..7 unused p_next, p_rest, p_key, p_val, // 8..b p_eof, p_setter, p_getter, p_token // c..f };

     Note second row (p_one, p_two, p_three, and p_nobox) is unused: more excess capacity reserved for future use.

     Several ply values denote unique constants: p_false, p_true, and p_eof denote #false, #true, and #eof.

     Several other ply values denote halfbaked ideas: for example, p_void and p_none denote #void and #none, which aren't used in either Scheme or Smalltalk, but give Wil more magic values to use internally.

     And yet another class of ply values denote typed integers of 24-bits in size: p_setter, p_getter, and p_token denote setter and getter indexes, and token variables, respectively. Setter and getter immediates basically specify an index for a box slot — these immediates can be used as object methods.

     The other ply values not yet mentioned? Wil's thinking about them, but there's nothing substantive to say yet.

enum Psole { // special "sole" constants put into p_lid « // p_lid constants (shifted by 4 to reach ply bits) s_nil = i_nil, s_false = ((p_false << 4) | i_cue), // c_ply == 0 s_true = ((p_true << 4) | i_cue), // c_ply == 0 s_void = ((p_void << 4) | i_cue), // c_ply == 0 s_none = ((p_none << 4) | i_cue), // c_ply == 0 s_next = ((p_next << 4) | i_cue), // c_ply == 0 s_rest = ((p_rest << 4) | i_cue), // c_ply == 0 s_key = ((p_key << 4) | i_cue), // c_ply == 0 s_val = ((p_val << 4) | i_cue), // c_ply == 0 s_eof = ((p_eof << 4) | i_cue), // c_ply == 0 };

     The Psole enum above constructs the actual immediate constant values stored in pegs. So for example, the value #t for true is encoded as s_true with the right values in both the cue and ply bitfields. The name sole here implies only one known constant uses each combination of those bitfields.

enum Pquad { // least significant 4-bits of p_lid « q_c21 = ((c_chr << 2) | i_cue), q_err = ((c_err << 2) | i_cue) };

     The Pquad enum above constructs the four low bits used by character and error immediates. These constants are used by code actually building character and error immediates.

enum Perr { // cue==c_err==2 (error immediates) « r_cease = 0, // eof encountered unexpectedly at line r_delim, // wrong delimitor at line (paren vs bracket) r_overdot, // too many expressions after dot in list r_underdot, // missing expression after dot in list r_multidot, // multiple dots in same list r_badref, // #n= or #n# refs nothing when n is unknown };

     The Perr enum above specifies a small number of errors currently found by the Lisp parser. (Errors involving tokens are separate, so these errors involve combinations of tokens and not tokens themselves.) As noted by a section in the right column, current handling of error immediate encoding is quite odd, as if Wil never gave it much attention yet.

     The code comments above say what each error means in reasonably clear terms — but let's expand a bit more here: r_cease means end-of-file was encountered in a strange place; r_delim means delimitors don't balance because parentheses and brackets are mixed; r_overdot means more than one expression followed the dot in a list; r_underdot means no expression followed the dot in a list; r_multidot means more than one dot appeared in a list; and r_badref means a reference of form #n= or #n# was not defined.

enum Poct { // least significant 8-bits of p_lid « // no need to include (c_ply << 2) when c_ply is zero: o_nobox = ((p_nobox << 4) | i_cue), // | (c_ply << 2) o_setter = ((p_setter << 4) | i_cue), // | (c_ply << 2) o_getter = ((p_getter << 4) | i_cue), // | (c_ply << 2) o_token = ((p_token << 4) | i_cue), // | (c_ply << 2)

// must use (c_err << 2) with i_cue: c_err is NOT zero: o_cease = ((r_cease << 4) | q_err), o_delim = ((r_delim << 4) | q_err), o_overdot = ((r_overdot << 4) | q_err), o_underdot = ((r_underdot << 4) | q_err), o_multidot = ((r_multidot << 4) | q_err), o_badref = ((r_badref << 4) | q_err), };

     The Poct enum defines constants corresponding to the low eight bits of an immedate value, taken all together.

     The first of four values are for ply immediates that use the high 24 bits for an integer value; the low eight bits is o_nobox, o_getter, o_setter, or o_token.

     The next six oct constants define actual error immediates, combining err values with the low for bits equal to q_err.

     All the oct constants shown are used by inline methods shown column right to build actual immediate values by adding the missing integer put into the higher bits.

turtles all the way down «

     "Hey Wil," Zé said. "Elsewhere you said everything in Lathe was an object: turtles all the way down."

     "That's right," Wil confirmed. "Every value is an object, and every object has a class — and every object can dispatch polymorphic duck-typed messages."

     "Every value has a class?" Stu repeated. "Even immediates? How do you make immediates into objects? Where is the class for each immediate type?"

     "I told Stu you'd be able to explain this," Zé apologized. "I'd like to hear the answer myself."

     "Yes, well, every primitive object type will have an associated yclassb class box," Wil assured. "So every value, including immediates, will be a smalltalk style object."

     "Where is it?" Stu asked. "Where do I find the class for an immediate object?"

     "That will be somewhere in the vm after the class system is bootstrapped," Wil claimed. "I'll make it as fast to find as I can, so every (x class) expression yields the class."

     "You'll probably add a method like peg2class() to the vm api," Zé suggested. "And a box2class() method."

     "Yeah," Wil agreed. "It'll just be part of the runtime contract. But you'll need a virtual machine instance."

     "Hmmm," Stu pulled his lip. "What about monkey-patching? What effect will that have?"

     Wil glanced at Zé then back at Stu. "That's a slight change in topic," Wil noted. "What did you have in mind?"

     "I was thinking about folks screwing with the classes for standard builtin objects," Stu said. "It makes me worried. It seemed slightly related."

     "Okay," Wil spread his hands. "I was thinking each virtual machine might get a copy-on-write reference to standard builtin classes — maybe edited if sandboxing is in effect."

     "So if any VM edits standard classes," Zé anticipated, "it would only affect the local copy? Like that?"

     "Something along those lines," Wil nodded, "so you'd only be able to screw up the world for yourself."

     Stu pointed a finger: "That sounds like a later feature," Stu said. "Not a day one, first release feature, right?"

     "I dunno," Wil admitted. "You're probably right. But it's square in my path. It fits the whole idea of having module systems with facile symbol scope control."

     "Did you ever read a famous 1958 short story by Alfred Bester named The Men Who Murdered Mohammed?" Zé asked. "It was one of my favorites."

     "I don't know why we're talking about science fiction," Wil sounded annoyed. "If it's one of your favorites, it must involve time travel. Am I right?"

     "Guy catches his wife cheating on him," Zé said, "So he builds a time machine and goes back in time to whack her parents, so she'd never be born."

     "Ah, the indirect approach," Wil smiled. "I take it he was a physics professor or something?"

     "Did I forget to mention that?" Zé agonized. "Anyway, it doesn't work: his wife is still there afterward. So he goes and whacks her grandparents too."

     "This has a creepy tone to it," Wil objected.

     "Nothing works," Zé continued. "So he keeps traveling in time, whacking famous people to see if that makes any difference. When he comes back, people have trouble seeing him."

     "Uh, oh," Wil saw where this was going. "I take it he was whacking his own sandbox in time?"

     Zé nodded. "Basically, yeah," he said. "After that he can only communicate with other time travelers who also whacked their own time continuums the same way."

     "If I say I get the analogy," Wil pleaded, "will you stop telling me about 50 year old science fiction?"

menu

     mu, toy, peg, imm « Þ, tag, box, symbol, token, number, bigint, class, method, reader, writer, eval, env, vm, gc, world, pcode, compiler, asm, lathe, lisp, smalltalk, design, weight, jar, card, harp, debug, profile

     (thorn, todo, names, fd, iovec, assert, log, run, hex, crc, buf, in, out, quote, escape, compare, file, deck, cow, arc, blob, tree, slice, rand, time, stat, hash, heap, node, primes, page, book, pile, stack, atomic, lock, mutex, thread, map, meter, list, iter, ctype)

cache lines «

     Parsing the format of immediates inside a peg might seem complex. But it's relatively fast once a given peg has been touched and a cache line holds the value. (Note it can also be loaded into a register, which is even better.)

     Cache line misses are spectacularly expensive. So on average, once you've touched one slot in a source box, then looking at an immediate in another slot is very cheap compared to following a pointer to a new box that's not in the cache.

     Even if a lot of if statements are executed in pursuit of an immediate's meaning, this can still be much faster than touching another box not yet in a cache line. So Wil has an incentive to pack as many different kinds of value into a peg as he can.

     As long as a literal will fit in 30-bits or less, Wil is happy to pack them in like sardines if the alternative is allocating another box in gc space. Densely packed and already in hand is fast.

     Note small apps whose memory footprint fits entirely into cache lines is an unusual situation (perhaps even a micro-benchmark kind of situation) that might actually go slightly slower when unpacking values from immediates encoded in a peg. But this scenario is neither likely nor interesting to Wil when apps under strong memory pressure are his main concern.

errors «

     The error immediates need revision, mainly because they use two few bits for error numbers. Apparently only four bits of error code were used last time — as if the Lisp reader would find no more than 16 distinct sorts of basic parser error that were not badly constructed tokens.

     The current error immediates also contain the line number at which the parse error was seen. (But this is not particularly important beyond printing messages, because the generated parse tree it itself shows the error value in the spot where the parse error happened — which is typically more precise than the line.)

     Twenty-four bits are used for the line number, which is way too many since 20 bits is enough for a million lines (and if you have source files with more than a million lines, Wil doesn't care if your error messages show correct line numbers).

     So minimally, another four bits should go into error numbers, taking four away from the line number field. However, the number of bits used for shifting occurs too many places to edit in five minutes, and Wil is in no hurry. It's simpler to write a note here saying "todo: fix error format."

constructors and operators «

     The following yp constructors and operators use immediate peg values.

explicit yp(u32 v) { u.p_lid = v; } «

u32 operator=(u32 v) { u.p_lid = v; return v; } « bool operator==(u32 v) const { return v == u.p_lid; } « bool operator!=(u32 v) const { return v != u.p_lid; } «

explicit yp(ybox* b) { u.p_box = b; } « explicit yp(Pilk i) { u.p_lid = i; } « explicit yp(Psole s) { u.p_lid = s; } «

inlines «

     The yp peg api defines the following inline methods for working with immediate values stored inside yp. To help clarify the purpose of similar inlines, the following prefixes appear on method names to consistently mean testing the type, getting the value, or setting the value:

  • t - ask if immediate has that type (predicate)
  • p - read the actual immediate value (getter)
  • x - set yp to that immediate value (setter)

explicit yp(ybox* b) { u.p_box = b; } « explicit yp(Pilk i) { u.p_lid = i; } « explicit yp(Psole s) { u.p_lid = s; } «

u32 operator=(Pilk i) { u.p_lid = i; return i; } « u32 xilk(Pilk i) { u.p_lid = i; return i; } « bool operator==(Pilk i) const { return i == u.p_lid; } « bool operator!=(Pilk i) const { return i != u.p_lid; } «

u32 operator=(Psole s) { u.p_lid = s; return s; } « u32 xsole(Psole s) { u.p_lid = s; return s; } « bool operator==(Psole s) const { return s == u.p_lid; } « bool operator!=(Psole s) const { return s != u.p_lid; } «

// peg type is nil? (low bits equal two?): bool isnil() const { return u.p_lid == i_nil; } «

/// \return not zero and not nil bool nonnil() const { // peg type is not nil? « return u.p_lid && (i_nil != u.p_lid); }

// tlid() asks "is a lid peg?" (an imm? nonzero low bits?) bool tlid() const { return (3 & (p_t) u.p_box) != 0; } « lid_t plid() const { return u.p_lid; } // encoding of imm « operator lid_t() const { return u.p_lid; } // the lid bits «

// tbox() asks "is a box peg?" (not imm? zero low bits?) bool tbox() const { return (3 & (p_t) u.p_box) == 0; } « ybox* pbox() const { return u.p_box; } // 4-byte aligned « operator ybox*() const { return u.p_box; } // the ptr bits « void xbox(const void* b) { u.p_box = (ybox*) (void*) b; } «

// peg type is imm int? (low bits are equal to one?) bool ti4() const { return (3 & u.p_lid) == i_int; } « int pi4() const { return u.p_lid >> 2; } // remove ilk bits « void xi4(int x) { u.p_lid = (x << 2) | i_int; } « // 30-bit ints use 30th bit for sign; only 29 bits remain:

enum { // limits of immediate integers // e_i4max = 0x1fffffff (29 bits) e_i4max = (1<<29)-1, // biggest pos int preserved by xi4() « // e_i4min = 0xe0000000 (30 bits) e_i4min = (-1L<<29), // smallest neg int preserved by xi4() « e_i4bits = 30 « }; // e_i4max: biggest pos int from xi4(): 0x1fffffff (29 bits) // e_i4min: smallest neg int from xi4(): 0xe0000000 (30 bits)

// peg type is err? (low 4 bits equal to quad err value?) bool terr() const { return (u.p_lid & 0x0f) == q_err; } « Perr perr() const { return (Perr) ((u.p_lid >> 4) & 0x0f); } « u32 perrline() const { return (u32) (u.p_lid >> 8); } «

// peg type is cease? (low 8 bits equal to oct cease value?) bool tcease() const { return (u.p_lid & 0x0ff) == o_cease; } « u32 pceaseline() const { return (u32) (u.p_lid >> 8); } « void xcease(u32 line) { u.p_lid = (line << 8) | o_cease; } «

// peg type is delim? (low 8 bits equal to oct delim value?) bool tdelim() const { return (u.p_lid & 0x0ff) == o_delim; } « u32 pdelimline() const { return (u32) (u.p_lid >> 8); } « void xdelim(u32 line) { u.p_lid = (line << 8) | o_delim; } «

// type is overdot? (low 8 bits equal oct overdot val?) bool toverdot() const { « return (u.p_lid & 0x0ff) == o_overdot; } u32 poverdotline() const { return (u32) (u.p_lid >> 8); } « void xoverdot(u32 line) { « u.p_lid = (line << 8) | o_overdot; }

bool tunderdot() const { « return (u.p_lid & 0x0ff) == o_underdot; } u32 punderdotline() const { return (u32) (u.p_lid >> 8); } « void xunderdot(u32 line) { « u.p_lid = (line << 8) | o_underdot; }

bool tmultidot() const { « return (u.p_lid & 0x0ff) == o_multidot; } u32 pmultidotline() const { return (u32) (u.p_lid >> 8); } « void xmultidot(u32 line) { « u.p_lid = (line << 8) | o_multidot; }

bool tbadref() const { return (u.p_lid & 0x0ff)==o_badref; } « u32 pbadrefline() const { return (u32) (u.p_lid >> 8); } « void xbadref(u32 line) { u.p_lid = (line << 8) | o_badref; } «

bool tc21() const { return (u.p_lid & 0x0f) == q_c21; } « c32 pc21() const { return (c32) (u.p_lid >> 8); } «

// peg bucky bits (second four bits): int pbucky() const { return (u.p_lid >> 4) & 0x0f; } « void xc(c32 c) { u.p_lid = (c << 8) | q_c21; } « void xc21(c32 c) { u.p_lid = (c << 8) | q_c21; } « void xc21(c32 c, int bucky) { « u.p_lid = (c << 8) | ((bucky & 0x0f) << 4) | q_c21; }

// char->integer according to MIT Scheme with bucky bits: i32 pc21asinteger() const { c32 lid = u.p_lid; « int bucky = (lid >> 4) & 0x0f; // ply bits return (i32) ((bucky << 21) | (u.p_lid >> 8)); }

typedef unsigned u_t; // shorter unsigned synonym for below «

bool tnobox() const { return o_nobox == (u.p_lid & 0xff); } « u_t pnobox() const { return (u_t) (u.p_lid >> 8); } « void xnobox(u_t tag) { u.p_lid = (tag << 8) | o_nobox; } «

bool tsetter() const { return o_setter == (u.p_lid & 0xff); } « u_t psetter() const { return (u_t) (u.p_lid >> 8); } « void xsetter(u_t x) { u.p_lid = (x << 8) | o_setter; } «

bool tgetter() const { return o_getter == (u.p_lid & 0xff); } « u_t pgetter() const { return (u_t) (u.p_lid >> 8); } « void xgetter(u_t x) { u.p_lid = (x << 8) | o_getter; } «

bool ttoken() const { return o_token == (u.p_lid & 0xff); } « u_t ptoken() const { return (u_t) (u.p_lid >> 8); } « void xtoken(u_t x) { u.p_lid = (x << 8) | o_token; } «