(cf log intro) (briarpig : archive : shellgame : microcosm)
microcosm

What are the constituent components of data in computing systems? If we look at data at the smallest possible granularity, what do we see? In the world of physical objects, matter is composed of atoms. And even atoms are divisible into smaller subatomic particles. Data in computing systems has a similar smallest possible granularity of divisibility. Typically the smallest addressable object in memory is a byte. And each byte is composed of eight bits. So we might think of bytes as atoms for building larger structures in data.
cardhouse
And there's nothing smaller than a bit in computer data. All the data in your computer system is composed of bytes in ordered patterns.

Although we might liken bytes to atoms for building larger structures, the resulting systems in computer data have only weak bonds to each other. Bytes are not joined together by strong forces like the chemical bonds between atoms in molecules. So bytes only associate in patterns by virtue of exact placement and care to avoid scattering influences. A computer data structure is like a house of cards, and can easily be disarranged by small amounts of accidental force. So a great deal of effort in software development is devoted to avoiding casual disruption of data systems. Enormous discipline is needed to keep everything just right during the execution of computer applications. With enough discipline present in software, end users perceive a stable system seemingly safe from casual disturbance. With good software, a user sees robust virtual models that encourage play and exploration without a sense of undue fragility.

oldcore
(Old style core memory. Click image to compare organization with corn.)

The image above shows a design for computer memory preceding the invention of transistors. This is called core memory, and consists of individually magnetized rings of iron. Each tiny metal ring on the right can be selected individually by current flowing through both the horizontal and vertical wires intersecting inside the ring. Because each ring can have two different magnetic states, one denoting zero and another denoting one, each ring encodes a single bit of data. This image of core memory might help readers envision the concept of individual bits of data. Today this is handled very differently in RAM chips, and the bits are individually much smaller in size. But principles are similar.

Most computer architectures today are byte addressable, which means every single byte in memory can be read or written individually. Each byte consists of eight bits of data, which is enough to distinguish 256 (two to the eighth power) different states. There are 256 different ways in which one can arrange zeroes and ones in eight bits, and this is why a byte has 256 possible integer states. But why does a byte have eight bits, and not seven bits or nine bits instead? Well, it's arbitrary, but this is now an utterly pervasive standard in computer systems.

Bytes are traditionally associated with letter characters in the minds of software developers, merely because a
codes
byte is big enough to hold values encoding characters in ASCII format. ASCII is a standard character set which assigns arbitrary integer values to letters of the alphabet. This is similar in principle to the Morse code invented in the nineteenth century for use on the telegraph. Except morse code uses a variable number of dots and dashes, and ASCII uses the same number of bits for every letter. (ASCII actually uses only seven of eight bits, and the eighth bit is typically zero.)

The image at the right shows many ways to represent the three letters 'B', 'I', and 'T', in the word bit. This variety might help demonstrate the way codes can assign arbitrary patterns to denote letters, without any concern for the way in which other codes make similar assignments. The pattern for each letter need only be different from that used by any other letter in the same code system. For example, the code used in braille at the bottom effectivley uses six bits of information for each letter, since each point on a six dot grid can be either a bump or a blank spot. Braille has no relation to either morse code or ASCII.

Although code assignments might be arbitrary, once made they can't be changed again without altering a standard. So in ASCII the hex integer 0x42 means 'B' and this never changes. (Note some coding system prototypes evolve when they are new, before they become standards. This can have the effect of breaking old data created with early versions of the prototypes.)

Codes often have internal patterns that make some tasks more convenient when applying the codes. For example, morse code uses a single dot and a single dash to represent letters E and T respectively, and this is convenient in English because those are the two most commonly occurring letters in typical bodies of English text. For another example, ASCII assigns sequential integer values to letters A through Z, and this makes it easier to derive the ordinal positions 1 to 26 through simple arithmetic. So although code assignments need only keep things distinct, they might have use additional patterns within a code that make it cheaper to perform typical tasks. In the case of morse code, the choice of shorter codes for common letters can make a coded message smaller.

As a rule, encoding systems for character sets are not terribly complex, although Unicode is enormously more complex than ASCII.
morse & braille
These systems create data in formats rather uniform and well-defined in structure. As a result, the software dealing with these formats is not intrinsically complex either. So problems in standardizing document formats in the computing industry are not really caused by these basic kinds of data. Issues in handling text, integers, and other primitive types are easy enough in isolation. Documents composed of nothing but vast expanses of the same kind of data are not a problem. This is one reason why text-based document formats like XML grow in popularity as a solution for problems in mixing varied content types in complex aggregations. It's quite difficult to get confused about the content of such documents as text per se. However, it can still be hard to grasp the complex content expressed inside as text, even though the text medium itself is transparent.

We have the most trouble with document formats permitting almost anything to be placed inside a single document file. This is true whether a file has a more human readable text format like XML, or a more opaque binary format like structured storage database systems. The reason is because such files can require many kinds of software to understand every single kind of data inside. As a general rule, the more varied the content put in just one file, the greater the number of software modules needed to use it.

The problem is that no data is independent of code required to understand the data. This is not at all obvious when data types used are very simple. Then any code involved is so commonplace that a dependency is not considered consequential even if noticed at all. But any very complex kind of data requires a complex kind of software to cope with the data. In this case it's more obvious that data is not usable without the presence of code understanding the data. Then the scope of this problem is compounded when a single file contains several distinct kinds of very complex data, all mixed together in aggregation. When many complex data formats are used jointly in compound document formats, the resulting documents depend on many kinds of software simultaneously. This greatly increases the likelihood that a document becomes at least partly unusable when the software evolves for handling any one of the several data formats used.

Any file format which is very tolerant of casually mixing many kinds of data format is most likely to get burned by this problem. Such tolerant formats attract experimental software developers as customers, because such formats cope with more extremes as developers push the envelope in ongoing research. However, software near the cutting edge in research is more likely than anything else to change in the near future as it gets further refined. And changes in software can require changes in data format. Or even more subtly, a change in software can change the meaning of old data when in use, even if the format does not change. When a single document mixes content from ten different cutting edge vendors, chances are not very good such a document has a long useful lifespan with unchanged quality and fidelity. In this case, such documents are truly like houses made of cards, stacked to dizzying and unstable heights.

dome


back imac eye chip gyro junk microscope forward

VEX

What's with the biological imagery above? Are you one of those nuts who fantasize that computers are alive? Or are you just trying to borrow the grandeur of success in genetics to pump up the mystique of your own field of study? How can a DNA molecule say anything useful about computer data?


GED

No, computer systems aren't alive. The biological imagery and DNA molecule support a nice analogy that says bytes in computers are like atoms in biology. Both are simple and lifeless, but can be joined in big patterns. Solid objects consist of atoms, and data objects consist of bytes.


ROZ

The DNA molecule shows a sudden change happens at an atomic level between lifeless matter and something more that uses matter as an underlying substrate. It's not the atoms and molecules in DNA that imbue life. Instead the information is present somewhere in the organization.


GED

Right, and the info in DNA is useless by itself. It only makes sense in the context of a host of other biological materials interacting with DNA molecules. DNA serves a function within cells only with other molecules in its company.


ROZ

The same thing applies to both code and data in computers. They depend on the presence of more code and data to provide additional context. Much art in computers involves keeping understandings suitably aligned. We avoid confusing which bytes mean text and which bytes mean images, etc.


YEN

I have this fear you'll tell me about ones and zeroes again. I know everything in computers is binary ones and zeroes, but I don't care. I can use my computer without considering ones and zeroes. So why tell me about bits and bytes now? It all seems arbitrary. What do I really need to know? How often do you developers think about ones and zeroes?


GED

We rarely think about ones and zeroes per se, so we're more like you than you might think. We often write code doing arithmetic with data, where we know the representation of ones and zeroes is important, but we don't need any details. I can write hash functions without knowing ASCII codes.


ROZ

When we want ones and zeroes we can see them, but it's rarely necessary. I drive to a local grocery store all the time, and I neither use a map nor look at the street names. I know where I'm going, so I just drive. If I wanted to notice street names I could do so. But I have no desire nor reason.


GED

We tell you about bits and bytes now so you know where it all stops at the bottom, and so you have some feel for size. It helps to have some intuitive feel for size in small data, just like it helps to know the size of a dollar when shopping. Would you want to pay $500 for a cup of coffee?


ROZ

Compared to atoms in physics, bytes in computers are simple. We don't have scads of mysterious subatomic particles. Each byte has the same simple structure. Every byte has eight ordered bits. (Vex, keep quiet.) You can look at a byte as eight bits, or as a small integer in the range zero to 255.


GED

But you don't need to know that. You need only know a byte can distinguish few things, since it can have only 256 different values. When you have more things, a byte is too small and you'll need more. For example, when writing text in Unicode, each letter uses two bytes instead of just one.


ROZ

We once wrote text in ASCII (American Standard Code for Information Interchange) nearly always. This is an arbitrary code that assigns different small integers to mean different things. For example, 0x42 in hex denotes the letter 'B'. But ASCII has a very small repertoire for letters. It's for Roman Latin alphabets.


GED

As a standard ASCII is now extremely pervasive, which makes it very convenient. Unicode is a newer standard with a much larger standard repertoire for letters, because it can distinguish 64K (i.e. 64 * 1024) different letters using two bytes.


VEX

But I hate Unicode with a passion even if it's politically correct to support alphabets I don't care about. It doubles the amount of memory I need to represent text. And using Unicode is a performance nightmare because I need a virtual machine now in places where CPU arithmetic worked fine for ASCII.


YEN

Okay, I understand a view of text that formats can undergo technical change. But can you get back to why I should care? How does this affect my data and where it lives? All data is not text, is it? You never fully answered my question about what I really need to know about bits and bytes. Do I really need to know a byte has eight bits? And what do those names mean anyway? Bit and byte sound so arbitrary.


GED

Okay, what you really need to know is what can go wrong. To understand any potential risk of loss to your data, it helps to know what would make it corrupt. That might help you grasp how a vendor is trying to prevent this, and how this affects the movement of your data. Many issues involve bytes.


ROZ

You don't need to know a byte has eight bits. That's arbitrary, but it doesn't change anymore. It's not much more interesting than knowing decimal numbers use ten digits from 0 to 9. The names for bits and bytes are somewhat arbitrary. Bit once stood for binary digit in the beginning, if that helps.


GED

Let's go back to the risk of data loss. The risk of loss in data composed of bytes is similar to the risk of damage in DNA composed of atoms and molecules in biological systems. Almost any change to a DNA sequence can damage it. DNA info is composed of a sequence of pairs from a repertoire.


ROZ

Of course, DNA seems to have some noise, so some changes do not cause damage. But let's ignore that and work with the model that any change is damage. Swapping old DNA pairs with different new pairs is damage. So is cutting pairs or adding pairs. Similar things in data can also be damage.


GED

Computer data is not only sensitive to bit values, but also to number of bytes and their relative positions. Altering bytes, or moving them around, or changing the length of a sequence can all drastically change meaning, sometimes to the extent of making data completely unintelligible.


ROZ

So computer systems often focus on making perfect replicas of data to prevent loss. Changing data just a little can have small to large effects. A change in content might make a barely perceptible change to users. But changes in metacontent might cause any content described to become unreadable.


GED

When text formats or other formats undergo technical change, this can affect the integrity of Yen's data. This is because Yen's data moves around, and the code using the data might not be the same everywhere. The more standards change, the bigger chance code systems vary in operation.


ROZ

A computer system might succeed in making perfect replicas of data bytes. But the code in a receiving system might assign slightly different meanings to the bytes. This would make the data seem different in a receiving computer even if the bytes were identical to those sent from a source.


GED

This affects where Yen's data lives, because Yen's data can vary in meaning in different places, even when perfect copies are passed from point to point. This is more likely to happen when standards undergo technical change than otherwise.


VEX

Basically, the longer your data lives, the bigger a pain it is to me. This is because the code on your computer changes when you install new software. And the code on my server changes when I write new software or buy into the hottest new fad. Either can alter the meaning of your old data.


GED

After a long period of technical change, Vex can lose sympathy with your desire to preserve old data. He thinks you should have upgraded to the newest things long ago. If you suffer as a result of not upgrading, he thinks this is partly your fault.


VEX

You make it sound like a bad thing that I want Yen to upgrade all his software. I mean c'mon. He might be using products written a couple years ago, or (shudder) even as long as five years ago. Get real! The world doesn't stop just because Yen stopped shelling out bucks for new software.


YEN

You make it sound like you have no duty unless I pay regular software fees on a subscription model, whether or not that was our original understanding. When I pay for something I expect it to work. Now I see all software slowly decays over a few years. Does the free web app work that way, too?


VEX

Hey, babe, I live on internet time. By now I think in terms of months instead of years. I'll guarantee your data until next quarter at the very least. The informal industry statute of limitations gets shorter all the time. It doesn't last much longer than it takes me to go find a new gig.


YEN

I can control my anger. I'm counting to ten. Ah, somebody distract me so my blood pressure goes down again. Roz, tell me your business model doesn't have planned obsolescence built into it like Vex's. Are you shipping, and how much does Rozware cost these days?


VEX

Whoa, don't get all testy! Geez, users get awfully uptight. I gotta make a living you know. Just because I give you something for free doesn't mean I have no plan to shake you by the ankles and collect all your spare change. You keep me in snappy threads and cars, and I'll watch your data.


ROZ

Sorry, we're not shipping yet. Avoiding obsolescence is complex, and I only see good results from following a two pronged approach. If we make data available in both long term and short term formats, we can split our efforts between protecting old assets and making new technical progress.


GED

The cutting edge involves a bit of chaos, and this chaos eats away at the safety of your data, just like the ocean eats away the coastline more during storms than other times. But we can hedge the risk by paying for a bit of redundancy in the system for storing data. It makes software a bit more costly.


ROZ

For example, we might use XML as a safer and lower rate of change format, if we can see how to avoid thrash in document schema standards. If we make data self descriptive enough, it's hard for content to become unreadable. That's one prong.


GED

The other prong of the approach is to use more experimental formats in our chase for more performance and utility. As long as your data is not trapped in less stable formats, change caused by technical evolution will not put your data at risk. That way innovation doesn't hold you hostage.


ROZ

But two prongs will make our development more expensive. It's not easy to fully support two entirely different storage approaches with equal levels of fidelity and code quality. We must make our code abstract to achieve this. But we actually think that abstraction can improve the code quality.


GED

In the time we ship one system done right, Vex might ship two squirrelly ones and have an IPO under his belt as well. Maybe we can split the market with Vex. He can chase the easy money from folks new to the scene, and we can chase the folks looking for more quality and support.


VEX

Heh, but you forgot one thing. I'm also filing patents left and right as fast as I can put up legal fences across the landscape. So you losers will walk a minefield the whole way, with me suing you for license fees you can't afford. Maybe I'll let you sharecroppers keep some of the harvest. Hah! Man, I love the law when it works so well in my favor.

back shell imac compass chest anvil dragon forward

22Oct00  Images are the property of respective copyright holders. All text is
Copyright © 1999-2007 briarpig.