What are the constituent components of data in computing systems?
If we look at data at the smallest possible granularity, what
do we see?
In the world of physical objects, matter is composed of atoms.
And even atoms are divisible into smaller subatomic particles.
Data in computing systems has a similar smallest possible
granularity of divisibility. Typically the smallest addressable
object in memory is a byte. And each byte is composed of eight bits.
So we might think of bytes as atoms for building larger structures
in data.
And there's nothing smaller than a bit in computer data.
All the data in your computer system is composed of bytes in
ordered patterns.
Although we might liken bytes to atoms
for building larger structures, the resulting systems
in computer data have only weak bonds to each other.
Bytes are not joined together by strong forces like
the chemical bonds between atoms in molecules. So bytes
only associate in patterns by virtue of exact placement
and care to avoid scattering influences. A computer
data structure is like a house of cards, and can easily
be disarranged by small amounts of accidental force.
So a great deal of effort in software development is
devoted to avoiding casual disruption of data systems.
Enormous discipline is needed to keep everything
just right during the execution of computer applications.
With enough discipline present in software, end users
perceive a stable system seemingly safe from casual
disturbance. With good software, a user sees robust
virtual models that encourage play and exploration without a
sense of undue fragility.
(Old style core memory. Click image to
compare organization with corn.)
The image above shows a design for computer memory preceding
the invention of transistors. This is called core memory, and consists
of individually magnetized rings of iron. Each tiny metal ring
on the right
can be selected individually by current flowing through both the
horizontal and vertical wires intersecting inside the ring.
Because each ring can have two different magnetic states,
one denoting zero and another denoting one, each
ring encodes a single bit of data. This image of core memory
might help readers envision the concept of individual bits of data.
Today this is handled very differently in
RAM chips, and the bits are individually
much smaller in size. But principles are similar.
Most computer architectures today are byte addressable,
which means every single byte in memory can be read or
written individually. Each byte consists of eight bits
of data, which is enough to distinguish 256 (two to the
eighth power) different states. There are 256 different
ways in which one can arrange zeroes and ones in eight
bits, and this is why a byte has 256 possible integer states.
But why does a byte have eight bits, and not seven bits or nine bits
instead? Well, it's arbitrary, but this is now an utterly pervasive
standard in computer systems.
Bytes are traditionally associated
with letter characters in the minds of
software developers, merely because a
byte is big enough to hold values
encoding characters in
ASCII
format. ASCII
is a standard character set which assigns
arbitrary integer values to letters
of the alphabet. This is similar in
principle to the Morse code invented in
the nineteenth century for use on the
telegraph. Except morse code uses a
variable number of dots and dashes,
and ASCII
uses the same number of bits for every
letter. (ASCII
actually uses only seven of eight bits,
and the eighth bit is typically zero.)
The image at the right shows many
ways to represent the three letters
'B', 'I', and 'T', in the word bit.
This variety might help demonstrate
the way codes can assign arbitrary
patterns to denote letters, without
any concern for the way in which other
codes make similar assignments. The
pattern for each letter need only be
different from that used by any other
letter in the same code system.
For example, the code used in braille at the
bottom effectivley uses six bits of
information for each letter, since
each point on a six dot grid can
be either a bump or a blank spot.
Braille has no relation to either
morse code or ASCII.
Although code assignments might be arbitrary, once
made they can't be changed again without altering
a standard.
So in ASCII the hex integer
0x42 means 'B' and this never
changes. (Note some coding system prototypes evolve
when they are new, before they become standards.
This can have the effect of breaking old data
created with early versions of the prototypes.)
Codes often have internal patterns that make
some tasks more convenient when applying the codes.
For example, morse code uses a single dot and a
single dash to represent letters E and T
respectively, and this
is convenient in English because those are the
two most commonly occurring letters in typical
bodies of English text. For another example,
ASCII assigns sequential
integer values to letters A through Z, and this
makes it easier to derive the ordinal positions
1 to 26 through simple arithmetic. So although
code assignments need only keep things distinct,
they might have use additional patterns within
a code that make it cheaper to perform typical tasks.
In the case of morse code, the choice of shorter
codes for common letters can make a coded message
smaller.
As a rule, encoding systems for character sets are not
terribly complex, although Unicode is enormously
more complex than ASCII.
These systems create data in formats rather
uniform and well-defined in structure. As
a result, the software dealing with these formats
is not intrinsically complex either. So
problems in standardizing document formats
in the computing industry are not really
caused by these basic kinds of data.
Issues in handling text, integers, and other
primitive types are easy enough in isolation.
Documents composed of nothing but vast expanses
of the same kind of data are not a problem.
This is one reason why text-based document
formats like XML grow in popularity as a solution
for problems in mixing varied content types in
complex aggregations. It's quite difficult to
get confused about the content of such documents
as text per se. However, it can still be hard
to grasp the complex content expressed inside as text,
even though the text medium itself is transparent.
We have the most trouble with document formats
permitting almost anything to be placed inside a
single document file. This is true whether a
file has a more human readable text format like XML, or
a more opaque binary format like structured storage
database systems. The reason is because such files
can require many kinds of software to understand every
single kind of data inside. As a general rule, the
more varied the content put in just one file, the
greater the number of software modules needed to use it.
The problem is that no data is independent of code
required to understand the data. This is not at all
obvious when data types used are very simple. Then any code
involved is so commonplace that a dependency is
not considered consequential even if noticed at all.
But any very complex kind of data requires a complex
kind of software to cope with the data. In this case
it's more obvious that data is not usable without the
presence of code understanding the data. Then the
scope of this problem is compounded when a single
file contains several distinct kinds of very complex
data, all mixed together in aggregation. When many
complex data formats are used jointly in compound
document formats, the resulting documents depend on
many kinds of software simultaneously. This greatly
increases the likelihood that a document becomes
at least partly unusable when the software evolves
for handling any one of the several data formats used.
Any file format which is very tolerant of casually mixing
many kinds of data format is most likely to get burned
by this problem. Such tolerant formats attract
experimental software developers as customers, because
such formats cope with more extremes as developers
push the envelope in ongoing research. However, software
near the cutting edge in research is more likely than
anything else to change in the near future as it gets
further refined. And changes in software can require
changes in data format. Or even more subtly, a change
in software can change the meaning of old data when in
use, even if the format does not change. When a single
document mixes content from ten different cutting edge
vendors, chances are not
very good such a document has a long useful
lifespan with unchanged quality and fidelity. In this
case, such documents are truly like houses made of
cards, stacked to dizzying and unstable heights.
|

VEX
|
What's with the biological imagery above? Are you one of those
nuts who fantasize that computers are alive? Or are you just
trying to borrow the grandeur of success in genetics to pump
up the mystique of your own field of study? How can a DNA
molecule say anything useful about computer data?
|

GED
|
No, computer systems aren't alive. The biological imagery
and DNA molecule support a nice analogy that says
bytes in computers are like atoms in biology. Both
are simple and lifeless, but can be joined in big patterns.
Solid objects consist of atoms, and data
objects consist of bytes.
|

ROZ
|
The DNA molecule shows a sudden change happens at an
atomic level between lifeless matter and something more
that uses matter as an underlying substrate. It's not
the atoms and molecules in DNA that imbue life. Instead
the information is present somewhere in the organization.
|

GED
|
Right, and the info in DNA is useless by itself.
It only makes sense in the context of a host of other
biological materials interacting with DNA molecules.
DNA serves a function within cells only with other molecules
in its company.
|

ROZ
|
The same thing applies to both code and data in computers.
They depend on the presence of more code and data to provide
additional context. Much art in computers involves keeping
understandings suitably aligned. We avoid confusing which
bytes mean text and which bytes mean images, etc.
|

YEN
|
I have this fear you'll tell me about ones and zeroes again.
I know everything in computers is binary ones and zeroes, but I
don't care. I can use my computer without considering ones and
zeroes. So why tell me about bits and bytes now? It all seems
arbitrary. What do I really need to know? How often do you
developers think about ones and zeroes?
|

GED
|
We rarely think about ones and zeroes per se, so we're more
like you than you might think. We often write code
doing arithmetic with data, where we know the representation
of ones and zeroes is important, but we don't need any details.
I can write hash functions without knowing
ASCII codes.
|

ROZ
|
When we want ones and zeroes we can see them, but it's
rarely necessary. I drive to a local grocery store all
the time, and I neither use a map nor look at the street
names. I know where I'm going, so I just drive. If I
wanted to notice street names I could do so.
But I have no desire nor reason.
|

GED
|
We tell you about bits and bytes now so you know where it
all stops at the bottom, and so you have some feel for size.
It helps to have some intuitive feel for size in small data,
just like it helps to know the size of a dollar when shopping.
Would you want to pay $500 for a cup of coffee?
|

ROZ
|
Compared to atoms in physics, bytes in computers are simple.
We don't have scads of mysterious subatomic particles.
Each byte has the same simple structure. Every byte has
eight ordered bits. (Vex, keep quiet.) You can look at a
byte as eight bits, or as a small integer in the range
zero to 255.
|

GED
|
But you don't need to know that. You need only know a
byte can distinguish few things, since it can have only 256
different values. When you have more things, a byte is
too small and you'll need more. For example, when writing
text in Unicode, each letter uses two bytes instead of just one.
|

ROZ
|
We once wrote text in ASCII
(American Standard Code for Information
Interchange) nearly always. This is an arbitrary
code that assigns different small integers to mean different
things. For example, 0x42 in hex
denotes the letter 'B'. But ASCII has
a very small repertoire for letters. It's for Roman Latin alphabets.
|

GED
|
As a standard ASCII is now
extremely pervasive, which makes it very convenient.
Unicode is a newer standard with a much larger standard
repertoire for letters, because it can distinguish 64K
(i.e. 64 * 1024)
different letters using two bytes.
|

VEX
|
But I hate Unicode with a passion even if it's politically
correct to support alphabets I don't care about. It doubles
the amount of memory I need to represent text. And using
Unicode is a performance nightmare because I need a virtual
machine now in places where
CPU arithmetic worked fine for
ASCII.
|

YEN
|
Okay, I understand a view of text that formats can undergo
technical change. But can you get back to why I should care?
How does this affect my data and where it lives?
All data is not text, is it? You never
fully answered my question about what I really need to know about
bits and bytes. Do I really need to know a byte has eight bits?
And what do those names mean anyway? Bit and byte sound so arbitrary.
|

GED
|
Okay, what you really need to know is what can go wrong.
To understand any potential risk of loss to your data, it
helps to know what would make it corrupt. That might help
you grasp how a vendor is trying to prevent this, and
how this affects the movement of your data. Many
issues involve bytes.
|

ROZ
|
You don't need to know a byte has eight bits. That's arbitrary,
but it doesn't change anymore. It's not much more interesting
than knowing decimal numbers use ten digits from 0 to 9. The
names for bits and bytes are somewhat arbitrary. Bit once
stood for binary digit in the beginning, if that helps.
|

GED
|
Let's go back to the risk of data loss. The risk of loss in
data composed of bytes is similar to the risk of damage in
DNA composed of atoms and molecules in biological systems.
Almost any change to a DNA sequence can damage it.
DNA info is composed of a sequence of pairs from a repertoire.
|

ROZ
|
Of course, DNA seems to have some noise, so some changes
do not cause damage. But let's ignore that and work with the
model that any change is damage. Swapping old DNA pairs with
different new pairs is damage. So is cutting pairs or
adding pairs. Similar things in data can also be damage.
|

GED
|
Computer data is not only sensitive to bit values, but also
to number of bytes and their relative positions.
Altering bytes, or moving them around, or changing the length
of a sequence can all drastically change meaning, sometimes
to the extent of making data completely unintelligible.
|

ROZ
|
So computer systems often focus on making perfect
replicas of data to prevent loss. Changing data just a
little can have small to large effects. A change
in content might make a barely perceptible change to users.
But changes in metacontent might cause any content
described to become unreadable.
|

GED
|
When text formats or other formats undergo technical change,
this can affect the integrity of Yen's data. This is because
Yen's data moves around, and the code using the data might
not be the same everywhere. The more standards change, the
bigger chance code systems vary in operation.
|

ROZ
|
A computer system might succeed in making perfect replicas
of data bytes. But the code in a receiving system might
assign slightly different meanings to the bytes. This would
make the data seem different in a receiving computer even if
the bytes were identical to those sent from a source.
|

GED
|
This affects where Yen's data lives, because Yen's data can vary
in meaning in different places, even when perfect copies are
passed from point to point. This is more likely to happen when
standards undergo technical change than otherwise.
|

VEX
|
Basically, the longer your data lives, the bigger a pain
it is to me. This is because the code on your computer changes
when you install new software. And the code on my server
changes when I write new software or buy into the hottest new
fad. Either can alter the meaning of your old data.
|

GED
|
After a long period of technical change, Vex can lose
sympathy with your desire to preserve old data. He thinks
you should have upgraded to the newest things long ago.
If you suffer as a result of not upgrading, he thinks this
is partly your fault.
|

VEX
|
You make it sound like a bad thing that I want Yen to
upgrade all his software. I mean c'mon. He might be
using products written a couple years ago, or (shudder)
even as long as five years ago. Get real! The world
doesn't stop just because Yen stopped shelling out
bucks for new software.
|

YEN
|
You make it sound like you have no duty
unless I pay regular software fees on a subscription
model, whether or not that was our original understanding.
When I pay for something I expect it to work. Now I see
all software slowly decays over a few years. Does the
free web app work that way, too?
|

VEX
|
Hey, babe, I live on internet time. By now I think in
terms of months instead of years. I'll guarantee your
data until next quarter at the very least. The
informal industry statute of limitations gets shorter
all the time. It doesn't last much longer than it takes
me to go find a new gig.
|

YEN
|
I can control my anger. I'm counting to ten. Ah,
somebody distract me so my blood pressure goes down again.
Roz, tell me your business model doesn't have planned
obsolescence built into it like Vex's. Are you shipping,
and how much does Rozware cost these days?
|

VEX
|
Whoa, don't get all testy! Geez, users get awfully uptight.
I gotta make a living you know. Just because I give you
something for free doesn't mean I have no plan to shake you
by the ankles and collect all your spare change. You keep
me in snappy threads and cars, and I'll watch your data.
|

ROZ
|
Sorry, we're not shipping yet. Avoiding obsolescence is
complex, and I only see good results from following a two
pronged approach. If we make data available in both long
term and short term formats, we can split our efforts between
protecting old assets and making new technical progress.
|

GED
|
The cutting edge involves a bit of chaos, and this chaos
eats away at the safety of your data, just like the ocean eats
away the coastline more during storms than other times. But
we can hedge the risk by paying for a bit of redundancy in the
system for storing data. It makes software a bit more costly.
|

ROZ
|
For example, we might use XML as a safer and lower rate
of change format, if we can see how to avoid thrash in
document schema standards. If we make data self descriptive
enough, it's hard for content to become unreadable.
That's one prong.
|

GED
|
The other prong of the approach is to use more experimental
formats in our chase for more performance and utility.
As long as your data is not trapped in less stable formats,
change caused by technical evolution will not put your
data at risk. That way innovation doesn't hold you hostage.
|

ROZ
|
But two prongs will make our development more expensive.
It's not easy to fully support two entirely different storage
approaches with equal levels of fidelity and code quality.
We must make our code abstract to achieve this. But we actually
think that abstraction can improve the code quality.
|

GED
|
In the time we ship one system done right, Vex might ship
two squirrelly ones and have an IPO under his belt as well.
Maybe we can split the market with Vex. He can chase the easy
money from folks new to the scene, and we can chase the folks
looking for more quality and support.
|

VEX
|
Heh, but you forgot one thing. I'm also filing patents left
and right as fast as I can put up legal fences across the
landscape. So you losers will walk a minefield the whole way,
with me suing you for license fees you can't afford. Maybe
I'll let you sharecroppers keep some of the harvest. Hah!
Man, I love the law when it works so well in my favor.
|
|