A beginner's tutorial in plain English, (with sound effects!) This will take you on a journey through many aspects of Personal Computing. At times we will travel deep into the inner workings of the computer itself, to see exactly how something is achieved. On other occasions we will skim lightly over the surface. The main topic is Databases because they represent a big proportion of all computing activity. This tutorial will provide the average reader with an easy to follow, easy to understand insight into databases without the usual industry jargon or in-depth analysis of database theory. This is for those people who know of the "word" database but know absolutely nothing about databases and even less about computers. It will cater to the average thirst for knowledge: perhaps provide the bare basics for the would-be database designer or programmer. It should be interesting to those employed in Education, Industry & Commerce or Science, who may be curious, but always work at the edges and never get an opportunity to be any closer than walking past the door of the Computer Department. This is not a lesson on how to design a database, nor on database programming. If you read it very carefully however, you will under-stand considerably more than many who call themselves experts. We start by clearing up an all too common misconception. What are these
things we call databases? Many people believe a Database to be a collection of some names and addresses and other information about people or companies. Well, just as a motor car is not necessarily a Ford, databases are not always lists of names and addresses. Databases are built and maintained in every conceivable industry and profession. Information the Data provides is used in many ways. Data captured during manufacturing plant operations is reused immediately to maintain strict process control. Data gathered daily in a factory sales office is often used the next day to place orders for raw materials. Data gathered by Telecom is used to make up accounts. Data extracted from computer drawings is used to generate office documentation, again by computer. References to Databases throughout this article will apply mainly to what are popularly known as "Relational Database Management Systems" designed for use on a Personal Computer. At times I will use the word "System" when referring to this combination of both the databases themselves and the Computer Programming Languages associated with them. We will also cover briefly the fundamental difference between these Systems and the original, truly Relational Database Management Systems. PC will always mean "Personal Computer", the desktop variety. Most of us have seen or at least heard about them. Computers increase in size from there on up to the huge Mainframe Units, singularly capable of the combined workload of a great many smaller, "personal" units. Irrespective of whether a Database System is used on a Personal Computer or a Mainframe or, whether or not it is a Relational Database System, all computer databases and the relevant Database Software come under the general term DBMS. This stands for "DataBase Management System(s)". You may have seen it before, now you know! As for the word "Database" however, much of what is included here applies
literally to any database whatsoever, on any kind of computer. Where there are differences between the
capabilities, or the requirements of one kind of DBMS and another, no attempt will be made to explain
them. Those differences would interest only the Computer Software Developers and Programmers who like motor
car enthusiasts, often believe their System is the only System worth using and they argue over this,
sometimes with great enthusiasm. Database is a word that has grown from the 20th Century technology we know as "Electronic Computerisation". Since the advent of Electronic Computerisation, mainly from about the 1940s, another term, "Information Technology", has grown rapidly. From its earliest beginnings, at the time when Digital Computers were thought to be at their most useful, employed as nothing more than super fast calculators (number crunchers), Information Technology has evolved to the stage where now it is a science of its own: A science growing in parallel with the extremely rapid developments in Digital Electronics. Information Technology refers to the use of Computers for the purpose of storing, evaluating, manipulating and disseminating information. Deep within the intricate framework of this technology there lies the word Database. In its simplest form "Database" refers to a collection of
data. More specifically in Computer terminology a Database might comprise ONE simple disk file or it
might take the form of several hundred or even thousands of disk files. The files that make up a computer
database will generally be either DATA files or INDEX files but the DATA files are the prime component. INDEX
files simply assist with the manipulation of the data. We will explain all of these terms in detail before we
finish. Before we continue to get involved with Data, let's look briefly at the PC. We'll explore Hard Disks and Disk Files, we'll cover the building blocks of Electronic Data and meaning of Computer Memory for those who have not yet ventured far enough to learn about these things. When we refer to a disk in this instance we refer to a DATA storage device in a computer. They are quite round and flat. Some boffins might not like the following analogy but if you can imagine something close to a cross between an old LP record and a cassette tape, you have a very broad idea of the nature of a storage disk in a computer. The LP is ROUND and the cassette tape has a SPECIAL COATING. A storage disk
is round, it spins rapidly and it too has this special coating. The information, or data, is read to and from
the computer by a HEAD that sits just above the surface of the disk. Whilst the disk is spinning the head
never touches the disk but it is so close you could not see the gap. The actual distance between the disk and
the head is a fraction of the thickness of a human hair. As the disk spins information is read from, or
written onto the special coating. If ever you have seen static electricity jump from one object to another, or from your hand to a door handle, imagine something similar to that but with magnetic impulses being transmitted rather than electrical charges. Each time an impulse is "sent" to the surface of the disk it writes another "BIT" of information. In electrical terms these "bits" could be likened to either a positive charge or a negative charge. When you touch the door handle and the static electricity "jumps" between you and the door, that might be regarded as a very positive charge, you certainly feel it anyway because you've been "Zapped". The special coating on the Disk is Iron Oxide, and even though it doesn't
actually get "Zapped" by a short, sharp electrical charge feel quite free to think of it in those
terms. Only a technical minded few will be interested to know that the state of the magnetic flux is
altered by reversal of the polarity of a magnet in the read/write head. This is not new technology, we've had
tape recorders around us for many years. The first IBM PC's were built with "cassette tape" connections. There are several varieties of disk used in computers. Those that are fitted into the computer and fixed into place, are known as hard disks. Disks on which we can store information outside the computer, are known as floppy disks. They are not as thick nor as heavy and don't spin as fast as a hard disk. A floppy disk is inserted into a "floppy disk drive", the information is transferred, then the floppy disk can be stored away for future reference. From here on we'll confine this discussion to hard disks, the permanent type. A hard disk spins continuously at 3,600 revs per minute; that is 60 revs per second. In other words the part of the disk under the Head passes by, and is available for reading or writing 60 times during every second. It may surprise you to know that despite this speed disk activity is by far the slowest operation in a computer. The Head reads or writes only when instructed by the computer. Most heads will transfer "data" to and from the surface of the disk at the rate of about five million "bits" per second. In very slow motion you might imagine the head going "Zap ...ZapZap...Zap" as the disk passes by underneath. That's slow motion. When a head is reading data "from" a hard disk and sending it to the computer at normal speed, many "PC"s have trouble keeping up with this extremely high transfer rate. Even though disk activity is the "slowest" operation, those five million bits of data (each second) must be sorted, stored and then made available to the Central Processing Unit (CPU) in the order required. The sheer bulk of work done "after" the data leaves the disk, can be enough to keep even the fastest processors working flat out. There is a very simple method employed to slow down the transfer rate so that the computer's electronic memory can catch up with what's happening. Essentially this amounts to the Head reading only some of the available data on each revolution. Naturally, the head is then transferring data to the computer at a slower overall rate. The amount of available data that is read on each revolution varies with the
type of hardware used. Irrespective of the hardware though, the setting of this transfer rate, establishes
what is known as the disk's "Interleave Factor". We'll discuss this again in little while. Eight (8) of these "bits" make up what is known as a Byte. One Byte stores "one character", for instance the letter "a". The way in which the computer recognizes a character is by the pattern or arrangement of those eight bits. There are 256 different combinations. By now you may be wondering "what exactly are these BITS?". Let's take a moment to identify them in more precise terms. Have you ever learnt anything about Binary Numbers? If yes, you will know that the Binary Numbering System uses only "Zero"s and "One"s to represent values. A Binary "One" is stored on disk as a "positive" charge (or magnetic state). A Binary "Zero" is stored as the "negative" charge (or magnetic state). These are known by a variety of definitions, ON/OFF, MARK/SPACE, HIGH/LOW but whatever school or university you attended, they always come back in the end to a simple "0" or "1". In reality you are most unlikely to get "Zapped" by a "Binary Zero", or hit between the fingernails with a "Binary One", but this old and simple numbering system, comprising nothing but "Zero"s and "One"s, lends itself perfectly to the "Zap..Zap...Zapping" of impulses onto a thin metal disk. What more could the computer designers have asked for? Imagine a high level meeting of stressed-out Electrical Engineers in 1940 searching desperately for ways to represent the letter "a" with an electric current. It didn't happen this way obviously, but you can imagine this enlightened conversation when one says to another 'Hey, what about the Binary System we learnt at school! Can't we design something that sends a current when we want a "1" and when we want a "0" we just turn the current of again?' Yes, it's as simple as that. Electronic Computers are built on nothing more than "Zero"s and "One"s: CURRENT or NO CURRENT We'll come back to the arrangement of the "bits" later. You may have heard the term 20 Megabyte Hard Disk. That size
of hard disk is capable of storing over 160 million "bits". How many Bytes is that? How many Characters? How
many Words, Letters, Books? You should calculate it because the facts can be quite startling. Remember eight
"bits" is one letter (or character). File is simply the term used to describe some of the "related items" of data stored somewhere on the disk. What do we mean by "related items"? One Disk File might contain a thank-you letter or, some other form of correspondence. It might contain all of the figures from an Accounting Spreadsheet, or perhaps even some special machine instructions that will run a computer game program. This grouping of related items is a naturally convenient method of storage. We do it in the office, we keep accounts and receipts in one folder, correspondence in another. We do it in the kitchen with eating utensils in one drawer, pots and pans in the cupboard. It's no different with computers. Take this article you are reading now for instance and every other article you read in this magazine. They are all stored in individual files on a computer's disk. Someone writing a book might store each new "chapter" of that book as a separate file for convenience. Later the individual files could be joined and sent to the printer as one. Files take on a huge variety of formats. Likewise there can be some thousands of files on a single computer disk and many different sizes. Sizes will range from just a few "bits" to thousands, even millions of "bits". This storage is highly organized. Every part of the disk surface is
accurately mapped into circular tracks, again like the old LP record. Each file has a specific address,
on one or more of those tracks. Can you see a problem developing though? Are you beginning to visualize that the individual sectors on a track near the centre might be much shorter than those sectors on a track near the outside edge of the disk. Won't the tracks at the wide end of the wedge be much longer than those located near the pointed end, at the centre? OK, now you understand the principle of a "sector" on a disk. If not, please read the last three paragraphs again because it's important to this discussion. The problem of the length of these sectors is overcome in various ways. In this case the sectors are spaced evenly around each track of the disk so that every sector of each track contains the same amount of data. Obviously all of the sectors on each track will have a number. Then, as the disk passes by underneath, the Head will read from, or write to, a few sectors of one "track" at a time. In the same fashion you and I can look at this page and we can "SEE" all the lines, but we can "READ" only a few words on one line, at a time. The sectors on a disk are arranged into little groups and a disk file will use up one or more of these groups. The groups of sectors are known as "CLUSTERS". On my machine each "CLUSTER" com-prises four (4) sectors. Each sector will store 512 Bytes, or if you like, 4096 "bits". When you take four of these sectors and join them to make one cluster, it's easy to see that a "cluster" on this disk will store 2,048 Bytes. How many "bits" is that? The "smallest area" set aside to store one file, is "one cluster"! My smallest file is 4 Bytes, or just 32 "bits". It occupies one whole cluster which has enough room to store 16,384 "bits". Seems like a lot of wasted space doesn't it. Yes! the space is definitely wasted but there's a good reason for that. We'll explain why later. With the surface of the disk mapped so accurately, the start of each file can be found very quickly. The size of a file is stored on the disk as well, so that the computer can instruct the head to stop reading when it has all the information it needs. When a new file (or part of a file) is being written to disk, the computer will also store the new file size on the disk for next time. Sometimes a file can be broken up into pieces and these pieces might be spread over many locations. Such a file is referred to as being Fragmented. This is not desirable but it is unavoidable and causes some problems. A fragmented file will occur when the space available in one location does not match the size of a new file we are writing to disk. Since "clusters" are all one size and files are all different sizes it's easy to see that some files might use only one cluster whilst other files might occupy a great many clusters. If the computer runs out of space when writing a file to disk and finds that the next cluster is "occupied", it must seek another available empty spot in order to continue. That takes up valuable time. Naturally also a computer will take more time to read a fragmented file because it has to find each "next" piece in order to continue sending data to the CPU. If a hard disk has been used in a computer for a long time, where files of many different sizes have been written to disk, changed, deleted and whatever: The space that is officially "FREE SPACE" and avail-able for use, can actually be almost all "single clusters". That is, no two adjacent areas on one track being both available at the one time. If this happens to be the case, what dramas will occur when we come along to write a very big file onto that disk? A file that takes up (say) 20, 30 or even 100 clusters? It will be totally "Fragmented" and hard disk "read/write performance" drops dramatically. Special computer programs, known as "Disk Optimizers" are avail-able to reduce the incidence of fragmentation. This type of program rearranges the files on a disk in order to keep the larger ones "all in one piece". When a file occupying more than one cluster is all together on consecutively numbered sectors it is known to be CONTIGUOUS. (kon-tig-you-es) Disk Performance is a critical factor in the measurement of a PC's overall speed. Whereas in some cases the 3,600 revs per minute is more than enough, think about the problems of the HEAD reading from another track. Say for argument sake we are getting data from the extreme outside track and the file we are reading is fragmented. What if the next bit of data in that file is located on a track near the centre of the disk? The Head is able to read from only one track at a time so it must move as quickly as possible to the new location. By the time it arrives the disk has probably done about another 20 revs, completely messing up our otherwise acceptable data transfer rate: But even when the head does find the right track it must wait for the correct sector to come around. This movement of the head from one track to another is referred to as a "TRACK to TRACK seek". The head is usually on the end of a pivoting arm and the arm just swings left or right as required. A few lines back we mentioned that there was a reason for cluster sizes being what they are. It has to do with achieving optimum disk "PERFORMANCE". Perhaps now you might be able to imagine how pathetically slow a Fragmented Hard Disk would be, if the cluster sizes were any smaller. When reading a large file the CPU would have no option but to wait whilst the head was running all over the place picking up little bits and pieces of data from many different locations. The disk's cluster size is a trade-off: Some wasted space being preferred over the problem of large files getting even more fragmented and hence more severe degradation of disk performance. Remember also we mentioned the disk's Interleave Factor! By now you should be able to follow this part quite easily. We'll pick this up with a disk spinning at 3,600 rpm and the head reading data from the sector directly underneath. It waited for the right sector to come around, it now sends "bits" to the CPU at tremendous speed. In a flash that sector has passed by, the next one contains data from the same file so the HEAD keeps on reading, still pushing "bits" up to the CPU at enormous speeds, and this goes on for several milliseconds UNTIL... the CPU is grossly overloaded, cannot keep up with the pace and fails to accept 512 bytes of the file. Big trouble. How do we overcome that? Simple and easy. By now you could probably answer the question without reading any further. The sectors on a disk are numbered in such a way that No.2 is not right beside No. l and so on. When a disk's Interleave Factor is 1:1, the sectors are all adjacent. From there on the factor can go as high as 10:1 or even 15:1 but such low performance is not common any more. Each time, the consecutively numbered sectors are spaced further apart around the disk. A badly selected "Interleave Factor" can greatly reduce a disk's
performance. Add that to the problem of fragmented files and it's easy to see why the fine tuning of a
computer is as important as fine tuning the engine of your motor car. There are programs available that will
analyse the situation and give you a full and comprehensive report of your hard disk. Selecting the right
program can also have its hazards. There is a very special set of programs that do all this reading and writing and storing of file sizes and addresses etc. They are known as the computer's Operating System. The Operating System communicates with other programs that might be running on the computer and generally between them they manage to get it right. I'm sure it must still be a bit tricky though all this reading and writing of millions of tiny little "bits" of magnetic data at extremely high speeds. We will come back to the "bits" again later but in the meantime, have you ever heard of the term "DOS"? Can you work out what DOS stands for? Yes, you have it! Disk Operating System. In the world of Personal Computing there are other Operating Systems but for now DOS is the most widely used, we'll stick with it during this exercise. Sometimes another program will completely take over from the normal Operating System, run itself, managing the entire machine for the duration, then hand control back again when finished. Now you'll be able to join in many more conversations and no doubt you'll
surprise some friends with meaningful questions. Try this one next time you are talking to an "expert". Ask
them "How many sectors are there in each track on a hard disk?" or, if you don't want to see them shy away so
quickly, try something a little more simple like, "How fast does a hard disk spin?" or, "What is a Byte?".
Incidentally four (4) bits is officially known as a nibble, believe it or not. We must discuss this because many people get confused between a Computer's "MEMORY", and "STORAGE CAPACITY". As we have just learnt, storage capacity refers to how much information we can store on the particular size of hard disk being used. Memory refers to ELECTRONIC MEMORY, activated in silicon chips. Generally these chips are useless if there is no supply of electricity to the computer. However, when the machine is powered-up and it comes to life with a "whirrr" then it's an entirely different story. Once those cleverly designed little pieces of silicon have electricity supplied to them they are capable of doing much work very quickly and storing lots of TEMPORARY information. Temporary because when the electrical power is turned off again, the contents of the MEMORY disappear for ever. Memory is used to hold computer programs, help perform calculations, and to store data temporarily whilst it's being processed. The contents of the memory are either obtained from the computer's disk or "calculated" during program operation. Any information that is read from the disk into the memory, and later changed, must be written back to the disk in its new form, before the electrical power is turned of. Otherwise, the changes are lost forever and once again we revert to the old, inefficient method of relying upon Human Memory. The results of any calculations performed or other work done must also be written to disk or they'll be lost as well when the power is turned off. The type of memory we discussed above is known as RAM. This is "Random Access Memory". Each silicon chip has many storage locations, just like the disk has. When electrical power is supplied and some information has been stored into a RAM chip, it can be accessed later "at random". In other words as with a disk, each bit of information has an address, it can be written to or read, as required. There is another type of Memory that does not require electricity to HOLD information. To complete this part of the lesson we should mention it briefly so you become aware of the terminology. This type of Memory is known as ROM. This is "Read Only Memory".
Obviously it cannot be written to. The main use for ROM in computers we are discussing is the storage of the
machine's "start-up" routines and some parts of the Operating System. Remember, ROM does not need a supply of
electricity in order to hold the information it contains. I promised to cover the arrangement of "bits" on the surface of a disk. So before we go back to databases we'll explain how the data is stored and how a computer can be made to read it. Before we finish this I'll have you imitating a computer. Yes, you read it right! Consider the following two ways of writing a story ....
Method 2.
Method 1 is generally accepted as our preferred standard. Agree? What is the main difference between Methods 1 and 2? The format applied to the first one includes spaces, punctuation and capital letters to break up what would otherwise be an ongoing stream of characters. We include this punctuation so that a reader can know when to pause or to change gear or stop for a complete break. When we become more ambitious and write a whole page most of us leave an extra blank line between each paragraph to emphasize a point or to separate ideas, arguments or topics. All of this "extra" we include could be classified as "overhead". Despite the overhead with Method No. 1, we used about half as many lines on the page. Our use of available space was more efficient. When data is stored on a disk it contains similar overhead, only much more. When we see data in tabular form on the screen or on a computer print-out, or an article like this one printed on the pages of a book, we do not see the computer's overhead: it exists only to give computer programs information they need. Just like when you read a story to a child at bed time. The child hears and enjoys the story, blissfully unaware of the sentence punctuation you encountered and followed absolutely: all to make the story "sound just right" on the occasion. We have already learnt that a disk is mapped into circular tracks just like the LP record. The tracks on a disk are not grooves as you would find on an LP. Instead, the state of the magnetic coating on the disk is altered in a way that enables it to be read by the "Read/Write Head". We'll avoid delving into that any further, just accept it as a situation where irrespective of the physics involved, the head writes "bits" to the disk and it can read what it has written. Remember that every character is represented by eight "binary digits", so there will be lots of "bits" in a row just to make up one word or one paragraph. Remember also in our discussion on disks and data transfer, we learnt that most "Read/Write Heads" are capable of transferring data at the rate of about 5 million "bits" per second. That speed begins to take on greater significance as one looks deeper into the realms of disk storage. At first, "five million of anything", per second, is something we cannot comprehend. Soon you will begin to understand that the amount of overhead involved and the work required to identify and separate the overhead from the essential data is truly staggering. Let's utilize our short story we wrote above to demonstrate in simple fashion the sort of overhead a computer disk will contain. For our example we'll use the following "bit patterns" (in this case letters of the alphabet that you and I can understand), to break up the ongoing stream of "bits" (words). We will then attempt to simulate the activity of a computer and read the story exactly as a computer program would read it. To make this exercise more understandable for Humans I've invented these "bit patterns" especially for the occasion.
We Begin Simulating the Computer's activity. The story will be found in a book. We walk to the bookcase and scan along the shelf marked "stories", reading the name on each book, comparing it with the name we want We are following our instructions step by step. We find the book, reach out and remove it from the shelf, we open it and find the index. We look into the index and find the correct page number. We search through the pages and when we get to the correct page, we can then begin to read the story. In order to go through this exercise properly you should not memorize those simulated bit patterns. You should look back every time to check what each one means and for instructions on how to interpret the words when each special pattern is encountered. This is exactly the sort of thing a computer must be programmed to do although this exercise has LESS steps than the real thing. You must constantly be on the lookout for the start of a new line, the end of a story or end of page. That means you must compare "every character string" with the list above so that you'll know when some other instruction is to be carried out. Note also that if a character string is "not found" in the list above, then it is to be taken as part of the story text (or data).
I've made it easy by highlighting the actual words for you. As well as that I've exaggerated a bit by including the spaces because normally a space is just part of any text. A space between words is simply another character and it's quite different to what is known as a NUL character. We'll explain the Nul character soon. Apart from that, imagine the number of comparisons a processor must make to interpret one small chapter of a book read from a hard disk. No doubt you will have the general idea by now! Little wonder disk
manufacturers continually strive to improve on the 5 million "bits per second". "ASCII" stands for "American Standard Code for Information Interchange". Remember every character is made up of eight (8) binary digits. The binary digits are derived from a character's numeric value. How can we place a numeric value on a letter of the alphabet? The use of binary numbers makes it necessary for a unique numeric value to be assigned to every character that will ever be handled by the computer. It must be something that can eventually be changed into Binary Form. This requirement led to the development of the ASCII codes. Each of 256 different characters can be represented by its ASCII value. Those that we humans use start at 32. The word "Digit", converted to ASCII and then to binary form, is as follows:
That's the last semi-technical topic we'll cover in this article. From here
on when we refer to a "file" you will be able to think of it as some bits of information stored on a
thing that spins around very quickly inside the computer. When we refer to the computer's "electronic
memory" we are talking about the temporary storage areas in silicon chips. Let's get back to the main
topic. Data (or "Information") evolved from the word Datum .... ....where, according to the English Dictionary Datum means "a fact, or proposition, granted or known, from which other facts, etc, are to be deduced". In Engineering or Architecture the builder uses a datum point or benchmark upon which he bases all his building levels. This kind of base level could be referred to as a "physical" datum. In the world of Information Technology, the datum is the "Base Information" that we gather, evaluate and manipulate and from which we determine or deduce many other forms of information. This collection of "base data" is referred to as a database. We can now define the terms "information" and "data" a little more precisely. The base "data", or the "facts" that we store in a database, provide us with information. There is no need to dwell on this but you should be aware of the distinction. I will use either word when referring to data stored on disk, often just to suit the occasion. But, in the strictest sense, only the word "data" truly describes the "bits" and/or "bytes" stored in computer files. On the other hand, information is something we can spread around and give to other people, either verbally, or in written form, in letters and books. We extract information from Data. We store the data. We sample the data. We sort the data. But we would not tell anyone some data would we? I could give you some data, in the form of computer files, containing no information whatsoever. In a database of all the people in your town, the base data might simply comprise one data file with name, address, date of birth, single or married, name of spouse if married, education obtained, occupation, salary level, how long at present address, which school attended, last address if moved, the names of your parents and their present address. This base data is all that is required for the owner (of the
data) to be able to produce a complete and comprehensive study of the entire community: a total
socio-economic analysis and profile containing information about your town that you didn't know
existed. Every person added to the database would cause another record to be added to a data file. Just like the old Library Card File, where a new card would go into the box for each new borrower and/or each new book. In a data file however, the new records are not generally inserted into the correct spot as the library cards were. Each new data file record is simply added at the end of the existing file. The computer can sort these records very quickly when required.
Unless one has a special need to actually insert a record between two existing records, it's
much quicker to let the computer do some work at the time we need the information. The computer needs some definite order about things for it to function efficiently. Let's compare human ability with computer ability. Let's say you and I are given a box of odd library cards. These cards are taken from different libraries all over the country, every one of them was designed and written by a different individual. As we can imagine, some librarians would like to put the borrower's name in the top right hand corner, others might prefer to see it in the middle of the card. The names could be written virtually anywhere on each of these cards, but if you and I were asked to sit in the corner and find the surname on each card, we could do that easily I'm sure. Our general knowledge or awareness of surnames would enable us to scan each card and pick most of them fairly quickly, and even if there was a name we had never seen I'm sure we could recognise it as being a surname. I'm just as certain that if there was a card with no name, we would recognise that as well. Computers are not so smart. They cannot read, they do not have any memory unless the information is provided at the time a job is done and they would need extremely intricate and complex programming, plus some very special reading equipment, if ever they were to perform the same sort of task. Why then, if we Humans are so clever, would each Librarian have a tendency to write the borrower's name in the same spot on each card? Obviously, to enable a quick search. Computers don't have a problem with speed though, so why bother? Couldn't we just rely upon the computer's speed to search through everything, quickly and efficiently? NO! The problem with computers is that they cannot see! They must be given very specific instructions. Not only must they be provided with what to look for, they must also be told exactly where to find it. If you think back to our discussion on disks you will remember we said that Every part of a computer's hard disk is accurately mapped. Let's follow that idea a little further. The computer is getting data from the disk and it must obtain a surname from one of the records in a Library Membership Data File. It locates the file easily because the address of the file is always available. However, when it finds the file, it must know where in that file it will find the particular surname. Even if we know the exact length of every record in the file and we tell it to get the surname from the 17th record, it must be told where, or, in what part of record number 17, will it find the surname. We don't necessarily want the very first bit of data, it might be the date of joining!! Without finite instructions a computer is simply a box of wire, solder, silicon, screws, and rust coated onto a rapidly spinning disk. Each record of a Data File is divided into separate and distinct sections, with the same section of each record set aside to contain exactly the same bit of information. Just like a well organised set of library cards. Each of these sections of a record, is known as a field. A Data File like the one we were discussing earlier would have a separate field within each record for the person's surname, another field for given name, one for date of birth and so on. As well as that, the fields are in exactly the same order in every record. Refer Figure 1 The various fields of each record in a data file are generally of fixed length and this length would be determined at the time the database is designed. The length of a field is determined by the number of characters it will hold. A field that is 10 characters long will not hold the name "McGuillicuddy", whereas a field 25 characters long will have much wasted space when we store the name "Wong". It is important, especially if we might ever need to copy a surname into another file, that the other data file's surname field be of the correct size. Or in other words, have the same amount of character storage space set aside. Records with other types of fields, that is fields of variable length are sometimes used, but we are not discussing those here. Just remember that some Systems do use Variable Length Fields and in cases where they are used, the Computer Program must have the ability to establish the length of a field on the run, to cater for a longer or shorter name and so forth. Files containing records with Fixed length fields will tend to take up more room on a disk, but this is largely overcome by efficient database design. From this point on we will refer only to fields of fixed length. Before we discuss the content of Fig. l let's take a closer look at the format of a data file. A data file is made up of two distinct sections. The first section is known as the "Header". This is where information is stored that tells a computer program the "file size and date", the "length of each record", the various "field names", the "size" of each field, the "type of information" stored in that field, the "number of records" in the file plus lots of other necessary detail. The second section contains the records, starting at Record No.1 We have looked at Fig. 1 a couple of times, take another look at it. Unfortunately data cannot be written to disk in the same manner. We have already covered the way innd how the overhead assists with identification of the data. How the shorter, "individual streams" of "bits" are separated by the inclusion of special "bit patterns", or overhead, that a computer program will recognize in the same way you and I recognize sentence punctuation Remember the bedtime story? In exactly the same manner, a data file, the file header, and the start of the database records are also "marked" for a computer program to be able to identify where one finishes and the next begins. In a data file the contents of Record No.2 will follow immediately after the end of Record No.1. The records follow, one immediately after the other, with no separation between them. The computer program must count characters until it knows it has reached the record it needs. Then it must begin to count again until it reaches the correct field. All this time, checking against the values contained in the Data File Header. Now we will go back to our sample file in Fig. 1 and examine its content. By far the best way to visualize these things is in chart form. Our data file has four records in it and there are five fields within each record.
Notice that the structure of each new record added to the file is identical. In this case the fields are of fixed length, hence each new record will have exactly the same overall length. The name of a field is decided at the time the data file is created. We use the term Field Name. Soon we will cover sorting and indexing, but when we get to that point, remember that we will be discussing reorganising the records, not the fields. For instance if we sorted the above file into alphabetical order of surname, the entire record Number 2 containing all of the data on Mary Abbott, would go to the top. We would not split a record and move just some of its fields. Let's compare this with a ring binder. We go to a lecture, we make notes on different pages throughout the lecture then when we go home we sort out the notes and re-arrange some of the pages in the binder. That's like re-arranging the records in a data file. We would never cut each page into thin strips and begin swapping separate
lines of notes between pages. Likewise, we cannot break a data file record into smaller pieces to move
only some of the fields. If you had the above data file, containing four (4) records, written on a page of an exercise book in front of you, and you had no other copies, and you had to sort the records into alphabetical order of surname before you handed the book to a teacher for correction. How would you do it? Let's do the first one together, the others will become obvious. Remember, this is our only copy of the data and we will be changing the order in which it is written. We know from our knowledge of the alphabet that A (for Abbott) will come before any of the others. So clearly Mary's record will go to the top. We can also see at a glance that if we swap Record No.1 with Record No.2 then our sorting exercise is half finished because the Jones and Jackson records will stay further down the file. So we must swap the data contained in the first two records. My method would be to write the data from Record No.2 onto a piece of paper, then using an eraser, rub out that second Record. I would then copy the first Record's data into the now empty second spot and erase the first Record. Finally, I would read Mary's data from the piece of paper on the desk, and write it into the top spot where John's data was previously held. Effectively I have now swapped all of the data held in those first two records. If my memory was superb and I could memorize all of Mary's data, I wouldn't have needed to write it down on a separate piece of paper. It would have been much quicker to do it that way but, alas, I am human and the capacity of the old memory box is limited. HOW would you imagine a computer would handle that job? Well, for a start, the computer would need a program. The program would follow almost exactly the same overall steps but with two small exceptions. Instead of temporarily storing Mary's data onto the disk, it would probably store that in the computer's electronic memory. Secondly, the data stored on disk is not actually erased nor deleted, any existing data would simply be over-written. Every one of those overall steps would comprise a great number of smaller steps and that is where we humans have the edge. You and I can see the page, we can process the initial requirement very quickly, we can decide to act almost instantly, we can remember where we left the eraser last time we used it, we know how to use the eraser, we can see the pad, pencil and desk top, we understand the paper will sit there until we need it, we know "A" comes before "B", we know how to read and write, plus a whole lot more. The computer must be programmed to do every single step. To this extent, a
computer is no more clever than its programmer. Can you think of areas where the computer might have a
big advantage? The word Index has many evolved uses, but generally any broad description of this word would include a reference to "pointing out" or "pointing to" something. An index in a book is used to tell you where you can find something inside the book. Our index finger is the one we use when pointing to something. When workers' wages are indexed to the Inflation Rate, the inflation rate indicates or shows by what percentage or value the wages will change. In the world of PC Databases an index is used for two purposes. Its primary use is "randomly accessing records" in a Data File. This ability to find specific records, extremely fast, enables us to use indexes to display or output those records in a different order. An index will show the data file records in the order they would appear if they had been physically sorted, instead of as they do exist and generally that is, entered in no particular order whatsoever. Imagine a data file of names and addresses. This information must be entered by a human operator, at least on the first occasion anyway. Two problems exist with this kind of file. (a) We simply cannot add the records in alphabetical order.
That would be an utter waste of time and virtually impossible anyway. We must be aware that all of the original records in a file can be physically sorted if required, and sometimes this may be highly desirable. Whenever that is done the records are totally re-written into the new order. Most files are never actually physically sorted though, they are simply indexed instead. When a data file is indexed, a separate index file is created. An index file contains records that are tied back to a specific record in the data file. These records cause the base data in the data file to be displayed or compared or printed in the order governed by the index file, not by the physical order of the data file records. We can say that an Index File Record points to the correct record in the Data File. Indexes can be created based upon one or more fields of every record. A simple example of this is where we might wish to see our "person" data file displayed by order of age. In such an instance we would index the data file on the "DATE OF BIRTH" field. The computer program would compare the date field of each record and sort them into the index file. Later when we viewed the data file with the help of the index, the computer program would skip through the index file records, these would point to the appropriate record in the data file and the data file records would be displayed in date order, as we wanted to see them. Similarly if we wished to display or output our file contents in the order
of AGE under each SURNAME, we would index the data file on the SURNAME field, plus the
DATE OF BIRTH field, all at once. The indexing program would proceed to sort the surnames first
and then taking each surname in turn, would sort those records according to the birthdates of that
name, before moving on to process the next name. We would end up with all the "A"s in birthdate order, then
all the "B"s and so on. Indexes are often used to relate data held in databases. The words "related data" in this case take on the meaning, "associated with". When we relate data in two or more files, the data contained in one file, will point to some associated or matching data in another file. There are two "breeds" of Relational DBMS. One is a truly Relational DBMS and the other simply provides us with the facility to relate data. In a truly Relational DBMS, data is related by pointers. Each data file record contains pointers that point to other, associated data. This was the first, or the original variety of Relational Database. They are still very much in use today but generally only in the large Mainframe Computer environment. The second "breed" is the more recent and popular DBMS designed for the Personal Computer. That's the one we are discussing here. This type of DBMS has a special programming language that enables us to establish a relationship between selected data. We do this with the help of index files. In order for us to set up a relationship between two or more data files, we must design our files so that each related file has an identical indexing field, of the same name, and containing exactly the same data. We index each data file on this special indexing field. The content of this identical indexing field is known as the "Index Key". Often it will be entered automatically and simultaneously into the appropriate field of the corresponding record of each file, by the computer program, at the time of data input. Whilst this is the most efficient way for most modern Relational Database Systems to create indexes, it is not absolutely necessary to create a special index key. The fields that we already have can be quite suitable. For instance the surname field can be used as an index key. The very slight inconvenience that arises from this is that an index file will be much larger when the index key is (say) 30 characters long, big enough to hold the longest surname. An index key which is only 5 characters long will result in a smaller index file and this can be an important consideration, especially on large databases. Almost any field in a file can be used as an indexing field but the key must be exactly the same in each of the related files. Thus we have a natural requirement to keep this index key as small and simple as possible. Another very important requirement, is that the index key of each record in any one file, should be unique in that file. Otherwise the index file will contain duplicate records and duplicates severely reduce, in fact they can destroy, the effectiveness of an index. Let's examine indexing and relationships with specific examples. We have modified our first sample file, adding a special new field for indexing. The field named IDKY is 5 characters long and the content of this field, the Index key, is unique in each record. The index keys were created by the computer program during data entry. The file is indexed on IDKY and the records appear on the printout in that order. (Fig.2) As the operator keyed in data the program stored parts of that data in two separate files although the operator wasn't aware of this. Each entry into the second file was given the same index key, so that relating these files and extracting the correct information at a later stage was indeed simple and easy. The second file (Fig.3) is also indexed on IDKY so that we can establish that vital relationship.
Figure 2. A picture is better than a thousand words, isn't it! How would you find the general state of health for John Bennett? The index key for John's record is BE770. To obtain the correct answer we simply search through the index keys of the second file (Fig.3) until we find the matching key. That's exactly what a computer program will do when undergoing the same exercise. The program will use both indexes though, because it must be able to determine the physical location of each matching record.
The first file in this instance is the main file and the second
file is known as the target file. Using correct PC Relational Database Terminology we would say that a
relationship was established from the first file into the second file. In order to establish a Unique Index Key, in this instance, the computer program was designed to take the first two letters of each surname and then add the next available sequential number for that combination of letters. Can you see a problem with this? Yes, when we reach Jones number 999, we will lose the ability to create a unique key for the next person named JONES. The same limit applies to all surnames here. You and I can see this quite easily. As well as that, any database designer who tried to implement such a system would deserve more than a smack on the hand for fraudulent representation of their ability. However, the situation is occurring often these days where old designs are no longer suitable. Whilst the above index key design might be suitable for storing data on the people who live in your street, clearly it could not be used to do the same job for the entire City or State. Much thought should be put into the design of every database, and one of the most critical considerations is future use of the stored data. Several pages back we mentioned a database of all the people in our town, we have some base data stored on each person. We will take another brief look at this database and explore just how we can use related data; where it becomes extremely valuable, making computers appear to be clever. If we extend our database to include the type of motor vehicle each person drives then we would most probably create another data file that contained just their vehicle data. There are good reasons for this and we will discuss them in a later article. This vehicle data file would contain a special
indexing field that would tie the vehicle data records back to the correct record in the person
file. Should we wish to view the names of all people who drive Maseratis we would have the computer program set a relationship from the vehicle file into the person file. Then, as we select only the Maseratis from the vehicle file and browse through them, comparing each colour and model number etc, we will also be able to see the name of the owner because the computer program will use the vehicle file's index to randomly access the appropriate matching record from the related person file. This relationship between files provides benefits in both directions. If the occasion ever arose where we wanted to list the huge variety of vehicles driven by people named Taig, we would have the program set a relationship from the person file into the vehicle file. We would list or view all of the Taigs in the person file and as we did this the program would find the matching record in the vehicle file and display the required vehicle information. What happens when there is a Taig who does not own or drive a motor vehicle? PC Relational Database Systems make it very easy for us. When the computer program is operating on two related data files, and we move onwards to view or output another record in the main file, the program quickly locates the matching record in the related file. If and when there is no matching record, and the program cannot find in the related file, any record with the same index key, it sets up a special condition that can be read by the program. Traditionally this condition is known as EOF which stands for end of file. The computer program will test for the EOF condition immediately after each move to another record. If there was no matching record in the target file it will do something else, like display an empty space, or in this case, print "No vehicle!" So, Relational Database Systems, a very popular and
useful part of the world of Information Technology, can be used not only to show you information
but also to tell you when it isn't there!! Now that we have seen how relatively simple it is (couldn't avoid that word), to relate information in separate data files, you might imagine the ease with which all information can be compared and evaluated. The software I use will allow me to create and maintain multiple relationships between data files. With that facility, the possibilities are simply endless. If we take our person file containing information on everybody in one town, obtain some more files and set up some multiple relationships, we can determine a great many things, very quickly and accurately. Imagine this.... Let's say that our townfolk have recently completed a survey by a Fast Food Chain, their eating habits are all on file. From previous surveys their voting habits might be held in another file, their hobbies and interests in another, their credit card purchases in another, their preference for entertainment in another, their Bank, bank balance and net wealth in another, their favourite TV channel in another, the route they travel to work etc etc etc. Somewhere there could be all sorts of information on our townfolk stored away in a variety of databases. If we are able to gain possession of each of these data files we could set up our main file with some multiple relationships, that is a relationship into every other file. Who could possibly use the information we would extract from that data??? Fast Food Shop Owner Joe Blo could establish exactly where to locate his new "burger joint" based upon the concentration of takeaway food devotees in each area. The local politician could decide exactly where he should devote his energies and which community developments to support as well as those he could ignore, based upon which area contains the most votes for him and which areas do not. The Bank could determine exactly where to locate its new branch based upon the concentration of wealth and the numbers of children in each family. Wait! Wouldn't we need a Common Index Key in order to do
that? NO? YES! In this case, Yes! However there are many other ways of matching data in files.
Relating them with a common index key just happens to be one effective way of doing it. List the First Name, Surname and Address for every male in town who drives a blue car, older than 1985 model, who did purchase tickets to some form of entertainment in the east side of town, using credit card, on Friday last, who is not taller than 170 cm, whose surname starts with "P" and first name that sounds like John or Con or Ron. If you have the information on file, a PC Relational Database System will extract a great many things. Now there might be one person in town who fits that description but if you are the Police Force looking for a murder suspect or on the other hand, an Insurance Sales Representative looking for suitably qualified sales prospects, then consider the immense impact of computer databases. Information sufficiently detailed to enable this search doesn't exist in one place, not quite yet. We are getting close however. Almost every time you open a Credit Account, nearly every time you buy a major electrical appliance, and certainly every time you sit for a driving test or sign onto the Electoral Roll, your details are being entered into one more government, semi-government or privately owned database. We have a National Government that wanted to bring about the inevitable a little prematurely recently with the introduction of the Australia Card. Just think about it. Things certainly have changed very rapidly in this world in recent years but the condition very nearly existed today where if you were not in the database, then you simply didn't exist!! Can you imagine having any privacy left at all if ever the day came when the databases held by the Banks and Building Societies, American Express and MasterCard, Myer, Billy Guyatt, Waltons and others like them, were all related to the Databases held by the Australia Card, the Taxation, Customs, Social Security, Commonwealth Statisticians, Health and Education Departments together with the Births, Deaths and Marriages Offices, Prisons, Baby Health Centres, Schools, Hospitals and Retirement Villages. The Australian Government, determined in its efforts to catch a few cheats very nearly removed every last bit of privacy any of us ever had, with the Australia card. No matter how good their intentions might have been, once established the Australia Card System would have paved the way for future Governments to pry into the lives of the average Australian in a way that most of us would regard as totally unacceptable. The Australia Card debate saw the introduction of par-allel Legislation designed to protect our privacy and when the Gov-ernment finally succeeded in 1988 in passing the Tax File Number Legislation, they were forced to pass parallel Legislation to offer some protection against this sort of intrusion. The Privacy Act was passed and this is supposed io regulate the flow of assembled data between Government Agencies. It has limited affect on data held by Private Agencies. Perhaps Social Engineering is not that far away after all. How long will it take? The technology is there! The capacity is very close behind it. In less than 50 years there won't be a single person in this country, perhaps even the world, who is not just another number on some massive Government Database. The index key will be name, place and date of birth supported by some other unique item such as fingerprint, maybe even our DNA Signature, gathered by a hospital at birth, totally beyond our control. This database could be updated every time we purchase something, every time we move house or apply for a job, every time we travel anywhere. Almost every single thing we do can be recorded in a database, and available. Let's hope the Privacy Act 1988 is further developed in parallel with the advances in modern technology. When the first mechanical computer was designed and developed to assist with solving mathematical problems many, many years ago, how could they have known that electronics would follow? When the first electromechanical relays were employed to do this same job a great many times faster and more efficiently, just 45 odd years ago, how could they have known what was to follow? One thing is almost certain. When Bell Laboratories developed the first transistor in 1948 and this was quickly followed by the Vacuum Tube Computers of the 1950s and the Solid State Circuits of the 1960s, the people involved almost certainly knew and understood a little of the impact their designs would have upon the future of this world. Not one of us should ever doubt the absolutely total impact that Computer Technology, Information Technology and the use or misuse of these things, will have upon all of mankind, for ever more. Written
permission will be granted for this article to be re-printed in any appropriate publication provided that it
is not altered in any way and provided that credit remains with the Author. If reprinted, the ENTIRE text
MUST be included and this note must remain as part of the text. Anyone wishing to reprint this article should
contact the author on (03) 700 7700 for written authority and text on disk. Copyright © Gary Taig, 1990
(all rights reserved). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||