The Task We wish to read a text file containing a collection of numbers and then output some simple statistics. The text file is of the form:
Each number is on a separate line. We will use standard in and standard out for input and output. What this cryptic statement means is that we use command line redirection operators to do all file reading etc. For example if our program was named STATS.EXE, and our number file was named NUMBERS.TXT, we would type at the DOS command line:
This directs the contents of the numbers file into our program file and then the output is displayed on the screen. The statistics this program will output are:
Now I know you can easily do this in Microsoft Excel, but this program will execute with 100 lines of data or 100 thousand lines. The C Language The C programming language evolved from two previous languages B and BCPL. BCPL was developed in 1967 as a language for writing operating systems software and compilers. B was used to create early versions of the UNIX operating system in 1970. C was developed by Dennis Ritchie at Bell Laboratories in 1972 and initially it was known as the development language for the UNIX operating system. Today virtually all operating systems are written in either C or its object oriented successor C++. Because of C's widespread acceptance, C programs are easier to port between operating systems and different hardware. As an operating system programming language it has access to the hardware device level and is extremely flexible in its methods of viewing and manipulating data. This flexibility does have its drawbacks in that obscure bugs can appear in the code. C++ has gone part way to alleviating this problem, but in extreme programming tasks for example nuclear power stations and military applications, C is still regarded as not having the required stability. The first rendition of our sample program is written in the C language.
You will recall that we discussed above how the program will be read and processed by a complier. There are many different compilers but apart from usually minor differences, they all perform essentially the same task. A complier processes a source code file, one line at a time. This is what each line of our sample C program does: The first lines between /* and the */ are comments. In the C language, anything between these character pairs is ignored by the compiler. The comments can be on one line or a continuous block of multiple lines, and are used for general record keeping and descriptions of the operation of the programming code.
This instructs the compiler to include the text from the file STDIO.H into the program. In effect the compiler will see the included text as part of the program. The file STDIO.H contains variable declarations and definitions in reference to files. We include this include file when doing any file reading or writing; this includes writing to the screen.
Every C program includes main(). The parenthesis after main indicate that it is a function and if the program has any command line parameters, they are passed to main inside the parentheses. In this case there are no command line parameters so the parentheses are empty. Every C program begins executing at main().
This tells the compiler that we want to reserve a location in memory for an integer or whole number that will be of size "int". In today's PCs an int (integer) is 4 bytes long (32 bits) and can store any number ranging in value from -2,147,483,647 to 2,147,483,648. This variable that I've named number_of_lines will hold the number of lines that our program reads from the input file.
These three statements reserve locations in memory for three double-precision numbers. A double-precision is a floating point number that is accurate to about 14 decimal places for numbers ranging up to 17 followed by 307 zeros. A double reserves 8 bytes of memory and is used for any floating point quantity like an average or an interest rate. As the numbers we are reading in have decimal points, we will store each one as it is read into "this_number" and add it to our running total "sum". Finally the result of the average calculation will be stored in the variable named "average"
Now that the variables have been declared, the program can actually start to do something. The first thing we do is set the value of the line counter variable "number_of_lines" to zero and set the value of sum to zero. If this was not done, the memory locations that these variables refer to could contain any random collection of bits. When a program starts to run, the operating system simply gives it a chunk of system memory to use. It doesn't prepare that memory in any way, the memory used will contain all sorts of bytes collected over the time since the computer was last turned on. Not initialising variables is one of the major causes of errors.
The "while()" statement executes a loop of statements while the expression inside the parentheses is true. The expression in this case is a bit complex, but it could be something like while(counter is greater than one) written as If the expression is true, then we execute the next statements between the curly braces {}, otherwise ignore them. In this case, our expression is feof(stdin). This function is part of the library that comes with the C compiler to assist with file manipulation. It looks at the lines of text coming into the program (from the file redirection, stdin) and if it sees an end of file marker (meaning there is no more input) it returns true. Now we want our while expression to only operate when there remains input to read, so we must flip the result from feof(stdin) to its opposite. We place the not symbol, an exclamation mark !, in front of it. So the while expression will execute only if the end of file marker has not yet been encountered.
This expression reads a line of the file, converts it to floating point number and stores it in the variable "this_number". The reason for all this complexity for what seems like a simple operation is that the lines of the file do not contain numbers but collections of number characters and decimal point characters. The scanf function then formats the string of characters to the data type determined by the "%lf" expression. In this case we are telling it that the characters on each line should be formatted as a long floating point number (ie. double). If the characters on the line read "FRED" then scanf would not be able to format these characters as a number and would return an error. Another chance for a logic error. You rightly assume that your datafile is a long list of numbers, but if there are a few alphabetical characters in it, the program will give unexpected results as scanf fails to format the characters correctly. The & symbol in front of the this_number variable instructs scanf to store the formatted number at the variables address in memory.
After the scanf statement the variable this_number should be equal to the number at the current line of the file. If we have reached the end of the file at the previous scanf, feof(stdin) will be equal to true and the break statement will be executed. This will break us out of the while loop and transfer us to the first statement past the end brace. If we haven't reached the end of the file, continue on in the while loop.
We haven't reached the end of the file so the ++ following the variable name tells the program to increment the "number_of_lines" variable by one.
Print the number we have read from the file, to the screen. The "%f\n" in part tells printf to format the number as a floating point, and after writing it to the screen, to write a newline (the \n character).
This is a shortcut way of saying sum = sum + this_number. Increment the sum variable by the amount this_number. The while loop continues until the data is all read into memory.
The first statement out of the while loop calculates the average of all the data by dividing the sum of the data by the number of lines. A logic error could occur here if the file was empty. The number of lines will be zero and any division by zero will cause the program to fail with a fatal error. A well written program (good logic) would detect a file with no data and graciously stop the program to display an error message. For simplicity I've left out that bit.
These three statements output the values of the collated statistics. The sum and average are both floating point numbers so they are formatted with the "f" specifier. The number_of_lines variable is an integer so we use the "d" specifier to format it. Each printf statement also prints a newline after the number.
We have reached the end of the program and although this is not essential, the exit(0) statement passes the value of zero back to the operating system upon completion. If we were to halt the program because there was an error, for example if the file was empty, we could call the exit function with a value other than zero to indicate to the operating system that the program halted because of an error. The operating system can then indicate to the user that a program has failed and indicate what caused the failure. The program ends with a closing curly brace and another comment line. Java The other example is the same program written in Java. Java was first released in 1995 by Sun Microsystems. It is an object oriented programming language with strong typing and advanced features such as garbage collection. What does all this mean? First of all an object oriented language combines data with functions to make objects or classes. A function is a small subprogram that does a specific task such as draw a window or draw a character on the screen. The data might be the size of the window or the type of font and the character to draw. By combining these two entities together the program does not run any faster but the design becomes more logical and with large software projects reusable components are easily utilised. Strong data typing ensures that data types are not incorrectly reassigned. For example a byte might represent a letter of the alphabet or it could be a number from 0 to 255. A strongly typed language will not allow a character variable to be assigned a numerical value. If that was attempted the compiler would flag an error. This catches a lot of subtle bugs at the compiling stage. Garbage collection is a feature normally seen in very high level programming languages such as Smalltalk. Normally the programmer is responsible to free up system memory that is used in the execution of a program. The program might access a megabyte of system memory to (say) load an image and manipulate it. When that megabyte of memory is no longer required. it is the programmer's responsibility to free the memory so that it is again available to the operating system. Not performing this memory release causes what is known as memory leaks. The longer the program runs, the lower the system's memory resources, until eventually the system crashes. Garbage collection keeps track of what memory is being used by the program and when there are no longer any references to it, the memory is returned to the operating system. Memory management is controlled by the Java runtime environment, not by the programmer. Apart from these advanced programming features, Java's real forte is as a cross platform programming language. To run a Java program you run it in the Java virtual machine (JVM). There is a JVM written for each computer system whether it be Macintosh, Linux, Windows or whatever. When a Java program is compiled the object code produced is called byte code. This byte code is translated by the JVM into native instructions for that particular architecture. What this means is that a program compiled on a Windows PC can be placed on disk and transferred to a Macintosh and executed as though it were a Macintosh application. The Java version of our program is shown in Figure 3.
Comments in a Java program can be // on a line-by-line basis, or C's /* .. */ characters to cover multiple lines. Java also includes a third comment indicator, /** */, which is interpreted by the JAVADOC program to produce linked HTML pages for program documentation.
This means that all the classes in the java.io package are available for use in this program. As in the C program the io (input/output) package has existing routines for reading from and writing to files.
In Java everything is a class or object, even the program itself. The class name must be the same as the program source code file name.
Public means that this piece of code is visible everywhere throughout the program. Static means there can be only one version of this method (main) per class. String args[] is an array or list of strings which are the command line parameters. This part of the Java syntax is very similar to C.
These four variables are the same as C with int being a four byte whole number and double an eight byte floating point number.
This states that the variable named thisline is of type String. The String class is used to store a series of adjacent characters such as "Hello" or "45.67". The class has methods that allow strings to be compared and manipulated. For example to compare a string s1 with a string s2 we would write:
The equals method of the s1 String class compares itself to the String s2.
The FileReader class translates bytes from a file into a stream of characters. The BufferedReader class breaks the stream of characters into block of characters, for example lines of a file.
As with C, initialise the counting and summing variables to zero.
This is one of the features of Java, exception handling. The next section of code within the curly braces reads input data. The "try" keyword signifies that if there is an error reading the input data then an exception will occur and the program will pass control to the "catch" section (further down the code). The statements in the catch section will then be executed. This normally includes a message referring to the error type that has occurred.
Create an instance of the FileReader class with standard input (the redirected number file) as the file from which to read a stream of characters.
Create a new instance of the BufferedReader class using our new FileReader class instance as the source of the characters to buffer.
The BufferedReader class has a method readline() which reads a line of characters. We assign the line of characters to the String variable named "thisline".
If we have reached the end of the file "thisline" will equal null. Null is a memory location of zero. So in effect, if the BufferedReader cannot supply a string of characters (the end of the file is reached and there are none left) then it assigns zero to "thisline". The while statement says keep executing the group of statements between the braces while thisline is a proper String.
The C program used scanf() to read from the input file and convert the characters to a floating point number in one single statement. This Java program takes two steps. The parseDouble method takes a String of characters and formats them as a double-precision number. The result is assigned to "this_number".
Same as in C, increment the counter variable named number_of_lines.
This line matches the printf() in C. Here, System refers to the Java package which contains operating specific routines. The "out" portion pertains to the fact that output will be to standard out (ie. the screen). We don't need a format specifier as "this_number" will provide the code to write itself to the screen.
Same as C again. Add this_number to the sum.
Get the next line from the input stream. If it is not null, then we continue through the loop again.
This is the catch section and it is activated if an error occurs during the data reading operation. The "try" section "throws" an exception and this part "catches" it. If there is an error, a message is written to the screen together with an error number.
The average of all the numbers in the file is the sum divided by the number of lines - same as the C program.
Use the println() method to output the values of the three variables. Note the use of the "+" sign to add character strings together, to form one large string.
Same purpose as in the C program. The program has completed successfully, return 0 (zero) to the operating system. The program ends with two curly braces, one to signify the end of the main() method and the other to end the "stat" class. Finally there is a comment line. Resources Compilers for C and Java are freely available on the Net and are also available on magazine cover CDs. Borland has a free downloadable version of its C++ Builder compiler at: http://www.borland.com/bcppbuilder/freecompiler/. Also early versions of the standard C++Builder Integrated Development Environment (IDE) come up regularly on magazine cover disks. This is an entire package which includes an editor, compiler, debugger and a form builder similar to that of Visual Basic. A similar product called JBuilder is available for Java. Borland is to be commended for this initiative as it allows users to work with industrial quality software building environments without the enormous cost outlay. The only drawback is that the software is the base edition and usually a few versions behind the current. Unless you are using very advanced features this is not a concern. If you wish to use something a bit simpler there is the DJGPP compiler, which runs from the DOS command line: http://www.delorie.com/djgpp/. This compiler has also appeared on magazine cover disks from time to time. For a Java compiler it is best to go direct to the source: http://java.sun.com/j2se/1.3/. Here they have the SDK (Software Development Kit) for Windows, Linux and Solaris. Often Java books will include a CD with the SDK included. They can be a little out of date but it saves a large download from the Net. There is also a Sun product called Forte which is an integrated development environment for Java. The community version of Forte is free. This is an excellent product but it is extremely memory intensive. The recommended system memory is 256 megabytes. Although the compilers may be free, it can become expensive if you start to purchase reference books. I suggest with C to try the academic secondhand bookshops. C programming has been a standard at most universities for many years and there are numerous textbooks available. Another advantage of Linux is that it goes well with university texts as they are usually aimed at a UNIX environment. Java versions are increasing fairly rapidly so there are usually some older reference books about at reduced prices. When looking for Java references I suggest looking at Java version 1.1 or later. Version 1.1 included some major changes to graphics handling which has made much of version 1.0 redundant. Currently Java is at version 1.3. If you are interested in programming, my own opinion is that Linux offers the best environment. Linux comes standard with a C and C++ compiler and there are numerous editors and programming utilities. With some of the larger distributions the full Java SDK is supplied and some include Forte for Linux. In conclusion, Java has a number of advantages over C due to the fact that it was developed as an object oriented language, it includes advanced memory management techniques and because of its strong type casting, tends to be more stable. Java is also more Internet friendly with a specific Net package to facilitate network connections and utilise common Web protocols. Its cross platform ability is also important for software that needs be executed on different architectures. The major drawback of Java is its performance due to it being partially interpreted code. For normal applications using moderm processors with lots of RAM this is not a problem. But graphics programs, especially those using 3D graphics, like games, are hampered by its performance characteristics.
C's history as an operating system language means that it produces fast, lean code. It originates from a time when memory and processor speed were worth many times what they are today. By using C++ you can have the advantage of the object oriented methodology, while keeping the performance. Also the design of C++ uses some of Java's advanced features such as stronger data typing and exception handling. This explains why much software is now written in C++. |