Indexing tutorial

Make comments, ask questions, or just complain about the software on this site. Or comment on any educational software.
Please note that by clicking on links that may appear in these posts that you may be leaving the Dale Harris Educational Software website and that the content of those sites is the sole resposibility of the authors of those sites.

Moderators:daleadmin, Dale Harris, Alan, Andrew

User avatar
Dale Harris
Forum Owner
Posts:1171
Joined:Sun Dec 28, 2003 10:19 pm
Location:Chicago
Contact:
Indexing tutorial

Post by Dale Harris » Fri Dec 16, 2005 9:30 pm

Small asked me for a tutorial on indexing files in QuickBASIC but I do not have his email address. So I am posting it here in the hops that he will find it. Plus anyone else can have it who wants it.

The file http://keyhut.com/index.zip contains two files. The source code for the INDEX.BAS program and the following text file in RANDFILE.TXT

<hr>

This document and the program file INDEX.EXE are a tutorial on indexing files.

INDEX.EXE will allow you to create a data file which will contain 5 lists of random numbers from 20 up to 999,999 numbers. The number lists will comprise all the numbers from 1 to the maximum, there will be no missing numbers and no duplicates. The program will create a new database when you start the program for the first time or choose the menu option "1. Create database." You will be asked to enter the largest number in the database. The first time you do this you may want to choose 20 so that the entire database can be displayed on the screen at one time. The database will have the file name RANDFILE.DAT and the number for the size of the database will be in the file RANDFIL1.DAT

Each number list will be held in one column of the database and will be identified by a letter from A to E. Each record of the database will contain the ?th number from each column. For example the 9th record will hold the 9th number from column A, and the 9th number from column B, and the 9th number from column C, etc. The point of this is what if the columns did not contain numbers but other data instead, like personal data. For example the data columns could be name, address, city/state/zip, phone number, email address. But that would require a whole lot of typing to create a sample database with thousands of records while a computer can create numbers automatically. Plus it is real easy to tell when numbers are in order. So while the database is all numbers think of it as being personal data.

This also means that if you display number 256 from column A and then column B displays 483 then if you later ask the program to display 483 from column B then column A will then display 256. Or think of it this way, if you have a personal info database and display DALE HARRIS in the name column and 5654 OLCOTT displays in the address column then if you later display 5654 OLCOTT in the address column then of course DALE HARRIS would display in the name column. This means that the data in each record will be displayed and will be the same for each column no matter which column you choose to bring up that record. This will be true unless you choose to create a new database.

This means that if you choose to display the database in column A order that all the other columns will appear in seemingly random order. If you then display the database in column C order then only column C will be in order and all the other columns will appear to be random. But they are not really random. Remember that a particular number in column A will be always be partnered with a particular number in column B and a particluar number in each of the other columns. Remember in the personal database a particular name will always partner with the same address, and the same phone number, etc.

After you have created your database you must index it using the menu option "2. Index database." The index for each column will be in the file A.IDX for column A, B.IDX for column B, etc. An index will have two columns, the first column will have the data (the numbers from the database for that column) in sorted order and the second column will have a pointer which will point to the record in the database that contins that number. For example if number 5265A is in record 2663 in the database then the 5265th record in A.IDX will be 5265A 2663. Of course it is kind of pointless to have the actual number 5265 stored in record 5265 because if you know the record number you will know what data is stored there. But this is only true if you are storing sequential numbers. If your database was nonsequential numbers, like phone numbers, or text like names, then record 5265 could contain anything.

Once you have created and indexed your database you can see it by choosing "3. Display database." from the menu. You will then be offered choices.

If you press [R] the database will be displayed in record order. This means that you will see the database exactly as it is in the disk file. If you press [A] to [E] then the database will be displayed in the order of the column you have chosen.

To display the database in column order the program must open up the correct index file. It will get the pointer for record #1 in the IDX file which will tell the program where in the database it will find the first number to be displayed. For example if you choose to display the database in column "B" order the program will open B.IDX and retrieve the first record which may contain "1B 7442", then the program will get record 7442 in the RANDFILE.BAT which may contain "2563A 1B 6415C 533D 9651E" and it will display those numbers in the top line of the screen. The program will then get the second record from B.IDX which may be "2B 1574" and will the retrive record 1574 from RANDFILE.DAT to retrieve "258A 2B 6451C 8713D 2105E" and will display those numbers on line #2 of the screen. It will repeat this until all 20 lines of the screen have been filled in.

The first line of the screen is filled in with the record number contained in the variable OFFSET and then the next 19 lines are filled in with the records OFFSET+1 to OFFSET+19. If you press one of the arrow keys, PAGE keys, HOME, or END the value of OFFSET will change accordingly. Whenever one of the above keys are pressed the entire screen is repainted so it may seem as though the screen is scrolling but actually the screen is being repainted.

If you press [F2] you may enter a number to be searched for in the active column, for example if you are displaying the database in column C order the the number being searched for will be found in column C. Since the index for each column is in order you can use a binary tree search to find the number. A binary tree search will start from the middle of the database and look at the number stored there. It will then ask if the number found is higher or lower than the search number. If lower it will look half the distance to the end of the list and look at that number, if that number is higher it will look half the distance from there to the middle. It will keep doing thtis until the number is found. For an example leys assume that the number you are looking for is 5 out of 32 numbers. A binary tree search will look at 16, 8, 4, 6, 5 and then it is found. Of course in using an index that contains all of the numbers from 1 up in sequential order this is pointless, number 5635 has to be in record 5635, but if the numbers are not in sequential order (but still in numeric order) or the index contains text in alphabetical order then a tree search makes sense.

Well why does it make sense? If you had 100,000 items in a list and wanted to find one of them then on average you would have to look though 50,000 items to find the find the one you are looking for by checking every item from #1 upward until you find it. Sure it is possible that the item you are looking for is in record #56 but it is just as possible that it is in record #999,944, so the average is 50,000 peeks. However with a binary tree search the maximum number of peeks into the index for 100,000 records is 16, for a million it would be 19 peeks, 10 million equals 23 peeks and 100 million is 27 peeks, for the population of the U.S. it would be 29 peeks. Much better than checking an average of 130,000,000 records to find you uncle Earl.

When displaying the database if you press [F1] you will be able to choose a different column to display in order, however the top record will not change (assuming that in the new order the record will be 20 or less from the maximum.) What the program will do is to look at the number currently displayed in the new column, use a binary tree search to find that number in the new IDX file, and then display the records in the new column order starting with the same number in the chosen column. Of course since the top record will be the same the numbers in the other columns on the top will not change either. What is the purpose of this? Let's say that you have a database of personal information and you want to find those people who live near a particular person and that the address column is indexed in such a way that addresses close to each other are together, for example instead of...

5654 OCONTO
5654 OLCOTT
5654 OSCELOA

you would have...

OLCOTT 5654
OLCOTT 5655
OLCOTT 5657

Well then you would find the record by searching for the name, then switch to the address column with the same name at the top of the page but the records below being neighbors of the top record.

From the menu if you choose "4. Display pointers." the index files will be displayed.
Dale

Post Reply

Who is online

Users browsing this forum: No registered users and 67 guests