![]() | ||
| Database | ![]() ![]() ![]() ![]() | Tuning and Efficiency |
Efficiency in an application is difficult to achieve by tuning after the system is developed. If efficiency is a concern, the best time to consider these issues is at the design stage. The first thing to determine is whether efficiency is a major concern and to identify possible areas where these concerns may arise. For example:
The
Common vars are retrieved very efficiently regardless of a particular record type that is being processed. (Note that common vars cannot be used as keys in secondary indexes as they are not physically part of the record.)
Using the default, when a data block fills up, it is split, with half of the records staying in the original block and half going to the new block. Fifty percent is a good figure for active databases. It means however, that as much as fifty percent of the
data file may be empty. This may be unacceptable on large databases and on
relatively static databases.
The amount of free space is controlled with the loading factor clause on database updating and database creating commands and utilities.
When a database is reloaded or imported, the data is in sequence and the default loading factor is set to .99 to make maximum use of disk space.
The larger the size of the keys, the larger the index. The maximum size keys in any record in a database affects the size of the index. If an application has one record type with a much longer key than all others, try to reduce this if possible. For example, do not have one record type indexed on a 60 character name, if all other record types in the database have unique numbers.
The maximum possible size of the database key or any secondary index is 320 characters.
In a series of record types that share higher level keyfields, each of these records store much of the same key information. It is therefore sensible to minimise the size of these keys.
For example, a text retrieval system could use words to index documents. However a word can be very long and storing these as keys for documents is wasteful of space. Assign each word a number, such that the text of the word is only stored once and all other key indexing is through the word number.
SIR2002 compresses string variables by stripping trailing blanks to hold only the data. Specifying a long maximum length for string variable incurs little overhead provided it is not used in any keys or secondary indexes. Note that the maximum record size is limited to 32k bytes and is tested assuming all strings are at maximum defined length.
The size of integers depends on the maximum value. One byte holds integers in the range of -127 through +123; two bytes holds integers in the range -127*256 through +127*256 (approximately 32,000), four bytes holds integers in the range -127*256*256*256 through +127*256*256*256 (approximately 2,100,000,000).
SIR2002 stores an actual value in a data field to indicate missing values. A variable can have four possible missing values. SIR2002 uses the upper four values on integers for the three missing values that can be specified and the system missing value
Real numbers are stored in 4 or 8 bytes.
For example, a variable with an input format of "I3" requires 2 bytes of storage because any value between -99 and 999 in the input field can be input. If this field will only contain a 2-digit variable with a leading blank or plus sign (+), specify the format as "1X,I2". This saves one byte of storage space in each data record containing this variable since 2-digit variables are stored in a single byte.
For example, a categorical variable might be a list of the names of American
states. If 'Alabama' were the first entry in the list, when 'Alabama' is input,
'1' is stored.
The list is held in the data dictionary and is searched sequentially. It is a very simple and easy to use facility for short lists which are not updated very often. A categorical variable takes one byte (for up to 123 values) or two bytes for longer lists.
Do not use categorical variables if there are hundreds or thousands of entries, or there is more information about each entry than just the name, or users have to modify the entries, use tables with indexes to store this type of reference data.
For example, suppose a variable
If the precision for the percentage example were 2 digits after the decimal
point, specify an input format of I6 (nnn.nn) allowing for the decimal point and
5 numbers and specify
Even more storage may be saved with
Whenever possible use the keys to retrieve records. To retrieve a single record,
specify the whole key. To retrieve a set of records, specify the high level keys
that define the set. Whenever possible, specify the keys as part of the
retrieval statement, rather than retrieving all the records and testing values
in the program.
Define secondary indexes to provide access to subsets of records. Both databases and tables provide automatic secondary indexes.
Consider sorting the input transactions to ensure that any serial processing happens only once.
Consider adding additional indexes or keys to avoid an application having to do serial searches of records to find those of interest,
One common design issue involves processing records after a certain amount of time has elapsed. For example, sending a letter to all patients who have not attended for six months. Consider a secondary index by date for planned future attendances. Update this at the time the visit data is updated and then the system can process by date rather than serially searching.
For the most efficient processing, sort the data for a batch data input run into the same sequence as the data base key. Sort on :
This way, the batch data input can be accomplished by an almost sequential processing of the data base.
It is also efficient to process the records by record type. Each time a new record type is processed, the description of this record type must be loaded from the dictionary. If different record types are processed together and there are multiple records for each case, this saves accesses to the case block but requires multiple access to the dictionary. The most efficient processing depends on the exact mix of input.
When a record is added, it is stored on the correct block in sequence. This means that a block can get full. If this happens SIR2002 creates a second block to store the additional data, and creates another entry in the index. New blocks are created as necessary. A new block is either allocated from existing available blocks or from new blocks at the end of the file. Blocks become available if the data on them is deleted.
SIR2002 holds the case id, record number and key fields as the key. All keys are the same length, which is either the maximum length of a defined key or the
At the lowest level, an index consists of a key and a pointer to the data block which has that key as the lowest value. At the highest level there is a single Master Index block. This contains a key and a pointer to the index block which has that key as the lowest value. If necessary, because of the size of a database, there may be further index levels between the Master index and the lowest level index. When a key is specified, SIR2002 uses the master index (and any other index levels) which point to the lowest level index block which corresponds to the value given and retrieves that data block.
When a block is created on disk, it is assigned a number known as the PRU or physical record unit which can then be used to retrieve the block directly. In operating system terms, a SIR block consists of multiple physical disk blocks since most operating systems write in fixed blocks.
The
Data block size depends on the size of records defined. If there are not any very long records, SIR2002 uses one block (i.e. 2K). SIR2002 tries to allocate a block size which is big enough to hold 4 of the largest records. The largest block size is 32K. If the maximum record length is between 512 bytes and 8k bytes, then SIR2002 allocates a block size between 2K and 32K. A data record is held in one block. That is, a record is not split across blocks so the maximum size for a single record type is 32K.
A loading factor can be specified on a retrieval update, a batch data input or on a utility update run such as RELOAD. This affects the way a full block is split. The factor is a number between 0 and 1 and the default is .5 on updates and .99 on imports and reloads. The most efficient database is one where each block is loaded to the maximum since this minimises the amount of disk space used and makes retrievals more efficient by
reducing the number of disk I/Os. However, a high loading factor for existing blocks can affect the ways that updates work. To take some examples:
Now suppose that a Retrieval Update adds records randomly using a loading factor of .8. At some point a block becomes full. The record being added at that time is inserted in the correct place and 20% of the space on that block is made available by copying those records to a new block and entering that into the index. If the original block is added to further with data that belongs in that block (i.e. with a key lower than an existing record in that block or than the lowest key in the next block) then again it becomes full and again split with the next new block again taking 20%. Thus it is possible under some sequences of updates that many new blocks are only 20% full. If the loading factor were higher, the result would be even worse. Adding data in reverse key sequence with a high loading factor would produce very poor block usage.
The best loading factor depends on the nature of the activity at the time. In general, adding in sequence at the end of the database is best served by factors nearer to 1. Randomly adding data throughout the database is best served by having enough space available for the inserts to work without splitting blocks and, without specific knowledge as to the sequence of updates, a loading factor of .5 should be used.
SIR2002 uses .5 as a default for updates and .99 as a default for reloads and imports.
The actual, exact loading is reported by the
Each key has the same length - its length is either defined implicitly in the schema or by the
The index holds the key of the lowest record in each data block. An index block holds 'n' entries depending on the size of the key. If normal size keys are specified, say up to about 80 bytes, SIR2002 uses the minimum 2K block size; after that SIR2002 increases the index block size. The index block size is always a multiple of the minimum size. If there are very large keys or a very high number of data blocks, SIR2002 increase the size of the index block to cope with this.
To illustrate the way index levels work, assume there are 80 keys per index block. One index block can point to 80 other blocks. If there are less than 80 blocks of data, then there are only two index blocks. The master index and one low level index block. The master index only has one entry. With 81 to 160 data blocks, there are three index blocks, the master index with two entries, one index block for the first 80 data blocks and the second for the next eighty blocks. This continues on until there are 80*80 data blocks, 80 index blocks and one master index block with 80 entries. When the next data block is added, one of the low level index blocks is used to create two new low level blocks. The original low level index block is now a third level index which contains just two entries pointing to the two new low level indexes. As records are added, indexes split as necessary. The third level takes the index capacity up to 80*80*80 data blocks. This process continues as necessary.
At no one point in time is there any major overhead or any need to reorganise the database assuming that none of the limits specified in the schema definition are reached.
Each secondary index is physically very similar to a standard database. It contains index blocks and data blocks. The sizes of these blocks are calculated in a similar way to the block size calculations for standard database blocks to ensure reasonably efficient processing given the size of the secondary index key and the maximum number of records of that type. Each index potentially has different block sizes.
Each record in the data block in a secondary index has the secondary key as the key and contains the standard database key as the data. Thus the size of these data blocks is affected by the size of both keys.
The following discussion refers to the
CIR Size = 5 words plus key 2 = 7. 254
divide by 7 = 36 per block. 1000 cases in database means 1000
CIRs. 1000 divided by 36 means 28 blocks.
CIR = 28 blocks Rec1 = 40
blocks Rec2 = 32 blocks Total = 100 blocks
This shows how many data blocks the database requires and which record types use
the space.
Second level = 2,500/40 = 63 (Rounding
up). Third level = 63/40 = 2. Total required 2,500 + 63 + 2 + 1 = 2,566.
This gives how many index blocks are needed. To translate these two figures into
physical disk blocks or megabytes on a particular operating system, multiply by
the appropriate factors:
Take the data block size and index block size from
This is how much space the data and indexes are going to take for a database.
Disk Space
The amount of space that a database occupies on the disk can be a concern with larger databases. There are often trade-offs between processing efficiency and storage and there are several things that can be done to limit the size of databases.Database Subsets
All of the data may not be needed on-line. There are utilities which create a subset of a database. There are also utilities that merge subsets into the master database. For example, if an application normally only deals with data from the current year, archive the data for previous years and conserve disk space. If the old information is needed for year end reports, reload it, use it and then archive it again.CIR Size
The common information record or CIR on a case structured database occurs once for each case and holds both record counts and common data.Record Counts
The CIR holds space to count occurrences of each possible record type on the case. Allowing for large numbers of record types means a large CIR. In particular, it is very wasteful to allow a large number of record types (MAX REC TYPES) with the intention of using very few e.g do not define a MAX REC TYPES 1000 just to use a few record types in various ranges (100+, 200+, 300+) to mean something. For maximum efficiency, record types should start at 1 and be assigned sequentially.MAX REC COUNT affects the size of each counter (1, 2 or 4 bytes). So a MAX REC TYPES of 1,000 and a MAX REC TYPES of 1,000,000 would mean 4K of record counts per CIR.Common Vars
Defining a variable as a common var means that it is physically stored in the CIR, not in a record. Very often, there is only one occurrence of a common var in a case, so storage is identical whether held as a common or a record variable. If a common var is defined in a record that does occur multiple times, it is only physically stored once, this value being the latest value written.The Loading Factor
Records are stored in "data blocks", the exact size of a data block varying from database to database. Records are added to a data block, and when that gets filled up, another block is created and so on. Records are maintained in sequence within a block and some empty space is left on each data block for the insertion of new records. The amount of space on each data block is controlled by the loading factor and is expressed as a decimal representing a percentage. The default loading factor on updates is .5 or 50 percent.The Database Index
The index to the data records in a database is built from the key variables. The key of the first record in each data block is in the index. This means that there is some redundancy between data in the records and data in the index.Variable Sizes
It takes more space to store strings than integers. Whenever there is a choice, storing a number is more space efficient. If a string has a defined set of values, either define the variable as an integer and assign value labels or define it as a categorical variable. UNDEFINED.Schema Specifications
SIR2002 assigns internal formats according to the external format of the data defined in the schema. Disk space can be saved by a careful choice of schema specifications.VAR RANGES
Specify VAR RANGES if the variable has a narrower range of values than given by the number of digits. The value is used to calculate the minimum number of bytes needed to store the data on disk. For example, specifying a VAR RANGE of -99 to +99 on a variable where 3 input columns are allowed saves space. Consider a potential saving of disk space by defining a proper VAR RANGE.CATEGORICAL
Categorical variables offer an efficient way to store strings which are predefined. A categorical variable is a character string which has a limited number of values specified as an ordered list. When the data is input as a string, it is compared to the list and the number which corresponds to the matching position in the list is stored instead of the value. This has the advantage that only valid entries are held and considerable space is saved. In programs and reports, the full string is displayed and retrieved.SCALED VARS
SCALED VARS stores numbers as integers when they have a predefined number of decimal places. This is more efficient than using floating point R*8 and can be more accurate than R*4.XPCT which holds a percentage and can have a range of 0 through 100 and a precision of one decimal point: Define XPCT as integer with an "I4" input format and specify SCALED VARS XPCT (-1). On input, supply the data as a number which includes the physical decimal point, i.e 10.3, 40.0. The XPCT scaled integer is only going to require two bytes to store (since the maximum physical digits stored are 1000 i.e. 100.0).VAR RANGES (0.00,100.00) which tells
SIR2002 that 2 bytes are sufficient.SCALED VARS, on numbers
which are very small but have only a small number of significant digits. For
example, the specific gravity of fluids in the human body (blood, urine, etc.),
which are measured often with a 3 digit precision. To maintain precision in
floating point, specify an INPUT FORMAT of D4.3. SIR2002 would
use 8 bytes of storage because of the precision. If this variable is read as
"I2" integer and specify SCALED VARS (-3), 6 bytes is saved
per value and accuracy is preserved. (The variable can hold values up to 32.763
which is ample for an S.G. measurement.)
Processing Efficiency
Disk Input/Output (I/O) is the most time consuming operation on a computer and
retrievals should be designed to minimise I/O.Using Keys
The index is used to retrieve records whenever keys are specified in a retrieval
statement. In case structured databases, the index is ordered by case, record
type and by the key variables. In caseless databases, the index is ordered by
record type and then key variables.Efficient On-Line Access
If the key values are known, then data can be retrieved efficiently. Without
keys, an alternative access route is needed. Doing a serial search for
particular records on-line, without knowing the high level keys is a slow
process.
Efficient Batch Processing
Batch processing (the running of jobs in a non-interactive way), typically means that a user is not at a terminal waiting for the job to finish. Processing speed tends not to be of the same concern as it is for on-line access. A process which takes 2 minutes as compared to 1 minute is unlikely to be of concern to anyone. However there may be some concerns when processing thousands of transactions which run for hours.Efficiency in Batch Data Input
Batch data input is the loading of data from files into the database through the
batch data input utilities. This can be done interactively or in batch mode.
Database Internal Structure
The records in a SIR2002 database are stored in a direct access file with an internal index sequential B-Tree index. The database contains two types of blocks: data blocks and index blocks. Data blocks contain the data records, index blocks contain the information needed to access any record in the data base. Blocks can be in any sequence on the disk. Within one block, records are held in sequence. The first record in each block is indexed.MAX KEY SIZE specified. Pay attention to the size of the key. A key is held for each data block in the index and the key is held for each record in each data block. Defining a very large key for one record type impacts the overall database size, regardless of the number of occurrences of that record type.
Block Organisation
SIR2002 calculates the size of data blocks and index blocks for a particular database based on keysize, maximum record size and maximum numbers of records when it first puts any data into the database. The block size is between a minimum and maximum (from 2K bytes to 32K bytes on all current systems). The data blocks and index blocks in a database may be different sizes though in a particular database all data blocks are the same size and all index blocks are the same size.LIST STATS command gives information about the database including the 'INDEX/DATA BLOCK SIZE'. This gives the sizes of the SIR2002 index and data blocks. Sizes are given in double words - eight bytes on current systems. Sizes are also given less the control information SIR2002 holds on each block. A logical block of 2K is 256 doublewords. A typical size for logical blocks for small keys and small data records would be 253/254.Data Blocks
Records are stored in blocks in the order of the keys:case 1 CIR of case 1
records of type 1 within case 1
records of type 2 within case 1
...
...
case 2 CIR of case 2
records of type 1 within case 2
records of type 2 within case 2
...
...
case 3 CIR of case 3
...
SIR2002 holds all of the records in a data block in sequence and to do this it constructs an extra key area at the beginning of each record and holds keys there separately from the data. All record key areas are the same length which is the same as the keys held for the index.
Loading Factor
When SIR2002 needs to insert data which does not fit in the original data blocks, it creates a new block and splits the original data leaving some space on each block. The amount of space left on a block when it is split is determined by the "Loading
Factor".Example Loading Factor Effects
If a loading factor of .99 is specified on a RELOAD, then all the blocks are approximately full. Suppose that Batch Data Input is then used to add a large amount of data at the end of the database, say with an .99% loading factor. Again all blocks are approximately full. (Blocks have to hold whole records and each record is a different length. So when a record does not fit into a block a certain amount of space is left free. This space varies from block to block.)VERIFY FILE command. The number reported gives an average over all blocks in the database. SIR2002 does not split records across data blocks and each block contains complete records only. A block contains a mix of records. For example an 80% full block in the EMPLOYEE database might contain data for 3 or 4 employees and as such could have say 4 CIRs, 4 Employee records, 7 Position records and 13 review records. Thus the exact loading of the block depends on the exact mix of records.Index Blocks
The key is comprised of:MAX KEY SIZE command.Index Levels
There are always at least two levels of index, a Master Index which is a single index block, and a low level index. There may be up to 6 levels of index. A six level index can point to the number of keys in one index block raised to the power of 6. For example, with 36 keys in a block, a six level index copes with over 2,000,000,000 data blocks.Secondary Indexes
All secondary indexes are held on a separate database file (.sr6). This is created when the first index is created and deleted if the last index is deleted.
Size Estimating
Once records have been added to the database, each physical data block contains
a number of different record types. For size estimation, calculating the number
of data blocks each record type would take gives a reasonable estimate of disk
space requirements. In addition, space is required
for the dictionary and procedures but typically these are relatively small
requirements. See Procedure File
for further details on managing a procedure file.LIST STATS
output listed at the bottom of this page.LIST STATS, find the data block size in
double words. For example 254.LIST STATS find the keysize in bytes.
Convert this to double words by dividing by eight and rounding up. For example,
keysize 12 is 2 double words.LIST STATS.
Take the size of the record in words and add the keysize from step 2 to find the
space this record type takes. For example, "Size in Words" 8 plus key size 2
equals 10.
Next calculate the index size:
For example, if Max Index Entries is 40 and there are 100,000 data blocks.
Bottom Level = 100,000/40 = 2,500. LIST
STATS. For example, 126. LIST STATS
in double words and convert to physical blocks or bytes. There is a small
overhead on each physical block such that the reported size is smaller than the
real physical size. For example, a data block of 254 double words is 2K (which is four 512 byte blocks on some Windows file systems). 1,000 254 double word data blocks would take approximately 2 megabytes of disk space.
Verify database statistics
---------------------------
Data records on database 17803
Cases on database 53
Index blocks read 10
Data blocks read 658
Average index block loading 0.87
Average data block loading 0.77
Warning messages 0
Correctable errors 0
Non-correctable errors 0
Verification complete with no errors
Database name HEART
Creation Date/Time 08/30/01 09:08:04
Last update Date/Time 08/30/01 13:42:20
Update level 165
Average Records per Case 20000
Max/Current Number of Cases 100000/53
Max/Current Number of Records 2000000000/17803
Max/Current Number of Record Types 117/85
Maximum Input Columns/Lines 136/32
Rectype Columns 1-3
Sequence Column None
Journal For Database OFF
Case Id Variable SSNUM (A)
Number of Index Levels 2
Max Entries Per Index Block 76
Index/Data Block Size 761/1019
Active/Inactive Data Blocks 658/0
Active/Inactive Index Blocks 10/0
Keysize In Bytes 72
Min/Max Record Size 3/240
Number of Temporary Variables 0
Maximum Number of Data Variables 842
Record Type Number of Maximum Total In Size In Entry Use
No. Name Variables Per Case Database Words Count
-------------- --------- -------- -------- ------- ---------
0 CIR 9 1 53 29 2
1 DEMO 72 1 51 81 2
2 HOSP 244 100 151 123 2
3 CLINPRES 842 100 40 240 2
4 CATH 498 100 51 201 2
......



