Database homecontents start chapter top of pagebottom of pagenext page indexIntroduction

SIR2002 Databases

This section of the documentation describes how to manage databases using SIR2002. Anyone intending to create or change the definitions of a database or act as a database administrator should be familiar with this material. If you are simply going to access SIR2002 databases using applications developed by someone else, then the
Database Overview provides a good introduction.

SIR2002 database management is covered under a number of topics:

Many of the features are available through the menus and dialogs which are the normal way to manage a database on a regular basis.

SIR2002 provides a concurrent environment if multiple people need to update a database at the same time across a network.

Overview

There can be any number of different databases for different applications. A SIR2002 session can have any number of databases connected at any one time and one of these is the current or default database. Any database operations happen on the default database. SQL, Forms, PQLForms and VisualPQL can access multiple databases.

A database is made up of one or more record types. Each record type contains variables. Each record type may have one or more variables defined as keys. Keys are used to identify each individual record.

A record may have secondary indexes allowing access by VisualPQL and PQLForms through values in non-key variables.

The definition of the database consists of overall information about the database plus definitions for each record type with definitions of each variable and any secondary indexes. This is collectively known as the Schema. Records and variables are retrieved by name and applications do not need to know anything about the physical organisation of data. i.e. Applications are independent of the physical structure of data. Every application that accesses data from a database uses the schema which means that the information is always verified and conforms to the schema definition. An application can access any record type within a database and can access multiple record types for complex processing. The schema can be defined through interactive dialogs or by creating a set of commands and running them in a similar way to a running a program. The resulting database is exactly the same regardless of the way that it is defined.

The database definition can be modified even after the data has been loaded without, in many cases, having to UNLOAD and RELOAD data.

There is a second method to store data known as Tables. A table holds a single record type and is independent of any database. Tables are held on tabfiles. A tabfile can hold many tables. An application can operate on many tables and many tabfiles.

CASE Structured Databases

SIR2002 is a true relational database and databases can be created which are simply sets of record types. However SIR2002 also allows databases which can take advantage of structure in the data:

Some databases have a natural structure known as a Case structure. For example, a medical database might hold information about patients; there could be a number of record types such as patient demographics, visits, procedures, followups, etc. most with multiple occurrences but all of the information about a single individual makes up a case. i.e. A case is a set of records which all refer to one single entity.

A case structure can be easier to use. Queries are simpler to write and less error prone within a case. A case structure can also be faster. Since all of the information for any given case is available with a minimum of searching, access is very fast.

A case structure can be found in many applications. The following guidelines may help determine whether a particular set of data has a case structure.

The Common Information Record

In a case structured database some information is maintained about each case. This is held in the Common Information Record or CIR. The CIR is a summary of data about the case. It holds the case identifier, counts of records belonging to the case and other specified common variables. It typically holds values which only occur once in the case although it can hold a copy of the last value entered for variables which occur multiple times in the case.

CASELESS Databases

Records in a caseless database have no single common element that relates them. If a single database has various records types which are each independent from the other (e.g. Parts, Customers, Employees), then it should be a caseless database or perhaps the various entities should be in individual case structured databases.

Keys

Each record in a SIR2002 database is unique and that uniqueness is defined in terms of the values of keys. A record may have one or more variables which make up the key. No two records in the database can have the same combination of record type and key values.

The main index which SIR2002 maintains to locate the records is built from the key and the key determines how the record is physically stored in the database.

Because the keys go into the index, they are relatively more "expensive" to maintain than non-key variables. Do not declare more keyfields than needed to make records unique or to define relationships. Whenever there is a choice as to which variables to use for a key, use variables which are short and well defined. Use numeric codes wherever possible, especially in place of long alphabetic strings. Using strings as keys leads to misspellings and possible confusions as well as being inefficient. Avoid floating point numbers as key fields.

Secondary indexes can be used to retrieve data in other sequences.

Order of Variables in Keys

In keys made up of multiple variables, the order in which the variables are specified in the key is important.

The most efficient retrieval of a single record is one in which the whole key is specified. Since there is an index built from the key, the record can be directly retrieved.

Records can also be located from just part of the key. The higher level key variables must be specified to find records efficiently. Higher level keys should be the ones that are more likely to retrieve by.

Joining Records with Keyfields

Data from two record types can be joined by using a keyfield that is common to both record types. If keyfields in different record types have the same name and are in the same order, these record types are implicitly joined. A record type with one extra keyfield defines a hierarchy. For example, one record type may have CUSTNO as the key, a second record type CUSTNO and ORDERNO, a third record type CUSTNO, ORDERNO, LINENO, etc. These can go down through many levels of record types if needed, and the implicit joins make retrievals simpler and more efficient.

Defining Variables

Every variable in a database has a name and a structured definition. Whenever a variable is entered into the database, it is checked to ensure that it conforms to the definition.

Variable Names

Database variable names are up to eight characters, begin with a letter and can contain letters, numbers and four special characters (dollar sign $, hash sign #, at sign @ , underscore _ ). Letters in names are translated to uppercase. (These same naming conventions apply to all database names such as record names and passwords.)

Variables are always referred to by their name. Descriptive names are usually best. For example, a variable called 'PHONE' is easier to remember than one called 'VAR112'.

Within a record type, variable names must be unique. Variables in different record types may have the same name. Beware of using the same name for different things in different record types in the same database. For example, it would be a mistake to use the name 'DATE' for variables when they are different dates, such as Birth Date, Visit Date, etc.

Variable Data Types

Every variable is one of three basic data types:

INTEGER
Integers are the set of natural numbers both positive and negative including zero.

FLOATING POINT
Floating point or real numbers, are numbers that may have decimal characters. They may be either single or double precision.

STRING
String or character variables are variables whose values are alphanumeric.

Variable Special Data Types

Although every variable is stored as one of the three basic types, there are several special data types which automatically translate information as they are entered or retrieved. These are:

CATEGORICAL
Categorical variables offer an efficient way to store predefined strings.

A categorical variable is a character string which has a limited number of values specified as an ordered list. When the variable is input as a string, it is compared to the list and the number which corresponds to the matching position in the list is stored instead of the value. This has the advantage that only valid entries are held and considerable space is saved. In programs and reports, the full string is displayed and retrieved.

For example, a categorical variable might be a list of the names of American states. If 'Alabama' were the first entry in the list, a variable entered as 'Alabama' results in '1' being stored.

The list is held in the data dictionary and is searched sequentially. It is a very simple and easy to use facility for short lists which are not updated very often. If there are hundreds or thousands of entries, or there is more information about each entry than just the name, or users need to modify the entries, use tables or record types with indexes to store this type of reference data.

DATE
A date variable is entered as a character string and stored as an integer which is the number of days since the start of the Gregorian calendar (October 15, 1582). A date has a format or date map such as 'MM/DD/YYYY'. As dates are stored as the number of days since a predetermined point in the past, it is very simple to perform date based calculations and there are no potential problems at the end of centuries or millennia.

TIME
A time variable is entered as a character string and stored as an integer which is the number of seconds since midnight. A time has a format or time map such as 'HHMMSS'.

SCALED
A scaled integer represent numbers multiplied by a power of ten (the power is the scale factor). These can include a decimal component (a negative scale factor) or can be a number with a set number of trailing zeros (a positive scale factor). This is efficient for data which has this characteristic of a fixed scale. For example, Money in dollars and cents would be scaled at -2.

The software handles any scaling issues involved. For example, if a scaled integer has a scaling factor of 2 (hundreds), and it is computed equal to 50 x 4, the database stores a 2. If the variable is printed, 200 is output. If used in another calculation, it would be 200. For all practical purposes it has the value 200, except it saves storage space.

The major limitation on scaled integers is that the maximum integer value is 2,147,483,643. If scaled to a negative power, this may not be large enough. For example, scaling to -2 for money, gives a maximum value of 21,474,836.43. (For larger money values, use whole cents in a double precision floating point variable:R*8)

Variable Formats

Variables can be defined in terms of an external format. These formats are:

Aw
A string w long. Strings are stored up to this length with trailing blanks compressed so very long fields (up to 4094) can be defined with little overhead. If using strings as key fields or as fields used in secondary indexes, make them as short as possible. If the variable is a categorical variable, the variable is stored as a number which varies in internal size depending on how many entries are in the list of allowed values.

Iw
An integer w long. For integers, the number of external digits (or the maximum size if specified as a range) determines the internal size. For example two digits holds up to 99, but this can be stored in one byte internally. Numbers with one or two external positions are stored in one byte, numbers with three or four external positions are stored in two bytes and numbers with five and over external positions are stored in 4 bytes.

Scaled integers with decimal places (negatively scaled) have external formats identical to floating point, although the internal storage is as an integer.

Date 'format'
A date. The external format is a string and the internal format is a 4 byte integer.

Time 'format'
A time. The external format is a string and the internal format is a 4 byte integer.

Fw.n
A single precision floating point number w long with an assumed n decimal places. Floating point numbers can be defined as single precision regardless of their external format. For example a floating point number might take 15 columns and be single-precision and it would be stored in 4 bytes at single precision accuracy.

Dw.n
A double precision floating point number w long with an assumed n decimal places.

When defining the external format of floating point numbers, specify the number of digits after the decimal point. This sets the format for printing and the default for input. If a physical decimal point is present in the input, this overrides the format i.e. the default is only used if there is no explicit decimal point.

Floating point numbers can also be input in scientific notation, that is +n.nnnE+nnn where n.nnn is the number, E indicates exponentiation and +nnn is the power. For example, if the variable has four decimal places specify D(12) to hold the 12 columns of input required (+1.3456E+009).

Variable Size

The size of a variable is defined by either the external format or internal storage type:

For most data definition purposes, specify how the variable looks externally and the appropriate internal format is created automatically. However it is as well to understand these transformations.

Variable Internal Formats

Variables are stored as:

I* 1 | 2 | 4
Integers of 1, 2 or 4 bytes.
I*1 have a range of -128 to +123;
I*2 have a range of -32,768 to +32,763;
I*4 have a range of -2,147,483,648 to 2,147,483,643.

R* 4 | 8
Floating Point of 4 bytes for single-precision; 8 bytes for double-precision.

To input a negative number (whether integer or floating point), simply precede the number with a minus sign (-).

STRING*n
Character Strings of up to 4094 bytes.

Variable Quality Control

Whenever variables are entered, they are checked to make sure that the input conforms to the defined data type, size and format. Several checks may be specified:

Valid Values

Valid Values are a list of values which are allowable for a given variable. Input that does not match the valid value list for a variable is rejected.

Variable Ranges

Variable Ranges are ranges of values which are allowable for a variable. Input that is outside the ranges is rejected.

Missing Values

Missing values define values which are held in the database but excluded in any calculations. Up to 3 missing values can be specified per variable.

For example, a variable might have two valid values such as 'Y', and 'N' meaning 'Yes' and 'No'. A further three missing values could be defined such as 'X' for 'Not relevant', 'R' for 'Refused to Answer' and 'Z' for 'Invalid Answer'. When producing statistics on that variable, only the Y and N answers are included as the others are defined as missing values.

Blanks may be declared as a missing value. If a numeric field is blank on input and blank has been defined as a missing value, the variable is stored as missing. If blanks are not declared as a missing value for the variable, zero is stored when a numeric field is blank on input.

Batch Data Input

Certain checks may be specified which are applied by the Batch Data Input utilities. Consistency checks between variables in a record type can be specified. The values of data items can be computed before being stored in the database.

Variable Labels

Variables on screen displays and printed output can be displayed with the variable name or with an optional, longer label. Names as are used as labels if there is no further description.

Value Labels

Labels can be defined for specific values of variables. For instance, for a variable Gender, the value 1 may have the label 'Male' and the value 2 may have the label 'Female'. These labels may be used in reports, etc. but generally, internally within programs, the numeric values are used.

Decimal Points

When reading numbers from external files or specifying numbers in definitions or programs, an actual decimal point "." can be present in the number. In definitions and programs, if the decimal point is omitted, it is assumed at the right of the number. For example:
COMPUTE A = 1.2
COMPUTE B = 100

Some existing computer files may not have explicit decimal points but may have an implied decimal point at a given position. For example, a format of F4.1 or a scale of (-1) means that the number in these four positions has one decimal place if an explicit decimal point is not quoted. If this field contained "0012", it is read as "1.2", however if it contained 1.234, then 1.234 is the value that would be stored.

The decimal positions describe an implicit input format and an implicit output format. They do not describe the maximum number of decimal positions that can be stored in a field. If a number is defined as D10.0, it can still be computed equal to "0.12345" and .12345 is stored.

On some schema definition commands (VALUE LABELS, VAR RANGES, MISSING VALUES etc.), numbers can be specified. If the number has a decimal component, specify an explicit decimal point and the actual value regardless of any input definition of implied decimals. For example:
MISSING VALUES RESULT (99.99)

homecontents start chapter top of pagebottom of pagenext page index