All posts by timc

Theon

Our new database system (or revamped system if you prefer) needs a proper name. Following a suggestion from Iain Rae I think we should call it Theon. Euclid (the mathematician) is most famous for writing a set of 13 books called Elements. However, historians no longer believe Euclid had sole authorship. Many other mathematicians at the time wrote commentaries that later became incorporated into the piece of work we know today as Elements. Theon of Alexandria was one of these. Given our local system was originally to be replaced by the Euclid project but is now going to be retained to preserve local functionality supplemental to Euclid this name seems most appropriate and heralds perhaps a more collaborative approach to future development.

XML Schema

Proposal! Individual tables will be represented in XML Schema. We will use “annotation” and a separate name space to support meta-data associated with the data model which cannot be represented by XML Schema itself – this is internal table, column, key, relation and cluster documentation (also used in the interface) as well as “appinfo” types for inline SQL as necessary. We will use multiple name spaces – one for the base data model and one for each fork (organisational unit). The final schema for any database defines the name spaces it sources its model from and imports the specific table definitions into its schema. XML Schema also supports an inheritance model, so for example, one fork could extend a base table definition to easily add columns local to that fork (name space). In principle using name spaces means the model could be completely distributed, however in the first instance we will start with one central repository. Structurally our repository will look like this.

  • core/
    • tables/
      • table.xsd
        Each table is represented in XML Schema using the name space of the parent (“core” in this case).
    • clusters/
      • cluster.xsd
        Each cluster is represented in XML Schema using the name space of the parent (“core” in this case). A cluster simply imports a number of tables which form a natural grouping and can act as an independent unit, for example, you might have a cluster to represent “staff data”, “location data” or “inventory data” for example. A table will often belong to more than one cluster.
  • unit/
    • name/
      Directory for each local organisational unit (name space), this would be “inf” for example. Remote organisational units (name spaces) would (virtually) inherit the core into a separate repository (with the same defined structure). They would also fork the release mechanism, as in a release would be tagged relative to the “core” release, whereas local units use the same release tag as the “core”.

      • tables/
        • table.xsd
          Additional tables (or extensions/rewrites to “core” tables) for this unit are represented in XML Schema using the name space of the parent (“inf” for example).
      • clusters/
        • cluster.xsd
          Each additional cluster for this unit is represented in XML Schema using the name space of the parent (“inf” for example). Often a cluster definition here will be a straight copy of one from “core” but replacing some or all of the table definitions from the name space of this tree.

The structure above is release managed in a similar fashion to LCFG and “database profiles” (see below) can choose whether to use the current “develop” (essentially live), “testing” (what will be the next stable) or “stable” release. The SQL DDL is built automatically from the XML Schema definition (for each profile) and distributed to upgrade the appropriate database based on the defined release schedule.

A “database profile” is much like an LCFG machine profile but refers to the current data model schema for a database server rather than a machine configuration. It is written in XML Schema and imports clusters and tables from names spaces in the “core” or specific “units”. The release for a profile is defined via an “appinfo” annotation. This can set a specific release number or one of the pseudo-releases above. A database profile is always live just like an LCFG machine profile and changes will propogate immediately to the corresponding server. The structure (in the same repository as above) is as below.

  • live/
    • name/
      Directory for each local organisational unit (name space), this would be “inf” for example.

      • database.xsd
        Each database is represented in XML Schema using the name space of the parent (“inf” for example). Essentially a database definition defines the name spaces it will be using (“core” or each “unit”) to source cluster and/or table definitions. It will then explicitly import the cluster or table definitions from the required name spaces that will define the data model for the database. Generally would expect just clusters to be imported. A database definition will also have additional annotation to define the release it is using as well as the “push” destination for the model (where the actual server is running etc). A database definition could in principle import another one and could itself define tables and clusters – largely this is not encouraged (much like putting configuration other than specific hardware into a LCFG machine profile).

Issues:

  • need to look at what happens when multiple (identical) schema definitions are imported, for example, different clusters can use the same table and if multiple clusters referring to the same table are imported into a database that will result in multiple instances in the schema – this may just end up being ignored or will have to be processed out.
  • need to define how the “changes in schema” will result in suitable DDL between any two releases
  • need to define the schema template and annotation name spaces (also held in repository somewhere)

The backend to the repository must do validation, compilation and publication. Validation confirms that the XML Schema for any “database profile” is correct. Compilation converts the XML Schema for each profile into SQL (generally changes from previous release, however an initialisation SQL must also be produced) as well as generating SQL to alter (or initialise) the user interface for the database as defined. Publication pushes the SQL onto the database server which then runs it to alter its internal structure to match for the particular database. Publication must also push out the user interface DDL to the server (both must be applied for consistency). Publication is also likely to generate web pages defining the schema for a particular database (as well as the abstract “core” and “unit” schemas) plus maybe differences between them onto the web site to replace the existing generated data model pages.

Essentially our mapping process will populate the “core” structure above. We will then define a basic “inf” unit that just at the outset uses all of “core”. We will then define a basic “inf” database profile that just uses all of the “inf” unit name space definitions. We will then need to write a prototype back end to do an initial production of SQL to populate PostgreSQL and the user interface to support further testing before we do the real back end (including validation and automated publication).

Abstract Framework

Been thinking about this. It needs to do a few things:

  • represent the data model schema in an abstracted way, that is, we really don’t want raw SQL in here, although we may have to allow an opt-out for some twiddly bits that can’t be easily otherwise represented
  • support a back-end to produce an SQL DDL for the data model (for initializing a PostgreSQL database with the schema)
  • support a back-end to populate the user interface database with a description of the schema (since the user interface won’t currently talk to the internal system catalog of PostgreSQL)
  • support a back-end that can produce a “changes SQL DDL” that will alter table structures from one particular release of the model to another (and the complementary user interface data) so that we can implement something akin to the develop/testing/stable LCFG release mechanism
  • provide a modular approach to the data model – such that different bits can be combined together and function independently of the whole

After some thought it looks like the best way to do all of the above will be to use XML-Schema as the representational format in a multi-file structured/validated Subversion repository with some back-end hooks to do the raw SQL creation. More details on the overall structure and how releases will be managed shortly.

Mapping Summary

Table Names:

Ingres

PostgreSQL

NAME

NAME
ing_NAME (if name then conflicts with reserved word)

Column Names (“tecdb”):

Ingres

PostgreSQL

NAME#

NAME

NAME

ing_NAME (if name conflicts with reserved word or other mapped name in same table)

“NAME”

ing_NAME

Column Names (“infdb”):

Ingres

PostgreSQL

oid@

_gui_oid

NAME@

NAME

NAME#

NAME

_NAME$d

_NAME

NAME$d

_NAME

NAME

ing_NAME (if name conflicts with reserved word or other mapped name in same table)

“NAME”

ing_NAME

Column Types:

Ingres

Dictionary

PostgreSQL

New Dictionary

char(N)

Character/*/N

varchar(N)

Character/(as defined)/N

varchar(N)

Character/*/N

varchar(N)

Character/(as defined)/N

integer

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

i1

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

tinyint

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

smallint

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

float

Numeric/Floating/8

float

Numeric/Floating/8

date

Abstract/Date

date

Abstract/Date

char(N)

Abstract/Set/N

ENUMERATED TYPE ing_TABLE_COLUMN

Abstract/Set/N

varchar(N)

Abstract/Set/N

ENUMERATED TYPE ing_TABLE_COLUMN

Abstract/Set/N

any integer type

Numeric/Set/N

ENUMERATED TYPE ing_TABLE_COLUMN

Numeric/Set/N

any integer type

Abstract/Boolean

boolean

Abstract/Boolean

money

Abstract/Money

float/decimal depending

Abstract/Money

Column Constraints:

Ingres

PostgreSQL

not null not default

not null

not null

not null

Other:

Dictionary

PostgreSQL

Key and AKMap

PRIMARY KEY, REFERENCES and UNIQUE column groups

Relation

REFERENCES definitions (always with ON DELETE/UPDATE CASCADE)

Parameter

CREATE TYPE (for enumerated types)

Process

The mapping process for “infdb” will work as follows. We will use the dictionary information held in “tecdb” as the master as this easier to process than the raw Ingres schema and provides virtual type, key and relation data. We will apply the mappings and produce an “abstract framework” for the schema. From this abstract framework we will generate the schema for PostgreSQL as well as the dictionary information (in new PostgreSQL format) for the “infdb” database that gets added into the “tecdb” database. Ultimately the abstract framework will become the productized environment for creating/changing the PostgreSQL schema with dictionary information for “tecdb” also generated while the user interface continues to require it.

The mapping process for “tecdb” is different as the master for the schema and self-referential base data is the user interface. So a branch of the user interface is made which will have the standard mapping applied to the data driving the creation of the dictionary structures to make them PostgreSQL compliant. The branch will also have the support for PostgreSQL database connections folded in. It will also need changes to support the new types – “boolean” and “enumerated” although these can probably be mapped at the driver level leaving the primary code base unchanged. A conversion function will also be needed in query construction for “money” type display – however again this can probably be achieved in the driver by making the assumption to always map “float” or “decimal” to 2 decimal places for display.

Subsequently the generated schema will need to be back-checked by hand (semi automated difference to cover the bulk) against the original Ingres schema. Then equivalent mappings must be applied to all the conduits, GenRep scripts and Interface scripts. The Ingres -> PostgreSQL sync process will then need to be implemented to transfer “data” only.

Constraints

In the Ingres schema the only column constraints used in “tecdb” and “infdb” are “not null” and “not null not default”. Most columns are nullable, the exception is primary keys which are “not null not default” (must have a non null value on insert) and non-obligatory foreign keys. PostgreSQL supports the “not null” constraint so no explicit mapping is required. There is no equivalent in PostgreSQL of “not default” but since it is always combined with “not null” then a value (other than null) must be provided on entry anyway so its superfluous so the mapping will be to drop it (in PostgreSQL “not default” is the column constraint default unless an explicit DEFAULT clause is provided).

Uniqueness constraints are done in Ingres by modifying the internal table structure to BTREE or HASH unique on one or more columns. Additional BTREE/HASH indexes may be defined for performance. Tables are also all defined with “no duplicates”, possibly unnecessarily in many cases (probably done to make an earlier incantation of the user interface guaranteed to work). Generally uniqueness constraint columns parallel primary key columns however this is not always the case. PostgreSQL does not have table structure modifications like Ingres, instead the CREATE INDEX syntax is used to create BTREE or HASH indexes. PostgreSQL does not have a direct equivalent to “no duplicates” however you could use a UNIQUE clause including all columns to the same effect. We won’t map “no duplicates”. We can use dictionary information to replace uniqueness constraints defined as explicit table structure modifications with PRIMARY KEY or UNIQUE clauses as appropriate. However, there will be some cases that will need to be manually mapped.

Keys/Relations

In the Ingres schema there is no information about what columns in a table form primary or foreign identifiers, nor the relations between tables. However PostgreSQL has the syntax to make this explicit in the schema using the PRIMARY KEY, FOREIGN KEY and REFERENCES constraints. Hence we should add this into the schema as part of the mapping. This can be achieved by using information in the user interface database.

Table Mapping

We shouldn’t have to make any changes for mapping table names from Ingres to PostgreSQL. In the unlikely event where a table name would clash with a reserved word we will simply prefix with “ing_”.

Column Type Mapping

The types for columns used in the “tecdb” database in Ingres are: “char”, “varchar”, “tinyint”, “integer”, “smallint” and “date”. Except “tinyint” (also called “i1”), all of these are standard SQL and are the same in PostgreSQL so don’t need to be mapped. However, for performance reasons we should map “char” to “varchar”. In doing so we need to be careful that the subtle semantic differences between the two won’t affect input/output via the user interface and reports. We can safely map “tinyint” to “smallint” as it is just a one byte signed integer and PostgreSQL does not have an equivalent to this. However, again for performance reasons we will simply map “tinyint”/”i1” and “smallint” to “integer”. Mapping of the “date” type is likely to be more problematic as the Ingres Date format is different by default to the ANSI Date format. Will need to look at this when it comes to the actual values and how they are displayed by default in SQL output and what the current values are (the Ingres default date type supports times as well as intervals which is not the SQL standard).

For the “infdb” database the following additional column types to those above are used: “money” and “float”. The “money” type is used extensively across mark processing, both for input value columns and derived values columns. The “float” type is used in some intermediate mark combination columns (not generally user accessed). The “money” type was used largely for easy entry/display (two digits shown after the decimal point by default). For some reason which is lost in the mists of time, “money” was preferred over something like “decimal(7,4)” (perhaps “decimal” wasn’t available in Ingres back then). It is likely that “money” was used because the input/output of a “money” type value was more user manageable (that is, it is essentially an 8 byte integer for fast processing and accuracy but internally displayed/rounded to two digits after the decimal point which is much easier for users to view then something like 22.00030303 for example). Processing of the “decimal” type is very slow according to the PostgreSQL documentation. In summary, there are accuracy issues with the “money” type (if you use multiplication/division and need greater precision than four decimal places), performance/display issues with the “decimal” type and display issues with the “float” type. So we should choose a data type with higher precision and just round the result for display. This means either “decimal” or “float” as the mapping for all “money” and “float” columns (some may need higher precision as they aggregate). Ideally we would use “float” for performance (and since we don’t tend to compare the column values) and “decimal” for accuracy but both would be rounded on display (and reports) to two decimal places. This can be achieved by wrapping every access in a rounding function but this would mean altering the user interface and reports. Also the comparative QA between mark calculation using “float” (or “decimal” with greater precision) against “money” will be tricky. In general we should now be avoiding the use of “money” for calculations. Calculations on “float”/”money” types may differ from “decimal” types in handling of NaN situations, this will need to be looked at more carefully.

The “integer” type is used in a lot of places to represent a “boolean” since Ingres did not have this specific type. The user interface dictionary identifies these so as the display of an integer tagged as a boolean is shown differently to the user. These should all really now be mapped to “boolean” in PostgreSQL. This must be done with care however – the user interface will need to be changed to quote 0 or 1 as a value for such a column (or use the preferred TRUE/FALSE SQL standard keyword). Also booleans are returned as the letters ‘t’ or ‘f’ and the user interface would need to be updated to manage this. It is likely that boolean columns used in SQL reports are treated as equivalent to integers as that is the native base type in Ingres – so would need to do a mapping to an integer to maintain compatibility. The actual values would need to be translated when copied from Ingres to PostgreSQL.

The other virtual data type is a “set”. This is represented as a “char” or “varchar” but the user interface is aware of a fixed set of possible values, presenting this as a list of options to choose from rather than a free text entry field. This can be achieved in PostgreSQL using Enumerated types (and CREATE TYPE statement). Hence the mapping of a “char” or “varchar” column which is tagged as a “set” will be to a custom enumerated type defined by CREATE TYPE. The list of values can be extracted automatically from the user interface dictionary database in order to populate the list of options for the set. The name of the set should be the name of the table and the name of the original column separated by an underscore and leading “ing_”. For example, table “course” with column “choice” which is a “char(15)” will be mapped to “ing_course_choice” type which will be an enumerated type created specifically for that column.

key columns in tecdb

Of course we need to look at mapping column names in “tecdb” (the user interface database) as well as “infdb”. This is slightly different. The only suffix used is “#” which in this is used to indicate a column that is a primary/foreign key (or part of). So we will just map “wibble#” to “wibble” for this database. Any resultant name conflicts will be mapped with a leading “ing_”.

Mappings in “tecdb” will have to be replicated into the user interface build data structures so that the boot strap cache refers to the correct new names, this in itself is not difficult but it will necessitate starting a branch’ed version of the interface to do this. However this will need to be done for PostgreSQL support anyway, and putting the whole thing under version control is part of the Phase 1 work.