Annotation

Ok, looks like we should use name space extensions under <xs:appinfo> to do extensions as its the standard way and won’t give us any back end problems. This does make the resulting schema definition clunky however its not necessarily the case that it would always be hand edited – we would expect a simple ER design tool to generate it for example, or we envisage translators to go from an SQL DDL or any other data model representation to a Theon XML Schema. We could even provide a slightly abstracted and more user friendly version of the schema language that is then translated via XSLT to the base standard. So not overly concerned about the clunkiness at this level. Next to define the structures under <xs:appinfo> to represent a simple RBAC (Role Based Access Control) system (because XACML looks fiendish and massively over complex for what we need), triggers and procedure wrappers (procedures themselves are code and could be in all manner of different base languages so we don’t intend to do anything other than provide a way of encapsulating them for version control and distribution) and indexes as these are used extensively for performance improvement in the current system.

Extending XML-Schema

Natively in XML Schema we can nicely express tables, columns, keys and referential integrity. However, our data model schema needs more than this – we need to express permissions/roles, triggers/procedures and indexes/views at the very least. We already use the annotation extension feature of XML Schema for adding some meta data (documentation and data supporting the existing user interface) so what are the options for adding these other structures, which are not strictly speaking meta data (although they would not be present in an XML document that instantiated content from the schema)?

  • Use a separate name space with structures under <xs:appinfo>, as we do for meta data.
  • Use foreign namespaced attributes. These can be added to any XSD element and work a bit like extensions under <xs:appinfo> but a more restricted since they can only take simple types.
  • Extend our “implicit” structure of database->clusters->tables->columns with a parallel “abstract” structure representing the additional information. This would mean, for example, defining <xs:element name=”wibble” abstract=”true” type=”Role”> where “Role” is defined like any other complexType. Such elements could never be instantiated but are still part of the schema which we process directly. This is probably the most user friendly approach, but we would need a separate schema to define it (ie. this would move out of our existing schema representation to an “instantiation” (substitutedtype) of some kind of “ancillary data” schema.
  • Copy and extend the base XML-Schema directly. Problem with this approach is it would likely break standard parsing tools.

For now we will assume a structure under <xs:appinfo> with a separate namespace defining the schema for it as we do for meta data.

PostgreSQL Limitations

So, as anticipated, we are going to have to do significant rewrites of the existing triggers and procedures. PostgreSQL triggers are unfortunately just not as sophisticated as Ingres rules. PostgreSQL does not support specific column updates in triggers. PostgreSQL does not support WHERE type clauses to further constrain when triggers fire. Most critically PostgreSQL does not support passing of triggering data in “statement level” triggers (it does for “row level” triggers) – this is used a lot in existing mark processing for efficiency as the data set from the triggering statement is used as a virtual temporary table to join against (limit data sets on) other tables for updates.

Prototype XML Schema

We now have a prototype XML Schema built from existing dictionary information. It includes some annotation which preserves meta-data from the dictionary that just passes through to the backend. Roughly the schema structure matches the proposal. Next is to produce SQL DDL from this to load the schema into PostgreSQL as well as load the user interface data into PostgreSQL. Also now looking at how to incorporate rules/triggers, procedures and permissions into all of this.

Theon

Our new database system (or revamped system if you prefer) needs a proper name. Following a suggestion from Iain Rae I think we should call it Theon. Euclid (the mathematician) is most famous for writing a set of 13 books called Elements. However, historians no longer believe Euclid had sole authorship. Many other mathematicians at the time wrote commentaries that later became incorporated into the piece of work we know today as Elements. Theon of Alexandria was one of these. Given our local system was originally to be replaced by the Euclid project but is now going to be retained to preserve local functionality supplemental to Euclid this name seems most appropriate and heralds perhaps a more collaborative approach to future development.

XML Schema

Proposal! Individual tables will be represented in XML Schema. We will use “annotation” and a separate name space to support meta-data associated with the data model which cannot be represented by XML Schema itself – this is internal table, column, key, relation and cluster documentation (also used in the interface) as well as “appinfo” types for inline SQL as necessary. We will use multiple name spaces – one for the base data model and one for each fork (organisational unit). The final schema for any database defines the name spaces it sources its model from and imports the specific table definitions into its schema. XML Schema also supports an inheritance model, so for example, one fork could extend a base table definition to easily add columns local to that fork (name space). In principle using name spaces means the model could be completely distributed, however in the first instance we will start with one central repository. Structurally our repository will look like this.

  • core/
    • tables/
      • table.xsd
        Each table is represented in XML Schema using the name space of the parent (“core” in this case).
    • clusters/
      • cluster.xsd
        Each cluster is represented in XML Schema using the name space of the parent (“core” in this case). A cluster simply imports a number of tables which form a natural grouping and can act as an independent unit, for example, you might have a cluster to represent “staff data”, “location data” or “inventory data” for example. A table will often belong to more than one cluster.
  • unit/
    • name/
      Directory for each local organisational unit (name space), this would be “inf” for example. Remote organisational units (name spaces) would (virtually) inherit the core into a separate repository (with the same defined structure). They would also fork the release mechanism, as in a release would be tagged relative to the “core” release, whereas local units use the same release tag as the “core”.

      • tables/
        • table.xsd
          Additional tables (or extensions/rewrites to “core” tables) for this unit are represented in XML Schema using the name space of the parent (“inf” for example).
      • clusters/
        • cluster.xsd
          Each additional cluster for this unit is represented in XML Schema using the name space of the parent (“inf” for example). Often a cluster definition here will be a straight copy of one from “core” but replacing some or all of the table definitions from the name space of this tree.

The structure above is release managed in a similar fashion to LCFG and “database profiles” (see below) can choose whether to use the current “develop” (essentially live), “testing” (what will be the next stable) or “stable” release. The SQL DDL is built automatically from the XML Schema definition (for each profile) and distributed to upgrade the appropriate database based on the defined release schedule.

A “database profile” is much like an LCFG machine profile but refers to the current data model schema for a database server rather than a machine configuration. It is written in XML Schema and imports clusters and tables from names spaces in the “core” or specific “units”. The release for a profile is defined via an “appinfo” annotation. This can set a specific release number or one of the pseudo-releases above. A database profile is always live just like an LCFG machine profile and changes will propogate immediately to the corresponding server. The structure (in the same repository as above) is as below.

  • live/
    • name/
      Directory for each local organisational unit (name space), this would be “inf” for example.

      • database.xsd
        Each database is represented in XML Schema using the name space of the parent (“inf” for example). Essentially a database definition defines the name spaces it will be using (“core” or each “unit”) to source cluster and/or table definitions. It will then explicitly import the cluster or table definitions from the required name spaces that will define the data model for the database. Generally would expect just clusters to be imported. A database definition will also have additional annotation to define the release it is using as well as the “push” destination for the model (where the actual server is running etc). A database definition could in principle import another one and could itself define tables and clusters – largely this is not encouraged (much like putting configuration other than specific hardware into a LCFG machine profile).

Issues:

  • need to look at what happens when multiple (identical) schema definitions are imported, for example, different clusters can use the same table and if multiple clusters referring to the same table are imported into a database that will result in multiple instances in the schema – this may just end up being ignored or will have to be processed out.
  • need to define how the “changes in schema” will result in suitable DDL between any two releases
  • need to define the schema template and annotation name spaces (also held in repository somewhere)

The backend to the repository must do validation, compilation and publication. Validation confirms that the XML Schema for any “database profile” is correct. Compilation converts the XML Schema for each profile into SQL (generally changes from previous release, however an initialisation SQL must also be produced) as well as generating SQL to alter (or initialise) the user interface for the database as defined. Publication pushes the SQL onto the database server which then runs it to alter its internal structure to match for the particular database. Publication must also push out the user interface DDL to the server (both must be applied for consistency). Publication is also likely to generate web pages defining the schema for a particular database (as well as the abstract “core” and “unit” schemas) plus maybe differences between them onto the web site to replace the existing generated data model pages.

Essentially our mapping process will populate the “core” structure above. We will then define a basic “inf” unit that just at the outset uses all of “core”. We will then define a basic “inf” database profile that just uses all of the “inf” unit name space definitions. We will then need to write a prototype back end to do an initial production of SQL to populate PostgreSQL and the user interface to support further testing before we do the real back end (including validation and automated publication).

Abstract Framework

Been thinking about this. It needs to do a few things:

  • represent the data model schema in an abstracted way, that is, we really don’t want raw SQL in here, although we may have to allow an opt-out for some twiddly bits that can’t be easily otherwise represented
  • support a back-end to produce an SQL DDL for the data model (for initializing a PostgreSQL database with the schema)
  • support a back-end to populate the user interface database with a description of the schema (since the user interface won’t currently talk to the internal system catalog of PostgreSQL)
  • support a back-end that can produce a “changes SQL DDL” that will alter table structures from one particular release of the model to another (and the complementary user interface data) so that we can implement something akin to the develop/testing/stable LCFG release mechanism
  • provide a modular approach to the data model – such that different bits can be combined together and function independently of the whole

After some thought it looks like the best way to do all of the above will be to use XML-Schema as the representational format in a multi-file structured/validated Subversion repository with some back-end hooks to do the raw SQL creation. More details on the overall structure and how releases will be managed shortly.

Mapping Summary

Table Names:

Ingres

PostgreSQL

NAME

NAME
ing_NAME (if name then conflicts with reserved word)

Column Names (“tecdb”):

Ingres

PostgreSQL

NAME#

NAME

NAME

ing_NAME (if name conflicts with reserved word or other mapped name in same table)

“NAME”

ing_NAME

Column Names (“infdb”):

Ingres

PostgreSQL

oid@

_gui_oid

NAME@

NAME

NAME#

NAME

_NAME$d

_NAME

NAME$d

_NAME

NAME

ing_NAME (if name conflicts with reserved word or other mapped name in same table)

“NAME”

ing_NAME

Column Types:

Ingres

Dictionary

PostgreSQL

New Dictionary

char(N)

Character/*/N

varchar(N)

Character/(as defined)/N

varchar(N)

Character/*/N

varchar(N)

Character/(as defined)/N

integer

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

i1

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

tinyint

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

smallint

Numeric/Fixed/*

integer

Numeric/Fixed/(as defined)

float

Numeric/Floating/8

float

Numeric/Floating/8

date

Abstract/Date

date

Abstract/Date

char(N)

Abstract/Set/N

ENUMERATED TYPE ing_TABLE_COLUMN

Abstract/Set/N

varchar(N)

Abstract/Set/N

ENUMERATED TYPE ing_TABLE_COLUMN

Abstract/Set/N

any integer type

Numeric/Set/N

ENUMERATED TYPE ing_TABLE_COLUMN

Numeric/Set/N

any integer type

Abstract/Boolean

boolean

Abstract/Boolean

money

Abstract/Money

float/decimal depending

Abstract/Money

Column Constraints:

Ingres

PostgreSQL

not null not default

not null

not null

not null

Other:

Dictionary

PostgreSQL

Key and AKMap

PRIMARY KEY, REFERENCES and UNIQUE column groups

Relation

REFERENCES definitions (always with ON DELETE/UPDATE CASCADE)

Parameter

CREATE TYPE (for enumerated types)

Process

The mapping process for “infdb” will work as follows. We will use the dictionary information held in “tecdb” as the master as this easier to process than the raw Ingres schema and provides virtual type, key and relation data. We will apply the mappings and produce an “abstract framework” for the schema. From this abstract framework we will generate the schema for PostgreSQL as well as the dictionary information (in new PostgreSQL format) for the “infdb” database that gets added into the “tecdb” database. Ultimately the abstract framework will become the productized environment for creating/changing the PostgreSQL schema with dictionary information for “tecdb” also generated while the user interface continues to require it.

The mapping process for “tecdb” is different as the master for the schema and self-referential base data is the user interface. So a branch of the user interface is made which will have the standard mapping applied to the data driving the creation of the dictionary structures to make them PostgreSQL compliant. The branch will also have the support for PostgreSQL database connections folded in. It will also need changes to support the new types – “boolean” and “enumerated” although these can probably be mapped at the driver level leaving the primary code base unchanged. A conversion function will also be needed in query construction for “money” type display – however again this can probably be achieved in the driver by making the assumption to always map “float” or “decimal” to 2 decimal places for display.

Subsequently the generated schema will need to be back-checked by hand (semi automated difference to cover the bulk) against the original Ingres schema. Then equivalent mappings must be applied to all the conduits, GenRep scripts and Interface scripts. The Ingres -> PostgreSQL sync process will then need to be implemented to transfer “data” only.

Constraints

In the Ingres schema the only column constraints used in “tecdb” and “infdb” are “not null” and “not null not default”. Most columns are nullable, the exception is primary keys which are “not null not default” (must have a non null value on insert) and non-obligatory foreign keys. PostgreSQL supports the “not null” constraint so no explicit mapping is required. There is no equivalent in PostgreSQL of “not default” but since it is always combined with “not null” then a value (other than null) must be provided on entry anyway so its superfluous so the mapping will be to drop it (in PostgreSQL “not default” is the column constraint default unless an explicit DEFAULT clause is provided).

Uniqueness constraints are done in Ingres by modifying the internal table structure to BTREE or HASH unique on one or more columns. Additional BTREE/HASH indexes may be defined for performance. Tables are also all defined with “no duplicates”, possibly unnecessarily in many cases (probably done to make an earlier incantation of the user interface guaranteed to work). Generally uniqueness constraint columns parallel primary key columns however this is not always the case. PostgreSQL does not have table structure modifications like Ingres, instead the CREATE INDEX syntax is used to create BTREE or HASH indexes. PostgreSQL does not have a direct equivalent to “no duplicates” however you could use a UNIQUE clause including all columns to the same effect. We won’t map “no duplicates”. We can use dictionary information to replace uniqueness constraints defined as explicit table structure modifications with PRIMARY KEY or UNIQUE clauses as appropriate. However, there will be some cases that will need to be manually mapped.