Been thinking about this. It needs to do a few things:
- represent the data model schema in an abstracted way, that is, we really don’t want raw SQL in here, although we may have to allow an opt-out for some twiddly bits that can’t be easily otherwise represented
- support a back-end to produce an SQL DDL for the data model (for initializing a PostgreSQL database with the schema)
- support a back-end to populate the user interface database with a description of the schema (since the user interface won’t currently talk to the internal system catalog of PostgreSQL)
- support a back-end that can produce a “changes SQL DDL” that will alter table structures from one particular release of the model to another (and the complementary user interface data) so that we can implement something akin to the develop/testing/stable LCFG release mechanism
- provide a modular approach to the data model – such that different bits can be combined together and function independently of the whole
After some thought it looks like the best way to do all of the above will be to use XML-Schema as the representational format in a multi-file structured/validated Subversion repository with some back-end hooks to do the raw SQL creation. More details on the overall structure and how releases will be managed shortly.
The mapping process for “infdb” will work as follows. We will use the dictionary information held in “tecdb” as the master as this easier to process than the raw Ingres schema and provides virtual type, key and relation data. We will apply the mappings and produce an “abstract framework” for the schema. From this abstract framework we will generate the schema for PostgreSQL as well as the dictionary information (in new PostgreSQL format) for the “infdb” database that gets added into the “tecdb” database. Ultimately the abstract framework will become the productized environment for creating/changing the PostgreSQL schema with dictionary information for “tecdb” also generated while the user interface continues to require it.
The mapping process for “tecdb” is different as the master for the schema and self-referential base data is the user interface. So a branch of the user interface is made which will have the standard mapping applied to the data driving the creation of the dictionary structures to make them PostgreSQL compliant. The branch will also have the support for PostgreSQL database connections folded in. It will also need changes to support the new types – “boolean” and “enumerated” although these can probably be mapped at the driver level leaving the primary code base unchanged. A conversion function will also be needed in query construction for “money” type display – however again this can probably be achieved in the driver by making the assumption to always map “float” or “decimal” to 2 decimal places for display.
Subsequently the generated schema will need to be back-checked by hand (semi automated difference to cover the bulk) against the original Ingres schema. Then equivalent mappings must be applied to all the conduits, GenRep scripts and Interface scripts. The Ingres -> PostgreSQL sync process will then need to be implemented to transfer “data” only.
In the Ingres schema the only column constraints used in “tecdb” and “infdb” are “not null” and “not null not default”. Most columns are nullable, the exception is primary keys which are “not null not default” (must have a non null value on insert) and non-obligatory foreign keys. PostgreSQL supports the “not null” constraint so no explicit mapping is required. There is no equivalent in PostgreSQL of “not default” but since it is always combined with “not null” then a value (other than null) must be provided on entry anyway so its superfluous so the mapping will be to drop it (in PostgreSQL “not default” is the column constraint default unless an explicit DEFAULT clause is provided).
Uniqueness constraints are done in Ingres by modifying the internal table structure to BTREE or HASH unique on one or more columns. Additional BTREE/HASH indexes may be defined for performance. Tables are also all defined with “no duplicates”, possibly unnecessarily in many cases (probably done to make an earlier incantation of the user interface guaranteed to work). Generally uniqueness constraint columns parallel primary key columns however this is not always the case. PostgreSQL does not have table structure modifications like Ingres, instead the CREATE INDEX syntax is used to create BTREE or HASH indexes. PostgreSQL does not have a direct equivalent to “no duplicates” however you could use a UNIQUE clause including all columns to the same effect. We won’t map “no duplicates”. We can use dictionary information to replace uniqueness constraints defined as explicit table structure modifications with PRIMARY KEY or UNIQUE clauses as appropriate. However, there will be some cases that will need to be manually mapped.
In the Ingres schema there is no information about what columns in a table form primary or foreign identifiers, nor the relations between tables. However PostgreSQL has the syntax to make this explicit in the schema using the PRIMARY KEY, FOREIGN KEY and REFERENCES constraints. Hence we should add this into the schema as part of the mapping. This can be achieved by using information in the user interface database.
We shouldn’t have to make any changes for mapping table names from Ingres to PostgreSQL. In the unlikely event where a table name would clash with a reserved word we will simply prefix with “ing_”.
The types for columns used in the “tecdb” database in Ingres are: “char”, “varchar”, “tinyint”, “integer”, “smallint” and “date”. Except “tinyint” (also called “i1”), all of these are standard SQL and are the same in PostgreSQL so don’t need to be mapped. However, for performance reasons we should map “char” to “varchar”. In doing so we need to be careful that the subtle semantic differences between the two won’t affect input/output via the user interface and reports. We can safely map “tinyint” to “smallint” as it is just a one byte signed integer and PostgreSQL does not have an equivalent to this. However, again for performance reasons we will simply map “tinyint”/”i1” and “smallint” to “integer”. Mapping of the “date” type is likely to be more problematic as the Ingres Date format is different by default to the ANSI Date format. Will need to look at this when it comes to the actual values and how they are displayed by default in SQL output and what the current values are (the Ingres default date type supports times as well as intervals which is not the SQL standard).
For the “infdb” database the following additional column types to those above are used: “money” and “float”. The “money” type is used extensively across mark processing, both for input value columns and derived values columns. The “float” type is used in some intermediate mark combination columns (not generally user accessed). The “money” type was used largely for easy entry/display (two digits shown after the decimal point by default). For some reason which is lost in the mists of time, “money” was preferred over something like “decimal(7,4)” (perhaps “decimal” wasn’t available in Ingres back then). It is likely that “money” was used because the input/output of a “money” type value was more user manageable (that is, it is essentially an 8 byte integer for fast processing and accuracy but internally displayed/rounded to two digits after the decimal point which is much easier for users to view then something like 22.00030303 for example). Processing of the “decimal” type is very slow according to the PostgreSQL documentation. In summary, there are accuracy issues with the “money” type (if you use multiplication/division and need greater precision than four decimal places), performance/display issues with the “decimal” type and display issues with the “float” type. So we should choose a data type with higher precision and just round the result for display. This means either “decimal” or “float” as the mapping for all “money” and “float” columns (some may need higher precision as they aggregate). Ideally we would use “float” for performance (and since we don’t tend to compare the column values) and “decimal” for accuracy but both would be rounded on display (and reports) to two decimal places. This can be achieved by wrapping every access in a rounding function but this would mean altering the user interface and reports. Also the comparative QA between mark calculation using “float” (or “decimal” with greater precision) against “money” will be tricky. In general we should now be avoiding the use of “money” for calculations. Calculations on “float”/”money” types may differ from “decimal” types in handling of NaN situations, this will need to be looked at more carefully.
The “integer” type is used in a lot of places to represent a “boolean” since Ingres did not have this specific type. The user interface dictionary identifies these so as the display of an integer tagged as a boolean is shown differently to the user. These should all really now be mapped to “boolean” in PostgreSQL. This must be done with care however – the user interface will need to be changed to quote 0 or 1 as a value for such a column (or use the preferred TRUE/FALSE SQL standard keyword). Also booleans are returned as the letters ‘t’ or ‘f’ and the user interface would need to be updated to manage this. It is likely that boolean columns used in SQL reports are treated as equivalent to integers as that is the native base type in Ingres – so would need to do a mapping to an integer to maintain compatibility. The actual values would need to be translated when copied from Ingres to PostgreSQL.
The other virtual data type is a “set”. This is represented as a “char” or “varchar” but the user interface is aware of a fixed set of possible values, presenting this as a list of options to choose from rather than a free text entry field. This can be achieved in PostgreSQL using Enumerated types (and CREATE TYPE statement). Hence the mapping of a “char” or “varchar” column which is tagged as a “set” will be to a custom enumerated type defined by CREATE TYPE. The list of values can be extracted automatically from the user interface dictionary database in order to populate the list of options for the set. The name of the set should be the name of the table and the name of the original column separated by an underscore and leading “ing_”. For example, table “course” with column “choice” which is a “char(15)” will be mapped to “ing_course_choice” type which will be an enumerated type created specifically for that column.
Of course we need to look at mapping column names in “tecdb” (the user interface database) as well as “infdb”. This is slightly different. The only suffix used is “#” which in this is used to indicate a column that is a primary/foreign key (or part of). So we will just map “wibble#” to “wibble” for this database. Any resultant name conflicts will be mapped with a leading “ing_”.
Mappings in “tecdb” will have to be replicated into the user interface build data structures so that the boot strap cache refers to the correct new names, this in itself is not difficult but it will necessitate starting a branch’ed version of the interface to do this. However this will need to be done for PostgreSQL support anyway, and putting the whole thing under version control is part of the Phase 1 work.
Remembered how this works and what it was for. Essentially this is a user interface generated unique record identifier. It uses the interface database for persistence (hence the id is actually unique across all tables and databases used by the interface although don’t think that was a requirement, just a simplification to avoid having per-table id tables). Why not just use rules to create an increasing numerical value in the database back end? Because the user interface wants to know the id of the record it has just added so it can guarantee to find it again – otherwise it would have to assume that the table is unique across all the column values and use a query lookup on all the column values entered for a new record, this might work for some tables but certainly not all. Its clearly still a useful thing to be able to do, MySQL lets you do this with the “SELECT LAST_INSERT_ID()” command for example. Now if there is a PostgreSQL equivalent to that … however it will still mean functional change to the user interface so for Phase 1 we are going to stick with the “oid@” columns as they are. We will rename these as “_gui_oid” to indicate that it is a generated column and that the source for the values is the interface.
The identifiers (table and column names) used in the Ingres Data Model schema have character suffixes in certain cases to indicate usage. These are “@” to indicate a column that is (or is part of) a primary or foreign key, “#” as a shorthand for “number” in the column name and “$d” to indicate a column the value of which is not user entered but is derived by rules and procedures triggered from some other change. There is inconsistency in the use of “#” as it sometimes just indicates a numerical column. So we might have “person@” as the primary key in the person table, “part#” as “part number” and “mark$d” as a mark derived from something automatically. They can also be used in combination, for example “slot#$d” which is a column with a generated value which is the “slot number”. None of these characters are allowed in an identifier in standard SQL and none will work in PostgreSQL where identifiers can only include letters, numbers and underscore (actually PostgreSQL also allows $ but thats not standard SQL). We could make them all work by using quoted identifiers but this means that every existing reference to an identifier would need to be quoted which would mean the SQL generated by the user interface would have to be altered, so not really an option when we want to minimize functional change. So how to map? We don’t really need “@” anymore in PostgreSQL because we will be using the “PRIMARY/FOREIGN IDENTIFIER” syntax to notate keys. The “#” is not currently used consistently. So on schema conversion “wibble@” and “wibble#” should just become “wibble”. The only issue this might cause will be a possible name conflict with something in the same table already called “wibble” – this should be unlikely but if it happens these will need to be mapped to something else. The “$d” still seems useful though as you can quickly tell if a column is system maintained. A leading underscore is also used in columns that have an auto generated value that is associated with audit (visible status of who made changes), for example “_date$d” would be the date something in the record was changed. Since it is still useful to be able to distinguish columns with system generated values we will map a trailing “$d” to a leading “_” where an “_” does not already exist. We won’t allow/use quoted identifiers for now, so the few that are quoted will be mapped to something else. We will stick with exclusive lower case for column names (standard SQL is case insensitive). Where a column name clashes with a reserved word in PostgreSQL it will also need to be mapped to something else. Where a name is mapping to something else because there is now a conflict we should be consistent, so to do this we will just prefix names with “ing_” (short for “ingres”) for now to indicate the origin of the name. There are also some special columns in a number of tables called “oid@”. The values in these are an increasing numeric sequence auto generated from a central id table (so unique across the database). Can’t at present remember the usage requirement (other than its something to do with the user interface as you can define a table with “has_oid” set) – will need to check, however could probably be replaced by some built-in auto-increment type. This would have to happen at a later stage, part of rule/procedure mapping and may need user interface change so will probably retain until phase 2.
Of course mapping the column names in the schema is only part of what needs to be done – we also need to map all the internal (user interface) dictionary references to the columns and all the references to the columns in the SQL in reports. There are also probably references in scripts as well (in GenRep). So no small task – needs to be something that can be automated and consistent so that it can be bundled into the sync process.