Mock REF Reviewer System

This project is to provide a simple paper referee and review system for the mock REF exercise. The expectation is that College will produce something for the actual REF submission (in 2021). So in this context we want something that can be thrown together fairly quickly. Also the workflow is not entirely clear at the outset, so we also want something that can be quickly altered. So I decided to use a Theon managed service. This means the database schema and corresponding UI can be prototyped and turned into a production system very quickly, but also allows the live system to be easily updated to account for any late design changes. Furthermore we now also get a RESTful  API for free by doing this, and if a more custom UI turns out to be required it will be relatively quick to do using the existing API.

Production Hadoop Service

Myself and Chris have started working on this project. The idea is for both of us to get up to speed with installing and running a Hadoop cluster in the first instance.

To date we have got one one node running all service and successfully running a job. This was using secure ssh for connection. We have started looking at converting this to using Kerberos before trying to build more nodes. Configuration seems to be working but we have an issue using host principals to ssh.

Free Machine Finder – Draft Design and Implementation Proposal


Provide a way for our students to easily locate free DICE desktops in AT. In addition provide something like the functionality in the student written MAPP demonstration system that helps students find out
which labs their friends are working in.


This is the first draft so likely to see some amendments.

User Interface

We will provide an Android App as the primary user interface. If this is positively accepted we would then look at also providing an iOS equivalent. We will use a REST API so as not to preclude other interfaces, such as desktop web browser and command line, but will not necessarily implement any of these as part of this project.

The App will have a live home screen widget which simply displays the name of the AT lab that currently has the most available free machines. An additional icon will be shown alongside the lab name if there are “friends” logged onto machines in that lab (see details on
the “friends” feature below).

Opening the App will show one main window. This will contain an ordered scrollable list of AT lab names with those at the top of the list having the most available free machines and those at the bottom having the least available. An additional icon will be shown alongside each lab name where there are one or more “friends” logged onto machines in that particular lab.

The status of AT lab machine usage and logged in “friends” presented by the App will be refreshed automatically every minute while the App is running.

There will be a “Refresh” icon on the main App window which can be touched to force a manual refresh.

There will be a “Friends” icon on the main App window. This can be touched to open the  App “friends” window. See details below.

Friends Feature

The “friends” feature is configured in the App “friends” window. At the top of the window will be an on/off “opt-in” slide toggle which will be off by default. Nothing else is shown on the window in this state. When this toggle is off (or switched to off) the users UUN is not available to other users to search for and invite to be a “friend” (and any previously made associations will then be hidden from those users, see below). Also the “friends” icon will no longer be shown alongside the AT lab name, irrespective of whether previously made associations would normally result in the icon being shown.

When the toggle is slid to on the following additional items are shown.

“Enter Friend UUN” which is an entry box where you can enter a UUN. Alongside this is a “Search” icon. Touch this (or touch Return on the onscreen keyboard) to search for the UUN among only those users that have chosen to “opt-in”. If the UUN is not found an error like “UUN not found (invalid) or UUN has not opted-in” will be displayed. If the UUN is found a modal will be shown with the message “UUN (Firstname Lastname) found”. This will have two buttons – “Invite” and “Cancel”.  Touching “Invite” will close the modal and add the UUN to the list underneath (see below) and the “Enter Friend UUN” box will be cleared. Touching “Cancel” will close the prompt modal doing nothing else.

Below the “Enter Friend UUN” entry box will be a list of friends, each shown as “UUN (Firstname)”. For those waiting acceptance alongside is text saying “Invited”. In addition some will be separated at the top of the list which have an “Accept” button alongside. These are requests from other users to this one. Touch the button to confirm.

Swiping any friend item will display a modal with an option to delete the association or cancel. The entry does not have to be a confirmed association, swipe to delete can also reject an invitation or cancel an invitation they have made (in which case an orphaned acceptance would then be discarded on both sides). Once deleted the association will need to be requested again from either side and confirmed. There is no explicit notification on removal for the other user (the corresponding UUN will just  drop off their list on the next refresh).

A “friend” association is always a two way pairing, it does not matter whether A invites and B accepts or B invites and A accepts, after this has happened A will have B listed as a friend and B will have A listed as a friend.

Once friend associations exist (which requires invitation AND acceptance) then when any of the associated UUN’s in the list are logged into machines this will be alerted to the user by displaying the “friends” icon alongside the corresponding AT lab name. This is except where a user has subsequently turned off “friends”, in which case their UUN will (silently) not be included in the search (although they will still be listed on the associated users “friends” window) – possibly with an “offline” status shown.


For ease of use it would be impractical for the user to go through DICE authentication every time they wanted to use the App. So instead, on first opening the App the user will go through some kind of DICE authentication (possibly by just making an authenticated REST/API call although not entirely clear how this will work from  a mobile app yet) for which they receive a unique token (random hash value). The token is stored on their phone and the token/UUN mapping is stored in the back end. Subsequently the App just uses this stored token for authentication (by passing it as an argument in every REST/API call).

Compromise by loss of phone is possible obviously, but if we are informed the relevant token/uun pair can be revoked by deletion of the record at the back end requiring the user to re-authenticate in order to use the App. Deletion of the token/uun pair at the back end would also delete the users “opt-in” status (reverting to the default which is off) and any existing “friend” associations. Increased security is possible, for example by constructing the hash based on UUN and phone UDID, but not sure that is really necessary here (and may cause some users concern over tracking).


Machine Status Collection

Each desktop in the AT student labs will have a cron job that runs a command every minute. This command will make an authenticated REST/API call to update the back end PostgreSQL database with the current status. This updates the corresponding machines record with the current console UUN (or empty if no user logged into the console). It need not wake up the machine to run during sleep – the last reported state should in general indicate no console user (unless they left themselves logged on).

An alternative (maybe preferable) to running the status update command every minute would be to just call the command directly from console login and console logout, this could be done through the PAM stack for example. Although it would be necessary to also run the command on machine reboot and GUI restart for example.

Authentication will use the machine hostclient principal. Using an LCFG spanning map and pgluser we will be able to automatically create a mapping role for each host principal on PostgreSQL for each AT lab machine as and when they go in and out of service. We can also use pgluser to grant the role permission to access the “status” table view (below).

On PostgreSQL these roles will use a view that constrains their access to only the record for the machine corresponding to the host principal. This can be trivially achieved since the SESSION_USER will be the “machine name” (derived from host principal) and by using a view with update rules based on SESSION_USER.

In summary:

  • the client will use hostclient/
  • a spanning map and pgluser will automatically maintain a corresponding “HOSTNAME” user in PostgreSQL for each hostclient principal
  • similarly pgluser will grant suitable access to the “status” table view for each HOSTNAME role (by assigning the relevant access role to each HOSTNAME user role).
  • when the hostclient principal connects the  SESSION_USER will be “HOSTNAME”
  • a “status” table in PostgreSQL will have two columns – “hostname” and “console_uun”
  •  a view will on the “status” table that is “select” and “update” constrained to only rows where “hostname” matches SESSION_USER – hence the given “HOSTNAME” role can only see and change the value of the row in the table which has a matching hostname value. A similar “insert” rule will be defined on the view so that only a row with matching hostname value can be added. The REST/API will use an “UPSERT” to insert or update the row it owns as appropriate.
  • the “status” table will be unique on hostname to avoid any risk of double entry by accident or design flaw.

Obviously status updates will not work during online examinations since the connection will be firewalled. This is reasonable behaviour (all locked down labs will show as fully available during the exam).

Unfortunately it will probably not be possible at present to get a feed of timetabled lab room bookings (for tutorials etc) from the central timetabling system as remote programmatic access to that system was de-scoped in the original project. Consequently this service may show rooms having many free machines that are not actually available to students to use as there is a tutorial happening in the room at that time, for example.

Support Data

A feed of data will populate an “active_user” table from the Informatics School Database with known UUN’s and corresponding names. The known UUN’s will only be those that have a valid account entitlement (effectively the same data as provided via the Prometheus user view) . Once a UUN drops off that feed the authenticated REST/API can no longer be used to retrieve a token for that UUN and any existing token/uun pairing and associated data will be marked for deletion and will be fully purged after a short (glitch handling) delay. By using a FDW and TheonCoupler all the feed management can be done entirely in PostgreSQL and no supporting infrastructure should be required.

An additional feed of data will be required to populate a “room” table. This will contain machine to room mapping data. This data could come from LCFG or the inventory. Directly from the inventory would probably be preferable and can be achieved with a simple REST/API query against that and internalized as an FDW. Alternatively it may be possible this information could be extracted from clientreport or LCFG information on the client itself and returned in the status update along with the console login status.


Initial Authentication and Connection. When the App has no token it will require DICE authentication. It can then make an authenticated connection to the REST/API as that UUN.

An authenticated URI to REST/API can request a token for the authenticated UUN (or a call to the authenticated URI will first bounce the user through an authentication step). Once the App has the token it no longer uses the authenticated URI.

All other connections to REST/API use the unauthenticated URI always passing the token as one of the arguments.

All REST/API calls return JSON.

The authenticated API simply returns the corresponding token. To do so it will perform an “UPSERT” operation on the “user” table. This table contains three columns: “uun”, “token” and “friend_optin”. The upsert will be view limited (for the REST/API credentials) so that it can only set the “uun” value – the token will be generated and set automatically and “friend_optin” will be left as NULL. The authenticated API will use credentials held on the server to perform this operation, but that is all these credentials will be able to do. The table will be unique on “uun”, preventing double entry by accident or design flaw. Note that an attempt to re-authenticate (if the user loses their token by deleting the App or getting a new phone) will update the existing record and re-generate a token.

The unauthenticated REST/API provides the following calls:

  • Get a list of labs and free machines – arguments are “token”. This returns an array of lab objects in order from most free to least free. The app will put the first entry in the homepage widget and display all the entries on the main App window. Each lab object contains the name and a flag indicating if friends are in it. Where the user has not turned on “opt-in” the friends flag will always be false. Where the supplied token does not match a stored token/uun pair the result will always be empty. Where it does the array of lab name objects will be returned. Where it does and the token/uun friend “opt-in” value is true then the relevant “friend” table will be scanned to set the friend flag appropriately. The “friend” table is used for this. It has two “uun” columns (invitor and invitee, although after this point the two are synonymous) and a “status” column which if true indicates the association has been confirmed (the invite has been accepted). When scanned the token/uun is looked for under both columns (i.e. find friends that the user invited and also friends that invited that user). Also has a “timestamp” column set on row creation, used later for automatic expiry of requests. The App makes this API call automatically every minute or when the “refresh” icon is touched.
  • Enable friends – arguments are “token”. This is called to opt-in to the “friends” feature. Sets the relevant flag in the “user” table and returns confirmation message.
  • Get friend – arguments are “token” and “uun”. This is called to return a match for the passed “uun” value. A match is only returned if the UUN matches an entry in the “user” table where “status” is true. The returned match object includes the corresponding firstname and lastname from the “active_user” table.
  • Invite friend (PUT) – arguments are “token” and “uun”. Adds a record in “friend” table with “uun,friend_uun” (where “uun” is that corresponding to the token and “friend_uun” is the “uun” passed directly as argument) and the default for status boolean (cannot be set via add api) which is NULL (meaning “request”). Only a UUN that would be returned as valid in the “get friend” call above will be added. Only one instance can be added (the table is unique on each uun+friend_uun combination).
  • Approve friend request – arguments are “token” and “uun”. Updates the “friend” table by locating the row where “the uun corresponding to the token” = “friend_uun” and “uun” equals the passed “uun” (the requester) and status is NULL and sets status to TRUE.
  • Decline friend (either an invite, a request for approval or a confirmed association) – arguments are “token” and “uun” (of friend). Searches “friend” table and deletes the row where uun corresponding to token equals “uun” and “friend_uun” equals passed “uun” or vice versa.
  • Get friends – arguments are “token”. Returns data from “friend” table where current uun (corresponding to token) equals “friend_uun” or the “uun” columns. An ordered array of objects is returned suitable for direct display on the “friends” App window. Three types will be returned: “requests” where status is NULL and current uun matched the “uun” column; “invites” where status is NULL and current_uun matched the “friend_uun” column; “associated” where status is TRUE and current uun matches either column. Data included in the object will be the names corresponding to the UUN and a type flag. This API call is made when the “friends” App window is opened, every minute while it is open and immediately after the “enable/invite/approve/decline” requests above.

Note that rows in the “friends” table are deleted automatically by purge function from where status is NULL (invites have not been accepted) after a period of time, e.g. 7 days.

To alleviate tracking concerns for “friends” data we could consider (as in the demonstration application) hashing all UUN values (with a random salt value held only on the server) within PostgreSQL so that casual internal support access (and accidental end user remote access by API bug or deliberate hack) reveals no useful information.

All the API functionality is implemented directly in PostgreSQL using
views and appropriate update rules. The REST/API will connect locally as a specific set user (no connections from outside the server for that user). It would be easily implementable with Python/FlaskRestful for example, although the authenticated side might need more work.

The App simply makes the necessary  REST/API URI calls and displays the results, it needs little other logic. It should be simulatable via Curl calls. It holds no state other than its connection token.

Implementation Sequence

Below will probably be done in two discreet stages, each running through some or all of the steps below. First would be to implement the core support for finding free machines (all steps). Second would be to implement support for the “friends” feature (steps 8 through 12).

  1. Create back end PostgreSQL service.
  2. Add LCFG spanning map and pgluser configuration to map hostclient principals to db users.
  3. Create REST/API for client “status” table update.
  4. Create machine client.
  5. Test then deploy onto all student lab machines.
  6. Monitor content on PostgreSQL to confirm functionality.
  7. Implement and test feed for “active_user” table update.
  8. Implement PostgreSQL functionality for REST/API calls including permissions. Test directly in PostgreSQL.
  9. Implement a Python/Flask REST/API interface.
  10. Test REST/API functionality with Curl.
  11. Implement a full test suite using Curl and a simulated data set of some kind.
  12. Write the Android App.

Adopting Theon – Part 3.4

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part is an introductory tutorial.

In the first part of this tutorial we created a database and started to manage it in Theon.  In the second part we configured TheonCoupler to load data automatically into the database. In the third part we configured TheonUI to allow access to the database and made a custom desktop. In this fourth and final part we will package and distribute the result.

The Theon toolkit has some commands to build and package a ModelLibrary for distribution. The resulting distribution will then have the toolkit as a pre-requisite.

We need to do some work with Git first, as the ModelLibrary is held as a Git repository.

rm -fr derived
git add factory schemat
git commit --message "first pass"

Now we can build a release (which will have version number 1.1 by default).

ttkm release

Check the result.

ttkm display
* >
  Establishing UnitArchive
  - Checking for UnitArchive in Folder[CWD]
    - Found Unit[packages] in Folder[CWD]
    - This[packages]
* packages[CWD] >
  History of Unit[packages]
  - Release Directory:         CWD::.release
  - Release Directory Index:   CWD::.release/index
  - Release SpecFile:          CWD::packages.spec
  - Current Release Version:   packages-1.1
  - Current Release Revision:  07f1a86d37bc74a6b5a2bd0c5157650c60501b3a

Now build a source and binary package for the current release above and list the results.

ttkm package rebuild
ls package
packages-1.1-1.src.rpm                   packages-theon-library-devel-1.1-1.noarch.rpm
packages-theon-library-1.1-1.noarch.rpm  packages-theon-library-ui-1.1-1.noarch.rpm

Install the packages.

rpm -U package/*.noarch.rpm

Now we can test the installation, first drop our existing live database, change directory and install from distribution.

dropdb packages
cd ~/
ttkm packages install

Feed in the data in the same way but using the coupler configuration in the distribution.

rpm -qa --qf '"%{NAME}","%{VERSION}","%{RELEASE}","%{ARCH}","%{BUILDTIME}"\n'| ttkm packages stream agent rpm refresh

rpm -qa --qf '["%{NAME}","%{ARCH}","%{REQUIRENAME}","%{REQUIREVERSION}"\n]'| ttkm packages stream agent rpm refresh to packdep couple packdep

Finally start up TheonUI to access the content using the configuration and metadata in the distribution.

ttkm packages server ui

Note that the server configuration is built for the distribution using the LiveProfile attached to the ModelLibrary at build time. It is more than likely that  in most distributions that this would ultimately be separately configured.

There is more extensive documentation on Theon, TheonCoupler, TheonToolkit and TheonUI in the The Theon Book.

Adopting Theon – Part 3.3

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part is an introductory tutorial.

In the first part of this tutorial we created a database and started to manage it in Theon.  In the second part we configured TheonCoupler to load data automatically into the database. In this third part we will configure TheonUI to allow access to the database and make a custom desktop. In the fourth and final part we will package and distribute the result.

Clear out existing derived content and generate the TheonUI metadata content and configuration file.

rm -fr derived
ttkm derive ui

Start up a server instance using the just generated configuration file and content.

ttkm server ui using config=derived/ui/default/ui.conf

On a browser running on the local machine go to localhost:8000/ui.html to access our instance of TheonUI. Once there we can access  the data in the package or rpmdata tables as follows.

From the Choose Desktop menu select Browser.
From the Choose Cluster menu select All Clusters.
From the Choose Table menu select package.
Then the content will be accessible.
Choose Use Grid option to see the content in a tabular form.

We can construct a desktop to show content from multiple related tables simultaneously. However at the moment our database is too simplistic for this. So first we will add another table which is a child of package and create a stream to load it with data. All aspects of this have been covered in part one and part two of this tutorial so much of the detail will be glossed over here.

Create a new table related to package and a corresponding data source table, import the new structure into Theon, then undo our changes and upgrade from Theon.

psql packages <<EOF
CREATE TABLE packdep ( name TEXT, arch TEXT, requires TEXT, version TEXT );
CREATE TABLE rpmdeps AS SELECT * FROM packdep;
ALTER TABLE package ADD PRIMARY KEY (name,architecture);
ALTER TABLE packdep ADD FOREIGN KEY (name,arch) REFERENCES package (name,architecture);
ttkm import
psql packages <<EOF
DROP TABLE packdep;
DROP TABLE rpmdeps;
ALTER TABLE package DROP CONSTRAINT package_pkey;
ttkm export
ttkm upgrade

Now modify the existing rpm Stream to add a new Couple with a custom source table and feed of data. Start the TheonUI first.

ttkm self server ui

Now enter the necessary data using a locally running browser pointed at localhost:8000/ui.html.

Open the Streaming desktop.
On the first panel (Choose Model) ensure the right model is shown (in this case packages should be the only one) and then open the Stream panel and add one new row into the Couples box:
 Position SourceEntity TargetEntity Name[Info]
 2 rpmdeps packdep rpmpackagedeps
Open the Couple panel, make sure the rpmpackagedeps couple is shown and enter and save the following for the couple (keep quotes):
 Source Origin Value = 'RPM'
 Reload = Pipe
 Then add two new rows in the Bonds box:
 Position Type
 1 Handle
 2 Column

Open the Bond panel and add rows in the Pairs box for each bond type as below:
 For Handle:
 Position SourceColumn TargetColumn
 1 name name
 2 arch arch
 3 requires requires
 For Column:
 Position SourceColumn TargetColumn
 1 version version

For the couple we have defined the source table (rpmdeps) and target table (packdep). We have also defined the handle pair (name:name + architecture:architecture + requires:requires) to tie records together between source and target.

Now exit the server process with Control-C.

Export the modified schema and upgrade the live database structures.

ttkm export
ttkm upgrade

Update the stream configuration.

ttkm derive coupler

Run the stream, load the data and process the new couple to load the target table content. Then check that content.

rpm -qa --qf '["%{NAME}","%{ARCH}","%{REQUIRENAME}","%{REQUIREVERSION}"\n]'| ttkm stream agent rpm refresh to packdep couple packdep
psql packages -c "SELECT arch, COUNT(*) AS total FROM packdep GROUP BY arch;"
  arch  | total 
 (none) |      1
 i486   |     99
 noarch |  17006
 i686   |   2631
 x86_64 |  63068
(5 rows)

Now we have some related tables and data we can construct a desktop in TheonUI that in one panel allows package selection and shows the related package dependencies. Start up TheonUI for the Theon management database.

ttkm self server ui

Now enter the necessary data using a locally running browser pointed at localhost:8000/ui.html.

Open the Reflection desktop.
On the first panel (Choose Model) ensure the right model is shown (in this case packages should be the only one).
Add a new row in the Desktops box:
 Desktop-Label = packages
 Name[Info] = Packages
Open the Panel panel and add one new row:
 Panel-Label = package
 Name[Info] = Package
 ? Base Entity = package
 Selector = Scroll
and save the changes. Then add one new row in the Connectors box:
 Position ?JoiningRelationship Type Label
 1 packdep_name_fkey ForwardNormal Requirements
and save the changes. Then add one new row in the Presenters box:
 Presenter-Label Style Height
 Requirements Grid 16
and save the changes. Then add the following rows in the Fields box:
 Position ?Connector ?Attribute Presenter Width Name[Info]
 1 -NONE- name - 80 Name
 1 -NONE- architecture - 20 Architecture
 1 -NONE- version - 20 Version
 1 -NONE- release - 20 Release
 5 Requirements requires Requirements 80 Name
 6 Requirements version Requirements 20 Version
and save the changes.
Open the Desktop panel and add one new row into the Panels box:
 Position ?Panel Link-Label
 1 package Package

Now exit the server process with Control-C.

Export the modified schema and derive the new UI metadata.

ttkm export
rm -fr derived
ttkm derive ui

Before starting the server tweak the configuration manually so that the initial data fetch size is limited. Then start up a server instance using the just generated configuration file and content.

Add at the top of the "config = {" section in derived/ui/default/ui.conf:
  "limit_rows" : 250,
  "limit_rows_conditionally" : True,

ttkm server ui using config=derived/ui/default/ui.conf

On a browser running on the local machine go to localhost:8000/ui.html to access our instance of TheonUI.

From the Choose Desktop menu select Packages.

This is a very simple demonstration desktop. You can use it to lookup a package by name at the top for example and each pre-requisite package and version will be shown in the lower box. Or you can query for a pre-requisite package name and all the packages dependent on it will be shown.

Adopting Theon – Part 3.2

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part is an introductory tutorial.

In the first part of this tutorial we created a database and started to manage it in Theon.  In this second part we will use TheonCoupler to load data automatically into the database. In the third part we will use TheonUI and create a desktop for the database.  In the fourth and final part we will package and distribute the result.

External data can be brought into (or accessed from) PostgreSQL in many ways. Here we will transfer it in at regular intervals from an external source. The command below will act as our live source of data.

rpm -qa --qf '"%{NAME}","%{VERSION}","%{RELEASE}","%{ARCH}","%{BUILDTIME}"\n'

This returns information on all installed packages in a CSV format. We need to create a table in our database to hold this data. In this example it will be the same as our package table, we can’t use that table directly since it is the sync target table – in most cases the external source data is not a direct alignment for an existing table like here. Though in this case, since it is, we can create it by cheating and using our existing table as a template.

psql packages -c "CREATE TABLE rpmdata AS SELECT * FROM package;"

Since this change was made on the live database we need to import it back into Theon.

ttkm import

Next we need to configure the Stream and Couple that will process the data. This needs to be done via TheonUI. So first create an instance of the server for the Theon management database. The default configuration will suffice.

ttkm self server ui

Then using a locally running browser point it at localhost:8000/ui.html to bring up the user interface. Follow the steps below to add the Stream.

Open the Streaming desktop.
On the first panel (Choose Model) ensure the right model is shown (in this case packages should be the only one) and then:
Add a new row in the Streams box:
 Stream-Label = rpm
 Origin = RPM
 ? Feed Entity = rpmdata
 Source = Pipe
 Format = CSV
 Name[Info] = rpm

Above creates a stream agent called rpm which takes data from a standard unix pipe in CSV format and maintains the rpmdata table as a copy of that content.

Now define a Couple to sync data from the source to our target table. We only need one in this instance.

Open the Stream panel and ensure rpm is shown
and then add one new row into the Couples box:
 Position SourceEntity TargetEntity Name[Info]
 1 rpmdata package rpmpackage

Open the Couple panel and enter and save the following for the couple (keep quotes):
 Source Origin Value = 'RPM'
 Then add two new rows in the Bonds box:
 Position Type
 1 Handle
 2 Column

Open the Bond panel and add rows in the Pairs box for each bond type as below:
 For Handle:
 Position SourceColumn TargetColumn
 1 name name
 2 architecture architecture
 For Column:
 Position SourceColumn TargetColumn
 1 version version
 2 release release
 3 buildtime buildtime

For the couple we have defined the source table (rpmdata) and target table (package). We have also defined the unique handle pair (name:name + architecture:architecture) to tie records together between source and target. Finally we have defined the data column pairs, the values of which will be maintained by the stream process.

Now exit the server process with Control-C.

Export the modified schema and review the new content.

ttkm export
[timc@babel packages]$ ls -R schemat
live  packages.xsd  process  relation  stream


_th_coupler_package_check.xsd  _th_rpm_package_couple.xsd

archcount.xsd  package.xsd  rpmdata.xsd


The rpmdata table has been added. The stream configuration has been added. Some stream management processes have been created. There will also be some changes to the target table (package) structure (metadata columns and triggers that have been added to support the content being managed by TheonCoupler).

Now we need to upgrade the live database structures. First remove the temporarily created rpmdata table (see part one for why).

psql packages -c "DROP TABLE rpmdata;"
ttkm upgrade

Create the stream configuration for the toolkit and have a look at it.

ttkm derive coupler
cat derived/coupler/rpm.cpl

Now we can run the stream, load the data and process all associated couples (just the one here) to load the database content. Then check that content using the view.

rpm -qa --qf '"%{NAME}","%{VERSION}","%{RELEASE}","%{ARCH}","%{BUILDTIME}"\n'| ttkm stream agent rpm refresh
psql packages -c "SELECT * FROM archcount;"
  arch  | total 
 (none) |     7
 i686   |   111
 x86_64 |  2614
 i486   |     1
 noarch |  1750
(5 rows)

Rows in package will now be added, deleted and updated automatically based on what is in the source table (rpmdata) and ultimately the content of the live data fed in from the rpm command every time the above is run.

Next creating configuration for TheonUI to support easier user access to the database content.

Adopting Theon – Part 3.1

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part will be an introductory tutorial.

In the first part of this tutorial we will create a database, manage it in Theon, make some changes, and use the factory to template some schema.  In the second part we will use TheonCoupler to load data automatically into the database. In the third part we will use TheonUI and create a desktop for the database. In the fourth and final part we will package and distribute the result.

Before doing anything note that we need a running PostgreSQL installation we have access on to create databases with. In this tutorial we use on one localhost with /disk/scratch as the socket connection directory.

First we will create a simple database with one table. This will be used to hold installed packages.

createdb packages
psql packages <<EOF
CREATE TABLE package (
  name TEXT,
  version TEXT,
  release TEXT,
  architecture TEXT

Create the Theon management database.

ttkm create

Now create a directory to hold our ModelLibrary.

cd /tmp
ttkm packages launch at .
cd packages

Import the database we have just created as the schema. In order to do this we need to create a default LiveProfile with the necessary connection details.

ttkm attach using database=packages host=/disk/scratch
ttkm import

Next test what we imported by exporting the schema.

ttkm export
cat schemat/relation/package.xsd

The above is the XML representation of the table we created. We will properly check now by dropping the original database then, by deriving the necessary SQL, recreate it.

dropdb packages
ttkm install
psql packages -c "\d"
         List of relations
 Schema |   Name    | Type  | Owner 
 public | _th_theon | table | timc
 public | package   | table | timc
(2 rows)

Note the additional metadata table Theon adds. This is not mandatory(and can be removed) but it is useful for subsequent upgrades as we will see shortly.

Next we will define  a view using the factory. To do this we first create the necessary hierarchy and add a template definition.

mkdir -p factory/templates
cat >factory/templates/archcount.fat   <<EOF
%%include "relation/view.grg"
%%equate relation_ViewName "archcount"
%%equate relation_ViewValueBlock "SQL"
%%block SQL
    architecture AS arch,
    COUNT(*) AS total
  FROM package

Then we process the factory templates and load them into the Theon management database.

ttkm gather

Next export the changed schema and check.

ttkm export
cat schemat/relation/archcount.xsd

Now we need to upgrade the installed database to add the view. This is automatic as we have the metadata table, manual machinations would be required otherwise.

ttkm upgrade
psql packages -c "\d"
 List of relations
 Schema | Name | Type | Owner 
 public | _th_theon | table | timc
 public | archcount | view  | timc
 public | package | table | timc
(2 rows)

The view has been successfully added.

Finally we want to add an extra column in the package table for the build time. There are various ways of doing this, here we will alter the live database and re-import the change so the schema in Theon is kept up-to-date. However, in order to keep consistent metadata, we will drop the column and then re-create the column by upgrade. This is necessary in this case as we are working with a development database which is also the live database.

psql packages -c "ALTER TABLE package ADD COLUMN buildtime INTEGER;"
ttkm import
ttkm export
psql packages -c "ALTER TABLE package DROP COLUMN buildtime;"
ttkm upgrade
psql packages -c "\d package"
       Table "public.package"
    Column    |  Type   | Modifiers 
 name         | text    | 
 version      | text    | 
 release      | text    | 
 architecture | text    | 
 buildtime    | integer |

The column has been successfully added back by upgrade.

Now it is time to get some data into the database.

Adopting Theon – Part 2

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. This part covers the necessary packages and installation. The final part will be an introductory tutorial.

Theon is available from here directly from the relevant Git repositories. Pre-built packages (RPM’s) are also available from here for Scientific Linux 7, but would probably work as-is on comparable distributions, for example RedHat Enterprise Linux, Centos and Fedora.

Theon contains three components. The base component contains the framework the schema for Theon itself, standard transforms, the factory core templates and all the documentation. The Theon Toolkit component contains the command line interface. The Theon UI contains the generic web browser graphical database interface. All of these are needed to use Theon. There are also some development packages, but these are only necessary to make changes to Theon itself.

Theon requires a couple of custom support tools. Gurgle, which is used as the factory template engine. Schematic, which is used to process the XML-Schema files and apply transforms. Both of these are also available from here, either direct from the Git repositories or as pre-built packages.

There are also a number of pre-requisite third party packages required, such as PostgreSQL, Python and various modules. Theon UI uses YUI2 – the necessary software for this has been re-packaged and is also available directly from here.

If you use Yum everything can be obtained from the Theon Yum Repository. In order to use this first create a suitable Yum repository configuration file as below.

cat >/etc/yum.repos.d/theon.repo <<EOF
name=Theon Repository

Then install the core packages.

yum install gurgle-postgresql schematic yuidoc yuicompressor yui2 theon theon-library theon-library-book theon-library-ui theon-tk theon-ui

Yum will install other pre-requisite software automatically, assuming it is available for your platform.

If you want to use Theon as a PostgreSQL extension module then you will also need the package below, built for version 9.2.23.

yum install theon-library-extension_9.2.23

The following packages are only necessary for doing Theon development.

yum install theon-library-devel theon-ui-devel

Alternatively you can clone the Git repositories and build from source for your target platform. You will need to first install all pre-requisite software (see the “.spec” file). The builds are done simply as part of the package build and so you will also need rpm/rpmbuild available on your platform, or you will need to extract and script the content of the package “.spec” file.

cd /tmp
git clone git://
cd gurgle
rpmbuild -ba ${RPMROOT}/SPECS/gurgle.spec
rpm -U ${RPMROOT}/RPMS/x86_64/gurgle-postgresql-*.rpm
cd /tmp
git clone git://
cd schematic
rpmbuild -ba ${RPMROOT}/SPECS/schematic.spec
rpm -U ${RPMROOT}/RPMS/noarch/schematic-*.rpm
Theon Toolkit
cd /tmp
git clone git://
cd theon-tk
rpm -i ${RPMROOT}/SRPMS/x86_64/theon-tk-1.1-1.src.rpm
rpmbuild -ba ${RPMROOT}/SPECS/theon-tk.spec
rpm -U ${RPMROOT}/RPMS/noarch/theon-tk-*.rpm
cd /tmp
git clone git://
cd theon
rpm -i ${RPMROOT}/SRPMS/x86_64/theon-1.1-1.src.rpm
rpmbuild -ba ${RPMROOT}/SPECS/theon.spec
rpm -U ${RPMROOT}/RPMS/noarch/theon-*.rpm
Theon UI
cd /tmp
git clone git://
cd theon-ui
rpm -i ${RPMROOT}/SRPMS/x86_64/theon-ui-1.1-1.src.rpm
rpmbuild -ba ${RPMROOT}/SPECS/theon-ui.spec
rpm -U ${RPMROOT}/RPMS/noarch/theon-ui-*.rpm

Theon will be installed at the following paths by default.


You should now be ready to go.

Migrating from SVN to Git while Splitting Repository

As the final stage of a recent project we were transferring our large Subversion repository into multiple Git repositories. We wanted to retain as much of the Subversion history as possible. Interestingly this turned out to be considerably more complex than expected! See the end for links to pages that various other people have written about this process.

We initially considered splitting up the repository at the same time as doing the migration. However, it turned out to be very difficult to preserve the full history this way. Doing the process in two steps – migration then splitting – was a better approach. Another reason for doing two steps was that not all of the original repository was going to be split off. A large amount of content that was legacy would be left behind as an archive.

So the plan was that we would do an initial migration to Git. We would retain this as a standalone read-only archive of everything (about 1.6GB). Active ongoing content would then be split (copied) into separate Git repositories (which would only be about 100MB in total).

Subversion to Git Migration

The normal migration approach for Subversion to Git seems to be to use git svn . This tool however appears to be more of a live synchronisation tool so that people can use both Git and Subversion at the same time. For this reason, and others, we did not find it ideal for a one off migration. We eventually discovered an excellent tool called reposurgeon which is designed precisely for one-off migration. This tool is hugely configurable, fast, and supports a lot of the spurious baggage that comes with an old Subversion repository (which in our case had started out life as an even older CVS repository). Another advantage of reposurgeon is that it automatically creates a Makefile to define migration options and manage each migration step. This encapsulates the whole process and is important as we found that a number of migration attempts were necessary before it looked like everything had worked correctly.

The first problem we hit was that when our Subversion repository was created it had not been structured into trunk and branches. It was only at a later point (when a formal branch was actually needed) that it had been reformatted into this standard structure. The reposurgeon tool assumes the standard structure and as a result history prior to the restructuring and creation of trunk was effectively lost (buried in pseudo branches that reposurgeon creates to represent the previous root structure, complicating history extraction when subsequently splitting up the repository). Since we had only ever had one formal branch (albeit a big one) we opted to migrate using a flat structure (by using --nobranch as a read option in the reposurgeon Makefile). This meant that the repository was treated as an ordinary directory tree from the root – trunk and branches had no special meaning.

The second problem we had was that, by default, the Subversion revision number was not preserved. This revision number was important for us as they had been frequently used: in comments in our issue tracking system (as a reference to the commit where a fix had been made); in formal release notes; within commits (for example, simple errors like “this change is also associated with the fix in revision r12345”). To resolve this we needed to add the --legacy write option in the reposurgeon Makefile so that every commit message was automatically annotated with the original Subversion revision number.

After this the migration went pretty smoothly and we had a clean Git repository with the full Subversion repository history.

There were a couple of minor leftover problems with our new Git repository. One of the directories in our Subversion repository had contained content from a third party which was actually just a Git repository. At the end of the migration git status reported lots of error: Invalid path 'trunk/tools/.../.git/...' messages. Since we didn’t care about retaining the actual embedded .git directory content we just did git reset --hard to remove these errors and leave a clean repository.

The other issue we had was that in the Subversion repository some files had altered over time from being normal files with content to being symbolic links and then back to being normal files with content again. In the migrated Git repository some of these files (but not all) were broken as the file was still a symbolic link pointing to a file which had a name which was the concatenation of what should have been the file content! We did not pick up on this error until later when a clone of the split off repository containing these files failed to checkout because the resulting file names were too long. We remain unclear what could have caused this. Presumably something odd at the time the particular Subversion commit was made that reverted the symbolic link back to a file and something reposurgeon wasn’t consequently able to deal with automatically. We fixed this in the end by careful rewrite of the offending commit in the history of the split off Git repository.

Splitting the Git Repository

There seem to be two approaches to this, use git subtree or git filter-branch with the --subdirectory-filter option. Both scan the entire commit history and keep just those commits containing the specified path (removing the rest). Both have limitations if you want the full history. With subtree it is only possible to scan one branch, so history recorded in any other branch will be lost. This was a significant issue until we chose to migrate the Subversion repository as a flat directory structure. However, a principal failing of both approaches is that the specified path must have remained the same throughout the repository history. So, if at any point in the history, the specified path has changed name or parentage, or content in that path has arrived into it from elsewhere in the repository then all that history will be lost. Since a significant amount of re-structuring at both file and directory level had taken place in our repository over time this limitation was a significant issue to preserving full history.

One option would be to just accept the loss of history. This was not entirely unreasonable as we were retaining the original complete migration into Git as a permanent archive. However we are not able to make that archive public because of some of the legacy content. This would have made a lot of relevant history for content in active development unavailable to anyone else. Nor would it have been that convenient, even as a local developer, to not have the full history of a file self-contained within the split off repository – so you couldn’t just do git log --follow FILE  for example. Instead, having to locally clone 1.6GB just to access a tiny fraction of that to get one files full history.

In the end we managed to find a way to do what we want (with one caveat) using a combination of techniques and because we also had the advantage of access to the full file rename history in our old Subversion repository.

Our approach was to first identify all the current files in a directory that was being split off. For each of these files the Subversion repository was then accessed to extract a full file name and path history. Then every commit in the Git repository was filtered so that it contained only these files (when they existed in that commit). The result became the split repository, where every file contained its full history irrespective of whether it changed name or path in its history. The precise steps we took for each directory being split off are outlined below.

First we created a file containing a list of all the files (including all previous names and locations) in the directory to be split off, trunk/root1 in this example case. We built this from a checked out working copy of the Subversion repository.

cd /path/to/repo/svn/working/copy
REPOURL=`svn info | awk '/^URL:/{print$2}'`
(for f in `find trunk/root1 -type f`; do svn log $f | grep -o -E '^r[0-9]+' | xargs -L 1 -irev svn info -rev $f | grep '^URL:'; done)| sed -e "s?^URL: $REPOURL/??1" | uniq > /tmp/root1.files

Next we created a clone of our migrated Subversion repository so everything we did was isolated and the original could be used again to split off another repository.

cd /tmp
git clone --no-hardlink /path/to/repo.git clone1
cd clone1

Then we ran the filter. We used --index-filter as it is a lot faster than --tree-filter and has an identical result in this particular case. For each commit this filter first removes all the files and then selectively re-instates only those files which existed in that commit and which are also listed as being part of the split directory (eventually, even if not at this particular point).  The filter also removes any resulting empty commits and re-writes tags.

git filter-branch --prune-empty --index-filter 'git rm --cached -qr -- . && cat /tmp/root1.files | xargs -n 1 git reset -q $GIT_COMMIT --' --tag-name-filter cat -- --all

Now we had a modified repository just containing the current files in the root1 directory (actually trunk/root1) but with their full history. Next we tidied up by cloning (a simple way of leaving behind all the objects that have now been orphaned by the filter) and did some shuffling so that the content of trunk/root1 became the content at the top level of the new repository.

cd /tmp
mkdir root1
cd root1
git init
git remote add origin ../clone1
git pull origin master
git remote rm origin
git mv trunk/root1 .
git clean -d -x -n
git clean -fd
git mv root1/* .
git clean -d -x -n
git clean -fd
git status -s
git commit
git status
git log

After this we simply cloned again into a bare repository and pushed this onto our Git server. Then we moved onto the next directory to be split off.

There is one major caveat with this approach. Only current files (those in the most recent commit) and their history is included. Any files that used to exist in the split directory path but have at some point been deleted will not be included. This may or may not matter. For us it did not as one of the reasons for splitting into separate repositories was to lose the 1.5GB of legacy files that were no longer of any relevance, so we did not want them retained. Also we still had the fallback of referring to the original migrated archive with all content as a last resort.

Another problem with this approach was performance. Our repository had over 12,000 commits. The filter has to process each commit one by one. This was alright when the number of files (and their path and name history) for a split directory was small, but when it was in the thousands or more (as some were in our case) the time to process each commit became significant (from fractions of a second to tens of seconds).

There is probably a better way to do this. Perhaps using a combination of this approach to get the unique file path changes and then using the filter to do specific commit history grafting at each change point.


using mock to test rpm packages

Our PkgForge service is a locally developed build farm. We use this for automatically building .src.rpm files for the multiple platforms and architectures we often have to support concurrently . The service uses Mock underneath. It works very well for known working packages and a “send and forget” approach. It is not so good however for debugging build issues (although that was never its intended purpose). Really hit this recently when trying to build a new locally developed package for wider distribution. Being used to only having a few issues to resolve normally (commonly just missing pre-requisite packages), started off using the build farm service. But the numerous iterations needed and the slow rebuild cycles were getting tedious. Eventually an issue that really needed direct access to the build root to debug was the final nail in the coffin. Previously been put off using Mock manually by assuming that it would be complicated. Turns out it is an absolute breeze and definitely the way to pre-test, or even final build if you want speed and controlled (re)submission, new packages (particularly when you only need to worry about distribution for one architecture and platform). So here is a brief howto (as root).

yum install mock

Mock comes with the necessary configuration file for our main current platform and architecture  already, so in principle you need do nothing more than:

mock -r epel-7-x86_64.cfg --rebuild PACKAGE.src.rpm

You can then follow the output live as it builds. If you are running this on your desktop you will also find it runs refreshingly fast. The build root and results are also kept around for subsequent analysis in:


Commonly you may need to include our own local repositories. This is also simple, for example to use our “inf” repository:

cp /etc/mock/epel-7-x86_64.cfg /tmp

Edit and add the following within the set of repositories :



mock -r /tmp/epel-7-x86_64.cfg --rebuild PACKAGE.src.rpm

If you get failures during build you can simply chroot and test the build environment, do further debugging and manual package/file installations and rebuilds:

chroot /var/lib/mock/epel-7-x86_64/root

Note to the wicked: if you are using Java you are likely to find that after doing the chroot you also need to do:

mount -t proc none /proc

PkgForge can automatically build multiple locally dependent packages in one go. If you need to do this with a local mock setup then you need to manually build each package one by one and submit them so they are accessible to mock (or you could investigate the feature of mock that means it can dynamically generate and update a yum repository from the results of previous builds).