All posts by timc

Adopting Theon – Part 3.4

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part is an introductory tutorial.

In the first part of this tutorial we created a database and started to manage it in Theon.  In the second part we configured TheonCoupler to load data automatically into the database. In the third part we configured TheonUI to allow access to the database and made a custom desktop. In this fourth and final part we will package and distribute the result.

The Theon toolkit has some commands to build and package a ModelLibrary for distribution. The resulting distribution will then have the toolkit as a pre-requisite.

We need to do some work with Git first, as the ModelLibrary is held as a Git repository.

rm -fr derived
git add factory schemat
git commit --message "first pass"

Now we can build a release (which will have version number 1.1 by default).

ttkm release

Check the result.

ttkm display
* >
  Establishing UnitArchive
  - Checking for UnitArchive in Folder[CWD]
    - Found Unit[packages] in Folder[CWD]
    - This[packages]
* packages[CWD] >
  History of Unit[packages]
  - Release Directory:         CWD::.release
  - Release Directory Index:   CWD::.release/index
  - Release SpecFile:          CWD::packages.spec
  - Current Release Version:   packages-1.1
  - Current Release Revision:  07f1a86d37bc74a6b5a2bd0c5157650c60501b3a

Now build a source and binary package for the current release above and list the results.

ttkm package rebuild
ls package
packages-1.1-1.src.rpm                   packages-theon-library-devel-1.1-1.noarch.rpm
packages-theon-library-1.1-1.noarch.rpm  packages-theon-library-ui-1.1-1.noarch.rpm

Install the packages.

rpm -U package/*.noarch.rpm

Now we can test the installation, first drop our existing live database, change directory and install from distribution.

dropdb packages
cd ~/
ttkm packages install

Feed in the data in the same way but using the coupler configuration in the distribution.

rpm -qa --qf '"%{NAME}","%{VERSION}","%{RELEASE}","%{ARCH}","%{BUILDTIME}"\n'| ttkm packages stream agent rpm refresh

rpm -qa --qf '["%{NAME}","%{ARCH}","%{REQUIRENAME}","%{REQUIREVERSION}"\n]'| ttkm packages stream agent rpm refresh to packdep couple packdep

Finally start up TheonUI to access the content using the configuration and metadata in the distribution.

ttkm packages server ui

Note that the server configuration is built for the distribution using the LiveProfile attached to the ModelLibrary at build time. It is more than likely that  in most distributions that this would ultimately be separately configured.

There is more extensive documentation on Theon, TheonCoupler, TheonToolkit and TheonUI in the The Theon Book.

Adopting Theon – Part 3.3

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part is an introductory tutorial.

In the first part of this tutorial we created a database and started to manage it in Theon.  In the second part we configured TheonCoupler to load data automatically into the database. In this third part we will configure TheonUI to allow access to the database and make a custom desktop. In the fourth and final part we will package and distribute the result.

Clear out existing derived content and generate the TheonUI metadata content and configuration file.

rm -fr derived
ttkm derive ui

Start up a server instance using the just generated configuration file and content.

ttkm server ui using config=derived/ui/default/ui.conf

On a browser running on the local machine go to localhost:8000/ui.html to access our instance of TheonUI. Once there we can access  the data in the package or rpmdata tables as follows.

From the Choose Desktop menu select Browser.
From the Choose Cluster menu select All Clusters.
From the Choose Table menu select package.
Then the content will be accessible.
Choose Use Grid option to see the content in a tabular form.

We can construct a desktop to show content from multiple related tables simultaneously. However at the moment our database is too simplistic for this. So first we will add another table which is a child of package and create a stream to load it with data. All aspects of this have been covered in part one and part two of this tutorial so much of the detail will be glossed over here.

Create a new table related to package and a corresponding data source table, import the new structure into Theon, then undo our changes and upgrade from Theon.

psql packages <<EOF
CREATE TABLE packdep ( name TEXT, arch TEXT, requires TEXT, version TEXT );
CREATE TABLE rpmdeps AS SELECT * FROM packdep;
ALTER TABLE package ADD PRIMARY KEY (name,architecture);
ALTER TABLE packdep ADD FOREIGN KEY (name,arch) REFERENCES package (name,architecture);
EOF
ttkm import
psql packages <<EOF
DROP TABLE packdep;
DROP TABLE rpmdeps;
ALTER TABLE package DROP CONSTRAINT package_pkey;
EOF
ttkm export
ttkm upgrade

Now modify the existing rpm Stream to add a new Couple with a custom source table and feed of data. Start the TheonUI first.

ttkm self server ui

Now enter the necessary data using a locally running browser pointed at localhost:8000/ui.html.

Open the Streaming desktop.
On the first panel (Choose Model) ensure the right model is shown (in this case packages should be the only one) and then open the Stream panel and add one new row into the Couples box:
 Position SourceEntity TargetEntity Name[Info]
 2 rpmdeps packdep rpmpackagedeps
Open the Couple panel, make sure the rpmpackagedeps couple is shown and enter and save the following for the couple (keep quotes):
 Source Origin Value = 'RPM'
 Reload = Pipe
 Then add two new rows in the Bonds box:
 Position Type
 1 Handle
 2 Column

Open the Bond panel and add rows in the Pairs box for each bond type as below:
 For Handle:
 Position SourceColumn TargetColumn
 1 name name
 2 arch arch
 3 requires requires
 For Column:
 Position SourceColumn TargetColumn
 1 version version

For the couple we have defined the source table (rpmdeps) and target table (packdep). We have also defined the handle pair (name:name + architecture:architecture + requires:requires) to tie records together between source and target.

Now exit the server process with Control-C.

Export the modified schema and upgrade the live database structures.

ttkm export
ttkm upgrade

Update the stream configuration.

ttkm derive coupler

Run the stream, load the data and process the new couple to load the target table content. Then check that content.

rpm -qa --qf '["%{NAME}","%{ARCH}","%{REQUIRENAME}","%{REQUIREVERSION}"\n]'| ttkm stream agent rpm refresh to packdep couple packdep
psql packages -c "SELECT arch, COUNT(*) AS total FROM packdep GROUP BY arch;"
  arch  | total 
--------+--------
 (none) |      1
 i486   |     99
 noarch |  17006
 i686   |   2631
 x86_64 |  63068
(5 rows)

Now we have some related tables and data we can construct a desktop in TheonUI that in one panel allows package selection and shows the related package dependencies. Start up TheonUI for the Theon management database.

ttkm self server ui

Now enter the necessary data using a locally running browser pointed at localhost:8000/ui.html.

Open the Reflection desktop.
On the first panel (Choose Model) ensure the right model is shown (in this case packages should be the only one).
Add a new row in the Desktops box:
 Desktop-Label = packages
 Name[Info] = Packages
Open the Panel panel and add one new row:
 Panel-Label = package
 Name[Info] = Package
 ? Base Entity = package
 Selector = Scroll
and save the changes. Then add one new row in the Connectors box:
 Position ?JoiningRelationship Type Label
 1 packdep_name_fkey ForwardNormal Requirements
and save the changes. Then add one new row in the Presenters box:
 Presenter-Label Style Height
 Requirements Grid 16
and save the changes. Then add the following rows in the Fields box:
 Position ?Connector ?Attribute Presenter Width Name[Info]
 1 -NONE- name - 80 Name
 1 -NONE- architecture - 20 Architecture
 1 -NONE- version - 20 Version
 1 -NONE- release - 20 Release
 5 Requirements requires Requirements 80 Name
 6 Requirements version Requirements 20 Version
and save the changes.
Open the Desktop panel and add one new row into the Panels box:
 Position ?Panel Link-Label
 1 package Package

Now exit the server process with Control-C.

Export the modified schema and derive the new UI metadata.

ttkm export
rm -fr derived
ttkm derive ui

Before starting the server tweak the configuration manually so that the initial data fetch size is limited. Then start up a server instance using the just generated configuration file and content.

Add at the top of the "config = {" section in derived/ui/default/ui.conf:
  "limit_rows" : 250,
  "limit_rows_conditionally" : True,

ttkm server ui using config=derived/ui/default/ui.conf

On a browser running on the local machine go to localhost:8000/ui.html to access our instance of TheonUI.

From the Choose Desktop menu select Packages.

This is a very simple demonstration desktop. You can use it to lookup a package by name at the top for example and each pre-requisite package and version will be shown in the lower box. Or you can query for a pre-requisite package name and all the packages dependent on it will be shown.

Adopting Theon – Part 3.2

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part is an introductory tutorial.

In the first part of this tutorial we created a database and started to manage it in Theon.  In this second part we will use TheonCoupler to load data automatically into the database. In the third part we will use TheonUI and create a desktop for the database.  In the fourth and final part we will package and distribute the result.

External data can be brought into (or accessed from) PostgreSQL in many ways. Here we will transfer it in at regular intervals from an external source. The command below will act as our live source of data.

rpm -qa --qf '"%{NAME}","%{VERSION}","%{RELEASE}","%{ARCH}","%{BUILDTIME}"\n'

This returns information on all installed packages in a CSV format. We need to create a table in our database to hold this data. In this example it will be the same as our package table, we can’t use that table directly since it is the sync target table – in most cases the external source data is not a direct alignment for an existing table like here. Though in this case, since it is, we can create it by cheating and using our existing table as a template.

psql packages -c "CREATE TABLE rpmdata AS SELECT * FROM package;"

Since this change was made on the live database we need to import it back into Theon.

ttkm import

Next we need to configure the Stream and Couple that will process the data. This needs to be done via TheonUI. So first create an instance of the server for the Theon management database. The default configuration will suffice.

ttkm self server ui

Then using a locally running browser point it at localhost:8000/ui.html to bring up the user interface. Follow the steps below to add the Stream.

Open the Streaming desktop.
On the first panel (Choose Model) ensure the right model is shown (in this case packages should be the only one) and then:
Add a new row in the Streams box:
 Stream-Label = rpm
 Origin = RPM
 ? Feed Entity = rpmdata
 Source = Pipe
 Format = CSV
 Name[Info] = rpm

Above creates a stream agent called rpm which takes data from a standard unix pipe in CSV format and maintains the rpmdata table as a copy of that content.

Now define a Couple to sync data from the source to our target table. We only need one in this instance.

Open the Stream panel and ensure rpm is shown
and then add one new row into the Couples box:
 Position SourceEntity TargetEntity Name[Info]
 1 rpmdata package rpmpackage

Open the Couple panel and enter and save the following for the couple (keep quotes):
 Source Origin Value = 'RPM'
 Then add two new rows in the Bonds box:
 Position Type
 1 Handle
 2 Column

Open the Bond panel and add rows in the Pairs box for each bond type as below:
 For Handle:
 Position SourceColumn TargetColumn
 1 name name
 2 architecture architecture
 For Column:
 Position SourceColumn TargetColumn
 1 version version
 2 release release
 3 buildtime buildtime

For the couple we have defined the source table (rpmdata) and target table (package). We have also defined the unique handle pair (name:name + architecture:architecture) to tie records together between source and target. Finally we have defined the data column pairs, the values of which will be maintained by the stream process.

Now exit the server process with Control-C.

Export the modified schema and review the new content.

ttkm export
[timc@babel packages]$ ls -R schemat
schemat:
live  packages.xsd  process  relation  stream

schemat/live:
packages.xsd

schemat/process:
_th_coupler_package_check.xsd  _th_rpm_package_couple.xsd

schemat/relation:
archcount.xsd  package.xsd  rpmdata.xsd

schemat/stream:
rpm.xsd

The rpmdata table has been added. The stream configuration has been added. Some stream management processes have been created. There will also be some changes to the target table (package) structure (metadata columns and triggers that have been added to support the content being managed by TheonCoupler).

Now we need to upgrade the live database structures. First remove the temporarily created rpmdata table (see part one for why).

psql packages -c "DROP TABLE rpmdata;"
ttkm upgrade

Create the stream configuration for the toolkit and have a look at it.

ttkm derive coupler
cat derived/coupler/rpm.cpl

Now we can run the stream, load the data and process all associated couples (just the one here) to load the database content. Then check that content using the view.

rpm -qa --qf '"%{NAME}","%{VERSION}","%{RELEASE}","%{ARCH}","%{BUILDTIME}"\n'| ttkm stream agent rpm refresh
psql packages -c "SELECT * FROM archcount;"
  arch  | total 
--------+-------
 (none) |     7
 i686   |   111
 x86_64 |  2614
 i486   |     1
 noarch |  1750
(5 rows)

Rows in package will now be added, deleted and updated automatically based on what is in the source table (rpmdata) and ultimately the content of the live data fed in from the rpm command every time the above is run.

Next creating configuration for TheonUI to support easier user access to the database content.

Adopting Theon – Part 3.1

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. The second part part covered the necessary packages and installation. This final part will be an introductory tutorial.

In the first part of this tutorial we will create a database, manage it in Theon, make some changes, and use the factory to template some schema.  In the second part we will use TheonCoupler to load data automatically into the database. In the third part we will use TheonUI and create a desktop for the database. In the fourth and final part we will package and distribute the result.

Before doing anything note that we need a running PostgreSQL installation we have access on to create databases with. In this tutorial we use on one localhost with /disk/scratch as the socket connection directory.

First we will create a simple database with one table. This will be used to hold installed packages.

createdb packages
psql packages <<EOF
CREATE TABLE package (
  name TEXT,
  version TEXT,
  release TEXT,
  architecture TEXT
);
EOF

Create the Theon management database.

ttkm create

Now create a directory to hold our ModelLibrary.

cd /tmp
ttkm packages launch at .
cd packages

Import the database we have just created as the schema. In order to do this we need to create a default LiveProfile with the necessary connection details.

ttkm attach using database=packages host=/disk/scratch
ttkm import

Next test what we imported by exporting the schema.

ttkm export
cat schemat/relation/package.xsd

The above is the XML representation of the table we created. We will properly check now by dropping the original database then, by deriving the necessary SQL, recreate it.

dropdb packages
ttkm install
psql packages -c "\d"
         List of relations
 Schema |   Name    | Type  | Owner 
--------+-----------+-------+-------
 public | _th_theon | table | timc
 public | package   | table | timc
(2 rows)

Note the additional metadata table Theon adds. This is not mandatory(and can be removed) but it is useful for subsequent upgrades as we will see shortly.

Next we will define  a view using the factory. To do this we first create the necessary hierarchy and add a template definition.

mkdir -p factory/templates
cat >factory/templates/archcount.fat   <<EOF
%%include "relation/view.grg"
%%equate relation_ViewName "archcount"
%%equate relation_ViewValueBlock "SQL"
%%block SQL
  SELECT
    architecture AS arch,
    COUNT(*) AS total
  FROM package
  GROUP BY
    arch
%%end
EOF

Then we process the factory templates and load them into the Theon management database.

ttkm gather

Next export the changed schema and check.

ttkm export
cat schemat/relation/archcount.xsd

Now we need to upgrade the installed database to add the view. This is automatic as we have the metadata table, manual machinations would be required otherwise.

ttkm upgrade
psql packages -c "\d"
 List of relations
 Schema | Name | Type | Owner 
--------+-----------+-------+-------
 public | _th_theon | table | timc
 public | archcount | view  | timc
 public | package | table | timc
(2 rows)

The view has been successfully added.

Finally we want to add an extra column in the package table for the build time. There are various ways of doing this, here we will alter the live database and re-import the change so the schema in Theon is kept up-to-date. However, in order to keep consistent metadata, we will drop the column and then re-create the column by upgrade. This is necessary in this case as we are working with a development database which is also the live database.

psql packages -c "ALTER TABLE package ADD COLUMN buildtime INTEGER;"
ttkm import
ttkm export
psql packages -c "ALTER TABLE package DROP COLUMN buildtime;"
ttkm upgrade
psql packages -c "\d package"
       Table "public.package"
    Column    |  Type   | Modifiers 
--------------+---------+-----------
 name         | text    | 
 version      | text    | 
 release      | text    | 
 architecture | text    | 
 buildtime    | integer |

The column has been successfully added back by upgrade.

Now it is time to get some data into the database.

Adopting Theon – Part 2

The first part of this series discussed why you might use Theon as a Schema Management Tool for your PostgreSQL database. This part covers the necessary packages and installation. The final part will be an introductory tutorial.

Theon is available from here directly from the relevant Git repositories. Pre-built packages (RPM’s) are also available from here for Scientific Linux 7, but would probably work as-is on comparable distributions, for example RedHat Enterprise Linux, Centos and Fedora.

Theon contains three components. The base component contains the framework the schema for Theon itself, standard transforms, the factory core templates and all the documentation. The Theon Toolkit component contains the command line interface. The Theon UI contains the generic web browser graphical database interface. All of these are needed to use Theon. There are also some development packages, but these are only necessary to make changes to Theon itself.

Theon requires a couple of custom support tools. Gurgle, which is used as the factory template engine. Schematic, which is used to process the XML-Schema files and apply transforms. Both of these are also available from here, either direct from the Git repositories or as pre-built packages.

There are also a number of pre-requisite third party packages required, such as PostgreSQL, Python and various modules. Theon UI uses YUI2 – the necessary software for this has been re-packaged and is also available directly from here.

If you use Yum everything can be obtained from the Theon Yum Repository. In order to use this first create a suitable Yum repository configuration file as below.

cat >/etc/yum.repos.d/theon.repo <<EOF
[theon]
name=Theon Repository
baseurl=http://groups.inf.ed.ac.uk/theon/repo/rpms/os/el7/x86_64/
gpgcheck=0
enabled=1
EOF

Then install the core packages.

yum install gurgle-postgresql schematic yuidoc yuicompressor yui2 theon theon-library theon-library-book theon-library-ui theon-tk theon-ui

Yum will install other pre-requisite software automatically, assuming it is available for your platform.

If you want to use Theon as a PostgreSQL extension module then you will also need the package below, built for version 9.2.23.

yum install theon-library-extension_9.2.23

The following packages are only necessary for doing Theon development.

yum install theon-library-devel theon-ui-devel

Alternatively you can clone the Git repositories and build from source for your target platform. You will need to first install all pre-requisite software (see the “.spec” file). The builds are done simply as part of the package build and so you will also need rpm/rpmbuild available on your platform, or you will need to extract and script the content of the package “.spec” file.

Gurgle
cd /tmp
git clone git://afsgit.inf.ed.ac.uk/timc/gurgle
cd gurgle
./makerpm
rpmbuild -ba ${RPMROOT}/SPECS/gurgle.spec
rpm -U ${RPMROOT}/RPMS/x86_64/gurgle-postgresql-*.rpm
Schematic
cd /tmp
git clone git://afsgit.inf.ed.ac.uk/timc/schematic
cd schematic
./makerpm
rpmbuild -ba ${RPMROOT}/SPECS/schematic.spec
rpm -U ${RPMROOT}/RPMS/noarch/schematic-*.rpm
Theon Toolkit
cd /tmp
git clone git://afsgit.inf.ed.ac.uk/timc/theon-tk
cd theon-tk
./makerpm
rpm -i ${RPMROOT}/SRPMS/x86_64/theon-tk-1.1-1.src.rpm
rpmbuild -ba ${RPMROOT}/SPECS/theon-tk.spec
rpm -U ${RPMROOT}/RPMS/noarch/theon-tk-*.rpm
 Theon
cd /tmp
git clone git://afsgit.inf.ed.ac.uk/timc/theon
cd theon
./makerpm
rpm -i ${RPMROOT}/SRPMS/x86_64/theon-1.1-1.src.rpm
rpmbuild -ba ${RPMROOT}/SPECS/theon.spec
rpm -U ${RPMROOT}/RPMS/noarch/theon-*.rpm
Theon UI
cd /tmp
git clone git://afsgit.inf.ed.ac.uk/timc/theon-ui
cd theon-ui
./makerpm
rpm -i ${RPMROOT}/SRPMS/x86_64/theon-ui-1.1-1.src.rpm
rpmbuild -ba ${RPMROOT}/SPECS/theon-ui.spec
rpm -U ${RPMROOT}/RPMS/noarch/theon-ui-*.rpm

Theon will be installed at the following paths by default.

/usr/share/theon
/etc/theon
/usr/lib/theon
/usr/lib/python2.7/site-packages/theontk
/usr/lib/python2.7/site-packages/theonui
/usr/bin/theontk*
/usr/bin/ttk*

You should now be ready to go.

Migrating from SVN to Git while Splitting Repository

As the final stage of a recent project we were transferring our large Subversion repository into multiple Git repositories. We wanted to retain as much of the Subversion history as possible. Interestingly this turned out to be considerably more complex than expected! See the end for links to pages that various other people have written about this process.

We initially considered splitting up the repository at the same time as doing the migration. However, it turned out to be very difficult to preserve the full history this way. Doing the process in two steps – migration then splitting – was a better approach. Another reason for doing two steps was that not all of the original repository was going to be split off. A large amount of content that was legacy would be left behind as an archive.

So the plan was that we would do an initial migration to Git. We would retain this as a standalone read-only archive of everything (about 1.6GB). Active ongoing content would then be split (copied) into separate Git repositories (which would only be about 100MB in total).

Subversion to Git Migration

The normal migration approach for Subversion to Git seems to be to use git svn . This tool however appears to be more of a live synchronisation tool so that people can use both Git and Subversion at the same time. For this reason, and others, we did not find it ideal for a one off migration. We eventually discovered an excellent tool called reposurgeon which is designed precisely for one-off migration. This tool is hugely configurable, fast, and supports a lot of the spurious baggage that comes with an old Subversion repository (which in our case had started out life as an even older CVS repository). Another advantage of reposurgeon is that it automatically creates a Makefile to define migration options and manage each migration step. This encapsulates the whole process and is important as we found that a number of migration attempts were necessary before it looked like everything had worked correctly.

The first problem we hit was that when our Subversion repository was created it had not been structured into trunk and branches. It was only at a later point (when a formal branch was actually needed) that it had been reformatted into this standard structure. The reposurgeon tool assumes the standard structure and as a result history prior to the restructuring and creation of trunk was effectively lost (buried in pseudo branches that reposurgeon creates to represent the previous root structure, complicating history extraction when subsequently splitting up the repository). Since we had only ever had one formal branch (albeit a big one) we opted to migrate using a flat structure (by using --nobranch as a read option in the reposurgeon Makefile). This meant that the repository was treated as an ordinary directory tree from the root – trunk and branches had no special meaning.

The second problem we had was that, by default, the Subversion revision number was not preserved. This revision number was important for us as they had been frequently used: in comments in our issue tracking system (as a reference to the commit where a fix had been made); in formal release notes; within commits (for example, simple errors like “this change is also associated with the fix in revision r12345”). To resolve this we needed to add the --legacy write option in the reposurgeon Makefile so that every commit message was automatically annotated with the original Subversion revision number.

After this the migration went pretty smoothly and we had a clean Git repository with the full Subversion repository history.

There were a couple of minor leftover problems with our new Git repository. One of the directories in our Subversion repository had contained content from a third party which was actually just a Git repository. At the end of the migration git status reported lots of error: Invalid path 'trunk/tools/.../.git/...' messages. Since we didn’t care about retaining the actual embedded .git directory content we just did git reset --hard to remove these errors and leave a clean repository.

The other issue we had was that in the Subversion repository some files had altered over time from being normal files with content to being symbolic links and then back to being normal files with content again. In the migrated Git repository some of these files (but not all) were broken as the file was still a symbolic link pointing to a file which had a name which was the concatenation of what should have been the file content! We did not pick up on this error until later when a clone of the split off repository containing these files failed to checkout because the resulting file names were too long. We remain unclear what could have caused this. Presumably something odd at the time the particular Subversion commit was made that reverted the symbolic link back to a file and something reposurgeon wasn’t consequently able to deal with automatically. We fixed this in the end by careful rewrite of the offending commit in the history of the split off Git repository.

Splitting the Git Repository

There seem to be two approaches to this, use git subtree or git filter-branch with the --subdirectory-filter option. Both scan the entire commit history and keep just those commits containing the specified path (removing the rest). Both have limitations if you want the full history. With subtree it is only possible to scan one branch, so history recorded in any other branch will be lost. This was a significant issue until we chose to migrate the Subversion repository as a flat directory structure. However, a principal failing of both approaches is that the specified path must have remained the same throughout the repository history. So, if at any point in the history, the specified path has changed name or parentage, or content in that path has arrived into it from elsewhere in the repository then all that history will be lost. Since a significant amount of re-structuring at both file and directory level had taken place in our repository over time this limitation was a significant issue to preserving full history.

One option would be to just accept the loss of history. This was not entirely unreasonable as we were retaining the original complete migration into Git as a permanent archive. However we are not able to make that archive public because of some of the legacy content. This would have made a lot of relevant history for content in active development unavailable to anyone else. Nor would it have been that convenient, even as a local developer, to not have the full history of a file self-contained within the split off repository – so you couldn’t just do git log --follow FILE  for example. Instead, having to locally clone 1.6GB just to access a tiny fraction of that to get one files full history.

In the end we managed to find a way to do what we want (with one caveat) using a combination of techniques and because we also had the advantage of access to the full file rename history in our old Subversion repository.

Our approach was to first identify all the current files in a directory that was being split off. For each of these files the Subversion repository was then accessed to extract a full file name and path history. Then every commit in the Git repository was filtered so that it contained only these files (when they existed in that commit). The result became the split repository, where every file contained its full history irrespective of whether it changed name or path in its history. The precise steps we took for each directory being split off are outlined below.

First we created a file containing a list of all the files (including all previous names and locations) in the directory to be split off, trunk/root1 in this example case. We built this from a checked out working copy of the Subversion repository.

cd /path/to/repo/svn/working/copy
REPOURL=`svn info | awk '/^URL:/{print$2}'`
(for f in `find trunk/root1 -type f`; do svn log $f | grep -o -E '^r[0-9]+' | xargs -L 1 -irev svn info -rev $f | grep '^URL:'; done)| sed -e "s?^URL: $REPOURL/??1" | uniq > /tmp/root1.files

Next we created a clone of our migrated Subversion repository so everything we did was isolated and the original could be used again to split off another repository.

cd /tmp
git clone --no-hardlink /path/to/repo.git clone1
cd clone1

Then we ran the filter. We used --index-filter as it is a lot faster than --tree-filter and has an identical result in this particular case. For each commit this filter first removes all the files and then selectively re-instates only those files which existed in that commit and which are also listed as being part of the split directory (eventually, even if not at this particular point).  The filter also removes any resulting empty commits and re-writes tags.

git filter-branch --prune-empty --index-filter 'git rm --cached -qr -- . && cat /tmp/root1.files | xargs -n 1 git reset -q $GIT_COMMIT --' --tag-name-filter cat -- --all

Now we had a modified repository just containing the current files in the root1 directory (actually trunk/root1) but with their full history. Next we tidied up by cloning (a simple way of leaving behind all the objects that have now been orphaned by the filter) and did some shuffling so that the content of trunk/root1 became the content at the top level of the new repository.

cd /tmp
mkdir root1
cd root1
git init
git remote add origin ../clone1
git pull origin master
git remote rm origin
git mv trunk/root1 .
git clean -d -x -n
git clean -fd
git mv root1/* .
git clean -d -x -n
git clean -fd
git status -s
git commit
git status
git log

After this we simply cloned again into a bare repository and pushed this onto our Git server. Then we moved onto the next directory to be split off.

There is one major caveat with this approach. Only current files (those in the most recent commit) and their history is included. Any files that used to exist in the split directory path but have at some point been deleted will not be included. This may or may not matter. For us it did not as one of the reasons for splitting into separate repositories was to lose the 1.5GB of legacy files that were no longer of any relevance, so we did not want them retained. Also we still had the fallback of referring to the original migrated archive with all content as a last resort.

Another problem with this approach was performance. Our repository had over 12,000 commits. The filter has to process each commit one by one. This was alright when the number of files (and their path and name history) for a split directory was small, but when it was in the thousands or more (as some were in our case) the time to process each commit became significant (from fractions of a second to tens of seconds).

There is probably a better way to do this. Perhaps using a combination of this approach to get the unique file path changes and then using the filter to do specific commit history grafting at each change point.

References

http://www.catb.org/~esr/reposurgeon/reposurgeon.html

https://www.atlassian.com/blog/git/tear-apart-repository-git-way

https://stackoverflow.com/questions/14759345/how-to-split-a-git-repository-and-follow-directory-renames

https://medium.com/making-soft-things/where-is-git-split-98dfbaec3d9a

https://superuser.com/questions/241660/how-can-i-trace-the-history-of-a-files-name-in-svn

https://stackoverflow.com/questions/19954485/extract-multiple-directories-using-git-filter-branch

https://stackoverflow.com/questions/34568849/git-split-repository-directory-preserving-move-renames-history

https://kevin.deldycke.com/2011/02/moving-git-subtree-repository/

http://git.661346.n2.nabble.com/Remove-all-files-except-a-few-files-using-filter-branch-td7567155.html

using mock to test rpm packages

Our PkgForge service is a locally developed build farm. We use this for automatically building .src.rpm files for the multiple platforms and architectures we often have to support concurrently . The service uses Mock underneath. It works very well for known working packages and a “send and forget” approach. It is not so good however for debugging build issues (although that was never its intended purpose). Really hit this recently when trying to build a new locally developed package for wider distribution. Being used to only having a few issues to resolve normally (commonly just missing pre-requisite packages), started off using the build farm service. But the numerous iterations needed and the slow rebuild cycles were getting tedious. Eventually an issue that really needed direct access to the build root to debug was the final nail in the coffin. Previously been put off using Mock manually by assuming that it would be complicated. Turns out it is an absolute breeze and definitely the way to pre-test, or even final build if you want speed and controlled (re)submission, new packages (particularly when you only need to worry about distribution for one architecture and platform). So here is a brief howto (as root).

yum install mock

Mock comes with the necessary configuration file for our main current platform and architecture  already, so in principle you need do nothing more than:

mock -r epel-7-x86_64.cfg --rebuild PACKAGE.src.rpm

You can then follow the output live as it builds. If you are running this on your desktop you will also find it runs refreshingly fast. The build root and results are also kept around for subsequent analysis in:

/var/lib/mock/epel-7-x86_64/root
/var/lib/mock/epel-7-x86_64/result/{build,root,state}.log

Commonly you may need to include our own local repositories. This is also simple, for example to use our “inf” repository:

cp /etc/mock/epel-7-x86_64.cfg /tmp

Edit and add the following within the set of repositories :

[el7inf]
name=el7inf
baseurl=http://http.pkgs.inf.ed.ac.uk/rpms/os/el7/x86_64/inf
priority=1
enabled=1
gpgcheck=0

then:

mock -r /tmp/epel-7-x86_64.cfg --rebuild PACKAGE.src.rpm

If you get failures during build you can simply chroot and test the build environment, do further debugging and manual package/file installations and rebuilds:

chroot /var/lib/mock/epel-7-x86_64/root

Note to the wicked: if you are using Java you are likely to find that after doing the chroot you also need to do:

mount -t proc none /proc

PkgForge can automatically build multiple locally dependent packages in one go. If you need to do this with a local mock setup then you need to manually build each package one by one and submit them so they are accessible to mock (or you could investigate the feature of mock that means it can dynamically generate and update a yum repository from the results of previous builds).

Python Command Scripts

You would think trying to install and package a Python package and module with a command script would be easy. Turns out this is a twisty minefield on top of shifting quicksand just itching to catch the unwary. Found this article though which was most helpful.

Adopting Theon – Part 1

Theon is a Schema Version Management system for PostgreSQL databases. So the first question to ask yourself is why you would want to bother with such a thing at all.

Why use Theon?

  • your data model is in constant flux, needs to be adapted frequently and responsively to meet new requirements
  • your data model is large and complex
  • you need to manage different instances of your data model for production and development servers, for handling continuous release/testing cycles, for deployment at multiple organisations
  • many different people are concurrently working on making changes to your data model
  • your data model needs to be kept in alignment with other outside systems, you need it to be automatically maintained based on the requirements of those systems
  • you want to develop your data model at a higher abstracted layer than physical database tables, columns and relationships
  • you or others need to maintain/upgrade external applications so that they remain in alignment with specific versions of your data model
  • others maintain independent systems/applications using your data model and need a robust/automatic upgrade path

If you want/need to do any/all the above then Theon  may help you out. If not then adopting a Schema Version Management system might still be the right thing to do – after all it is considered sensible to version control source code so why not databases, see Database Refactoring) – but you will probably get less immediate benefit.

There are other Schema Version Management systems that support PostgreSQL: Liquibase, Flyway, sqlHawk, MIGRATEdb, schema-evolution-manager, Alembic, Sqitch, DBSteward. All these vary in functionality, complexity and price. Follow the links and make your evaluation of the pros and cons of each depending on your own specific needs. Architecturally Theon is probably closest to DBSteward.

Why choose Theon?

  • open source under GPLv3
  • unix command line scriptable tool chain
  • extensible
  • change sets are just SQL DDL
  • built in graphical schema editor
  • make changes by direct DDL manipulation on development database instances
  • import and go

Why not?

  • immature
  • evolving
  • only supports a subset of the PostrgreSQL feature set

The second part of this article examines what you need to get started using Theon. The final part will demonstrate managing a database in Theon.