In December last year, I talked a bit about the issues affecting the OpenLDAP proxycaching project. This posting updates the current situation.
Testing and Stability
As a result of problems with slapd crashing, we turned on debugging in the LCFG openldap component in mid December so core files would be generated for each crash, giving us an opportunity to debug the problem(s). We have now accumulated 18. This compares with approximately 4 crashes in other labs not running proxycaching, so clearly the proxycaching clients are less stable than our standard DICE clients (which use our own slaprepl technology to sync with the master). Unfortunately the core files do not point to a single problem, or bug – they are almost all different, with no identical backtraces appearing more than twice. There is a suspicion that memory corruption may be occuring in some cases. Also, it might be beneficial to build using Heimdal kerberos rather than MIT to see if that has any effect, as some core files appear to point in this direction. However, it is highly debatable whether continuing to debug these core files has any real benefit given (a) the time needed to do so and (b) the OpenLDAP developers now focusing on version 2.4, leading on to….
2.3 vs 2.4
It was thought, at the December development meeting, that sticking with version 2.3.43 of OpenLDAP would be the best approach for the moment. However, the lack of stability, as outlined above, means that we would not be happy implementing this solution. A previous drawback of 2.4 – as described in ITS#5756 has now been fixed, so making 2.4 a more attractive candidate.
Following some discussion, we now believe that the best approach is to increase testing of proxycaching using OpenLDAP 2.4 (current version is 2.4.13, with 2.4.14 imminent) to see if it suffers the same stability problems as found in 2.3. This will inevitably delay the project, but increasingly seems the right course to take. The urgent need for this project, as shown a couple of years ago when typically 20 machines a day would suffer ldap replication failures, has disappeared (most likely because of increased memory in commodity hardware). It would probably be sensible to either put the project into a stalled state, or set a deadline a few months away, to properly evaluate 2.4.