Semantic data types

Back in 2005 Joel Spolsky wrote a blog post about Making Wrong Code Look Wrong.

Joel’s basic argument is that programmers should look for conventions that make incorrect code stand out, or look wrong. He goes on to argue that Applications Hungarian.aspx) notation is a good thing as it confers semantic information about the variable.

As an example he gave used the Microsoft Word code base. The developers had added prefixes to variables such as xl being horizontal coordinates relative to the layout and cb meaning count of bytes. This trivial example seems to make a lot of sense, as it makes it easy to see that there is no way that xl = cb should be allowed to happen.

Taking a more complex expression, and it starts to involve more thought and become less clear and requires memorising an increasing number of prefixes. Seeing code such as (pcKesselRun/msFoo)/msBar could quite conceivably lead to a thought process such as: Looks right, but why is Kessel Run in parsecs? Is that a typo? I can’t remember what ms is. I assume it is milliseconds but I am sure someone used it as meters/second somewhere.

With more modern languages shouldn’t there be a better way?

Types to the rescue

The concept of Applications Hungarian notation surfaced (according to Wikipedia) in around 1978 when fewer languages allowed user defined types. Modern languages with better type systems allow concise definitions of types and operators. Code then doesn’t just look wrong, the code fails either to compile or to run.

Once the type system is used to represent the semantic type of the variable, it removes a lot of the ambiguity and chance of making mistakes. In treating everything as real numbers type conversions are almost always valid

(pcKesselRun/msFoo)/msBar => (real/real)/real => real/real => real

Where as using datatypes to represent the semantic type with limited possibilities for type conversion provides more automatic checking of type assignments and conversions. So that the result type of the type would be acceleration

(pcKesselRun/msFoo)/msBar => (parsec/millisecond)/millisecond => velocity/millisecond => acceleration

This means rather than relying on humans to see the wrongness is code, the compiler does it for us.

Too much hard work?

In small projects I would agree that this is in no way required. At that point the entire project fits in the programmers head and there is probably just one programmer working on it.

In languages with adequate type systems, the amount of code to define a new type is minimal (excuse my appalling Haskell)

newtype Velocity = Velocity Double deriving (Eq, Ord, Read, Show)

(/) :: Velocity -> Time -> Acceleration
(/) vel time = ....

At the point where a project starts to require interfaces to allow different people code to interact then it becomes worth the added complexity. It would allow different teams to use different systems of measurement. Using this sort of system could have avoided Mars Climate Orbiter crash. The crash was caused by one team using metric and the other imperial units of measurement.

Even if it wouldn’t have helped with the Mars Climate Orbiter it would have helped with SDP, where at one point half the system worked in radians and the other half in degrees. Ogre3D solves the problem by having two angle types and taking advantage of the implicit type conversion feature of C++.

But dynamic languages…

It doesn’t matter if the type system is dynamic or static, it would just tend to fail at a different point. Where as a static language would fail at compile time a dynamic language would fail at runtime. For the following hypothetical Python example, this would fail if the time object doesn’t have any function to_seconds defined when trying to execute the second line.

class Distance(object):
    def __init__(self, ...)
        ...

    def from_meters_per_second(self):
        return ...

    def __div__(self, time):
        return Velocity.from_meters_per_second(self.to_meters()/time.to_seconds())

With a dynamic language types are only checked at run times when the code is executed as opposed to static languages in which the types are checked at compile time. Having a dynamic system that raises an exception as opposed giving the wrong result better as it is far clearer to the developer that there is a problem and it highlights exactly where the error is. This happening in production isn’t ideal where the exception would lead to the loss of a 193 million space craft. The probability of errors could be reduced by using unit tests that exercise the used code paths.

Was Joel wrong?

In Joel’s defence in a language restricted in some way, such as not being able to define new date types, Applications Hungarian Notation may prove a improvement. In a language that supports it Applications Hungarian should be avoided. In using a type system to the fullest extent, it shouldn’t just make wrong code look wrong. It should make wrong code noticeably fail, either by not compiling or by raising an exception.