Saturday, January 8, 2011

Boost Threads and the Data Hiding Zoo

In C++ the concept of data hiding is implemented via private data. This is a fairly nice concept but it is really more like putting the data behind bars and telling the viewer not to touch it.

Very much like visiting animals at a zoo you can look at all the data and so long as you don't touch it everyone is happy.

However, this also means that we have to see the data even when we don't want to. This has a lot of detrimental effects, not the least of which is that it adds to compile times because of C++'s lame include model.

C++ and for that matter pretty much all languages give us the ability to not only hide the data but to make it invisible. The simplest and most effective method is to put the data that you don't need to know about into a DLL and then provide a simple function call to perform the action that you want to make public.

A good example of this is a dialog. Let's assume that you have a class that has an associated dialog that is designed to edit the class. All you need is a simple function call that takes a reference to the data that you want modified. Then the only public interface to the dialog is a plain old function call.

In some recent work I've been doing involving reducing the time it takes to compile our program I have run across cases where the simple function call prototype and the class for the dialog are in the same header file. This may seem logical since they are related, but by the time you count up all the lines that the dialog class needs to include from all the 100's of header files that it ultimately includes you can easily have 300,000 lines of included files to get at the 1 line prototype that you need.

The problem get's compounded in the file that is used to dispatch dialogs which can end up including dozens of similar headers. Certainly there is overlap so in the end the file may only bloat to about half a million lines of code when they are all included. But why pay the cost of half a million lines of code when a 100 lines will do?

I also spent part of last week encapsulating the boost thread locks and mutexes which on Windows end up pulling in 400,000 lines of code. I ended up with a nearly identical templated set of classes where the headers run less than 100 lines of code total including everything that they include. This was done by using the pimpl (Pointer to Implementation) idiom.

In talking to one of my colleagues I commented that it seemed wholly surprising to me that people as smart as the guys that put together the boost library would impose such a horrendous overhead when it was nearly trivial to avoid it. He commented that this is what you get when you have a header only library. To which I replied, but boost threads requires that you link against a DLL.

The absurdity of a 4 orders of magnitude difference in number of lines of code to implement a simple concept makes me wonder what the implementors were thinking. I can understand an inexperienced or even a somewhat experienced person making this sort of rookie blunder, but these are supposed to be experts.

I am becoming more and more conscious of the fact that many so called experts in the C++ community are really not. They are really smart, but that doesn't make them experts. That just makes them dangerous.

While there are some really good top notch open source software projects out there the vast majority, including boost, are putting out average or below average software.

The same can be said about the standard template library. While many implementations of the STL are good, I have yet to find one that includes less than 20,000 lines of code for something as simple as a string. Ask yourself why std::string has to be template. The correct answer is that there is no good reason.

Template library creators have lost sight or never grasped the basic programming concepts of data hiding, data invisibility, or the proper design of a library.

Rather than hide the implementation in a .cpp file they expose everything to your compilation unit through header includes and force the compiler to work excessively hard to compile the same thing over and over again.

Rather than give us the option of using the explicit instantiation model and do forward declarations of template classes they force upon us the one size fits all mentality of exposing our compilation unit to tens or hundreds of thousands of lines of code when a couple of dozen is all that is needed.

I don't get it. Why isn't the C++ community up in arms about this irresponsible imposition of insanely heavy compilation costs?

I suppose part of the problem is that adding 1 more straw to the pile isn't noticeable. Eventually you break the back of the proverbial camel or boil the proverbial frog alive.

The problem is that the library creators aren't trying to use their libraries in real world applications. Their trivial test applications compile in seconds and so they don't notice and don't have to pay the cost of the fatal flaw in their designs.

Demand better. If the "experts" aren't going to make the data invisible then we are forced to do it. That adds a little work for all of us that could be done once.

No comments: