Sunday, December 12, 2010

Code Invariance and C++ Templates

I have been working to improve compilation times recently using all the usual things, interfaces, pointer to implementation (PIMPL) wrappers, forward declarations, etc.

In doing this my biggest target is the boost libraries. These are orders of magnitude larger than you might think. In analyzing the amount of headers that a typical boost header includes I've found them to be 2-10 times the size of a typical STL header.

The STL headers are already bloated. For example <string> runs to about 30,000 lines typically.

First of all std::string is a template to produce 2 common variants std::string and std::wstring. While it could produce more I can't think of any that I would typically need or want.

Design Guideline:

Don't use templates when a non-template implementation is just as good or better.

If you look at the std::string API you discover that you should be able to prototype it in less than 200 lines of code. Now there are arguments for inlining for performance reasons that make inlining it a nice option.

However, most of the time I don't need the performance benefits. It is also very difficult to forward declare a string and the STL implementors don't provide us with one that we can use. They provide but no other forward declarations for any of the container classes.

The result is that any file, even a 10 line file, that I want to use a string in becomes effectively a 30,000 line file.

This is a huge price to pay when the code it generates is invariant. Now I know that compilers are smart and can cache this, and some do a pretty good job. And computers are a lot faster today. But when I end up spending several hours a day waiting for compiles, productivity suffers.

The point here is that templates are effectively programs that generate code. That code only needs to change when the template changes and the generated code needs only to be done once for each combination of types that you are using.

With the design of C++ this is done once for each compilation unit. It is redone anytime a compilation unit is invalid. This leads to an N^2 algorithm on top of needing to parse thousands of lines of code.

There are other invariants in compilation as well. In fact any code that hasn't changed and won't generate different machine code is invariant. Compilers and programmers move invariant code out of repetitive tasks.

We need a compiler design that does the same thing.

Today we see typical compilation times of minutes and hours when in reality they could often be milliseconds. Current if I change 1 character in a header that is included in all of my code then all of it needs to be recompiled. If that character results in a 1 byte difference in the generated machine code, it should be instantaneous.

Effectively this is the same problem as we used to have with typewriters. If I add 1 character I may need to retype the whole page. If I add a paragraph I may need to retype all the following pages. This is why we used to have many secretaries in businesses. Now we have word processors and we are far more efficient.

Language designers need to apply the lessons of complexity to the design of the language. Library writers need to apply the lessons of complexity to the design of libraries.

A small amount of work on the part of the STL library designers would allow us to have massive savings in the time it takes to compile.

The C++ language already gives us options for making the use of templates less invariant.

Let's take std::string as an example. While I don't think std::string as a template is even marginally a good idea, let's assume for a minute that it is. To give C++ users the most flexibility we need the following:

<string_fwd> - Provides a forward declaration of std::string in as few lines as possible so it can be used to reduce the header load in header files that only need a forward declaration.

<string_proto> - Provides the prototype for std::string without implementing any methods using as few lines of code as possible. This allows me to use std::string in compilation units without having to pay the price at compile time of compiling all the methods in string and then optimizing their inline use and then emitting the template code.

<string> - Provides the full std::string implementation as it stands now so that I can use it in cases where performance is critical and can also explicitly instantiate it where needed.

This pattern applied to all of the STL would be easy to implement and give us more ability to control how long compiles take.

While I suppose I could create a wrapper class for std::string, std::vector and the other containers are much more compelling at templates. I have been playing with doing a forward declaration and that seems to be feasible, but suffers from some extra maintenance.

With any luck someone will start listening and fix some of these issues.

Boost Filesystem Version 3

I've been looking at the boost filesystem as a possible replacement for file manipulation and directory parsing.

In doing so I discovered that the comparison operator for boost::filesystem3::path is horribly slow. I'm not entirely sure why, it may be the iterator.

In doing this I implemented a simplistic wrapper to avoid the massive include load. The filesystem 3 path.hpp include looks like it includes well over 100,000 lines of headers. My wrapper runs only about 30,000 lines almost all of which is because I'm including <string>

In my operator< comparison function I replaced the compare with a .native() string comparison and found what appears to be about a 10X improvement in performance on the Mac.

The more I dig into boost the more I'm convinced that it must be used with extreme caution.

Saturday, December 4, 2010

The Include Model: Yet Another Bad Idea

It's probably been at least 20 years now since I first recognized that the include model for programming languages was a bad idea. Others have been saying this for much longer.

First let me explain the model. The idea is that you have a set of items that you want to be the same in many contexts. For example in C++ you have a class prototype, or defines, etc. In HTML you may have style sheets and other things.

So what happens is you code up these items in a file that gets included at the point they are needed. You write it once and then use it many times. So far so good.

Now the not so obvious problem is that each include may include several others. And those in turn may include more. The result is an explosion of code. In C++ we put include guards in to prevent multiple inclusion and this helps a lot, but even so a single include can lead to thousands of lines of code.

In the C++ standard template library the inclusion of something like <vector> is about 20,000 lines of code.

So why is it that we keep making the same language design mistake?

Probably because the model is so simple.

The result is that we can end up with compilers taking several minutes or even hours to compile code.

Why should it take more than a few milliseconds to update my executable when all I did was change one line of code?

My challenge to the C++ and other standardization committees is to once and for all expunge the idea of an inclusion model for the language from all standards.

In reality all that is needed is to move to a database model of compilation. Conversion from the current text based model would be required, but in reality this is already done by compilers now.

I'm sure there are challenges, like what to do with preprocessing, but those can be overcome.

While I'm not an expert on compilers, I do know something having written a couple of them in my career. I've also talked to people who write compilers that agree that this is feasible and in fact behind the scenes a lot of stuff is done that is very close to what we need.

All that we are missing is for the standardization committees to move forward.

Let's not waste time on more language niceties until we fix the fundamental flaws.

The include model is a fatal fundamental flaw of many languages that collectively costs software companies billions of dollars in lost productivity.

As I write this I also am reminded that char * is another bad idea that came from the same guys that brought us #include. char * as a C and C++ language construct is also responsible for billions of dollars of lost productivity over the years. Think about how much time used to be wasted looking for that off by 1 error or security holes that were caused by buffer overruns in char * arrays...

It is unfortunate that some of these ideas don't get fixed sooner. I sometimes wonder if the language creators are writing code. If they were it would seem like they would be a lot more motivated to fix these issues.

I would love to hear some counter point arguments as to why the include model is still with us.

Unexpected Technology

It is the technology that is unanticipated that excites us the most. For example we have recently seen the ability to scan credit cards on a smartphone. This is probably something that many have thought of over the years, but it is something that I personally didn’t think of before it happened.

When people see technology that they didn’t anticipate it’s like a magic trick where the outcome is a total surprise to the audience.

The real trick is coming up with a bit of technological magic that is unanticipated.

Blindly Following the Advice of Respected Writers

Over the last couple of weeks I have had two occasions where I found fatally flawed arguments in the writings of very respected writers in the C++ community.

In both cases my gut told me when I first read these things that they were wrong. In one case I and others took the advice and ran with it for several years only to be burned later on by what should have been obvious.

Case 1: In Scott Meyers book on Effective STL he gives the advice to use std::for_each for three reasons. He contends that it is faster, less prone to error and more readable. However, after using this advice and seeing how these things were implemented, especially in conjunction with the use of boost::bind or tr1::bind it is clear that all three arguments are simply wrong.

First it is not faster. If you analyze how an stl iterator translates into C++ code you find that iterators are in fact almost identical to the fastest possible hand coded C++ loop that you can write. In addition he contends that the implementors may not write the fastest possible iterators and then in almost the same breath he says that the implementors know more about their implementation than we do so they can write a faster for_each.

To which I say if they can’t write a fast iterator what makes you think they will write a faster for_each.

In analyzing the for_each implementations they are a thin wrapper over a for loop. But with a side effect. The side effect is that end() is called only once before the loop. Most of the time this isn’t an issue, but sometimes it is.

The second contention is that it is less prone to error. Unfortunately, this is simply not true. The syntax is harder to decipher and because of that it is very easy to create a functor that does extra copies and so in the end it is slower.

Finally he contends that it is more readable. Which in my experience is clearly false. Stuffing too much information on one line is bad.

Consider

for(int i = 0; i < n; i++) { temp[i] = b[i]; b[i] = a[i]; a[i] = temp[i]; }

vs.

for(int i = 0; i < n; i++) {
temp[i] = b[i];
b[i] = a[i];
a[i] = temp[i];
}

Which is easier to read. To me the second is the clear choice and most people I know would agree.

Stuffing this into a std::for_each would mean writing a functor, using something like lambda operators etc. However, the more obvious solution to making this more readable is to write a function.

So in the end in the light of day the analysis, my gut, and actual experience says that it is bad advice.

Case 2: In the book C++ Coding Standards the authors make the statement that you shouldn’t have a standard for the placement of things that don’t affect the readability of the code such as curly braces. The interesting thing is if you carefully read what they say they state that they have chosen a formatting for the book to enhance readability. Which on the surface should be enough to call into question their logic.

They also suggest that you be consistent and that you use the standard in the file you are in.

The problem is that if you work with a team of people in C++ you are often adding virtual functions to a class hierarchy, which may be a symptom of bad design, but is more often than not just part of extending the capability of the software. In doing this you will often be working across a class hierarchy that is maintained by more than one person.

This leads to a clear problem. Let’s say you are adding a virtual function. Do I write it like this:

bool T::supportsMyNewFeature() const {
return true;
}


or

bool T::supportsMyNewFeature() const
{
return true;
}

If one file uses one style of curly brace formatting and another the other then I’m constantly having to switch between styles depending on the file.

If I impose my style instead of using the team standard then files quickly become inconsistent and there is no file specific style so the only reasonable option is to default to the team standard.

If I use the team standard then files that don’t use the standard quickly become inconsistent and so the only reasonable option is to default back to the team standard.

If I use the team standard then there is no problem.

There are cases where imposing a team style may not be needed. Such as cases where one person is responsible for an entire module. While this works, generally over time that module will eventually be taken over by someone else or by several people who may not share you view on formatting. They either have to use your formatting or reformat to theirs or to the team standard.

Team standards make these decisions easier.

So back to the main point of this post. Don’t believe everything you read. The writers of these books are very smart people, but they do make mistakes. Some of them are obvious when you examine them. Others may not be so obvious and lead you down a path that leads to problems. When you realize you have been mislead it is time to let everyone know and change how you do things.

BOOST_FOREACH: A Bad Implementation of a Good Idea

I am more and more often running across constructs, especially from boost, that I find to be intuitively poor.

BOOST_FOREACH is the latest one that I've looked at.

First let me state that I like the concept of for each as is implemented in many languages. C++ doesn't happen to implement it and some really smart people decided that it would be cool to create something that has the expressive niceties of for each.

One result is BOOST_FOREACH. However, the include file runs to over 67,000 lines of code and in my opinion the readability is worse.

Why?

Consider the following snippet.

BOOST_FOREACH(T & a, myClass.getAVector()) {
foo(a);
}

Here is another way to write it:

T &aVec = myClass.getAVector();
for(std::vector::iterator it = aVec.begin(); it != aVec.end(); ++it) {
foo(a);
}

And another which is more like std::for_each:

T &aVec = myClass.getAVector();
std::vector::iterator begin = aVec.begin();
std::vector::iterator end = aVec.end();
for(std::vector::iterator it = begin; it != end; ++it) {
foo(a);
}

Which is easier to understand?

Is the invariant getAVector() removed from the loop?

Is end called for each loop or once at the beginning like std::for_each?

Did you know that std::for_each is more like my second example than the first?

If there is a side-effect of calling getAVector() what is the behavior?

What happens if the size of the container changes during the loop?

Is the expanded macro efficient?

Can everyone on your team answer these questions?

Can new developers that may come into your team answer these questions?

Should we spend 67,000 lines of extra code for every compilation unit that uses BOOST_FOREACH to save a few dozen characters?

My conclusion is that BOOST_FOREACH is not more readable and the cost is not worth the header load even if it were.

I will not be using it.

Boost: Not Ready for Professional Projects?

Over the last several weeks I’ve been doing some general code maintenance that is a necessary part of software development. One of the problems with using C++, or any include based language for that matter, is the issue of the geometric growth in number of lines of code that can potentially get included.

If you have an include file that includes 2 other files and those in turn include 2 others you can see that very quickly you can get to a point where the total number of lines of code included runs into the hundreds of thousands of lines.

I got curious about just how big of a hit this was and decided to write a simple include file analyzer to see how potentially big this was. My analyzer is pretty brain dead in that all it looks for is #include statements. It doesn’t respect the preprocessor so the numbers it spits back are pretty much worst case. But it does give some very useful results.

For example many of the stl containers like map, vector etc. tend to run about 20,000 lines and a couple of hundred includes. This was surprisingly large to me. The good news is that there is a lot of overlap, so if I include both map and vector it’s only incrementally larger.

In our work we use many of the boost constructs, like boost shared_ptr. These are some very nice and useful constructs that are an important conceptual addition to the language.

When I analyzed the boost headers I was very unpleasantly surprised at how big some things were. For example the boost shared_ptr was running about 200,000 lines. Of course some of this is platform specific code, but the point is that it is a massive hit.

As a sanity check one of my colleagues ran all of our code through the preprocessors and dumped the output into a single file. It was about 25 gigabytes of data!

While it is obvious that we should be aware of these issues, in general it is really easy to miss the cost of including header files.

Boost is providing a great service to the C++ community, but they too have fallen victim to the geometric growth of include based libraries.

Since shared_ptr is now part of tr1 we have moved over to using it. In analyzing the tr1 memory file inclusion it is an order of magnitude smaller than the boost implementation.

I write this not to say that boost is bad, but to say that you need to be aware of the cost of using it. To that end I suggest the following:

  • Don’t include any boost header in any commonly used header file. Use the Pointer to Implementation (PIMPL) idiom, forward declarations or simply don’t use boost.
  • Prefer stl solutions, although you still have a massively large header file load. But 20,000 lines is nothing compared to 200,000 lines.
  • Always check the cost of our includes. It is often a bit like boiling a frog in water in that a single change is hardly noticeable. But then one day you look up and your compile time is 30 minutes.

In any case it is my hope that the writers of boost take a look at putting it on a diet. It has become a massively bloated beast.

Saturday, October 16, 2010

Problem Solving Skills Revisted

Over the past several years I have been contemplating the idea that one needs to have good problem solving skills to be a good software engineer. I happen to agree with this assessment and work pretty hard to find engineers that are good at solving problems.

One way to do this is to ask them to solve certain programming problems, such as reversing a string, or writing a recursive or iterative factorial function.

We also ask less well defined problem solving questions, such as how many gas stations are there in Seattle. All of these questions are designed to figure out how a candidate approaches problem solving.

This works really well, but in the end while solving problems is a great skill to have most software suffers from a set of problems that are conceptually trivial to solve.

These problems mostly fall into the category of poor organization. It is often the case that highly creative people are often horrible about organization. While they know intellectually how to organize, they don't.

In some cases this leads to code that is highly disorganized which becomes a big problem to maintain.

It is my opinion that a software team needs a variety of skill sets. One of which is the organizer. These are people who love organizing code to make it more readable.

They understand well what a good and poor class hierarchy is and like to make the code better by massaging code through refactoring to be better. Much of the time these types of task require 1% inspiration and 99% perspiration.

The skill set of this type of engineer is one who has the ability to stick with a repetitive task for a long period of time, one who enjoys making the code better, has reasonable problem solving skills, and the ability to leverage the rest of the team to figure out the best way to solve problems.

In addition to this there is also the need for a manager of the team to be a good politician/salesman that gets the rest of the team on-board with all the tasks that the organizer does. One thing that I find frustrating is how often a small percentage of engineers have a very strong affinity to a code formatting style. Selling these engineers on accepting the team formatting style is very important.

The organizer must also support and understand well the team formatting style as they will be imposing this on the code as they go through it. If they disagree with the team standard and don't follow it they become useless as an organizer.

While it is nice to get both a highly skilled problem solver who is also an organizer, often these two skills don't appear in the same individual and there is no need to have exceptional problem solving skills to organize code.

When building a team you should try to get a good mix of skill sets. I'm not entirely sure how many organizers you need on your team, but I'm inclined to think that you need at least as many organizers as problem solvers.

What do Computer Scientists Do?

Over the past couple of weeks I have had the opportunity to talk to some computer science students as well as a number of professors and others involved in a Computer Science (CS) education. This topic came up several times.

In one context it was an issue that some students get into a CS program based on some vision of what it is that a computer scientist would do, only to find out that it is quite different. The result is that these students drop out of computer science.

While learning that CS is not for you early is great, when we look at this problem from the flip side this may mean that many who would like to study CS think it is something else and thus don't even consider CS as an educational choice.

In recruiting interns I ask what they expect to get out of an internship. Often the answer is that they want to find out what it is like to work in industry.

The answer to this question is that it depends. There are many routes that one can take with a CS degree.

One can variously be a software engineer, do research, teach, be a software test engineer, be a user interface designer, etc.

Since my expertise is primarily in the area of software engineering I will limit this discussion to that area.

In a nutshell a software engineer is responsible for creating a software solution to solve a customer problem.

An example of this is the blog you are reading now. The problem being solved is to provide a simple way of posting a blog. A collection of tools that allow editing of the blog, the style, etc. have been created that make this easier than writing straight html.

In software engineering one is expected to become an expert in at least two knowledge domains. The primary domain is that of designing, writing and testing software. The secondary domain is in the problem that you are trying to solve.

For example at one point in my career I worked on printing for a CAD program. To do printing correctly I had to learn about many things including paper sizes. I ended up learning more about paper sizes for CAD than pretty much any of our customers that use the program.

Often software engineers don't spend enough time in the secondary domain learning what their customer needs. This can lead to poorly designed solutions to the customer problem.

Let's take an aside here and talk about customer need vs. want. It is a classic mistake to give a customer what they ask for. The problem is this, the customer is not an expert in the software domain. Thus when they ask for something they are asking from the standpoint of their limited knowledge of software.

By taking the time to really learn what the problem is that they are trying to solve you as a software engineer become an expert in that problem and can apply your expertise of software to come up with a solution that the customer may not have thought possible that is far better than anything they could have suggested.

On the flip side one shouldn't discount what the customer is asking for as it may be that their idea is better than what you would come up with.

Software is tightly coupled to technology. Often it is the technology that attracts students to CS. As a software engineer it is your responsibility to match the right technology to solving the customer problem.

There can be many competing technologies. For example there are many programming languages such as Java, C, C++, C#, Perl, Python, Ruby, Basic, Fortran, Cobol, Lisp, and dozens of others.

Choosing the right technology can be tough. That is where continuous education in what is available is important. It is also true that in some cases there is no best choice, or that there are many good choices. In some cases the right choice may be to use the technology you are most familiar with as that will allow you to finish the task soonest.

In other cases it may mean you need to learn something new. For example in the last year I've spent time learning more about XML, .NET, Qt, Ruby, CMAKE, and several other technologies.

Over time it becomes more and more difficult to be an expert in various technologies so often software engineers specialize in certain areas. For example I've invested a great deal of time in learning how to program in C++. While I can program in many other languages I spend most of my time using C++. As such it becomes the language I gravitate to when there is no clearly better choice.

So once you have made your choice of technologies to use to solve the problem we get into the real grunt work, writing the software. This is typically where most engineers spend the majority of their time. The decisions that you make in choosing the technology can be crucial to how hard or easy this is, but ultimately it gets down to making it all work.

Being able to write bug free, maintainable software is a skill that requires a lot of effort. Some that may be able to make good decisions about hardware or solving problems don't have the attention to detail that is required to be good at this.

Some people view software as a means to an end. They think that it doesn't matter what it looks like so long as it gets the job done. These are the guys that would be happy with a sports car that has a big engine, great suspension, transmission but the body has dents, bondo and rust.

The code you write should be a work of art. Most people who drive a car don't really care what the engine looks like so long as it is reliable and does what they expect. To a car enthusiast what the engine looks like is as important as the paint job, interior, etc.

One way to put this is that the ugly of software will eventually become visible to the customer. Usually this is in the form of bugs. Pride of workmanship, attention to detail, continuous improvement to the code are hallmarks of what a software engineer does.

This can be as trivial as formatting the code in a consistent way or abiding by a consistent naming convention or as complex as sweating the details in a very complicated algorithm.

Another aspect of software engineering is one of economics. As a software engineer you are expected to keep the bottom line in mind. Unless you are independently wealthy and can do whatever you like eventually you need to make some money. Whether you are working for yourself or a company the bottom line is important. The work you do needs to pay for the time you invest in it.

This is where you need to become good at time management. People who are poor at time management make poor software engineers or they require a lot more handholding from their managers. Managers don't like doing this in general so you need to be good at time management.

You need to know how long it takes to do stuff. I have found that most people tend to be horrible at this. I have lost count of the number of times that I have underestimated the time it will take to do a task. What this means is that you need to account for this tendency when you estimate how long it will take to do things. Here are some of my thoughts on time estimation.

There is obviously a lot more to software engineering that I can cover in this post, but the basic idea is that software engineering is about communicating between people. Sometimes people have a picture of a software engineer as someone who spends all night programming and is very non-social.

While there is a place for people that fall into this category in software engineering they need a lot of extra direction and management to keep them on track. Often these highly creative people come up with brilliant solutions to problems that no one cares about.

The really good engineers spend as much time working with people as they do with their nose in code. As a software engineer it is your job to bridge the gap between real life and technology.

Tuesday, March 16, 2010

Evolution and Software Engineering, "Survival of the Fittest Code"

The theory of evolution, as postulated by Charles Darwin, and further extended by Herbert Spencer, who coined the term "Survival of the Fittest" has an interesting parallel in software development.

Anyone who has worked on any code base for an extended period of time knows that some code is in a constant state of flux. Often that same code is where most of the defects occur as well. At the same time there is code that almost never changes.

In software development methods a lot of emphasis is place on things such as code reviews, design patterns, requirements, testing and many other concepts that are supposed to create good maintainable code that is relatively defect free. In practice these methods generally don't produce exceptionally better code.

I have seen a lot of well written, well designed code that is easy to read and maintain that is also in a state of flux. Usually this is only for a relatively short time, but sometimes things take a really long time jell no matter how much good design work was done up front.

I have also seen a lot of poorly written, badly designed code that is hard to read and difficult to maintain that hardly ever changes. And in many cases this code is core to the survival of the software program.

Often very good programers who take a look at the code label it as "bad code" because it is really hard to understand.

And to a large extent they are correct.

The problem is, it works, it does the job well, it's bullet proof, not by design, but because over time it evolved through many fixes to the point that it works really well.

This code survives. Often much longer than really well written code.

In the end who is to say that ugly code is bad code. Sure it may be hard to maintain, but if it does the job it was designed to do and doesn't require much or any maintenance then maybe it really is good code.

Now I'm not advocating that we write ugly code. However, this points out that the ultimate measure of good software is that it does what it was designed to do.

When we write code we should strive to make it maintainable and readable by others, but in the end if it doesn't do what it is supposed to do it is garbage that will get thrown away or refactored to do the job.

Another concept that "Survival of the Fittest Code" encompasses is that code evolves over time. In other words creating good software can be achieved by constant fixing and massaging of the code until the customer is happy with it. The fewer iterations to achieve this the more successful we will be.

Traditionally we call this the "Test and Fix" paradigm of software development.

When we look at DNA we find that a random or planned breeding produces results that are possibly good and possibly bad. The customer is the environment that the offspring has to live in. If the environment is really harsh to the offspring it may die before it reproduces. If on the other hand the environment is really easy for the offspring to survive in then it will thrive.

I have observed cases where an unintended "bug" became a useful "feature". In some cases this feature was unknown to the engineer and later on when they fixed the bug the customer gets really upset that the feature they were using is now gone.

This sort of interaction will more than likely result in the reinstatement of the feature. Since the engineer now knows about the use case of the customer they have the opportunity to evolve the software in another direction that the customer likes.

"Survival of the Fittest Code" can be an effective software development strategy.

In fact I believe it may be the single most effective software development strategy.

Just think how much more effective it could be if we combine it with just a little bit of planning and good coding habits.