Wednesday, November 30, 2011

Optimizing C++ Build Times

I've been fighting improving build times under C++ for years and over that time I've managed to put together a few good practices to help.

1) Use good header inclusion isolation.

Use forward declarations as much as possible.
Use the PIMPL idiom to provide better isolation.
Avoid include only libraries. Some are marginally OK, such as most of the STL. Others, especially certain boost libraries, can create enormous burdens. Wrap these using PIMPL or "has-a" instead of inheritance.
Use good encapsulation by splitting your classes into reasonably small DLLs. This reduces link times and helps enforce other good design practices.

In theory this should be all you need.

But there are some other things that work well.

2) Use IncrediBuild if you are building under Windows.

There are other distributed build systems as well, but at the moment I've not had any luck improving over local builds on today's multi-core hardware.

3) Use multi-core machines.

The more cores the better.

4) Use SSD drives to eliminate disk bottlenecks.

Depending on the amount of code you are compiling you may not need an SSD drive. However, in my current project I've seen a 30% improvement in build times.

I also have found that when using IncrediBuild in combination with an SSD can really improve the throughput allowing you to use many more machines. I've recently done tests where I ran up to 100 agents seeing significant improvements between 50 and 100.

5) Figure out what the hardware bottlenecks really are.

Memory speed. This is still a potential bottleneck in today's hardware.

Memory amount. This is easy, if you are using more than about 3/4 of your memory typically while building, get more.

Disk speed. The disk cache can get into a swamped state where it takes as much as several minutes to finish writing all the cached data. This can lead to randomly long link or compilation times. Use SSD drives to take care of this.

Network speed. If you are using distributed builds then go to at least a gigabit network. Beyond that I'm unsure that there would be any gain. Below that it is clearly a bottleneck.

CPU speed. Faster is better unless the bottleneck is memory speed, disk speed, or network speed.

Number of cores. Since compilation can be highly parallel more is better, unless memory speed or disk speed is an issue. Usually more cores means faster memory so it is normally just disk speed that can slow you down.

6) Turn off active virus scanning.

This should be obvious, but every .obj, .dll, or .exe that you create would get scanned thus slowing your compilations considerably.

I find it amazing how often this is overlooked.


Does this work? Yes. I've taken a build that was pushing 15 minutes and reduced it to about 8 using the methods in 1). I've take a build that was pushing 10 minutes and dropped it to about 7 using and SSD drive. I further cut the same build down to about 4 minutes by taking the number of IncrediBuild clients from 20 to 100, 5 minutes for 50 clients.

Slow builds are productivity suckers. The slowest build I ever saw was 30 hours. I only did that once. At one point I worked on a project that went from 5 minutes to 1 1/2 hours in year.

Somewhere not far past 5 minutes will put you into a state where flow gets interrupted. If you find your flow getting interrupted often, you will be far less productive. Don't let that happen.

Ideally you should strive to keep your builds under 1 minute. I believe this is possible for typical builds with the right encapsulation of data regardless of the total size of your project. While an entire system built from scratch could still take days, your normal builds should not need to take long. If they do, you are wasting resources.

Sunday, November 27, 2011

What is the Best Software Process?

In order to write good software you need some sort of process. As it turns out it doesn't really matter much which process you choose but there are a few key elements to a good process.

1) It must focus on your customer.

The first thing to get right is to solve the problem that is most needed by your customer. This can be a bit hard to determine because in most cases the customer will tell you what they think you can solve rather than what you can solve.

This means that sometimes they want you to do the impossible and you need to scale back their expectations or that they have already scaled back their expectations and what they really need to be solved is easy to do.

There is a real art in figuring this out but it is mostly about spending enough time understanding what it is that your customer does. This often means you need to learn something new outside the domain of computer science.

To me the chance to learn something new is the best part.

2) The process must be iterative.

No matter how you end up doing things there needs to be some iteration involved. This has led to many different process methodologies. All the best ones use some sort of iteration.

Basically this recognizes the fact that you can't create a great design without writing some code. Eventually you need to rewrite the code to deal with changes. Sometimes you get it right to start with, but usually that is for a trivial problem that you can already write the code for in your head.

3) The process should be collaborative.

Writing code by yourself is OK, but two heads are better than one. Code that is written by and maintained by multiple people will get better or worse over time depending on the team.

If it is getting worse then you have a problem with your team. Either you or your teammates need to learn to work better together. Almost always it is you that needs to improve. How can that be you ask?

Well what it boils down to is that most software engineers think their code is better than their colleagues. While that may be true in some cases it is almost always true that you can learn something from your teammates about writing better code.

Looking for that and adopting the best of others will make you a better engineer. When you get to this point then it is probably true that your teammates need to improve. They will be a lot more likely to adopt your good habits if you have first adopted theirs.

4) The process must include testing.

This is obvious, but there are some good and bad ways of testing. While you should test your own code it should always be tested by someone else. Testing should be independent and done by a team that doesn't also do development.

The testing team should develop somewhat of an adversarial role with respect to the code. However, it should be more of a friendly rivalry. The goal of the developer is zero bugs. The goal of the tester is to find many bugs.

Failure to achieve either of these should be a sign that the respective team has room to improve.

5) The process should include documentation.

The first documentation should be written by the software engineer. It should be written, no videos, and it should clearly state what the feature should do (requirements) and how it works (specification).

Each of these can be used to produce other documents, a test plan, user documentation, etc.

6) There needs to be accountability.

This can come in many forms. For example source control maintains a record of who did what and when, a bug tracking database records defect and is used to assign them to the appropriate party, a schedule gives management, marketing, and other stakeholders a prediction as to when they can expect delivery so that they can plan accordingly.

7) Return on Investment Must be Estimated.

Technically this goes under accountability, but it is so often ignored that I call it out specifically.

Software that costs more to develop than it saves should never be started. For example if it takes $100,000 to automate a shipping process where the savings is only $20,000 over the next 5 years then there is probably better things to do.

Not only is this a losing proposition, it is likely that the project will get canceled making it an even bigger waste of money and time. It will also piss off the engineers that work on it.

If you are an engineer starting work on a project like this then make sure management knows it is a waste of resources. If they don't listen, move on. You don't need the aggravation, and there are plenty of software development shops around that do work on profitable projects.

On the other hand if adding a feature for $100,000 will add $1,000,000 a year in revenue then it is most likely a good thing to work on.

Spend your development resources where they will make you the most money. It's good business and it's good for the team.

8) Focus on the End Game.

Often from the time of feature completeness to release can take several weeks. Often one thing, such as finishing documentation, web site, marketing materials, etc. can end up taking a surprisingly long time. These needs can often push back the date of feature completeness by a month or two for a given release date.

Gantt charts are very helpful in planning your end game to ensure that everything that has dependencies on the software being complete are done.

9) Use Technology Only When Appropriate.

There are some technologies that are pretty much a given. For example you need a compiler.

Other things should be a given such as using source control and a bug tracking database.

Then there are technologies that may or may not be helpful. Most things can be done manually. Such as managing requirements, specifications, etc. Tools exist to do these things, but are not necessary to write software. Use them only if they really help you.

If they are black holes where information is poured but never used then get rid of them as they are just taking up time.

10) The Process Must be Simple.

If you can't explain it in 10 minutes to your mother and have her understand it, then it's too complicated.

A complicated process will fail. It won't get buy in from your team. It will be interpreted differently by different team members. It can take a long time to be fully adopted.

11) Use Iteration to Improve the Process

With each release you can sit down and examine your process. Figure out what worked well and what didn't work so well. Focus on the areas that didn't work well and figure out how to make them work better.

One thing that happens frequently over time is that as things improve you hit a release where a the quality slips. This tends to come from complacency. You have to keep working to improve.

One way to deal with this is to keep raising the bar. Recognize that you can always improve some part of the process as there is not a perfect process.

12) Adapt Process to Your Needs.

Don't just adopt a process. No process is perfect and every team has slightly different needs.

For example, a young inexperienced team can greatly benefit from pairs programming while a highly seasoned team would only be slowed by pairs programming. Some would argue that point with me, but 30 years of development experience tells me otherwise.

Adapt process ideas to your team needs. Don't just blindly adopt a process as it won't work if you and your team don't understand why you are doing things in a particular way.

Conclusion

With a little effort you can write good software. With a lot more effort you may be able to write slightly better software. But if it doesn't make your company money, it's a bad process.

The traveling salesman problem is a great example of a problem where achieving the perfect solution can take so much compute time that the salesman could have gone with a quick less than optimal solution and be back from his trip before the computation of the perfect solution is finished.

Don't let your process become such a large cost that you can't get the job done.

Saturday, January 8, 2011

Boost Threads and the Data Hiding Zoo

In C++ the concept of data hiding is implemented via private data. This is a fairly nice concept but it is really more like putting the data behind bars and telling the viewer not to touch it.

Very much like visiting animals at a zoo you can look at all the data and so long as you don't touch it everyone is happy.

However, this also means that we have to see the data even when we don't want to. This has a lot of detrimental effects, not the least of which is that it adds to compile times because of C++'s lame include model.

C++ and for that matter pretty much all languages give us the ability to not only hide the data but to make it invisible. The simplest and most effective method is to put the data that you don't need to know about into a DLL and then provide a simple function call to perform the action that you want to make public.

A good example of this is a dialog. Let's assume that you have a class that has an associated dialog that is designed to edit the class. All you need is a simple function call that takes a reference to the data that you want modified. Then the only public interface to the dialog is a plain old function call.

In some recent work I've been doing involving reducing the time it takes to compile our program I have run across cases where the simple function call prototype and the class for the dialog are in the same header file. This may seem logical since they are related, but by the time you count up all the lines that the dialog class needs to include from all the 100's of header files that it ultimately includes you can easily have 300,000 lines of included files to get at the 1 line prototype that you need.

The problem get's compounded in the file that is used to dispatch dialogs which can end up including dozens of similar headers. Certainly there is overlap so in the end the file may only bloat to about half a million lines of code when they are all included. But why pay the cost of half a million lines of code when a 100 lines will do?

I also spent part of last week encapsulating the boost thread locks and mutexes which on Windows end up pulling in 400,000 lines of code. I ended up with a nearly identical templated set of classes where the headers run less than 100 lines of code total including everything that they include. This was done by using the pimpl (Pointer to Implementation) idiom.

In talking to one of my colleagues I commented that it seemed wholly surprising to me that people as smart as the guys that put together the boost library would impose such a horrendous overhead when it was nearly trivial to avoid it. He commented that this is what you get when you have a header only library. To which I replied, but boost threads requires that you link against a DLL.

The absurdity of a 4 orders of magnitude difference in number of lines of code to implement a simple concept makes me wonder what the implementors were thinking. I can understand an inexperienced or even a somewhat experienced person making this sort of rookie blunder, but these are supposed to be experts.

I am becoming more and more conscious of the fact that many so called experts in the C++ community are really not. They are really smart, but that doesn't make them experts. That just makes them dangerous.

While there are some really good top notch open source software projects out there the vast majority, including boost, are putting out average or below average software.

The same can be said about the standard template library. While many implementations of the STL are good, I have yet to find one that includes less than 20,000 lines of code for something as simple as a string. Ask yourself why std::string has to be template. The correct answer is that there is no good reason.

Template library creators have lost sight or never grasped the basic programming concepts of data hiding, data invisibility, or the proper design of a library.

Rather than hide the implementation in a .cpp file they expose everything to your compilation unit through header includes and force the compiler to work excessively hard to compile the same thing over and over again.

Rather than give us the option of using the explicit instantiation model and do forward declarations of template classes they force upon us the one size fits all mentality of exposing our compilation unit to tens or hundreds of thousands of lines of code when a couple of dozen is all that is needed.

I don't get it. Why isn't the C++ community up in arms about this irresponsible imposition of insanely heavy compilation costs?

I suppose part of the problem is that adding 1 more straw to the pile isn't noticeable. Eventually you break the back of the proverbial camel or boil the proverbial frog alive.

The problem is that the library creators aren't trying to use their libraries in real world applications. Their trivial test applications compile in seconds and so they don't notice and don't have to pay the cost of the fatal flaw in their designs.

Demand better. If the "experts" aren't going to make the data invisible then we are forced to do it. That adds a little work for all of us that could be done once.