DEC Alpha, gcc, cc, and the Tower of Hanoi
(Running time pecularities)

M. Kolar
Feb 1 - 10, 1998

Abstract: Relative speed of two different C programs may depend significantly on the used compiler. GNU gcc compiler can give much slower code (at least on the DEC Alpha machines) than the standard cc compiler. "Optimized" executable code may be slower than the code compiled without any optimization.


The first version of my Tower of Hanoi algorithms page posted on Feb. 1, 1998 (before the formulation of Algorithms no. 3 and 4) contained the running times in CPU seconds for the Er's algorithm and my Algorithms no. 1 and 2, compiled with the GNU gcc compiler with default optimization turned on (gcc -O) on a five year old 100 MHz DEC Alpha 300L (OSF1 V 3.2) workstation (and executed on the same machine). Tables 1 to 4 below are taken over from that first version. They led my to conclude originally that the Er's bare bones algorithm is faster than Algorithms 1 and 2.

Table 4
Running times of Algorithm no. 2 and the Er's algorithm (bare bones versions)

Actually, Table 4 does not refer to the present form of Algorithm no. 2, but to its variation containing a goto statement as shown in Fig. 7old. The reason was, that this variation gave faster code when compiled with the -O flag (both for gcc and cc) on the DEC Alpha 300L. Later on I found that the present version of Algorithm no. 2 gives faster code when compiled with both compilers without optimization on Alpha 300L, and with or without optimization on the other two machines used in the final version of the Algorithms page. That is what I have originally expected (that the code with the goto will be slower).

nLLHanoiAlgorithm no. 2
/* Kolar's Hanoi Towers algorithm - faster version */
 .
 .
 .

 fr = a;
 if(n&1) { to = c; sp = b; }
 else    { to = b; sp = c; }

 while(*c > 1) {
  if((to[--(*to)] = fr[(*fr)++])&1) {
   odd:
    p=sp; sp=to;
    if(fr[*fr] > p[*p]) { to=fr; fr=p; }
    else to=p;
  } else {
    p=sp; sp=fr; fr=p;
    /* (call to) tower visualization code */
    to[--(*to)] = fr[(*fr)++];
    goto odd;
   }
  /* (call to) tower visualization code */
 }
}
Fig. 7old. Algorithm no. 2 as presented in the first, Feb. 1, version of the Algorithms page

But only when I have started testing Algorithm no. 3, have I realized that the running times for the DEC Alpha 300L are somewhat strange. The times for Algorithm no. 3 (bare bones version; compiled with gcc -O) were rather longer than those for the Er's algorithm. So I have also tried the standard cc compiler, and both compilers with optimization turned off. The CPU times I obtained are presented in Table 5. I was surprised that gcc -O gives longer times than gcc without optimization, and that gcc gives much slower code than cc. I found that the latter is also true for all other algorithms, but for them the running times for the gcc optimized code are at least shorter than those for the unoptimized code. I recalled that the DEC Alpha 300L was one of the first DEC Alpha RISC workstations, and it was apparently not yet very good at efficiently using its resources (e.g., I myself found that it sometimes leaves unused bytes in the memory between all elements of an array).

Table 5
Running times of Algorithm no. 3 (bare bones version) on a DEC Alpha 300L for the code compiled with different compilers and optimizations

After that I tried my algorithms on several other machines with installed C compilers. On that 400 MHz DEC Alpha, I still found that gcc -O always gives longer times than the cc compiler. At least gcc -O gave shorter times than gcc without any optimization. And for the cc compiler the optimization (-O) did not matter (except for the Algorithm no. 4 with simple tower visualization, for which -O produced a somewhat faster code). Actually, for Algorithm no. 3, the code obtained with cc -O gives for large n running times that are by a fraction of a second longer than those for the code obtained with cc without optimization! On the SPARCstation, optimization always decreased running times significantly.

Another point that surprised me on both DEC Alpha machines was that when in Algorithm no. 4 the line

ncccc -Ogccgcc -O
e[--(*e)] = 1; (*o1)++;
that uses the fact that on each odd-numbered move the smallest disk (1) is moved, was replaced by the more complex line
e[--(*e)] = o1[(*o1)++];
(requiring an additional array element access), the resulting code ran faster. No such behaviour was observed on the SPARCstation, where assigning a constant (1) was faster, as one would expect, than assigning the contents of an array element (which we know must contain 1 when this particular line of code is executed).


It would be interesting to know how other machines and compilers treat these algorithms. Before, I have not noticed such large differences in the running times of the executable code obtained with cc and gcc. When I now recompiled some of my other older programs both with cc -O and gcc -O, I found that for some of them the gcc generated code may also run (on DEC Alpha 300L) slightly slower (by fractions of a second for CPU times of the order of tens of seconds) than the cc generated code.