String functions that meet the needs of i[45]86 
  November, 1994
  Ulrich Drepper <drepper@ipd.info.uni-karlsruhe.de>


Torböjrn Granlund made in 1991 a great job while he wrote the
assembler versions of nearly all the essential strings functions in
the GNU libc. Most are written in assembler for speed in the features
of the processor are used for better speed.

At that time the i386 was the CPU one found. In the development for the
i486 Intel manage to implement the most important assembler instructions
to use only one clock cycle. But this could not be done for all instruc-
tions. So some of them now use the same or a bigger number of cycles.
That's the way the CISC CPU get closer to RISC.

Among those instructions which suffer from speedup are the "unimportant"
string instructions: lods, stos, movs, scas, cmps, ins, and outs.
Formerly very fast in comparison with the two or three instructions they
substitute now they take longer. E.g.

              lodsl  movl (%esi),%eax         stosl  movl %eax,(%edi)
                     addl $4,%esi                    addl $4,%edi

 i386          5        4 + 2                   4        2 + 2

 i486          5        1 + 1                   5        1 + 1


The only situation where the string instructions can still compete is
when they are compined with 'rep' prefix and can process 32 bits in one
step. This is the case for functions like 'memset', 'memcpy' etc. You
will not find these functions in the collection.


In the process of writing this functions I found several more techniques
that can be used when breaking the string instructions in the sequence.
For understanding this one must know that all kind of jumps (transfer of
control) is expensive. The main goal should be to keep the main path of
execution free of jumps. Of course we need loops but with some tricks we
can get more power out of a simple loop.


Some optimization techniques

This is not intend to be a course in assembler programming but just a
collection of the experiences I made. Perhaps somebody doing work on
further functions (or even myself) want to read this instead of
re-invent the wheel.

0. (As said above)  Optimize the path of execution to use as little
    _TAKEN_ jumps as possible.

    In conditional jumps the main path should follow the way of a not
    fulfilled condition. 

1. Unroll loops.

    The contents of loops should be unrolled to have the loop check to
    occure not every step. A good number I found is 4: nearly all the
    functions I wrote have a four times unfolded loop.

    To achieve this for directly bounded loops (such as in strncmp,
    bound by the length parameter) one have to do a check for the
    whole loop. This leads to some extra code which can process the
    remaining bytes but this is limited to a small range. E.g. the
    loop processes 16 bytes each round one has to write a tail code
    which can process 0 to 15 bytes. This number is so small that one
    need not process it in a loop but instead write a straight code.

2. Use the registers which are used for register variables only if
    necessary. 

    The binary format specification for a.out and ELF on i386 CPUs
    says that the registers %ebx, %edi, and %esi must not be changed
    by a function (they are 'callee-save-registers'). A saving is
    generally done by a push/pop pair. This costs 5 cycles. Not much
    but we are looking for every cycle :).

    Before using tempories on the stack there is another possibility.
    Most those functions are tail-functions. I.e. they won't call any
    other functions. Therefor the base register is not needed. This
    way one has another register available. (Caution: you won't be
    able to use a debugger anymore if you use this technique.)  I
    would not suggest to use this trick if you have to access the
    parameter and locale variables of the function frequently because
    the addressing mode relative to %ESP is expensive: because it was
    introduced late, first in i386, it always has a prefix. This prefix
    needs (at least upto i486) one extra cycle to process.

3. Exploit the adressing modes of the i386 CPUs

    The processor has a rather complex general adressing mode:

           offset + base register + index register * index factor

    where   offset           is a constant
            base register    is a general register
            index register   another general register
            index factor     is 1, 2, 4, or 8

    One possible use is found in strcmp. We have to process two
    strings and therefor have to deal with two pointers. But these
    pointers are NOT unrelated. If we used one pointer and a constant,
    but only known at runtime, offset we can write loops simpler. An
    example:

               str1 =>  %edi           str2 => %esi


         normal code                        optimized code    
 
                                             subl %esi,%edi

      L:   mov (%edi),...              L:   mov (%edi,%esi),...
           mov (%esi),...                   mov (%esi),

           ...                              ...

           add $x,%edi                       add $x,%esi
           add $x,%esi

           jmp L                             jmp L

    The CPU is able to compute the address (most of the time) in one
    cycle.


4. Use 32 bit for processing
    (This is what gains the most speedup !!!)

    The nature of the string functions generally appears to be only
    solvable by processing byte by byte. One has to match a character
    and also look for the end of the string (NULL char).

    But this not true. Basing on an algorithm I found in the C
    version of 'memchr' by Roland McGrath based on a work of (again)
    Torböjrn Granlund one has an efficient possiblity of matching a
    specific character out of a 32-bit word. I don't describe it here
    (look into the source code of the assembler version of 'memchr').
    The only thing to say here is that it can be implemented with very
    cheap assembler instruction.

    (Unfortunately the test is only a necessary condition but not
     sufficient. If you look through the code you will known what I
     mean.)

    But does this extra arithmetic is it worth?  Is a simple
    comparison not of similar or greater speed?

    With a simple word: no. An extreme example: 'strlen'. One could
    think of a naïve implementation, perhaps with an unfolded loop.
    But we have one comparison and one conditional jump for each byte.
    With this nice little trick we have some arithmetic and two
    conditionals for each 32-bit word. That's worth it.

    But there must be some problems. The world would not wait so long
    with the invention if the gain is so big. And in fact: there is a
    problem.

    The address space in modern OS is a virtual one. One effect of
    this is that memory access to some areas of the address space is
    illegal. If we now process string which is located at the top of
    a valid area and an 32-bit memory access crosses the boundary to
    the not mapped area we get a SEGFAULT. But what to do?  We cannot
    check before every memory access if it is valid. This would
    nullify the gains.

    The solution comes if we put some knowledge about the environment
    in the algorithm. The MMU of the i386 CPU is only able to map
    memory areas on 4096-bytes boundaries (I assume the big grainity
    which is found in all interesting OS). If we now align all memory
    access on a 4-byte boundary we can never cross the line between two
    areas with one memory access. If we access an illegal address
    (perhaps because of a non-terminated string) the simple algorithm
    would make this, too. The behaviour is identical!  (Another
    profit from aligned memory access is that it is faster.)

    You might say that this sounds nice but if you have two strings to
    process you cannot assure to align them both simultaneously. And
    you are right. This was one of the biggest problems I came over
    but if you look through the code of 'strcmp' you will note that
    there is a solution.

NOTE NOTE NOTE:  PLEASE check it out !!!!!!!!!!

    One can identify two big areas of memory where the 32 bit access
    is valid:  The normal data space upto the '___brk_addr' and the
    stack (from %esp upto 0xc0000000). At the beginning of 'strcmp'
    and 'strncmp' we check if both pointers fall into one of these
    areas. If not we use the simple byte-by-byte algorithm.

    In the other case there are three possibilities:
    a) both pointers are in the data areas

       Here it is sufficient to align the higher pointer because it is
       the only one which could cause a boundary crossing.

    b) both pointers are in the stack area

       We need not do any special action. The first illegal address
       is 0xc000000. But at the top of the stack there is (at least)
       the return address for the 'main' function and its parameter
       pointers. Every access in this area can only be a fault and
       perhaps a resulting SEGFAULT will help to debug.
       (If somebody says that this is not right we also here can align
        the higher pointer.)

    c) one pointer in data the other in stack area

       As one can guess we have to align the pointer in the data area.
       This works with the same arguments as above.


5. Align loops.

    It is important to align the main loop according to the needs of
    the processor. The i386 only needs a 4-byte alignment but the
    i486 a 16-byte alignment (no info about i586). I use the generic
    macro 'ALIGN' which is defined in 'asm-ops.h'.

    In some files you find another optimization with this alignment.
    The aligning is done by padding with a fill-instruction: NOP.
    This cannot be executed for free, it costs on i386 3 cycles and on
    i486 1 cycle. Many functions have a jump to an instruction
    immediately in front of the loop. If we now arrange the code that
    the instructions are placed behind the NOP many times the NOPs
    need not be executed. Please look for a concrete example. It is
    a rather ugly method but the assembler does not have a directive
    like 'align to XXX but place the NOP before YYY'.