String functions that meet the needs of i[45]86 November, 1994 Ulrich Drepper Torböjrn Granlund made in 1991 a great job while he wrote the assembler versions of nearly all the essential strings functions in the GNU libc. Most are written in assembler for speed in the features of the processor are used for better speed. At that time the i386 was the CPU one found. In the development for the i486 Intel manage to implement the most important assembler instructions to use only one clock cycle. But this could not be done for all instruc- tions. So some of them now use the same or a bigger number of cycles. That's the way the CISC CPU get closer to RISC. Among those instructions which suffer from speedup are the "unimportant" string instructions: lods, stos, movs, scas, cmps, ins, and outs. Formerly very fast in comparison with the two or three instructions they substitute now they take longer. E.g. lodsl movl (%esi),%eax stosl movl %eax,(%edi) addl $4,%esi addl $4,%edi i386 5 4 + 2 4 2 + 2 i486 5 1 + 1 5 1 + 1 The only situation where the string instructions can still compete is when they are compined with 'rep' prefix and can process 32 bits in one step. This is the case for functions like 'memset', 'memcpy' etc. You will not find these functions in the collection. In the process of writing this functions I found several more techniques that can be used when breaking the string instructions in the sequence. For understanding this one must know that all kind of jumps (transfer of control) is expensive. The main goal should be to keep the main path of execution free of jumps. Of course we need loops but with some tricks we can get more power out of a simple loop. Some optimization techniques This is not intend to be a course in assembler programming but just a collection of the experiences I made. Perhaps somebody doing work on further functions (or even myself) want to read this instead of re-invent the wheel. 0. (As said above) Optimize the path of execution to use as little _TAKEN_ jumps as possible. In conditional jumps the main path should follow the way of a not fulfilled condition. 1. Unroll loops. The contents of loops should be unrolled to have the loop check to occure not every step. A good number I found is 4: nearly all the functions I wrote have a four times unfolded loop. To achieve this for directly bounded loops (such as in strncmp, bound by the length parameter) one have to do a check for the whole loop. This leads to some extra code which can process the remaining bytes but this is limited to a small range. E.g. the loop processes 16 bytes each round one has to write a tail code which can process 0 to 15 bytes. This number is so small that one need not process it in a loop but instead write a straight code. 2. Use the registers which are used for register variables only if necessary. The binary format specification for a.out and ELF on i386 CPUs says that the registers %ebx, %edi, and %esi must not be changed by a function (they are 'callee-save-registers'). A saving is generally done by a push/pop pair. This costs 5 cycles. Not much but we are looking for every cycle :). Before using tempories on the stack there is another possibility. Most those functions are tail-functions. I.e. they won't call any other functions. Therefor the base register is not needed. This way one has another register available. (Caution: you won't be able to use a debugger anymore if you use this technique.) I would not suggest to use this trick if you have to access the parameter and locale variables of the function frequently because the addressing mode relative to %ESP is expensive: because it was introduced late, first in i386, it always has a prefix. This prefix needs (at least upto i486) one extra cycle to process. 3. Exploit the adressing modes of the i386 CPUs The processor has a rather complex general adressing mode: offset + base register + index register * index factor where offset is a constant base register is a general register index register another general register index factor is 1, 2, 4, or 8 One possible use is found in strcmp. We have to process two strings and therefor have to deal with two pointers. But these pointers are NOT unrelated. If we used one pointer and a constant, but only known at runtime, offset we can write loops simpler. An example: str1 => %edi str2 => %esi normal code optimized code subl %esi,%edi L: mov (%edi),... L: mov (%edi,%esi),... mov (%esi),... mov (%esi), ... ... add $x,%edi add $x,%esi add $x,%esi jmp L jmp L The CPU is able to compute the address (most of the time) in one cycle. 4. Use 32 bit for processing (This is what gains the most speedup !!!) The nature of the string functions generally appears to be only solvable by processing byte by byte. One has to match a character and also look for the end of the string (NULL char). But this not true. Basing on an algorithm I found in the C version of 'memchr' by Roland McGrath based on a work of (again) Torböjrn Granlund one has an efficient possiblity of matching a specific character out of a 32-bit word. I don't describe it here (look into the source code of the assembler version of 'memchr'). The only thing to say here is that it can be implemented with very cheap assembler instruction. (Unfortunately the test is only a necessary condition but not sufficient. If you look through the code you will known what I mean.) But does this extra arithmetic is it worth? Is a simple comparison not of similar or greater speed? With a simple word: no. An extreme example: 'strlen'. One could think of a naïve implementation, perhaps with an unfolded loop. But we have one comparison and one conditional jump for each byte. With this nice little trick we have some arithmetic and two conditionals for each 32-bit word. That's worth it. But there must be some problems. The world would not wait so long with the invention if the gain is so big. And in fact: there is a problem. The address space in modern OS is a virtual one. One effect of this is that memory access to some areas of the address space is illegal. If we now process string which is located at the top of a valid area and an 32-bit memory access crosses the boundary to the not mapped area we get a SEGFAULT. But what to do? We cannot check before every memory access if it is valid. This would nullify the gains. The solution comes if we put some knowledge about the environment in the algorithm. The MMU of the i386 CPU is only able to map memory areas on 4096-bytes boundaries (I assume the big grainity which is found in all interesting OS). If we now align all memory access on a 4-byte boundary we can never cross the line between two areas with one memory access. If we access an illegal address (perhaps because of a non-terminated string) the simple algorithm would make this, too. The behaviour is identical! (Another profit from aligned memory access is that it is faster.) You might say that this sounds nice but if you have two strings to process you cannot assure to align them both simultaneously. And you are right. This was one of the biggest problems I came over but if you look through the code of 'strcmp' you will note that there is a solution. NOTE NOTE NOTE: PLEASE check it out !!!!!!!!!! One can identify two big areas of memory where the 32 bit access is valid: The normal data space upto the '___brk_addr' and the stack (from %esp upto 0xc0000000). At the beginning of 'strcmp' and 'strncmp' we check if both pointers fall into one of these areas. If not we use the simple byte-by-byte algorithm. In the other case there are three possibilities: a) both pointers are in the data areas Here it is sufficient to align the higher pointer because it is the only one which could cause a boundary crossing. b) both pointers are in the stack area We need not do any special action. The first illegal address is 0xc000000. But at the top of the stack there is (at least) the return address for the 'main' function and its parameter pointers. Every access in this area can only be a fault and perhaps a resulting SEGFAULT will help to debug. (If somebody says that this is not right we also here can align the higher pointer.) c) one pointer in data the other in stack area As one can guess we have to align the pointer in the data area. This works with the same arguments as above. 5. Align loops. It is important to align the main loop according to the needs of the processor. The i386 only needs a 4-byte alignment but the i486 a 16-byte alignment (no info about i586). I use the generic macro 'ALIGN' which is defined in 'asm-ops.h'. In some files you find another optimization with this alignment. The aligning is done by padding with a fill-instruction: NOP. This cannot be executed for free, it costs on i386 3 cycles and on i486 1 cycle. Many functions have a jump to an instruction immediately in front of the loop. If we now arrange the code that the instructions are placed behind the NOP many times the NOPs need not be executed. Please look for a concrete example. It is a rather ugly method but the assembler does not have a directive like 'align to XXX but place the NOP before YYY'.