working for fast code - c++/asm optimizations

Hacktarux · Jan 10, 2005

BGNG said:
I do not doubt what you have said, Slougi, but I am curious as to the reasoning behind that loop counting down thing. Why is counting down faster than up for loops? Is it because knowing the maximum value eliminates the need to check for data size, overflow, etc?

Because of the x86 LOOP opcode i suppose, a well written compiler will use it : it decrements ecx and loop automatically while ecx is greater than 0.

BGNG said:
Also... Your mentioning of link lists reminded me of a very important thing:

To optimize for small code is not necessarily to optimize for fast code. In fact, the fastest code tends to be larger and uses more memory than the smallest code.

That's usually true (although in some cases it can be false) but only when your code is so optimized that any speed improvement involves more memory or the other way around. There's no need to say that it's a really hard to prove condition, otherwise everybody would be doing pretty optimized code

BGNG · Jan 10, 2005

Well, take cufunha's loop unrolling concept for example...

A)
For I = 1 To 100
X = X + I
Next

B)
For I = 1 To 97 Step 4 'for(I = 1; I <= 97; I += 4) {
X = X + 4 * I + 6
Next

...Both A and B give the same result, but B uses more code and memory to do so. Nevertheless, B is faster than A.

EDIT:
Where smaller code is faster is typically something like the following:

A)
X = Z + Z + Z + Z + Y

B)
X = 4 * Z + Y

In Assembly, it's easier to shift left two bits and add once than to add four times.

EDIT 2:

I guess if we wanted to be all hoity-toity about it... (-:

B)
For I = 97 To 1 Step -4
X = X + 4 * I + 6
Next

EDIT 3:

And just for the heck of it, here is some NES Assembly of the examples, which is the Assembly I've been working with for the past while... Feel free to optimize 'em. I'm sure there must be a number of improvements to be made:

Example 1:
The variable I is stored in memory byte 0, and variable X is stored in memory byes 1 and 2, where the high byte is byte 1

A) The total cycle time for all of the included instructions for code A is 51 cycles, and for B is 81 cycles. So at most (which is an impossible scenario), A would take 5100 (51 * 100 iterations) and B would take 2025 (81 * 25 iterations) cycles to complete.

Code:

LDA #$00          ; A = 0
STA $01           ; Mem(1) = A
STA $02           ; Mem(2) = A
LDA #$01          ; A = 1
STA $00           ; Mem(0) = A

LoopTop:          ; LoopTop:
LDA $02           ; A = Mem(2)
CLC               ; Carry = 0
ADC $00           ; A = A + Mem(0) + Carry
BCS CarrySet      ; If Carry = 1 Then GoTo CarrySet
BCC CarryClear    ; If Carry = 0 Then GoTo CarryClear

CarrySet:         ; CarrySet:
CLC               ; Carry = 0
LDY $01           ; Y = Mem(1)
INY               ; Y = Y + 1
STY $01           ; Mem(1) = Y

CarryClear:       ; CarryClear:
STA $02           ; Mem(2) = A
LDY $00           ; Y = Mem(0)
INY               ; Y = Y + 1
STY $00           ; Mem(0) = Y
CMP $65           ; Void = A - 101
BNE LoopTop       ; If Zero = 0 Then GoTo LoopTop

B) Remember, this version only has 1/4 the iterations as A, so it's faster despite the fact each iteration takes longer.

And as luck would have it, it doesn't use any more memory than the first example did. Probably because of the inefficient memory method I used in the first place (I still have the X register I haven't used)

Code:

LDA #$00          ; A = 0
STA $01           ; Mem(1) = A
STA $02           ; Mem(2) = A
LDA #$01          ; A = 1
STA $00           ; Mem(0) = A

LoopTop:          ; LoopTop:
LDA $00           ; A = Mem(0)
ASL A             ; A = A * 2
ASL A             ; A = A * 2
ADC #$06          ; A = A + 6
JSR CarryOne      ; GoSub CarryOne
CLC               ; Carry = 0
ADC $02           ; A = A + Mem(2) + Carry
BCS CarrySet      ; If Carry = 1 Then GoTo CarrySet
BCC CarryClear    ; If Carry = 0 Then GoTo CarryClear

CarryOne:         ; CarryOne:
BCC NoCarry       ; If Carry = 0 Then GoTo NoCarry
CLC               ; Carry = 0
LDY $01           ; Y = Mem(1)
INY               ; Y = Y + 1
STY $01           ; Mem(1) = Y
NoCarry:          ; NoCarry:
RTS               ; Return

CarrySet:         ; CarrySet:
CLC               ; Carry = 0
LDY $01           ; Y = Mem(1)
INY               ; Y = Y + 1
STY $01           ; Mem(1) = Y

CarryClear:       ; CarryClear:
STA $02           ; Mem(2) = A
LDY $00           ; A = Mem(0)
ADC #$04          ; A = A + 4 + Carry
STA $00           ; Mem(0) = A
CMP $65           ; Void = A - 101
BNE LoopTop       ; If Zero = 0 Then GoTo LoopTop

Example 2:
The variable X goes to memory byte 0; Z to byte 1; and Y to byte 2. The cycle time for ADC with Zero Page addressing as shown below is 3 cycles, and ASL with Accumulator is 2 cycles. So B is 8 cycles faster, and 2 instructions shorter (and happens to be 6 bytes shorter compiled) than A.

A)

Code:

LDA #$00     ; A = 0
ADC $01      ; A = A + Mem(1)
ADC $01      ; A = A + Mem(1)
ADC $01      ; A = A + Mem(1)
ADC $01      ; A = A + Mem(1)
ADC $02      ; A = A + Mem(2)
STA $00      ; Mem(0) = A

B)

Code:

LDA #$01     ; A = Mem(1)
ASL A        ; A = A * 2
ASL A        ; A = A * 2
ADC $02      ; A = A + Mem(2)
STA $00      ; Mem(0) = A

zAlbee · Jan 10, 2005

BGNG said:
Well, take cufunha's loop unrolling concept for example...

A)
For I = 1 To 100
X = X + I
Next

B)
For I = 1 To 97 Step 4 'for(I = 1; I <= 97; I += 4) {
X = X + 4 * I + 6
Next

...Both A and B give the same result, but B uses more code and memory to do so. Nevertheless, B is faster than A.

This example is just silly... You would never use a loop to do this... If you do find a problem with many steps that you could condense formulaically, then condense ALL the steps you can.

x = x + 5050;

i optimized it

BGNG · Jan 10, 2005

The purpose of this thread is to discuss optimization techniques. Adding every number from 1 to 100 is stupid at best, but it's a simple example to use when describing said techniques for program optimization.

Optimize this:

For I = 1 To 100
X = X + ArrayOfNumbers(I)
Next

...X = X + ArrayOfNumbers(5050)? Not hardly. That's what cufunha described in his first post, which I implemented above.

EDIT:

And besides, didn't you see my post on the first page?:

For I = 1 To Y
X = X + I
Next

is the same as

X = X + (1 + Y) * (Y / 2)

aprentice · Jan 10, 2005

I thought the purpose of this thread was to discuss optimizations that relate to actual programming techniques used not about theories and such :huh:

I haven't been able to relate any of bgng's optimizations to my programs..

BGNG · Jan 11, 2005

"Technique" by definition implies nothing more than application of understanding of a concept, which is defined by "theory" itself. Posting theoretical descriptions of techniques better provides for understanding, which is why I presented my ideas as such.

And I find it hard to believe that in none of your programs do you use arithmetic operators, For loops, or conditional statements.

(I find the intent of your last post questionable.)

aprentice · Jan 11, 2005

BGNG said:
And I find it hard to believe that in none of your programs do you use arithmetic operators, For loops, or conditional statements.

The theories you have presented thus far illustrate very simple optimization techniques that are pretty much well known and just plain common sense. I can honestly say that I have known each and every optimization technique you have presented so far and I am pretty sure most of our readers have as well. I am more interested in advanced optimization techniques that go beyond second grade arithmetic. I have however found the majority of the posts in this thread informative.

(I find the intent of your last post questionable.)

If you are implying that I have something against you then you have been horribly mistaken; as a matter of fact, I have nothing against you and believe I am entitled to my own opinion without a reasonable doubt. I would appreciate that you do not imply such accusations in future as I find them somewhat offending.

Slougi · Jan 11, 2005

Well a lot of the loop optimization stuff we have seen is basically optimizing geometric and arithmetic series. So do study math

ector · Jan 11, 2005

WildNinji said:
This is true. But I was talking about f-pointers in C and f-pointers in C++, not vfuncs.

Code:

/* C f-pointer */ table[0] = one_func; table[1] = one_func; table[2] = another_func; (*table[x])(); /* C++ f-pointer */ table[0] = &CPU::one_func; table[1] = &CPU::one_func; table[2] = &CPU::another_func; (*table[x])();

The C++ (*table[x])(); will be at least 2x slower because of how &CPU:ne_func is encoded (as a offset to a calling table, one load, one addition and one comparison is needed).

Nope. Since by your usage these member functions of the CPU class that you describe above are static functions, i.e. they don't use a "this" pointer, they will act, perform and compile exactly as your C example.

You're absolutely correct. But there are still some little things (like functions that deal with a lot of pointers) that need C because the C++ compilers can't do many assumpion about memory usage and will produce slow(er) code.

Show me an example, with assembly output.

WildNinji · Jan 11, 2005

ector said:
Nope. Since by your usage these member functions of the CPU class that you describe above are static functions, i.e. they don't use a "this" pointer, they will act, perform and compile exactly as your C example.

Show me an example, with assembly output.

You're correct, but I was wrinting that with non-static functions in my mind.

About C vs C++ the differences are all about aliasing and __restricted pointers. The C++ language have no (standard) means to say if one pointer influences another. If the compiler supports __restricted pointers also for C++ (and not only for C), then the two languages are equivalent.

I've attached the sources and asm files (created by gcc 3.3.3).

ector · Jan 11, 2005

You are right that there is no standard way to do that in C++ yet, I hope they fix that soon. In the meantime I just use the "Assume no aliasing" option. It does the same as "restricted" on every function in the program. If your code doesn't survive that, it's in most cases badly written and deserves to break

cufunha · Jan 12, 2005

I'm making a n64 emulator. If I was to take only ONE variable of the whole emulator and assign the register c++ modifier to it - in order to make it faster, which variable do you suggest? I guess it is MIPS PC counter, because its accessed everytime, am I right?

euphoria · Jan 12, 2005

cufunha said:
I'm making a n64 emulator. If I was to take only ONE variable of the whole emulator and assign the register c++ modifier to it - in order to make it faster, which variable do you suggest? I guess it is MIPS PC counter, because its accessed everytime, am I right?

If i had understood it correctly the best time to use a register variable would be the place where it's accessed all the time. I don't think that the MIPS PC wouldn't be the best option since there is a whole lot of code executed in between.
And also if i remember correctly the register doesn't make any promises, it just makes the compiler to try to use a register for the variable.
Also threading where there is a lot of thread switching would destroy the usage of variables defined as register.

I have very likely misunderstood what the register really does, if so please correct.

Slougi · Jan 13, 2005

euphoria said:
I have very likely misunderstood what the register really does, if so pease correct.

Nope, afaik that's right.

Search

Search

working for fast code - c++/asm optimizations

Hacktarux

Emulator Developer

BGNG

New member

zAlbee

Keeper of The Iron Tail

BGNG

New member

aprentice

Moderator

BGNG

New member

aprentice

Moderator

Slougi

New member

ector

Emulator Developer

WildNinji

New member

ector

Emulator Developer

cufunha

New member

euphoria

Emutalk Member

Slougi

New member