Well, take cufunha's loop unrolling concept for example...
A)
For I = 1 To 100
X = X + I
Next
B)
For I = 1 To 97 Step 4
'for(I = 1; I <= 97; I += 4) {
X = X + 4 * I + 6
Next
...Both A and B give the same result, but B uses more code and memory to do so. Nevertheless, B is faster than A.
EDIT:
Where smaller code is faster is typically something like the following:
A)
X = Z + Z + Z + Z + Y
B)
X = 4 * Z + Y
In Assembly, it's easier to shift left two bits and add once than to add four times.
EDIT 2:
I guess if we wanted to be all hoity-toity about it... (-:
B)
For I = 97 To 1 Step -4
X = X + 4 * I + 6
Next
EDIT 3:
And just for the heck of it, here is some NES Assembly of the examples, which is the Assembly I've been working with for the past while... Feel free to optimize 'em. I'm sure there must be a number of improvements to be made:
Example 1:
The variable I is stored in memory byte 0, and variable X is stored in memory byes 1 and 2, where the high byte is byte 1
A) The total cycle time for all of the included instructions for code A is 51 cycles, and for B is 81 cycles. So at most (which is an impossible scenario), A would take 5100 (51 * 100 iterations) and B would take 2025 (81 * 25 iterations) cycles to complete.
Code:
LDA #$00 ; A = 0
STA $01 ; Mem(1) = A
STA $02 ; Mem(2) = A
LDA #$01 ; A = 1
STA $00 ; Mem(0) = A
LoopTop: ; LoopTop:
LDA $02 ; A = Mem(2)
CLC ; Carry = 0
ADC $00 ; A = A + Mem(0) + Carry
BCS CarrySet ; If Carry = 1 Then GoTo CarrySet
BCC CarryClear ; If Carry = 0 Then GoTo CarryClear
CarrySet: ; CarrySet:
CLC ; Carry = 0
LDY $01 ; Y = Mem(1)
INY ; Y = Y + 1
STY $01 ; Mem(1) = Y
CarryClear: ; CarryClear:
STA $02 ; Mem(2) = A
LDY $00 ; Y = Mem(0)
INY ; Y = Y + 1
STY $00 ; Mem(0) = Y
CMP $65 ; Void = A - 101
BNE LoopTop ; If Zero = 0 Then GoTo LoopTop
B) Remember, this version only has 1/4 the iterations as A, so it's faster despite the fact each iteration takes longer.
And as luck would have it, it doesn't use any more memory than the first example did. Probably because of the inefficient memory method I used in the first place (I still have the X register I haven't used)
Code:
LDA #$00 ; A = 0
STA $01 ; Mem(1) = A
STA $02 ; Mem(2) = A
LDA #$01 ; A = 1
STA $00 ; Mem(0) = A
LoopTop: ; LoopTop:
LDA $00 ; A = Mem(0)
ASL A ; A = A * 2
ASL A ; A = A * 2
ADC #$06 ; A = A + 6
JSR CarryOne ; GoSub CarryOne
CLC ; Carry = 0
ADC $02 ; A = A + Mem(2) + Carry
BCS CarrySet ; If Carry = 1 Then GoTo CarrySet
BCC CarryClear ; If Carry = 0 Then GoTo CarryClear
CarryOne: ; CarryOne:
BCC NoCarry ; If Carry = 0 Then GoTo NoCarry
CLC ; Carry = 0
LDY $01 ; Y = Mem(1)
INY ; Y = Y + 1
STY $01 ; Mem(1) = Y
NoCarry: ; NoCarry:
RTS ; Return
CarrySet: ; CarrySet:
CLC ; Carry = 0
LDY $01 ; Y = Mem(1)
INY ; Y = Y + 1
STY $01 ; Mem(1) = Y
CarryClear: ; CarryClear:
STA $02 ; Mem(2) = A
LDY $00 ; A = Mem(0)
ADC #$04 ; A = A + 4 + Carry
STA $00 ; Mem(0) = A
CMP $65 ; Void = A - 101
BNE LoopTop ; If Zero = 0 Then GoTo LoopTop
Example 2:
The variable X goes to memory byte 0; Z to byte 1; and Y to byte 2. The cycle time for ADC with Zero Page addressing as shown below is 3 cycles, and ASL with Accumulator is 2 cycles. So B is 8 cycles faster, and 2 instructions shorter (and happens to be 6 bytes shorter compiled) than A.
A)
Code:
LDA #$00 ; A = 0
ADC $01 ; A = A + Mem(1)
ADC $01 ; A = A + Mem(1)
ADC $01 ; A = A + Mem(1)
ADC $01 ; A = A + Mem(1)
ADC $02 ; A = A + Mem(2)
STA $00 ; Mem(0) = A
B)
Code:
LDA #$01 ; A = Mem(1)
ASL A ; A = A * 2
ASL A ; A = A * 2
ADC $02 ; A = A + Mem(2)
STA $00 ; Mem(0) = A