Home
JAQForum Ver 20.06
Log In or Join  
Active Topics
Local Time 10:31 29 Mar 2024 Privacy Policy
Jump to

Notice. New forum software under development. It's going to miss a few functions and look a bit ugly for a while, but I'm working on it full time now as the old forum was too unstable. Couple days, all good. If you notice any issues, please contact me.

Forum Index : Microcontroller and PC projects : CMM2: Scrolling Performance, Left vs. Right

Author Message
Nelno

Regular Member

Joined: 22/01/2021
Location: United States
Posts: 59
Posted: 08:19pm 06 Mar 2021
Copy link to clipboard 
Print this post

While working on a generic scrolling tile engine that could be used across a variety of games, I noticed a few things.

First, operations on pages that aren't in the STM local RAM are a fair bit slower. This was expected and has been mentioned before, including in Peter's graphics tutorial posts, but I just wanted to get that out of the way as a known and expected result.

However, what I'm seeing in my tests is a bit odd and I'm not sure that I have a good explanation for it.

The following table shows timings for PAGE SCROLL using 1, 2, 3 and 4 pixels at a time and using -1 with PAGE SCROLL to tell it to do nothing with the edge pixels.

I'm choosing to show 800x600x8 and 640x400x8 because page 1 in 800x600x8 is not in STM internal memory, while it is in 640x400x8 mode. I have done this for various other modes and recorded similar results, which is that scrolling left is some factor more expensive than scrolling right, independent of alignment, and scrolling left on a page not in STM internal memory is prohibitively expensive at any delta <> 4 bytes. This is irrelevant for some of the lower res modes, but it's a killer for the 640x480 and 800x600 modes.


---------------------------------------------
| MODE       | PAGE | DIR   | DELTA | AVG MS|
---------------------------------------------
| 800x600x8  | 1    | LEFT  | 1     | 46.1  |
|            | 1    | LEFT  | 2     | 31.0  |
|            | 1    | LEFT  | 3     | 46.1  |
|            | 1    | LEFT  | 4     | 8.4   |
|            | 1    | RIGHT | 1     | 5.2   |
|            | 1    | RIGHT | 2     | 5.7   |
|            | 1    | RIGHT | 3     | 5.6   |
|            | 1    | RIGHT | 4     | 5.6   |
| 800x600x8  | 0    | LEFT  | 1     | 6.6   |
|            | 0    | LEFT  | 2     | 6.4   |
|            | 0    | LEFT  | 3     | 6.6   |
|            | 0    | LEFT  | 4     | 1.7   |
|            | 0    | RIGHT | 1     | 2.0   |
|            | 0    | RIGHT | 2     | 2.0   |
|            | 0    | RIGHT | 3     | 2.0   |
|            | 0    | RIGHT | 4     | 2.0   |
---------------------------------------------
| 640x400x8  | 1    | LEFT  | 1     | 3.5   |
|            | 1    | LEFT  | 2     | 3.5   |
|            | 1    | LEFT  | 3     | 3.5   |
|            | 1    | LEFT  | 4     | 0.9   |
|            | 1    | RIGHT | 1     | 1.1   |
|            | 1    | RIGHT | 2     | 1.1   |
|            | 1    | RIGHT | 3     | 1.1   |
|            | 1    | RIGHT | 4     | 0.9   |
| 640x400x8  | 0    | LEFT  | 1     | 3.5   |
|            | 0    | LEFT  | 2     | 3.5   |
|            | 0    | LEFT  | 3     | 3.5   |
|            | 0    | LEFT  | 4     | 0.9   |
|            | 0    | RIGHT | 1     | 1.1   |
|            | 0    | RIGHT | 2     | 1.1   |
|            | 0    | RIGHT | 3     | 1.1   |
|            | 0    | RIGHT | 4     | 0.9   |
---------------------------------------------


These are averages over 32 scrolls in a particular direction, and a particular delta (1, 2, 3, or 4 pixels, i.e. bytes in 8 bit modes, which is the important distinction to be made).

In this context a "left" scroll means PAGE SCROLL is passed a negative delta and a "right" scroll means PAGE SCROLL is passed a positive delta.

A few things I (think I) know:
- In all cases, the MMBASIC code all of these operations are performed line-by-line, involving a copy of the source into a line buffer, then writing that line back into the video memory.
- Scrolling right and left uses essentially the same code path in MMBASIC, at least when it comes to the actual copying, with the only real difference being whether or not the offset is applied to the source or destination.
- Scrolling up/down and left/right are always independent operations, making a diagonal scroll around 2x more expensive -- but I'm only showing horizontal scrolls in this table.

For a left scroll:
- writes should always be 4-byte aligned.
- for a 4-pixel delta, reads are also 4-byte aligned.
- for any other delta, reads are at some offset of 1, 2 or 3 bytes.

For a right scroll:
- reads are always 4-byte aligned.
- writes are 4-byte aligned for a 4-pixel delta.
- for all other deltas, writes are at some offset of 1, 2 or 3 bytes

A lot of this makes perfect sense. Unaligned, single-byte reads/writes are slowest, and reads/writes that are multiples of 4 are fastest.

Scrolling right is absolutely doable at half vsync rate in these modes with tons of time left over for gameplay and graphics. It's even manageable at vsync rate, depending on what your game logic will be. But scrolling left is untenable, which is a bit unfortunate since that's the direction every side-scroller I can think of uses. I was so close to having a smooth-scrolling, any-directional tiling engine working at high resolutions (by scrolling the center and just updating edge tiles) until I hit this bump. I'm doing this all in BASIC for now, too, no CSUBs (because I hate messing with getting build pipelines working -- that sounds like my day job).

So, in particular, the problem I'm wondering about is why is left scrolling so much slower than right scrolling? The main difference seems to be that in the case of scrolling left, it's the reads that are unaligned, and in the case of going right, it's the writes that are unaligned. This makes me suspect that unaligned reads are terribly, terribly slow from non-internal memory compared to unaligned writes. It occurs to me there should be some ways to verify this assumption, especially using CSUBS, so I have another reason to get those working.

At the least I would expect this to amount to only a minor difference though, as in general it should be possible to decompose each operation into an initial unaligned read/write, a bunch of aligned read/writes, and another unaligned read/write, i.e. the vast majority of read/writes can be aligned. I believe the MMBASIC code is doing something along these lines. It definitely looks at the alignment and takes different paths, and calls a fast path copy in fully aligned cases.

Apologies to Geoff and Peter if I got any details wrong, it's been a couple of weeks since I looked into this due to things like Texas freezing over + work, so I need to go back and refresh my memory on what's happening here in the MMBASIC code. Still, I'm interested to hear if anyone else has noticed this and has any theory on left vs. right scrolling speeds or if I'm just missing something obvious.

-Jonathan
Edited 2021-03-07 06:21 by Nelno
 
bar1010
Senior Member

Joined: 10/08/2020
Location: United States
Posts: 195
Posted: 10:53pm 06 Mar 2021
Copy link to clipboard 
Print this post

  Nelno said  While working on a generic scrolling tile engine that could be used across a variety of games, I noticed a few things.

First, operations on pages that aren't in the STM local RAM are a fair bit slower. This was expected and has been mentioned before, including in Peter's graphics tutorial posts, but I just wanted to get that out of the way as a known and expected result.

However, what I'm seeing in my tests is a bit odd and I'm not sure that I have a good explanation for it.

The following table shows timings for PAGE SCROLL using 1, 2, 3 and 4 pixels at a time and using -1 with PAGE SCROLL to tell it to do nothing with the edge pixels.

I'm choosing to show 800x600x8 and 640x400x8 because page 1 in 800x600x8 is not in STM internal memory, while it is in 640x400x8 mode. I have done this for various other modes and recorded similar results, which is that scrolling left is some factor more expensive than scrolling right, independent of alignment, and scrolling left on a page not in STM internal memory is prohibitively expensive at any delta <> 4 bytes. This is irrelevant for some of the lower res modes, but it's a killer for the 640x480 and 800x600 modes.


---------------------------------------------
| MODE       | PAGE | DIR   | DELTA | AVG MS|
---------------------------------------------
| 800x600x8  | 1    | LEFT  | 1     | 46.1  |
|            | 1    | LEFT  | 2     | 31.0  |
|            | 1    | LEFT  | 3     | 46.1  |
|            | 1    | LEFT  | 4     | 8.4   |
|            | 1    | RIGHT | 1     | 5.2   |
|            | 1    | RIGHT | 2     | 5.7   |
|            | 1    | RIGHT | 3     | 5.6   |
|            | 1    | RIGHT | 4     | 5.6   |
| 800x600x8  | 0    | LEFT  | 1     | 6.6   |
|            | 0    | LEFT  | 2     | 6.4   |
|            | 0    | LEFT  | 3     | 6.6   |
|            | 0    | LEFT  | 4     | 1.7   |
|            | 0    | RIGHT | 1     | 2.0   |
|            | 0    | RIGHT | 2     | 2.0   |
|            | 0    | RIGHT | 3     | 2.0   |
|            | 0    | RIGHT | 4     | 2.0   |
---------------------------------------------
| 640x400x8  | 1    | LEFT  | 1     | 3.5   |
|            | 1    | LEFT  | 2     | 3.5   |
|            | 1    | LEFT  | 3     | 3.5   |
|            | 1    | LEFT  | 4     | 0.9   |
|            | 1    | RIGHT | 1     | 1.1   |
|            | 1    | RIGHT | 2     | 1.1   |
|            | 1    | RIGHT | 3     | 1.1   |
|            | 1    | RIGHT | 4     | 0.9   |
| 640x400x8  | 0    | LEFT  | 1     | 3.5   |
|            | 0    | LEFT  | 2     | 3.5   |
|            | 0    | LEFT  | 3     | 3.5   |
|            | 0    | LEFT  | 4     | 0.9   |
|            | 0    | RIGHT | 1     | 1.1   |
|            | 0    | RIGHT | 2     | 1.1   |
|            | 0    | RIGHT | 3     | 1.1   |
|            | 0    | RIGHT | 4     | 0.9   |
---------------------------------------------


These are averages over 32 scrolls in a particular direction, and a particular delta (1, 2, 3, or 4 pixels, i.e. bytes in 8 bit modes, which is the important distinction to be made).

In this context a "left" scroll means PAGE SCROLL is passed a negative delta and a "right" scroll means PAGE SCROLL is passed a positive delta.

A few things I (think I) know:
- In all cases, the MMBASIC code all of these operations are performed line-by-line, involving a copy of the source into a line buffer, then writing that line back into the video memory.
- Scrolling right and left uses essentially the same code path in MMBASIC, at least when it comes to the actual copying, with the only real difference being whether or not the offset is applied to the source or destination.
- Scrolling up/down and left/right are always independent operations, making a diagonal scroll around 2x more expensive -- but I'm only showing horizontal scrolls in this table.

For a left scroll:
- writes should always be 4-byte aligned.
- for a 4-pixel delta, reads are also 4-byte aligned.
- for any other delta, reads are at some offset of 1, 2 or 3 bytes.

For a right scroll:
- reads are always 4-byte aligned.
- writes are 4-byte aligned for a 4-pixel delta.
- for all other deltas, writes are at some offset of 1, 2 or 3 bytes

A lot of this makes perfect sense. Unaligned, single-byte reads/writes are slowest, and reads/writes that are multiples of 4 are fastest.

Scrolling right is absolutely doable at half vsync rate in these modes with tons of time left over for gameplay and graphics. It's even manageable at vsync rate, depending on what your game logic will be. But scrolling left is untenable, which is a bit unfortunate since that's the direction every side-scroller I can think of uses. I was so close to having a smooth-scrolling, any-directional tiling engine working at high resolutions (by scrolling the center and just updating edge tiles) until I hit this bump. I'm doing this all in BASIC for now, too, no CSUBs (because I hate messing with getting build pipelines working -- that sounds like my day job).

So, in particular, the problem I'm wondering about is why is left scrolling so much slower than right scrolling? The main difference seems to be that in the case of scrolling left, it's the reads that are unaligned, and in the case of going right, it's the writes that are unaligned. This makes me suspect that unaligned reads are terribly, terribly slow from non-internal memory compared to unaligned writes. It occurs to me there should be some ways to verify this assumption, especially using CSUBS, so I have another reason to get those working.

At the least I would expect this to amount to only a minor difference though, as in general it should be possible to decompose each operation into an initial unaligned read/write, a bunch of aligned read/writes, and another unaligned read/write, i.e. the vast majority of read/writes can be aligned. I believe the MMBASIC code is doing something along these lines. It definitely looks at the alignment and takes different paths, and calls a fast path copy in fully aligned cases.

Apologies to Geoff and Peter if I got any details wrong, it's been a couple of weeks since I looked into this due to things like Texas freezing over + work, so I need to go back and refresh my memory on what's happening here in the MMBASIC code. Still, I'm interested to hear if anyone else has noticed this and has any theory on left vs. right scrolling speeds or if I'm just missing something obvious.

-Jonathan


What are the results for 1024x768?
 
epsilon

Senior Member

Joined: 30/07/2020
Location: Belgium
Posts: 255
Posted: 08:21am 07 Mar 2021
Copy link to clipboard 
Print this post

  Nelno said  
So, in particular, the problem I'm wondering about is why is left scrolling so much slower than right scrolling? The main difference seems to be that in the case of scrolling left, it's the reads that are unaligned, and in the case of going right, it's the writes that are unaligned. This makes me suspect that unaligned reads are terribly, terribly slow from non-internal memory compared to unaligned writes. It occurs to me there should be some ways to verify this assumption, especially using CSUBS, so I have another reason to get those working.


I suspect that memory access latency is high and that the processor pipeline will quickly stall on outstanding read operations, because it needs those read results to move on. The write operations on the other hand get posted into a write buffer and the processor moves on, without waiting for the writes to complete.
That would mean the performance of a copy operation is limited by read performance, not write performance. And if that's the case, it would make sense that an unaligned read is more expensive than an unaligned write.
Edited 2021-03-07 18:23 by epsilon
Epsilon CMM2 projects
 
matherp
Guru

Joined: 11/12/2012
Location: United Kingdom
Posts: 8516
Posted: 09:42am 07 Mar 2021
Copy link to clipboard 
Print this post

I can't replicate the slow times - what version are you using and what type of CMM2 - board,cpu

Please give me the exact command you are using
 
Nelno

Regular Member

Joined: 22/01/2021
Location: United States
Posts: 59
Posted: 03:36am 09 Mar 2021
Copy link to clipboard 
Print this post

  matherp said  I can't replicate the slow times - what version are you using and what type of CMM2 - board,cpu

Please give me the exact command you are using


Oops. I meant to follow up with the exact code I was using after my first post. Sorry about that.

I'm running 5.07.00b14.

I'm using a Retromax from CircuitGizmos.

The text on the CPU is printed poorly and difficult to read but I think it is  STM32H743IIT6 (https://www.digikey.com/en/products/detail/stmicroelectronics/STM32H743IIT6/7915904)

The Windbond memory chip is W9864G6KH-6 (https://www.digikey.com/en/products/detail/winbond-electronics/W9864G6KH-6/4490112)

? mm.info$(cpuspeed)
480000000


option explicit

dim integer vsyncCount = 0
dim integer workPage = 0
dim integer displayPage = workPage
dim integer res = 1
dim integer bitdepth = 8

sub vblank
 inc vsyncCount,1
end sub

mode res,bitdepth,0,vblank

page write workPage
page display displayPage

const ITERATIONS% = 32

main

end

sub main
 local float t1
 scrollTest -1,0,"Press key to scroll left 1",t1

 local float t2
 scrollTest -2,0,"Press key to scroll left 2",t2

 local float t3
 scrollTest -3,0,"Press key to scroll left 3",t3

 local float t4
 scrollTest -4,0,"Press key to scroll left 4",t4

 local float t5
 scrollTest 1,0,"Press key to scroll right 1",t5

 local float t6
 scrollTest 2,0,"Press key to scroll right 2",t6

 local float t7
 scrollTest 3,0,"Press key to scroll right 3",t7

 local float t8
 scrollTest 4,0,"Press key to scroll right 4",t8

 mode 1,8,0
 cls

 ? "Times for mode " + str$(res) + "," + str$(bitdepth) + ", Page " + str$(workPage)  
 ? ""
 ? "Scroll Left 1 : " + str$(t1,3,3)
 ? "Scroll Left 2 : " + str$(t2,3,3)
 ? "Scroll Left 3 : " + str$(t3,3,3)
 ? "Scroll Left 4 : " + str$(t4,3,3)
 ? ""
 ? "Scroll Right 1: " + str$(t5,3,3)
 ? "Scroll Right 2: " + str$(t6,3,3)
 ? "Scroll Right 3: " + str$(t7,3,3)
 ? "Scroll Right 4: " + str$(t8,3,3)
 
 wait 0,fix(mm.vres/2),"Press any key to exit..."
end sub

sub wait x as integer, y as integer, message as string
 text x,y,message
 do while (inkey$ = "")
 loop
end sub

sub fillScreen
 text mm.hres - 12,0,chr$(254)
 text 0,mm.vres - 12,chr$(254)
 text mm.hres - 12,mm.vres - 12,chr$(254)
end sub

sub scroll dx as integer, dy as integer
 page scroll workPage,dx,dy,-1
end sub

sub scroll2 dx as integer, dy as integer
 local dm = dx mod 4
 if (workPage = 0 or dm = 0) then
   page scroll workPage,dx,dy,-1
 elseif (dx > 0) then
   blit 0,0,dx,dy,mm.hres - abs(dx),mm.vres - abs(dy),workPage
 else
   blit -dx,-dy,0,0,mm.hres - abs(dx),mm.vres - abs(dy),workPage
 endif
end sub

sub scrollTest dx as integer, dy as integer, msg as string, t as float
 cls

 fillScreen

 wait 0,0,msg

 t = 0.0    
 local float start
 local integer i
 local integer vsc
 for i = 1 to ITERATIONS%
   start = timer
   vsc = vsyncCount
    scroll dx,dy
'    page scroll workPage,dx,dy,-1
'    do while (vsc >= vsyncCount)
'    loop
   t = t + (timer - start)
 next i
 
 t = t/ITERATIONS%

end sub


I know that's more complex than is ideal but I was able to repro this with a program < 10 lines before I went all out and made this one for testing multiple modes and comparing blit times. Unfortunately I didn't save that version. I'll make a shorter one, verify I get the same thing, then post it here.

Note to test scrolling on page 1 vs 0 just change workPage at the top of the program.

-Jonathan
Edited 2021-03-09 13:49 by Nelno
 
Nelno

Regular Member

Joined: 22/01/2021
Location: United States
Posts: 59
Posted: 04:54am 09 Mar 2021
Copy link to clipboard 
Print this post

Here's the simplest program I could get to reproduce it.


page write 1
page display 0
timer = 0
page scroll 1,-1,0,-1
t = timer
page write 0
? t


outputs 36.374.

A slightly more complex program with write and display pages the same:


page write 1
page display 1
timer = 0
page scroll 1,-1,0,-1
t = timer
page write 0
page display 0
? t


outputs 44.447.

Note, however, those aren't setting the mode. What's even crazier is I had that first program written on a single line and I was executing it from immediate mode and getting ~36 ms, but now I'm seeing ~7.6 ms!

I'm not sure what is happening here. If I force mode 1,16 or mode 1,12 I see ~26 and ~53 ms times, neither of which match the 36 ms time.

I am sure I was getting 36 ms with that exact code (not setting mode), because I am simply using the up arrow to go back in my command history and run it again. In between seeing the ~36 ms and seeing the ~7.6 I did 'edit "scrollleft3.bas"' and typed the same thing into the editor and now I'm seeing the lower time.

I'm thoroughly confused right now. I'm 99.9% positive I saw the slow scrolling with the test code. I'm 100% sure I saw it in my tiling engine code in mode 1,8 but maybe it has some case where it can end up writing to the display page (not sure how that could possibly change when scrolling left vs. write since it's the same code with just the sign of the scroll delta changed).

I can only think of three things to explain this:
- my machine gets in a bad state where performance suffers, either rebooting or something else (maybe creating a new program, since I did do that in between when it was slow and it wasn't) resets it.
- I'm dumb.
- I'm crazy.

Note that it may be some combination of the above possibilities.

For now, don't waste any more time on this. I'll spend some more time trying to get a reliable repro or proving myself entirely wrong. I thought I had tested this very specifically and reproducibly, but something isn't making sense here.

I had my CMM2 running for a while without any reset before this, something that isn't true today (I unplugged it earlier to remove the case and get the chip IDs).
Edited 2021-03-09 15:12 by Nelno
 
Nelno

Regular Member

Joined: 22/01/2021
Location: United States
Posts: 59
Posted: 06:15am 09 Mar 2021
Copy link to clipboard 
Print this post

First, let me put up a new table that shows the difference between when the write and display pages are the same vs. different:


--------------------------------------------------------
| MODE       | WRITE | DISPLAY | DIR   | DELTA | AVG MS|
|            | PAGE  | PAGE    |       |       |       |
--------------------------------------------------------
| 800x600x8  | 1     | 1       | LEFT  | 1     | 44.5  |
|            | 1     | 1       | LEFT  | 2     | 30.0  |
|            | 1     | 1       | LEFT  | 3     | 44.4  |
|            | 1     | 1       | LEFT  | 4     | 8.2   |
|            | 1     | 1       | RIGHT | 1     | 8.8   |
|            | 1     | 1       | RIGHT | 2     | 8.8   |
|            | 1     | 1       | RIGHT | 3     | 8.8   |
|            | 1     | 1       | RIGHT | 4     | 8.6   |
--------------------------------------------------------
| 800x600x8  | 1     | 0       | LEFT  | 1     | 7.6   |
|            | 1     | 0       | LEFT  | 2     | 7.6   |
|            | 1     | 0       | LEFT  | 3     | 7.6   |
|            | 1     | 0       | LEFT  | 4     | 7.6   |
|            | 1     | 0       | RIGHT | 1     | 8.7   |
|            | 1     | 0       | RIGHT | 2     | 8.0   |
|            | 1     | 0       | RIGHT | 3     | 8.7   |
|            | 1     | 0       | RIGHT | 4     | 7.6   |
--------------------------------------------------------
| 800x600x8  | 0     | 0       | LEFT  | 1     | 6.4   |
|            | 0     | 0       | LEFT  | 2     | 6.4   |
|            | 0     | 0       | LEFT  | 3     | 6.4   |
|            | 0     | 0       | LEFT  | 4     | 1.6   |
|            | 0     | 0       | RIGHT | 1     | 2.0   |
|            | 0     | 0       | RIGHT | 2     | 1.9   |
|            | 0     | 0       | RIGHT | 3     | 1.9   |
|            | 0     | 0       | RIGHT | 4     | 1.6   |
--------------------------------------------------------
| 800x600x8  | 0     | 1       | LEFT  | 1     | 6.3   |
|            | 0     | 1       | LEFT  | 2     | 6.3   |
|            | 0     | 1       | LEFT  | 3     | 6.3   |
|            | 0     | 1       | LEFT  | 4     | 1.6   |
|            | 0     | 1       | RIGHT | 1     | 1.9   |
|            | 0     | 1       | RIGHT | 2     | 1.9   |
|            | 0     | 1       | RIGHT | 3     | 1.9   |
|            | 0     | 1       | RIGHT | 4     | 1.6   |
--------------------------------------------------------


Clearly the abysmal times are when using a page not in internal STM memory as both the write and read page. Seems pretty obvious that in this case the display scanout and the scrolling would be competing for memory bandwidth.

Fair enough, this was just something stupid I was doing in the test code, probably as an artifact of changing it so I could actually see the scrolling. I can't think of a good reason to use a write page other than 0 if you really want to write and display the same page.

This is the updated code that I used for the table above. The main changes are I added some page setting before and after the tests so that the text is always shown on a visible page:


option explicit

dim integer vsyncCount = 0
dim integer workPage = 1
dim integer displayPage = 0
dim integer res = 1
dim integer bitdepth = 8

sub vblank
 inc vsyncCount,1
end sub

mode res,bitdepth,0,vblank

page write workPage
page display displayPage

const ITERATIONS% = 32

main

end

sub main
 local float t1
 scrollTest -1,0,"Press key to scroll left 1",t1

 local float t2
 scrollTest -2,0,"Press key to scroll left 2",t2

 local float t3
 scrollTest -3,0,"Press key to scroll left 3",t3

 local float t4
 scrollTest -4,0,"Press key to scroll left 4",t4

 local float t5
 scrollTest 1,0,"Press key to scroll right 1",t5

 local float t6
 scrollTest 2,0,"Press key to scroll right 2",t6

 local float t7
 scrollTest 3,0,"Press key to scroll right 3",t7

 local float t8
 scrollTest 4,0,"Press key to scroll right 4",t8

 page display 0
 page write 0
 mode 1,8,0
 cls

 ? "Times for mode " + str$(res) + "," + str$(bitdepth)  
 ? "Write Page: " + str$(workPage) + " Display Page: " + str$(displayPage)
 ? ""
 ? "Scroll Left 1 : " + str$(t1,3,3)
 ? "Scroll Left 2 : " + str$(t2,3,3)
 ? "Scroll Left 3 : " + str$(t3,3,3)
 ? "Scroll Left 4 : " + str$(t4,3,3)
 ? ""
 ? "Scroll Right 1: " + str$(t5,3,3)
 ? "Scroll Right 2: " + str$(t6,3,3)
 ? "Scroll Right 3: " + str$(t7,3,3)
 ? "Scroll Right 4: " + str$(t8,3,3)
 
 wait 0,fix(mm.vres/2),"Press any key to exit..."
end sub

sub wait x as integer, y as integer, message as string
 page write displayPage
 text x,y,message
 page write workPage
 text x,y,message
 do while (inkey$ = "")
 loop
end sub

sub fillScreen
 text mm.hres - 12,0,chr$(254)
 text 0,mm.vres - 12,chr$(254)
 text mm.hres - 12,mm.vres - 12,chr$(254)
end sub

sub scroll dx as integer, dy as integer
 page scroll workPage,dx,dy,-1
end sub

sub scroll2 dx as integer, dy as integer
 local dm = dx mod 4
 if (workPage = 0 or dm = 0) then
   page scroll workPage,dx,dy,-1
 elseif (dx > 0) then
   blit 0,0,dx,dy,mm.hres - abs(dx),mm.vres - abs(dy),workPage
 else
   blit -dx,-dy,0,0,mm.hres - abs(dx),mm.vres - abs(dy),workPage
 endif
end sub

sub scrollTest dx as integer, dy as integer, msg as string, t as float
 cls

 fillScreen

 wait 0,0,msg

 t = 0.0    
 local float start
 local integer i
 local integer vsc
 for i = 1 to ITERATIONS%
   start = timer
   vsc = vsyncCount
    scroll dx,dy
'    page scroll workPage,dx,dy,-1
'    do while (vsc >= vsyncCount)
'    loop
   t = t + (timer - start)
   if (workPage <> displayPage) then
     page copy workPage to displayPage
   endif
 next i
 
 t = t/ITERATIONS%

end sub



In my tiling engine, however, I'm setting mode 1,8 and I'm flipping between page 0 and page 1 as display and write pages. I've got code in there, right at the point where I'm calling PAGE SCROLL, to confirm that I'm never writing to the same page I'm displaying, so that seems fine.

I'm using an X delta of 1 or -1 and a Y delta of 0 so my call looks like:

PAGE SCROLL workPage,-1,0,-1

And I'm also timing only the PAGE SCROLL call and only the PAGE SCROLL call for each page independently.

For the left scroll I see (scrd means "scroll duration" and the number is the write page):

scroll page 0 duration: 6.3
scroll page 1 duration: 24.7

and for right scrolling I see:

scroll page 0 duration: 1.9
scroll page 1 duration: 7.2

I'll work on trying to distill that specific behavior into a simpler test (which I thought I had done already, but, apparently not -- I think I messed that up at some point by setting the write and display pages to the same thing). I'm still confident that I saw this outside of my tiling engine with a simpler test, but unfortunately I didn't save that version and for all I know I did the same thing with display and write pages in that test.

My first guess is that in my tiling engine I have some other bandwidth usage going on, but I don't see what that would be. I'm not doing anything in a callback / timer other that swapping the display pages and incrementing a frame count. I'm not using PAGE COPY anywhere and I'm not playing any sounds, but that PAGE SCROLL call is 3x worse when called on page 1 with a negative delta. That doesn't match the time I see with write page = display page, either, so it seems to be something else going on.

I'll post again if I find it. For now, let's assume programmer (me) error.
 
matherp
Guru

Joined: 11/12/2012
Location: United Kingdom
Posts: 8516
Posted: 08:21am 09 Mar 2021
Copy link to clipboard 
Print this post

Thanks for the diagnostics I can now reproduce.

Please could you try the attached for both speed and functionality in both 8 and 16 bit modes and let me know if it improves the speed for you and also still works properly


CMM2V1.5.zip
 
Nelno

Regular Member

Joined: 22/01/2021
Location: United States
Posts: 59
Posted: 08:06pm 09 Mar 2021
Copy link to clipboard 
Print this post

  matherp said  Thanks for the diagnostics I can now reproduce.

Please could you try the attached for both speed and functionality in both 8 and 16 bit modes and let me know if it improves the speed for you and also still works properly


CMM2V1.5.zip


Just tried this firmware and it fixes the problem, both in my test program and my tiling engine.

To be clear, in the test programs I was seeing the slowdown when the write and display pages were the same. In my tiling engine I was seeing a lesser, but still significant, slowdown when using different display and write pages. Both are fixed.

In my test app, left and right are essentially the same timings now (~8 ms), whether or not the display page is the same as the write page.

In my tiling engine left scrolling is a bit faster in 8 bit mode (~6 ms vs ~8 ms) now.

Those are all for 800x600 8-bit modes with a 1 pixel delta.

For 16 bit, it looks like all of the times are right about 20 - 22 ms, no matter the scroll delta used.

After running through a bunch of different modes it would be difficult to summarize the timings without another table, so I'll post a comprehensive one later for reference.

Suffice to say, it's all greatly improved. Thanks!

Also, just for my learning and to satisfy my curiosity, what was the problem? I spent some time looking at the MMBASIC code and didn't see significant differences between what a right or left scroll would do.

Also, small request for your backlog, but having horizontal + vertical scrolling as single copy operation at some point would be great. I think the times are fast enough that I can still do diagonal scrolling (should be about 16 ms to scroll diagonally, leaving 14 ms for gameplay and sprites), but saving another 8 ms there would make it easier to hold a consistent framerate and use that other 8 ms for gameplay.

Thanks again for the quick turn-around! I know how difficult it can be to keep on top of fixes since I basically write and maintain a runtime for a living. Your work and dedication is very much appreciated!

-Jonathan
Edited 2021-03-10 06:07 by Nelno
 
matherp
Guru

Joined: 11/12/2012
Location: United Kingdom
Posts: 8516
Posted: 10:46pm 09 Mar 2021
Copy link to clipboard 
Print this post

  Quote  Also, just for my learning and to satisfy my curiosity, what was the problem?


Old code:
} else if(pixels<0){
pixels=-pixels;
for(y=0;y<lmaxH;y++){
d=(uint8_t *)((y * maxW) + wpa);
s=(uint8_t *)((y * maxW) + wpa);
s+=pixels;
mycopy(d,s,maxW-pixels);
}
}


New code:
} else if(pixels<0){
pixels=-pixels;
for(y=0;y<lmaxH;y++){
d=(uint8_t *)((y * maxW) + wpa);
s=(uint8_t *)((y * maxW) + wpa);
mycopy(linebuff,s,maxW);
mycopy(d,linebuff+pixels,maxW-pixels);
}
}


linebuff is in the fastest tightly coupled memory with guaranteed 1 cycle access
Edited 2021-03-10 08:47 by matherp
 
Nelno

Regular Member

Joined: 22/01/2021
Location: United States
Posts: 59
Posted: 01:11am 10 Mar 2021
Copy link to clipboard 
Print this post

  matherp said  
linebuff is in the fastest tightly coupled memory with guaranteed 1 cycle access


Ah... that makes perfect sense now. Thanks!
 
Print this page


To reply to this topic, you need to log in.

© JAQ Software 2024