Notice. New forum software under development. It's going to miss a few functions and look a bit ugly for a while, but I'm working on it full time now as the old forum was too unstable. Couple days, all good. If you notice any issues, please contact me.
|
Forum Index : Microcontroller and PC projects : CMM2: Scrolling Performance, Left vs. Right
Author | Message | ||||
Nelno Regular Member Joined: 22/01/2021 Location: United StatesPosts: 59 |
While working on a generic scrolling tile engine that could be used across a variety of games, I noticed a few things. First, operations on pages that aren't in the STM local RAM are a fair bit slower. This was expected and has been mentioned before, including in Peter's graphics tutorial posts, but I just wanted to get that out of the way as a known and expected result. However, what I'm seeing in my tests is a bit odd and I'm not sure that I have a good explanation for it. The following table shows timings for PAGE SCROLL using 1, 2, 3 and 4 pixels at a time and using -1 with PAGE SCROLL to tell it to do nothing with the edge pixels. I'm choosing to show 800x600x8 and 640x400x8 because page 1 in 800x600x8 is not in STM internal memory, while it is in 640x400x8 mode. I have done this for various other modes and recorded similar results, which is that scrolling left is some factor more expensive than scrolling right, independent of alignment, and scrolling left on a page not in STM internal memory is prohibitively expensive at any delta <> 4 bytes. This is irrelevant for some of the lower res modes, but it's a killer for the 640x480 and 800x600 modes. --------------------------------------------- | MODE | PAGE | DIR | DELTA | AVG MS| --------------------------------------------- | 800x600x8 | 1 | LEFT | 1 | 46.1 | | | 1 | LEFT | 2 | 31.0 | | | 1 | LEFT | 3 | 46.1 | | | 1 | LEFT | 4 | 8.4 | | | 1 | RIGHT | 1 | 5.2 | | | 1 | RIGHT | 2 | 5.7 | | | 1 | RIGHT | 3 | 5.6 | | | 1 | RIGHT | 4 | 5.6 | | 800x600x8 | 0 | LEFT | 1 | 6.6 | | | 0 | LEFT | 2 | 6.4 | | | 0 | LEFT | 3 | 6.6 | | | 0 | LEFT | 4 | 1.7 | | | 0 | RIGHT | 1 | 2.0 | | | 0 | RIGHT | 2 | 2.0 | | | 0 | RIGHT | 3 | 2.0 | | | 0 | RIGHT | 4 | 2.0 | --------------------------------------------- | 640x400x8 | 1 | LEFT | 1 | 3.5 | | | 1 | LEFT | 2 | 3.5 | | | 1 | LEFT | 3 | 3.5 | | | 1 | LEFT | 4 | 0.9 | | | 1 | RIGHT | 1 | 1.1 | | | 1 | RIGHT | 2 | 1.1 | | | 1 | RIGHT | 3 | 1.1 | | | 1 | RIGHT | 4 | 0.9 | | 640x400x8 | 0 | LEFT | 1 | 3.5 | | | 0 | LEFT | 2 | 3.5 | | | 0 | LEFT | 3 | 3.5 | | | 0 | LEFT | 4 | 0.9 | | | 0 | RIGHT | 1 | 1.1 | | | 0 | RIGHT | 2 | 1.1 | | | 0 | RIGHT | 3 | 1.1 | | | 0 | RIGHT | 4 | 0.9 | --------------------------------------------- These are averages over 32 scrolls in a particular direction, and a particular delta (1, 2, 3, or 4 pixels, i.e. bytes in 8 bit modes, which is the important distinction to be made). In this context a "left" scroll means PAGE SCROLL is passed a negative delta and a "right" scroll means PAGE SCROLL is passed a positive delta. A few things I (think I) know: - In all cases, the MMBASIC code all of these operations are performed line-by-line, involving a copy of the source into a line buffer, then writing that line back into the video memory. - Scrolling right and left uses essentially the same code path in MMBASIC, at least when it comes to the actual copying, with the only real difference being whether or not the offset is applied to the source or destination. - Scrolling up/down and left/right are always independent operations, making a diagonal scroll around 2x more expensive -- but I'm only showing horizontal scrolls in this table. For a left scroll: - writes should always be 4-byte aligned. - for a 4-pixel delta, reads are also 4-byte aligned. - for any other delta, reads are at some offset of 1, 2 or 3 bytes. For a right scroll: - reads are always 4-byte aligned. - writes are 4-byte aligned for a 4-pixel delta. - for all other deltas, writes are at some offset of 1, 2 or 3 bytes A lot of this makes perfect sense. Unaligned, single-byte reads/writes are slowest, and reads/writes that are multiples of 4 are fastest. Scrolling right is absolutely doable at half vsync rate in these modes with tons of time left over for gameplay and graphics. It's even manageable at vsync rate, depending on what your game logic will be. But scrolling left is untenable, which is a bit unfortunate since that's the direction every side-scroller I can think of uses. I was so close to having a smooth-scrolling, any-directional tiling engine working at high resolutions (by scrolling the center and just updating edge tiles) until I hit this bump. I'm doing this all in BASIC for now, too, no CSUBs (because I hate messing with getting build pipelines working -- that sounds like my day job). So, in particular, the problem I'm wondering about is why is left scrolling so much slower than right scrolling? The main difference seems to be that in the case of scrolling left, it's the reads that are unaligned, and in the case of going right, it's the writes that are unaligned. This makes me suspect that unaligned reads are terribly, terribly slow from non-internal memory compared to unaligned writes. It occurs to me there should be some ways to verify this assumption, especially using CSUBS, so I have another reason to get those working. At the least I would expect this to amount to only a minor difference though, as in general it should be possible to decompose each operation into an initial unaligned read/write, a bunch of aligned read/writes, and another unaligned read/write, i.e. the vast majority of read/writes can be aligned. I believe the MMBASIC code is doing something along these lines. It definitely looks at the alignment and takes different paths, and calls a fast path copy in fully aligned cases. Apologies to Geoff and Peter if I got any details wrong, it's been a couple of weeks since I looked into this due to things like Texas freezing over + work, so I need to go back and refresh my memory on what's happening here in the MMBASIC code. Still, I'm interested to hear if anyone else has noticed this and has any theory on left vs. right scrolling speeds or if I'm just missing something obvious. -Jonathan Edited 2021-03-07 06:21 by Nelno |
||||
bar1010 Senior Member Joined: 10/08/2020 Location: United StatesPosts: 195 |
What are the results for 1024x768? |
||||
epsilon Senior Member Joined: 30/07/2020 Location: BelgiumPosts: 255 |
I suspect that memory access latency is high and that the processor pipeline will quickly stall on outstanding read operations, because it needs those read results to move on. The write operations on the other hand get posted into a write buffer and the processor moves on, without waiting for the writes to complete. That would mean the performance of a copy operation is limited by read performance, not write performance. And if that's the case, it would make sense that an unaligned read is more expensive than an unaligned write. Edited 2021-03-07 18:23 by epsilon Epsilon CMM2 projects |
||||
matherp Guru Joined: 11/12/2012 Location: United KingdomPosts: 8516 |
I can't replicate the slow times - what version are you using and what type of CMM2 - board,cpu Please give me the exact command you are using |
||||
Nelno Regular Member Joined: 22/01/2021 Location: United StatesPosts: 59 |
Oops. I meant to follow up with the exact code I was using after my first post. Sorry about that. I'm running 5.07.00b14. I'm using a Retromax from CircuitGizmos. The text on the CPU is printed poorly and difficult to read but I think it is STM32H743IIT6 (https://www.digikey.com/en/products/detail/stmicroelectronics/STM32H743IIT6/7915904) The Windbond memory chip is W9864G6KH-6 (https://www.digikey.com/en/products/detail/winbond-electronics/W9864G6KH-6/4490112) ? mm.info$(cpuspeed) 480000000 option explicit dim integer vsyncCount = 0 dim integer workPage = 0 dim integer displayPage = workPage dim integer res = 1 dim integer bitdepth = 8 sub vblank inc vsyncCount,1 end sub mode res,bitdepth,0,vblank page write workPage page display displayPage const ITERATIONS% = 32 main end sub main local float t1 scrollTest -1,0,"Press key to scroll left 1",t1 local float t2 scrollTest -2,0,"Press key to scroll left 2",t2 local float t3 scrollTest -3,0,"Press key to scroll left 3",t3 local float t4 scrollTest -4,0,"Press key to scroll left 4",t4 local float t5 scrollTest 1,0,"Press key to scroll right 1",t5 local float t6 scrollTest 2,0,"Press key to scroll right 2",t6 local float t7 scrollTest 3,0,"Press key to scroll right 3",t7 local float t8 scrollTest 4,0,"Press key to scroll right 4",t8 mode 1,8,0 cls ? "Times for mode " + str$(res) + "," + str$(bitdepth) + ", Page " + str$(workPage) ? "" ? "Scroll Left 1 : " + str$(t1,3,3) ? "Scroll Left 2 : " + str$(t2,3,3) ? "Scroll Left 3 : " + str$(t3,3,3) ? "Scroll Left 4 : " + str$(t4,3,3) ? "" ? "Scroll Right 1: " + str$(t5,3,3) ? "Scroll Right 2: " + str$(t6,3,3) ? "Scroll Right 3: " + str$(t7,3,3) ? "Scroll Right 4: " + str$(t8,3,3) wait 0,fix(mm.vres/2),"Press any key to exit..." end sub sub wait x as integer, y as integer, message as string text x,y,message do while (inkey$ = "") loop end sub sub fillScreen text mm.hres - 12,0,chr$(254) text 0,mm.vres - 12,chr$(254) text mm.hres - 12,mm.vres - 12,chr$(254) end sub sub scroll dx as integer, dy as integer page scroll workPage,dx,dy,-1 end sub sub scroll2 dx as integer, dy as integer local dm = dx mod 4 if (workPage = 0 or dm = 0) then page scroll workPage,dx,dy,-1 elseif (dx > 0) then blit 0,0,dx,dy,mm.hres - abs(dx),mm.vres - abs(dy),workPage else blit -dx,-dy,0,0,mm.hres - abs(dx),mm.vres - abs(dy),workPage endif end sub sub scrollTest dx as integer, dy as integer, msg as string, t as float cls fillScreen wait 0,0,msg t = 0.0 local float start local integer i local integer vsc for i = 1 to ITERATIONS% start = timer vsc = vsyncCount scroll dx,dy ' page scroll workPage,dx,dy,-1 ' do while (vsc >= vsyncCount) ' loop t = t + (timer - start) next i t = t/ITERATIONS% end sub I know that's more complex than is ideal but I was able to repro this with a program < 10 lines before I went all out and made this one for testing multiple modes and comparing blit times. Unfortunately I didn't save that version. I'll make a shorter one, verify I get the same thing, then post it here. Note to test scrolling on page 1 vs 0 just change workPage at the top of the program. -Jonathan Edited 2021-03-09 13:49 by Nelno |
||||
Nelno Regular Member Joined: 22/01/2021 Location: United StatesPosts: 59 |
Here's the simplest program I could get to reproduce it. page write 1 page display 0 timer = 0 page scroll 1,-1,0,-1 t = timer page write 0 ? t outputs 36.374. A slightly more complex program with write and display pages the same: page write 1 page display 1 timer = 0 page scroll 1,-1,0,-1 t = timer page write 0 page display 0 ? t outputs 44.447. Note, however, those aren't setting the mode. What's even crazier is I had that first program written on a single line and I was executing it from immediate mode and getting ~36 ms, but now I'm seeing ~7.6 ms! I'm not sure what is happening here. If I force mode 1,16 or mode 1,12 I see ~26 and ~53 ms times, neither of which match the 36 ms time. I am sure I was getting 36 ms with that exact code (not setting mode), because I am simply using the up arrow to go back in my command history and run it again. In between seeing the ~36 ms and seeing the ~7.6 I did 'edit "scrollleft3.bas"' and typed the same thing into the editor and now I'm seeing the lower time. I'm thoroughly confused right now. I'm 99.9% positive I saw the slow scrolling with the test code. I'm 100% sure I saw it in my tiling engine code in mode 1,8 but maybe it has some case where it can end up writing to the display page (not sure how that could possibly change when scrolling left vs. write since it's the same code with just the sign of the scroll delta changed). I can only think of three things to explain this: - my machine gets in a bad state where performance suffers, either rebooting or something else (maybe creating a new program, since I did do that in between when it was slow and it wasn't) resets it. - I'm dumb. - I'm crazy. Note that it may be some combination of the above possibilities. For now, don't waste any more time on this. I'll spend some more time trying to get a reliable repro or proving myself entirely wrong. I thought I had tested this very specifically and reproducibly, but something isn't making sense here. I had my CMM2 running for a while without any reset before this, something that isn't true today (I unplugged it earlier to remove the case and get the chip IDs). Edited 2021-03-09 15:12 by Nelno |
||||
Nelno Regular Member Joined: 22/01/2021 Location: United StatesPosts: 59 |
First, let me put up a new table that shows the difference between when the write and display pages are the same vs. different: -------------------------------------------------------- | MODE | WRITE | DISPLAY | DIR | DELTA | AVG MS| | | PAGE | PAGE | | | | -------------------------------------------------------- | 800x600x8 | 1 | 1 | LEFT | 1 | 44.5 | | | 1 | 1 | LEFT | 2 | 30.0 | | | 1 | 1 | LEFT | 3 | 44.4 | | | 1 | 1 | LEFT | 4 | 8.2 | | | 1 | 1 | RIGHT | 1 | 8.8 | | | 1 | 1 | RIGHT | 2 | 8.8 | | | 1 | 1 | RIGHT | 3 | 8.8 | | | 1 | 1 | RIGHT | 4 | 8.6 | -------------------------------------------------------- | 800x600x8 | 1 | 0 | LEFT | 1 | 7.6 | | | 1 | 0 | LEFT | 2 | 7.6 | | | 1 | 0 | LEFT | 3 | 7.6 | | | 1 | 0 | LEFT | 4 | 7.6 | | | 1 | 0 | RIGHT | 1 | 8.7 | | | 1 | 0 | RIGHT | 2 | 8.0 | | | 1 | 0 | RIGHT | 3 | 8.7 | | | 1 | 0 | RIGHT | 4 | 7.6 | -------------------------------------------------------- | 800x600x8 | 0 | 0 | LEFT | 1 | 6.4 | | | 0 | 0 | LEFT | 2 | 6.4 | | | 0 | 0 | LEFT | 3 | 6.4 | | | 0 | 0 | LEFT | 4 | 1.6 | | | 0 | 0 | RIGHT | 1 | 2.0 | | | 0 | 0 | RIGHT | 2 | 1.9 | | | 0 | 0 | RIGHT | 3 | 1.9 | | | 0 | 0 | RIGHT | 4 | 1.6 | -------------------------------------------------------- | 800x600x8 | 0 | 1 | LEFT | 1 | 6.3 | | | 0 | 1 | LEFT | 2 | 6.3 | | | 0 | 1 | LEFT | 3 | 6.3 | | | 0 | 1 | LEFT | 4 | 1.6 | | | 0 | 1 | RIGHT | 1 | 1.9 | | | 0 | 1 | RIGHT | 2 | 1.9 | | | 0 | 1 | RIGHT | 3 | 1.9 | | | 0 | 1 | RIGHT | 4 | 1.6 | -------------------------------------------------------- Clearly the abysmal times are when using a page not in internal STM memory as both the write and read page. Seems pretty obvious that in this case the display scanout and the scrolling would be competing for memory bandwidth. Fair enough, this was just something stupid I was doing in the test code, probably as an artifact of changing it so I could actually see the scrolling. I can't think of a good reason to use a write page other than 0 if you really want to write and display the same page. This is the updated code that I used for the table above. The main changes are I added some page setting before and after the tests so that the text is always shown on a visible page: option explicit dim integer vsyncCount = 0 dim integer workPage = 1 dim integer displayPage = 0 dim integer res = 1 dim integer bitdepth = 8 sub vblank inc vsyncCount,1 end sub mode res,bitdepth,0,vblank page write workPage page display displayPage const ITERATIONS% = 32 main end sub main local float t1 scrollTest -1,0,"Press key to scroll left 1",t1 local float t2 scrollTest -2,0,"Press key to scroll left 2",t2 local float t3 scrollTest -3,0,"Press key to scroll left 3",t3 local float t4 scrollTest -4,0,"Press key to scroll left 4",t4 local float t5 scrollTest 1,0,"Press key to scroll right 1",t5 local float t6 scrollTest 2,0,"Press key to scroll right 2",t6 local float t7 scrollTest 3,0,"Press key to scroll right 3",t7 local float t8 scrollTest 4,0,"Press key to scroll right 4",t8 page display 0 page write 0 mode 1,8,0 cls ? "Times for mode " + str$(res) + "," + str$(bitdepth) ? "Write Page: " + str$(workPage) + " Display Page: " + str$(displayPage) ? "" ? "Scroll Left 1 : " + str$(t1,3,3) ? "Scroll Left 2 : " + str$(t2,3,3) ? "Scroll Left 3 : " + str$(t3,3,3) ? "Scroll Left 4 : " + str$(t4,3,3) ? "" ? "Scroll Right 1: " + str$(t5,3,3) ? "Scroll Right 2: " + str$(t6,3,3) ? "Scroll Right 3: " + str$(t7,3,3) ? "Scroll Right 4: " + str$(t8,3,3) wait 0,fix(mm.vres/2),"Press any key to exit..." end sub sub wait x as integer, y as integer, message as string page write displayPage text x,y,message page write workPage text x,y,message do while (inkey$ = "") loop end sub sub fillScreen text mm.hres - 12,0,chr$(254) text 0,mm.vres - 12,chr$(254) text mm.hres - 12,mm.vres - 12,chr$(254) end sub sub scroll dx as integer, dy as integer page scroll workPage,dx,dy,-1 end sub sub scroll2 dx as integer, dy as integer local dm = dx mod 4 if (workPage = 0 or dm = 0) then page scroll workPage,dx,dy,-1 elseif (dx > 0) then blit 0,0,dx,dy,mm.hres - abs(dx),mm.vres - abs(dy),workPage else blit -dx,-dy,0,0,mm.hres - abs(dx),mm.vres - abs(dy),workPage endif end sub sub scrollTest dx as integer, dy as integer, msg as string, t as float cls fillScreen wait 0,0,msg t = 0.0 local float start local integer i local integer vsc for i = 1 to ITERATIONS% start = timer vsc = vsyncCount scroll dx,dy ' page scroll workPage,dx,dy,-1 ' do while (vsc >= vsyncCount) ' loop t = t + (timer - start) if (workPage <> displayPage) then page copy workPage to displayPage endif next i t = t/ITERATIONS% end sub In my tiling engine, however, I'm setting mode 1,8 and I'm flipping between page 0 and page 1 as display and write pages. I've got code in there, right at the point where I'm calling PAGE SCROLL, to confirm that I'm never writing to the same page I'm displaying, so that seems fine. I'm using an X delta of 1 or -1 and a Y delta of 0 so my call looks like: PAGE SCROLL workPage,-1,0,-1 And I'm also timing only the PAGE SCROLL call and only the PAGE SCROLL call for each page independently. For the left scroll I see (scrd means "scroll duration" and the number is the write page): scroll page 0 duration: 6.3 scroll page 1 duration: 24.7 and for right scrolling I see: scroll page 0 duration: 1.9 scroll page 1 duration: 7.2 I'll work on trying to distill that specific behavior into a simpler test (which I thought I had done already, but, apparently not -- I think I messed that up at some point by setting the write and display pages to the same thing). I'm still confident that I saw this outside of my tiling engine with a simpler test, but unfortunately I didn't save that version and for all I know I did the same thing with display and write pages in that test. My first guess is that in my tiling engine I have some other bandwidth usage going on, but I don't see what that would be. I'm not doing anything in a callback / timer other that swapping the display pages and incrementing a frame count. I'm not using PAGE COPY anywhere and I'm not playing any sounds, but that PAGE SCROLL call is 3x worse when called on page 1 with a negative delta. That doesn't match the time I see with write page = display page, either, so it seems to be something else going on. I'll post again if I find it. For now, let's assume programmer (me) error. |
||||
matherp Guru Joined: 11/12/2012 Location: United KingdomPosts: 8516 |
Thanks for the diagnostics I can now reproduce. Please could you try the attached for both speed and functionality in both 8 and 16 bit modes and let me know if it improves the speed for you and also still works properly CMM2V1.5.zip |
||||
Nelno Regular Member Joined: 22/01/2021 Location: United StatesPosts: 59 |
Just tried this firmware and it fixes the problem, both in my test program and my tiling engine. To be clear, in the test programs I was seeing the slowdown when the write and display pages were the same. In my tiling engine I was seeing a lesser, but still significant, slowdown when using different display and write pages. Both are fixed. In my test app, left and right are essentially the same timings now (~8 ms), whether or not the display page is the same as the write page. In my tiling engine left scrolling is a bit faster in 8 bit mode (~6 ms vs ~8 ms) now. Those are all for 800x600 8-bit modes with a 1 pixel delta. For 16 bit, it looks like all of the times are right about 20 - 22 ms, no matter the scroll delta used. After running through a bunch of different modes it would be difficult to summarize the timings without another table, so I'll post a comprehensive one later for reference. Suffice to say, it's all greatly improved. Thanks! Also, just for my learning and to satisfy my curiosity, what was the problem? I spent some time looking at the MMBASIC code and didn't see significant differences between what a right or left scroll would do. Also, small request for your backlog, but having horizontal + vertical scrolling as single copy operation at some point would be great. I think the times are fast enough that I can still do diagonal scrolling (should be about 16 ms to scroll diagonally, leaving 14 ms for gameplay and sprites), but saving another 8 ms there would make it easier to hold a consistent framerate and use that other 8 ms for gameplay. Thanks again for the quick turn-around! I know how difficult it can be to keep on top of fixes since I basically write and maintain a runtime for a living. Your work and dedication is very much appreciated! -Jonathan Edited 2021-03-10 06:07 by Nelno |
||||
matherp Guru Joined: 11/12/2012 Location: United KingdomPosts: 8516 |
Old code: } else if(pixels<0){ pixels=-pixels; for(y=0;y<lmaxH;y++){ d=(uint8_t *)((y * maxW) + wpa); s=(uint8_t *)((y * maxW) + wpa); s+=pixels; mycopy(d,s,maxW-pixels); } } New code: } else if(pixels<0){ pixels=-pixels; for(y=0;y<lmaxH;y++){ d=(uint8_t *)((y * maxW) + wpa); s=(uint8_t *)((y * maxW) + wpa); mycopy(linebuff,s,maxW); mycopy(d,linebuff+pixels,maxW-pixels); } } linebuff is in the fastest tightly coupled memory with guaranteed 1 cycle access Edited 2021-03-10 08:47 by matherp |
||||
Nelno Regular Member Joined: 22/01/2021 Location: United StatesPosts: 59 |
Ah... that makes perfect sense now. Thanks! |
||||
Print this page |