![]() |
Forum Index : Microcontroller and PC projects : Need Basic language compression code
Page 1 of 2 ![]() ![]() |
|||||
Author | Message | ||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
I'm looking for MMBasic (or any variety basic) code to compress a text file. The file, which might be multiple megabytes, would be on the SD card of an Explore-64, and I'd like to compress it before downloading to a PC using XMODEM (and I'm assuming that XMODEM doesn't do compression on the fly). It's likely to be highly compressible. The code wouldn't have to try to wring out every bit of savings, but must be lossless. Of course, I would need to decompress it again on the PC side--with MMBasic DOS. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
twofingers![]() Guru ![]() Joined: 02/06/2014 Location: GermanyPosts: 1593 |
Hi, not the complete solution but maybe helpful? https://rosettacode.org/wiki/LZW_compression Regards Michael causality ≠correlation ≠coincidence |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
The BBC Basic version looks hopeful. Needs some translation to MMBasic. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
twofingers![]() Guru ![]() Joined: 02/06/2014 Location: GermanyPosts: 1593 |
I think that is an interesting topic (also useful as benchmark?). Perhaps we should someday create a CFunction for that ... causality ≠correlation ≠coincidence |
||||
CaptainBoing![]() Guru ![]() Joined: 07/09/2016 Location: United KingdomPosts: 2170 |
This is a big ask on a small machine. I have been playing with the BBCTrans version (just the encode for now), and got it working, but the results are very hit n miss. (I tweaked a lot of the variable names so you can actually understand what is happening) firstly the size of strings makes it difficult to build up a single output (i.e. your zipped string) because the output can very often be bigger than the input and two bytes can represent a single character of input, so if it doesn't zip well it will easily over-run the maximum size. This could probably be worked around but I think you'd need to stream the output to something which then makes having a self-contained function difficult as you'd need to feed chunks of your file in without upsetting the dictionary each time. Most importantly, the memory size would be a big impact. The example given with the input as "TOBEORNOTTOBEORTOBEORNOT" isn't really representative (even that is 8 chars longer than the input). Even producing phrases that I would have thought lent heavily in the direction of being compressible, weren't and produce 300+ dictionary entries. have a play with the below code. The final two figures output are the pre & post compression string lengths. On one string, the input is 131 chars, and the output 222 - that is uncomfortably close to the limit, so you'd have to use longstrings probably anyway. Unless I have got something horribly wrong (which is always on the cards), I reckon this is an interesting mental exercise but difficult to put into practice on anything other than an MMX or ArmMite... even then aspirations have to be tempered by the realities - multi-megabyte is going to take a huge dictionary - because MMBasic uses fixed length strings, you have to size the dictionary before hand which could use your memory at a startling rate. Also, depending on what you are zipping, you might need to think about monitoring the success of the dictionary and making a decision to ditch it when the efficiency drops to restore the matches. option base 0 'un-REM whichever of the following you want to play with 'plaintext$ = "A B C D E G H I J K L M N O P Q R S T U V W X Y Z" 'plaintext$ = "ABCDEGHIJKLMNOPQRSTUVWXYZ" 'plaintext$ = "TOBEORNOTTOBEORTOBEORNOT" 'plaintext$ = "TTOBEORNOTTOBEORTOBEORNOTTOBEORNOTTOBEORTOBEORNOTTOBEORNOTTOBEORTOBEORNOTTOBEORNOTTOBEORTOBEORNOTTOBEORNOTTOBEORTOBEORN OT" plaintext$ = "following may not be necessary due to wifi for above construct mains outlets, power shelf and G8 lightning string (12VDC beefy PSU)" 'plaintext$ = "a can opener can open anything a can opener can but if it can't open anything a can opener can it can't be called a can opener can it?" zipped$ = encodeLZW$(plaintext$) ? len(plaintext$), len(zipped$) ? end FOR i% = 1 TO LEN(zipped$) 'STEP 2 PRINT ASC(MID$(zipped$,i%)); '+ 256*ASC(MID$(encodeLZW$,i%+1)) " " ; NEXT 'end '? zipped$ '? FNdecodeLZW(zipped$) Function encodeLZW$(ii$) local integer newindex, d, slicesize, freeptr local string o$, sliceword$ Local string dict$(4095) length 8 FOR d = 0 TO 255 : dict$(d) = CHR$(d) : NEXT freeptr = d 'first free place in the dictionary is 256 slicesize = 1 sliceword$ = LEFT$(ii$,1) do d = 0 do newindex = d IF slicesize > LEN(ii$) THEN EXIT FOR d = 1 TO freeptr-1 IF sliceword$ = dict$(d) THEN EXIT FOR NEXT IF d < freeptr Then slicesize =slicesize+ 1 sliceword$ =sliceword$+ MID$(ii$, slicesize, 1) EndIf loop UNTIL d >= freeptr ? freeptr,"!"+sliceword$+"!" dict$(freeptr) = sliceword$ freeptr =freeptr+ 1 sliceword$ = RIGHT$(sliceword$, 1) o$ =o$+ CHR$(newindex MOD 256) + CHR$(newindex \ 256) loop UNTIL slicesize >= LEN(ii$) encodeLZW$= o$ memory End Function 'I have not worked on this part at all beyond obvious syntax translation Function decodeLZW$(ii$) local integer c, i, l local string o$, t$, w$ local string dict$(4095) length 4 FOR i = 0 TO 255 : dict$(i) = CHR$(i) : NEXT i l = i c = ASC(ii$) + 256*ASC(MID$(ii$,2)) w$ = dict$(c) o$ = w$ IF LEN(i$) < 4 THEN decodeLZW$= o$:Exit Function FOR i = 3 TO LEN(i$) STEP 2 c = ASC(MID$(ii$,i)) + 256*ASC(MID$(ii$,i+1)) IF c < l Then t$ = dict$(c) ELSE t$ = w$ + LEFT$(w$,1) EndIf o$ =o$+ t$ dict$(l) = w$ + LEFT$(t$,1) l =l+ 1 w$ = t$ NEXT decodeLZW$= o$ End Function Beware, I think I can see duplication in the dictionary so I'm not convinced the algorithm is that great - but then I can't see why the "if slice$ isn't in the dictionary" logic would fail so may be there is some nuance I am missing. EDIT:No it looks good... spaces: RUN 256 !a ! 257 ! c! 258 !ca! 259 !an! 260 !n ! 261 ! o! 262 !op! 263 !pe! 264 !en! 265 !ne! 266 !er! 267 !r ! 268 ! ca! 269 !an ! 270 ! op! 271 !pen! 272 !n a! 273 !any! 274 !yt! 275 !th! 276 !hi! 277 !in! 278 !ng! 279 !g ! 280 ! a! 281 !a c! 282 !can! 283 !n o! 284 !ope! 285 !ene! 286 !er ! 287 ! can! 288 !n b! 289 !bu! 290 !ut! 291 !t ! 292 ! i! 293 !if! 294 !f ! 295 ! it! 296 !t c! 297 !can'! 298 !'t! 299 !t o! 300 !open! 301 !n an! 302 !ny! 303 !yth! 304 !hin! 305 !ng ! 306 ! a ! 307 ! can ! 308 ! ope! 309 !ener! 310 !r c! 311 !can ! 312 ! it ! 313 ! can'! 314 !'t ! 315 ! b! 316 !be! 317 !e ! 318 ! cal! 319 !ll! 320 !le! 321 !ed! 322 !d ! 323 ! a c! 324 !can o! 325 !opene! 326 !er c! 327 !can i! 328 !it! 329 !t?! Flash: 2K ( 3%) Program (86 lines) 58K (97%) Free RAM: 38K (75%) 11 Variables 0K ( 0%) General 12K (25%) Free 134 148 > |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
Thanks for the thoughts. I'll give the programs a try. I had an idea. My first job after college was as a proofreader for a "computerized" typesetting output. The text to be typeset was held on 5-bit paper tape, which I came to be able to read. Here's how it managed to get nearly all ascii printable characters in 5 bits. In default mode (say the third 32-character row in a standard ascii table, containing the upper-case letters--see https://www.asciitable.com/), the top two letters, "^" and "_", 0x1E and 0x1F, would represent shift-to-lower-set and shift-to-upper-set respectively. In those modes, the highest character, "?" and DEL respectively (0x1F) would mean shift back to default mode. You probably need "?", so use "~" to represent that. And you need line feed (representing CR & LF in DOS/Windows text), so replace backwards quote (0x60) with that. If you needed, say, TAB, I'd consider "{" and "}" expendable. So you lose "^", "_", "~", DEL, and the backwards single quote, but you reduce the file size by 37.5%--from 8 bits per character to 5 (nearly--not counting the shift-in and shift-out codes). PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
twofingers![]() Guru ![]() Joined: 02/06/2014 Location: GermanyPosts: 1593 |
@Andrew(CaptainBoing) Thanks for your efforts! But I guess there may be something wrong - or needs to be optimized (in the BBC code?). ![]() If I compare your output with the output from the LZW-Compressor then I get different compression rates. Eg for the 122 bytes in Your code generates 88 bytes vs. 44 bytes for LZW-Compressor. The 131 bytes Your code generates 222 bytes vs. 114 bytes for LZW-Compressor. @lizby I think thats could be a better/faster way than LZW. Maybe you can implement a ESCape character for seldom used characters (ascii/ansi >127, ascii/ansi <32) in the output file followed by the ascii code of the original. This way you will not loss any character. EDIT: I think that means the same. ![]() EDIT2: I fear we need more than 5 bits to store the whole alphabet, upper and lower case. ![]() causality ≠correlation ≠coincidence |
||||
CaptainBoing![]() Guru ![]() Joined: 07/09/2016 Location: United KingdomPosts: 2170 |
cheers for that - I suspected as much. I did go over the wiki for LZW and the algorithm is fairly straightforward but I spent more time on getting the BBC code to run than checking the algorithm of it. There are several enhancements over the original 1984 work so I guess the tool you used included those maybe. I only had an academic interest so i am not heartbroken ![]() A contemporary algorithm would be nice to see working. h |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
Just as a typewriter uses a real shift key to double the number of characters relative to the number of strikers, 5-bit codes can use 2 virtual shift keys to nearly triple the number--30 characters in normal mode (ascii 0x40-0x5d), 31 in downshifted, and 31 in upshifted--92 minus one for LF, or 91 of the 96 printable ascii characters (if you count DEL as printable). I saw it done using 5-bit paper tape (I don't remember, but there may have been an additional shift for more special characters). PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
TassyJim![]() Guru ![]() Joined: 07/08/2011 Location: AustraliaPosts: 6283 |
The original RTTY systems used 5 bit. Check out "Baudot". I assume you are looking at transferring data you have collected. You might be better served by compressing your data as it is collected, rather than doing it when you want to transfer. Jim VK7JH MMedit |
||||
Volhout Guru ![]() Joined: 05/03/2018 Location: NetherlandsPosts: 5091 |
As long as you don't need to differentiate between upper and lower case, baudot can be used, as TassyJim proposes. It can be efficient, but if there is a high mixture of numbers and characters it becomes awfully inefficient (change set eachtime costs another 5 bits). And if the data you want to compress is pure numerical...why not send only 4 bits per character....50% compression. Or you can compress by only storing the changes... either only store when a change happens, or store the delta with last measurement. It is all depending on what compression you want to achieve, and what the data actually is. And if you need a loss-les compression or not. PicomiteVGA PETSCII ROBOTS |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
Thanks for the suggestion, Jim. The wikipedia article on Baudot is good. It was probably a variant that the company I was working for used. The photo of the paper tape with the tiny clocking holes (or maybe they were feeding holes) looks just like what I remember. I don't recall whether "e" was a single bit or a truncated ascii-type code. I do remember DEL as all 5 bits punched--we spliced paper tape corrections in by copying a string while making corrections with leading and trailing DELs, then overpunching DEL through the original string, then cutting the paper tape and splicing in the correction by aligning the leading and trailing DELs. (This in addition to walking to work uphill in both directions--young folks these days!) Volhout--I was already considering "nibble" coding for numerals plus space, comma, semicolon, LF, and have gotten confirmation that that is all there will be, so 4-bit it is. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
CaptainBoing![]() Guru ![]() Joined: 07/09/2016 Location: United KingdomPosts: 2170 |
if you can use CR instead of LF, this fits directly with your four bits (0Dh instead of 10h) I know you will be mapping things anyway but jus' sayin' |
||||
twofingers![]() Guru ![]() Joined: 02/06/2014 Location: GermanyPosts: 1593 |
@lizby Yes, the reference to Baudot (or CCITT) helps a lot! It would help to have a list of characters which the program should handle. After that we need to determine the "shift" codes. How many "code pages" or character sets? As a first draft I would say we should have a set (SET1) of 26 characters (a-z) + Space + EOF (= 28 char). Additionally a set (SET2) A-Z + Space + CR. And finaly a set (SET3) of numbers (0-9) and extra characters (<>,.;:-?!"()/=+° etc.) + Space + EOF. We could - for example - have an escape character (eg "^") in SET3 which allows us to insert any ansi/ascii code. Just my 2 cents ... ![]() EDIT: forgot CR causality ≠correlation ≠coincidence |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
Ok, this is just for proof of concept, and I don't expect to pursue it further unless I find a need. One program compresses the Declaration of Independence (8114 bytes) into 5-bit code (5340 bytes)--compressed to 68%. The other decompresses. For another test, I ran the text of the Wikipedia Baudot article--12903 to 8890 bytes (69%). [code] ' encode-5bit DIM as string A, B, C, D, ch, fn dim buffChars(5) AS STRING length 1 dim as integer i, j, k, l, m, n, iCh, nchars, iBuff dim as integer iMode ' modes are 1=&H40-&H5F; 2=&H20-&H3F; 3=&H60-&H7F ' fn = "declaration" fn = "dec" ' fn = "WikiBaudot" open fn+".txt" for input as #1 open fn+"_encoded.txt" for output as #2 iMode = 1 nchars = 0 iBuff = 0 do while not eof(#1) A = input$(250,#1) l = len(A) for i = 1 to l ch = mid$(A,i,1) if ch = " " then : ch = chr$(&H60) ' shift some punctuation to set 3 elseif ch = "," then : ch = "{" elseif ch = "." then : ch = "|" elseif ch = ":" then : ch = "}" endif iCh = ASC(ch) ' print ch, iCH; : print " "; if iCh <> 13 then ' ignore carriage return if iCh = 10 then iCh = &H7E ' line feed becomes tilde elseif iCh < &H20 then iCh = &H5E ' unknown--becomes "?" elseif ch = "?" or ch = "^" or ch = "_" then iCh = &H5E ' becomes "?" elseif ch = ">" then iCh = &H5D ' becomes "]" endif if iCh > &H5F then ' mode 3 if imode = 1 then addCH(&H1F) ' shift to upper set elseif imode = 2 then addCH(&H1F) ' shift to normal set addCH(&H1F) ' shift to upper set endif iMode = 3 elseif iCh > &H3F then ' mode 1 if imode <> 1 then addCH(&H1F) ' shift to normal set iMode = 1 endif elseif iCh > &H1F then ' mode 2 if imode = 1 then addCH(&H1E) ' shift to lower set iMode = 2 elseif imode = 3 then addCH(&H1F) ' shift to normal set addCH(&H1E) ' shift to lower set iMode = 2 endif endif ' print " | ", iCH; : print " : "; addCH(iCH MOD 32) endif next i ' print A; loop close #1 if imode = 2 then : addCH(&H1F) : addCH(&H1F) : imode = 3 : endif ' shift to upper set if imode = 1 then : addCH(&H1F) : imode = 3 : endif ' shift to upper set if nchars <> 0 then for i = nchars to 8 addCH(0) ' space next i endif close #2 print "" print fn+" encoded" end sub addCH(localiCH as integer) ' lower 5 bits of localiCH contain encoding nchars = nchars + 1 iBuff = (iBuff << 5) + localiCH ' print localiCH; : print " > "; print iBuff if nchars = 8 then ' iBuff contains 40 significant bits buffChars(1) = chr$(iBuff >> 32) buffChars(2) = chr$((iBuff AND &HFFFFFFFF) >> 24) buffChars(3) = chr$((iBuff AND &HFFFFFF) >> 16) buffChars(4) = chr$((iBuff AND &HFFFF) >> 8) buffChars(5) = chr$(iBuff AND &HFF) print #2, buffChars(1)+buffChars(2)+buffChars(3)+buffChars(4)+buffChars(5); nchars = 0 iBuff = 0 ' end ' 8 characters only endif end sub [/code] [code] ' decode-5bit DIM as string A, B, C, D, ch, fn dim as string cShiftNormal length 1, cShiftUp length 1, cShiftDown length 1 dim chString AS STRING length 8 dim buffChars(8) AS STRING length 1, cOutputBuff as string length 250 dim as integer i, j, k, l, m, n, iCh, nchars, iBuffChars(5), jStop dim as integer iMode ' modes are 1=&H40-&H5F; 2=&H20-&H3F; 3=&H60-&H7F cShiftNormal = chr$(&H1F) ' mode-dependent--for modes 2 & 3 cShiftDown = chr$(&H1E) ' mode-dependent--for mode1 cShiftUp = chr$(&H1F) ' for mode1 fn = "dec" ' fn = "WikiBaudot" open fn+"_encoded.txt" for input as #1 open fn+"_decoded.txt" for output as #2 iMode = 1 nchars = 0 iBuff = 0 cOutputBuff = "" chString="" k = 0 ' count of characters in cOutputBuff j = 0 ' count of loops jStop = 0 ' debug while j<jStop do while not eof(#1) A = input$(5,#1) l = len(A) ich = 0 for i = 1 to l iBuffChars(i) = asc(mid$(A,i,1)) If j < jStop then : print bin$(iBuffChars(i),8)+" "; : endif next i if j < jStop then : print "" : endif ' 11111222 22333334 44445555 56666677 77788888 ' 8 chars in 5 bytes buffChars(1) = chr$(iBuffChars(1) >> 3) buffChars(2) = chr$(((iBuffChars(1) and &B111) << 2) + (iBuffChars(2) >> 6)) buffChars(3) = chr$((iBuffChars(2) and &B00111110) >> 1) buffChars(4) = chr$(((iBuffChars(2) and &B1) << 4) + (iBuffChars(3) >> 4)) buffChars(5) = chr$(((iBuffChars(3) and &B1111) << 1) + (iBuffChars(4) >> 7)) buffChars(6) = chr$((iBuffChars(4) and &B01111100) >> 2) buffChars(7) = chr$(((iBuffChars(4) and &B11) << 3) + (iBuffChars(5) >> 5)) buffChars(8) = chr$(iBuffChars(5) and &B11111) for i = 1 to 8 ch = buffChars(i) if j < jStop then : print bin$(asc(ch),5)+" "; : endif select case ch CASE cShiftNormal if iMode = 1 then iMode = 3 ' lower case else iMode = 1 ' shift to normal--upper case endif CASE cShiftDown : if iMode = 1 then : iMode = 2 : endif ' numbers end select Select case iMode case 1 if ch = cShiftNormal or ch = cShiftDown then ch = chr$(0) else ch = chr$(asc(ch) + &H40) endif case 2 if ch = cShiftNormal or ch = cShiftDown then ch = chr$(0) else ch = chr$(asc(ch) + &H20) endif case 3 if ch = cShiftNormal then ch = chr$(0) else ch = chr$(asc(ch) + &H60) endif end select if ch = chr$(&H60) then : ch = " " ' remap some characters elseif ch = "{" then : ch = "," elseif ch = "|" then : ch = "." elseif ch = "}" then : ch = ":" elseif ch = "~" then : ch = chr$(13) endif if ch <> chr$(0) then : cOutputBuff = cOutputBuff + ch : endif k = k + 1 if k = 250 then print #2, cOutputBuff; cOutputBuff = "" k = 0 endif if ch = chr$(13) then cOutputBuff = cOutputBuff + chr$(10) k = k + 1 if k = 250 then print #2, cOutputBuff; cOutputBuff = "" k = 0 endif endif if j < jStop then : chString = chString + ch : endif next i if j < jStop then : print "" : print chString : endif chString = "" j = j + 1 ' exit do ' only one group of 5 bytes / 8 characters loop print #2, cOutputBuff close #1 close #2 print "" print fn+"_decoded.txt output" end [/code] My first pass didn't achieve much--mainly because the most common punctuation characters (space, comma, period, colon) were not in the same set as the most common letters--the lower case letters. I shifted them up, replacing backquote, "{","|","}". Next--at some point--I'll code for numerals, space, comma, colon, LF into 4 bits. Texts zipped below. 2018-10-16_111126_dec.zip 2018-10-16_111156_WikiBaudot.zip I did the development on MMBasic for DOS. The programs ran in not much more than a second. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
Volhout Guru ![]() Joined: 05/03/2018 Location: NetherlandsPosts: 5091 |
Another way of compressing existing data is simply storing the pointers. The declaration of independance can be compressed into a simple hyperlink....... PicomiteVGA PETSCII ROBOTS |
||||
twofingers![]() Guru ![]() Joined: 02/06/2014 Location: GermanyPosts: 1593 |
@lizby Congratulation! Good job and nearly perfect. ![]() I compared WikiBaudot.txt vs WikiBaudot_decoded.txt using Totalcommander/notepad++ and found some - only a few - differences. not surprising. hmmm ... It found seven differences, mainly because of "foreign characters". An idea about more "shift codes": A "shift code" followed by a "shift code" expands to the next shift level. causality ≠correlation ≠coincidence |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
Yes, additional shift codes could be used. Right now I'm using &B11111 as "shift-up" when in the ascii &H40-5F mode, and as "shift-to-&H40" when in either of the other modes--but yes, taking another code could allow you to continue to shift into as many 32-byte groupings as you could want. In fact, you'd only need the shift-down code in the non-primary groupings. &B11111 from &H40 would shift to &H60, &B11111 again would shift to &H20, and again would shift to whatever next grouping you wanted. From any grouping other than &H40, &B11110 would shift down one. Regarding "1865-1946" yielding "1865v1946", my guess would be that the dash is actually an en-dash or some other special character, and outside my coding scheme. (But I was hoping that unrecognized characters were going to end up as "?"--though I actually never checked for codes above &H7F.) PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3378 |
Cap'n--I chose LF because in Linux mode, LF is used in place of what in DOS is CRLF. I don't know why that would be relevant here, but that was my rationale. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
twofingers![]() Guru ![]() Joined: 02/06/2014 Location: GermanyPosts: 1593 |
Thats correct, it is chr$(150) - typographically a must. Maybe a good idea to use lookup-tables? Another point: "unrecognized characters" should IMHO be replaced with an unique character? causality ≠correlation ≠coincidence |
||||
Page 1 of 2 ![]() ![]() |
![]() |
![]() |
The Back Shed's forum code is written, and hosted, in Australia. | © JAQ Software 2025 |