https://web.archive.org/web/20010309140512/noquake.de/domination/lzw-secrets.htm

Cached with the author’s permission — He emailed me! =)

By Antitrack/Legend, Sept.22nd, a.d.1998, for Domination paper edition

LZW on Wikipedia

Table of Contents [hide]

1 Introduction
2 Normal Operation
3 A Simple Idea of Improvement
4 Life is not so simple at all
5 A tour de Memory
6 The source part 1
7 More sources
8 The source part 2
9 Summary and final analysis

Introduction ∞

Every commodore 64 – scener has most likely used these fiendish, evil, little utilities known as “crunchers”. Whatever work of yours is about to finish – your demo or your crack – one of your last steps in completing your work will most likely be to compress it. And what a step it is! Many a cracker spent a whole night waiting for the compression program to finish its work. A 230 block program which is already packed with an equal-char-packer may well eat up to seven hours of crunching. Thats not exactly overwhelming if you consider that people might want to finish (and, hell, yes, PACK!) their demos at a party in time! Other, more clever folks, may have tried to convert their data to another computer, in order to find out that compression utilities on these devices tend to finish their work faster by dozens of times! Obviously, this cannot only be done by sheer faster computing power… the algorithms on other computers are supposed to be better!

In this article, we will take an in-depth look at how an equal-sequence-cruncher works on the c64, and we will find the reason why he is so slow. We will find ways how to speed up things, and what means, hardware– or software-wise, are required in order to accomplish that. At the end of the article, we will try to optimize all these means and will generally tend to juggle around with them in order to get fast compression with a reasonable amount of extra hardware (most notably, memory). Surprisingly, we will also try to use other hardware to improve crunching!

Normal Operation ∞

In order to improve anything, we better got to get a deep understanding what’s going on already. So, what, basically, does a LZW (equal-sequence) cruncher do? The answer is, of course, he scans for similar strings in a given file, and tries to eliminate them, so they appear only once. Let me try to draw an example:

An input file consists of the following bytes (all values in hexadecimal):

ADDRESS BYTES
$1000::::   $00 $01 $03 $03 $0a $0d $01 $03 $03 $0a $03 $03 $0a

Yes, this string is packable very well. A LZW cruncher will detect all similar strings, and will only store the largest of the similar string in the output file. Any time the string should appear in the decompressed memory, the LZW cruncher will store a some special codes instead of the string there. These special codes will tell the uncompressor where to look for the bytes, and how many of them.

In the example above, the result might look like this :

$00 $01 $03 $03 $0a $0d (<code for get "earlier mem"> <code for "4 bytes"> <code for offset>)
(<again code for "get earlier mem"> <code for "3 bytes"> <code for "offset">)
(<again code for "get earlier mem"> <code for "3 bytes"> <code for "offset">)

The decruncher needs to know at least 3 things:

shall I simply put these bytes into memory or shall i get a string from the memory earlier?
If I get a string, how many bytes do I have to copy, and
from what memory address?

There are really a hell lot of methods to encode your information so that the decruncher will know -in an orderly manner- when to simply put data into memory, when to get a string from memory, and when to stop. For example, some compressors ALWAYS think they have to copy data from memory unless explicitly told otherwise! These compressors have “chunks” of data everywhere, which are superseded by a header information in front of each chunk of data, much like every disk sector on the disk has a header telling the disk drive where it is.

However, we are not interested in the format of the headers and bytes that are encoded. We are interested in why the encoding process is so slow, notably on large files, and what we can do about it.

To understand this, let us return to the example above:

ADDRESS BYTES
$1000::::    $00 $01 $03 $03 $0a $0d $01 $03 $03 $0a $03 $03 $0a</P>

What is the cruncher supposed to do? The cruncher has a pointer, say, $0002/$0003 (2 zeropage addresses) to our program at $1000. it looks in the whole memory from $1000 to $100c for a sequence. Here, it is reasonable to assume any sequence consists of at least 2 bytes, so the cruncher will look in $1000 to $100c for the bytes “$00 $01“. He will not find them, so he will simply output $00 $01 without any special notice.

Next the cruncher will try to find the next 2 bytes, $01 $03 in the memory, and he will find them (at $1001, $1006) and proceed to crunch them.

Crunchers operation will continue at $1005 (at the $0d byte), then at $1006 ($01 $03 again), then at $100b (another sequence).

So basically, on the c-64, the cruncher is stuck up in a double loop. For each 2 byte sequence there is in the memory, the program will scan the whole other memory! This looks (and behaves) much like a basic loop in the following style:

For i=1 to 50000 do
for j=1 to 50000 
do
.....(crunching algorithm)
next j
next i

A double loop is, unfortunately, not a real fast or favourable way to handle stuff like this. What do we need to speed up all this? The answer is, after due consideration, we have, somehow to avoid the double loop. The first idea a programmer might have, is: “Hmmm. It would be something wonderful if we had some big, big arrays that contained all the memory addresses of all sequences!” Any time the computer looks for a sequence, he would scan the array instead of the whole memory.

How would such a code look like?

A Simple Idea of Improvement ∞

We remember that the cruncher tends to look for a sequence that is at least 2 bytes long. Well, then it would be cool to have some HEAVY extra memory and have all addresses of all 2-byte-values in some big arrays. Each array would have to be 64 kb big, of course, since our 64kb input file may consist of only these 2 bytes appearing all over again. Let us, for a (strange) moment imagine that we have had a sort of super-c64 with infinite memory, and try to code some “search next sequence” routine. Somehow (magically) we already have the addresses of all 2-byte-values in some big arrays:

We have: all addresses of all $0000 values in an array going from $0000-0000 to $0000-$ffff

$0001   $0001-0000   $0001-$ffff
$0002   $0002 0000   $0002-$ffff
.
 .
$ffff   $ffff-0000   $ffff-$ffff

Ui, a rather big bunch of 65536 arrays,
you will say. Yep. And how would our search routine look like? Answer: The
search routine would look very simple, we imagine we have a super-c64 that can
address 32 bit addresses:

       ldy #$0000          (16 bit y register, rite)
loop   lda $ea2d0000,y     (we are looking for the sequence $ea2d)
                           (akku is a 16 bit accu, miraculously)
       beq end_of_crunch   (if no more sequences, end)
       jsr encode          (crunch slave to the work!)
       iny
       bne loop
end_of_crunch   rts

Wow, this would be great! Just some self-modifying code at the “loop” label, and we are set.

However, life is not that simple. We can code the example above with 8 bit registers (akku and y register, typically by using zeropage) but what’s not so easy to get is a commodore-c64 with no less than 2^32 bits ram = 4194 megabytes of ram!

Obviously, we have to think of something less memory-eating. Here is how.

Life is not so simple at all ∞

The basic idea is to use a chained list. A “chained list” is a rather clever data structure that is composed of, basically, three things:

a) An easy-to-find start pointer to the chained list
b) the middle of the chained list, consisting of link and data (in whatever order you decide)
c) the end of the list

Case b) is unique : The middle of the chained list consists of two things. The “link” is a pointer to where I can find the next chain of the list. “DATA” is any data we want, and what we want, is, of course, still, the 16-bit address of where our 2-byte-sequence resides in c64’s normal memory. We might reserve special values for “link” and “data” for special reasons! For example, it is reasonable to assume that our input file will never be larger than $0801-$ffff, so we might assign special meanings to the values $0000-$07ff as “DATA”. However, all for now, we only need one special value to help us detect case (c) (which is, the end of the list), so we will use a combination of data=$0000 and link=$0000 for this very case.

Most readers are getting uneasy here, so let me sum up the monumentous task that lies ahead of us:

We are going to split the task of LZW crunching in TWO parts. Part one will be to construct an array of big chained lists, and for programming convenience reasons we are going to waste all extra 512 kb ram of external memory just for this “big array of chained lists”. This big array of chained lists will contain all the addresses for all 16-bit-values that we can find inside the c64’s memory (which, of course, contains the data we want to crunch). As you will see later, it really is convenient to have so much RAM to waste. Trying to make all this less ram-wasting tends very much to increase programming work!

Part two of our work is dedicated to acually make use of these mess of chained lists, which is rather simplistic anyway. For any sequence that lies ahead of us, we will look into the chained list of this specific sequence in order to determine where the next 2 bytes of this sequence are located inside the c64’s memory. This is almost as simplistic as the imaginary 16-bit-supercode program that was mentioned above, so hold your breath, it’s worth the trouble! But first lets write some code to actually use the big bunch of “pseudo arrays with links in between”.

A tour de Memory ∞

Let us try to draw a memory map:

The 512 kb REU is being split into 8 banks, bank 0 up to bank 7, so lets try to make use of that:

We will use:

Bank 0 to hold the low byte of all start pointers to all chained lists
bank 1 to hold the hi byte of all start pointers to all chained lists
bank 2 to hold the low byte of all END pointers to all chained lists
bank 3 to hold the high byte of all END pointers to all chained lists
bank 4-7 will hold all the chained lists.

“Wait a moment! Where is the middle byte of the pointers?” Some clever readers may ask. “You see, we have 512 kb ram, well, at least 256 KB ram to access but only 16 bits (2 bytes: low and hibyte) of pointers! Where are the other 2 bits? ”

Well, this is a valid question. Indeed if we have to have pointers to a big area we need more than 16 bits…DO WE? There is a little trick clause here: We assume that our small pieces of chains are at an even address only, since they take 4 bytes for each piece of link. So, each piece of link will be at an address like $xxx0, $xxx4, $xxx8 or $xxxC. There is no need to store the least significant 2 bits, since they will always be zero! Nice trick, huh? This also means, we will have to shift 18 bit addresses to 16 bit addresses, and when we get them back, we will have to shift 2 zero-bits in again to get the 16 bit address. Simple, but saving tons of memory!

Anyhow – how are we going to construct this list?

First, we are going to clear the whole 512 kb of extra RAM with zero. You will see that this task will not only clear everything but save us lots of work later.

What we, basically, need, is a single integer-variable, a sort of a counter, that keeps track of where some free memory in the REU resides. Every time we insert some of our chunks into the REU memory, we will have to increase this 16-bit-counter in order to keep track up to where we have wasted REU memory for our chunks.

Now, we will start with a pointer to the start of the data that is to be packed. We will get 2 bytes there. (Let us assume, for readability, 2 random bytes, e.g. $ea $db). So, we will try to get the start pointer. If the start pointer is $0000 (no start has been set yet), the solution is simple: put a chunk of data into the next free REU memory (our counter points there), then put the start pointer (REU pages 0 and 1) there, and also put the end pointer (REU pages 2 and 3) there.

If, however, the start pointer is not $0000, it means the chunk already exists (somewhere in REU memory) so we have to follow a more tricky strategy of memory allocation: First, we will still put our chunk at the REU place the counter tells us, but then we will look at the end pointer (pages 2 and 3 of REU) for the selected linked list, and change the link info from $0000 (end of list) to $xxxx (whereever the counter told us the “free” mem in the REU is).

In all cases, at the end of this routine we have to correct the end-of-list pointer from the old value to the new value (our trusty counter tells us where).

This was tricky, wasn’t it? Lets look together at the source code that does it all. We will mainly assume our part 1 code has been properly inserted into any version of Darksqueezer, and a JMP command was nicely placed into the DSQ cruncher after he loaded all c64 mem with the data to be packed, and before the actual packing process starts.

The source part 1 ∞

Spoiler

COUNTLO = $FE         ; this is the 16 bit counter that tells us where to find the next
COUNTHI = $FF         ; free REU address (18 bit wide, between pages 4-7 thus 256 kb)
   *= $1000           ; some nifty start address
      SEI
      LDA #$2F
      STA $00
      LDA #$37
      STA $01
      LDA #$80        ; reu page 4 mem $0000-$0200 is reserved
      STA COUNTLO     ; for temp storage ---> our chunks will never
      LDA #$00        ; start before reu page 4 mem $0200 .
      STA COUNTHI
      LDA #$00        ; the main DSQ code didnt tell us yet the proper
      SEC             ; start/end address yet, so we have to correct this
      SBC $AE         ; here
      STA $AE
      STA ENDE+1      ; some tiny self modifying code here,
      LDA #$00        ; in order to remember $ae/$af data start/end
      SBC $AF
      STA $AF
      STA ENDE+5
      LDA #$00        ; start address in c64: $0200 (temp variable)
      STA $DF02
      LDA #$02
      STA $DF03
      LDA #$01        ; one byte to transfer
      STA $DF07
      LDA #$00        ; one byte to transfer
      STA $DF08
      LDY #$00
      STY $01         ; $01 =  00 : RAM only on            
      LDA ($AE),Y     ; get program byte

      STA $03         ; into mini ringpuffer at $02/$03
MAININS   LDA $03     ; move ring puffer
      STA $02
      LDY #$00
      STY $01
      INY
      LDA ($AE),Y     ; get next byte off c64 memory
      STA $03         ; into ringpuffer
      LDA #$37        ; enable ram expansion
      STA $01
      STA $D020       ; do some stupid $d020-flicker
      LDY #$00        ; get start pointer $00xxyy and $01xxyy
      STY $D020       ; from REU page 0 and 1
      LDA $02
      LDX $03
      JSR SET456      ; get page 0 reu pointer lowbyte for this chunk
      JSR READ        ; 
      BNE COMPLINS    ; if value of pointer not 0 choose complex strategy
      INC $DF06       ; get page 1 reu pointer hibyte for this chunk     
      JSR READ
      BEQ SIMPLINS    ; hibyte AND lowbyte are zero, this means
                      ; the link doesnt even exist
                      ; so choose simple strategy
COMPLINS   LDA #$02   ; this is the complex strategy
      STA $DF06       ; get existing upper end of linked list
      JSR READ        ; from $02-xxxx (bank 2 reu)
      STA $0100       ; store at $0100
      INC $DF06       ; get hibyte of upper end bank 3 reu
      JSR READ        ;
      STA $0101       ; store at $0101
      JSR MULT4       ; calc 16 to 18 bit, 2 least sig. bits = 0
      INC $DF04       ; add 2 to get to the link
      INC $DF04
      LDA COUNTLO     ; write current chunk (16bit) lobyte
      JSR WRITE       ; to end of linked list
      INC $DF04
      LDA COUNTHI     ; write hibyte of current chunk to end of linked list
      JSR WRITE
      JMP COMPLCON    ; continue with complex insert
SIMPLINS   LDA $02    ; this is simple insert.
      LDX $03         ; write current link address to $00-xxxx                      
      LDY #$00
      JSR SET456
      LDA COUNTLO
      JSR WRITE       ; and hibyte to $01-xxxx
      INC $DF06
      LDA COUNTHI
      JSR WRITE

; now that all pointer work was done, we do the real work of storing the chunk : this is, storing the value and storing the link to the next chunk, and, remembering the proper end of list.

COMPLCON   LDA $02    ; we will remember the end of list at $02-xxxx
      LDX $03
      LDY #$02
      JSR SET456
      LDA COUNTLO     ; the end is of course at countlo/hi in REU
      JSR WRITE       ; storing at $02-xxxx
      INC $DF06       ; end of list hibyte pointer at $03-xxxx
      LDA COUNTHI
      JSR WRITE       ; storing at $03-xxxx

; now we have stored all sort of links only,
; its a good time to store the actual chunk. Thus we convert the 16 bit counter to 18 bit again,
; then we store the chunk at the 18 bit address. 

      LDA COUNTLO     ; store countlo/hi in $0100 as a temp variable
      STA $0100
      LDA COUNTHI
      STA $0101
      JSR MULT4       ; shift 16 to 18 bits, 2 0-bits inserted at bit 0 and 1
      LDA $AE         ; write data to chunk address c64 mem lobyte
      JSR WRITE
      INC $DF04       ; write data to chunk: address c64 mem hibyte
      LDA $AF
      JSR WRITE

; now we would actually have to write $0000 to the new chunk at address COUNTLO/HI in REU.

; However we don't need to do this coz the whole REU was filled, earlier, with $00-bytes.

   ; INC $DF04 
   ; LDA #$00         ; write "end of list" link value $0000 (not necessary).
   ; JSR WRITE
   ; INC $DF04
   ; LDA #$00
   ; JSR WRITE
      INC COUNTLO     ; since we have stored our chunk at address "COUNT"
      BNE IN1         ; in REU, we have to increment COUNT here.
      INC COUNTHI
IN1      INC $AE      ; we also increment to the next address inside the c-64 mem.
      BNE *+4
      INC $AF
      LDA $AF
      AND $AE
      CMP #$FF
      BEQ ENDE        ; until end of memory
      JMP MAININS     ; end of mem not reached : goto main loop.
ENDE      LDA #$00    ; end of memory reached: Restore some important
      STA $AE         ; memory zp locations ($00ae/af ) using self modifying code
      LDA #$00        ; attention Self modifying code
      STA $AF
      LDY #$00
      STY $01
      LDA ($AE),Y     ; set an internal flag ($87) for DSQ
      EOR #$FF
      STA $87
      DEC $01
      RTS

MULT4      LDA #$00   ; mult4: convert 16 bits at $0100/$0101 to 18 bits
      STA $0102       ; shift in 2 zero bits at bit 0 and bit 1.
      ASL $0100
      ROL $0101
      ROL $0102
      CLC
      ASL $0100
      ROL $0101
      ROL $0102
      LDA $0102
      CLC
      ADC #$04        ; add 4: select bank 4 as starting bank
      STA $DF06       ; and store $0100/1/2 into REU address select registers:
      LDA $0101       ; $df04/5/6
      STA $DF05
      LDA $0100
      STA $DF04
      RTS

SET456      STA $DF04 ; select REU address manually
      STX $DF05
      STY $DF06
      RTS
WRITE      STA $0200  ; write a byte to reu: store byte at $0200 then write it to
      LDA #$FC        ; current REU address
      STA $DF01
      RTS
READ      LDA #$FD    ; read a byte from reu: store byte from reu at $0200
      STA $DF01       ; then get byte into accu.
      LDA $0200
      RTS

———————————— (insert coffee break here) —————————————–

If you have done all your homework and carefully followed the mazelike array of linked lists, you might have been somewhat uncomfortable. No, not due to the fact that you might be lacking brains to follow all these links, but that so much memory has been wasted. After all, do we really need these none less than 128 kb memory…..just to find the end of the linklist, stored in bank 2 and bank3 ?

Answer: No, we did not really need these extra 128 kb. Instead of having a huge array of end pointers, we might follow the list up to its end every time we decide to chain another chunk to the end of the list. However this would mean we had to scan the entire list each time we decide to insert another chunk. This is eating valuable crunching time, thus is rather unacceptable.

Phew! We are nearly set! Now let us take a close look at what DSQ really does 99.5% all of the crunching time, and lets write the code to improve that performance by using the carefully constructed maze of lists above. Much to our pleasant surprise, we will find out it’s rather simple to code and short too!

But first let us disassemble what DSQ does 99% all of the time. Suck and see: If you crunch a large file using DSQ 2.0 or 2.2 and if you press your freezer cartridge button, you will most likely end up inside a little code fragment that’s inside the zeropage. Let’s disasm that one (and yes, of course it’s some clever self modifying code….):

More sources ∞

Spoiler

; akku contains the 8 bit low byte of the sequence

; we actively search for the sequence by finding the first 8 bits, first ....

0030  D9 00 EE   CMP EE00,Y   ; searching for lower 8 byte of 2-byte sequence....
0033  F0 0C   BEQ 0041        ; similar? good....lets do work
0035  C8   INY                ; no, scan the rest of this page
0036  D0 F8   BNE 0030
0038  E6 32   INC 32          ; still not, scan rest of, uhhhhh, whole memory
003A  F0 04   BEQ 0040     
003C  C6 2F   DEC 2F          ; or scan at least as many pages as in $2f ....
003E  D0 F0   BNE 0030
0040  60   RTS

0041  AA TAX                  ; store accu in x for later use       
0042  84 31   STY 31          ; self modifying, clever code
0044  A0 01   LDY #01         ; try to determine sequence length
0046  B9 2E EC   LDA EC2E,Y   ; self modifying code
0049  D1 31   CMP  (31 ),Y
004B  D0 05   BNE 0052
004D  C8 INY
004E  C0 FF   CPY #FF
0050  D0 F4   BNE 0046
0052  88 DEY
0053  C4 02   CPY 02          ; compare with minimum profitable sequence length in $02
0055  B0 09   BCS 0060        ; bigger? Yes: do real crunching
0057  8A TXA
0058  A4 31   LDY 31
005A  A2 00   LDX #00
005C  86 31   STX 31
005E  F0 D5   BEQ 0035        ; shit its smaller than profitable, keep scanning
0060  4C F8 11   JMP 11F8     ; do real crunching

Ok, so here is the summary: This heavy self-modifying beast searches for the next sequence by scanning the whole memory for suitable (equal) strings. $02 seems to hold the current “at least wanted” string length. $0030 and $0031 hold the start of memory, $0047 and $0048 hold a pointer to the sequence in the middle of the data. This is rather important since we will have to use the same pointers in our source.

What can we do about it?

Well… $0030 and $31 hold some important pointer, so it would be unwise to modify anything here. The BEQ command at $0033 actually could serve us to do something useful. That is to say, if a sequence seems to be immediately ahead, this very BEQ will be executed. So we will leave it alone as well. But, at $0035 the INY is part of the dreaded slooow loop that scans the memory. We will mercilessly place a JMP at $0035 to our own code:

$0035 JMP $0338

As you will soon see, we will have enough memory in the …. TAPE BUFFER … to do all the scanning that will replace the loop at $0035 up to $003e. Yes, our routine will be longer, but not that considerably longer!

Let’s sum up again what our routine has to do:

Our routine has to scan for a sequence, whose LSB we know, (in the accu) and whose hibyte is in $0031. The sequence address of the same sequence must be bigger or equal than the number we have identified to reside in memory location $0047/48. 47/48 also point to a sequence with the same two bytes as in $0030/$0031. We will use pointer 47/48 in our case only.

We will add some special flags that will make our life easier if multiple instances of the same sequence are being needed each “call” to our routine. (Well its not a call, its a jump, but who cares, you get the meaning).

Simple enough? Yes it is. Lets look at the code now!

The source part 2 ∞

Spoiler

ACCU   = $80            ; zeropage wasting code :)
XREG   = $81
YREG   = $82
H1   = $83
H2   = $84
BYTE1   = $85
BYTE2   = $86
LASTBYTE1   = $87
LASTBYTE2   = $88
GOT1   = $89
GOT2   = $8A
OLDDF04   = $8B
OLDDF05   = $8C
OLDDF06   = $8D
  *= $0338              ; our jmp at $0035 gets us here

MAINPRG   STA ACCU
     STX XREG
NOFF      LDY #$00
     LDA ($47),Y        ; store 16 bit sequence in h1 and h2
     STA H1             ; (help variables)
     INY
     LDA ($47),Y
     STA H2        
     LDY #$FF           ; switch to normal ram and enable REU registers
     STY $01
     LDX BYTE1          ; flag if we have already scanned for this sequence
     BEQ NEXTFETCH      ; flag = 0 : scan from the lists beginning
     CMP LASTBYTE2      ; do some sanity check if its really the last used sequence
     BNE NEXTFETCH      ; (h2 is in accu)
     LDA H1
     CMP LASTBYTE1      ; if its really the last used sequence
     BEQ NOCHMAL        ; we will take a shortcut :-)
NEXTFETCH
     LDA H1             ; no shortcut, sniff ....
     STA $DF04          ; thus, let us scan the start of the list
     LDA H2             ; get start pointer of list into "got1"
     STA $DF05
     LDA #$00           ; bank 0 contains lobyte of pointer
     STA $DF06
     JSR READ           ; get it into variable "GOT1"
     LDA GOT1           ; LOWBYTE
     PHA                ; notice it (TAX might work also)
     INC $DF06          ; get bank 1 hibyte of pointer
     JSR READ
     LDA GOT1             
     STA OLDDF05        ; move it to $df05
     PLA
     STA OLDDF04        ; and $df04  (lo and med address in REU)
     JMP CONT

; if we should get the same link again we didnt need to follow the procedure above. However we have to do a sanity check first if we are already at the end of the list.........

NOCHMAL   LDA OLDDF04   ;
     ORA OLDDF05        ;
     BEQ GIBTSNICHT     ; if link = 0000 then end of list, unsucessful search for
                        ; a sequence.

; in any given case, we have to convert 16 bit to 18 bit address to follow the link.....that s whatthe ROLROUT subroutine does. Moreover the same subroutine also reads the bytes into the GOT1/GOT2 zeropage variable.

CONT      JSR ROLROUT
GOODMOV   LDA GOT2      ; now we have to check if the found sequence's address is  
     CMP $48            ; really bigger than the address found in $0047/48.
     BEQ TEST2
     BCS NICHTNOCHMAL   ; bigger address:        "not again"
     BCC NOCHMAL        ; smaller address -----> scan again
TEST2      LDA GOT1     ; comparing 16 bit lobytes:
     CMP $47            ; address equal: check again
     BEQ NOCHMAL
     BCC NOCHMAL        ; smaller: check again

; the found address of the sequence is bigger than the address of 47/48. This means we found a fine 2 byte address in the sequence, and can proceed to try to crunch. For this reason we will simply jumpback to the zeropage-code responsible: the one at $0041.


NICHTNOCHMAL
     LDA GOT2           ; store "result" in 0031/32
     STA $32
     LDA H1        
     STA LASTBYTE1      ; remember this "last used" sequence in case we have
     LDA H2             ; to look for her again next time!                            
     STA LASTBYTE2
     LDX #$00           ; correct $01               
     STX $01
     INX
     STX BYTE1          ; set flag "sucessfully found sequence"
     LDY GOT1           ; restore accu and set y properly
     LDA ACCU
     JMP !$41           ; jump back to original zeropage code

; this is the code if the sequence was not found at all (end of linked list reached). 

GIBTSNICHT              ; "not there"
     LDA #$00
     STA BYTE1          ; set flag "unsucessful"
     TAY                ; correct registers and quit
     STA $01
     RTS
ROLROUT   LDA #$00      ; convert 16 to 18 bits and add 4
     STA OLDDF06
     ASL OLDDF04
     ROL OLDDF05
     ROL OLDDF06
     ASL OLDDF04
     ROL OLDDF05
     ROL OLDDF06
     LDA OLDDF06
     CLC
     ADC #$04
     STA OLDDF06
     STA $DF06
     LDA OLDDF05
     STA $DF05
     LDA OLDDF04
     STA $DF04
READ      LDA #$FD      ; read byte from REU
     STA $DF01
     RTS

Summary and final analysis ∞

Uh, this was quite some code. Can you sum up yourself what it does?

It looks for a 2 byte sequence similar to the one at $0047/8. If that is the same 2 byte sequence that the one we have been looking for earlier (a very common case!) then we will simply re-use the old registers with the proper set values to follow the link. If not, we scan the chained list fromt the start. In any case, we determine if the c64 address from the chained list’s DATA is smaller than the current $47/48 address. If so, we proceed to scan the list either until the end of the list or a proper address was found. If a proper address was found, we store its value into $30/$31 and jump back to the crunchers routines. Otherwise, the end of the chained list will be reached soon, and the program will exit with code “sequence not found”.

Instead of scanning the whole memory, we now scan a rather tiny list (tiny: compared to the whole memory that is). If you press reset in the middle of crunching and if you decide to look at (and follow) these chained lists in REU memory yourself, you will find out they are rather small, compared to a whole 60kb of memory to scan!

So actually, what is the worst case? The worst case would be e.g. a input data file consisting of ONLY 0 – bytes. There would be only ONE sequence in this whole file, and only ONE single chained list would be produced by the program from pass one. These are those rare cases where normal Darksqueezer also tends to go haywire (i.e, very slow). No surprise you are advised to crunch your data first with an equal-character-cruncher to eliminate such cases.

Let us rather look at a typical case. Let’s look at a game like “ELITE“. It does have some sequences that can be eliminated, but, obviously, not too many of them. An equal-char packed version of ELITE has something like 203 blocks, the packed version is 161 blocks. 42 blocks can be gained…. this is impressive, but not too much. Normal DSQ needs 1 hour and 35 minutes to pack it. Our improved REU version takes gasp! Only one minute and 35 seconds! What an exciting improvement! A gain in speed by 63 times, with the same output result, you know! Very, very impressive. And how can we explain this? One of the main speed gains is the fact that our chained lists are rather small and thus, convenient to scan. The other (main) reason is, they immidately tell us when it is not worth looking for a sequence! Many times, DSQ normally scans the whole memory for a sequence, but, of course FAILS! Because the sequence only appears once, or realllly few times. Now, with our brushed-up REU routines, DSQ next-to-immidately knows when its not worth looking for a sequence! If a sequence appears only once, its linked chain is only one item small, and DSQ immidately knows its not worth scanning the memory! A very typical case when you crunch data that is difficult to crunch (e.g. an already crunched program…..try the difference yourself!)

If, on the other hand, we have input data which is VERY WELL crunchable (i.e. “Dizzy down the Rapids“, an old game) we will acknowledge that crunching takes considerably longer than just one minute and 35 seconds. On dizzy down the rapids, it took like 15 minutes… the longest crunchtime I have experienced with my 512kb-REU-Darksqueezer version! Pretty long? No way – CruelCruncher needed the whole night again -with 3 blocks WORSE result. (So far about “superior packers”…I think I know why I really stuck to DSQ all my time…..)

But of course, I still hear some people moan that they have only a 256 kb ram expansion (and, arguably, still some more time left to waste for crunching). Lets see what we can do for these folks….but not tonight! I will leave this for another, and hopefully kewler, article!

Yours truly,

Antitrack/Legend in 1998 !!!

2 comments on “The Secret of Fast LZW Crunching”

spiralofhope on 2005-02-20 at 12:00 am said:

(the date of this comment is estimated)

– Wiki-ized with love. All that source code was a pain. =)
— I should have used spaces instead of tabs. Oh well.
– Fixed minor typos/grammar

spiralofhope on 2019-12-28 at 8:24 pm said:

ported

corrected the source code to remove tabs and line up comments nicely.

spiralofhope

The Secret of Fast LZW Crunching

Introduction ∞

Normal Operation ∞

A Simple Idea of Improvement ∞

Life is not so simple at all ∞

A tour de Memory ∞

The source part 1 ∞

More sources ∞

The source part 2 ∞

Summary and final analysis ∞

2 comments on “The Secret of Fast LZW Crunching”

Leave a Reply Cancel reply

Search