Lesson
M11 - Simple RLE
There will certainly be times when memory is short - and one option
is compression!
RLE is one of the simplest compression techniques, and
is used by formats like GIF images...
Lets learn how to use a simple RLE decompressor to show an image
onscreen
CPC_RleByte.asm SMS_RleByte.asm
RLE - Run Length Encoding
Run Length encoding works when consecutive bytes are the same -
suppose we want to compress the string "111100000223335" (15
characters)
We could do this by storing a letter - and a count for that letter.
this would result in a string '1105223351' (10 characters)...we've
saved a few characters!
We'd also want a way of storing bytes that are not the same (we'll
call it linear data)... lets look at the real algorithm
111100000223335 1x40x52x23x35x1 1405223351
Our actual format is more advanced, well a bit!
We'll have a 'header byte' -
If the whole byte is zero - then we're reached the end of the
compressed sequence
If the top bit is 1, the next sequence is RLE compressed... the
remaining bits are a bytecount (1-127)
If the top bit is 0, the next sequence is LINEAR uncompressed
data... the remaining bits are a bytecount (1-127)
Compressed data:
Uncompressed Data:
If you want to RLE compress a file in this format, you can use
AkuSprite Editor, the FileProcessor option
will compress in RLE format
The RLE
compression alrorithm here certainly isn't the best! it was
shamelessly stolen *ahem* borrowed from Phantasy Star!...
At the request of a patreon, the author of these tutorials did a
disassembly of the routines here
... This tutorial is based on what
was learned!
Z80 Decompressor
When we want to decompress data, we need
to specify a source in HL... and a destination in DE
The RLE decompressor is split into 3 parts... the Header
byte processor detects if the next block is Linear, Rle, or if
the stream had ended...
We use JP M, This is 'Jump if Minus' - effectively this
tests the top bit of the registers, which is what we need to detect
the RLE block
The Linear section handles uncompressed data.
The RLE section handles the compressed data
Of course we'll need our Compressed RLE data, and some storage
space for the destination after it's decompressed.
Here is the decompressed data.
Of course we can compress anything!... here's a simple CPC screen!
This screen is 16k... but compressed it's just 2.7k!
Of course that's only because it's so basic, but you get the point!
Some
Data will RLE compress well - others will end up bigger than they
started!
You need to pick what data to compress, and what to leave alone,
if possible, design your graphics to be simple if you need to RLE
compress them in this way
SMS Bitmap data - splitting bitplanes for better RLE
Lets use our RLE compressor to compress the bitmap data of the
Chibiko mascot we used in a previous example...
Unfortunately, the raw format of the file is 'problematic'
Looking ad the data, it looks a bit complicated, and maybe like it
won't compress so well (We would save just 3 bytes)... but we need
to think of what the data represents - and use our decompressor
accordingly!
Our graphics is 16 color - and is split into 4 bitplanes... Bitplane
0, Bitplane 1,Bitplane 2,Bitplane 3
Because Chibiko is only 4 color, bitplanes 3 and 4 are totally
unused... If we split up our file, and store each bitplane in a
separate RLE we'll get far more efficiency.
Splitting the file in this way reduces the file size by over 50%!
We can create 4 files split in this way with the 'Split
Four Bytes' option in Akusprite Editor
Filling One Bitplane of the SMS with RLE
Again The RLE decompressor is split into 3 parts... the Header
byte processor detects if the next block is Linear, Rle, or if
the stream had ended...
We use JP M, This is 'Jump if Minus' - effectively this
tests the top bit of the registers, which is what we need to detect
the RLE block
The Linear section handles uncompressed data.
The RLE section handles the compressed data
Note that we're OUTing the decompressed data to the vdp, and first
we're selecting the VRAM address in DE ... DE must contain VRAM
address +&4000 (&40000 denotes WRITE)...
We add 4 to DE each time, so we're sending the decompressed data to
just one bitplane
We need to include all 4 bitplane files in our source code.
When we want to stream the data to the 4 bitplanes, we load the
starting VRAM address in DE, and the RLE data in HL
We repeat, backing up the destination address before each loop, and
INCing the Vram address for each consecutive bitplane
HL will point to the start of the next bank of RLE data after each
execution, so we don't need to change it between bitplanes.
The image will look the same, but it took less memory!
We've not saved
much memory, but it's only an example of how you can use RLE
compression, and change the way you're writing (OUTing) data to
VRAM to optimize the data you're sending.
If you're doing a small game it's not going to be a problem, but
if you're making a bigger game, you may need to keep things as
small as possible.
Lesson
M12 - Stack Tricks!
We've looked at the stack before and discussed some tricks we can do
with it...
Lets look in more detail, and learn how we can manipulate the stack
for parameter passing, allocating temporary ram and more!
Test_StackFun.asm
Passing parameters to Subroutines
If we want to save program code and we want to pass parameters to
a function, rather than loading them into registers, we can store
them in the code after the sub with DB and DW statements...
We can read these in from within our sub, and modify the return
address to skip over the parameters so code execution resumes after
the parameters
To get the address of the parameters, we transfer the top of the
stack into IX...
We then use IX to read in the parameters.
Finally we alter IX to skip the parameters,
adding enough bytes to jump over the parameters, and put the new
return address back on the stack
A and HL have been loaded in by the sub
Another way of passing parameters to a function is to push
parameters on to the stack... this is handy if we need the
registers for other purposes.
We get the stack pointer into IX and use it to read
in the parameters.
Once we've taken the parameters, We need to move the return address
to the top of the stack with the parameters removed
Here are the results.
Returning parameters on the stack
Of course, we can reverse the procedure, taking the return
address off the stack, putting some
parameters on the stack, then putting the return
address back on the stack
We can then read the parameters back after
the sub finishes
Here are the results
Allocating Temp Ram on the stack
If we need a small amount of temporary ram for our subroutine,
rather than having a permanently allocated bank, it may make more
sense to allocate it on the stack, and set IX or IY to this
allocated data.
In this example, we've allocated six bytes...
when we're done, we reset the stack pointer
The data will be on the stack... above the
return address... so make sure you fix the stack
before returning!
Allocating a
few bytes of temp ram with a label will of course be much easier,
but what about if you're subroutine will need to be nested?...
you'd need to allocate those bytes for each iteration of the
function
By using the stack, once the function completes, all the bytes are
freed - just make sure your stack isn't too small - or something
may get overwritten!
Return as a Jump!
Lets take a look at a simple test program...
This program has a large number of branches, all of which end up
with a JumP to CaseThanks
a JP command takes 3 bytes each... and if space is super tight this
may be a problem if there are a lot of them... but a RET command is
just one byte
If we push an address onto the stack - then
any RETurn command will effectively jump to
that address - if there are a lot of Jumps we can save a few
bytes...
If there are some cases we don't want that jump to occur we can just
POP the return address off the stack.
This is just a
simple example of what you can do with the stack and RETurn... you
can of course do much more... like simulating a JMP (BC)
command - jump to a numbered entry in a vector table, or
whatever you can figure out the code for!
Of course, these are all advanced 'tricky' things, so don't worry
about them if you're starting out.
Jump with RST (Like making our own command!)
It's not really related to the
stack... but if we're looking for ways to save a byte or too, don't
forget RST!
There are 8 RST's, which effetively call to low addresses... on the
CPC RST6 is reserved for the user to use... that means we can do
with it what we want,without harming the firmware.
RST6 (RST &30) is a 1 byte command... and in this case we're
setting it to JP RstTest... As RST is a
call, this means RST6 is equivalent to CALL RstTest
Our 'RstTest' function does the same as our earlier example - taking
a word from the following bytes as a string to show to the screen...
Effectively we've made our RST6 into a
'PrintString' command.. it prints the following word- and this saves
3 bytes compared to 'LD HL,STRING / CALL RstTest'
Here is the result
Stack Misuse (For fast Read/Write)
We've covered it before - but
lets look again at Stack Misuse... this is where we use the stack
for bulk reading or writing...
We just point the stack pointer to the bottom of
the area we want to read, or top of the area we want to write,
and PUSH or POP away.
Of course we can't use the stack for anything else when we're doing
this, so no CALLs!
We need to restore the stack pointer once
we're done so we can CALL again!
This example copies the screen, scrolling the screen up 8 lines
The routine will cause a vertical scroll of 8 lines when we run
it!
We used the stack to copy the whole screen quickly
If
you're misusing the stac kin this way, you'll probably want to
disable interrupts first!.
Because interrupts are effectively CALLs, and calls use the stack,
they may draw corruption to your screen, and make quite a mess!
If your super clever you may be able to work around this - but
it's very tough so it's best to disable interrupts and make your
life easy!
Lesson
M13 - Fast Multiplication and Division.
We've looked at some very simple Multiplication and Division before,
which just repeatedly added or subtracted a value... that's one
solution, but it's pretty slow.
Lets look at some better ways to do things!
MultDivTest.asm
Multiplication
Lets look at
Multiplication!....
If we want to multiply 102 by 10 by hand, we would multiply 10 by
each of the digits in 102 (or the other way round!)
We can do the same in assembly...however instead we'll work in bits
- bit-shifting the value we want to multiply, and adding the amount
we want to multiply by each time.
This function will effectively set HL = HL*A
Through each iteration of the program, we rotate the bits in A to
the right... causing one bit from the right to be copied into the
Carry
If the bit is 1, we add DE to HL
Whether it was or not, we double DE - setting it to the correct
value to add for the next bit.
Here we can see what happens to the registers... on the right is
the bits in A and the Carry (C)
When the Carry reaches 1 - we add the current value of DE (1020) to
HL... this effectively multiplies the value that WAS in HL at the
start of the function into by the bit position (now it's shifted
into the Carry)
We multiplied &10 by &0102...
the result was &1020
Division
Lets Divide 1023 by 10
When we do division by hand, we need to work out how many times we
can take 10 from each digit - starting from the left moving to the
right - any remaining value is the remainder.
Of course, we'll have to do things in bits when it comes to our ASM
version.
In our example we'll perform HL=HL/A... where HL=1023 and A=10
We actually work sort-of backwards with the division - rather than
moving A right through the bits of HL, we'll move HL left (by adding
HL to itself - doubling it).
We'll move HL LEFT - popping the leftmost (top) bit into A... with
each shift we'll also double A - when A is greater than, or equal to
the amount we wanted to divide by (10) we subtract that amount - and
add 1 to HL
That '1' that was shifted into HL will continue shifting with HL -
until we've rotated HL round a full 16 bit 'loop'...
At the end HL will have been divided by the value (10)... and A will
now contain any remainder!
Here's the registers... and a bit representation of HL on the
right.
We move the bits left in HL (Blue highlighted bits) -and when a bit
gets pushed out, we move it into A (Yellow Highlighted)
A also shifts left each iteration, and once this exceeds the divisor
we add 1 to to HL (shown Yellow)
We repeat this loop 16 times - HL ends up containing the whole
number part... and A is the remainder.
We divided HL by A
The result is in HL... the remainder is in A
This is just
two examples of a dividing a 16 bit number by an 8 bit one - if
you need more, check out Grauw's
siteWhich has many other z80
algorithms for multiplication and division with great
documentation!
Lesson
M14 - SpeedTile software Tilemap Engine (1/3) - (Basic Tilemap)
It's time to start a new series... We're going to look at a
multiplatform software Tile/Sprite Engine
This allows for scrolling tilemaps, and even sprites with clipping
(meaning they can be partially offscreen) of pretty much any size.
Lets Learn how!
SpeedTile is a software tilemap... it uses 2 frame buffers to
facilitate software scrolling, and works at 256x192 on most systems.
It breaks up images into 8x8 tiles, and has optimized drawing
routines which are designed to work 'around' the screen co-ordinate
difficulties of each system.
The tile code has the potential to use multiple drawing engines for
flexibility - Normal (PSET tiles) Fill (solid colors for fast tiles)
Zero Transparent (byte zero not drawn) and Transparent (Tile
completely unused)
Tilemaps can be any size, and small 'tilemaps' can be used as
sprites - as the code allows tile reuse (compression) and flood fill
for solid color tiles (speed) and flexibility (small areas
transparent only when needed) using the tile code for sprites can be
extremely efficient.
Software
scrolling means we'll use two screen buffers, and redraw the
entire screen every frame - this is slower than Hardware
scrolling, but it's much easier and is less platform specific.
Many systems cannot hardware scroll (SAM/Speccy) and systems like
the CPC have limited ability - plus Hardware scroll is difficult,
as memory addresses change every time the screen moves, and if a
lot of stuff moves onscreen, it could even end up slower than soft
scroll as we'd end up redrawing everything anyway.
And beware! If you're struggling to understand the software scroll
tutorial here... you'll really find hardware scroll too hard!
The Logical Tilemap
There are 3 kinds of tilemap
we need to consider.
The Visible tilemap.... What we see onscreen
The Logical tilemap... this is a set of pointers to the data used to
draw the screen - it's slightly bigger than the screen to allow for
scrolling
The Supertile map... this is the 'compressed' levelmap... it's many
time bigger than the screen.
The logical tilemap is pretty simple... each 8x8 tile is represented
by 1 word in the format:
%AAAAAAAA AAAAAATF
Where A is the address of the bitmap data in ram for the 8x8 tile
(with the bottom 2 bits as 00)... eg &FF84
F=1 denotes a filled tile that uses 2 bytes to fill the entire tile
(Faster)... eg &FF84+1
F=1 and T=1 denotes a transparent tile ... eg &FF84+3
A=00000000 + F=1 denotes an empty tile - the tile is not drawn -
skipping this 8x8 block... eg &0+1
We use symbols to simplify defining these.
The Tilemap is rotated 90 degrees clockwise... so the XY
co-ordinates of data in the Logical tilemap go DOWN then ACROSS...eg
DW [Y0 X0],[Y1 X0],[Y 2X0]
DW [Y0 X1],[Y1 X1],[Y2 X1]
DW [Y0 X2],[Y1 X2],[Y2 X2]
This is because the drawing order of the screen goes down then
across.
Because
speed is paramount the logical tilemap code is platform
specific,however they're all pretty similar.
We'll take a look at the CPC version here, but the SAM and Speccy
version are pretty similar, it's just 'newline calculations'
differ
Drawing the Logical Tilemap
We need to draw the tilemap... we set DE to the first byte of the
logical tilemap to draw.
A/C are the XY offset (in 1/4 tiles)... to move a full tile we need
to alter the DE offset.
The CPC can only move the tilemap in 4 pixel chunks... ***
We need to calculate the base VRAM address of the top-left corner of
the tilemap - we use IX to calculate the start address in VRAM to
draw.
Our Logical tilemap will usually be taller than our visible
screen... Our Screen is 24 tiles tall - but what if our Logical
tilemap is 32 tiles tall???
After drawing each vertical strip to screen, there will be 8 tiles
unused... we'll need to skip over them... we calculate the amount to
skip here for later.
We set IYH and IYL for the end of the tilemap sizes.
This is consistent for the tilemap... but will be different when we
use the tilemap code for sprite drawing.
*** NOTE: The 4
pixel vertical shift is actually handled by an alternate version
of this module "CPC_V1_SpeedTile_ShiftedTile.asm"... this is
because the routine is optimized for 8x8 blocks, and the
optimizations are different for an effective 4 pixel vertical
offset... using extra ram for a second version makes more sense
than trying to write a single version that works for both
positions and lose speed we don't have.
Starting The Tile Draw
We use the Stack for fast reading during the main loop, so we back
up the stack pointer at the start.
shadow C is used for fast calculations on the CPC to move down a
line (add 8 to Vram H)
Note we're using shadow registers to allow less use of ram to store
paramters
At the start of a line, we reset the screen memory position to
draw the next vertical strip
We need to draw a tile... we load in the address of the bitmap
data for the tile from HL
we check the bottom bit of the address... if it's 1 this is not a
'normal' tile - it's either Filled or Transparent
Drawing a bitmap tile
we're drawing a normal tile.
We set the stack pointer to the start of the Tile bitmap data and
POP a pair of bytes from the data (one lines worth on the CPC)
We write the first byte to the screen and move right (INC L) then we
write the second byte.
we then move down a line ... we add 8 (from C) to the accumulator
(Which contains H) - we then update H
We then repeat the process... this time we write the bytes in
reverse to save unnecessary moves.
We repeat this for all 8 lines of the tile.
Finishing the tile
When a tile is complete we need to fix up HL... HL will now point
one tile below the last one - ready for the next tile.
We check B and repeat until the strip is done.
Once a strip is done, we skip any unused tiles (The tilemap will
usually be taller than the visible screen)
We need to move 8 pixels to the right... we INC C twice... moving
across the screen two bytes (8 pixels) and repeat to do the next
vertical strip
When there are no strips left, we restore SP and return.
The Solid tile
The Solid tile is basically the same, however alternating lines
are filled with just 1 byte - this gives the tile a filled color
(with checkerboard pattern color)
Because there's just one bytepair read It's faster than than the
regular draw
The Empty tile
Empty tiles have no drawn pixels - we just change HL to move down
to the next tile.
Transparent tiles (Zero Byte)
The 'Transparent' tile version is for sprite drawing.
It's basically the same as the regular tile routine, but before each
draw we test the byte, if it's zero, we don't put it onscreen.
this is used for the sprite drawing routines we'll look at in a
later episode.
Lesson
M15 - SpeedTile software Tilemap Engine (2/3) - (SuperTile map)
The logical tilemap is quick to draw, but using 2 bytes per tile is
a problem... if our game world used 256x128 tiles then the total
tilemap would be 64k - our entire memory!
To solve this, we use 'Supertiles' - each one contains 4x4 sprites
(16) ... making our 256x128 tilemap out of 1 byte supertiles takes
just 2048 bytes!
SpeedTileSprite.asm
V1_SpeedTile_SuperTileMap.asm
Supertile definitions and SuperTile Map
The SuperTile map uses 1 byte per supertile defining the tile
number
The tilemap is organized horizontally, so each byte has the
position:
db [X0 Y0],[X1 Y0],[X2 Y0]
db [X0 Y1],[X1 Y1],[X2 Y1]
db [X0 Y2],[X1 Y2],[X2 Y2]
Each Supertile defines a 4x4 square of tiles, so it's definition
is 16 tiles / 32 bytes in total
Using the SuperTileMap code
We're going to update the Visible/Logical tilemap based on a
'World position'
This uses a 16 bit X and Y co-ordinate whic splits into 3 parts....
%--------SSSSOOHQ
The Top byte -------- is unused (our world is
only 16 supertiles wide)
The SSSS bits are the position in the
SuperTilemap for the top corner of the Logical Tilemap OO is the tile offset of the Logical tilemap in
whole tiles HQis the offset of the
tile in Half/Quarter tiles (2 pixel blocks)... The SAM can use
quarter shifts... the CPC can only do half shifts... the SPECCY can
only do full block shifts
We have a single routine 'CalculateAndDrawTilemap' which will
update the Logical tilemap as required, and draw a full screen
tilemap.
IY needs to point to the variables defining the current position,
and last calculated position
Filling the Logical Tilemap from the Supertile Map
The Logical tilemap is 4 tiles (1 supertile) larger than the
visible tilemap on all sides, this means we can scroll up to 4 tiles
in any direction before updating the logical tilemap.
First we check the current position within the Supertilemap... if
the position of the currently calculated logical tilemap has not
changed we don't need to do anything.
If it HAS , we need to refill the Logical tilemap from the super
tile map
We need to convert the XY position within the supertile map into a
byte position within the supertile map.
The sample tile map is 16 tiles wide... so the offset address is:
SuperTileMap+(Ypos*16)+(Xpos)
We're going to transfer supertiles into the logical tilemap... we
load in a supertile from IX... and get the supertile definition from
"SuperTileDefs"
Each supertile is 4x4 tiles - 2 bytes each (32 bytes)
We need to transfer all all 16 bytes of each of the 4 lines of the
supertile definition into the tilemap.
We read data from the supertile definition by POPing it out
we repeat for all 4 lines
We repeat moving along the logical tilemap
We've done one 4 tile tall row of the Logical tilemap, we need to
move down a row and repeat.
Once the Logical tilemap fill is complete we restore the stack
pointer
You'll notice in
some parts of the code there are two versions - one that uses self
modifying code, and the other that uses symbols for other memory
addresses.
Self-Modifying is faster, but doesn't work if we're working from
ROM - we want both options so we can best server RAM systems like
the CPC, and ROM systems like the SMS
Calculating the Logical Tilemap area to draw.
We need to calculate the 'whole tile' offset of the logical
tilemap (Defined by bits 2,3 of our XY tilemap pos)
Each Tile contains two bytes, so to move down 1 tile, we multiply
the offset by two
We add this to the address of the start of the logical tilemap
Because the tiles are organized vertically in the logical tilemap,
To move across one tile, we need to add the height of the tilemap
(32) * the number of bytes per tile (2)
So we multiply the Xpos by 64, and add this to the address of the
start of the logical tilemap
We get the X 'Quarter tile' offset into A... and the Y offset into
C (for systems that can use it)
Due to the screen layout on systems like the Amstrad CPC it's not
easy to vertically shift the screen - to work around this, we have
an alternative draw routine to do a 4 pixel vertical offset.
Lesson
M16 - SpeedTile software Tilemap Engine (3/3) - (Software Sprites
& Sprite Clipping)
The Tilemap engine can be used to draw areas of any size, we can
therefore use it to create small 'Sprites' using the same graphics
code.
This will draw a small graphic, but we'll need to deal with the
instances where the sprite is partially onscreen - this is known as
'Clipping'
SpeedTileSprite.asm
V1_SpeedTile_SuperTileMap.asm
Drawing a sprite
Here's a 'Chibiko' sprite drawn with the Sprite/Tile engine...
because the graphics is being done by the tile code, we have all the
Tile options - Transparency, 8x8 fill areas and the possibility to
'reuse' 8x8 tiles without increasing the size of the memory
footprint.
The sprite can be any size, but it must be a multiple of 8x8 (of
course part can be transparent.
The DrawSprite routine is called to show the sprite to screen...
We pass a pointer to the 'Sprite Tilemap' in DE... a set of pointers
in the same format as the logical tilemap - one word per 8x8 tile.
We pass the X,Y pos in B,C
We specify the Width,Height (in Tiles) in H,L
This example automatically makes the Chibiko character move towards
the bottom right of the screen.
The Sprite needs to be defined in the same format as
the 'Logical Tilemap'... a grid of pointers to the tile data
Drawing a sprite - Cropping Check
The X,Y pos is NOT specified
in tiles - the co-ordinates are specified in pairs of pixels
'quarter tiles'... therefore our screen is 128x96 in logical units
So we can have clipping 0,0 is NOT the top left visible pixel of the
screen - there is an 'offscreen' area around the visible screen...
the first visible unit is 64,80 and the last is 192,176
We need to look at the position of our sprite, and see if it needs
any of its sides cropping
Sprite clipping takes time, so we'll use a marker (Shadow reg D) to
note if anything needs doing...
IX/IY registers will store the amount of cropping needed.
IXL - is the pixels to remove from Left
IYL - is the pixels to remove From Top
IXH - is the pixels to remove From Right
IYH - is the pixels to remove from Bottom
We can only crop full tiles ... there is a 1/2 or 3/4 tile border
around the tilemap so this is not noticeable!
For each direction we check if any tiles need to be cropped, and
store the number of tiles to remove - we'll calculate how to do this
later.
If the result of the 'crop' is no tiles are onscreen then we return
without drawing anything.
We check the Shadow D register, and run the crop routine if
required.
We also set the 'Y unused tiles' to zero - this will need changing
later if cropping occurs.
Drawing a sprite - Cropping
Because the Logical Tilemap uses 2 bytes per tile and goes down
the tilemap, to crop the left we have to remove an entire column of
tiles (2 * L)
To crop a row from the top, we move down the tilemap by 2 bytes per
tile... but we then need to skip those (in TileY unused)
To crop a row from the bottom, we reduce L... but we then need to
skip those (in TileY unused)
To crop the right hand side, we just decrease H
Hardware Sprite Support
If we're on a system like the
SMS we could use hardware sprites to draw the graphic - provided
we've not yet used all the hardware sprites this frame... we check
and branch to a special 'Hardware sprite' routine if possible
Drawing the sprite
We need to calculate the VRAM address of the first tile to draw.
This is Platform specific, there is a version for each system.
We've calculated the area to draw, and the address to draw to.
We run DrawTiles to do the job...
There's also an 'alternate' version for the CPC - which needs a
special version to draw with a vertical 4 pixel offset (due to
screen ram layout)
Using the
Tilemap code for sprites works surprisingly well!... Firstly
there's only one piece of code to port to new systems, and
splitting a sprite up into 8x8 blocks gives us opportunities for
saving speed and space... by making only small areas transparent -
reusing repeating parts, and solid filling blocks where possible.