NES Emulation Saga - Writing a NES emulator Part II: the PPU

This is a series (part1, part3) of articles on the development of my own NES emulator. It is coded in a mixture of C/C++, using Allegro 4, and is hosted at bitbucket (link to repository).

The PPU is NES' Picture Processing Unit. It is fundamental in a NES emulator. The first thing a game does is usually check PPU registers in a loop to wait for two frames, and then after the reset code runs many games will go into an infinite loop and have everything run in the vblank interrupt handler.

; Clear the vblank flag if it was set at reset time bit PPU_STATUS ; Wait 2 vblanks - bit PPU_STATUS bpl - - bit PPU_STATUS bpl - ;Rest of init code ;[...] loop: jmp loop vblank: ;Most of game code rti
; Clear the vblank flag if it was set at reset time bit PPU_STATUS ; Wait 2 vblanks - bit PPU_STATUS bpl - - bit PPU_STATUS bpl - ;Rest of init code ;[...] loop: jmp loop vblank: ;Most of game code rti

Main emulator loop

Therefore, the main emulator loop ends up structured around the PPU, and looks like this: unsigned int cyclesCount = 0; for(int scanline = 0; scanline < 262; scanline++){ switch(scanline){ case 0 ... 240: //Visible lines RenderScanline(nes_mem, scanline, screen); break; case 241: //First vblank line if(nes_mem.ppu[PPUCTRL] & PPUCTRL_NMI) machine.NMI(); nes_mem.ppu[PPUSTATUS] |= PPUSTATUS_VBLANK; //Usually, since it is the most predictable state, //emulators do savestate saving/loading here also break; case 261: //Last line before rendering starts anew nes_mem.ppu[PPUSTATUS] &= (255u - PPUSTATUS_VBLANK); nes_mem.ppu[PPUSTATUS] &= (255u - PPUSTATUS_SPR0HIT); break; default: break; } while(cyclesCount < 113){ int cycles = machine.DoStep(); cyclesCount += cycles; } cyclesCount -= scanlineCycles; }
unsigned int cyclesCount = 0; for(int scanline = 0; scanline < 262; scanline++){ switch(scanline){ case 0 ... 240: //Visible lines RenderScanline(nes_mem, scanline, screen); break; case 241: //First vblank line if(nes_mem.ppu[PPUCTRL] & PPUCTRL_NMI) machine.NMI(); nes_mem.ppu[PPUSTATUS] |= PPUSTATUS_VBLANK; //Usually, since it is the most predictable state, //emulators do savestate saving/loading here also break; case 261: //Last line before rendering starts anew nes_mem.ppu[PPUSTATUS] &= (255u - PPUSTATUS_VBLANK); nes_mem.ppu[PPUSTATUS] &= (255u - PPUSTATUS_SPR0HIT); break; default: break; } while(cyclesCount < 113){ int cycles = machine.DoStep(); cyclesCount += cycles; } cyclesCount -= scanlineCycles; }

Scrolling storm

The main difficulty in emulating the PPU is the complex behaviour around scrolling and register writes. This seems to have been first well documented in a famous 1999 document called "SKINNY.TXT" and now well known as loopy's PPU doc. It is described in better detail at the NESdev wiki pages PPU rendering and PPU scrolling.

Rather than emulate it closely, my first and quite successful attempt was to do a less accurate, hacky, sort of trial-and-error emulation. class NESMemory : public Memory { public: uint8_t ram[0x800]; uint8_t ppu[8]; //[...] uint8_t ppu_pal[0x20]; uint16_t ppu_addr; uint8_t ppu_x, ppu_y, ppu_latch; //[...] void NESMemory::Write(const MemoryByte &mb) { //[...] if(mb.ptr == ppu + PPUDATA){ GetPPU(ppu_addr) = ppu[PPUDATA]; if(ppu[PPUCTRL] & PPUCTRL_INC) ppu_addr += 32; else ppu_addr++; } if(mb.ptr == ppu + PPUADDR){ if(!(ppu_latch & 1)){ ppu_addr &= 255u; ppu_addr |= ppu[PPUADDR] << 8u; //Hack based on loopy's PPU doc ppu[PPUCTRL] &= 0xFFu - PPUCTRL_NAMETBL; ppu[PPUCTRL] |= (ppu[PPUADDR] >> 2u) & PPUCTRL_NAMETBL; } else{ ppu_addr &= 255u << 8u; ppu_addr |= ppu[PPUADDR]; } ppu_latch++; } if(mb.ptr == ppu + PPUSCROLL){ if(!(ppu_latch & 1)){ ppu_x = ppu[PPUSCROLL]; } else{ ppu_y = ppu[PPUSCROLL]; } ppu_latch++; } //[...] void NESMemory::Read(const MemoryByte &mb) { if(mb.ptr == ppu + PPUSTATUS){ ppu[PPUSTATUS] &= (255u - PPUSTATUS_VBLANK); ppu[PPUSTATUS] &= (255u - PPUSTATUS_SPR0HIT); ppu_latch = 0; } if(mb.ptr == ppu + PPUDATA){ ppu[PPUDATA] = GetPPU(ppu_addr); if(ppu[PPUCTRL] & PPUCTRL_INC) ppu_addr += 32; else ppu_addr++; }
class NESMemory : public Memory { public: uint8_t ram[0x800]; uint8_t ppu[8]; //[...] uint8_t ppu_pal[0x20]; uint16_t ppu_addr; uint8_t ppu_x, ppu_y, ppu_latch; //[...] void NESMemory::Write(const MemoryByte &mb) { //[...] if(mb.ptr == ppu + PPUDATA){ GetPPU(ppu_addr) = ppu[PPUDATA]; if(ppu[PPUCTRL] & PPUCTRL_INC) ppu_addr += 32; else ppu_addr++; } if(mb.ptr == ppu + PPUADDR){ if(!(ppu_latch & 1)){ ppu_addr &= 255u; ppu_addr |= ppu[PPUADDR] << 8u; //Hack based on loopy's PPU doc ppu[PPUCTRL] &= 0xFFu - PPUCTRL_NAMETBL; ppu[PPUCTRL] |= (ppu[PPUADDR] >> 2u) & PPUCTRL_NAMETBL; } else{ ppu_addr &= 255u << 8u; ppu_addr |= ppu[PPUADDR]; } ppu_latch++; } if(mb.ptr == ppu + PPUSCROLL){ if(!(ppu_latch & 1)){ ppu_x = ppu[PPUSCROLL]; } else{ ppu_y = ppu[PPUSCROLL]; } ppu_latch++; } //[...] void NESMemory::Read(const MemoryByte &mb) { if(mb.ptr == ppu + PPUSTATUS){ ppu[PPUSTATUS] &= (255u - PPUSTATUS_VBLANK); ppu[PPUSTATUS] &= (255u - PPUSTATUS_SPR0HIT); ppu_latch = 0; } if(mb.ptr == ppu + PPUDATA){ ppu[PPUDATA] = GetPPU(ppu_addr); if(ppu[PPUCTRL] & PPUCTRL_INC) ppu_addr += 32; else ppu_addr++; }


While still quite simple, with the hack mentioned by the comment even Super Mario Bros plays fine, a notoriously difficult game to emulate. The way the renderer calculates the addresses for graphical data is also quite hacked together:
for(unsigned int scr_x = 0, x = nes_mem.ppu_x; scr_x < 256; scr_x++, x++){ const unsigned int coarse_x_lo = (x >> 3u) & (bit5 - 1u); const unsigned int coarse_y_lo = ((y >> 3u) % 30); const unsigned int coarse_x_hi = (x >> 3u) >= 32u; // (x >> 3u) >= 32; const unsigned int coarse_y_hi = (y >> 3u) >= 30u; const unsigned int fine_x = x & (bit3 - 1u); const unsigned int fine_y = y & (bit3 - 1u); const unsigned int nametbl = nes_mem.ppu[PPUCTRL] & PPUCTRL_NAMETBL; unsigned int bg_name_addr = (coarse_x_lo) | (coarse_y_lo << 5u) | ((nametbl << 10u) ^ ((coarse_x_hi << 10u) | (coarse_y_hi << 11u))) | bit13; const unsigned int bg_name = nes_mem.GetPPU(bg_name_addr); const unsigned int bg_table = (nes_mem.ppu[PPUCTRL] & PPUCTRL_BGADDR) >> 4u; const unsigned int bg_plane0_addr = fine_y | 0 | (bg_name << 4u) | (bg_table << 12u); const unsigned int bg_plane1_addr = fine_y | bit3 | (bg_name << 4u) | (bg_table << 12u);
for(unsigned int scr_x = 0, x = nes_mem.ppu_x; scr_x < 256; scr_x++, x++){ const unsigned int coarse_x_lo = (x >> 3u) & (bit5 - 1u); const unsigned int coarse_y_lo = ((y >> 3u) % 30); const unsigned int coarse_x_hi = (x >> 3u) >= 32u; // (x >> 3u) >= 32; const unsigned int coarse_y_hi = (y >> 3u) >= 30u; const unsigned int fine_x = x & (bit3 - 1u); const unsigned int fine_y = y & (bit3 - 1u); const unsigned int nametbl = nes_mem.ppu[PPUCTRL] & PPUCTRL_NAMETBL; unsigned int bg_name_addr = (coarse_x_lo) | (coarse_y_lo << 5u) | ((nametbl << 10u) ^ ((coarse_x_hi << 10u) | (coarse_y_hi << 11u))) | bit13; const unsigned int bg_name = nes_mem.GetPPU(bg_name_addr); const unsigned int bg_table = (nes_mem.ppu[PPUCTRL] & PPUCTRL_BGADDR) >> 4u; const unsigned int bg_plane0_addr = fine_y | 0 | (bg_name << 4u) | (bg_table << 12u); const unsigned int bg_plane1_addr = fine_y | bit3 | (bg_name << 4u) | (bg_table << 12u);
I suppose this must look terrible to someone familiar with how the NES PPU operates, or a more accurate emulator's writer.

Glitches in graphical detail

I already shared some interesting, glitch-art like screens on Twitter when first starting to code this.

Before I implemented reading from the PPU, which is something a few, mostly older games do, some games would lack collision with the world. In PacMan, famously you can go right through walls. I don't think it looks as ridiculous/amusing as Mappy however:


Limitations of the approach

The approach to scrolling mentioned above works well for games that scroll horizontally, but it seems to fail when games do vertical scrolling at a split point - a more involved technique that involves both PPUSCROLL and PPUADDR writes in a particular sequence. This issue can be easily seen in The Legend of Zelda and in Duck Tales:


I don't think it is worth it to keep to this hacky method, it makes more sense to rewrite the PPU to do the correct thing. It shouldn't be too much work, but annoying to read all the specs with great attention to detail.

(This hasn't been completed as of this writing)