I also debate your assertion that NES games could not have a large scale, or a ton of text. It took me almost 100 hours to beat Dragon Warrior 3 back in the day, and it was filled with countless towns of NPCs who all had something to tell me. I think SMB3 (or late NES games like Startropics 2) pretty much proved that if the developer was willing to pay to put the right hardware in the cart the NES could match the depth of any 16 bit console game.
I built a retro-console back in 2014 (basically a PC preloaded with emulators) and BY FAR my favorite old console to play today is the NES for the reasons outlined. The graphics look so clean and fantastic on my big plasma that I am not just playing game for nostalgia, I actually beat I few I never beat back in the day just because it was such a pleasant experience.
Dragon Warrior 3 was near the end of the NES life cycle when carts were starting to push 512k to 1 MB+ (How big was DW3? I can't look right now). 512k+ was a lot for a system which implemented most games in 8-64k.
Dragon Warrior 3 was awesome.
I remember stealing the poison needle at night and save scumming the hell out of the monster gambling for infinite money at the start of the game.
DW3, FF3, both were massive for 8 bit NES games being some of the first games to have 2nd world maps and such and start breaking some 8 bit memory limitations once ROM sizes started to increase, fall in price, and some more advanced mappers could be implemented to fit it in the allotted 32k PRG-ROM space.
While 1 MB or so was the largest NES ROM that came near the end of the NES, it was the start of 16 bit ROM sizes. The two biggest consumers of ROM space in the NES was text strings and level maps. And you couldn't really compress them a whole lot because you had a 1 MHz 8 bit CPU with 2K RAM, one quarter of which is reserved by the 6502 for zero page ($0000-00FF) and stack ($0100-$01FF). Then because OAM (sprites) hardware is slow to read/write or can't be accessed during VDRAW or something, and there isn't enough time to update everything in VBLANK when you can write. But there is time to DMA 256 bytes during VBLANK) so most devs would use a shadow 64 sprites x 4 bytes for OAM in RAM so there is another 256 bytes gone ($0200-$02xx) so you end up with 1280 bytes of RAM left @ $0300-$07FF.
Large blocky games like SMB can store their levels as coords of larger meta objects (pipes, bushes, clouds, block platform, etc) instead of individual tiles. Games like Dragon Warrior at best can use a fast RLE. Other than a castle or village, the non square region aligned sea and mountain tiles and stuff have to be expressed as a tile level. What I mean is mountains and grass aren't in nice 128 x 64 blocks or anything. They run diagonal and very from row to row.
I'm trying to remember what I ended up doing for my 4096 x 4096 world map scroller on the NES. This would be a grid of 512 x 512 1 byte tile indices = 256K (or 2 megabit!) JUST for the world map. For starters I think I recognized that there aren't 256 tiles, only like 8 or so (grass, sea, mountain, desert, town, forest, etc), so you can start bit packing.
If you use 4 bits per entry that cuts you down by half and gives you 16 tiles (mountain, snow, desert, grass, shallow water, deep water, forest, bridge, town, etc.) Also recognized that most of the world map is runs of the same tile, runs of ocean, runs of forest, runs of mountain, etc.
I think in the end I got my 4096 x 4096 map down to to like 20 kilobytes and it was easily parsed left to right using some combination of RLE and a skip list (which could be static and also in ROM) to aid in jumping into the middle instead of going through all 512 positions for the start of the max=33 tile wide section that could cover a row... Worse case scenario with RLE is sections with runs of 1 took up more space but these only occurred when terrain frequency increase. eg a town surrounded by water with a bridge = n row of grass -> 1 water -> 1 grass -> 1 town -> 1 grass -> 1 water -> n row of grass. I just accepted this, because features like a town or a bridge didn't happen often and you more than made up for it when you'd have a span of 30 water or 20 mountain tiles in a row.
DW3 if I recall kept out the smaller details like towns and bridges to achieve more compression of the main map, then used a SMB style object just to go back and patch in the details. I think they even did the same thing with the water + anything else combinations being patched manually on the fly so they didnt have to waste CHR-ROM with two dozen or more variations of water left of dirt, water right of dirt, water top..... water bottom of mountain, etc transitions for the white water boarders. You can also get less bits per tile index and make your world map smaller. Just think if you had water + everything else x 4 directions youd have to use a byte or more for each tile and we are back to 2 megabit world maps lol. vs a couple instructions in your RLE decode loop that could detect a transition to or from water and manually override with the appropriate tile. There was no cache or memory wait state or deep pipeline on a 1 MHz 6502 so there was no penalty for branching in loops and stuff like that.
I didn't have collisions either. You'd want to do collision info with the tile so you'd only store it once, but the NES has tile data in a separate ROM and bus on the PPU, so you need a collision map for your tile set in PRG-ROM, etc.
Ok I'll stop now... sorry.
Back to my home built PC... keyboard clock and data to 1 to 8 shift register... to pin 19 of the 8259 for IRQ1 then parallel out byte to the 8255 port A mapped at 60h...zzzzz