I'd argue that 128-bit is the
maximum memory bus width for Dane, and by the time third party developers are able to give feedback, it's likely far too late to change it (unlike the quantity of RAM, which is just swapping out standardised components and could in theory be done a few months before launch). Besides, the limit on bus width isn't just physical space or cost, although they're not trivial, it's power consumption.
It's difficult to find hard numbers on RAM power consumption, but these Micron slides (
PDF link) suggest LPDDR4 consumes about 6.5 Pj/bit, and the graph in
this article (sourced from Samsung) indicates LPDDR5 consumes about 37.5% less power per bit than LPDDR4, so that would put LPDDR5 at just over 4 Pj/bit. Going to a 256 bit bus width at the full 6400MT/s would therefore consume over 6.5W for memory alone, which is a huge amount for a device like the Switch (by these numbers the original model maxed out at around 1.33W for 25.6GB/s). Even with a 128 bit bus, they're still looking at around 3.3W for the RAM at peak clocks, and will likely have to clock the memory down significantly in handheld mode. It's notable that even the Steam Deck, which has much higher power consumption than Switch, limits its LPDDR5 to 5500MT/s, and I'd imagine power consumption played a part in that decision.
Fortunately there's a more efficient approach, which is to rely more heavily on caches. I don't mean "infinity cache" (which is just marketing speak for a really big cache), but just generally bigger caches than you'd usually get on a chip like this. Accessing data in cache is far, far more power efficient than accessing data from off-chip DRAM, so if you want to maximise performance within a tight power budget, then big caches and a narrow memory bus is a better approach than small caches and a wide bus. Orin is moving in that direction, with 50% higher L1 cache on the SMs vs desktop Ampere, and a 4MB L2, and then another 4MB system-level cache above both the CPU and GPU. Dane won't have the same amount of L2, and probably not the SLC, but it's an indication that they're taking the right approach. The TX1 GPU only has 24KB of L1 cache per SM and a 256KB L2 cache, so even moving to 192KB L1 per SM and perhaps 1MB L2 would be a big jump.