You're too focused on the "game predicts user input" part. Inputs are polled 60 times within a second (for a 60fps game). The game doesn't need some super complex logic which looks into all possible scenarios and decide what is the most likely one. If the game just assume "all players are pressing the exact same buttons they did 16.6ms ago" they will be right most of the time. Think about how many button presses you do every second, and how far is that from 60.
With 4 players, the game will be wrong more often and it will take longer to calculate the new state from though.
You have a state and you have a set of inputs. There should only be one outcome.
ex: State: all characters are on the ground, Input: Player 1 pressed dpad right, Player 2 and 4 did nothing, Player 3 pressed jump button. Outcome: character 1 moved to the left and started the walking animation, character 2 and 4 are on the same place, Player 3 moved up and started the jumping animation.
Then a lag happens and the game assumes they continued the same. Single outcome: Char 1 moved even more to the left and 3 even more up.
Then the game receives the inputs and after a while Player 1 pressed jump button too. Then the game rewind the assumptions to the frame it predicted wrong and recalculate the outcome: Char 1 moved even more to the left and up, while 3 even more up.
Problem is that if the game is on Frame 6 and it predicted Frame 3 wrong, for example, then it has to recalculate the correct state for Frames 3, 4, 5, 6 AND 7 before it can display the 7th frame. So it needs the headroom to calculate 5 states and still have time to render the frame within 16.6ms or the game will freeze (assuming 60fps).