LLMs are trained with hundreds of terabytes of data to a few petabyte at most. You are off by 3 to 6 orders of magnitude in your estimate of training data. They aren't literally trained on "all the data of the internet". That would be a divergent nightmare. Catastrophic forgetting is still a problem with neural networks and ML algorithms in general. Humans are probably trained on less than half an exabyte of data given the ~1Gbps of sensory data we receive in a lifetime. That's still ~20 petabytes of data by age 5. A 400B parameter LLM with 100 examples per parameter would equal about 640 TB (F16 parameters) of training data. That's the order of magnitude of current models.