So my adventure on me versus openAI.com continues:
So far my code I made from scratch (except the little pre-processor, though I know how it works mostly) got a score of 19,033,243 bytes losslessly compressed for 100,000,000 bytes fed in (enwik8.txt). So I'm 4MB away now from where I "should" be. Still more to come. This is only the beginning of everything.
The text completion results yous seen I had improved as you can see in post #221 below in the link below, but these are from my 20MB score, because the code was older and wasn't running the pre-processor during completion of text, so that's a big difference.
https://encode.su/threads/3594-CM-design-discussion/page6(project page)
https://encode.su/threads/3595-Star-Engine-AI-data-compressorNot listed in my how it works on that link is that I am half set up the usage of hole, and delay matching of context, and delayed prediction (ahead of time predicts it). And it works on word level due to the preprocessor, so it can recognize "walked very fast to the" as "walked fast in the >>> new store they made" and predict ahead of time "new [store]" so we get "walked very fast to the store". Many matches cultivate their predictions to get a new set of predictions in form of probabilities then sent to a evaluation function.
What's left to do after this is translation for recency boosting and matching, and mirror ghosting (seeing a partial match and seeing the 2 items not matched are similar so so should my items even though different topic words), and pattern of delay and hole errors, and weighting tricks to tie it all together to get clearer predictions, and a slew of other little thingies.
Next time someone says they don't know what GPT is learning, or how the code works, tell them to look at the code harder and look at the dataset harder, there is just several common patterns in it you can find and allows predictions, it all starts with exact matches too.