A natural and explainable brain? Another visit to my design.
New pictures are provided too.
I have been working on AGI for now 5 years full-time (I don't get paid), on mostly text sensory but it turned out to be very very insight-full, I have a very large design with many of the "bells and whistles", while the architecture itself that can run it all is very simple and has made some of my friends cringe. Below I lay down a good portion of my work. I really hope, you can help my direction, or that I help you.
Nearly every part of my design/ architecture can be found in the AI field. Hierarchies, yup. Weights, yup. Rewards, yup. Reward Update, yup. Activation Function, yup. Word2Vec, yup. Seq2Seq, yup. Energy, yup. Pruning, yup. Online Learning, yup. Pooling, yup. Mixing Predictions, yup. Etc. It's when I unify them together I start getting a new view that no one else shares. I'm able to look into my net and understand everything and how they all work together, that's why I was able to learn so much about the architecture.
I've coded my own Letter Predictor and compresses 100MB to 21.8MB, world record is 14.8MB. Mine updates frequencies Online and mixes predictions, and more. I still have tons to add to it, I will likely come close to the world record easily. How it works is in the supplementary attached file.
So I'm going to present below a lot of my design, showing how I unify a things together in a single net. And you can tell me if there's a more natural way or not. I've tested other people's algorithms like GPT-2 and they can accomplish what I present, but the natural way to do it is not shown in an image or ever explained like I explain it, they just stack black boxes on each other.
See this image to get a basic view of my architecture. It's a toy example.
https://ibb.co/p22LNrN It's a hierarchy of features that get re-used/ shared to build larger memories. The brain only stores a word or phrase once and links all sentences to it ever heard. That makes for a extremely powerful breeding ground. Note the brain doesn't store a complete pyramid like I show in my image, just bits n parts; a collection of small hierarchies. So think of my image as a razor tooth saw, not a single very tall pyramid triangle.
https://ibb.co/d4JVm55Notice all nodes are too "perfectly" clear? Well nodes can be merged "whalkinge" and have variable weights "wALkinG" and be pruned "my are cute" to get a "compressed" fuzzy-like network but we can for now keep a clean hierarchy so we can easily see what is going on!
I have a working algorithm (trie/tree-based) that updates the connection weights in the tree when accesses a feature (in the same order time ex. a>b>c, cba is a different feature), so it knows how many times it has seen 'z' or 'hello' or 'hi there' in its life so far! Frequencies! This is my Online Training for weights. Adding more data always improves my predictor/model, guaranteed. I tested using not Perplexity but Lossless Compression to Evaluate my model's predictions. So now you can imagine my razor tooth hierarchies with counts (weights) placed on connections. Good so far. Starting to look like a real network and can function like one too!
https://ibb.co/hC8gkFCNow for the cool part I want to slap on here. I hope you know Word2Vec or Seq2Seq. It translates by discovering cat=dog based on shared contexts. The key question we need/ will focus on here now is how does the brain find out cat=dog using the same network hierarchy? Here's my answer below and I want to know if you knew this or if you have a more natural way.
https://ibb.co/F4BL1Ys Notice I highlighted the cats and dogs nodes? The brain may see "my cats eat food" 5 times and then, tomorrow, may see "my dogs eat food" 6 times. Only through their shared contexts will energy leak and trigger cats from dogs. There's no other realistic way this would occur other than this. The brain is finding out cats is similar to dogs on its own by shared strengthened paths leaking energy. So next time it sees "dogs" in an unseen sentence like "dogs play", it will activate both dogs and cats nodes by some amount.
We ignore common words like "the" or "I" because they will be around most words, it doesn't mean cats=boat. High frequency nodes are ignored.
Word2Vec or the similar can look at both sides around a word to translate, use long windows, skip-gram windows, closer words in time have more impact, and especially the more times seen (frequencies). My hierarchy can naturally do all that. Word2Vec also uses Negative Sampling, and my design can also use inhibition for both next word and translation Prediction.
Word2Vec uses vectors to store words in many dimensions and then compare which are closer in the space. Whereas my design just triggers related nodes by how many local contexts are shared. No vectors are stored in the brain... Nor do we need Backprop to update connections. We increment and prune low frequency nodes or merge them etc, we don't need Backprop to "find" this out, we just need to know how/ why we update weights!
There's a such thing as contextual word vectors. Say we see "a beaver was near the bank", here we disambiguate "bank". In my design, it triggers river or wood more than TD Trust or Financial building. Because although "near the bank and the building" and "near the bank with wood" both share bank, the beaver in my sentence input triggers the latter sentence more than the financial one.
Word2Vec can do the "king is to queen as man is to what?" by misusing dimensions from king that man doesn't have to find where queen is dimensionally without the king dimensions in man to land up at woman. Or USA is to Canada as China is to India, because instead of them lacking a context they both share it here but the location is slightly off in number. But the brain doesn't do this naturally, just try cakes are to toast as chicken is to what? Naturally the brain picks a word with all 3 properties.
To do the king woman thing we need to see the only difference is man isn't royal, so queen is related to woman most but not royal, hence woman. This involves a NOT operation, somehow.
Ok so, when my architecture is presented with "walking down the" it activates multiple nodes like "alking down the" and "lking....." and "king...." ..... and "down the" and "the" and also skip-gram nodes ex. "walking the", as well as related nodes ex. "running up that" and "walking along the". My code BTW does this but not related or skip-gram nodes yet! What occurs now is all activated nodes have shared parent predictions on the right-hand side to predict the next letter or word. So "down the" and "the" and "up this" all leak energy forward to "street". This Mixing (see the Hutter Prize or PPM) improves Prediction. You can only repeat the alphabet forward because it was stored that way. Our nodes have now mixed their predictions to decide a better set of predictions.
https://ibb.co/Zz91jQQMy design is therefore recognizing nodes despite typos or related words. It can also handle rearranged words like "down walking the" by time delay from children nodes. Our "matches" in the hierarchy are many, and we have many forward predictions now, we can take the top 10 predicted words now. We usually pick the top prediction, mutation makes it not perfect on purpose, it's important.
You may wonder, why does thinking in the brain only hear 1 of the top 10 predictions? All 10 nodes are activated, and so are recently heard nodes kept Active! If they were heard, you'd hear them in your mind, surely? If you imagine video in your brain, it'd be very odd to predict the next frame as a dog, cat, horse, and sheep, it would be all blended like a monster. The brain needs precision. So Pooling, as done in CNNs, is used in picking from top 10 predictions! Other nodes and predictions still are activated, just not as much.
Also, Pooling in my architecture can be done for every node outputs! Not just the final high layer. Pooling helps recognition by focusing. Pooling can be controlled indirectly to make the network Summarize or Elaborate or keep Stable. It simply says or doesn't say more or less important nodes, based on the probability of being said. Like you may ignore all the "the" or you may say a lot of filler content that isn't even rewarding like talking about food (see below).
When given a prompt ex. "What do you want to eat? What?" you may first parrot exactly the start, and some may be said in your own loved words I, fries, etc. Or you may just say the entail. You might just say what they said and stop energy forward flow. And you might just say fries in replace of "What?". Why!? Because their words, and your loved words fries, I, etc are pre-active.
One more thing I'll go through is Temporary Energy and Permanent Energy in my architecture. You can see Facebook's new chatbot Blender is like GPT-2 but it has a Dialog Persona that makes it always say certain words/ nodes. So if it likes food or communism, it will bring it up somehow in everything. Just look at what I'm writing, it's all AI related! Check out the later half of this guy's video:
In my design, positive and inhibitory reward is installed on just a few nodes at birth time, and it can transfer reward to related nodes to update it's goals. It may see contextually food=money, so now it starts talking about money. Artificial rewards are changeable, root goal is not modifiable as much.
For Temporarily Active nodes, you can remember a password is car and forget it, but of course you retain car node. This is a different forgetting than pruning weak weights forever. GPT-2 is probably using the last 1,000 words for prediction by this very mechanism. The brain already has to keep in memory the last 10 words, so any predicted nodes that are pre-active from being held in memory get a boost. If you read "the cat and cat saw cats cat then a cute" you predict cat, and the cat node is already activated 4 times just recently. You're holding the words in your hierarchy nodes, not on paper anymore. So yes energy is retained for a while and affects the Probabilities predicted!
I once played Pikmin for half the day, and when I went in the kitchen things looked like Pikmin faces or I seen them faintly but still somewhat vividly running around things. It causes dreams more random predictions from the top 10 or 100 predictions. It's not really good predictions in dreams.
You can see how this helps. Say you only read 100,000 bytes of data so far, and you now read "the tree leaves fell on the root of the tree and the", you have little data trained on so far, but you can predict well the next word is Probably a related word to tree, leaves, etc, so leaf, tree, branch, twig all get boosted by related words from recently read words. And it's really powerful, I've done tests in this area as well. The Hutter Prize has a slew of variants I presented. Like looking at the last 1,000 letters to boost the likeliest next letter. That's good but not as commonly accurate or flexible as word prediction using related words, instead of Exact letters! Big difference.
I look forward to your thoughts, I hope I provided some insight into my design and tests. I hope you can help me if there is something I'm missing, as my design does do a lot in a single architecture. I don't see why it's a good idea to study it as a stack of black boxes without fully understanding how it makes decisions that improve Evaluation (prediction). While my design may be inefficient it may be the natural way it all fits together using the same nodes.
To learn more, I have a video and a different but similar run through my design in this file (and how my code works exactly):
https://workupload.com/file/Y4XhZPYHzqy