So you mean a sparse network then? Instead of computing the whole matrix. Ya, my AI actually would fit that bill then really good, if I can finish it. My network can come out to still small and fast as explained in my last post. Now, you say hashes, hmm, like the word 'walking' is converted into its ord and that chooses the location in a list in python code, hmm, yes that's very fast....but I bet the RAM will suffer let's see: You have 100MBs of text, and must store every ~4 words of it, and find them fast. To make the 100MBs hashable, you need to store in a small list (small as in 'using the hash table method') the ~4 words+prediction entailment, which means you need to store for 100MBs: 500MBs....40GBs?: 200GBs...So this extra large 40GB dataset of text that GPT has in RAM as 12GBs, would come to 200GBs needing to be in RAM. Or does it need be in RAM if is hashable (fast find) ? Anybody know? So for 10GBs of text it would work o-k, 50GBs of RAM needed then.
But don't forget you need to find ALL matches of "[we [walked [down [the]]]] ?___?", and combine the predictions to get a set of predicted words for all 4 matches, yup so 'the' has nearly 800,000 matches in 100MBs of wiki LOL, when they could all be put into a tree with max 50K vocab. You also need a semantic web like word2vec and need to store those embeds or connections.
So it's big....and slow...
I don't know where you got your numbers from but certainly not from an understanding of hashcodes. So, take a word that has four characters with simple ASCII that's 4 Bytes, but a 32-bit hashcode is only 4 bytes and the word like "pneumonoultramicroscopicsilicovolcanoconiosis" is 45 characters(45 Bytes) but the word is represented by other components which means there are even more bytes involved where each word is stored with its OL, all of that reduces to a 32-bit hashcode! The ontological component and the descriptor component provide feature or property states for each word and there is only one instance of those structures for each word. I have something like 790,000 words stored and the OL database and it's only 391MB, but its hashcode store is only 3.2MB with 32-bit and 6.4MB with 64-bit codes! My older server has 128GB and the new system has 256GB. The 391MB with the addition 3.2MB is but a drop in the bucket of all the ram I have! The descriptor component is just starting out but right now is averaging 20,000 bytes per word, at 790,000 words that's 16GB to cache it, but its hashcode per word reduces to just 4 to 8 bytes!
Ok, so you might argue; but you have to index those features as well, and you're right, the current descriptor DB feature index averages 881 bytes per word, so 790,000 words would be just 700MB, where each feature is a single instance with a HashSet that stores a reference to the descriptor component, again a drop in the bucket of all the ram I have! So, even as the data grows as the system learns and makes those descriptor components more complex there is plenty of room and the hashcode burden is trivial.
If you remember my post of the time chunking scheme where I ran out of 128GB in 32 minutes, but I later simply stored threshold deltas of stimuli which kept everything manageable for days, where, yes eventually you'll need to manage the temporal resources by writing to disk, in this case, is NVMe gen3 or 4 SSD. The approach makes things pretty responsive when having to find data on the disk, which is indexed with hashcodes and inter-file locations.
Now here's your problem with ANNs, you have to iterate through the entire network that doesn't really work in a way that can represent meaning as a point in memory. Your ANN distributes the description of words across the entire network which is why you have to iterate through the entire matrix to get an output. My approach doesn't, the data is focused into structures that have single instances that can even dynamically change in real-time, meaning the system can learn while it executes! As stimuli are entered into the system it is converted into hashcode sets that look up the relevant data that is associated with functions or processes to respond. So I don't have to iterate through the entire dataset as the Anns do, and can change the associations to those structures instantly, no retraining of the entire system. Also, remember accessing other data that relates to the descriptors is referenced whose instances reference other data. So, algorithmically I can gain access to data to provide more capabilities without having to randomly search for it since it's right there for the taking because of how relationships are linked/referenced, again speeding up processing and not having to iterate through billions of other neurodes that aren't really representative of what is need but you have to calculate their contribution to the output regardless.
With an Ann you can't just find the functional data points with a query, but with this approach, you can and that's why only a fraction of the computational horsepower is needed compared to an ANN. Here's another advantage, I can still use ANNs but they are much much smaller because they are focused on the semantic interpretations of a query that's initiated from stimuli, whose generalized states can be evaluated into patterns. Realize the ANN is called only after the data is matched to the stimuli. So the problem domain is much smaller than what GPT3 does which tries to encode everything into a big ANN. Also, this approach isn't trapped into an ANN solution only, so it opens up the framework to a universe of solutions, e.g. genetic algorithms, differential equations, Bayesian inference, etc.