I think I'm onto something big. So in my big AGI design there is a smaller part for just recognizing objects. When I tried thinking about how it'd work for images instead of text, I realized something new was needed. And it should help text recognition too. A human can see a new image they never seen before, only 1 example of it, then given dummy images can easily spot which is the same image despite being rotated, brighter, stretched, blurred, parts rotated, colorless, flipped horizontally, bigger, etc. Same for music, a slowed down, higher pitch, louder version of Jingle Bell Rock sounds very recognizable despite being a different "image". It's because really there is not much difference, relatively all parts are the same, let me explain.
So, originally you store a node in the brain for the text word ex. "hello", so if you see "hell" you recognize it some amount at least because of #1 how many parts expected are there (ex. 4 of 5 letters are there, so it is 80% triggered/ recognized), and #2 the time delay where it expects the parts ex. "olleh" / "hellzzzzo" / "hZeZlZlZo". So it's flexible, it recognizes typos and delays in location. "you how arrrre ? doing" lol.
So with an image of a stop sign, say we see one but just is much brighter. Obviously if we total up each pixel to match, the global total sum of brightness difference will be huge, each pixel is 6 shades brighter than the pixel we compare to in the original memory image. Yet an image of a frog, not so bright, will have a very similar global sum of brightness (how bright the image is) by just simply shuffling around the pixels. Ex. 3 pixels of brightness > 3, 6, 2......and if the image is 2 times brighter it is 6, 12, 4.......still looks like a stop sign image just brighter.....but if we take the original stop sign image 3, 6, 2 and shuffle them, we get an image of a frog ex. 6, 3, 2! The time delay is of course off but it won't help now. In fact you can see here in this video they cut out real wood pieces and by just rearragning them they get a different image! Clearly their arrangement is not usable to see a stop sign when it now looks like a human face.
https://light.informatik.uni-bonn.de/papers/IseringhausenEtAl-ComputationalParquetry-TOG2020.pdfSo how do we realize that a much brighter stop sign, although much brighter than a frog (frog gotten by shuffling the pixels around, not any brighter than the original stop sign!) is actually a stop sign, and not a frog? Time delay has some help but clearly won't help us enough and will lie to us (frog=stop sign), so how do we realize the brighter stop sign is actually a stop sign? Because each pixel is of a much different brightness, they could be arranged differently for all we know. Again, the frog image has less brightness difference in global sum total. So with the brighter stop sign, we need to not just compare pixel to original pixel, we need to look at it like this - suspect image pixel 1 is 6 shades brighter than original image pixel 1, and the other pixels when compared are also 6 shades brighter. That's the pattern. In text this looks like this - we see "hello" and then see "hzzzezzzlzzzlzzzo", obviously the location delay is lots, but after the first painful wait of 3 z (zzz), it sees another same wait time (zzz again), it is less upset and expects the location of the rest of "hello", it is hello, if it were (without the spaces and CAPS) "H sfw E uowfg L rtl L opywds O" it's obviously not spaced evenly like "H rts E jui L dfg L awq O". Although both look really silly, the latter 2nd one has the pattern "hello" in it because there is 3 random letters between each of the letters we want, the other is random letters all over it doesn't say hello. It is expected error. And really useful for image and music recognition I believe.
For image, it uses both location, brightness, and color, relative expected error. So if allllll the image is brighter, or stretched, is sees the same gap in expectation across the whole image. Still works fine if the stop sign's top right and bottom left patches are inverted brightness, it will process each the 4 squares with less upsetness, then when compare the 4 squares themselves to each other it will se square 1 to square 2 are very different brightnesses cuz inverted, but so is the other 2 corners.
For a music track, it can be stretched longer, louder, higher pitch (size, brightness, color ;p), and it should work fine seeing the expected error across it. If we feed a small hierarchy network the music track but flipped played backwards, it doesn't really look at it like that, it just puts each pixel into the bottom of the network hierarchy, then if the end of the song has expected brightnesses and locations and pitch it will light up parts of the node in the higher layer of the neural network (the start of that node, despite being at the end).
Does it make sense to you?
If you blur, stretch, brighten a stop sign, it is very different to a computer to match to a memory, but obviously it is really, little different, it matches lots, because relatively between each parts of it it is relatively expected to each other, the parts are relative, each is as off in error. When it comes to pattern finding, you start at exact matches (only I know this, others don't...), it roots all other pattern matching. For example if you see "cat ran, cat ran, cat ran, cat sleeps" and given the prompt cat>?, it predicts the frequency is probably 75% likely, sleep 25% likely next word predicted. And if we want to do better we can recognize translation ex. dog ran, sleep etc as much, so cat-dog, and so maybe cat barks, dog meows, they are extremely close words. And look at this rarer pattern problem "hat man scarf, book building library, he frog she, wind dine blown, truck rat ?" it is vehicle that comes next, every 1st and 3rd is similar, so we match each 3 to each other then to the last triplet and see the 1st and 3rd should also be a translate.....lots of translate matching here lol!!!! When it comes to images, the pixel may be brighter, so it isn't exactly exact text matching, but it still starts at exact matches as evidence for anything further, pixel a is bit brighter than pixel b. It's rooted at something. Exact match. So back to image distortions, there is lots of relativeness in them, there is really not so much distortion, clearly a stop sign, this uses pattern matching to reach the activation / conclusion. It's very simple. But the AI field is not showing how simple it is, and I think that's where they really lack, and withholds better understanding.
I'm going to try to code it in the coming months, but first am busy on finishing the guide to AGI to a more stable version, so it may have to wait a bit to get worked on...... but thought I'd share with you some of the idea ahead of time as it is very interesting / simple and something the AI field can't even do as well, they need many examples of a stop sign and still fail over silly small 1 pixel attacks.
When I ask myself why is it still a stop sign image, I start at the pixels we start with, and then know we will do worse it missing some parts of the stop sign, and I look at the difference in brightness, location arrangement of parts of parts in a hierarchy network, color too, then from there what next can we do ?, the relative error difference "how similar is the error" in brightness, etc. So although there is error/missing parts during matching a stop sign, we can also *relieve the error lots by seeing the pattern across errors. When you stretch an image object, it adds lots of error (though not tons, else we wouldn't easily recognize it), but it can be resolved by my trick by seeing the pattern of error across the object.
Do note there is other pattern helpers for prediction ex. every image I showed you for the past hour is a cloud, so the next image is probably a cloud even if not looks like a cloud, however, this is if have a history, if no history, you rely on the image alone. And the reward system simply weights prediction to loved words/ images, it is not telling you what really is in an image.