There are literally hundreds of different techniques and methodologies used by various systems to achieve varying degrees of accuracy in image recognition. This is a very brief explanation, this is a extremely involved and complex topic to cover in a short description.
The word ‘convolution’ (in this context) basically means ‘filter’.
There are usually two main stages to an image recognition convolutional network; the first applies various ‘convolutional’ filters to the image, these are designed to enhance or extract the main features, angles, lines, gradients, shadows, etc.
One convolutional filter might highlight all the lines in the source image. The system has a pre-stored/ learned collection of smaller images (feature maps) that show small lines at various degrees of rotation/ angles. The original image is the scanned/ searched with each of the smaller images (lines) and matches are noted with their relevant positions. A single feature map might be a horizontal line segment, so all the locations in the original image (with applied convolution filer to highlight lines) where a horizontal line is found are recorded.
The same is done for colour boundaries, gradients, etc.
The resulting positions of the recognised smaller feature maps are then fed into the neural net for learning or recognition as complete object feature. The neural net learns which collections of the smaller feature maps and their relative locations ‘too each other’ make up an object feature; and which features make up an object.
In your example of the figure 8 the bottom row shows the original image after it has been run through the various convolutional filters. The pooling layers are the output results of running the convolved images against a neural network trained to generalise the features found in the convolution layer. Once past pooling layer 2 the line represents the output of the recognition network.
All the images on Google show a stack of images on the far left as many of the SAME image, why?
This is the stack of images with various convolution filters applied.
I get that a local window scans the input, but, why is the input saved as 50 slightly variations stacked behind eachother?
This stack represents the small feature maps that are scanned/ searched for within the convolved images.
Are these *assorted images from *all of its life
These images are the results of a generalizing neural net, producing images that represent the collections of feature maps found and their relative positions to each other.
The basic idea is an image is broken down into its basic parts, the parts are recognised from a set of pre defined templates and then the neural net guesses whats in the image based on the bits it recognises. lol.
If it looks like a duck... walks like a duck... quacks like a duck... then chances are it's a duck.
Just for fun... an example, I will use you as the neural net in a convolutional network.
I’ve drawn a red shape (triangle, square or circle), and searched the image with thousands of feature maps… this is the output result… these were found in the image…
What shape is it?