Are all ANN just vectors of numbers multplying together and propagating? That seems rather simplishly dull....I know they get a lot done doing that. I still want to know what Transformers do.

For the most part, yes, that's all neural networks are--linear algebra. There is a minor exception, though--in the book 'Neural Networks: A Comprehensive Foundation' by Simon Haykin he mentioned at the end that if neural networks are going to ever become something more than linear algebra then it's going to be because of their nonlinear transfer function, which is the only part that is too hard to easily model and therefore is not fully understood. Unfortunately, despite looking several times for that key quote, I have been unable to find it, maybe because it was in the first edition of his book and not in the second edition that the library has. My recommendation is to stay away from neural networks as a foundation of research because they're hardware, so they are not the essence of an architecture. The essence of the architecture is the middle of the 3 layers of hardware, algorithm, and goal, as described in David Marr's book 'Vision'. First figure out how to solve the problem of general intelligence, then figure out which hardware will do what you want. If neural networks fit well for what you want to do, then go ahead and choose that hardware. I lost several years of my life because I didn't have that wisdom earlier, so maybe you can save yourself a few years with that advice.

----------

(p. 23)

It is a mathematical theorem that these conditions define the operation of

addition, which is therefore the appropriate computation to use.

The whole argument is what I call the computational theory of the

cash register. Its important features are (1) that it contains separate argu-

ments about what is computed and why and (2) that the resulting operation

is defined uniquely by the constraints it has to satisfy. In the theory of visual

processes, the underlying task is to reliably derive properties of the world

from images of it; the business of isolating constraints that are both pow-

erful enough to allow a process to be defined and generally true of the

world is a central theme of our inquiry.

In order that a process shall actually run, however, one has to realize

it in some way and therefore choose a representation for the entries that

the process manipulates. The second level of the analysis of a process,

therefore, involves choosing two things: (1) a representation for the input

and for the output of the process and (2) an algorithm by which the

transformation may actually be accomplished. For addition, of course, the

input and output representations can both be the same, because they both

consist of numbers. However, this is not true in general. In the case of a

Fourier transform, for example, the input representation may be the time

domain, and the output, the frequency domain. If the first of our levels

specifies what and why, this second level specifies how.

There are three important points here. First, there is usually a wide

choice of representation. Second, the choice of algorithm often depends

rather critically on the particular representation that is employed. And

third, even for a given fixed representation, there are often several possible

algorithms for carrying out the same process. Which one is chosen will

usually depend on any particularly desirable or undesirable characteristics

that the algorithms may have; for example, one algorithm may be such

(p. 24)

more efficient than another, or another may be slightly less efficient but

more robust (that is, less sensitive to slight inaccuracies in the data on

which it must run). Or again, one algorithm may be parallel, and other,

serial.

**The choice, then, may depend on the type of hardware or machinery**

in which the algorithm is to be embodied physically. This brings us to the third level, that of the device in which the process

is to be realized physically. The important point here is that, once again,

the same algorithm may be implemented in quite different technologies.

The child who methodically adds two numbers from right to left, carrying

a digit when necessary, may be using the same algorithm that is imple-

mented by the wires and transistors of the cash register in the neighbor-

hood supermarket, but the physical realization of the algorithm is quite

different in these two cases. Another example: Many people have written

computer programs to play tic-tac-toe, and there is a more or less standard

algorithm that cannot lose. This algorithm has in fact been implemented

by W. D. Hilis and B. Silverman in a quite different technology, in a com-

puter made out of Tinkertoys, a children's wooden building set. The whole

monstrously ungainly engine, which actually works, currently resides in a

museum at the University of Missouri in St. Louis.

**Some styles of algorithm will suit some physical substrates better than**

others. For example, in conventional digital computers, the number of

connections is comparable to the number of gates, while in the brain, the

number of connections is much larger (x 10^4) than the number of nerve

cells. The underlying reason is that wires are rather cheap in biological

architecture, because they grow individually and in three dimensions.

In conventional technology, wire laying is more or less restricted to two

dimensions, which quite severely restricts the scope for using parallel

techniques and algorithms; the same operations are often better carried

out serially.

The Three Levels

We can summarize our discussion in something like the manner shown in

Figure 1-4, which illustrates the different levels at which an information-

processing device must be understood before one can be said to have

understood it completely. At one extreme, the top level, is the abstract

computational theory of the device, in which the performance of the device

is characterized as a mapping from one kind of information to another, the

abstract properties of this mapping are defined precisely, and its appro-

priateness and adequacy for the task at hand are demonstrated. In the

center is the choice of representation for the input and output and the

(p. 25)

-----

Computational theory

What is the goal of the

computation, why is it

appropriate, and what

is the logic of the strat-

egy by which it can be

carried out?

Representation and

algorithm

How can this computa-

tional theory be imple-

mented? In particular,

what is the representa-

tion for the input and

output, and what is the

algorithm for the trans-

formation?

Hardware

Implementation

How can the represen-

tation and algorithm be

realized physically?

Figure 1-4. The three levels at which any machine carrying out an information-

processing task must be understood.

-----

algorithm to be used to transform one into the other. And at the other

extreme are the details of how the algorithm and representation are real-

ized physically--the detailed computer architecture, so to speak.

**These**

three levels are coupled, but only loosely. The choice of an algorithm is

influenced for example, by what it has to do and by the hardware in which

it must run. But there is a wide choice available at each level, and the

explication of each level involves issues that are rather independent of the

other two.

Each of the three levels of description will have its place in the eventual

understanding of the perceptual information processing, and of course they

are logically and causally related. But an important point to note is that

since the three levels are only rather loosely related, some phenomena

may be explained at only one or two of them. This means, for example,

that a correct explanation of some psychophysical observation must be

formulated at the appropriate level. In attempts to relate psychophysical

problems to physiology, too often there is confusion about the level at

which problems should be addressed. For instance, some are related

mainly to the physical mechanisms of vision--such as afterimages (for

example, the one you see after staring at a light bulb) or such as the fact

that any color can be matched by a suitable mixture of the three primaries

(a consequence principally of the fact that we humans have three types of

cones). On the other hand, the ambiguity of the Necker cube (Figure 1-5)

seems to demand a different kind of explanation. To be sure, part of the

explanation of its perceptual reversal must have to do with a bistable neural

network (that is, one with two distinct stable states) somewhere inside the

(p. 26)

brain, but few would feel satisfied by an account that failed to mention the

existence of two different but perfectly plausible three-dimensional inter-

pretations of this two-dimensional image.

For some phenomena, the type of explanation required is fairly

obvious. Neuroanatomy, for example, is clearly tied principally to the third

level, the physical realization of the computation. The same holds for syn-

aptic mechanisms, action potentials, inhibitory interactions, and so forth.

Marr, David. 1982.

*Vision: A Computational Investigation into the Human Representation and Processing of Visual Information*. Cambridge, Massachusetts: The MIT Press.

----------

(p. 4)

From the above discussion, it is apparent that a neural network derives its computing

power through, first, its massively parallel distributed structure and, second, its ability to

learn and therefore generalize; generalization refers to the neural network producing

reasonable outputs for inputs not encountered during training (learning). These two infor-

mation-processing capabilities make it possible for neural networks to solve complex

(large-scale) problems that are currently intractable. In practice, however, neural networks

cannot provide the solution working by themselves alone. Rather, they need to be integrated

into a consistent system engineering approach. Specifically, a complex problem of interest

is decomposed into a number of relatively simple tasks, and neural networks are assigned

a subset of the tasks (e.g., pattern recognition, associative memory, control) that match

their inherent capabilities.

**It is important to recognize, however, that we have a long way**

to go (if ever) before we can build a computer architecture that mimics a human brain. The use of neural networks offers the following useful properties and capabilities:

** 1. Nonlinearity. A neuron is basically a nonlinear device. Consequently, a neural**

network, made up of an interconnection of neurons, is itself nonlinear. Moreover, the

nonlinearity is of a special kind in the sense that it is distributed throughout the network.

Nonlinearity is a highly important property, particularly if the underlying physical mecha-

nism responsible for the generation of an input signal (e.g., speech signal) is inherently

nonlinear. 2. Input-Output Mapping.

Haykin, Simon. 1994.

*Neural Networks: A Comprehensive Foundation.* New York, New York: Macmillan College Publishing Company.