I have long been an optimist about the future of AI. Probably I have been a bit too influenced by Iain Banks' Culture novels which I love. However there has been a lot in the news lately about the dangers of super-intelligence, or even much more mundane machines, with poorly designed goals.
Here I am talking about machine learning systems which are given a lot of freedom to discover how to achieve a goal, principally Reinforcement Learning(RL) systems. Goals are given to RL by a reward function which computes a reward from an observation of the environment. The RL algorithm attempts to maximise received reward (total reward or reward/timestep).
Sometimes the reward is a physical property which should be maximised e.g. The weight of fruit harvested by an automatic fruit picker. Other times a reward function is specified, with the aim of eliciting certain desired behaviour from a machine. This is a lazy way of describing the desired behaviour and it can often often produce unexpected/undesirable results. Either scheme can lead to problems. Below are some key reasons:
Indifference
The reward function specifies a certain property of the world which must be maximised. All other aspects of the world are ignored, and the system has no concern for them. This means that a cleaning robot which wants to clean the floor would happily destroy anything in its path to achieve its goal.
Disproportionate Behaviour
The machine is not prevented from taking the goal to extremes. A famous thought experiment concerns a super-intelligence that is given a task to collect stamps. This (seemingly harmless) task results in the AI consuming all the worlds resources (including humanity) in an effort to produce as many stamps as possible.
Reward Hacking
An AI may discover that it is easier to subvert the reward measurement than perform the intended behaviour. For example a cleaning robot that gets reward for cleaning up, could learn to create mess so as to receive reward for subsequently cleaning up. If the cleaner is motivated by receiving negative reward for seeing mess, it may discover that it is easier, and more effective, to close its eyes than clean up.
Solutions ?
I have tried to come up with some ideas for reducing these problems. They are guided by thinking about how human society addresses these problems.
Evolution has provided humans with completely selfish goals and drives. In essence we want the best for ourselves, and there has been no attempt to design-in reward functions that inevitably lead to good outcomes. Nonetheless humans seem to be quite capable of working together cooperatively and peacefully under the right circumstances (this is the norm, since there are actually very few mass murderers and malevolent dictators). Why is this ? One factor is that we live in a community surrounded by other entities of comparable abilities who defend their own interests. Thus we never have the ability to do exactly what we want. If we are too selfish in our competition for resources, neighbours/colleagues/police will punish us (negative reward). If we act in a way which assists others to achieve there own goals, we receive reward (praise). This results in the emergence of cooperative behaviour and philanthropy (see prisoner's dilemma argument for a mathematical explanation of cooperation). The system can, and does, break down when it is possible to hide anti-social behaviour, or when an individual becomes so powerful that they cannot be punished by others.
Human reward functions avoids extremes. For example we want food and experience pleasure (reward) from eating, but when we are full the pleasure diminishes, allowing other drives to dominate behaviour. Our reward function does not try to maximise food, rather it tries to obtain sufficient food. Multiple drives control behaviour and constantly change their order of importance. Having multiple drives may result in less extreme behaviour and eliminate problems resulting from indifference to all but one goal.
Conclusion
AIs should receive reward and punishment socially from human responses to their actions. The AI cannot fully know the (stochastic) function behind the reward given by humans. It must attempt to learn policies which receive reward.
We should prefer multiple AIs, which must cooperate with us and each other, to a single all powerful AI. This is not a requirement to limit the intelligence of individual AIs, rather it limits the extent to which an individual AI can control resources.
We should provide AIs with a rich and varied set of reward sources, resulting in a wide ranging set of concerns, rather than a single all-consuming goal. Drives, which can be satiated, should replace unqualified maximisations in the reward function.