anYnt
Anytime Universal
Intelligence
Project Summary
(March 2009)
MOTIVATION
One of the most debated issues in the field of
artificial intelligence, cognitive science, psychometrics, biology and
anthropology is the definition of intelligence and its measurement.
Measurement of human intelligence is of utmost
relevance for many application areas. Consequently, psychometric tests are
common in order to (i) help children and students learn efficiently
(education), (ii) to assess personnel (selection and recruitment) and also (iii)
when treating different diseases and learning problems (learning therapies).
In a quite similar way, measurements of machine
intelligence are required for areas such as (i) making agents, robots, and
other kinds of “intelligent systems” acquire knowledge quickly (knowledge
acquisition), (ii) in order to select/assess the abilities of “intelligent
systems” (accreditation/certification) and also (iii) when trying to fix or
improve the capabilities of these systems (intelligent system design or
amendment).
Although the terminology is different, the
similarities of both areas are clear. Additionally, a third germane area is
comparative psychology or comparative cognition for several kinds of animals,
where there is also a debate about how to assess cognitive abilities in higher
animals (especially great apes and cetaceans).
The case of machine intelligence is especially
relevant because of several issues:
- A science or technology cannot advance if we do not
have proper measurement techniques to assess its progress. It is difficult to
develop aeronautics even from watching birds fly if we do not have measurements
for weight, altitude or aerodynamics. Artificial Intelligence has a reference
(Homo sapiens) but suffers a lack of measures and measurement techniques for
its artifacts.
- The ubiquity of bots, assistants, majordomos and other
kinds of autonomous agents who perform tasks on behalf of humans, makes it
difficult to appropriately deal with them in a new space for collaboration
where virtual interaction substitutes physical interaction, such as
collaborative platforms, social networks, service-oriented architectures,
e-bureaucracy, Web 2.0, etc. In many applications we need to differentiate
between stupid single-task bots, general and more intelligent bots, and humans.
o On the one hand, some applications require allowing
bots to participate in projects or granting them permissions as interlocutors
for management, negotiations or agreements. Therefore, bots are expected to be
effective to carry on their delegated tasks, for which they need to be
intelligent (non-intelligent bots are to be avoided)
o On the other hand, in some cases, we want to prevent
bots of all kinds from doing certain tasks (e.g. malicious bots could create
millions of accounts in a free email service). For this purpose, CAPTCHAs
[von Ahn et al 2002] [von Ahn et al 2008] are widely used, but more
precise and reliable mechanisms will be needed in the near-future.
Let us give an example. Imagine a collaborative
environment, where several agents (some of them humans and some of them
machines), have to build a specialized website with information about a
specific topic (e.g. a Wikipedia entry, a European project outline, a recent
event, a music library, a geographical project, etc). The team will have to
decide the rules of interaction (how information is accepted, integrated and
published). In order to become part of the team, each component will need to certify
that she/he/it has some basic reasoning/learning abilities. If an inept agent
(or person) tries to join the club we would require some kind of assessment to
decide whether we should let him/her/it in.
Are mechanisms like this available nowadays? The
answer is no. Apart from CAPTCHAs, which only measure some very specific
abilities, we do not have any kind of test where we can quickly and reliably
evaluate the intelligence of an intelligent system (an agent).
The purpose of this project is to explore the
possibility of constructing tests which are:
·
Universal. They can be applied to humans, bots, animals,
communities or even hybrids. In order to meet this property the test must be
derived from universal non-anthropomorphic principles grounded on computer
science and information theory. This goal, by itself, is very challenging since
actual tests are based on anthropomorphic assumptions.
·
Anytime. They can adapt to the time-scale of the examinee,
and can dynamically adapt its questions, interactions or items to the examinee,
in order to assess its intelligence quicker. The test should be flexible enough
to test very quickly (with low reliability) for some applications (e.g.
applications where CAPTCHAs are used today) or to provide high reliability
estimations if more time is provided (e.g. applications in which serious
consequences derive from a stupid agent meddling around).
STATE OF THE ART
There are two kinds of tests that are of interest in
this area: psychometric tests and machine intelligence tests.
Psychometric tests [Martínez-Arias et al 2006] have a
long history [Spearman 1908], they are effective, easy to administer, fast and
quite stable when used on the same individual across time. In fact, they have
provided one of the best practical definitions of intelligence: “intelligence
is what is measured by intelligence tests”. However, psychometric tests are
anthropomorphic; they cannot evaluate the intelligence of systems other than
Homo sapiens.
New approaches in psychometrics such as Item Response
Theory (IRT) allows for item selection based on their cognitive demand
features, providing results to understand what is being measured and adapting
the test to the level of the individual being examined. Items generated from
cognitive theory and analysed from IRT are a promising tool, but these models
are not fully implemented in testing [Embretson & Mc Collam 2000].
Even though these and other efforts have tried to
establish “a priori” what an intelligence test should be (e.g. [Embretson 1998]
and then find adaptations for different kinds of subjects, in general we need
different versions of the psychometric tests to evaluate children at different
ages, since the psychometric tests for adult Homo sapiens rely on some knowledge
and skills that children have not acquired yet.
The same happens for other animals. Comparative
psychologists and other scientists in the area of comparative cognition usually
devise specific tests for different species. An example of these specialised
tests for children and apes can be found here [Herrmann
et al 2007]. Additionally, it has been shown that psychometric tests do
not work for machines at the current stage of progress in artificial
intelligence [Sanghi and Dowe 2003], since they can be tricked by very simple
computer programs. But the main drawback of psychometric tests to evaluate
other subjects different from humans is that there is no mathematical
definition behind them. Hence it is difficult to devise a test which will work
on any intelligent subject (machine, human, animal ..), and not only on the
specific kind of subjects we have experimentally evaluated that they work on.
Machine intelligence tests have been proposed since
Alan Turing [Turing 1950] introduced the imitation game in 1950, now commonly
known as Turing test. In this test, a system is considered intelligent if it is
able to imitate a human (i.e., to be undistinguishable from a human) during a
period of time and subject to a (tele-text) dialogue with one or more judges.
Although still accepted as a reference to eventually check whether artificial
intelligence will reach the intelligence of humans, it has generated debates
all along. Of course, many variants and alternatives have been suggested. The
Turing test and related ideas present several problems as a machine
intelligence test: the Turing Test is anthropomorphic (it measures humanity,
not intelligence), it is not gradual (it does not give a score), it is not
practical (it is becoming easy to cheat and requires a long time to get
reliable assessments) and it requires a human judge.
A recent and naïve approximation to a machine
intelligence test is what is called CAPTCHAs (Completely Automated Public
Turing test to tell Computers and Humans Apart) [von Ahn et al 2002] [von Ahn
et al 2008]. CAPTCHAs are any kind of simple question that can be easily
answered by a human but not by current artificial intelligence technology.
Typical CAPTCHAs are character recognition problems where letters appear
distorted. These distortions make it difficult for machines (bots) to recognize
the letters. An example of CAPTCHA from Gmail can be seen below:
The immediate objective of a CAPTCHA is to tell humans
and machines apart. The ultimate goal is to prevent bots and other kind of
machine agents or programs from being able to create thousands of accounts (or
other tasks that only humans should do). Note that bots could block or spoil
many Internet services we use as a daily basis if these and other control
techniques did not exist.
The problem of CAPTCHAs is that they are becoming more
and more difficult for humans, since bots are being specialized and improved to
read them. Whenever a new CAPTCHA technique is developed, new bots appear that
have chances of getting through the test[1]. This forces CAPTCHA developers to change
them again, and so on and so forth. The reason is that they are specific and
they rely on some particular tasks. Although CAPTCHAs work reasonably well
today, in about 10 or 20 years, they will need to make things so difficult and
general, that humans will require more time and several attempts to make it
through. In fact, this is already happening, and we lose more and more time in
CAPTCHAs everyday.
Apart from tests, a much more theoretical and formal
approach to machine intelligence has been undertaken by prominent scientists in
the 20th century such as A.M. Turing, R.J. Solomonoff, E.M. Gold, C.S. Wallace,
J.J. Rissanen, L. Blum and M. Blum, G.J. Chaitin and others. The landmark is the
development of Algorithmic Information Theory (also known as Kolmogorov
Complexity) (see [Li & Vitányi 2008] for the most complete reference on
it), its relation to learning (inductive inference and prediction) [Solomonoff
1964] [Wallace and Boulton 1968] [Solomonoff 1986] [Wallace and Dowe 1999]
[Wallace 2005] and ultimately its relation to intelligence.
These ideas led to some members of the team who are
now presenting this project hereby to introduce some formal definitions of
intelligence, namely the works [Dowe & Hajek 1997, 1998] [Hernandez-Orallo
& Minaya-Collado 1998] [Hernandez-Orallo 2000a] [Hernandez-Orallo 2000b].
All of these tests and definitions are mathematical, non-anthropomorphic (i.e.
universal), meaningful and intuitive. In this sense they do not take Homo
Sapiens as a reference and judge (as in the Turing Test), they have not evolved
by trial and error through experimentation on testing subjects during a century
(such as generally done in psychometrics) but they are constructed over fundamental
and mathematical notions in computation theory. And some of them succeeded to
be practical (feasible as an intelligence test) at the cost of being partial
(in the sense they measure necessary traits of intelligence but not all of
them). Let us see these approaches in more detail below:
On the one hand, Dowe & Hajek [Dowe & Hajek
1997, 1998] suggested the introduction of induction inference problems in a
somehow induction-enhanced or compression-enhanced Turing Test,
in order to, among other things, deal with Searle’s "Chinese room"
paradox, and also because an inductive inference ability is a necessary (though
possibly "not sufficient") requirement for intelligence.
On the other hand, quite simultaneously and similarly,
and also independently, in [Hernandez-Orallo & Minaya-Collado 1998]
[Hernandez-Orallo 2000a], intelligence was defined as the ability to
comprehend, giving a formal definition of the notion of comprehension as the
identification of a ‘predominant’ pattern from a given evidence, derived from
Solomonoff induction and prediction concepts, Kolmogorov complexity and Levin’s
variants and optimal search. The definition was given as the result of a test,
called C-test [Hernandez-Orallo & Minaya-Collado 1998], formed by
computationally-obtained series of increasing complexity. The sequences were
formatted and presented in a quite similar way to psychometric tests, and as a
result, the test was administered to humans. Nonetheless, the main goal was
that the test could eventually be administered to other kinds of intelligent
systems as well.
A factorisation (and hence extension) of these
inductive inference tests was sketched in order to explore which other
abilities could conform a complete (and hence sufficient) test
[Hernandez-Orallo 2000b]. In order to apply the test for low intelligent
systems, (still) unable to understand natural language, the proposal for a
dynamic extension of the C-test in [Hernandez-Orallo 2000a], was expressed like
this: “the presentation of the test must change slightly. The exercises should
be given one by one and, after each guess, the subject must be given the
correct answer (rewards and penalties could be used instead)”.
Recent works by Legg and Hutter (e.g. [Legg and Hutter
2007]) have followed the previous steps and, strongly influenced by Hutter’s
theory of AIXI optimal agents [Hutter 2005], have given another (and related)
definition of machine intelligence, dubbed “Universal Intelligence”, also
grounded in Kolmogorov complexity and Solomonoff's induction.
The basic idea is to evaluate the intelligence of an
agent p on several environments m, chosen by using the universal
distribution[2] (the one derived from Kolmogorov Complexity,
i.e, p(m)= 2-K(m)), using rewards that are accumulated during the
interaction with the environment. So intelligence is defined as the competence
of an agent in different environments, where simple environments have more
probability than complex environments.
The comparison with the works from Hernandez-Orallo
(but also Dowe & Hajek's compression test) is summarised by [Legg and
Hutter 2007] with the following table:
|
Universal agent |
Universal test |
Passive environment |
Solomonoff induction[3] |
C-test |
Active environment |
AIXI |
Universal intelligence |
In fact, the definition based on the C-test can be
considered an instance of Legg and Hutter’s work, where the environment outputs
no rewards, the agent is not allowed to make an action until a number of
observations is seen (the inductive inference sequence). In our opinion, one of
the most relevant contributions in their work is that their definition of
Universal Intelligence allows one to formally evaluate the theoretical
performance of some agents: a random agent, a specialised agent, ..., or a
super-intelligent agent, such as AIXI [Hutter 2005], which is claimed to be the
agent that, if ever constructed, would score the best in the Universal
Intelligent test. In a nutshell, Legg and Hutter's work is another step that
helps to shape the things that should be addressed in the near future in order
to eventually reach a theory of machine intelligence.
There are, however, five problems we have identified
in their definition with regard to making it practical. First, we have an
infinite sum over all the environments. Secondly, we also have an infinite sum
over all the possible rewards (agent's life in each environment is infinite).
Third, K() is not computable. Fourth, and more importantly, time is not taken
into account. And, fifth, there is a conflation between induction and
prediction.
PROPOSAL
In this work we propose a modification and extension
of the previous definitions and tests in order to construct the first general,
necessary, formal, feasible intelligence definition and test. The definition
and test will be grounded on previous developments on tests based on Kolmogorov
Complexity [Dowe & Hajek 1997, 1998] [Hernandez-Orallo & Minaya-Collado
1998] [Hernandez-Orallo 2000a, 2000b] [Legg and Hutter 2007]). The main idea of
all of them is that the Universal Distribution is used for the
generation of questions and environments, and hence any particular bias towards
a specific individual, species or culture is avoided. This makes the test
universal for any possible kind of subject. But apart from this “Universal”
condition we will focus on some additional practical requirements:
-
It must allow one to measure any kind of
intelligent system (biological or computational) which exists now or can be
built in the future (anytime systems).
-
The test must quickly adapt to the level
of intelligence and time scale of the system. It must be able to evaluate inept
and brilliant systems (any intelligence level) as well as very slow to
very fast systems (any time scale).
-
The quality of the assessment will depend
on the time we leave for the test. That means that the test can be interrupted
at any time, producing an approximation to the intelligent score. The more time
we leave for the test, the better the assessment (anytime test).
Because of these requirements we will call the tests
"anytime universal intelligence tests". If we succeed, science
will be able to measure intelligence of higher animals (e.g. apes), humans,
machines, hybrids or communities of humans and machines and, eventually,
extraterrestrial beings in an absolute way.
The main difficulty to make it feasible is that the
more general we intend the test to be, the fewer things about the agent should
be assumed. That means that we may require more time to evaluate intelligence,
because we can neither rely on a common knowledge nor language to explain the
instructions or give things for granted. The problem is similar to evaluating
children or animals, where everything must be quite symbolic and simple, and
subjects must be led toward a goal through rewards and penalties. That is the
reason why these tests are more interactive than traditional psychometric tests
for adult Homo sapiens.
In fact, there is a common thesis in all intelligence
tests (both from psychometrics, comparative cognition or artificial
intelligence): the time which is required to evaluate the intelligence of a
subject depends (1) on the knowledge or features which are expected and known
about the subject (same species, same culture, same language, etc.) and (2) on
the adaptability of the examiner. While (1) is taken to the extreme in
human psychometrics, where tests are just a form with questions explained in
natural language, (2) is taken to the extreme with the Turing Test or any kind
of interview evaluation, where the examiners are humans who dynamically adapt
their questions according to the subject’s answers.
Consequently, if we want to evaluate different kinds
of subjects without relying on any assumption about their nature or knowledge,
and if we want to get a reliable assessment in a reduced (practical) period of
time, .we must necessarily devise interactive and adaptive tests,
Although the endeavour looks pretentious and bold,
there are some facts and some reasonable hypotheses which give strong support,
at least, in considering the effort of attempting a first prototype of this
test and a preliminary evaluation in order to establish whether these tests are
constructible.
Facts:
Hypotheses:
§
We can only assume a few things about
the subject to be examined. It will be an open active system that will have
some sensors and some actuators, and, additionally, there will be a way to
conduct its behaviour through rewards and penalties. All the subjects we
contemplate (humans, animals, reactive machines with rewards) accomplish this[4],
and we think there is no need to assume anything else about the physics,
implementation or behaviour of the subject in order to evaluate it.
§
Interacting with a sample of
environments under the universal distribution is sufficient to evaluate
general intelligence. While it seems to only measure inductive abilities, many
environments will also require memory and deductive abilities to succeed. This
does not exclude the design of factorial tests, by posing specific constraints
on the kind of environments. However, this factorisation goes out the scope of
this project.
§
A relatively small sample of
environments can be used instead of all of them, in order to make the test as
shorter as possible. The complexity of the environments can also progressively
be adjusted to match the intelligence of the agent, and their rewards weighted
accordingly, in order to make the test more effective.
§
Time can be included by allowing the
agent to play with a single environment for a fixed time. This time limit can
progressively grow to make the test anytime. Rewards might be time-dependent so
the agent can realise that time is relevant.
Finally, it may seem difficult to imagine what one of
these "anytime universal intelligence tests" might look like, but we
find similar things everywhere: games, child games, where children start with
easy ones and soon move to more sophisticated ones. To find an even closer
parallel, the anytime intelligence test may be quite similar to a random collection
of videogames where we play for a little time with the easy ones first and then
we are given more time to play with some other more difficult ones if we
succeed on the easiest one. As usual, easy levels can be passed quickly and
difficult levels require more time. The intelligence of an agent would be some
kind of average of the scores in all the levels, possibly using some kind of
log-loss measure on the complexity of each level. The key point here is that
these games or environments will be generated according to a distribution which
is independent (and hence neutral) over the different kinds of subjects
(humans, machines, animals...) to be evaluated.
ACTIVITIES
The project spans for one and a half years, starting
in July 2009. The project is divided into three stages, about six months each.
IMPACT:
In case the results of the project are successful, we
will follow up with extensive publications of the results and also will start
some other more traditional projects exploiting the results, in order to
enhance the tests, factorise them and make them publicly available for several
scientific disciplines: artificial intelligence, psychometrics, comparative
cognition, etc.
Eventually, the acceptance and use of these tests
would allow the following:
In other words, and getting back to the similarities
between machine intelligence tests and psychometrics, we would expect at least
the same applications that psychometrics have today in education, recruitment
and therapy into the world of artificial intelligence, where the areas are
known as knowledge acquisition, agent cognitive certification and agent
redesign.
REFERENCES:
[Herrmann et al 2007] Esther Herrmann,
Josep Call, María Victoria Hernández-Lloreda, Brian Hare, Michael Tomasell
"Humans Have Evolved Specialized Skills of Social Cognition: The Cultural
Intelligence Hypothesis", Science, 7 September 2007, Vol. 317. no. 5843,
pp. 1360 - 1366, DOI: 10.1126/science.1146282
[Hernandez-Orallo
& Minaya-Collado 1998] Hernández-Orallo, José;
Minaya-Collado, N.: A Formal Definition of Intelligence Based on an
Intensional Variant of Kolmogorov Complexity, Proceedings of the
International Symposium of Engineering of Intelligent Systems (EIS'98) ICSC
Press 1998, pp. 146-163
[Hernandez-Orallo 2000a] José
Hernández-Orallo: Beyond the Turing Test. Journal of Logic, Language and Information 9(4):
447-466 (2000)
[Hernandez-Orallo 2000b]
Hernández-Orallo, José On
The Computationl Measurement of Intelligence Factors, appeared in A. Meystel Performance Metrics for Intelligent Systems Workshop, National Institute of Standards and Technology,
Gaithersburg, MD, USA, August 14-16, 2000, pp. 1-8, section XXI.]
[Legg
and Hutter 2007] Legg, S.; Hutter, M. Universal Intelligence: A Definition of Machine
Intelligence
Shane Legg and Marcus Hutter. In Minds
and Machines, pages 391-444, volume
17, number 4, November 2007. http://www.vetta.org/documents/UniversalIntelligence.pdf
[Legg 2008] Shane Legg. Department of Informatics, University of
Lugano, June 2008. http://www.vetta.org/documents/Machine_Super_Intelligence.pdf
[Levin
73] Levin, L.A. "Universal search problems" Problems Inform.
Transmission, 9:265-266, 1973.
[Li & Vitányi 2008] Li, Ming; Vitányi, Paul "An Introduction to Kolmogorov Complexity and its Applications" 3rd Edition. Springer-Verlag, New York, 2008.
[Martínez-Arias et al 2006] Martínez-Arias, M.Rosario; Hernández-Lloreda, M.Victoria; Hernández-Lloreda, M.José. “Psicometría”, Alianza Editorial 2006.
[Solomonoff
1964] Solomonoff, R.J. "A formal theory of inductive inference"
inform. Contr. vol. 7, pp. 1-22, Mar. 1964; also, pp. 224-254, June 1964.
[Solomonoff
1986] Solomonoff, R.J. "The Application of Algorithmic Probability to
Problems in Artificial Intelligence" in L.N.Karnal and J.F. Lemmer (eds.) Uncertainty
in Artificial Intelligence (L.N. Karnal and J.F. Lemmer, eds.), Elsevier
Science, pp. 473-491, 1986.
[Spearman 1904] Spearman, C. “ ‘General Intelligence’
objectively determined and measured” American Journal of Psychology 15,
201-293, 1904.
[Turing 1950] Turing, Alan “Computing Machinery and
Intelligence” Reprinted in “Minds and Machines”, edited by Alan Ross Anderson,
Englewood Cliffs, N.J., Prentice Hall 1964.
[von Ahn et al 2002] von Ahn, L.; Blum, M.; Langford,
J. “Telling Humans and Computers Apart (Automatically) or How Lazy
Cryptographers do AI” Communications of the ACM, 2002. www.captcha.net.
[von Ahn et al 2008] von Ahn, L.; Maurer, B.;
McMillen, C.; Blum, M. "reCAPTCHA: Human-Based Character Recognition via
Web Security Measures" Science, Vol. 321, September 12th, 2008.
[Wallace
2005] C. S. Wallace, Statistical and Inductive Inference by Minimum Message
Length, Information Science and Statistics", 432, Springer Verlag,
May,2005.
[Wallace
and Boulton 1968] Wallace, C.S. and Boulton, D. M. "An Information Measure
for Classification" Computer Journal, 11, 2, pp. 185-195, 1968.
[Wallace
and Dowe 1999] C. S. Wallace and D. L. Dowe "Minimum Message
Length and Kolmogorov Complexity", Computer Journal, 1999, vol 42, no. 4,,
pp. 270-283,
[1] For instance, JAntiCaptcha is a tool developed by JDTeam that is
included in JDownloader. It allows to crack CAPTCHAs from the most popular download
sites (e.g. MegaUpload, RapidShare).
[2] Universal
distribution is defined over a particular reference machine, but this choice
only affects Kolmogorov complexity by a constant. As such, we take the liberty
at saying "the" universal distribution.
[3] More precisely, it is really "Solomonoff prediction", but we
show it as it is in the original table.
[4] Sensors may either be eyes and
ears, or a camera and a microphone, or a communication-listening e-service,
etc. Actuators may either be hands, muzzle, fins, a robotic pincer, or an
e-service invocator software module, etc.