TO BE NOTED:"arXiv:0708.1362v2 [cond-mat.stat-mech] 23 Oct 2008

Physical limits of inference

David H. Wolpert

MS 269-1, NASA Ames Research Center, Moffett Field, CA 94035, USA

Abstract

I show that physical devices that perform observation, prediction, or recollection share an underlying mathematical

structure. I call devices with that structure “inference devices”. I present a set of existence and

impossibility results concerning inference devices. These results hold independent of the precise physical

laws governing our universe. In a limited sense, the impossibility results establish that Laplace was wrong

to claim that even in a classical, non-chaotic universe the future can be unerringly predicted, given sufficient

knowledge of the present. Alternatively, these impossibility results can be viewed as a non-quantum mechanical

“uncertainty principle”. Next I explore the close connections between the mathematics of inference

devices and of Turing Machines. In particular, the impossibility results for inference devices are similar to

the Halting theorem for TM’s. Furthermore, one can define an analog of Universal TM’s (UTM’s) for inference

devices. I call those analogs “strong inference devices”. I use strong inference devices to define the

“inference complexity” of an inference task, which is the analog of the Kolmogorov complexity of computing

a string. However no universe can contain more than one strong inference device. So whereas the

Kolmogorov complexity of a string is arbitrary up to specification of the UTM, there is no such arbitrariness

in the inference complexity of an inference task. I end by discussing the philosophical implications of these

results, e.g., for whether the universe “is” a computer.

Key words: Turing machine, automata, observation, prediction, multiverse, Kolmogorov complexity

PACS: 03.65.Ta, 89.20.Ff, 02.70.-c, 07.05.Tp, 89.70.Eg, 01.70.+w

1. Introduction

Some of the most fruitful investigations of the foundations of physics began by identifying a

set of features that are present in all physical realizations of a particular type of information processing.

The next step in these investigations was to abstract and formalize those shared features.

Once that was done, one could explore the mathematical properties of those features, and thereby

Email address: dhw@ptolemy.arc.nasa.gov (David H. Wolpert).

URL: ti.arc.nasa.gov/people/dhw (David H. Wolpert).

Preprint submitted to Elsevier 23 October 2008

analyze some aspects of the relationship between physics and information processing. Examples

of such investigations include the many decades of work on the relationship between physics

and computation [11,12,13,14,15,16,17,18,19,20,21,22,23,22,24], the work on observation that

started with Everett’s seminal paper [25], and more recent work that considers what possible

forms physical reality might have [26,27,28,29,30,31,32,33,34,35,36].

In this spirit, here we first present archetypal examples of physical devices that perform observation,

of physical devices that performprediction, and of physical devices that performrecollection.

We then identify a set of features common to those examples. This is our first contribution,

that such physical devices share those features.

Next we formalize those features, defining any device possessing them to be an “inference

device”. To do this requires our second contribution: a formalization of the concept of semantic

information content. 1 Loosely speaking, we define the semantic information content of a variable

s concerning a variable r to be what an external scientist can infer about what the value of r

is in their particular universe by knowing the state of s. Note the central role in this definition of

the scientist external to the device. As discussed below, in the context of using inference devices

for observation, this central role of the external scientist is in some ways more consistent with

Wigner’s view of the observation process than with the many-worlds view of that process.

For the remainder of the paper we develop the theory of inference devices, thereby analyzing

numerous aspects of the relationship between physics and information processing. Our goal in

this endeavor is to illustrate the breadth of the theory of inference devices; an exhaustive analysis

of any one aspect of that theory is beyond what can fit into this single paper.

A recurring theme in our analysis of inference devices is their relationship with Turing Machines

(TM’s). In particular, there are impossibility results for inference devices that are similar

to the Halting theorem for TM’s. Furthermore, one can define an analog of Universal TM’s

(UTM’s) for inference devices.We call those analogs “strong inference devices”.

A central result of this paper is how to use strong inference devices to define the “inference

complexity” of an inference task, which is the analog of the Kolmogorov complexity of computing

a string. A task-independent bound is derived on how much the inference complexity of

an inference task can differ for two different inference devices. This is analogous to the “encoding”

bound governing how much the Kolmogorov complexity of a string can differ between two

UTM’s used to compute that string. However no universe can contain more than one strong inference

device. So whereas the Kolmogorov complexity of a string is arbitrary up to specification

of the UTM, there is no such arbitrariness in the inference complexity of an inference task.

After presenting inference complexity, we informally discuss the philosophical implications

of all of our results to that point. In particular, we discuss what it might mean for the universe

to “be” a computer. We also show how much of philosophy can be reduced to constraint satisfaction

problems, potentially involving infinite-dimensional spaces.We follow this discussion by

deriving some graph-theoretic properties governing the possible inference relationships among

any set of multiple inference devices in the same universe.

Our next contribution is an extension of the inference devices framework to include physical

devices that are used for control. Associated impossibility results provide fundamental limits on

the capabilities of physical control systems. After this we present an extension of the framework

to probabilistic inference devices. Of all the results in this paper, it is the impossibility results

concerning probabilistic inference devices that are the most similar to quantum mechanical im-

1 In contrast to the concept of syntactic information content, whose formalization by Shannon is the basis of conventional

information theory [37].

2

possibility results. We end by presenting an extension of the framework that clarifies its relation

with semantic information.

The crucial property underlying our results is that inference devices are embodied in the very

physical system (namely the universe) about which they are making inferences. This embedding

property and its consequences have nothing to do with the precise laws governing the underlying

universe. In particular, those consequences do not involve chaotic dynamics as in [17,18],

nor quantum mechanical indeterminism. Similarly, they apply independent of the values of any

physical constants (in contrast, for example, to the work in [12]), and more generally apply to

every universe in a multiverse. Nor do the results presume limitations on where in the Chomsky

hierarchy an inference device lies. So for example they would apply to oracles, if there can be

oracles in our universe. In the limited sense of our impossibility results, Laplace was wrong to

claim that even in a classical, non-chaotic universe the future can be unerringly predicted, given

sufficient knowledge of the present [38]. Alternatively, these impossibility results can be viewed

as a non-quantum mechanical “uncertainty principle”.

All non-trivial proofs are in App. A. An earlier analysis addressing some of the issues considered

in this paper can be found in [26].

1.1. Notation

We will take the set of binary numbers B to equal {−1, 1}, so that logical negation is indicated

by the minus sign. We will also take to be the Heaviside theta function that equals 1 if its

argument is non-negative, 0 otherwise. N is the natural numbers, 1, 2, . . .. For any function ��

with domain U, we will write the image of U under �� as ��(U). For any function �� with domain

U that we will consider, we implicitly assume that ��(U) contains at least two distinct elements.

For any (potentially infinite) set W, |W| is the cardinality of W. For any real number a ∈ R, ⌈a

is the smallest integer greater than or equal to a. Given two functions ��1 and ��2 with the same

domain U, we write ��1 ⊗ ��2 for the function with domain U obeying u ∈ U :→ (��1(u), ��2(u)),

and with some abuse of terminology refer to this as the “product” of ��1 and ��2.

Given a function �� with domain U, we say that the partition induced by �� is the family of

subsets {��−1(γ) : γ ∈ ��(U)}. Intuitively, it is the family of subsets of U each of which consists

of all elements having the same image under ��. We will say that a partition A over a space U

is a fine-graining of a partition B over U (or equivalently that B is a coarse-graining of A) iff

every a ∈ A is a subset of some b ∈ B. Two partitions A and B are fine-grainings of each other iff

A = B. Say a partition A is finite and a fine-graining of a partition B. Then |A| = |B| iff A = B.

Given a probabilitymeasure, the mutual information between two associated randomvariables

a, b conditioned on event c is written M(a, b | c). The Shannon entropy of random variable a is

H(a).

2. Archetypal examples

We now illustrate that many (if not all) physical realizations of the processes of observation,

prediction, and memory share a certain mathematical structure. We do this by semi-formally

describing each of those processes, one after the other. Each such description uses language

that is purposely very similar to the other descriptions. It is that very similarity of language that

demonstrates that the same mathematical structure arises as part of each of the processes. In the

3

following sections of this paper we will formalize that mathematical structure, and then present

our formal results concerning it. 2

If the reader becomes convinced of this shared mathematical structure before reading through

all the examples, (s)he is encouraged to skip to the next section. It is in that section that we

formalize the shared mathematical structure, as an “inference device”.

In all of the examples in this section, U is the space of all worldlines of the entire universe

that are consistent with the laws of physics (whatever they may be), and u indicates an element

of U. 3

Example 1: We start by describing a physical system that is a general-purpose observation device,

capable of observing different aspects of the universe. Let S be some particular variable

concerning the universe whose value at some time t2 we want our device to observe. If the universe’s

worldline is u, then the value of S at t2 is given by some function of u (e.g., it could be

given by a component of u). Write that function as ��; S (t2) = ��(u).

The observation device consists of two parts: an observation apparatus, and a scientist who

uses (and interprets) that apparatus. To make our observation, the scientist must first configure

the observation apparatus to be in some appropriate state at some time t1 <> t2, the output display of the observation apparatus

accurately reflects S (t2). Again, that output display exists in the universe. So its state at t3 is a

function of u; write that function as ζ.

The scientist reads the output of the apparatus and interprets that output as this attempted

observation of S (t2). It is this interpretation that imbues that output with semantic information.

Without such interpretation the output is just a meaningless (!) pattern, one that happens to be

physically coupled with the variable being observed. (As an extreme example of such meaningless

coupling, if a tree falls in a forest, but the video that recorded the fall is encrypted in a way

that the scientist cannot undo, then the scientist does not “observe” that the tree fell by watching

the video .)

To formalize what such interpretationmeans, we must define “semantic information”.As mentioned

above, we want the semantic information of a variable s concerning a variable r to be what

an external scientist can infer about r by knowing the state of s. In the current example this means

we require that the scientist can ask questions of the sort, “Does S (t2) = K?” at t3, and that ζ(u)

provides the scientist with (possibly erroneous) answers to such questions. As an example, say

that ζ(u) is a display presenting integers from 0 to 1000, inclusive, with a special ’error’ symbol

for integers outside of that range. Since the scientist interprets the value on that display at t3 as

the outcome of the observation of S (t2), by looking at the display at t3 the scientist is provided

2 Somemight quibble that one or another of the these examples should involve additional structure, that what is presented

in that example does not fully capture the physical processes it claims to describe. (See App. B.) The important point is

that the structure presented in these examples is always found in real-world instances of the associated physical processes.

Whether or not there is additional structure that “should” be assumed is not relevant. The structure that is assumed in the

examples is sufficient to establish our formal results.

3 For expository simplicity we use the language of non-quantum mechanical systems in this paper. However most of

what follows holds just as well for a quantum-mechanical universe, if we interpret quantum mechanics appropriately.

4

with (possibly erroneous) answers to the question “Does S (t2) = K?” for all 1001 values of K

that can be on the display.

To make this more precise, first note that any question like “Does S (t2) = K?” can either be

answered ’yes’ or ’no’, and therefore is a binary function of u. For every K, write this associated

binary function of u as qK; ∀K, ∀u ∈ U, qK(u) = 1 if S (t2) = ��(u) = K, and it equals -1 otherwise.

Next, note that the brain of the scientist exists in the universe. So which (if any) of a set of such

possible binary questions concerning the universe the scientist is asking at t3 is also a function

of u. We write that function as Q. In particular, we presume that any question qK is one of the

elements in the range of Q, i.e., it is one of the questions that (depending on the state of the

scientist’s brain then) the scientist might be asking at t3.

Now for any particular question qK the scientist might be asking at t3, the answer that the

scientist provides by interpreting the apparatus’ output is a bit. The value of that bit is specified

by the state of the scientist’s brain at t3. (The premise being that the state of the scientist’s brain

was affected by the scientist’s reading and then interpreting the apparatus’ output.) So again,

since the scientist’s brain exists in the universe, the value of that answer bit is a function of u.We

write that function as Y.

It is the combination of Q and Y that comprise the scientist’s “interpretation” of ζ, and thereby

imbue any particular ζ(u) with semantic content. Q(u) specifies a question qK. ζ(u) then causes

Y(u) to have some associated value.We take that value to be (the scientist’s interpretation of) the

apparatus’ answer to the question of whether qK(u) = 1 or qK(u) = −1 (i.e., of whether S (t2) =

K). Combining, ζ(u) causes Y(u) to have a value that we take to be (the scientist’s interpration

of) the apparatus’ answer to whether [Q(u)](u) = 1 or [Q(u)](u) = −1.

This scenario provides a set of requirements for what it means for the combination of the

observation apparatus and the scientist using that apparatus to be able to successfully observe the

state of S at t2: First, we require that the scientist can configure the apparatus in such a way that

its output at t3 gives ��(u). We also require that the scientist can read and interpret that output.

This means at a minimum that for any question of the form “Does ��(u) = K?” the scientist can

both ask that question at t3 and interpret ζ(u) to accurately answer it.

To make this fully formal, we introduce a set of binary functions with domain ��(U): ∀K, fK :

γ → 1 iff γ = K. Note that we have one such function for every K ∈ ��(U). Our requirement

for successful observation is that the observation apparatus can be configured so that, for any fK,

if the scientist were to consider an associated binary question at t3 and interpret ζ(u) to answer

the question, then the scientist’s answer would necessarily equal fK(��(u)). In other words, there

is a value c ∈ χ(U) such that for any K ∈ ��(U), there is an associated qK ∈ Q(U) such that the

combination of χ(u) = c and Q(u) = qK implies that Y(u) = fK(��(u)).

Intuitively, for the scientist to use the apparatus to “observe S (t2)” only means the scientist

must configure the apparatus appropriately; the scientist must force the universe to have a worldline

u such that χ(u) = c, and that must in turn cause ζ(u) to accurately give ��(u). In particular,

to “observe S (t2)” does not require that the scientist impose any particular value on Q(u). Rather

Q’s role is to provide a way to interpret ζ(u). The only requirement made of Q is that if the scientist

were to ask a question like “Does S (t2) equal K?”, then Q(u) — determined by the state

of the scientist’s brain at t3 — would equal that question, and the scientist’s answer Y(u) would

be appropriately set by ζ(u). It is by using Q this way that we formalize the notion that ζ(u)

conveys information to the scientist concerning S (t2). The “observation is successful” if for any

such question the scientist might pose (as reflected in Q(u)), their associated answer (as reflected

in Y(u)) properly matches the state of S at t2.

We can motivate this use of Q in a less nuanced, more direct way. Consider a scenario where

5

the scientist cannot both pose all binary-valued questions fK concerning S (t2) and correctly answer

them using the apparatus output, ζ(u). It would seem hard to justify the view that in this

scenario the combination of the scientist with the apparatus makes a “successful observation”

concerning S (t2).

Note that by defining an observation device as the combination of an observation apparatus

with the external scientist who is using that apparatus, we are in a certain sense arriving at

a Wignerian approach to observation. In contrast to a more straight-forward many-worlds approach,

we require that the state of the observation apparatus not just be correlated with the

variable being observed, but in fact contain semantic information concerning the variable being

observed. This makes the external scientist using the observation apparatus crucial in our

approach, in contrast to the case with the many-worlds approach.

Example 2: This example is a slight variant of Ex. 1. In this variant, there is no scientist, just

“inanimate” pieces of hardware.

We change the apparatus of Ex. 1 slightly. First, we make the output ζ be binary-valued.We

also change the configuration function χ, so that in addition to its previous duties, it also specifies

a question of the form, “Does ��(u) equal K?”. Then observation is successful if for any K ∈ ��(U),

the apparatus can be configured appropriately, so that its output correctly answers the question

of whether S (t2) equals K. In other words, observation is successful if for any K ∈ ��(U) there is

an associated c ∈ χ(U) such that having χ(u) = c implies that Y(u) = fK(��(u)).

Example 3: We now describe a physical system that is a general-purpose prediction device,

capable of correctly predicting different aspects of the universe’s future. Let S be some particular

variable concerning the universe whose value at some time t2 we want our device to predict. If

the universe’s worldline is u, then the value of S at t2 is given by some function of u which we

write as ��; S (t2) = ��(u).

The prediction device consists of two parts, a physical computer, and a scientist who programs

that computer to make the prediction and interprets the computer’s output as that prediction. To

“program the computer” means that the scientist initializes it at some time t1 <> t1, and at that time displays

a correct prediction of S (t2) on its output. (In general we would like to also have t3 <> t2.

(The idea is that by changing how the apparatus is queried, the scientist can change what aspect

of the universe’s past the apparatus displays to the scientist.) That state imposed on the variable

concerning the apparatus at t1 is also given by a function of the entire universe’s worldline u,

since the apparatus exists in the universe.Write that function as χ, with range χ(U).

The hope is that if the apparatus functions properly and is properly queried at t1, then it retrieves

an accurate recording of S (t2), and displays that recording on its output at some time

t3 > t1. Again, that output display of the apparatus exists in the universe. So its state at t3 is a

7

function of u; write that function as ζ.

The scientist reads the output of the apparatus and interprets it as this recording of S (t2),

thereby imbuing that output with semantic meaning.More precisely, for the value ζ(u) to convey

information to the scientist at t3, we require that the scientist can ask questions of the sort, “Does

S (t2) = K?” at t3, and that ζ(u) provides the scientist with (possibly erroneous) answers to such

questions.

As in Ex. 1, to make this more formal, we note that any such question is a binary function of u,

of the sort qK presented in Ex. 1. Also as in Ex. 1, the brain of the scientist exists in the universe.

So which (if any) of a set of possible questions concerning the universe the scientist is asking at

t3 is also a function of u, which we again write as Q. Also as in Ex. 1, the answer of the scientist

to any such question is a bit that the scientist generates by interpreting ζ(u). Since that answer is

given by the state of the scientist’s brain at t3, it is a function of u, which as before we write as Y.

So for the combination of the apparatus and the scientist using that apparatus to be able to

successfully record and recall the state of S at t2 means two things: First, we require that the

scientist can query the apparatus in such a way that its output at t3 gives ��(u). We also require

that the scientist can read and interpret that output.More precisely, our requirement for successful

recording and recollection is that the apparatus can be queried so that, for any fK, if the scientist

were to consider an associated binary question at t3 and interpret ζ(u) to answer the question,

then the scientist’s answer would necessarily equal fK(��(u)). In other words, there is a value

c ∈ χ(U) such that for any K ∈ ��(U), there is an associated qK ∈ Q(U) such that the combination

of χ(u) = c and Q(u) = qK implies that Y(u) = fK(��(u)).

Just as in Ex. 1, for the scientist to use the apparatus to “recall S (t2)” only means the scientist

must query the apparatus appropriately; the scientist must force the universe to have a worldline

u such that χ(u) = c, and that must in turn cause ζ(u) to accurately give ��(u). In particular, to

“recall S (t2)” does not require that the scientist impose any particular value on Q(u). As before,

Q’s role is to provide a way to interpret ζ(u).

Note that nothing in this example specifies how the recording process operates. This is just like

how nothing in Ex. 1 specifies how the observation apparatus couples with S , and how nothing

in Ex. 3 specifies what simulation the computer runs.

See [39,11,30] for discussion about the crucial role that recollection devices play in the psychological

arrow of time, and of the crucial dependence of such devices on the second law of

thermodynamics. As a result of their playing such a role, the limitations on recollection devices

derived below have direct implications for the psychological and thermodynamic arrows of time.

Just as Ex. 2 varies Ex. 1 by removing the scientist, so Ex.’s 3 and 4 can be varied to remove

the scientist.

3. Basic concepts

In this section we first formalize the mathematical structure that is shared among Ex.’s 1-4 of

Sec. 2. In doing so we substantially simplify that structure. After this formalization of the shared

structure in the examples we present some elementary results concerning that structure.

8

3.1. Inference devices

Definition 1: An (inference) device over a set U is a pair of functions (X, Y), both with domain

U. Y is called the conclusion function of the device, and is surjective onto B. X is called the

setup function of the device.

As an illustration, in all of Ex.’s 1-4, the setup function is the composite function (χ, Q), and

the conclusion function is Y. The value of X(u) can loosely be interpreted as how the device is

“initialized / configured”. 4 The value of Y(u) should instead be viewed as all that the device

predicts /observes / recollects when it is done. A priori, we assume nothing about how X and Y

are related. Note that we do not require that the compound map (X, Y) : u ∈ U → (X, Y)(u) be

surjective. There can be pairs of values x ∈ X(U), y ∈ Y(U) that never arise for the same u.

Given some function �� with domain U and some γ ∈ ��(U), we are interested in setting up a

device so that it is assured of correctly answering whether ��(u) = γ for the actual universe u.

Loosely speaking, we will formalize this with the condition that Y(u) = 1 iff ��(u) = γ for all u

that are consistent with some associated setup value of the device, i.e., such that X(u) = x. If this

condition holds, then setting up the device to have setup value x guarantees that the device will

make the correct conclusion concerning whether ��(u) = γ. (Hence the terms “setup function”

and “conclusion function” in Def. 1.)

Note that this desired relationship between X, Y and �� can hold even if X(u) = x doesn’t

fix a unique value for Y(u). Such non-uniqueness is typical when the device is being used for

observation. Setting up a device to observe a variable outside of that device restricts the set of

possible universes; only those u are allowed that are consistent with the observation device being

set up that way to make the desired observation. But typically just setting up an observation

device to observe what value a variable has doesn’t uniquely fix the value of that variable.

In general we will want to predict / observe / recollect a function �� that can take on more than

two values. This is done by appropriately choosing X(u). As mentioned, X(u) specifies what is

known about the outside world together with a simulation program (in the case of computerbased

prediction), or a specification of how to set up an observation apparatus (in the case of

observation), or a specification of what to remember (in the case of a memory device). But in

addition, in all those cases X(u) specifies one of the possible values of ��(u) (i.e., it specifies a

question of the form “Does ��(u) = γ?”). We then view the device’s conclusion bit as saying

whether ��(u) does / doesn’t have that specified value. So for example if our device is a computer

being used to predict the value of some variable concerning the state of the world, then formally

speaking, the setup of the computer specifies a particular one of the possible values of that variable

(in addition to specifying other information like what simulation to run, what is known about

the outside world, etc.). Our hope is that the computer’s conclusion bit correctly answers whether

the variable has that value specified in how the computer is set up.

Intuitively, this amounts to using a unary representation of ��(U). To formalize this with minimal

notation, we will use the following shorthand:

Definition 2: Let A be a set having at least two elements. A probe of A is a mapping from A onto

B that equals 1 for one and only one argument a ∈ A.

4 Care should be taken with this interpretation though. For example, in Ex. 1, χ concerns the state of u at time t1, and Q

concerns the state of u at t3. So X “straddles multiple times”.

9

So a probe of A is a function that picks out a single one of A’s possible values, i.e., it is a

Kronecker delta function whose second argument is fixed, and whose image value 0 is replaced

by -1.

3.2. Notation for inference devices

We now have the tools to define what it means for an inference device to successfully observe

/ predict / recall. Before presenting that definition we introduce some useful notation.

Unless specified otherwise, a device written as “Ci” for any integer i is implicitly presumed

to have domain U, with setup function Xi and conclusion function Yi (and similarly for no subscript).

Similarly, unless specified otherwise, expressions like “minxi” mean minxi∈Xi(U) .

We define a probe of a device to be a probe of the image of the device’s conclusion function.

Given a function �� with domain U and a probe f of ��(U), we write f (��) as shorthand for the

function u ∈ U → f (��(u)). We write π(A) to indicate the set of all probes of a set A, and π(��) to

indicate the set of functions over U, { f (��) : f ∈ π(��(U))}.

Probes are a shorthand way of posing queries concerning membership in a set (e.g., queries

like “is it true that u ∈ Y−1(y) for some particular value y?”). All such queries are binary-valued

(which is why the range of probes is B). So couching the analysis in terms of probes essentially

amounts to representing all associated spaces in terms of bits. This has the advantage that it

allows us to avoid considering the ranges of any functions that arise in the analysis. In particular,

it allows us to avoid concern for whether one such range “matches up” with the domains and/or

ranges of other functions. For example, it allows us to avoid concern for such matching between

the spaces defining two different inference devices when considering whether they infer each

other.. (See [26] for a more elaborate way of circumventing the need of those ranges to match.)

Say we are given a set of functions over U, {D1, d1, D2, d2, . . . E1, e1, E2, e2, . . .}. Then with

some abuse of terminology, we write “D1 = d1, D2 = d2, . . . ⇒ E1 = e1, E2 = e2, . . .” as

shorthand for “∃ u ∈ U such that D1(u) = d1(u), D2 = d2, . . ., and ∀ u ∈ U such that D1(u) =

d1(u), D2 = d2, . . ., it is the case that E1(u) = e1(u), e2(u) = E2(u), . . .”. We will often abuse

notation even further by allowing d1 to be an element of D1’s range. In this case, “D1 = d1 ⇒

E1 = e1” is shorthand for“∃u ∈ U such that D1 = d1, and ∀ u ∈ U such that D1(u) = d1, it is also

the case that E1(u) = e1(u)”.

3.3. Weak inference

We can now formalize inference as follows:

Definition 3: A device C (weakly) infers a function �� over U iff ∀ f ∈ π(��), ∃ x such that

X = x ⇒ Y = f (��).

So using the definitions in the previous subsection, C weakly infers �� iff ∀ f ∈ π(��), ∃ x ∈ X(U)

such that for all u ∈ U for which X(u) = x, Y(u) = f (��(u)).

Recall our stipulation that all functions over U take on at least two values, and so in particular

�� must. Therefore π(��) is non-empty.We will write C > �� if C infers ��. Expanding our shorthand

notation, C > �� means that for all γ ∈ ��(U), ∃x ∈ X(U) with the following property: ∀u ∈ U :

X(u) = x, it must be that Y(u) = fγ(��(u)), where fγ : ��(U) → B is the probe of ��’s range that

equals 1 iff ��(u) = γ.

10

Intuitively, to have C > �� means that if the image value of �� is expressed as a list of answers to

questions of the form “Does ��(u) = γ?”, then we can set up the device so that it will guaranteedly

correctly conclude any particular answer in that list. Alternatively, the requirement that there be

an appropriate x for any probe function of �� can be viewed as shorthand; in the definition of

inference we are considering the ability of a device to correctly answer any member of a list

of binary-valued questions, a set that is “generated” by ��. So weak-inference is a worst-case

definition: if a device C weakly infers ��, then no matter what probe f ∈ π(��) a malicious demon

might choose, the scientist could guarantee that Y = f (��) by choosing an associated value x for

the value of X.

To illustrate this, consider again Ex. 1. Identify the Y in Def. 3 with the Y in Ex. 1, and

similarly identify the ��’s with each other. Then identify the function X in Def. 3 as the product

of functions, χ ⊗ Q. (X, Y) specifies a device C. The functions fK in Ex. 1 are the probes in π(��).

So if C > ��, then the aggregate system of scientist and observation apparatus can observe S (t2).

Note that ζ ends up being irrelevant. In essence, it serves as a conduit to transfer information into

the scientist’s brain.

In the many-worlds definition of an observation, any particular result of the observation is

identified with a solitary worldline u. Intuitively, this might be worrisome; a solitary u is just a

single point in a space, with no intrinsic mathematical structure. The properties of such a single

point can be drastically modified by an appropriate isomorphism over U. In particular, as has

been pointed out by many authors, in the many-worlds definition what gets “observed” can be

modified if one changes the basis of U. (This is one of the major motivations for the work on

decoherence [40,41].)

However if a scientist makes an observation, then that scientist could provide the value of

any (binary-valued) function of the result of the observation, if they were asked to. So formally

requiring that the scientist be able to provide such values doesn’t preclude real-world instances

of observation. At the same time, adding such a requirement has substantial consequences. In

fact, it drives many of the results presented below concerning weak inference. This is why this

requirement is incorporated into the definition of weak inference. In other words, it is why the

definition of weak inference inherently involves multiple worldlines u, in contrast to the manyworlds

definition of observation.

See Sec. 6.2 for a discussion of the philosophical aspects of weak inference. The relation

between weak inference and the theory of knowledge functions [42,43,44,45] is briefly discussed

in Sec. 9. App. B contains a discussion of how unrestrictive the definition of weak inference is.

Finally, some alternative definitions of devices and weak inference are considered in App. C.

3.4. Elementary results concerning weak inference

We say that a device C1 infers a set of functions if it infers every function in that set. We also

say C1 infers a device C2 iff C1 > Y2. In general inference among devices is non-transitive. In

addition we have the following elementary properties of devices:

Proposition 1: Let {��i} be a set of functions with domain U and W ⊂ U.

i) If ∀i, |��i(W)| ≥ 2, then there is a device over U that infers {��i}.

ii) For any device C, there is a binary-valued function that C does not infer.

Prop. 1(ii) means in particular that there are sets {��i} such that no device can infer every function

11

in that set.

In a limited sense, when applied to prediction (cf. Ex. 1), Prop. 1(ii) means that Laplace was

wrong: even if the universe were a giant clock, he would not have been able to reliably predict the

universe’s future state before it occurred. 5 Viewed differently, Prop. 1(ii) means that regardless

of noise levels and the dimensions and other characteristics of the underlying attractors of the

physical dynamics of various systems, there cannot be a time-series prediction algorithm [48]

that is always correct in its prediction of the future state of such systems.

Note that time does not appear in Def. 3’s model of a prediction system. So in particular in

Ex. 3 we could have t3 <> t.

Let G be the set of all time-t states of the universe in which C’s output display is +1. The laws

of physics can be used to evolve G forward to time t′. Label that evolved set of time-t′ states of

the universe as H. Let �� be the binary-valued question, “does the state of the universe at t′ lies

outside of H?”.

There is no information concerning H that can be programmed into C at some time t− < c =" (X,"> 2 iff there is a function that C infers.

Prop. 1(ii) tells us that any inference device C can be “thwarted” by an associated function.

However it does not forbid the possibility of some second device that can infer that function that

thwartsC. To analyze issues of this sort, andmore generally to analyze the inference relationships

within sets of multiple functions and multiple devices, we start with the following definition:

Definition 4: Two devices (X1, Y1) and (X2, Y2) are (setup) distinguishable iff ∀ x1, x2, ∃ u ∈ U

s.t. X1(u) = x1, X2(u) = x2.

No device is distinguishable from itself. Distinguishability is non-transitive in general. Having

two devices be distinguishable means that no matter how the first device is set up, it is always

possible to set up the second one in an arbitrary fashion; the setting up of the first device does not

preclude any options for setting up the second one. Intuitively, if two devices are not distinguishable,

then the setup function of one of the devices is partially “controlled” by the setup function

of the other one. In such a situation, they are not two fully separate, independent devices.

By choosing the negation probe f (y ∈ B) = −y we see that no device can weakly infer itself.

We also have the following:

Theorem 1: No two distinguishable devices can weakly infer each other.

Thm. 1 says that no matter how clever we are in designing a pair of inference devices, so long

as they are distinguishable from each another, one of them must thwart the other, providing a

function that the other device cannot infer. Whereas the impossibility result of Prop. 1(ii) relies

on constructing a special function �� matched to C, the implications of Thm. 1 are broader, in that

they establish that a whole class of functions cannot be inferred by C (namely the conclusion

functions of devices that are distinguishable from C and also can infer C). It is important to

note that the distinguishability condition is crucial to Thm. 1; mutual weak inference can occur

between non-distinguishable devices.

Example 5: Consider a rectangular grid of particle pairs, each pair consisting of a yellow particle

and a purple particle. Say that all particles can either be spin up or spin down. Write the spin of

the purple particle at grid location (i, j) as sp(i, j), and the spin of the yellow particle there as

sy(i, j).

Such a grid is a set U consisting of all quadruples {i, j, sp(i, j), sy(i, j)}. Assume there are

at least two i values, and at least one purple spin is up and at least one is down. Then we can

13

define a “purple inference device” Cp by Xp(i, j, sp(i, j), sy(i, j)) , i and Yp(i, j, sp(i, j), sy(i, j)) ,

sp(i, j). Similarly, a “yellow inference device” can be defined by Xy(i, j, sp(i, j), sy(i, j)) , j and

Yy(i, j, sp(i, j), sy(i, j)) , sy(i, j) (assuming there are at least two j’s and at least one yellow

particle is spin up and at least one is spin down).

These two devices are distinguishable. In addition, Cp > Cy if there is some i′ such that

sp(i′, j) = sy(i′, j) for all j, and also some i′′ such that sp(i′′, j) = −sy(i′′, j) for all j. In such

a situation we can set up the purple device with a value (i′) that guarantees that its conclusion

correctly answers the question, “Does sy point up?”. Similarly, we can set it up with a value that

guarantees that its conclusion correctly answers the question, “Does sy point down?”.

However if there is such an i′ and i′′, then clearly there cannot also be both a value j′ and a

value j′′ that the yellow inference device can use to answer whether sp points up and whether

sp points down, respectively. This impossibility holds regardless of the size of the grid and the

particular pattern of yellow and purple particles on the grid. Thm. 1 generalizes this impossibility

result.

As a general comment, the definition of what it means for a device to infer �� can be reexpressed

in terms of the pre-images in U of ��, {��−1(γ) : γ ∈ ��(U)}. 6 Now in this paper we only

consider weak inference of ��’s that are functions. So none of those pre-images of �� intersect

the others; they comprise a partition of U. However more generally, one might be interested in

inference of �� when some of the pre-images of �� have non-empty intersection with one another.

For example, one might wish to observe if some physical variable is in the range [0, 10], the

range [5, 20], or the range [15, 30]. Formally, the generalization to overlapping pre-images of

�� arises by allowing �� to be a correspondence rather than a function. The generalization of the

formalism to explicitly accommodate such correspondences is beyond the scope of this paper.

Note though that since devices are pairs of functions, that generalization is not relevant for much

of the analysis concerning the inference of one device by another.

4. Turing machines, Universal Turing machines, and inference

There are several connections between inference and results in computer science [49]. In this

section we introduce some elementary concepts for exploring those connections.

4.1. Turing machines and inference

Consider a deterministic Turing Machine (TM) and write its internal state at iteration t as g(t),

with the state of its tape then being written as h(t). So the operation of the TM on a particular

initial value of its tape h(t0) produces an infinite sequence {h(t0), g(t0), h(t0 + 1), g(t0 + 1), . . .}. (If

g(t) is the halt state, then for completeness we define g(t′) = g(t), h(t′) = h(t) ∀t′ > t.) Which

such sequence the TM executes is determined by the value h(t0) (assuming a default value for

g(t0)).

Next take U to be the set of worldlines consistent with the laws of physics in our universe (and

no other worldlines). Hypothesize that it is consistent with those laws of physics to have some

particular TMT be physically instantiated in our universe, with iteration number t corresponding

to time in some particular reference frame. Then which sequence T actually executes can be

6 Writing it out, if C infers ��, then for all ∀ γ ∈ ��(U), ∃ x ∈ X(U) such that [X−1(x) ∩ Y−1(1)] = [X−1(x) ∩ ��−1(γ)].

14

cast as a projection function of the worldline u ∈ U. (Recall that worldlines extend across all

time.) Accordingly we can identify any T as a function �� with domain U. The set of all possible

sequences of T that can occur in our universe is simply a set of functions ��.

To be more precise, fix t0, and let HT be the set of all possible initial (time t0) values of

T’s tape. Define MT as the map by which T takes h(t0) ∈ HT to the associated infinite sequence

{h(t0), g(t0), h(t0+1), g(t0+1), . . .}. MT can be viewed as defining T. Equivalently,we can express

T as a function over U, ��T : ��T projects every u ∈ U in which T has initial tape state h ∈ HT to

MT (h). MT and ��T have the same range (namely the set of all sequences that T can generate),

but different domains (HT and U, respectively).

Now construct an inference device CT ≡ (XT , YT ) where XT (U) ≡ {(h, f ) : h ∈ HT , f ∈

π(��T )}.Write the two components of any value XT (u) as XT

h (u) and XTf

(u), where XT

h (u) is defined

to be the value h(t0) for the TM T when the worldline is u. So XT

h “initializes” the TM. Note that

the second component of X, XTf

, maps u onto a space of functions over U (namely, the space

π(��)). Finally, define YT : u → 1 iff XTf

(u)[MT (XT

h (u))] = 1.

If XT is set up to be a particular initial state of T’s tape, together with a particular probe

concerning the resultant sequence of internal and tape states, then for any u the conclusion YT (u)

is the actual value of that probe for the sequence of internal and tape states specified in u. Since

probes are simply a way to imbue the conclusion of the device with semantic meaning (recall Ex.

3 in Sec. 2), this means we can view C as equivalent to T. In particular, CT infers the TM, i.e.,

CT > ��T .

We can generalize this example, to identify inference devices in general as analogs of TM’s,

with inference being the analog of TM-style computation. All of the impossibility results presented

above apply to these analogs of TM’s. To illustrate this, Prop. 1(ii) can be taken to mean

that for any such inference-based analog of a TM, there is some function that the device cannot

“compute”. In particular, this is true for the device CT that essentially equals the TM T. In

this, Prop. 1(ii) can be viewed as the analog for inference devices of the Halting theorem, which

concerns TM’s. Moreover, this reasoning concerning physical realizations of TM’s applies just

as well to other members of the Chomsky hierarchy besides TM’s, providing us with “halting

theorems” for those other members.

As a final comment on the relation between inference and TM-style computation, note that

inference by a device C is not a form of counter-factual “computation”. Inference by C does not

compute the answer to a question of the form “If {axioms} then {implications}”, unless there is

some x such that “{axioms}” actually holds for all u ∈ U that C induces by setting X(u) = x. In

particular, if in our universe there is no physical instantiation of some particular TM, then there

is no device in our universe whose inference is computationally equivalent to that TM.

4.2. Universal Turing machines and inference

Now we investigate how to define an analog of Universal Turing Machines (UTM’s) for inference

devices. More precisely, we consider how to define what it means for one device C1 to

emulate the inference process of another device C2. (Just like a UTMemulates the computational

process of another TM.) One natural desideratum for such a definition is that for C1 to “emulate”

C2 implies, at a minimum, that C1 > C2. So for example, if the two devices are both being

used for prediction, this would mean that C1 can correctly predict what prediction C2 will make

(whether or not that prediction by C2 is itself correct).

15

However we want C1 able to do more than infer the value of Y2(u); we want C1 able to emulate

the entire mapping taking any x2 to the associated value(s) Y2(X−1

2 (x2)).We want C1 able to infer

what inference C2 might make for any setup value x2, not just the inference that C2 makes for

the members of a set X2[X−1

1 (x1)] picked out by some particular x1. This means that all x2’s must

be allowed.

One way to formalize this second desideratum is to require that C1 can infer C2 using a setup

value that forces a unique x2, and can do so for any desired x2. More precisely, consider a particular

case where we want C1 to emulate the inference performed by C2 when X2(u) = x2. We

can do this if C1 infers Y2, while the value x1 used in that inference guarantees that X2(u) = x2.

That guarantee means that C1 infers the conclusion of C2 when C2 has the setup value x2. Given

this interpretation of what it means for C1 to emulate C2 when X2(u) = x2, to have C1 emulate

C2 in full simply means that we require that such emulation be possible for any x2 ∈ X2(U). So

formally, we require that ∀ f ∈ π(Y2), ∀x2, ∃x1 such that X1 = x1 ⇒ X2 = x2, Y1 = f (Y2).

A second formalization takes the opposite approach, and stipulates that the value x1 used by C1

to inferC2 places no restrictions on x2 whatsoever. Formally, thismeans that ∀ f ∈ π(Y2), ∀x2, ∃x1

such that X−1

1 (x1) ∩ X−1

2 (x2) , ? and X1 = x1 ⇒ Y1 = f (Y2).

In analogy with UTM’s, one might say that under the first formalizationC1 specifies the “input

tape” to C2 for which C1 will emulate C2, and then successfully carries out that emulation, i.e.,

successfully “computes” what C2 will produce in response to that input tape. To do this though

C1 must interfere with C2, forcing it to have that desired input tape. In contrast, under the second

formalization, there is no requirement that X1 force a particular value of X2. In particular, the

second formalization is obeyed if ∀ f ∈ π(Y2), ∃x1 such that X1 = x1 ⇒ Y1 = f (Y2) while at the

same time X−1

1 (x1) ∩ X−1

2 (x2) , ? ∀x2. In such a situation, C1 can emulate C2 using an x1 that

doesn’t reflect how C2 is set up. (Physically, this usually requires that the system underlying C1

must be coupled with the system underlying C2 at some time, so that x2 can be made known to

C1.)

Despite this apparent difference, these two formalizations of our second desideratum reflect

the same underlying mathematical structure. To see this, define a composite device C′ = (X′, Y′)

where X′ : u → (X1(u), X2(u)) and Y′ = Y1. Then under our second formalization of “emulation”,

for C1 to emulate C2 implies that ∀ f ∈ π(Y2), ∀x2, ∃x′ such that X′−1(x′) ∩ X−1

2 (x2) , ? and

X′ = x′ ⇒ X2 = x2, Y′ = f (Y2). However X′−1(x′) ∩ X−1

2 (x2) , ? means that X′ = x′ ⇒

X2 = x2, by definition of X′. So this second formalization of what it means for C1 to emulate C2

stipulates a relation between C′ and C2 that is identical to the relation between C1 and C2 under

the first formalization. In this sense, our second formalization reduces to our first. Accordingly,

we concentrate on the first formalization, and make the following definition:

Definition 5: A device (X1, Y1) strongly infers a device (X2, Y2) iff ∀ f ∈ π(Y2) and all x2, ∃ x1

such that X1 = x1 ⇒ X2 = x2, Y1 = f (Y2).

If (X1, Y1) strongly infers (X2, Y2) we write (X1, Y1) ≫ (X2, Y2). 7 See App. B for a discussion of

how minimal the definition of strong inference really is.

Say we have a TM T1 that can emulate another TM T2, e.g., T1 is a UTM. This means that T1

can calculate anything that T2 can. The analogous property holds for strong and weak inference.

7 Note that there are only two probes of Y2, the identity probe f (y2) = y2 and the negation probe, f (y2) = −y2. Indicate

those two probes by f = 1 and f = −1, respectively. Then we can express X1 = x1 ⇒ X2 = x2, Y1 = f (Y2) in set-theoretic

terms, as X−1

1 (x1) ⊆ X−1

2 (x2) ∩ (Y1Y2)−1( f ), where Y1Y2 is the function u ∈ U → Y1(u)Y2(u).

16

In addition, like UTM-style emulation (but unlike weak inference), strong inference is transitive.

These results are formalized as follows:

Theorem 2: Let C1, C2 and C3 be a set of inference devices over U and �� a function over U.

Then:

i) C1 ≫ C2 and C2 > �� ⇒C1 > ��.

ii) C1 ≫ C2 and C2 ≫ C3 ⇒C1 ≫ C3.

Strong inference implies weak inference, i.e., C1 ≫ C2 ⇒ C1 > C2. We also have the following

strong inference analogs of Prop. 1(ii) and Coroll. 1 (which concerns weak inference):

Proposition 2: Let C1 be a device over U.

i) There is a device C2 such that C1 4 C2.

ii) Say that ∀ x1, |X−1

1 (x1)| > 2. Then there is a device C2 such that C2 ≫ C1.

Recall that the Halting problem concerns whether there is a UTM T with the following property:

Given any TM T′ and associated input string s′, if T′ and s′ are encoded as an input string

to T, then T always correctly decides whether T′ halts on input s′. The Halting theorem then

says that there can be no such UTM T. Intuitively, Prop. 2(i) can be viewed as an analog of this

theorem, in the context of inference. (See also Prop. 7 below.)

In general we are not interested in whether a device can strongly infer an arbitrary set of other

devices, but rather with the strong inference relationships among the members of a particular

set of devices. Just like with weak inference, no device can strongly infer itself. This can be

generalized to concern a set of multiple devices as follows:

Theorem 3: No two devices can strongly infer each other.

Note that Thm. 3 does not require distinguishability, in contrast to Thm. 1.

5. Inference Complexity

In computer science, given a TMT, the Kolmogorov complexity of an output string s is defined

as the length of the smallest input string s′ that when input to T produces s as output. To construct

our inference device analog of this, we need to define the “length” of an input region of an

inference device C. To do this, we assume we are given a measure dμ over U, and for simplicity

restrict attention to functions G over U with countable range. Then we define the length of

g ∈ G(U) as -ln[R dμ G−1(g)], i.e., the negative logarithm of the volume of all u ∈ U such that

G(u) = g. We write this length as LC(g), or just L(g) for short. 8

Definition 6: Let C be a device and �� a function over U where X(U) and ��(U) are countable and

C > ��. The inference complexity of �� with respect to C is defined as

8 If R dμ 1 = ∞, then we instead work with differences in logarithms of volumes, evaluated under an appropriate limit

of dμ that takes R dμ 1 → ∞. For example, we might work with such differences when U is taken to be a box whose size

goes to infinity. This is just the usual physics trick for dealing with infinite volumes.

17

C(�� | C) , Xf ∈π(��)

minx:X=x⇒Y=f (��)[L(x)].

The inference complexity of �� with respect to C is the sum of a set of “complexities”, one

for each probe of ��, f . Loosely speaking, each of those complexities is the minimal amount

of Shannon information that must be imposed in C’s setup function in order to ensure that C

correctly concludes what value f has. In particular, if �� corresponds to a potential future state of

some system S external to C, then C (�� | C) is a measure of how difficult it is for C to predict

that future state of S . Loosely speaking, the more sensitively that future state depends on current

conditions, the more complex is the computation of that future state.

Example 6: Consider a conventional real-world computer,with a subsection of its RAMset aside

to contain the program it will run, and a separate subsection set aside to contain the conclusion

that the program will produce. Say the total number of bits in the program subsection of the

RAM is 2k + k for some integer k. Refer to any set of 2k + k bits as a “complete string”; the set of

all complete strings is the set of all possible bit strings in the program subsection of the RAM.

Let k be the set of all bit strings s consisting of at least k bits such that the first k bits are a

binary encoding of the total number of bits in s beyond those first k bits. So every element of

k can be read into the beginning of the RAM’s program subsection. For any s ∈ k define an

associated “partial string” as the set of all complete strings whose first bits are s. Intuitively, for

any such complete string, all of its bits beyond s are “wild cards”. (Such partial strings are just

the “files” of real-world operating systems.)With some abuse of terminology, when we write “s”

we will sometimes actually mean the partial string that s specifies.

We can identify a particular program input to the computer as such a partial string in its program

subsection. If we append certain bits to such an s (modifying the contents of the first k bits

appropriately) to get a new longer programpartial string, s′, the set of complete strings consistent

with s′ is a proper subset of the set of complete strings consistent with s.

Define the length of a partial string s as the negative of the logarithmof the number of complete

strings that have s at their beginning,minus k. This matches the usual definition of the length of a

string used in computer science. In particular, if s′ contains n more bits than does s, then there are

2n times as many complete strings consistent with s as there are consistent with s′. Accordingly,

if we take logarithms to have base 2, the length of s′ equals the length of s, plus n.

Now view our physical computer as an inference device, with U the Cartesian product of the

set of all possible bit strings in the RAM of the computer together with some countable-valued

variables concerning the world outside of the computer. Refer to the components of any u ∈ U

specifying the bit string in the program subsection of the RAM as the “program subsection of u”,

and similarly for the “conclusion subsection of u”.

For the computer to be an inference device means that the conclusion subsection of u consists

of a single bit, i.e., Y maps all u ∈ U to the (bit) value of the conclusion subsection of the

computer’s RAM as specified by u. For all u ∈ U, have X(u) be the bit string at the beginning of

the program subsection of u whose length is given by the first k bits of that program subsection

of u. So x is a partial string of the RAM’s program subsection. In general, there are many sets

each consisting of multiple u ∈ U that have the same image under X, i.e., there are many x such

that X−1(x) consists of multiple elements. If we adopt the uniform point measure dμ, then L(x)

is just the negative logarithm of the number of such elements in X−1(x), i.e., the length of the

partial string x in the program subsection of the computer’s RAM.

Now say we want our computer to make a prediction concerning the value of ��(U), one of

18

the variables associated with the world outside of the computer. As usual, we interpret this to

mean that for any γ ∈ ��(U), there is some partial string we can read into the computer’s program

subsection that contains enough information concerning �� and the state of the world so that the

computer’s conclusion will correctly say whether ��(u) = γ. The inference complexity of that

prediction of �� is the sum, over all such probes f of ��, of the length of the shortest partial string

in the computer’s program subsection that cause it to correctly conclude the value of f .

The min over x’s in Def. 6 is a direct analog of the min in the definition of Kolmogorov

complexity (there the min is over those strings that when input to a particular UTM result in the

desired output string). A natural modification to Def. 6 is to remove the min by considering all

x’s that cause Y = f (��), not just of one of them:

Cˆ(�� | C) , Xf ∈π(��)

−ln h μ ∪x:X=x⇒Y=f (��)X−1(x) i

= Xf ∈π(��)

−ln

X x:X=x⇒Y=f (��)

e−L(x)

,

where the equality follows from the fact that for any x, x′ , x, X−1(x) ∩ X−1(x′) = ?. The

argument of the ln in this modified version of inference complexity has a direct analog in TM

theory: The sum, over all input strings s to a UTM that generate a desired output string s′, of

2−n(s), where n(s) is the bit length of s.

We now bound how much more complex a function can appear to C1 than to C2 if C1 can

strongly infer C2.

Theorem4: Let C1 andC2 be two devices and �� a function over U where ��(U) is finite, C1 ≫ C2,

and C2 > ��. Then

C(�� | C1) − C(�� | C2) ≤ |��(U)| maxx2minx1:X1=x1⇒X2=x2,Y1=Y2 [L(x1) − L(x2)].

Note that sinceL(x1)−L(x2) = ln[ X−1

2 (x2)

X−1

1 (x1) ], the bound in Thm. 4 is independent of the units with

which one measures volume in U. (Cf. footnote 8.) Furthermore, recall that X1 = x1 ⇒ X2 =

x2, Y1 = Y2 iff X−1

1 (x1) ⊆ X−1

2 (x2) ∩ (Y1Y2)−1(1). (Cf. footnote 7.) Accordingly, for all (x1, x2)

pairs arising in the bound in Thm. 4, X−1

2 (x2)

X−1

1 (x1)

≥ 1. So the bound in Thm. 4 is always non-negative.

An important result in the theory of UTM’s is an upper bound on the difference between the

Kolmogorov complexity of a string using a particular UTM T1 and its complexity if using a

different UTM, T2. This bound is independent of the computation to be performed, and can be

viewed as the Kolmogorov complexity of T1 emulating T2.

The bound in Thm. 4 is the analog of this UTM result, for inference devices. In particular, the

bound in Thm. 4 is independent of all aspects of �� except the cardinality of ��(U). Intuitively,

the bound is |��(U)| times the worst-case amount of “computational work” that C1 has to do to

“emulate” C2’s behavior for some particular value of x2.

19

6. Realities and copies of devices

In this section the discussion is broadened to allow sets of many functions to be inferred and

/ or inference devices. Some of the philosophical implications of the ensuing results are then

discussed.

6.1. Formal results

To analyze relationships among multiple devices and functions, define a reality as a pair

(U; {Fφ}) where U is a space and {Fφ} is a (perhaps uncountable) non-empty set of functions

all having domain U. We will sometimes say that U is the domain of the reality. We are particularly

interested in device realities in which some of the functions are binary-valued, and we

wish to pair each of those functions uniquely with some of the other functions. Such realities can

be written as the triple (U; {(Xα, Yα)}; {��β}) ≡ (U; {Cα}; {��β}) where {Cα} is a set of devices over

U and {��β} a set of functions over U.

Define a universal device as any device in a reality that can strongly infer all other devices

and weakly infer all functions in that reality. Thm. 3 means that no reality can contain more than

one universal device. So in particular, if a reality contains at least one universal device, then it has

a unique natural choice for an inference complexity measure, namely the inference complexity

with respect to its (unique) universal device. (This contrasts with Kolmogorov complexity, which

depends on the arbitrary choice of what UTM to use.)

It is useful to define the reduced formof a reality (U; {Fφ}) as the range ofNφ Fφ. Expanding,

this equals ∪u∈U[φ Fφ](u), the union over all u of the tuples formed by a Cartesian product,

running over all φ, of the values Fφ(u). In particular, the reduced form of a device reality is the

set of all tuples ([x1, y1], [x2, y2], . . . ; γ1, γ2, . . .) for which ∃ u ∈ U such that simultaneously

X1(u) = x1, Y1(u) = y1, X2(u) = x2, Y2(u) = y2, . . . ; ��1(u) = γ1, ��2(u) = γ2, . . ..

As an example, take U to be the set of all worldlines consistent with the laws of physics (and

no other worldlines). So for example, if one wants to consider a universe in which the laws of

physics are time-reversible and deterministic, then we require that no two distinct members of

U can intersect. Similarly, properties like time-translation invariance can be imposed on U, as

can more elaborate laws involving physical constants. Which such particular properties of U are

imposed depends on what the laws of physics are.

Next, have {��β} be a set of physical characteristics of the universe, each characteristic perhaps

defined in terms of the values of one or more physical variables at multiple locations and/or

multiple times. Finally, have {Cα} be all prediction / observation systems concerning the universe

that all scientists might ever be involved in.

This example is the conventional way to interpret our universe as a reality. In this example the

laws of physics are embodied in U. The implications of those laws for the relationships among the

scientist devices {Cα} and the other characteristics of the universe {��β} is embodied in the reduced

form of the reality. Viewing the universe this way, it is the u ∈ U, specifying the universe’s state

for all time, that has “physicalmeaning”. The reduced form instead is a logical implication of the

laws of the universe. In particular, our universe’s u picks out the tuple [α Cα(u)] × [β ��β(u)]

from the reduced form of the reality.

As an alternative we can view the reduced form of the reality as encapsulating the “physical

meaning” of the universe. In this alternative u does not have any physical meaning. It is only

the relationships among the inferences about u that one might want to make and the devices

20

with which to try to make those inferences that has physical meaning. One could completely

change the space U and the functions defined over it, but if the associated reduced form of the

reality does not change, then there is no way that the devices in that reality, when considering

the functions in that reality, can tell that they are now defined over a different U. In this view, the

laws of physics i.e., a choice for the set U, are simply a calculational shortcut for encapsulating

patterns in the reduced form of the reality. It is a particular instantiation of those patterns that has

physical meaning, not some particular element u ∈ U.

Given a reality (U; {(X1, Y1), (X2, Y2), . . .}), we say that a pair of devices in it are pairwise

distinguishable if they are distinguishable. We say that a device (Xi, Yi) in that reality is outside

distinguishable iff ∀ xi ∈ Xi(U) and all x′

−i in the range of Nj,i Xj, there is a u ∈ U such

that simultaneously Xi(u) = xi and Xj(u) = x′

j ∀ j , i. (Note that that range may be a proper

subset ofj,i Xj(U).)We say that the reality as a whole is mutually (setup) distinguishable iff

∀ x1 ∈ X1(U), x2 ∈ X2(U), . . . ∃ u ∈ U s.t. X1(u) = x1, X2(u) = x2, . . ..

Proposition 3:

i) There exist realities (U;C1,C2,C3) where each pair of devices is setup distinguishable

and C1 > C2 > C3 > C1.

ii) There exists no reality (U; {Ci : i ∈ N ⊆ N}) where the devices are mutually

distinguishable and for some integer n, C1 > C2 > . . . > Cn > C1.

iii) There exists no reality (U; {Ci : i ∈ N ⊆ N}) where for some integer n, C1 ≫ C2 ≫

. . . ≫ Cn ≫ C1.

Consider a reality with a countable set of devices {Ci}. There are many ways to view such

a reality as a graph, for example by having each node be a device while the edges between

the nodes concern distinguishability of the associated devices, or concern whether one weakly

infers the other, etc. There are restrictions on what graphs of those various sorts can exist. As an

example, given a countable reality, define an associated directed graph by identifying each device

with a separate node in the graph, and by identifying each relationship of the form Ci ≫ Cj with

a directed edge going from node i to node j. We call this the strong inference graph of the

reality.

Thm. 3 means that a universal device in a reality must be a root node of the strong inference

graph of the reality. Applying Th. 3 again shows that the strong inference graph of a reality with

a universal device must contain exactly one root. In addition, by Thm. 2(ii), we know that every

node in a reality’s strong inference graph has edges that lead directly to every one of its successor

nodes (whether or not there is a universal device in the reality). By Prop. 3(iii) we also know that

a reality’s strong inference graph is acyclic. This latter fact establishes the following:

Proposition 4: Let D be a finite subset of the devices in a reality, where the strong inference

graph of the reality is weakly connected over D. Say that any pair of distinct devices in D that

are not connected by an edge of the strong inference graph are setup distinguishable.

Then the strong inference graph of the reality has one and only one root over D.

Results of this sort mean there are unavoidable asymmetries in the strong inference graphs of

realities. These asymmetries provide a preferred direction of strong inference in realities, akin to

the preferred direction in time provided by the second law of thermodynamics.

Note that even if a device C1 can strongly infer all other devices Ci>1 in a reality, it may

not be able to infer them simultaneously (strongly or weakly). For example, define �� : u →

21

(Y2(u), Y3(u), . . .). Then the fact that C1 is a universal device does not mean that ∀ f ∈ π(��) ∃ x1 :

Y1 = f (��). See the discussion in [26] on “omniscient devices” for more on this point.

We now define what it means for two devices to operate in an identical manner:

Definition 7: Let U and ˆU be two (perhaps identical) sets. Let C1 be a device in a reality with

domain U. Let R1 be the relation between X1 and Y1 specified by the reduced form of that reality,

i.e., x1R1y1 iff the pair (x1, y1) occurs in some tuple in the reduced form of the reality. Similarly

let R2 be the relation between X2 and Y2 for some separate device C2 in the reduced form of a

reality having domain ˆU .

Then we say that C1 mimics C2 iff there is an injection, ρX : X2( ˆU ) → X1(U) and a bijection

ρY : Y2( ˆU) ↔ Y1(U), such that for ∀x2, y2, x2R2y2 ⇔ ρX(x2)R1ρY (y2). If both C1 mimics C2 and

vice-versa, we say that C1 and C2 are copies of each other.

Note that because ρX in Def. 7 may not be surjective, one device may mimic multiple other

devices. (Surjectivity of ρY simply reflects the fact that since we’re considering devices, Y1(U) =

Y2(U) = B.) The relation of one device mimicing another is reflexive and transitive. The relation

of two devices being copies is an equivalence relation.

Intuitively, when expressed as devices, two physical systems are copies if they follow the

same inference algorithm with ρX and ρY translating between those systems. In particular, say a

reality contains two separate physical computers that are inference devices, both being used for

prediction. If those devices are copies of each other, then they form the same conclusion for the

same value of their setup function, i.e., they perform the same computation for the same input.

As another example, say that the states of some physical system S at a particular time t and

shortly thereafter at t+δ are identified as the setup and conclusion values of a device C1. In other

words, C1 is given by the functions (X1(u), Y1(u)) , (S (ut), S (ut+δ)). In addition, let RS be the

relation between X1 and Y1 specified by the reduced form of the reality containing the system.

Say that the time-translation of C1, given by the two functions S (ut′ ) and S (ut′+δ), also obeys the

relation RS . Then the pair of functions (X2(u), Y2(u)) , (S (ut′), S (ut′+δ)) is another device that

is copy of C1. So for example, the same physical computer at two separate pairs of moments is

two separate devices, devices that are copies of each other, assuming they have the same set of

allowed computations.

Say that an inference device C2 is being used for observation and C1 mimics C2. The fact that

C1 mimics C2 does not imply that C1 can emulate the observation that C2 makes of some outside

function ��. The mimicry property only relates C1 and C2, with no concern for third relationships

with any third function. (This is why for one device to “emulate” another is defined in terms of

strong inference rather than in terms of mimicry.)

Next for future use we note the following fact that is almost obvious (despite being so complicated):

Lemma 1: Let K1 be the set of reduced forms of all device realities. Let K2 be the set of all

sets k with the following property: k can be written as {(α∈A (sr

α, trα

) × β∈B vr

β) : r ∈ R}

for some associated A ,B and R such that for all α, ∪r trα

= B and | ∪r sr

α| ≥ 2, while for all

β ∈ B, | ∪r vr

β| ≥ 2. Then K1 = K2. In particular, any k ∈ K2 is the reduced form of a reality

(U; {Cα}, {��β}), where for all α ∈ A , β ∈ B, u ∈ U, there is some associated r ∈ R such that

simultaneously Xα(u) = sr

α, Yα(u) = trα

, and ��β(u) = vr

β.

Next, fix a counting number m and a set of m cardinalities, {

i : i = 1, . . .m}. Let M be the set

22

of all realities each of which comprises m functions, where the ranges of those m functions have

the associated cardinalities {

i : i = 1, . . .m}.

Now say we ask whether there is a reality in M whose m functions have some particular relationship(

s) with one another. (Answers to such questions form most of the results of the earlier

parts of this paper.) Lemma 1 allows us to transform this question into a constraint satisfaction

problem over an associated space of tuples. This transformation changes set of “specified

relationship(s)” into a set of simultaneous constraints over the associated space of tuples. The

precise type of constraint satisfaction problem produced by the transformation (integer-valued,

real-valued, etc.) is determined by the space of tuples under consideration, i.e., by the cardinalities

of the images of the functions that constitute the reality.

Often though we can use Lemma 1 more directly to answer questions concerning realities,

without invoking any techniques for solving constraint satisfaction problems. An example occurs

in the proof of the following result:

Proposition 5: Let C1 be a copy of C2.

i) It is possible that C1 and C2 are distinguishable and C1 > C2, even for finite X1(U), X2(U).

ii) It is possible that C1 ≫ C2, but only if X1(U) and X2(U) are both infinite.

6.2. Philosophical implications

Return now to the case where U is a set of laws of physics (i.e., the set of all worldlines consistent

with a set of such laws). The results of this subsection provide general restrictions that must

relate any devices in such a universe, regardless of the detailed nature of the laws of that universe.

In particular, these results would have to be obeyed by all universes in a multiverse [27,28,29].

Accordingly, it is interesting to consider these results from an informal philosophical perspective.

Say we have a device C in a reality that is outside distinguishable. Such a device can be

viewed as having “free will”, in that the way the other devices are set up does not restrict how

C can be set up. Under this interpretation, Thm. 1 means that if two devices both have free will,

then they cannot predict / recall / observe each other with guaranteed complete accuracy. A reality

can have at most one of its devices that has free will and can predict / recall / observe the other

devices in that reality with guaranteed complete accuracy. (Similar conclusions hold for whether

the devices can “control” each other; see Sec. 7 below.)

Thm. 3 then goes further and considers devices that can emulate each other. It shows that

independent of concerns of free will, no two devices can unerringly emulate each other. (In other

words, no reality can have more than one universal device.) Somewhat tongue in cheek, taken

together, these results could be called a “monotheism theorem”.

Now suppose that the domain of a reality is a set of worldlines extending across time, and

consider “physical” devices that are identified with systems evolving in time. (See discussion

just after Def. 7.) Prop. 5 tells us that any universal device must be infinite (have infinite X(U)) if

there are other devices in the reality that are copies of it. Since the time-translation of a physical

device is a copy of that device, this means any physical device that is ever universal must be

infinite. In addition, the impossibility of multiple universal devices in a reality means that if any

physical device is universal, it can only be so at one moment in time. (Its time-translation cannot

be universal.) Again somewhat tongue in cheek, taken together this second set of results could

be called an “intelligent design theorem”. (See Sec. 7 for related limitations concerning devices

that are used to control one another.)

23

In addition to the questions addressed by the monotheism and intelligent design theorems,

there are many other semi-philosophical questions one can ask of the form “Can there be a

reality with the following properties?”. As mentioned above, Lemma 1 can be used to reduce

all such questions to a constraint satisfaction problem, potentially involving infinite-dimensional

spaces. In other words, much of philosophy can be reduced to constraint satisfaction problems.

As a final comment, while it is most straight-forward to apply the results of this subsection

to physical universes, they can be applied more widely. In particular, somewhat speculatively,

one can consider applying them to mathematical logic itself. In such an application each u ∈ U

would be a (perhaps infinite) string over some alphabet. For example, U might be defined as

the set of all strings that are “true” under some encoding that translates a string into axioms and

associated logical implications. Then an inference device would be a (perhaps fallible) theoremproving

algorithm, embodied within U itself. The results of this subsection would then concern

the relation among such theorem-proving algorithms.

7. Control devices

In weak inference there is no causal arrow from �� to X. In fact, the only causal arrow goes

from the device to the function being inferred (in that X’s value forces something about ��’s value)

rather than vice-versa. This reflects what it means for us to be able to set up a device so that it is

guaranteed correct in its prediction / observation/ memory.

This causal arrow from the device to the function does not mean that the device controls the

function. The reason is that X’s value doesn’t set ��’s value, but only forces that value to be

consistent with Y. This motivates the following definition:

Definition 8: A device C controls a function �� over U iff ∀ f ∈ π(��), ∀b ∈ B, ∃x such that

X = x ⇒ Y = f (��) = b. C semi-controls �� iff ∀γ ∈ ��(U), ∃ x such that X = x ⇒ �� = γ.

Semi-control has nothing to do with the conclusion function Y of the device; that function

enters when one strengthens the definition of semi-control to get the definition of control. To see

this, note that C semi-controls �� iff ∀ f ∈ π(��), ∃x such that X = x ⇒ f (��) = 1. However if

X = x forces f (��) = 1, then for any probe f ′ , f , X = x forces f ′(��) = 0. So C semi-controls

�� iff ∀ f ∈ π(��), ∀b ∈ B, ∃x such that X = x ⇒ f (��) = b. This is just the definition of control,

without the extra condition that controls imposes on the value of Y. We say that one device C

(semi-) controls another if it (semi-) controls the conclusion function of that second device.

The weakness of the semi-control concept is that it stipulates nothing concerning whether C

“knows” (infers) that some value x forces �� into the state f −1(b). In this, it doesn’t capture the

intuitive notion of “control”. Accordingly, in the formalization of Def. 8, we stipulate that you

do not fully control a function if you force it to have some value but don’t know what that value

is.

If the partition induced by X is a refinement of the partition induced by �� [50], and in particular

if it is a fine-graining of that partition, then C semi-controls ��. Note also that if �� is binary-valued,

then having C semi-control �� means there is both an x such that X(u) = x ⇒ u ∈ ��−1(1) and an

x′ such that X(u) = x′ ⇒ u ∈ ��−1(−1). In the language of formal epistemology [42,43,45,44],

this means that X−1(x) and X−1(x′) are the values of a “knowledge function” evaluated for two

arguments: the subset ��−1(1) and the subset ��−1(−1), respectively. (See Sec. 9 below.)

Clearly control implies semi-control. In addition, if one device C1 strongly infers another

24

device C2, then C1 semi-controls X2, though it may not semi-control Y2. Control implies weak

inference, i.e., if C1 controls a function �� thenC1 > ��. The logical converse need not hold though.

Since control implies weak inference, all impossibility results concerning weak inference also

apply to control. In particular, no device can control itself, and no two distinguishable devices

can control each other. In fact we can make the following stronger statement, which essentially

states that if two partitions are refinements of each another, they must be identical:

Theorem5: If two devicesC1 and C2 simultaneously semi-control one another’s setup functions,

then the partitions induced by X1 and X2 are identical.

Intuitively, Thm. 5 means that if two devices simultaneously semi-control one another’s setup

functions, then those setup functions are identical, up to a relabeling of their ranges. This provides

the following results contrasting with Thm. 1 and Thm. 3:

Corollary 3: Let C1 and C2 be two devices that simultaneously semi-control one another’s setup

functions.

i) C1 > C2 ⇔ C2 > C1.

ii) Neither device strongly infers the other.

iii) Neither device controls the other’s setup function.

8. Stochastic devices

In the analysis above there is no probability measure P over U. There are several ways to

extend the analysis to incorporate such a probability measure, so that functions over U become

random variables. One starts as follows:

Definition 9: Let P(u ∈ U) be a probability measure, �� a function with domain U and finite

range, and ǫ ∈ [0.0, 1.0]. Then we say that a device (X, Y) (weakly) infers �� with (covariance)

accuracy ǫ iff

Pf ∈π(��) maxx[EP(Y f (��) | x)]

|(��(U)|

= ǫ.

As an example, if P is nowhere 0 and C weakly infers ��, then C infers �� with accuracy 1.0. 9

There are several reasonable alternatives to this definition. As an example, recall the “malicious

demon” interpretation of f introduced just below Def. 3. That interpretation suggests a

change to Def. 9 in which we replace the sum over all probes f and associated division by |��(U)|

with a minimum over all probes f .

Note though that it does not seem reasonable to define inference accuracy in terms of mutual

information expressions like M(Y, f (��) | X = x). To see why consider the case where f is a

probe of �� that equals 1 iff �� = γ, and let x be a value where X = x ⇒ Y = −f (��). In this case

the mutual information conditioned on x between Y and f (��) would be maximal. However the

9 A subtlety with the definition of an inference devices arises in this stochastic setting: we can either require that Y be

surjective, as in Def. 1, or instead require that Y be stochastically surjective: ∀y ∈ B, ∃u with non-zero probability

density such that Y(u) = y. The distinction between requiring surjectivity and stochastic surjectivity of Y will not arise

here.

25

device would have probability zero of correctly answering the question, “does �� have value γ?”.

It would either say “yes” and in fact �� does not equal γ, or it would say “no” and in fact �� does

equal γ.

This is an illustration of the fact that the definition of inference assigns semantic content to

Y = 1: it means that the device’s answer is “yes”. In contrast, information theoretic quantities

like mutual information are (in)famous for not involving semantic content.

While inference is a semantic concept, distinguishability is not, which motivates the following

definition:

Definition 10: Let P(u ∈ U) be a probability measure, and ǫ ∈ [0.0, 1.0]. Then we say that the

(setup) mutual information-distinguishability of two device (X1, Y1) and (X2, Y2) is

1 −

MP(X1, X2)

HP(X1) + HP(X2)

.

Mutual-information distinguishability is bounded between 0 and 1.

Note that variables can be distinguishable in the sense of Def. 4 even if their mutual information

distinguishability is less than 1. (They can be partially correlated but still distinguishable in

the sense of Def. 4.) This motivates the following alternative definition, for simplicity phrased

for countable X(U):

Definition 11: Let P(u ∈ U) be a probability measure, and ǫ ∈ [0.0, 1.0]. Then we say that the

counting distinguishability of two device (X1, Y1) and (X2, Y2) is

1 − Px1,x2 : ∃ u : X1(u)=x1 ,X2(u)=x2 1

|X1(U)| × |X2(U)|

There are many analogs of Thm. 1 that relate quantities like the accuracy with which device

C1 infers device C2, the accuracy with which C2 infers C1, how distinguishable they are, the

entropies of the random variables X1 and X2, etc. To present perhaps the simplest such example,

define H as the four-dimensional hypercube {0, 1})4, k(z) as the map taking any z ∈ H to z1 +

z4 − z2 − z3, m(z) as the map taking any z ∈ H to (z2 − z4), and n(z) as the map taking any z ∈ H

to (z3 − z4).

Proposition 6: Let P be a probabilitymeasure over U, and C1 and C2 two devices whose mutualinformation

distinguishability is 1, where X1(U) = X2(U) = B. Define P(X1 = −1) ≡ α and

P(X2 = −1) ≡ β. Say that C1 infers C2 with accuracy ǫ1, while C2 infers C2 with accuracy ǫ2.

Then

ǫ1ǫ2 ≤ maxz∈H | αβ[k(z)]2 + αk(z)m(z) + βk(z)n(z) + m(z)n(z) |.

In particular, if α = β = 1/2, then

ǫ1ǫ2 ≤

maxz∈H | (z1 − z4)2 − (z2 − z3)2 |

4

= 1/4.

26

The maximum for α = β = 1/2 can occur in several ways. One is when z1 = 1, and z2, z3, z4 all

equal 0. At these values, both devices have an inference accuracy of 1/2 at inferring each other.

Each device achieves that accuracy by perfectly inferring one probe of the other device, while

performing randomly for the remaining probe.

Similarly, say that we have a volume measure dμ over U, as in Sec. 5, together with a probability

measure P over U. Then we can modify the definition of the length of x to be −H(U | x),

the negative of the Shannon entropy under prior dμ of P(u | x). If as in statistical physics P is

proportional to dμ across the support of P, then P(u | x) ∝ dμ(u | x), and these two definitions of

the length of x are the same.

There are several ways to combine this new definition of length with the concept of inference

accuracy to define a stochastic analog of inference complexity. In particular, we can define the

stochastic inference complexity of a function �� with respect to C for accuracy ǫ, as

C¯ǫ (�� | C) , Xf ∈π(��)

minx:EP(Y f (��)|x)≥ǫ [−H(U | x)]

assuming the sum exists for ǫ. So for example if P is proportional to dμ across the support of P

and C > ��, then for ǫ = 1, C¯ǫ (�� | C) = C(�� | C).

One can extend this stochastic framework to include inference of the probability of an event,

e.g., have the device say whether P(�� = γ) has some specified value. Such inference contrasts

with inference accuracy, which (like non-stochastic inference) simply concerns a device’s concludingwhether

an event occurs, e.g., concludingwhether ��(u) = γ). One can also define stochastic

analogs of (semi)control, strong inference, etc. Such extensions are beyond the scope of this

paper.

9. Self-aware devices

We now return to scenarios where U has no associated probability measure. We consider

devices that know what question they are trying to answer, or at least “think they do”. Rather

than encode that knowledge in the conclusion function of the device, we split the conclusion

function into two parts. The value of one of those parts is (explicitly) a question for the device,

and the other part is a possible associated answer. We formalize this as follows:

Definition 12: A self-aware device is a triple (X, Y, Q) where (X, Y) is an inference device, Q is

a question function with domain U where each q ∈ Q(U) is a binary function of U, and Y ⊗ Q is

surjective onto B × Q(U).

Intuitively, a self-aware device is one that (potentially) knows what question it is answering in its

conclusion. When U = u, we interpret q = Q(u) as the question about the state of the universe

(i.e., about which subset of U contains the actual u) that the conclusion Y(u) is supposed to

answer. The reason we require that Y ⊗ Q be surjective onto B × Q(U) is so that the device is

allowed to have any conclusion for any of its questions; it’s the appropriate setting of X(u) that

should determine what conclusion it actually makes.

So one way to view “successful inference” is the mapping of any q ∈ Q(U) to an x such that

X(u) = x(u) both implies that the device’s conclusion to question q is correct, i.e., Y(u) = q(u),

and also implies that the device is sure it is asking question q, i.e., Q(u) = q. As an example,

say we have a computer that we want to use make a prediction. That computer can be viewed as

27

an inference device. In this case the question q that the device is addressing is specified in the

mind of the external scientist. This means that the question is a function of u (since the scientist

exists in the universe), but need not be stored directly in the inference device. Accordingly, the

combination of the computer with the external scientist who programs the computer is a selfaware

device.

To formalize this concept, we must first introduce some notation that is frankly cumbersome,

but necessary for complete precision. Let b be a value in some space. Then we define b as the

constant function over U whose value is b, i.e., u ∈ U → b. Intuitively, the underline operator

takes any constant and produces an associated constant-valued function over U. As a particular

example, let �� be a function with domain U. Then �� is the constant function over U whose value

is the function ��, i.e., u ∈ U → ��. Similarly, let B be a set of functions with domain U, and let

A be a function with domain U whose range is B (so each A(u) is a function over U). Then we

define A as the function taking u ∈ U → [A(u)](u). So the overline operator turns any function

over U whose range is functions over U into a single function over U. Both the underline and

overline operators turn mathematical structures into functions over U; they differ in what type

of argument they take. In particular, for any function �� over U, (��) = ��. (Using this notation is

more intuitive in practice than these complicated definitions might suggest.)

Next, recall from Sec. 1.1 that for any probe f of a function �� with domain U, f (��) is the

function u ∈ U → f (��(u)).

Definition 13: Let D = (X, Y, Q) be a self-aware device.

i) A function �� is intelligible to D iff ∀ f ∈ π(��), f (��) ∈ Q(U).

ii) D is infallible iff ∀u ∈ U, Y(u) = [Q(u)](u).

We say that D is infallible for Q′ ⊆ Q(U) iff ∀q ∈ Q′, ∀u ∈ U such that Q(u) = q, Y(u) = q(u).

So D is infallible iff it is infallible for Q(U) iff Y = Q iff YQ = 1. If a device is not infallible, we

say that it is fallible.

Recall that Y ⊗ Q is supposed to represent the original conclusion function “split into two

parts”. Accordingly, in keeping with the terminology used with weak inference, we say that a

self-aware device (X′, Y′, Q′) is intelligible to a self-aware device (X, Y, Q) iff (Y′, Q′) is intelligible

to (X, Y, Q).

Def. 13 provides the extra concepts needed to analyze inference with self-aware devices. Def.

13(i) means that D is able to ask what the value is of every probe of ��. Def. 13(ii) ensures that

whatever the question D is asking, it is correctly answering that question. Finally, the third part

of “successful inference” — having the device be sure it is asking the question q — arises if D

semi-controls its question function.

These definitions are related to inference by the following results:

Theorem 6: Let D1 be an infallible, self-aware device.

i) Let �� be a function intelligible to D1 and say that D1 semi-controls Q1. Then (X1, Y1) > ��.

ii) Let D2 be a device where Y2 is intelligible to D1, D1 semi-controls (Q1, X2), and (Q1, X2)

is surjective onto Q1(U) × X2(U). Then (X1, Y1) ≫ (X2, Y2).

Thm. 6 allows us to apply results concerning weak and strong inference to self-aware devices.

Note that a special case of having D1 semi-control Q1 is where X = χ ⊗ Q1 for some function

χ, as in Ex. 1. For such a case, Y and X “share a component”, namely the question being asked,

specified in Q1.

28

The following result concerns just intelligibility, without any concern for semi-control or infallibility.

Theorem 7: Consider a pair of self-aware devices D ≡ (X, Y, Q) and D′ ≡ (X′, Y′, Q′) where

there are functions R, P, R′, P′ such that P and P′ have domain U, Q = R(P) and Q′ = R′(P′). If

P is intelligible to D′ and P is intelligible to D′ then the following hold:

i) |Q(U)| = |Q′(U)| = |P(U)| = |P′(U)|.

ii) If Q(U) is finite, Q′ = π(P) = π(Q) and Q = π(P′) = π(Q′).

In particular, take R and R′ to be identity functions over the associated domains, so that P = Q

and P′ = Q′. Using this choice, Thm. 7 says that if each self-aware device can try to determine

what question the other one is considering, then neither device can try to determine anything

else.

An immediate corollary of Thm. 7 is the following:

Corollary 4: No two self-aware devices whose question functions have finite ranges are intelligible

to each other.

Note that Coroll. 4 does not rely on the devices being distinguishable (unlike Thm. 1). Indeed,

it holds even if the two devices are identical; a self-aware device whose question function has a

finite range cannot be intelligible to itself.

Coroll. 4 is a powerful limitation on any pair of self-aware devices, D and D′. It says that

for at least one of the devices, say D, there is some question q′ ∈ Q′(U) and bit b′, such that D

cannot even ask, “Does D′ pose the question q′ and answer with the bit b′?”. So whether D could

correctly answer such a question is moot.

To circumvent Coroll. 4 we can consider self-aware devices whose conclusion functions alone

are intelligible to each other. However combining Thm.’s 1 and 3(i) gives the following result:

Corollary 5: Let D1 and D2 be two self-aware devices that are infallible, semi-control their

questions, and are distinguishable. If in addition they infer each other, then it is not possible that

both Y2 is intelligible to D1 and Y1 is intelligible to D2.

With self-aware devices a device C1 may be able to infer whether a self-aware device D2

correctly answers the question that D2 is considering. To analyze this issue we start the following

definition:

Definition 14: If D1 is a device and D2 a self-aware device, then D1 corrects D2 iff ∃ x1 such

that X1 = x1 ⇒ Y1 = Y2Q2.

Def. 2 means that Y1 = 1 iff Y2 = Q2, i.e., Y2(u) = [Q2(u)](u). Intuitively, if a device D1 corrects

D2, then there is an x1 where having X1 = x1 means that C1’s conclusion tells us whether D2

correctly answers q2. 10

10 Say that D1 is also self-aware, and that Y2Q2 has both bits in its range (so that probes of it are well-defined). Then we

can modify the definition to say that D1 corrects D2 iff two conditions are met: all probes in π(Y2Q2) are intelligible to

D1, and D1 is infallible for π(Y2Q2).

29

Note how weak Def. 14 is. In particular, there is no sense in which it requires that D1 can assess

whether Y2(u) = q2(u) for all questions q2 ∈ Q2(U). So long as D1 can make that assessment for

any question in Q2(U), we say that D1 corrects D2. Despite this weakness, we have the following

impossibility result, which is similar to Prop. 2(i):

Proposition 7: For any device D1 there is a self-aware device D2 that D1 does not correct.

There are similar results for the definition of correction in footnote 10, and for the (im)possibility

of correction among multiple devices.

Finally, while there is not room to do so here, many of the concepts investigated above for

inference devices can be extended to self-aware devices. For example, one might want to modify

the definition of inference complexity slightly for self-aware devices. Let D be a self-aware

infallible device that semi-controls its question function and �� a function over U where ��(U) is

countable and �� is intelligible to D. Then rather than C(�� | (X, Y)), it may be more appropriate

to consider the self-aware inference complexity of �� with respect to D, defined as

D(�� | (X, Y, Q)) , Xf ∈π(��)

minx:X=x⇒Q=f (��)[L(x)].

Similarly, consider a reality that includes self-aware devices, i.e., a reality (U; {Fφ}) that can be

written as (U; {Cα}; {Dδ}; {��β}) where in addition to the set of functions {��β} and devices {Cα},

we have a set of self-aware devices {Dδ}. For such a reality it often makes sense to consider an

augmented reduced form,

[u∈U

Oα

(Xα(u), Yα(u)) ⊗Oβ

��β(u) ⊗Oδ

(Xδ(u), Yδ(u), Qδ(u)) ⊗Oδ

Qδ(U)

.

The last term means we include in the tuples all instances of the form [Q(u)](u′) in which a

self-aware device’s question for one u is evaluated at a different u′ , u.

Due to page limits the analysis of such extensions is beyond the scope of this paper.

We close with some comments on the relation between inference with self-aware devices and

work in other fields. Loosely speaking, in the many-worlds interpretation of quantum mechanics

[25], “observation” only involves the relationship between Y and �� (in general, for a Y whose

range is more than binary). As discussed above, such relationships cannot imbue the observation

with semantic meaning. It is by introducing X and Q into the definition of self-aware devices that

we allow an act of “observation” to have semantic meaning. This is formalized in Thm. 6, when

it is applied to scenarios where weak inference is interpreted as successful observation.

Much of formal epistemology concerns “knowledge functions” which are maps from subsets

of U to other subsets of U [42,43,45,44]. Ki(A), the knowledge function Ki evaluated for an

argument A ⊆ U, is interpreted as the set of possible worlds in which individual i knows that

A is true. The set A is analogous to specification of the question being asked by a self-aware

device. So by requiring the specification of A, knowledge functions involve semantic meaning,

in contrast to the process of observation in the many-worlds interpretation.

A major distinction between inference devices and both the theory of knowledge functions and

the many-worlds definition of observation is that inference devices require that the individual /

observer be able to answer multiple questions (one for each probe concerning the function being

inferred). As mentioned above, this requirement certainly holds in all real-world instances of

30

“knowledge” or “observation”. Yet it is this seemingly innocuous requirement that drives many

of the results presented above.

Future work involves exploringwhat inference device theory has to say about issues of interest

in the theory of knowledge functions. For example, analysis of common knowledge starts with a

formalization of what itmeans for “individual i to knowthat individual j knows A”. The inference

devices analog would be a formalization of what it means for “device D to infer that device C

infers ��”. Now for this analog to be meaningful, since D can only infer functions with at least

two values in their range, there must be some sense in which the set U both contains ”u under

which C infers ��” and contains u under which it does not. Formally, this means two things. First,

it must not be the case simply that C > ��, since that means that C infers �� under all u. Second,

there must be a proper subset UC ⊂ U such that if U were redefined to be UC (and C and �� were

redefined to have UC as their domains in the obvious way), then it would be the case that C > ��.

This proper subset specifies a binary-valued function, ��C, by ��C(u) = 1 ⇔ u ∈ UC. The question

of whether “D knows that C knows ��” then becomes whether D can infer ��C.

ACKNOWLEDGEMENTS: I would like to thank Nihat Ay, Charlie Bennett, John Doyle,

Michael Gogins, and Walter Read for helpful discussion.

APPENDIX A: Proofs

This section presentsmiscellaneous proofs. Sincemany of the results may be counter-intuitive,

the proofs are presented in elaborate detail. The reader should bear in mind though that many of

the proofs simply amount to “higher order” versions of the Cretan liar paradox, Cantor diagonalization,

or the like (just like many proofs in Turing machine theory). At the same time, in the

interest of space, little pedagogical discussion is inserted. Unfortunately, the combination makes

many of the proofs a bit of a slog.

Proof of Prop. 1: To prove (i), choose a device (X, Y) where Y(u) = −1 ⇔ u ∈ W. Also have

X(u) take on a separate unique value for each u ∈ W, i.e., ∀w ∈ W, u ∈ U : w , u, X(w) , X(u).

(Note that by definition of W, it contains at least two elements.) So by appropriate choice of an

x, X(u) = x forces u to be any desired element of W.

Choose i. Pick any γ ∈ ��i(U), and examine the probe f that equals 1 iff its argument is γ. If

for no u ∈ W does ��i(u) = γ, then choose any x that forces u ∈ W. By construction, X(u) = x ⇒

Y(u) = −1, and in addition X(u) = x ⇒ f (��i(u)) = −1. So X(u) = x ⇒ Y(u) = f (��i(u)), as

desired.

Now say that there is a u ∈ W such that ��i(u) = γ. By hypothesis, ∃u′′ ∈ W : ��i(u′′) , γ. By

construction, there is an x such that X(u′) = x ⇒ u′ = u′′. So X(u′) = x ⇒ u′ ∈ W, ��i(u′) , γ.

The first of those two conclusions means that Y(u′) = −1. The second means that f (��i(u′)) = −1.

So again, X(u) = x ⇒ Y(u) = f (��i(u)), as desired. There are no more cases to consider.

To prove (ii), choose b ∈ B and let �� be a function with domain U where ��(u) = b for all u

obeying Y(u) = −1 and for no others. (The surjectivity of Y ensures there is at least one such u.)

Consider the probe f of ��(U) that equals +1 iff ��(u) = b. For all u ∈ U, f (��(u)) = −Y(u). QED.

Proof of Coroll. 2: To prove the first part of the corollary, let α and β be the partitions induced

by X and Y, respectively. If |X(U)| = |α| = 2, |α| = |β|. Since α is a fine-graining of β, this means

31

that α = β. So without loss of generality we can label the elements of X(U) so that X = Y.

Now hypothesize that C > �� for some ��. Recall that we require that |��(U)| ≥ 2. Let γ and γ′

be two distinct elements of ��(U) where ��(u) = γ for some u ∈ X−1(−1). Define fγ to be the probe

of ��(U) that equals 1 iff its argument is γ, and define fγ′ similarly. C > �� means ∃ xγ ∈ X(U)

such that X(u) = xγ ⇒ fγ(��(u)) = Y(u) = X(u) = xγ. Since ∃ u ∈ X−1(−1) such that ��(u) = γ,

and since Y(u) = −1 ∀u ∈ X−1(−1), xγ must equal 1.

This means that ��(u) equals γ across all of X−1(xγ) ⊂ U. Therefore ∃ u ∈ X−1(−xγ) such that

��(u) = γ′. Moreover, since xγ = Y(X−1(xγ)) = 1, Y(X−1(−xγ)) = −1. Therefore ∃ u ∈ X−1(−xγ)

such that fγ′ (��(u)) , Y(u). Similarly, ∀ u ∈ X−1(xγ), fγ′ (��(u)) , Y(u). Therefore there is no

xγ′ ∈ X(U) such that X(u) = xγ′ ⇒ fγ′ (��(u)) = Y(u). So our hypothesis is wrong; there is no

function that C infers.

Now consider the case where |α| > 2. Label the two elements of β as +1 and -1. Since α is a

fine-graining of β, and since |β| = 2, there are at least two distinct elements of α that are contained

in the same element of β, having label b. Choose one of those elements of α, a, and let a′ be one

of the other elements of α that are contained in that element of β with label b.

Form the union of a with all elements of α that are contained in the element of β with label −b.

That union is a proper subset of all the elements of α. Therefore it picks out a proper subset of U,

W. (Note that W has non-empty overlap with both both partition elements of β.) So choose �� to

be binary-valued, with values given by ��(u) = b iff u ∈ W. Then for X(u) = a, ��(u) = b = Y(u).

On the other hand, for X(u) = a′, ��(u) = −b = −Y(u). So for both probes f of ��, there is a value

x such that X = x ⇒ Y = f (��). QED.

Proof of Thm. 1: Let C1 andC2 be the two devices. Since Y for any inference device is surjective,

Y2(U) = B, and therefore there are two probes of Y2(U). Since by hypothesis C1 weakly infers

C2, using the identity probe f (y ∈ B) = y establishes that ∃ x1 s.t. X1(u) = x1 ⇒ Y1(u = Y2.

Similarly, since C2 weakly infers C1, using the negation probe f (y) = −y establishes that ∃ x2 s.t.

X2(u) = x2 ⇒ Y2(u) = −Y1(u). Finally, by the hypothesis of setup distinguishability, ∃ u∗ ∈ U

s.t. X1(u∗) = x1, X2(u∗) = x2. Combining, we get the contradiction Y1(u∗) = Y2(u∗) = −Y1(u∗).

QED.

Proof of Thm. 2: To establish (i), let f be any probe of ��(U). C2 > �� ⇒ ∃ x2 such that

X2(u) = x2 ⇒ Y2(u) = f (��(u)). In turn, C1 ≫ C2 ⇒ ∃ x1 such that X1 = x1 ⇒ Y1 = Y2, X2 = x2

(by choosing the identity probe of Y2(U)). Combining, X1 = x1 ⇒ Y1(��). So C1 > ��, as claimed

in (i).

To establish (ii), let f be any probe of Y3(U), and x2 any member of X3(U). C2 ≫ C3 ⇒ ∃ x2 ∈

X2(U) such that X2(u) = x2 ⇒ X3(u) = x3, Y2(u) = f (Y3(u)). C1 ≫ C2 then implies that ∃ x1

such that X1(u) = x1 ⇒ X2(u) = x2, Y1(u) = Y2(u) (by choosing the identity probe of Y2(U)).

Combining, X1(u) = x1 ⇒ X3(u) = x3, Y1(u) = f (Y3(u)), as desired. QED.

Proof of Prop. 2: To establish the first claim, simply take Y2 to be the function �� in Prop. 1(ii).

To establish the second claim, focus attention on any x1 ∈ X1(U), and define W ≡ X−1

1 (x1).

Choose X2 so that X2(u) take on a separate unique value for each u ∈ W, i.e., ∀w ∈, u ∈ U :

w , u, X2(w) , X2(u).

First consider the case where Y1(W) has a single element, i.e., Y1(u) is the same bit across all

X−1

1 (x1). Without loss of generality take that bit to be 1. Choose Y2(u) = 1 for some w′ ∈ W,

and Y2(u) = −1 for all other w ∈ W. Then choose x2 so that X2(u) = x2 ⇒ u = w′. Therefore

X2(u) = x2 ⇒ X1(u) = x1, Y2(u) = 1. So for the probe f of Y1(U) that equals Y1, X2(u) =

32

x2 ⇒ Y2(u) = f (Y1(u)). On the other hand, by hypothesis ∃ w′′ ∈ W that differs from w′, and

∃ x′

2 ∈ X2(U) such that X2(u) = x′

2 ⇒ u = w′′. Moreover, Y2(w′′) = −1, by construction of Y2.

So consider the probe f ′ of Y1(U) that equals −Y1. For all u ∈ W, f ′(Y1(u)) = −1. In particular,

this is the case for u = w′′. Combining, X2(u) = x′

2 ⇒ X1(u) = x1, Y2(u) = f ′(Y1(u)). Since f

and f ′ are the only probes of Y1(U), there are no more cases to consider for the situation where

Y1(W) is a singleton.

If Y1(W) is not a singleton, since W contains at least three elements, there is a proper subset

of W, W′, on which Y1 takes both values. So by Prop. 1(i) there is a device C over W that infers

the restriction of Y1 to domain W. Define (X2, Y2) to be the same as that C for all u ∈ W, with

all members of X2(W) given values that are not found in X2(U − W). Since X1(w) = x1 for all

w ∈ W, this means that ∀ f ∈ π(Y1), ∃ x2 such that X2(u) = x2 ⇒ X1(u) = x1, Y2(u) = f (Y1(u)).

Combining, since Y1(X−1

1 (x1)) either is or is not a singleton for each x1 ∈ X1(U), we can build

a “partial” device C2 that strongly infers C1 for each region X−1

1 (x1). Furthermore, those regions

form a partition of U. So by appropriately “stitching together” the partial C2’s built for each

x1 ∈ X1(U), we build an aggregate device C2 that strongly infers C1 over all U, as claimed.

QED.

Proof of Thm. 3: Let C1 and C2 be two devices and hypothesize that they can strongly infer each

other. Since C1 can strongly infer C2, it can force X2 to have any desired value and simultaneously

correctly infer the value of Y2 under the identity probe. In other words, there is a function ξ1

I :

X2(U) → X1(U) such that for all x2, X1 = ξ1

I (x2) ⇒ X2 = x2 and Y1 = Y2. Let ˆx1 be any element

of ξ1

I (X2(U)).

Similarly, by hypothesis C2 can force X1 to have any desired value and simultaneously correctly

infer the value of Y1 under the negation probe. In other words, there is a function ξ2

−I :

X1(U) → X2(U) such that for all x1, X2 = ξ2

−I (x1) ⇒ X1 = x1 and Y1 = −Y2.

Define ˆx2 ≡ ξ2

−I ( ˆx1). Then X1(u) = ξ1

I ( ˆx2) ⇒ X2(u) = ˆx2 = ξ2

−I ( ˆx1) and Y1(u) = Y2(u).

The first of those two conclusions in turn means that Y1(u) = −Y2(u). Combining, we see that

X1(u) = ξ1

I ( ˆx2) ⇒ Y2(u) = Y1(u) = −Y2(u), which is impossible. QED

Proof of Thm. 4: Since C2 > ��, ∀ f ∈ π(��), ∃ x2 such that X2 = x2 ⇒ Y2 = f (��). Therefore

the set argminx2:X2=x2⇒Y2=f (��)[L(x2)] is non-empty. Accordingly, ∀ f ∈ π(��), we can define an

associated value xf

2 ∈ X2(U) as some particular element of argminx2:X2=x2⇒Y2=f (��)[L(x2)].

Now since C1 ≫ C2, ∀x2, ∃ x1 such that X1 = x1 ⇒ X2 = x2, Y1 = Y2. In particular, ∀ f ∈ π(��),

∃ x1 : X1 = x1 ⇒ X2 = xf

2 , Y1 = Y2. So by definition of xf

2 , ∀ f ∈ π(��), ∃ x1 : X1 = x1 ⇒ X2 =

xf

2 , Y1 = f (��).

Combining, ∀ f ∈ π(��),

minx1:X1=x1⇒Y1=f (��)[L(x1)] ≤ minx1:X1=x1⇒X2=xf

2 ,Y1=Y2

[L(x1)].

Accordingly,

C(�� | C1) − C(�� | C2) ≤ Xf ∈π(��)

minx1:X1=x1⇒X2=xf

2 ,Y1=Y2

[L(x1) − L(xf

2 )]

≤ Xf ∈π(��)

maxx2 minx1:X1=x1⇒X2=x2,Y1=Y2 [L(x1) − L(x2)]

= |π(��)| maxx2 minx1:X1=x1⇒X2=x2,Y1=Y2 [L(x1) − L(x2)]

33

Using the equality |π(��)| = |��(U)| completes the proof. QED.

Proof of Thm. 5: By hypothesis, for any x′

2 ∈ X2(U), ∃ x1 such that X1 = x1 ⇒ X2 = x′

2. This is

true for any such x′

2.Write the function mapping any such x′

2 to the associated x1 as ξ1. Similarly,

there is a function ξ2 that maps any x1 ∈ X1(U) to an x2 ∈ X2(U) such that X2 = ξ2(x1) ⇒ X1 =

x1. Using the axiom of choice, this provides us with a single-valued mapping from X1(U) into

X2(U) and vice-versa.

Since having X2(u) = ξ2(x1) forces X1(u) = x1, the set of u ∈ U such that X2(u) = ξ2(x1) must

be a subset of those u ∈ U such that X1(u) = x1, i.e., ∀ x1, X−1

2 [ξ2(x1)] ⊆ X−1

1 (x1). Similarly,

∀ x2, X−1

1 [ξ1(x2)] ⊆ X−1

2 (x2). This second equality means in particular that X−1

1 [ξ1[ξ2(x1))] ⊆

X−1

2 (ξ2(x1)). Combining, X−1

1 [ξ1[ξ2(x1))] ⊆ X−1

1 (x1).

However ∀ x1, ξ1(ξ2(x1)) is non-empty. Since X1 is single-valued, this means that ∀ x1,

ξ1(ξ2(x1)) = x1. Combining,we see that ∀ x1, X−1

1 (x1) ⊆ X−1

2 [ξ2(x1)], and therefore X−1

2 [ξ2(x1)] =

X−1

1 (x1). This in turn means that the set X2[X−1

1 (x1)] equals the singleton ξ2(x1) for any x1 ∈

X1(U). Accordingly ∀ u ∈ X−1

1 (x1), X2(u) = ξ2(x1) = ξ2(X1(u)). In addition, every u ∈ U obeys

u ∈ X−1

1 (x1) for some x1. Therefore we conclude that for all u ∈ U, ξ2(X1(u)) = X2(u).

This establishes that the partition induced by X1 is a fine-graining of the partition induced

by X2. Similar reasoning establishes that the partition induced by X2 is a fine-graining of the

partition induced by X1. This means that the two partitions must be identical. QED.

Proof of Coroll. 3: By Thm. 5, we can relabel the image values of the two devices’ setup functions

to express them as C1 = (X, Y1) and C2 = (X, Y2).

To prove (i), note that C1 > C2 means ∃ x ∈ X(U) such that X = x ⇒ Y1 = Y2 and ∃ x′ ∈ X(U)

such that X = x′ ⇒ Y1 = −Y2. But those two properties in turn mean that C2 > C1. A similar

argument establishes that C2 > C1 ⇒ C1 > C2.

To prove (ii), note that C1 ≫ C2 means that ∀x ∈ X(u), f ∈ π(Y2), ∃ x′ such that X = x′ ⇒

X = x, Y1 = f (Y2). In particular, ∀x ∈ X(u), ∃ x′ such that X = x′ ⇒ X = x, Y1 = Y2, and ∃ x′′

such that X = x′′ ⇒ X = x, Y1 = −Y2. The only way both conditions can hold is if x′ = x′′. But

that means it is impossible to have both Y1 = Y2 and Y1 = −Y2.

To prove (iii), hypothesize that C1 control X. This means in particular that ∀x ∈ X(U), ∃ x′ ∈

X(U) such that X = x′ ⇒ Y1 = δX,x = 1 (choose b = 1 and have f be the probe that equals

1 iff its argument equals x). To have δX,x = 1 means X = x, which in turn means x′ = x. So

X = x ⇒ Y1 = 1. This is true for all x ∈ X(U), so Y1(u) = 1 ∀u ∈ U. However by definition,

the range of Y1 must be B. Therefore the hypothesis is wrong. The same argument shows that C2

cannot control X. QED.

Proof of Thm. 6: To prove (i), let f be any probe of ��. Intelligibility means f ∈ Q1(U). Since

D1 semi-controls its question function, ∃x1 : X1 = x1 ⇒ Q1 = f . Infallibility then implies that

for any u such that X1(u) = x1, Y1(u) = [Q1(u)](u) = f (u). This proves (i).

Next, let f be any probe of Y2, and x2 any element of X2(U). Intelligibility means f ∈ Q1(U).

Since D1 semi-controls (Q1, X2) and (Q1, X2) is surjective, ∃x1 such that X1 = x1 ⇒ Q1 = f , X2 =

x2. Infallibility then implies that for any u such that X1(u) = x1, Y1(u) = [Q1(u)](u) = f (u). This

proves (ii). QED.

Proof of Thm. 7: The cardinality of π(P) is the cardinality of P(U), |P(U)|. Let f1 and f2 be two

separate such probes, so that f1 : P(U) → B differs from f2 : P(U) → B. Then as functions

over U, f1(P) and f2(P) differ. Therefore by hypothesis they correspond to two distinct q’s in

34

Q′(U). So |Q′(U)| ≥ |P(U)|. In turn, |Q(U)| = |R(P(U))| ≤ |P(U)|. So |Q′(U)| ≥ |Q(U)|. Similar

reasoning establishes that |Q(U)| ≥ |Q′(U)|. So |Q(U)| = |Q′(U)|. Therefore |Q(U)| = |P(U)| and

|Q′(U)| = |P′(U)|. This proves (i).

Now since P′ is intelligible to D, every f ∈ π(P′) is an element of Q(U). Therefore for |Q(U)|

finite, (i)’s conclusion that |Q(U)| = |P′(U)|means that there is no q ∈ Q(U) that is not an element

of π(P′). In other words, Q = π(P′). Next, (i)’s conclusion that |P′(U)| = |R′(P′(U))| establishes

that the partition induced by P′ is identical to the partition induced by R′(P′). So π(P′) = π(Q′).

Similar reasoning establishes that Q′ = π(P) = π(Q). This establishes (ii). QED.

Proof of Coroll. 4: Choose P = (Y, Q) and R : (Y, Q)(u) → Q(u). (So R is a projection map.)

Since (Y, Q) is surjective, |P(U)| = |(Y, Q)(U)| = 2|Q(U)|. By Thm. 7, this is impossible if the two

self-aware devices are intelligible to each another. QED.

Proof of Prop. 3: The validity of the claim in (i) is independent of the question function of the

devices, so they can be set arbitrarily. Choose X1(U) = X2(U) = X3(U) = {0, 1}. Then choose the

reduced formof the setup and conclusion functions of the devices in the reality to be the following

four tuples: ([0, 0], [0, 0], [0, 0]); ([0, 0], [[1, 0], [1, 1]); ([1, 1], [0, 0], [1, 0]); ([1, 1], [1, 0], [0, 1]).

It is straightforward to verify that each pair of devices is distinguishable and that C1 > C2 >

C3 > C1.

To prove (ii), note that under hypothesis, C1 > C2 ⇒ ∃ x1 : X1 = x1 ⇒ Y1 = Y2, C2 >

C3 ⇒ ∃ x2 : X2 = x2 ⇒ Y2 = Y3, . . . ,Cn−1 > Cn ⇒ ∃ xn−1 : Xn−1 = xn−1 ⇒ Yn−1 = Yn,

Cn > C1 ⇒ ∃ xn : Xn = xn ⇒ Yn = −Y1 . Mutual distinguishability means that there is a tuple

in the reduced form of the reality having that set of xi values. However that would mean that the

tuple has y1 = −y1. So our hypothesis is wrong.

To prove (iii), simply combine Thm. 3 and Thm. 2. QED.

Proof of Prop. 4: Since D is acyclic and finite, it contains at least one root node. Label one such

node as C1. Hypothesize that there is some other root node in the graph.

Given any D′ ⊆ D, define S (D′) as the union of D′ with the set of all nodes in D that are

successors of a node in D′. Similarly, define P(D′) as the union of D′ with the set of all nodes

in D that are predecessors of a node in D′. S ({C1}) ⊂ D since by hypothesis there is more than

one root node. Since D is weakly connected, this means that S ({C1}) ⊂ P[S ({C1})]. Since D is

acyclic and finite, this means that there is a node Cj ∈ S ({C1}) who has a root node predecessor

Ck where Ck <> C2 (e.g., X1 = 1 ⇒ Y1 = −Y2). Similarly, by inspection C1 and C2 are distinguishable,

and copies of each other. This completes the proof of (i).

35

To prove the first part of (ii), first note that C1 ≫ C2 requires that for all x2, there is (an x1

that forces X2 = x2 and Y1 = Y2), and (an x1 that forces X2 = x2 and Y1 = −Y2). In other words,

there is a single-valued map ξ : X2(U) → X1(U) such that the quadruple (X1 = ξ(x2), Y1 =

y1, X2 = x2, Y2 = y1) occurs for some y1 in some tuple in the reduced form of the reality while

(X1 = ξ(x2), Y1 = y1, X2 = x′

2, Y2 = y2) does not occur for any y2 if x′

2 , x2, and also does not

occur for y2 = −y1 if x′

2 = x2. Similarly, there is a single-valued map ξ′ : X2(U) → X1(U) such

that the quadruple (X1 = ξ(x2), Y1 = y1, X2 = x2, Y2 = −y1) occurs for some y1 in some tuple in

the reduced form of the reality while (X1 = ξ(x2), Y1 = y1, X2 = x′

2, Y2 = y2) does not occur for

any y2 if x′

2 , x2, and also does not occur for y2 = y1 if x′

2 = x2. By construction, both ξ and ξ′

are invertible. Furthermore, for all x2, ξ(x2) , ξ′(x2). So |X1(U)| ≥ 2|X2(U)|. On the other hand,

|X1(U)| = |X2(U)| because C1 and C2 are copies of each other. Therefore they must have infinite

setup functions.

The existence proof for (ii) is by example. Define a set of quadruples

T ≡ {(1, −1, 1, −1); (2, 1, 1, −1); (3, −1, 2, 1); (4, 1, 2, 1); (5, −1, 3, −1), (6, 1, 3, −1), . . .}

= {(i, 1 − 2(i mod 2), ⌈(i/2), 1 − 2(⌈(i/2) mod 2)) : i ∈ N}

Next, fix any set of spaces σ, where the spaces {y1} = {y2} ≡ B and {x1} = {x2} ≡ N all

occur in σ. Let S be a subset of the Cartesian product of the spaces in σ. Say that for every

t ∈ T, (x1, y1, x2, y2) = t for exactly one element of S , and no element of S contains a quadruple

(x1, y1, x2, y2) < x1 =" 2x2" x2 =" x2." x1 =" 2x2" x1 =" 2x2" y1 =" 1" y1 =" 1" x1 =" 5" x2 =" 3" y1 =" Y2," x1 =" 6" x2 =" 3" y1 =" −Y2.)" y1 =" Y2" y1 =" −Y2." x1 =" 1)" x1 =" −1)|" x2 =" 1)" x2 =" −1)|" g =" 1" g =" −1" g =" 1" g =" 1" x1 =" −1," x2 =" −1)," g =" 1" x1 =" −1," x2 =" 1)," g =" 1" x1 =" 1," x2 =" −1)," g =" 1" x1 =" 1," x2 =" 1)." x1 =" −1)" x1 =" 1)" x2 =" −1)" x2 =" 1)" c1 =" (X1," y2 =" Y1," q1 =" Y1," q2 =" −Y1." y1 =" −1" q2 =" q1" q1 =" Y1)," y1 =" 1" q2 =" q2." q2 =" −1." y1 =" Y2," y1y2 =" 1." q2 =" −Y2Y1." y2q2 =" −Y1." y1 =" Y2Q2," x =" χ" x =" x" y =" f"> �� does not imply that there is exactly one probe of �� for which the

associated conclusion value is 1. (This is true even though π(��(U)) is a full unary representation

of ��(U).) Formally, C > �� does not imply that there is exactly one probe f of �� such that

∃ x : X = x ⇒ Y = f (��) = 1. There may be more than one such f , or even none. So as embodied

in weak inference, for C to predict (something concerning the future state of the universe as

encapsulated in the function) �� does not mean that for each γ ∈ ��(U) there is some associated

question x that if embodied in X guarantees that Y correctly says, “yes, in this universe u, γ is

the value that will occur; ��(u) = γ”. Weak inference only requires that for each γ and associated

probe, X can be set up so that the device’s answer Y(u) must be correct, not that it can be set up

to be correct and answer in the affirmative.

Similarly, C > �� does not imply that C can infer a “coarse-grained” version of ��. It implies

that C can correctly answer, “does ��(u) equal γ1?” for some γ1 ∈ ��(U), and that it can correctly

answer “does ��(u) equal γ2” for some γ2 ∈ ��(U). However it does not imply that C can correctly

answer, “does ��(u) equal either γ1 or γ2 or both?”. In particular, for two functions over U, �� and

��′, C > (��, ��′) does not imply C > ��.

As another example of how weak Def. 3 is, recall that Y is to be interpreted as including all that

the device “knows”. On the other hand, it is X that includes a specification of what inference task

the device is being asked to perform. So in the definition of inference, we don’t even require that

the device knows what inference task it is being asked to perform.We just ask if it can be given

such a task and then come to the right conclusion, even if it doesn’t know what its conclusion

“means”.

There is no reason one could not introduce additional formal structure in the definition of

inference to embody some (or all) of these attributes. For example, say we want to analyze the

property of a device C both inferring some �� while also being capable of correctly answering

“does ��(u) equal either γ1 or γ2 or both?”. We could do this by strengthening the definition of

weak inference to also require that for any union of probes of ��, , there is an x ∈ X(U) such

that X(u) = x implies that Y(u) = 1 ⇔ f (��(u)) = 1 for some f ∈ . (In general the x ∈ X(U)

that force the device to infer such unions of multiple probes are different from the x ∈ X(U) that

force the device to infer single probes.) As another example, say we want to have C infer some

�� while also knowing how it is set up (so in particular it knows what probe of �� it is inferring).

We can accomplish this by requiring C > (��, X).

Whatever difficulties such additional structures might impose, they are in addition to the impossibility

results we derive below; the results below apply no matter what such additional structures

are imposed.

In addition, in Def. 3 there are no restrictions on how, physically, the function �� gets mapped

to the setup value x. So there are no stipulations, implicit or otherwise, about how x is interpreted.

A mechanism for forcing X(u) to have the desired value for its inference will typically

exist in any real device. In fact, in general to infer different functions will require different such

mechanisms. So in the real world there is typically a way to replace one such mechanism with

another, depending on the function �� being inferred.

By leaving the mechanism out of the formal definition of inference, all such complications are

avoided. In Def. 3, we simply say there exists some appropriate x ∈ X(U) for any f (��), with

nothing mentioned about how to force the inference device (and therefore u) to have what the

device is supposed to compute, f (��), reflected in the value x.

Indeed, given any device C, we can define a new device C′ ≡ (X′, Y′) where X′(u) itself

39

specifies the f (��) that we wish to answer using the original device (X, Y). So for example, say

(X, Y) is a computer running a physical simulation program whose initialized state is given by

X(u). Then C′ is that computer modified by having a “front end” program that runs first to figure

out how to initialize the simulation to have the bit it produces as a conclusion answer the question

of interest. In this case, trivially, there is no issue in mapping from �� to x; that mapping is part of

the setup function of our new device, X′(.).

In particular, say that there is an “external” scientist who types into the computer C a specification

of the system whose evolution is to be simulated in the computer (i.e., forces X(u) to have

a value that is interpreted as that specification). Then one can define C′ so that the scientist is

embodied in X′(.). In this definition, we view the human scientist as “part of” the device (s)he is

using.

In summary, and speaking very colloquially, one can view weak inference as a necessary

condition for saying that a device “knows” the actual value of a function of the state of the

universe.Whatever else such knowledge entails, it means that the device can, by whatevermeans,

correctly answer (with a yes or a no), “Does the value of the function of the state of the universe

equal z?” for any value z in the codomain of the function.

Like with weak inference, there is no requirement that a device knows how it has been set up

for it to strongly infer another device. Similarly, there is no requirement that it be able to strongly

infer the unions of probes, no requirements concerning its position in the Chomsky hierarchy,

etc. Despite being so pared-down, the definition of strong inference is still sufficient to exhibit

some non-trivial behavior.

APPENDIX C: Alternative definitions of weak inference

There are alternatives to Def. 3 that accommodate the case where |��(U)| > 2 without employing

multiple probes. One such alternative uses multiple devices in concert, each sharing the

same setup function, and each device’s conclusion giving a different bit concerning ��’s value. As

an example, say that ��’s range is R. Then we could assign each device to a separate real number,

and require that for all u one and only one device’s conclusion equals 1, namely the device

corresponding to the value of ��(u).

To formalize this, say we have a set of devices {Cz : z ∈ R} and some function �� : U → R. In

addition suppose there is some vector x with components xz running over all z ∈ R such that

i) ∩z∈RX−1

z (xz) ≡ ˆU�� , ?.

ii) u ∈ ˆU �� ⇒ ∀z ∈ R, Yz = 1 iff ��(u) = z.

iii) ∀γ ∈ ��(U), ∃u ∈ ˆU�� such that Yγ(u) = 1.

Then we can jointly set up the set of devices so that their joint conclusion gives ��(u), and we can

do so without precluding any element of ��(u). In this, the set of devices “jointly infers” ��.

Alternatively, we could use a single device, where we modify the definition of “device” to

allow arbitrary cardinality of the range of Y. With this modification, the conclusion function of

the device does not answer the question of what the value of a particular function of ��(U) is.

Rather it directly encodes the value of ��(U).

It would appear that under such an alternative we do not need to have the value of X(u) specify

40

the bit concerning ��(u) that we want to infer, and do not need to consider multiple probes. So for

example, it would appear that when the device is being used for prediction, under this alternative

X(u) need only specify what is known concerning the current state of the system whose future

state is being predicted, without specifying a particular bit concerning that future state that we

wish our device to predict. The conclusion Y (or set of conclusions, as the case might be) would

specify the prediction in full.

Things are not so simple unfortunately. If we wish to allow the device to infer functions �� with

different ranges, then under this alternative we have to allow different functions relating Y(u) and

��(u). This need is especially acute if we want to allow |��(U)| to vary.

Such functions should be surjective, to ensure that our device can conclude every possible

value of ��(U). (This surjectivity is analogous to the requirement that we consider all probes in

Def. 3.) For any such function φ : Y(U) → ��(U), we would interpret a particular value Y(u) as

saying “��(u) = φ(Y(u))”. (This contrasts with the situation when Y(U) = B, where we interpret

Y(u) = +1/−1 to mean “yes/no”, respectively, in response to the question of whether some

associated probe has the value +1.)

One immediate problem with this alternative definition of inference is that it does not allow a

device (X, Y) to infer any function ��(U) where |��(U)| > |Y(U)|. Such difficulties do not hold for

Def. 3. For example, if X(U) = 3, X is a fine-graining of Y with two of its elements contained in

Y−1(−1), and �� is a fine-graining of X, then (X, Y) > ��. (For every probe of ��(U), x is chosen to

be one of the two elements that cause Y(u) = −1. The precise x chosen for a particular probe f

is the one that lies in ( f (��))−1(−1).)

Other difficulties arise when we try to specify this alternative definition in full. For example,

one possible such definition is that C infers �� iff ∃ x and function φ : Y(U) → ��(U) such that

X = x ⇒ φ(Y) = ��. Such a definition is unsatisfying in that by not fixing φ ahead of time, it

leaves unspecified how the conclusion of the device is to be physically interpreted as an encoding

of ��(u). (This is in addition to the lack of a fixed mapping from �� to x, a lack which also arises

in Def. 3.)

To get around this problem we could pre-fix a set of φ’s, one for every member of a set of

ranges {��(U)}. We could then have u pick out the precise φ to use. This requires introduction

of substantial additional structure into the definition of devices however. (A somewhat related

notion is considered in Sec. 9.) Another possible solution would be along the lines of ∀φ :

Y(U) → ��, ∃x such that X = x ⇒ φ(Y) = ��”. But this returns us to a definition of inference

involving multiple functions relating Y and ��.

All of these other difficulties also apply to the definition above of joint inference involving

multiple devices. In particular, say we wish to use the same set of devices to jointly infer function

having different ranges from one another. Then we have to specify something about how to map

the joint conclusion of the devices into an inference in any of those ranges. For example, if the

set of devices is {Cz : z ∈ R} and ��(U) is non-numeric, we would need to specify something

about how a joint conclusion {Yz(u)} gets mapped into that non-numeric space.

As a final possibility, we could stick with a single device and have Y(U) = B, but use some

representation of ��(U) in X other than the unary representation implicit in Def. 3. For example,

we could require that for all binary representations φ of ��(U), for all bits i in that representation,

there is an x such that X = x ⇒ Y = φi(��). This would allow smaller spaces X(U) in general. But

it would still require consideration of multiple functions relating Y and ��. It would also raise the

issue of how to encode the elements of ��(U) as bits.

For simplicity, in the text we avoid these issues and restrict attention to the original definitions.

41

References

[1] D. Lewis, On the plurality of worlds, Blackwell publishers, 1986.

[2] R. Geroch, J. Hartle, Foundations of Physics 16 (1986) 533.

[3] I. Kanter, Physical Review Letters 64 (1990) 332.

[4] J. Berger, International Journal of Theoretical Physics 29 (1990) 985–995.

[5] N. da Costa, F. Doria, International Journal of Theoretical Physics 30 (1991) 1041.

[6] M. Gell-Mann, S. Lloyd, Complexity 2 (1996) 44–52.

[7] H. Touchette, S. Lloyd, Physical Review Letters 84 (2000) 1256–1259.

[8] K. Ruohonen, Complexity 2 (1997) 41.

[9] W. Hodges, A Shorter Model Theory, Cambridge University Press, 1997.

[10] J. Schmidhuber, The speed prior: A new simplicity measure yielding near-optimal computable predictions, in: Proc.

15th Conf. on Computational Learning Theory (COLT-2002), 2002, pp. 216–228, lNAI 2375.

[11] D. Wolpert, Memory systems, computation, and the second law of thermodynamics, International Journal of

Theoretical Physics 31 (1992) 743–785. Revised version available from author.

[12] S. Lloyd, Programming the universe, Random House, 2006.

[13] S. Lloyd, Nature 406 (2000) 1047.

[14] W. Zurek, Nature 341 (1984) 119.

[15] R. Landauer, IBM Journal of Research and Development 5 (1961) 183.

[16] R. Landauer, Nature 335 (1988) 779–784.

[17] C. Moore, Physical Review Letters 64 (1990) 2354–2357.

[18] M. Pour-El, I. Richards, International Journal of Theoretical Physics 21 (1982) 553.

[19] E. Fredkin, T. Toffoli, International Journal of Theoretical Physics 21 (1982) 219.

[20] R. Feynman, Foundations of Physics 16 (1986) 507.

[21] C. Bennett, IBM Journal of Research and Development 17 (1973) 525–532.

[22] C. H. Bennett, International Journal of Theoretical Physics 21.

[23] C. Bennett, in: D. Pines (Ed.), Emerging Syntheses in Science, Addison Wesley, Reading MA, 1987, p. 297.

[24] S. Aaronson, quant-ph/0502072 (2005).

[25] H. Everett, Reviews of Modern Physics 29 (1957) 454–462.

[26] D. Wolpert, Computational capabilities of physical systems, Physical Review E 65 (2001) 016128.

[27] L. Smolin, The life of the cosmos, Weidenfeld and Nicolson, 2002.

[28] A. Aguirre, M. Tegmark, Multiple universes, cosmic coincidences, and other dark matters, hep-th/0409072 (2005).

[29] B. Carr (Ed.), Universe or Multiverse?, Cambridge University Press, 2007.

[30] J. Barbour, The end of time, Oxford University Press, 1999.

[31] J. Conway, S. Kochen, The free will theorem, quant-ph/0604079 (2006).

[32] S. Wolfram, A new kind of Science, Wolfram Media, 2002.

[33] M. Tegmark, The mathematical universe, gr-qc:0704.0646v1 (2007).

[34] G. McCabe, gr-qc/0601073 (2006).

[35] P. Davies, Fluctuations and Noise Letters 7 (2007) C37–C50.

[36] J. Schmidhuber, A computer scientist’s view of life, the universe, and everything, in: Foundations of Computer

Science: Potential - Theory - Cognition, 1997, pp. 201–208, lNCS 1337.

[37] T. Cover, J. Thomas, Elements of Information Theory, Wiley-Interscience, New York, 1991.

[38] S. Laplace, Philosophical Essays on Probabilities, Dover, 1985, originally in 1825; translated by F.L. Emory and

F.W. Truscott.

[39] D. Wolpert, PHYSICS TODAY (1992) 98.

[40] W. Zurek, Reviews of Modern Physics 75 (2003) 715.

[41] D. Zeh, H., Foundations of Physics 1 (1970) 69–76.

[42] R. J. Aumann, Interactive epistemology ii: Probability, Int. J. Game Theory 28 (1999) 301–314.

[43] R. J. Aumann, A. Brandenburger, Epistemic conditions for nash equilibrium, Econometrica 63 (5) (1995) 1161–

1180.

[44] K. Binmore, A. Brandenburger, Common knowledge and game theory, sT/ICERD Discussion Paper 88/167, London

School of Economics.

[45] D. Fudenberg, J. Tirole, Game Theory, MIT Press, Cambridge, MA, 1991.

[46] D. MacKay, On the logical indeterminacy of a free choice, Mind, New Series 69 (273) (1960) 31–40.

42

[47] K. Popper, The impossibility of self-prediction, in: The Open Universe: From the Postscript to the Logic of Scientific

Discovery, Routledge, 1988, p. 68.

[48] A. Lasota, M. Mackey, Chaos, fractals and noise, Springer-Verlag, 1994.

[49] J. Hopcroft, J. D. Ullman, Introduction to automata theory, languages and computation, Addison Wesley, 1979.

[50] C. Aliprantis, K. C. Border, Infinite Dimensional Analysis, Springer Verlag, 2006.

43

## No comments:

Post a Comment