Professional Documents
Culture Documents
Active Learning ML To Create New Quantum Experiments
Active Learning ML To Create New Quantum Experiments
Active Learning ML To Create New Quantum Experiments
I. INTRODUCTION of) optical tables, and learns how to generate novel and
interesting experiments. More concretely, we phrase the
Automated procedures are indispensable in modern task of the development of new optical experiments in
science. Computers rapidly perform large calculations, a reinforcement learning (RL) framework [10], vital in
help us visualize results, and specialized robots perform modern AI [2, 11].
preparatory laboratory work to staggering precision. But The usefulness of automated designs of quantum ex-
what is the true limit of the utility of machines for sci- periments has been shown in Ref. [12]. There, the al-
ence? To what extent can a machine help us understand gorithm MELVIN starts from a toolbox of experimentally
experimental results? Could it – the machine – discover available optical elements to randomly create simulations
new useful experimental tools? The impressive recent of setups. Those setups are reported if they result in
results in the field of machine learning, from reliably ana- high-dimensional multipartite entangled states. The al-
lysing photographs [1] to beating the world champion in gorithm uses handcrafted rules to simplify setups and is
the game of Go [2], support optimism in this regard. Re- capable of learning by extending its toolbox with pre-
searchers from various research fields now employ ma- viously successful experimental setups. Several of these
chine learning algorithms [3], and the success of machine experimental proposals have been implemented success-
learning applied to physics [4–7] in particular is already fully in the lab [13, 14].
noteworthy. Machine learning has claimed its place as the
Inspired by the success of the MELVIN algorithm in the
new data analysis tool in the physicists toolbox [8]. How-
context of finding specific optical setups, here we invest-
ever, the true limit of machines lies beyond data analysis,
igate the broader potential of learning machines and arti-
in the domain of the broader theme of artificial intelli-
ficial intelligence in designing quantum experiments [16].
gence (AI), which is much less explored in this context.
Specifically, we are interested in their potential to con-
In an AI picture, we consider devices (generally called in-
tribute to novel research, beyond rapidly identifying solu-
telligent agents) that interact with an environment (the
tions to fully specified problems. To investigate this ques-
laboratory) and learn from previous experience [9]. To
tion, we employ a more general model of a learning agent,
broach the broad questions raised above, in this work
formulated within the projective simulation (PS) frame-
we address the specific yet central question of whether
work for artificial intelligence [16], which we apply to the
an intelligent machine can propose novel and elucidating
concrete test-bed of Ref. [12]. In the process of gener-
quantum experiments. We answer in the affirmative in
ating specified optical experiments, the learning agent
the context of photonic quantum experiments, although
builds up a memory network of correlations between dif-
our techniques are more generally applicable. We design
ferent optical components – a feature it later exploits
a learning agent, which interacts with (the simulations
when asked to generate targeted experiments efficiently.
In the process of learning, it also develops notions (tech-
nically speaking, composite clips [16]), for components
∗ These authors contributed equally to this work that “work well” in combination. In effect, the learn-
2
a Environment b
reward
Analyzer optical setups s1 s2 s3 s4 s5 s6 s7 s8 s9 ... sN
(percept clips)
percept: Agent
optical setup pij
action: placings of an
optical element a1 ... a4 ... a8 ... a12 ... a18 ... a30
optical element (action clips)
BSab BSbc DPb Reflb Holoa,2 Holod,2
Optical table
6⇥BS 4⇥DP 4⇥Refl 16⇥Holo
Figure 1. The learning agent. a, An agent is always situated in an environment [9]. Through sensors it perceives optical
setups and with actuators it can place optical elements in an experiment. On the side, an analyzer evaluates a proposed
experiment corresponding to the current optical setup and gives rewards according to a specified taska . b, The memory
network that represents the internal structure of the PS agent. Dashed arrows indicate possible transitions from percept clips
(blue circles) to action clips (red circles). Solid, colored arrows depict a scenario where a sequence of actions leads to the
experiment {BSbc , DPb , Reflb , BSbc , Reflb , Holoa,2 }. Arrows between the percepts correspond to deterministic transitions from
one experiment to another.
a Part of this figure is adapted from Ref. [15]. Image of the optical table is a part of the (3, 3, 2)-experiment (see main text) in Ref. [13],
by Manuel Erhard (University of Vienna).
ing agent autonomously discovers sub-setups (or gadgets) also both simultaneously [13]. OAM setups have been
which are useful outside of the given task of searching for used as test-beds for other new techniques for quantum
particular high-dimensional multiphoton entanglement1 . optics experiments [12, 37], and they are the system of
We concentrate on the investigation of multipartite choice in this work as well.
entanglement in high dimensions [17–19] and give two
examples where learning agents can help. The under-
standing of multipartite entanglement remains one of the II. RESULTS
outstanding challenges of quantum information science,
not only because of the fundamental conceptual role of
Our learning setup can be put in terms of a scheme
entanglement in quantum mechanics, but also because
visualized in Fig. 1(a). The agent (our learning al-
of the many applications of entangled states in quantum
gorithm) interacts with a virtual environment – the sim-
communication and computation [20–25]. As the first ex-
ulation of an optical table. The agent has access to a set
ample, the agent is tasked to find the simplest setup that
of optical elements (its toolbox), with which it generates
produces a quantum state with a desired set of properties.
experiments: it sequentially places the chosen elements
Such efficient setups are important as they can be ro-
on the (simulated) table; following the placement of an
bustly implemented in the lab. The second task is to gen-
element, the quantum state generated by the correspond-
erate as many experiments, which create states with such
ing setup (i.e. configuration of optical elements) is ana-
particular properties, as possible. Having many different
lyzed. Depending on the state, and the chosen task, the
setups is important for understanding the structure of
agent receives a reward (or not). The agent then ob-
the state space reachable in experiments and for explora-
serves the current table configuration and places another
tion of different possibilities that are available in experi-
element; then the procedure is iterated. We are inter-
ments. In both tasks the desired property is chosen to be
ested in finite experiments, so the maximal number of
a certain type of entanglement of the generated states.
optical elements in one experiment is limited. This goes
More precisely we target high-dimensional (d > 2) many-
along with practical restrictions: due to accumulation of
particle (n > 2) entangled states. The orbital angular
imperfections (e.g. through misalignment and interfer-
momentum (OAM) of photons [26–29] can be used for
ometric instability), for longer experiments we expect a
the investigations into both high-dimensional [30–34] and
decreasing overall fidelity of the resulting state or signal.
multiphoton entanglement [35, 36] and, since recently,
The experiment ends (successfully) when the reward is
given or (unsuccessfully) when the maximal number of
elements on the table is reached without obtaining the
1 The discovery of such devices is in fact a byproduct stemming reward. Over time, the agent learns which element to
from the more involved structure of the learning model, even place next, given a table.
when applied to an otherwise simple-to-specify activity (search Initially, we fix the analyzer in Fig. 1(a) to reward the
for specific optical table configurations). successful generation of a state out of a certain class of
3
number of interesting
a length of experiment
8 1 b 1400 1400
success probability
1200 projective simulation 1200
experiments
7 0.8
1000 automated search 1000
0.6 800 800
6
0.4 600 600
5 400 400
0.2 200 200
4 0 0 0
0 10 20 30 40 50 60 6 7 8 9 10 11 12
number of experiments ×103 maximum length of experiment, L
Figure 2. Results of learning new experiments. a, Average length of experiment and success probability in each of the
6×104 experiments. The maximal length of an experiment is L = 8. During the first 5×104 experiments an agent is rewarded for
obtaining a (3, 3, 2) state, during the last 104 experiments the same agent is rewarded when finding a (3, 3, 3) state. The average
success probability shows how probable it is for the PS agent to find a rewarded state in a given experiment. Solid/dashed lines
show simulations of the PS agent that learns how to generate a (3, 3, 3) state from the beginning with/without prior learning
of setups that produce a (3, 3, 2) state. b, Average number of interesting experiments obtained after placing 1.2 × 105 optical
elements. Data points are shown for L = 6, 8, 10 and 12. Dashed/solid blue and red lines correspond to PS with/without action
composition and automated search with/without action composition, respectively (see main text). Vertical bars indicate the
mean squared deviation. a, b, All data points are obtained by averaging over 100 agents. Parameters of the PS agents are
specified in the Methods section.
base of such experiments would allow us to deepen our allow for additional insight into the agent’s behavior and
understanding of the structure of the set of entangled helps provide useful information about quantum optical
states that can be accessed by optical tables. In partic- setups in general. We found that the PS model discov-
ular, such a database could then be further analyzed to ers significantly more interesting experiments than both
identify useful sub-setups or gadgets [12] – certain few- automated random search and automated random search
element combinations that appear frequently in larger with action composition, see Fig. 2(b).
setups – which are useful for generating complex states.
With our second task, we have challenged the agent to
generate such a database by finding high-dimensional en- C. Ingredients for successful learning
tangled states. As a bonus, we found that the agent does
the post-processing above for us implicitly, and in run-
In general, successful learning relies on a structure hid-
time. The outcome of this post-processing is encoded in
den in the task environment (or dataset). The results
the structure of the memory network of the PS agent.
presented thus far show that PS is highly successful in
Specifically, the sub-setups which were particularly use-
the task of designing new interesting experiments, and
ful in solving this task are clearly visible, or embodied,
here we elucidate why this should be the case. The fol-
in the agent’s memory.
lowing analysis also sheds light on other settings where
To find as many different high-dimensional three- we can be confident that RL techniques can be applied
photon entangled states as possible, we reward the agent as well.
for every new implementation of an interesting exper- First, the space of optical setups can be illustrated us-
iment. To avoid trivial extensions of such implement- ing a graph as given in Fig. 3(c), where the building of
ations, a reward is given only if the obtained SRV was an optical experiment corresponds to a walk on the direc-
not reached before within the same experiment. Fig. 2(b) ted graph. Note that optical setups that create a certain
displays the total number of new, interesting experiments state are not unique: two or more different setups can
designed by the basic PS agent (solid blue) and the PS generate the same quantum state. Due to this fact, this
agent with action composition [16] (dashed blue). Action graph does not have a tree structure but rather resembles
composition allows the agent to construct new compos- a maze. Navigating in a maze, in turn, constitutes one of
ite actions from useful optical setups (i.e. placing mul- the classic textbook RL problems [10, 39, 47, 48]. Second,
tiple elements in a fixed configuration), thereby autonom- our empirical analysis suggests that experiments gener-
ously enhancing the toolbox (see Methods for details). ating high-dimensional multipartite entanglement tend
It is a central ingredient for an AI to exhibit even a to have some structural similarities [12] (see Fig. 3(a)-
primitive notion of creativity [46] and was also used in (b) which partially displays the exploration space). The
Ref. [12] to augment automated random search. For com- figure shows regions where the density of interesting ex-
parison, we provide the total number of interesting ex- periments (large colored nodes) is high and others where
periments obtained by the automated search algorithm it is low – interesting experiments seem to be clustered
with and without action composition (solid and dashed (cf. Fig. 6). In turn, RL is particularly useful when
red curves). As we will see later, action composition will one needs to handle situations which are similar to those
5
Figure 3. Exploration space of optical setups. Different setups are represented by vertices with colors specifying an
associated SRV (bipartite entangled states are depicted in blue (a,b) or black (c)). Arrows represent the placing of optical
elements. a, A randomly generated space of optical setups. Here we allow up to 6 elements on the optical table and a standard
toolbox of 30 elements. Large, colored vertices represent interesting experiments. If two nodes share a color, they can generate
a state with the same SRV. Running for 1.6 × 104 experiments, the graph that is shown here has 45605 nodes, out of which 67
represent interesting setups. b, A part of the graph (a), which demonstrates a non-trivial structure of the network of optical
setups. c, A detailed view of the part of the big network. The depicted colored maze represents an analogy between the task of
finding the shortest implementation of an experiment and the task of navigating in a maze [10, 39, 47, 48]. Arrows of different
colors represent distinct optical elements that are placed in the experiment. The initial state is represented by an empty table
∅. The shortest path to a setup that produces a state with SRV (3, 3, 2) and (3, 3, 3) is highlighted. Labels along this path
coincide with the labels of the percept clips in Fig. 1(b).
previously encountered – once one maze (optical experi- and techniques that are helpful in the solving of the prob-
ment) is learned, similar mazes (experiments) are tackled lem. Furthermore, she would learn that such techniques
more easily, as we have seen before. In other words, are probably useful beyond the specified tasks and may
whenever the experimental task has a maze-type under- provide new insights in other contexts. Could a machine,
lying structure, which is often the case, PS can likely help even in principle, have such insight? Arguably, traces of
– and critically, without having any a priori information this can be found already in our, comparatively simple,
about the structure itself [39, 49]. In fact, PS gathers in- learning agent. By analyzing the memory network of the
formation about the underlying structure throughout the agent (ranking clips according to the sum of the weights
learning process. This information can then be extracted of their incident edges), specifically the composite actions
by an external user or potentially be utilized further by it learned to generate, we can extract sub-setups that
the agent itself. have been particularly useful in the endeavor of finding
(creating a database of) many different interesting exper-
iments.
D. The potential of learning from experiments For example, PS composes and then extensively uses a
combination corresponding to an optical interferometer
as displayed in Fig. 4(a) which is usually used to sort
Thus far we have established that a machine can indeed
OAM modes with different parities [50] – in essence,
design new quantum experiments, in the setting where
the agent (re)discovered a parity sorter. This interfer-
the task is precisely specified (via the rewarding rule).
ometer has already been identified as an essential part of
Intuitively, this could be considered the limit of what
many quantum experiments that create high-dimensional
a machine can do for us, as machines are specified by
multipartite entanglement [12], especially those involving
our programs. However, this falls short from what, for
more than two photons (cf. Fig. 2(a) and Ref. [13]).
instance, a human researcher can achieve. How could
we, even in principle, design a machine to do something One of the PS agent’s assignments was to discover as
(interesting) we have not specified it to do? To develop many different interesting experiments as possible. In
an intuition for the type of behavior we could hope for, the process of learning, even if a parity sorter was of-
consider, for the moment, what we may expect a human, ten (implicitly) rewarded, over time it will no longer be,
say a good PhD student would do in situations similar as in this scenario only novel experiments are rewarded.
to those studied thus far. This, again implicitly, drives the agent to “invent” new,
To begin with, a human cannot go through all con- different-looking configurations, which are similarly use-
ceivable optical setups to find those that are interesting. ful.
Arguably, she would try to identify prominent sub-setups Indeed, in many instances the most rewarded action is
6
ACKNOWLEDGMENTS
[1] Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet K. Q. (eds.) Adv. Neural Inf. Process. Syst., vol. 25, 1097–
classification with deep convolutional neural networks. In 1105 (Curran Associates, Inc., 2012).
Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger,
7
[2] Silver, D. et al. Mastering the game of Go with deep transformation of Laguerre–Gaussian laser modes. Phys.
neural networks and tree search. Nature 529, 484–489 Rev. A 45, 8185 (1992).
(2016). [27] Mair, A., Vaziri, A., Weihs, G. & Zeilinger, A. Entangle-
[3] Jordan, M. I. & Mitchell, T. M. Machine learning: ment of the orbital angular momentum states of photons.
Trends, perspectives, and prospects. Science 349, 255– Nature 412, 313–316 (2001).
260 (2015). [28] Molina-Terriza, G., Torres, J. P. & Torner, L. Twisted
[4] Carrasquilla, J. & Melko, R. G. Machine learning phases photons. Nat. Phys. 3, 305–310 (2007).
of matter. Nat. Phys. 13, 431–434 (2017). [29] Krenn, M., Malik, M., Erhard, M. & Zeilinger, A. Orbital
[5] van Nieuwenburg, E. P. L., Liu, Y.-H. & Huber, S. D. angular momentum of photons and the entanglement of
Learning phase transitions by confusion. Nat. Phys. 13, Laguerre–Gaussian modes. Phil. Trans. R. Soc. A 375,
435–439 (2017). 20150442 (2017).
[6] Schmidt, M. & Lipson, H. Distilling free-form natural [30] Vaziri, A., Weihs, G. & Zeilinger, A. Experimental two-
laws from experimental data. Science 324, 81–85 (2009). photon, three-dimensional entanglement for quantum
[7] Carleo, G. & Troyer, M. Solving the quantum many- communication. Phys. Rev. Lett. 89, 240401 (2002).
body problem with artificial neural networks. Science [31] Dada, A. C., Leach, J., Buller, G. S., Padgett, M. J.
355, 602–606 (2017). & Andersson, E. Experimental high-dimensional two-
[8] Zdeborová, L. Machine learning: New tool in the box. photon entanglement and violations of generalized Bell
Nat. Phys. 13, 420–421 (2017). inequalities. Nat. Phys. 7, 677–680 (2011).
[9] Russel, S. & Norvig, P. Artificial Intelligence - A Modern [32] Agnew, M., Leach, J., McLaren, M., Roux, F. S. & Boyd,
Approach (Prentice Hall, New Jersey, 2010), 3rd edn. R. W. Tomography of the quantum state of photons
[10] Sutton, R. S. & Barto, A. G. Reinforcement Learning: entangled in high dimensions. Phys. Rev. A 84, 062101
An Introduction (MIT press, Cambridge, 1998). (2011).
[11] Mnih, V. et al. Human-level control through deep rein- [33] Krenn, M. et al. Generation and confirmation of a (100×
forcement learning. Nature 518, 529–533 (2015). 100)-dimensional entangled quantum system. Proc. Natl.
[12] Krenn, M., Malik, M., Fickler, R., Lapkiewicz, R. & Acad. Sci. 111, 6243–6247 (2014).
Zeilinger, A. Automated search for new quantum ex- [34] Zhang, Y. et al. Engineering two-photon high-
periments. Phys. Rev. Lett. 116, 090405 (2016). dimensional states through quantum interference. Sci.
[13] Malik, M. et al. Multi-photon entanglement in high di- Adv. 2, e1501165 (2016).
mensions. Nat. Photon. 10, 248–252 (2016). [35] Hiesmayr, B., de Dood, M. & Löffler, W. Observation
[14] Schlederer, F., Krenn, M., Fickler, R., Malik, M. & of four-photon orbital angular momentum entanglement.
Zeilinger, A. Cyclic transformation of orbital angular Phys. Rev. Lett. 116, 073601 (2016).
momentum modes. New J. Phys. 18, 043019 (2016). [36] Wang, X.-L. et al. Quantum teleportation of multiple
[15] Adapted from https://openclipart.org/detail/ degrees of freedom of a single photon. Nature 518, 516–
266420/Request, created by squarehipster, and li- 519 (2015).
censed under the CC0 1.0 Universal. [37] Krenn, M., Hochrainer, A., Lahiri, M. & Zeilinger, A.
[16] Briegel, H. J. & De las Cuevas, G. Projective simulation Entanglement by path identity. Phys. Rev. Lett. 118,
for artificial intelligence. Sci. Rep. 2, 400 (2012). 080401 (2017).
[17] Lawrence, J. Rotational covariance and Greenberger- [38] Mautner, J., Makmal, A., Manzano, D., Tiersch, M. &
Horne-Zeilinger theorems for three or more particles of Briegel, H. J. Projective simulation for classical learn-
any dimension. Phys. Rev. A 89, 012105 (2014). ing agents: a comprehensive investigation. New Generat.
[18] Huber, M. & de Vicente, J. I. Structure of multidimen- Comput. 33, 69–114 (2015).
sional entanglement in multipartite systems. Phys. Rev. [39] Melnikov, A. A., Makmal, A. & Briegel, H. J. Projective
Lett. 110, 030501 (2013). simulation applied to the grid-world and the mountain-
[19] Huber, M., Perarnau-Llobet, M. & de Vicente, J. I. En- car problem. arXiv:1405.5459 (2014).
tropy vector formalism and the structure of multidimen- [40] Bjerland, Ø. F. Projective Simulation compared to rein-
sional entanglement in multipartite systems. Phys. Rev. forcement learning. Master’s thesis, Dept. Comput. Sci.,
A 88, 042328 (2013). Univ. Bergen, Bergen, Norway (2015).
[20] Horodecki, R., Horodecki, P., Horodecki, M. & Horo- [41] Melnikov, A. A., Makmal, A., Dunjko, V. & Briegel,
decki, K. Quantum entanglement. Rev. Mod. Phys. 81, H. J. Projective simulation with generalization.
865–942 (2009). arXiv:1504.02247 (2015).
[21] Amico, L., Fazio, R., Osterloh, A. & Vedral, V. En- [42] Hangl, S., Ugur, E., Szedmak, S. & Piater, J. Ro-
tanglement in many-body systems. Rev. Mod. Phys. 80, botic playing for hierarchical complex skill learning. In
517–576 (2008). Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2799–
[22] Pan, J.-W. et al. Multiphoton entanglement and inter- 2804 (2016).
ferometry. Rev. Mod. Phys. 84, 777–838 (2012). [43] Paparo, G. D., Dunjko, V., Makmal, A., Martin-Delgado,
[23] Hein, M., Eisert, J. & Briegel, H. J. Multiparty entangle- M. A. & Briegel, H. J. Quantum speed-up for active
ment in graph states. Phys. Rev. A 69, 062311 (2004). learning agents. Phys. Rev. X 4, 031002 (2014).
[24] Scarani, V. & Gisin, N. Quantum communication [44] Dunjko, V., Friis, N. & Briegel, H. J. Quantum-enhanced
between N partners and Bell’s inequalities. Phys. Rev. deliberation of learning agents using trapped ions. New
Lett. 87, 117901 (2001). J. Phys. 17, 023006 (2015).
[25] Scarani, V. et al. The security of practical quantum key [45] Friis, N., Melnikov, A. A., Kirchmair, G. & Briegel, H. J.
distribution. Rev. Mod. Phys. 81, 1301–1350 (2009). Coherent controlization using superconducting qubits.
[26] Allen, L., Beijersbergen, M. W., Spreeuw, R. & Woerd- Sci. Rep. 5, 18036 (2015).
man, J. Orbital angular momentum of light and the
8
(t)
[46] Briegel, H. J. On creative machines and the physical ability pij of a transition from percept si to action aj
origins of freedom. Sci. Rep. 2, 522 (2012). happening. This transition probability is proportional
[47] Mirowski, P. et al. Learning to navigate in complex en- to the weights, also called h-values, and is given by the
vironments. arXiv:1611.03673 (2016). following expression:
[48] Mannucci, T. & van Kampen, E.-J. A hierarchical maze
navigation algorithm with reinforcement learning and (t)
(t) hij
mapping. In Proc. IEEE Symposium Series on Compu- pij = PK(t) . (2)
(t)
tational Intelligence (2016). k=1 hik
[49] Makmal, A., Melnikov, A. A., Dunjko, V. & Briegel, H. J.
Meta-learning within projective simulation. IEEE Access Initially, all h-values are set to 1, hence the probability
4, 2110–2122 (2016). of performing action aj is equal to pij = 1/K, that is,
[50] Leach, J., Padgett, M. J., Barnett, S. M., Franke-Arnold, uniform. In other words, the initial agent’s behaviour is
S. & Courtial, J. Measuring the orbital angular mo- fully random.
mentum of a single photon. Phys. Rev. Lett. 88, 257901
As random behaviour is rarely optimal, changes in the
(2002).
[51] Klyshko, D. N. A simple method of preparing pure
probabilities pij are neccessary. These changes should
states of an optical field, of implementing the Einstein– result in an improvement in the (average) probability of
Podolsky–Rosen experiment, and of demonstrating the success, i.e. in the chances of receiving a nonnegative
complementarity principle. Sov. Phys. Usp. 31, 74 reward λ(t) . Learning in the PS model is realized by
(1988). making use of feedback from the environment, by up-
[52] Aspden, R. S., Tasca, D. S., Forbes, A., Boyd, R. W. & dating the weights hij . The matrix with entries hij , the
Padgett, M. J. Experimental demonstration of Klyshko’s so-called h-matrix, is updated after each interaction with
advanced-wave picture using a coincidence-count based, an environment according to the following learning rule:
camera-enabled imaging system. J. Mod. Opt. 61, 547–
551 (2014). h(t+1) = h(t) − γ(h(t) − 1) + λ(t) g (t) , (3)
[53] Schwab, K. The Fourth Industrial Revolution (Penguin,
London, 2017). where 1 is an “all-ones” matrix and γ is the damping
[54] King, R. D. et al. The automation of science. Science parameter of the PS model, responsible for forgetting.
324, 85–89 (2009). The g-matrix, or so-called glow matrix, allows to intern-
[55] Hangl, S. Evaluation and extensions of generalization in ally redistribute rewards such that decisions that were
the projective simulation model. Bachelor’s thesis, Univ.
made in the past are rewarded less than the more recent
Innsbruck, Innsbruck, Austria (2015).
decisions. The g-matrix is updated in parallel with the
h-matrix. Initially all g-values are set to zero, but if the
edge (i, j) was traversed during the last decision-making
METHODS process, the corresponding glow value gi,j is set to 1. In
order to determine the extent to which previous actions
A. Projective simulation should be internally rewarded, the g-matrix is also up-
dated after each interaction with an environment:
In the following, we describe the PS model and specify g (t+1) = (1 − η)g (t) , (4)
the structure of the learning agent used in this paper.
For detailed exposition of the PS model, we refer the where η is the so-called glow parameter of the PS model.
reader to Refs. [16, 38]. The PS model is based on in- While finding the optimal values for γ and η parameters
formation processing in a specific memory system, called is, in general, a difficult problem, these parameters can
episodic and compositional memory, which is represen- be learned by the PS model itself [49], however, at the
ted as a weighted network of clips. Clips are the units expense of longer learning times. In this paper γ and η
of episodic memory, which are remembered percepts and parameters were set to time-independent values without
actions, or short sequences thereof [16]. The PS agent any extensive search for optimal values, which could, in
perceives a state of an environment, deliberates using its principle, further improve our results.
network of clips and outputs an action as shown schemat- In addition to the basic PS model described above, two
ically in Fig. 1. The weighted network of clips is a graph additional mechanisms were used that dynamically alter
with clips as vertices and directed edges that represent the structure of the PS network: action composition and
the possible (stochastic) transitions between clips. In this clip deletion.
paper, the clips represent the percepts and actions, and
directed edges connect each percept to every action, as
illustrated in Fig 1(b). An edge (i, j) connects a percept 1. Action composition
si , i ∈ [1, N (t)] to an action aj , j ∈ [1, K(t)], where N (t)
and K(t) are the total number of percepts and actions at When action composition [16] is used, new actions can
time step t, respectively (each time step corresponds to be created as follows: if a sequence of l ∈ (1, L) optical
one interaction cycle shown in Fig. 1(a)). Each edge has elements (actions) results in a rewarded state, then this
(t)
a time-dependent weight hij , which defines the prob- sequence is added as a single, composite, action to the
9
agent’s toolbox. Having such additional actions avail- edges. These probabilities are given by
able allows the agent to access the rewarded experiments
!N (t)
in a single decision step, without reproducing the place- N (t)
ment of optical elements one by one. Moreover, com- pdel
aj (t) = PN (t) . (6)
posite actions can be placed at any time during an ex- k=1 hkj (t)
periment which effectively increases the likelihood of an
experiment to contain the composite action as sub-setup. Action deletion is triggered just after a reward is obtained
Although the entire sequence of l elements can be rep- and deactivated again until a new reward is encountered.
resented as a single action, this action still has length l. The intuition behind Eq. 6 is the following. If the in-
That is, the agent is not allowed to place such an element going edges of action aj have, to a large extent, com-
if there are already L−l+1 optical elements on the table. paratively small weights (i.e. either the action has not
Because of the growing size of the action space, rewards received many rewards or the h-values have been damped
coming from the analyzer are rescaled proportionally to significantly since the last rewards were received), then
the size of the toolbox K(t), the probability pdel
aj (t) can be approximated as
a b
2
number of experiments
400 6.0 1 1
length of experiment
⇥10
0.8 L=6 0.8
fraction of interesting
300 5.5
0.6 0.6
experiments
200 5.0
0.4 0.4
100 4.5 0.2 0.2
0 4.0 0 0
0 50 100 150 200 250 300 350 400 0 5 10 15 20
number of (3,3,2) states number of experiments ⇥103
c d
2
L = 12 12 12
8 8
fraction of interesing
6 6 10 10
experiments
4 4 8 8
projective simulation
2 2
6 automated search 6
0 0
0 2 4 6 8 10 6 7 8 9 10 11 12
number of experiments ⇥103 maximum length of experiment, L
e
di↵erent experiments
AS, L = 6
average number of
1000
PS, L = 6
AS, L = 8
10
PS, L = 8
0.100 AS, L = 10
PS, L = 10
0.001 AS, L = 12
222 322 332 333 432 442 532 542 543 644 652 662 763 PS, L = 12
SRV class
Figure 5. Details of learning performance. a, The average number of experiments performed before finding a (3, 3, 2)
experiment as a function of number of (3, 3, 2) states already found. b,c, Average fraction of interesting experiments (see main
text for definitions) in the cases of L = 6 (b) and 12 (c) is shown as a function of the number of experiments. d, Average
number of different SRV classes observed by PS and automated search as a function of L. e, Average number of different
experiments that produce states from different SRV classes are shown for PS and automated search (AS). a-e, All curves and
data points are obtained by averaging over 100 agents.
time steps. Here we show that the rate at which interest- experiments that have already been encountered. One
ing experiments are found grows over time. The fraction can see that all 4 red curves converge to certain val-
of interesting experiments at different times is shown in ues before they start decreasing again, indicating that
Fig. 5(b)-(c). The fraction of interesting experiments is on average for each automated search agent there exists
calculated as the ratio of the number of new, interesting a maximum rate with which new useful experiments can
experiments to the number of all experiments performed be obtained.
at time step t. Initially, the average fraction of inter- A tantalizing assignment, related to the task of finding
esting experiments is the same for all agents and auto- many interesting experiments, is that of finding many
mated search. But, in contrast to the automated search different SRVs. Even though the tasks assignment was
algorithms (red), the rate at which new experiments are not to find many different SRVs, this and the actual task
discovered by the PS agents (blue) keeps increasing for a seem to be related since PS actually found on average
much longer time (Fig. 5(b)-(c)). The automated random more different SRVs than automated random search (see
search algorithm, as expected, always obtains approxim- Fig. 5(d)-(e)).
ately the same fraction of interesting experiments inde-
pendent of the number of experiments that have already We have demonstrated that the PS agent is able to ex-
been simulated. In fact, the fraction of interesting ex- plore a variety of photonic quantum experiments. This
periments found by automated random search slowly de- suggest that there are correlations (or structural connec-
creases since it has the chance to encounter interesting tions) between different interesting experiments which
the agent exploits. If this were not the case, we would
11