Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

The Videogame Affordances Corpus

Gerard R. Bentley1 , Joseph C. Osborn1


1
Pomona College
gbkh2015@mymail.pomona.edu, joseph.osborn@pomona.edu

Abstract advantage collapses when a game’s graphics betray player


expectations (Dubey et al. 2018). Knowing the interactions
Videogames are created for human players whose common-
sense knowledge of real-world objects and interactions (and that game objects afford, such as “this block will block any
their familiarity with other games) primes them for success- agent’s movement” or “this key can be picked up by my
ful play. Action games feature recurring formal elements agent”, helps humans guess what they can (and should) do
including a directly controlled avatar, moving enemies, re- to progress in a game. We define videogame affordances as
source pickups, and portals to new map areas; mapping these the actions that agents are capable of performing on objects
onto culturally significant symbols helps players learn to play in the environment. In contrast, an algorithm like Deep Q-
quickly. We present a schema, annotation tool, and dataset Learning or Go-Explore might only implicitly learn to avoid
for codifying screenshots containing game objects in terms of rolling skulls or spiked pits.
their affordances, which is suitable for AI agents and machine Incorporating object affordance data has been helpful for
learning algorithms for a variety of interesting and significant
agents that must act in open-ended problem spaces besides
applications.
games. In the case of improvisational interactions involving
a prop, organizing possible actions by the prop’s affordances
Introduction facilitates search and enables stronger performances (Ja-
Videogame play from vision has been increasingly success- cob and Magerko 2018). Affordances of real-world objects
ful since the development of Deep Q-Learning (Mnih et al. are determined by their visual and physical properties, and
2013). While strategy games like Starcraft and Defense of videogame graphics analogously suggest their affordances
the Ancients 2 can be tackled with substantial computing by genre convention and internal visual consistency. Since
resources (Vinyals et al. 2019; OpenAI 2018), even rela- we cannot model every game’s simulation rules and art di-
tively simple adventure games like Montezuma’s Revenge rection, we have developed a tool for quickly tagging game
have posed a significant challenge, where superhuman play screenshots with per-pixel object affordances and built a
has been achieved only recently (Ecoffet et al. 2019). This small corpus of game object affordances. We have exercised
may be in part because adventure games rely on human in- this dataset on a prototype model which predicts, from a
terpretation of hints and because diverse (but targeted) ex- videogame screenshot, the likely affordances at each pixel
ploration of the state space is more important than optimiz- location. Applications for this dataset and our model could
ing short sequences. include:
Even where automated game players have been success- • Priors for game-playing agents
ful, they are known to be sample-inefficient relative to hu-
• Human-legible world representations for Explainable AI
man players, often requiring lifetimes of practice to achieve
their impressive results. While the way in which a machine • Transfer learning between games and genres
plays and the way in which a human plays are necessarily • Style transfer between games
different, it seems likely that a major advantage held by hu-
• Interaction-aware procedural content generation
mans is their familiarity with cultural signifiers (McCoy et
al. 2010) as a powerful prior on the likely outcomes of game
object interactions. Humans learn these beliefs through their
Schema
experiences with real-world objects, socialization, and game Our initial dataset1 comes from the Nintendo Entertainment
literacy—for example, they might suspect a round object can System games The Legend of Zelda and Super Mario Bros.
roll, that a skull indicates danger, and that a stationary glow- 3 (113 and 24 images, respectively), with pixel-wise labels
ing orb is likely to be beneficial to touch. This suspicion is for a reasonably general and complete set of nine affor-
supported by previous research, which shows that the human dances (Table 1). We also have broken-out images and af-
fordance data for background graphics and character sprites
Copyright c 2019 for this paper by its authors. Use permitted un-
1
der Creative Commons License Attribution 4.0 International (CC https://github.com/gerardrbentley/Videogame-Affordances-
BY 4.0). Corpus
Solid Blocks an agent’s movement
Movable An agent can move it
Destroyable Can be eliminated from the game world
Dangerous Hurts an agent
Gettable Can be acquired by an agent
Portal Reveals more of the game world
Usable An agent can directly interact with it
Changeable Can change form
UI Non-game user interface elements

Table 1: The nine key affordances in our dataset.

appearing in these games, and we are at work coding screen-


shots from additional games.
We selected our nine affordances based on the authors’
expert knowledge of videogame objects and their seman-
tics in NES action-adventure and role-playing games and the
first-person shooter DOOM. Each affordance is a zero or one
label, with more complex objects like breakable doors built
out of several individual affordances.
These affordances capture important interactions between Figure 1: Affordance Annotation Tool. The single blue
player agents and a game environment and seem to gener- bounding box notes the current section to be labelled. The
alize to many games involving moving an agent and inter- nine images correspond to activations of each affordance,
acting with environment objects (for example, they seem where white means the affordance holds
well-suited to Videogame Description Language (VGDL)
games (Thompson et al. 2013)). For example, solid objects
in a game level usually dictate the path a player agent will the ready availability of our tooling will support community
follow, while destroyable objects may reveal something use- or crowd-sourced expansion of the dataset.
ful when destroyed. We note that Changeable and UI are Tile-based games offer interesting challenges for
somewhat vague labels compared to the others. The for- computer-assisted labeling. Like previous work in an-
mer suggests possibly indirect interactions with the objects, notating game screenshots (Summerville et al. 2016;
while the latter states that the objects are non-diegetic and Guzdial and Riedl 2016), we use OpenCV’s template
exist outside of the world of the agents. A common example matching (Bradski 2000) to identify sprites and tiles that
of changeable objects are closed doors (which are also solid have already-known affordance labels. Given a new screen-
and are portals) that open under some condition. A door with shot, our tool (Figure 1) performs an automatic sprite- and
a lock emblem usually suggests that the player must directly tile-wise template matching on a user-adjustable grid. While
unlock it to change it, which adds the label usable. Oppo- it is easy (for most NES games) to identify tiles by splitting
sitely, a closed door with no apparent lock often requires the the image on a uniform 16 × 16 or 8 × 8 pixel grid, the
player to perform some action (defeating all enemies, push- underlying grid is not always aligned with the corners of
ing a block in the room somewhere) to change it to an open the screen (as in the scrolling Super Mario Bros. 3), which
doorway (non-solid, non-changeable, and a portal). requires manual shifting in our process.
The primary component of our corpus is the set of anno- Additionally, we need to handle game objects that move
tated screenshots, which are 256 × 224 images taken from off the grid—sprites. Because the number of sprites is usu-
game play alongside binary encodings of each affordance at ally small compared to the number of tiles, we currently
each pixel. This is the native resolution of NES and SNES require a spritesheet for each game which an annotator la-
games, but the important thing is that the game screenshot bels in advance. Unfortunately, spritesheets collected by en-
and the affordance maps have the same resolution. thusiasts may not exhaust all sprite orientations and may
Our schema treats tile-based and non-tile-based game use different color palettes than the emulators used to col-
screenshots in the same way, but our labeling tool is special- lect game screenshots (the famous, apopcryphal expansion
ized for tile-based games and it records some extra data (per- of NTSC is “Never Twice The Same Color”). To make the
tile and per-sprite affordances) for such games. A tool meant system more robust to these differences, we use a suite of
for labeling, e.g., 3D game screenshots could work in terms sprite detectors in a variety of spaces and require that they
of textures, shaders, or 3D models instead of tiles (Richter all match (though at different thresholds): grayscale, RGB,
et al. 2016). Sobel-derivative, Laplace-derivative, and Canny edges. We
believe that a semi-supervised machine learning approach to
Annotation Tool identify sprites could greatly assist in labelling new games,
In addition to the annotated screenshots and tiles, we also but we leave it to future work.
present the tool we used to process our data. We hope that After detecting and labelling known tiles and sprites, our
tool asks the user the affordances of unknown tiles in the (even if it is noisy) will reach human performance more eas-
image, including previously unseen tiles and tiles that are ily than algorithms which do not. Our corpus is the first step
obstructed by sprites (a keyboard-based interface is avail- towards extracting this information from games in general.
able for ergonomics). Newly labelled tiles are saved to the
collection of known game tiles, but pixels belonging to over- Feature Representation
lapped tiles are only tagged in the current image. We believe The recent Go-Explore algorithm, using only downsampled
this step could be made more automatic by a system that pixel data, achieved nearly four times the previous best score
removes sprites from the background, allowing previously by current reinforcement learning algorithms in the Atari
covered tiles to be matched. After this process the image game Montezuma’s Revenge (Ecoffet et al. 2019). Supplied
is fully labelled except in locations where sprite detection with knowledge of the player agent’s location and room
failed, in which hand annotation is employed. number, the algorithm outperformed the human record in
Montezuma’s Revenge by an order of magnitude and aver-
Applications ages better than human performance in Pitfall, a game previ-
ously unsolved by strict reinforcement learning. Go-Explore
Our immediate use-case is to develop a model for predict- uses a combination of techniques to extract this domain fea-
ing game object affordances from vision. This is a multiple- ture information from pixels, including locating the player
label classification problem—multiple labels may apply to a agent by a pixel color that only appears on that agent’s sprite
single pixel and we want the model to predict all of them. and template matching an image of a key to keep track of
The multiple-label setting distinguishes this work from ob- where in the game they were found. Our work would help
ject detection, semantic segmentation, and image classifica- generalize this aspect of extracting domain information from
tion problems, and it has not been explored in the videogame raw pixels to other games. The affordance maps could even
domain to our knowledge. be used as a more informative state representation than the
Our prototype (a reasonable baseline) adapts the SegNet- coarse visual approximation used by Go-Explore’s domain-
Basic architecture (Badrinarayanan, Kendall, and Cipolla independent pixel representation. In the same vein, work-
2017) to the multi-label setting: a fully convolutional ing from affordances rather than from screenshots could
encoder-decoder network outputs a 9-channel affordance promise more transferrable game moment embeddings for
map from a grayscale input image (full RGB data did not search, retrieval, and novelty appraisal (Zhan and Smith
significantly improve performance). Figure 2 shows the out- 2018; Zhan, Aytemiz, and Smith 2018).
put of our model for a screenshot from The Legend of Zelda. Cutting edge research in image captioning has focused
We evaluated our baseline using 3-fold validation, with on using neural network architectures to generate accu-
Hamming loss as the validation metric. The mean Hamming rate natural language descriptions of real-world images (Bai
loss in our worst fold was 0.0214 (our best was 0.0189), and An 2018). Successful methods include encoder-decoder,
which is an average of 11,044 mispredicted labels per image attention-guided, and compositional architectures, which
(out of a possible 516,096 predictions). On visualizing the operate by deriving context and feature vectors from images
worst-labeled images, we believe this is mainly due to the via convolutions. Limited research has been done on cap-
model not seeing relatively rare objects like enemy sprites. tioning videogame screenshots and models trained on real-
This task is the most direct application of our dataset world images fall short (Fulda et al. 2018). We feel that se-
and supports research in guiding AI game play with human- mantic captioning of videogame screenshots is a natural ex-
like priors without encoding explicit game- or genre-specific tension of this work, which could also lead into automatic
knowledge. Participants in the Generalized Videogame AI tutorialization (Green et al. 2018).
competition (Perez-Liebana et al. 2016) have shown that de- The predictive model described in the beginning of this
veloping domain knowledge of object interactions—similar section is, in some sense, learning to see game screenshots
to that described in our dataset—is a beneficial intermedi- in instrumental terms. The convolutional part of this trained
ate goal for agents playing previously unseen games. Some network could be used to bootstrap other models that work
important factors in the decision process of past competition from game screenshots, and the final outputs can serve as
winners include: a transferrable, lowest-common-denominator way for an al-
gorithm to understand a picture from a game.
• The position of the player agent
In addition to generating descriptions of whole images,
• The distance between the player agent and non-player we see this work fitting with research in more explainable AI
agents systems. Training agents with this high-level game informa-
tion like affordances (rather than raw pixels) could expose
• Identifying non-player agents as friendly or hostile
how perceived image features influence the agent’s choices.
• Approaching resources if any are present, otherwise a por- Explaining and observing actions in terms of affordances
tal shows what functional interactions the agent values, even if
its internal processes did work from raw pixels.
• Exploring unknown areas when other evaluations fail
The VGDL (as used in the competition) represents some PCGML
of this information explicitly and makes it available to Any game dataset suggests immediate applications
agents. We expect algorithms utilizing this information in procedural content generation via machine learn-
Figure 2: A screenshot from The Legend of Zelda (left) and non-zero affordance predictions (right). Predictions for solid
(Purple), portal (Yellow), dangerous (Green) and destroyable (Red). UI area is predicted, but excluded here for clarity

ing (PCGML) (Summerville et al. 2018). Using the model terms, which is vital in a co-creative mixed-initiative setting.
described earlier, it is easy to imagine a level designer Coming from the opposite direction of the VGLC, au-
sketching out an affordance map and running the model in tomated game design learning (AGDL) is a broad project
a DeepDream-like setting to find an image which optimizes which has as its goal the automatic extraction of high-level
the probability of predicting that particular affordance design elements (Osborn, Summerville, and Mateas 2017b).
map (Mordvintsev, Olah, and Tyka 2015). Since we have While AGDL focuses on learning game rules from observa-
per-sprite and per-tile affordance data, we could come tion and experimentation, our work abstracts specific rules
up with an embedding from 16 × 16-pixel graphics to away to focus on broad classes of interaction, and (for now)
affordance labels, and attempt to perform vector arithmetic requires manual tagging.
in affordance embedding space to procedurally generate
new game graphics. Conclusion
We also see this work as complementary to existing Our immediate next step is to expand the corpus, both in
efforts like the Videogame Level Corpus (VGLC) (Sum- terms of depth (screenshots and object coverage in each
merville et al. 2016). One interesting interaction is to ob- game) and breadth (more games in different genres, with
tain and tag level data from the VGLC. We could also use different art styles). Besides manual annotation, we hope
the model described above and an automatic mapping sys- to explore the use of instrumented emulators to identify
tem like Mappy (Osborn, Summerville, and Mateas 2017a) tiles and sprites (Osborn, Summerville, and Mateas 2017a)
to create new maps and interaction-aware legends to ex- and to capture object interactions (Summerville et al. 2017;
pand the VGLC. Combining both corpora could also provide Summerville, Osborn, and Mateas 2017). We also believe
richer features for PCGML algorithms. that there are natural semi-supervised learning tasks on this
corpus: for example, pasting sprites into a sprite-free im-
Related Work age at random locations, or reassembling a game screenshot
The most direct related project is the Videogame Level Cor- which has been broken up like a jigsaw puzzle (Noroozi and
pus, which contains structural level layouts and per-game Favaro 2016). Finally, augmenting our dataset with cultural
semantic tags for 12 games. The VGLC format required a information in free text could help form a fuller understand-
combination of hand and computer annotations and static ing of why game objects seem to afford certain uses.
file analysis to complete, and its plain-text labels for tile This work is a first step towards helping computers see
types range from game-specific (particular enemies in Super games as people do, which seems to be a necessary step to-
Mario Bros.) to general (e.g., “solid”, “breakable”). Our cor- wards more sample-efficient game playing algorithms. Even
pus is complementary, focusing on object affordances and though we have focused on instrumental affordances, we
interactions in a universal schema, and working at the level have already seen promising initial results in predicting af-
of pixels instead of tiles. Importantly, we are concerned with fordances from screenshots. We have also shown that this
game screenshots and not game levels, so our schema and affordance-oriented view of game images will be useful in
use cases are unlikely to be fully reconcilable. game-playing agents and beyond.
A related recent application of the VGLC is explainable
PCGML (Guzdial et al. 2018). Using level encodings from References
the VGLC along with expert-provided design pattern labels Badrinarayanan, V.; Kendall, A.; and Cipolla, R. 2017. Seg-
allows a generator to justify its creations in human-relevant net: A deep convolutional encoder-decoder architecture for
image segmentation. IEEE Transactions on Pattern Analysis Perez-Liebana, D.; Samothrakis, S.; Togelius, J.; Schaul, T.;
and Machine Intelligence. Lucas, S. M.; Coutoux, A.; Lee, J.; Lim, C.; and Thompson,
Bai, S., and An, S. 2018. A survey on automatic image T. 2016. The 2014 general video game playing competition.
caption generation. Neurocomputing. IEEE Transactions on Computational Intelligence and AI in
Games.
Bradski, G. 2000. The OpenCV Library. Dr. Dobb’s Journal
of Software Tools. Richter, S. R.; Vineet, V.; Roth, S.; and Koltun, V. 2016.
Playing for data: Ground truth from computer games. In
Dubey, R.; Agrawal, P.; Pathak, D.; Griffiths, T. L.; and European conference on computer vision. Springer.
Efros, A. A. 2018. Investigating human priors for playing
Summerville, A. J.; Snodgrass, S.; Mateas, M.; and Ontan,
video games.
S. 2016. The vglc: The video game level corpus.
Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and
Summerville, A.; Behrooz, M.; Mateas, M.; and Jhala, A.
Clune, J. 2019. Go-explore: a new approach for hard-
2017. What does that?-block do? learning latent causal
exploration problems.
affordances from mario play traces. In Workshops at the
Fulda, N.; Ricks, D.; Murdoch, B.; and Wingate, D. 2018. Thirty-First AAAI Conference on Artificial Intelligence.
Threat, explore, barter, puzzle: A semantically-informed al- Summerville, A.; Snodgrass, S.; Guzdial, M.; Holmgard, C.;
gorithm for extracting interaction modes. In Workshops Hoover, A. K.; Isaksen, A.; Nealen, A.; and Togelius, J.
at the Thirty-Second AAAI Conference on Artificial Intelli- 2018. Procedural content generation via machine learning
gence. (pcgml). IEEE Transactions on Games.
Green, M. C.; Khalifa, A.; Barros, G. A.; Machado, T.; Summerville, A.; Osborn, J.; and Mateas, M. 2017. Charda:
Nealen, A.; and Togelius, J. 2018. Atdelfi: automatically de- Causal hybrid automata recovery via dynamic analysis.
signing legible, full instructions for games. In Proceedings arXiv preprint arXiv:1707.03336.
of the 13th International Conference on the Foundations of
Digital Games, 17. ACM. Thompson, T.; Ebner, M.; Schaul, T.; Levine, J.; Lucas, S.;
and Togelius, J. 2013. Towards a video game description
Guzdial, M., and Riedl, M. 2016. Toward game level gener- language. Dagstuhl Follow-ups.
ation from gameplay videos.
Vinyals, O.; Babuschkin, I.; Chung, J.; Mathieu, M.; Jader-
Guzdial, M.; Reno, J.; Chen, J.; Smith, G.; and Riedl, M. berg, M.; Czarnecki, W. M.; Dudzik, A.; Huang, A.;
2018. Explainable pcgml via game design patterns. Georgiev, P.; Powell, R.; Ewalds, T.; Horgan, D.; Kroiss,
Jacob, M., and Magerko, B. 2018. Creative arcs in im- M.; Danihelka, I.; Agapiou, J.; Oh, J.; Dalibard, V.; Choi,
provised human-computer embodied performances. In Pro- D.; Sifre, L.; Sulsky, Y.; Vezhnevets, S.; Molloy, J.; Cai,
ceedings of the 13th International Conference on the Foun- T.; Budden, D.; Paine, T.; Gulcehre, C.; Wang, Z.; Pfaff,
dations of Digital Games. New York, NY, USA: ACM. T.; Pohlen, T.; Wu, Y.; Yogatama, D.; Cohen, J.; McK-
inney, K.; Smith, O.; Schaul, T.; Lillicrap, T.; Apps, C.;
McCoy, J.; Treanor, M.; Samuel, B.; Tearse, B.; Mateas, M.;
Kavukcuoglu, K.; Hassabis, D.; and Silver, D. 2019.
and Wardrip-Fruin, N. 2010. Comme il faut 2: A fully real-
AlphaStar: Mastering the Real-Time Strategy Game Star-
ized model for socially-oriented gameplay. In Proceedings
Craft II. https://deepmind.com/blog/alphastar-mastering-
of the Intelligent Narrative Technologies III Workshop, INT3
real-time-strategy-game-starcraft-ii/.
’10, 10:1–10:8. New York, NY, USA: ACM.
Zhan, Z., and Smith, A. M. 2018. Retrieving game states
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; with moment vectors. In Workshops at the Thirty-Second
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play- AAAI Conference on Artificial Intelligence.
ing atari with deep reinforcement learning.
Zhan, Z.; Aytemiz, B.; and Smith, A. M. 2018. Taking the
Mordvintsev, A.; Olah, C.; and Tyka, M. 2015. In- scenic route: Automatic exploration for videogames. arXiv
ceptionism: Going Deeper into Neural Networks . preprint arXiv:1812.03125.
https://ai.googleblog.com/2015/06/inceptionism-going-
deeper-into-neural.html.
Noroozi, M., and Favaro, P. 2016. Unsupervised learning of
visual representations by solving jigsaw puzzles. In Euro-
pean Conference on Computer Vision. Springer.
OpenAI. 2018. Openai five. https://blog.openai.com/
openai-five/.
Osborn, J.; Summerville, A.; and Mateas, M. 2017a. Au-
tomatic mapping of nes games with mappy. Proceedings of
the International Conference on the Foundations of Digital
Games - FDG 17.
Osborn, J. C.; Summerville, A.; and Mateas, M. 2017b. Au-
tomated game design learning. 2017 IEEE Conference on
Computational Intelligence and Games (CIG).

You might also like