Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Graph Data Science with Neo4j: Learn

how to use Neo4j 5 with Graph Data


Science library 2.0 and its Python driver
for your project Scifo
Visit to download the full and correct content document:
https://ebookmass.com/product/graph-data-science-with-neo4j-learn-how-to-use-neo
4j-5-with-graph-data-science-library-2-0-and-its-python-driver-for-your-project-scifo/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Graph Data Science (GDS) For Dummies®, Neo4j Special


Edition Amy Hodler

https://ebookmass.com/product/graph-data-science-gds-for-dummies-
neo4j-special-edition-amy-hodler/

(eBook PDF) Intro to Python for Computer Science and


Data Science: Learning to Program with AI, Big Data and
The Cloud

https://ebookmass.com/product/ebook-pdf-intro-to-python-for-
computer-science-and-data-science-learning-to-program-with-ai-
big-data-and-the-cloud/

Data-Driven SEO with Python: Solve SEO Challenges with


Data Science Using Python 1st Edition Andreas Voniatis

https://ebookmass.com/product/data-driven-seo-with-python-solve-
seo-challenges-with-data-science-using-python-1st-edition-
andreas-voniatis/

Data Science from Scratch: First Principles with Python


2nd Edition

https://ebookmass.com/product/data-science-from-scratch-first-
principles-with-python-2nd-edition/
Roman's Data Science - How To Monetize Your Data 1st ed
Edition Roman Zykov

https://ebookmass.com/product/romans-data-science-how-to-
monetize-your-data-1st-ed-edition-roman-zykov/

Smarter data science: succeeding with enterprise-grade


data and ai projects Fishman

https://ebookmass.com/product/smarter-data-science-succeeding-
with-enterprise-grade-data-and-ai-projects-fishman/

Leading with AI and Analytics: Build Your Data Science


IQ to Drive Business Value 1st Edition Anderson

https://ebookmass.com/product/leading-with-ai-and-analytics-
build-your-data-science-iq-to-drive-business-value-1st-edition-
anderson/

Winning with Data Science: A Handbook for Business


Leaders Friedman

https://ebookmass.com/product/winning-with-data-science-a-
handbook-for-business-leaders-friedman/

Data Science With Rust: A Comprehensive Guide - Data


Analysis, Machine Learning, Data Visualization & More
Van Der Post

https://ebookmass.com/product/data-science-with-rust-a-
comprehensive-guide-data-analysis-machine-learning-data-
visualization-more-van-der-post/
Graph Data Science with Neo4j

Learn how to use Neo4j 5 with Graph Data Science library 2.0
and its Python driver for your project

Estelle Scifo

BIRMINGHAM—MUMBAI
Graph Data Science with Neo4j
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi


Senior Editor: Nathanya Dias
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Hemangini Bari
Production Designer: Shankar Kalbhor
Marketing Coordinator: Vinishka Kalra

First published: January 2023

Production reference: 1310123

Published by Packt Publishing Ltd.


Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-80461-274-3

www.packtpub.com
Contributors

About the author


Estelle Scifo is a Neo4j Certified Professional and Neo4j Graph Data Science certified user. She is
currently a machine learning engineer at GraphAware where she builds Neo4j-related solutions to
make customers happy with graphs.
Before that, she worked in several fields, starting out with research in particle physics, during which
she worked at CERN on uncovering Higgs boson properties. She received her PhD in 2014 from the
Laboratoire de l’Accélérateur Linéaire (Orsay, France). Continuing her career in industry, she worked
in real estate, mobility, and logistics for almost 10 years. In the Neo4j community, she is known as
the creator of neomap, a map visualization application for data stored in Neo4j. She also regularly
gives talks at conferences such as NODES and PyCon. Her domain expertise and deep insight into
the perspective of a beginner’s needs make her an excellent teacher.

There is only one name on the cover, but a book is not the work of one person. I would like to thank
everyone involved in making this book a reality. Beyond everyone at Packt, the reviewers did an
incredible job of suggesting some very relevant improvements. Thank you, all!

I hope this book will inspire you as much as other books of this genre have inspired me.
About the reviewers
Dr. David Gurzick is the founding chair of the George B. Delaplaine Jr. School of Business and an
associate professor of management science at Hood College. He has a BS in computer science from
Frostburg State University, an M.S. in computer science from Hood College, a PhD in information
systems from the University of Maryland, Baltimore County, and is a graduate of Harvard’s business
analytics program. As a child of the internet, he grew up on AOL and programmed his way through
dot.com. He now helps merge technology and business strategy to enable innovation and accelerate
commercial success as the lead data scientist at Genitive.ai and as a director of the Frederick Innovative
Technology Center, Inc (FITCI).

Sean William Grant is a product and analytics professional with over 20 years of experience in
technology and data analysis. His experience ranges from geospatial intelligence with the United
States Marine Corps, product management within the aviation and autonomy space, to implementing
advanced analytics and data science within organizations. He is a graph data science and network
analytics enthusiast who frequently gives presentations and workshops on connected data. He has also
been a technical advisor to several early-stage start-ups. Sean is passionate about data and technology,
and how it can elevate our understanding of ourselves.

Jose Ernesto Echeverria has worked with all kinds of databases, from relational databases in the 1990s
to non-SQL databases in the 2010s. He considers graph databases to be the best fit for solving real-
world problems, given their strong capability for modeling and adaptability to change. As a polyglot
programmer, he has used languages such as Java, Ruby, and R and tools such as Jupyter with Neo4j
in order to solve data management problems for multinational corporations. A long-time advocate
of data science, he expects this long-awaited book to cover the proper techniques and approach the
intersections of this discipline, as well as help readers to discover the possibilities of graph databases.
When not working, he enjoys spending time with friends and family.
Table of Contents

Prefacexi

Part 1 – Creating Graph Data in Neo4j


1
Introducing and Installing Neo4j 3
Technical requirements 4 Downloading and starting Neo4j Desktop 16
What is a graph database? 4 Creating our first Neo4j database 18
Creating a database in the cloud – Neo4j Aura 20
Databases4
Graph database 5 Inserting data into Neo4j with
Finding or creating a graph database 7 Cypher, the Neo4j query language 21
A note about the graph dataset’s format 8 Extracting data from Neo4j with
Modeling your data as a graph 10
Cypher pattern matching 24
Summary25
Neo4j in the graph databases landscape12
Further reading 25
Neo4j ecosystem 13
Exercises26
Setting up Neo4j 15

2
Importing Data into Neo4j to Build a Knowledge Graph 27
Technical requirements 28 Introducing the APOC library to
Importing CSV data into Neo4j with deal with JSON data 41
Cypher28 Browsing the dataset 41
Discovering the Netflix dataset 29 Getting to know and installing the APOC plugin42
Defining the graph schema 31 Loading data 44
Importing data 32 Dealing with temporal data 47
vi Table of Contents

Discovering the Wikidata public Importing data for all people 55


knowledge graph 49
Dealing with spatial data in Neo4j 57
Data format 49
Importing data in the cloud 58
Query language – SPARQL 50
Summary62
Enriching our graph with Wikidata
Further reading 63
information52
Exercises63
Loading data into Neo4j for one person 52

Part 2 – Exploring and Characterizing Graph Data


with Neo4j
3
Characterizing a Graph Dataset 67
Technical requirements 67 Installing and using the Neo4j
Characterizing a graph from its node Python driver 79
and edge properties 68 Counting node labels and relationship types
Link direction 68 in Python 79
Link weight 71 Building the degree distribution of a graph 81
Node type 72 Improved degree distribution 85

Computing the graph degree Learning about other characterizing


distribution73 metrics87
Definition of a node’s degree 73 Triangle count 88
Computing the node degree with Cypher 74 Clustering coefficient 89
Visualizing the degree distribution with Summary91
NeoDash76
Further reading 91
Exercises91

4
Using Graph Algorithms to Characterize a Graph Dataset 93
Technical requirements 93 Installing the GDS library with Neo4j Desktop 96
Digging into the Neo4j GDS library 94 GDS project workflow 98

GDS content 94 Projecting a graph for use by GDS 99


Table of Contents vii

Native projections 100 Other centrality metrics 118


Cypher projections 109
Understanding a graph’s structure by
Computing a node’s degree with GDS 111 looking for communities 120
stream mode 112 Number of components 120
The YIELD keyword 113 Modularity and the Louvain algorithm 123
write mode 114
Summary126
mutate mode 116
Algorithm configuration 117
Further reading 126

5
Visualizing Graph Data 127
Technical requirements 128 What is Bloom? 138
The complexity of graph data Bloom installation 139
visualization128 Selecting data with Neo4j Bloom 139
Physical networks 128 Configuring the scene in Bloom 142
General case 129 Visualizing large graphs with Gephi 144
Visualizing a small graph with Installing Gephi and its required plugin 144
networkx and matplotlib 131 Using APOC Extended to synchronize Neo4j
and Gephi 146
Visualizing a graph with known coordinates 131
Configuring the view in Gephi 148
Visualizing a graph with unknown coordinates 133
Configuring object display 137 Summary152
Discovering the Neo4j Bloom graph Further reading 153
application138 Exercises153

Part 3 – Making Predictions on a Graph


6
Building a Machine Learning Model with Graph Features 157
Technical requirements 157 Running GDS algorithms from
Introducing the GDS Python client 158 Python and extracting data in a
GDS Python principles 158
dataframe164
Input and output types 159 write mode 165
Creating a projected graph from Python 161 stream mode 167
viii Table of Contents

Dropping the projected graph 169 Extracting and visualizing data 172
Building the model 172
Using features from graph algorithms
in a scikit-learn pipeline 170 Summary175
Machine learning tasks with graphs 170 Further reading 175
Our task 170 Exercise175
Computing features 171

7
Automatically Extracting Features with Graph Embeddings
for Machine Learning 177
Technical requirements 178 Training an inductive embedding
Introducing graph embedding algorithm192
algorithms178 Understanding GraphSAGE 192
Defining embeddings 178 Introducing the GDS model catalog 194
Graph embedding classification 182 Training GraphSAGE with GDS 195

Using a transductive graph Computing new node representations197


embedding algorithm 185 Summary199
Understanding the Node2Vec algorithm 185 Further reading 199
Using Node2Vec with GDS 187 Exercises200

8
Building a GDS Pipeline for Node Classification Model Training 201
Technical requirements 201 Choosing the graph embedding algorithm to
use215
The GDS pipelines 202
Training using Node2Vec 215
What is a pipeline? 202
Training using GraphSAGE 217
Building and training a pipeline 205
Summary220
Creating the pipeline and choosing the features205
Setting the pipeline configuration 206
Further reading 221
Training the pipeline 212 Exercise221

Making predictions 213


Computing the confusion matrix 213

Using embedding features 214


Table of Contents ix

9
Predicting Future Edges 223
Technical requirements 223 Features based on node properties 228
Introducing the LP problem 224 Building an LP pipeline with the GDS230
LP examples 224 Creating and configuring the pipeline 230
LP with the Netflix dataset 225 Pipeline training and testing 232
Framing an LP problem 226
Summary234
LP features 227 Further reading 235
Topological features 227

10
Writing Your Custom Graph Algorithms with
the Pregel API in Java 237
Technical requirements 238 Test for the PageRank class 249
Introducing the Pregel API 238 Test for the PageRankTol class 252

GDS’s features 238 Using our algorithm from Cypher 253


The Pregel API 238 Adding annotations 253
Implementing the PageRank Building the JAR file 254
algorithm239 Updating the Neo4j configuration 255
The PageRank algorithm 240 Testing our procedure 255
Simple Python implementation 240 Summary256
Pregel Java implementation 244 Further reading 257
Implementing the tolerance-stopping criteria 247
Exercises258
Testing our code 249

Index259

Other Books You May Enjoy 268


Preface
Data science today is a core component of many companies and organizations taking advantage of its
predictive power to improve their products or better understand their customers. It is an ever-evolving
field, still undergoing intense research. One of the most trending research areas is graph data science
(GDS), or how representing data as a connected network can improve models.
Among the different tools on the market to work with graphs, Neo4j, a graph database, is popular
among developers for its ability to build simple and evolving data models and query data easily with
Cypher. For a few years now, it has also stood out as a leader in graph analytics, especially since the
release of the first version of its GDS library, allowing you to run graph algorithms from data stored
in Neo4j, even at a large scale.
This book is designed to guide you through the field of GDS, always using Neo4j and its GDS library
as the main tool. By the end of this book, you will be able to run your own GDS model on a graph
dataset you created. By the end of the book, you will even be able to pass the Neo4j Data Science
certification to prove your new skills to the world.

Who this book is for


This book is for people who are curious about graphs and how this data structure can be useful in
data science. It can serve both data scientists who are learning about graphs and Neo4j developers
who want to get into data science.
The book assumes minimal data science knowledge (classification, training sets, confusion matrices) and
some experience with Python and its related data science toolkit (pandas, matplotlib, and scikit-learn).

What this book covers


Chapter 1, Introducing and Installing Neo4j, introduces the basic principles of graph databases and gives
instructions on how to set up Neo4j locally, create your first graph, and write your first Cypher queries.
Chapter 2, Using Existing Data to Build a Knowledge Graph, guides you through loading data into
Neo4j from different formats (CSV, JSON, and an HTTP API). This is where you will build the dataset
that will be used throughout this book.
Chapter 3, Characterizing a Graph Dataset, introduces some key metrics to differentiate one graph
dataset from another.
xii Preface

Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, goes deeper into understanding
a graph dataset by using graph algorithms. This is the chapter where you will start to use the Neo4j
GDS plugin.
Chapter 5, Visualizing Graph Data, delves into graph data visualization by drawing nodes and edges,
starting from static representations and moving on to dynamic ones.
Chapter 6, Building a Machine Learning Model with Graph Features, talks about machine learning
model training using scikit-learn. This is where we will first use the GDS Python client.
Chapter 7, Automating Feature Extraction with Graph Embeddings for Machine Learning, introduces
the concept of node embedding, with practical examples using the Neo4j GDS library.
Chapter 8, Building a GDS Pipeline for Node Classification Model Training, introduces the topic of node
classification within GDS without involving a third-party tool.
Chapter 9, Predicting Future Edges, gives a short introduction to the topic of link prediction, a graph-
specific machine learning task.
Chapter 10, Writing Your Custom Graph Algorithms with the Pregel API in Java, covers the exciting
topic of building an extension for the GDS plugin.

To get the most out of this book


You will need access to a Neo4j instance. Options and installation instructions are given in Chapter 1,
Introducing and Installing Neo4j. We will also intensively use Python and the following packages:
pandas, scikit-learn, network, and graphdatascience. The code was tested with Python 3.10 but should
work with newer versions, assuming no breaking change is made in its dependencies. Python code is
provided as a Jupyter notebook, so you’ll need Jupyter Server installed and running to go through it.
For the very last chapter, a Java JDK will also be required. The code was tested with OpenJDK 11.

Software/hardware covered in the book Operating system requirements


Neo4j 5.x Windows, macOS, or Linux
Python 3.10 Windows, macOS or Linux
Jupyter Windows, macOS or Linux
OpenJDK 11 Windows, macOS or Linux
Preface xiii

You will also need to install Neo4j plugins: APOC and GDS. Installation instructions for Neo4j Desktop
are given in the relevant chapters. However, if you are not using a local Neo4j instance, please refer to
the following pages for installation instructions, especially regarding version compatibilities:

• APOC: https://neo4j.com/docs/apoc/current/installation/
• GDS: https://neo4j.com/docs/graph-data-science/current/installation/

If you are using the digital version of this book, we advise you to type the code yourself or access
the code from the book’s GitHub repository (a link is available in the next section). Doing so will
help you avoid any potential errors related to the copying and pasting of code.

Download the example code files


You can download the example code files for this book from GitHub at https://github.com/
PacktPublishing/Graph-Data-Science-with-Neo4j. If there’s an update to the code,
it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://
github.com/PacktPublishing/. Check them out!

Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount
the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”
A block of code is set as follows:

CREATE (:Movie {
    id: line.show_id,
    title: line.title,
    releaseYear: line.release_year
  }
xiv Preface

When we wish to draw your attention to a particular part of a code block, the relevant lines or items
are set in bold:

LOAD CSV WITH HEADERS


FROM 'file:///netflix/netflix_titles.csv' AS line
WITH split(line.director, ",") as directors_list
UNWIND directors_list AS director_name
CREATE (:Person {name: trim(director_name)})

Any command-line input or output is written as follows:

$ mkdir css
$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance,
words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the
Administration panel.”

Tips or important notes


Appear like this.

Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at customercare@
packtpub.com and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
copyright@packt.com with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Preface xv

Share Your Thoughts


Once you’ve read Graph Data Science with Neo4J, we’d love to hear your thoughts! Please click here
to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering
excellent quality content.
xvi Preface

Download a free PDF copy of this book


Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical
books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content
in your inbox daily
Follow these simple steps to get the benefits:

1. Scan the QR code or visit the link below

https://packt.link/free-ebook/9781804612743

2. Submit your proof of purchase


3. That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1 –
Creating Graph Data
in Neo4j

In this first part, you will learn about Neo4j and set up your first graph database. You will also build
a graph dataset in Neo4j using Cypher, the APOC library, and public knowledge graphs.
This part includes the following chapters:

• Chapter 1, Introducing and Installing Neo4j


• Chapter 2, Using Existing Data to Build a Knowledge Graph
1
Introducing and
Installing Neo4j
Graph databases in general, and Neo4j in particular, have gained increasing interest in the past few years.
They provide a natural way of modeling entities and relationships and take into account observation
context, which is often crucial to extract the most out of your data. Among the different graph database
vendors, Neo4j has become one of the most popular for both data storage and analytics. A lot of tools
have been developed by the company itself or the community to make the whole ecosystem consistent
and easy to use: from storage to querying, to visualization to graph data science. As you will see through
this book, there is a well-integrated application or plugin for each of these topics.
In this chapter, you will get to know what Neo4j is, positioning it in the broad context of databases.
We will also introduce the aforementioned plugins that are used for graph data science.
Finally, you will set up your first Neo4j instance locally if you haven’t done so already and run your
first Cypher queries to populate the database with some data and retrieve it.
In this chapter, we’re going to cover the following main topics:

• What is a graph database?


• Finding or creating a graph database
• Neo4j in the graph databases landscape
• Setting up Neo4j
• Inserting data into Neo4j with Cypher, the Neo4j query language
• Extracting data from Neo4j with Cypher pattern matching
4 Introducing and Installing Neo4j

Technical requirements
To follow this chapter well, you will need access to the following resources:

• You’ll need a computer that can run Neo4j locally; Windows, macOS, and Linux are all supported.
Please refer to the Neo4j website for more details about system requirements: https://neo4j.
com/docs/operations-manual/current/installation/requirements/.
• Any code listed in the book will be available in the associated GitHub repository – that is,
https://github.com/PacktPublishing/Graph-Data-Science-with-
Neo4j – in the corresponding chapter folder.

What is a graph database?


Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j
is and how different it is from the data storage engine you are used to. In this section, we are going to
discuss (quickly) the different types of databases you can find today, and why graph databases are so
interesting and popular both for developers and data professionals.

Databases
Databases make up an important part of computer science. Discussing the evolution and state-of-the-art
areas of the different implementations in detail would require several books like this one – fortunately,
this is not a requirement to use such systems effectively. However, it is important to be aware of the
existing tools related to data storage and how they differ from each other, to be able to choose the right
tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and
Neo4j in your data science project doesn’t mean you will have to use it every single time you start a
new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.
A database, in the context of computing, is a system that allows you to store and query data on a
computer, phone, or, more generally, any electronic device.
As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:

• Relational databases (SQL) such as MySQL or PostgreSQL. These store data as records in
tables whose columns are attributes or fields and whose rows represent each entity. They have
a predefined schema, defining how data is organized and the type of each field. Relationships
between entities in this representation are modeled by foreign keys (requiring unique identifiers).
When the relationship is more complex, such as when attributes are required or when we can have
many relationships between the same objects, an intermediate junction (join) table is required.
What is a graph database? 5

• NoSQL databases contain many different types of databases:

‚ Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a
simple lookup database where the key is usually a string, and the value can be a more complex
object that can’t be used to filter the query – it can only be retrieved. They are known to be
very efficient for caching in a web context, where the key is the page URL and the value is
the HTML content of the page, which is dynamically generated. KV stores can also be used
to model graphs when building a native graph engine is not an option. You can see KV stores
in action in the following projects:

 IndraDB: This is a graph database written in Rust that relies on different types of KV
stores: https://github.com/indradb/indradb

‚ Document-oriented databases such as MongoDB or CouchDB. These are useful for storing
schema-less documents (usually JSON (or a derivative) objects). They are much more
flexible compared to relational databases, since each document may have different fields.
However, relationships are harder to model, and such databases rely a lot on nested JSON
and information duplication instead of joining multiple tables.

The preceding list is non-exhaustive; other types of data stores have been created over time and
abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can
mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector
representations to ease searching in the vector space, with many applications in machine learning
once a vector representation (embedding) of an observation has been computed.
Graph databases can also be classified as NoSQL databases. They bring another approach to the data
storage landscape, especially in the data model phase.

Graph database
In the previous section, we talked about databases. Before discussing graph databases, let’s introduce
the concept of graphs.
A graph is a mathematical object defined by the following:

• A set of vertices or nodes (the dots)


• A set of edges (the connections between these dots)
6 Introducing and Installing Neo4j

The following figure shows several examples of graphs, big and small:

Figure 1.1 – Representations of some graphs

As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But
in practice, far more objects can be seen as graphs:

• Time series: Each observation is connected to the next one


• Images: Each pixel is linked to its eight neighbors (see the bottom-right picture in Figure 1.1)
• Texts: Here, each word is connected to its surrounding words or a more complex mapping,
depending on its meaning (see the following figure):
Finding or creating a graph database 7

Figure 1.2 – Figure generated with the spacy Python library, which was able to identify
the relationships between words in a sentence using NLP techniques

A graph can be seen as a generalization of these static representations, where links can be created
with fewer constraints.
Another advantage of graphs is that they can be easily traversed, going from one node to another by
following edges. They have been used for representing networks for a long time – road networks or
communication infrastructure, for instance. The concept of a path, especially the shortest path in a
graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information
can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes
disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other
groups?) and node ranking (node importance). We will discuss these algorithms in more detail in
Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.
So, we know what a database is and what a graph is. Now comes the natural question: what is a graph
database? The answer is quite simple: in a graph database, data is saved into nodes, which can be
connected through edges to model relationships between them.
At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV
or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If
you do not have graph data, why would you need a graph database? There are two possible answers
to this question, both of which we are going to discuss.

Finding or creating a graph database


Data scientists know how to find or generate datasets that fit their needs. Randomly generating a
variable distribution while following some probabilistic law is one of the first things you’ll learn in a
statistics course. Similarly, graph datasets can be randomly generated, following some rules. However,
this book is not a graph theory book, so we are not going to dig into these details here. Just be aware
that this can be done. Please refer to the references in the Further reading section to learn more.
8 Introducing and Installing Neo4j

Regarding existing datasets, some of them are very popular and data scientists know about them because
they have used them while learning data science and/or because they are the topic of well-known
Kaggle competitions. Think, for instance, about the Titanic or house price datasets. Other datasets are
also used for model benchmarking, such as the MNIST or ImageNet datasets in computer vision tasks.
The same holds for graph data science, where some datasets are very common for teaching or
benchmarking purposes. If you investigate graph theory, you will read about the Zachary’s karate
club (ZKC) dataset, which is probably one of the most famous graph datasets out there (side note:
there is even a ZKC trophy, which is awarded to the first person in a graph conference that mentions
this dataset). The ZKC dataset is very simple (30 nodes, as we’ll see in Chapter 3, Characterizing a
Graph Dataset, and Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, on how to
characterize a graph dataset), but bigger and more complex datasets are also available.
There are websites referencing graph datasets, which can be used for benchmarking in a research
context or educational purpose, such as this book. Two of the most popular ones are the following:

• The Stanford Network Analysis Project (SNAP) (https://snap.stanford.edu/


data/index.html) lists different types of networks in different categories (social networks,
citation networks, and so on)
• The Network Repository Project, via its website at https://networkrepository.
com/index.php, provides hundreds of graph datasets from real-world examples, classified
into categories (for example, biology, economics, recommendations, road, and so on)

If you browse these websites and start downloading some of the files, you’ll notice the data comes in
unfamiliar formats. We’re going to list some of them next.

A note about the graph dataset’s format


The datasets we are used to are mainly exchanged as CSV or JSON files. To represent a graph, with
nodes on one side and edges on the other, several specific formats have been imagined.
The main data formats that are used to save graph data as text files are the following:

• Edge list: This is a text file where each row contains an edge definition. For instance, a graph with
three nodes (A, B, C) and two edges (A-B and C-A) is defined by the following edgelist file:

A B
C A

• Matrix Market (with the .mtx extension): This format is an extension of the previous one. It
is quite frequent on the network repository website.
Finding or creating a graph database 9

• Adjacency matrix: The adjacency matrix is an NxN matrix (where N is the number of nodes in
the graph) where the ij element is 1 if nodes i and j are connected through an edge and 0
otherwise. The adjacency matrix of the simple graph with three nodes and two edges is a 3x3
matrix, as shown in the following code block. I have explicitly displayed the row and column
names only for convenience, to help you identify what i and j are:

  A B C
A 0 1 0
B 0 0 0
C 1 0 0

Note
The adjacency matrix is one way to vectorize a graph. We’ll come back to this topic in Chapter 7,
Automatically Extracting Features with Graph Embeddings for Machine Learning.

• GraphML: Derived from XML, the GraphML format is much more verbose but lets us define
more complex graphs, especially those where nodes and/or edges carry properties. The following
example uses the preceding graph but adds a name property to nodes and a length property
to edges:

<?xml version='1.0' encoding='utf-8'?>


<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://graphml.graphdrawing.org/
xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.
xsd"
>
    <!-- DEFINING PROPERTY NAME WITH TYPE AND ID -->
    <key attr.name="name" attr.type="string" for="node"
id="d1"/>
    <key attr.name="length" attr.type="double" for="edge"
id="d2"/>
    <graph edgedefault="directed">
       <!-- DEFINING NODES -->
       <node id="A">
             <!-- SETTING NODE PROPERTY -->
            <data key="d1">"Point A"</data>
        </node>
        <node id="B">
10 Introducing and Installing Neo4j

            <data key="d1">"Point B"</data>


        </node>
        <node id="C">
            <data key="d1">"Point C"</data>
        </node>
        <!-- DEFINING EDGES
        with source and target nodes and properties
    -->
        <edge id="AB" source="A" target="B">
            <data key="d2">123.45</data>
        </edge>
        <edge id="CA" source="C" target="A">
            <data key="d2">56.78</data>
        </edge>
    </graph>
</graphml>

If you find a dataset already formatted as a graph, it is likely to be using one of the preceding formats.
However, most of the time, you will want to use your own data, which is not yet in graph format – it
might be stored in the previously described databases or CSV or JSON files. If that is the case, then
the next section is for you! There, you will learn how to transform your data into a graph.

Modeling your data as a graph


The second answer to the main question in this section is: your data is probably a graph, without you
being aware of it yet. We will elaborate on this topic in the next chapter (Chapter 2, Using Existing
Data to Build a Knowledge Graph), but let me give you a quick overview.
Let’s take the example of an e-commerce website, which has customers (users) and products. As in
every e-commerce website, users can place orders to buy some products. In the relational world, the
data schema that’s traditionally used to represent such a scenario is represented on the left-hand side
of the following screenshot:
Finding or creating a graph database 11

Figure 1.3 – Modeling e-commerce data as a graph

The relational data model works as follows:

• A table is created to store users, with a unique identifier (id) and a username (apart from
security and personal information required for such a website, you can easily imagine how to
add columns to this table).
• Another table contains the data about the available products.
• Each time a customer places an order, a new row is added to an order table, referencing the user
by its ID (a foreign key with a one-to-many relationship, where a user can place many orders).
• To remember which products were part of which orders, a many-to-many relationship is
created (an order contains many products and a product is part of many orders). We usually
create a relationship table, linking orders to products (the order product table, in our example).

Note
Please refer to the colored version of the preceding figure, which can be found in the graphics
bundle link provided in the Preface, for a better understanding of the correspondence between
the two sides of the figure.
12 Introducing and Installing Neo4j

In a graph database, all the _id columns are replaced by actual relationships, which are real entities
with graph databases, not just conceptual ones like in the relational model. You can also get rid of the
order product table since information specific to a product in a given order such as the ordered
quantity can be stored directly in the relationship between the order and the product node. The data
model is much more natural and easier to document and present to other people on your team.
Now that we have a better understanding of what a graph database is, let’s explore the different
implementations out there. Like the other types of databases, there is no single implementation for
graph databases, and several projects provide graph database functionalities.
In the next section, we are going to discuss some of the differences between them, and where Neo4j
is positioned in this technology landscape.

Neo4j in the graph databases landscape


Even when restricting the scope to graph databases, there are still different ways to envision such
data stores:

• Resource description framework (RDF): Each record is a triplet of the Subject Predicate
Object type. This is a complex vocabulary that expresses a relationship of a certain type (the
predicate) between a subject and an object; for instance:

Alice(Subject) KNOWS(Predicate) Bob(Object)


Very famous knowledge bases such as DBedia and Wikidata use the RDF format. We will talk about
this a bit more in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph).
• Labeled-property graph (LPG): A labeled-property graph contains nodes and relationships.
Both of these entities can be labeled (for instance, Alice and Bob are nodes with the Person
label, and the relationship between them has the KNOWS label) and have properties (people
have names; an acquaintance relationship can contain the date when both people first met as
a property).

Neo4j is a labeled-property graph. And even there, like MySQL, PostgreSQL, and Microsoft SQL
Server are all relational databases, you will find different vendors proposing LPG graph databases.
They differ in many aspects:

• Whether they use a native graph engine or not: As we discussed earlier, it is possible to use a
KV store or even a SQL database to store graph data. In this case, we’re talking about non-native
storage engines since the storage does not reflect the graphical nature of the data.
• The query language: Unlike SQL, the query language to deal with graph data has not yet been
standardized, even if there is an ongoing effort being led by the GQL group (see, for instance,
https://gql.today/). Neo4j uses Cypher, a declarative query language developed by the
company in 2011 and then open-sourced in the openCypher project, allowing other databases
Neo4j in the graph databases landscape 13

to use the same language (see, for instance, RedisGraph or Amazon Neptune). Other vendors
have created their own languages (AQL for ArangoDB or CQL for TigerGraph, for instance). To
me, this is a key point to take into account since the learning curve can be very different from
one language to another. Cypher has the advantage of being very intuitive and a few minutes
are enough to write your own queries without much effort.
• Their (integrated or not) support for graph analytics and data science.

A note about performances


Almost every vendor claims to be the best one, at least in some aspects. This book won’t create
another debate about that. The best option, if performances are crucial for your application, is
to test the candidates with a scenario close to your final goal in terms of data volume and the
type of queries/analysis.

Neo4j ecosystem
The Neo4j database is already very helpful by itself, but the amount of extensions, libraries, and
applications related to it makes it the most complete solution. In addition, it has a very active community
of members always keen to help each other, which is one of the reasons to choose it.
The core Neo4j database capabilities can be extended thanks to some plugins. Awesome Procedures
on Cypher (APOC), a common Neo4j extension, contains some procedures that can extend the
database and Cypher capabilities. We will use it later in this book to load JSON data.
The main plugin we will explore in this book is the Graph Data Science Library. Its predecessor, the
Graph Algorithm Library, was first released in 2018 by the Neo4j lab team. It was quickly replaced
by the Graph Data Science Library, a fully production-ready plugin, with improved performance.
Algorithms are improved and added regularly. Version 2.0, released in 2021, takes graph data science
even further, allowing us to train models and build analysis pipelines directly from the library. It also
comes with a handy Python client, which is very convenient for including graph algorithms into your
usual machine learning processes, whether you use scikit-learn or other machine learning
libraries such as TensorFlow or PyTorch.
Besides the plugins, there are also lots of applications out there to help us deal with Neo4j and explore
the data it contains. The first application we will use is Neo4j Desktop, which lets us manage several
Neo4j databases. Continue reading to learn how to use it. Neo4j Desktop also lets you manage your
installed plugins and applications.
Applications installed into Neo4j Desktop are granted access to your active database. While reading
this book, you will use the following:

• Neo4j Browser: A simple but powerful application that lets you write Cypher queries and
visualize the result as a graph, table, or JSON:
14 Introducing and Installing Neo4j

Figure 1.4 – Neo4j Browser

• Neo4j Bloom: A graph visualization application in which you can customize node styles (size,
color, and so on) based on their labels and/or properties:

Figure 1.5 – Neo4j Bloom


Setting up Neo4j 15

• Neodash: This is a dashboard application that allows us to draw plots from the data stored in
Neo4j, without having to extract this data into a DataFrame first. Plots can be organized into
nice dashboards that can be shared with other users:

Figure 1.6 – Neodash

This list of applications is non-exhaustive. You can find out more here: https://install.
graphapp.io/.

Good to know
You can create your own graph application to be run within Neo4j Desktop. This is why there
are so many diverse applications, some of which are being developed by community members
or Neo4j partners.

This section described Neo4j as a database and the various extensions that can be added to it to make
it more powerful. Now, it is time to start using it. In the following section, you are going to install
Neo4j locally on our computer so that you can run the code examples provided in this book (which
you are highly encouraged to do!).

Setting up Neo4j
There are several ways to use Neo4j:

• Through short-lived time sandboxes in the cloud, which is perfect for experimenting
• Locally, with Neo4j Desktop
• Locally, with Neo4j binaries
16 Introducing and Installing Neo4j

• Locally, with Docker


• In the cloud, with Neo4j Aura (free plan available) or Neo4j AuraDS

For the scope of this book, we will use the Neo4j Desktop option, since this application takes care of
many things for us and we do not want to go into server management at this stage.

Downloading and starting Neo4j Desktop


The easiest way to use Neo4j on your local computer when you are in the experimentation phase, is
to use the Neo4j Desktop application, which is available on Windows, Mac, and Linux OS. This user
interface lets you create Neo4j databases, which are organized into Projects, manage the installed
plugins and applications, and update the DB configuration – among other things.
Installing it is super easy: go to the Neo4j download center and follow the instructions. We recap the
steps here, with screenshots to guide you through the process:

1. Visit the Neo4j download center at https://neo4j.com/download-center/. At the


time of writing, the website looks like this:

Figure 1.7 – Neo4j Download Center

2. Click the Download Neo4j Desktop button at the top of the page.
3. Fill in the form that’s asking for some information about yourself (name, email, company, and
so on).
4. Click Download Desktop.
5. Save the activation key that is displayed on the next page. It will look something like this (this
one won’t work, so don’t copy it!):

eyJhbGciOiJQUzI1NiIsInR5cCI6IkpXVCJ9.
eyJlbWFpbCI6InN0ZWxsYTBvdWhAZ21haWwuY29tIiwibWl4cGFuZWxJZ
Setting up Neo4j 17

CI6Imdvb2dsZS1vYXV0a
...
...
The following steps depend on your operating system:
‚ On Windows, locate the installer, double-click on it, and follow the steps provided.
‚ On Mac, just click on the downloaded file.
‚ On Linux, you’ll have to make the downloaded file executable before running it. More
instructions will be provided next.
For Linux users, here is how to proceed:
6. When the download is over (this can take some time since the file is a few hundred MBs), open
a Terminal and go to your download directory:

# update path depending on your system


$ cd Downloads/

7. Then, run the following command, which will extract the version and architecture name from
the AppImage file you’ve just downloaded:

$ DESKTOP_VERSION=`ls -tr  neo4j-desktop*.AppImage | tail


-1 | grep -Po "(?<=neo4j-desktop-)[^AppImage]+"
$ echo ${DESKTOP_VERSION}

8. If the preceding echo command shows something like 1.4.11-x86_64., you’re good to
go. Alternatively, you can identify the pattern yourself and create the variable, like so:

$ DESKTOP_VERSION=1.4.11-x86_64.  # include the final dot

9. Then, you need to make the file executable with chmod and run the application:

# make file executable:


$ chmod +x neo4j-desktop-${DESKTOP_VERSION}AppImage
# run the application:
$ ./neo4j-desktop-${DESKTOP_VERSION}AppImage

The last command in the preceding code snippet starts the Neo4j Desktop application. The first
time you run the application, it will ask you for the activation key you saved when downloading the
executable. And that’s it – the application will be running, which means we can start creating Neo4j
databases and interact with them.
18 Introducing and Installing Neo4j

Creating our first Neo4j database


Creating a new database with Neo4j desktop is quite straightforward:

1. Start the Neo4j Desktop application.


2. Click on the Add button in the top-right corner of the screen.
3. Select Local DBMS.
This process is illustrated in the following screenshot:

Figure 1.8 – Adding a new database with Neo4j Desktop

4. The next step is to choose a name, a password, and the version of your database.
Setting up Neo4j 19

Note
Save the password in a safe place; you’ll need to provide it to drivers and applications when
connecting to this database.

5. It is good practice to always choose the latest available version; Neo4j Desktop takes care of
checking which version it is. The following screenshot shows this step:

Figure 1.9 – Choosing a name, password, and version for your new database

6. Next, just click Create, and wait for the database to be created. If the latest Neo4j version needs
to be downloaded, it can take some time, depending on your connection.
7. Finally, you can start your database by clicking on the Start button that appears when you
hover your new database name, as shown in the following screenshot:
20 Introducing and Installing Neo4j

Figure 1.10 – Starting your newly created database

Note
You can’t have two databases running at the same time. If you start a new database while
another is still running, the previous one must be stopped before the new one can be started.

You now have Neo4j Desktop installed and a running instance of Neo4j on your local computer. At
this point, you are ready to start playing with graph data. Before moving on, let me introduce Neo4j
Aura, which is an alternative way to quickly get started with Neo4j.

Creating a database in the cloud – Neo4j Aura


Neo4j also has a DB-as-a-service component called Aura. It lets you create a Neo4j database hosted
in the cloud (either on Google Cloud Platform or Amazon Web Services, your choice) and is fully
managed – there’s no need to worry about updates anymore. This service is entirely free up to a certain
database size (50k nodes and 150k relationships), which makes it sufficient for experimenting with
it. To create a database in Neo4j Aura, visit https://neo4j.com/cloud/platform/aura-
graph-database/.
The following screenshot shows an example of a Neo4j database running in the cloud thanks to the
Aura service:
Inserting data into Neo4j with Cypher, the Neo4j query language 21

Figure 1.11 – Neo4j Aura dashboard with a free-tier instance

Clicking Explore opens Neo4j Bloom, which we will cover in Chapter 3, Characterizing a Graph
Dataset, while clicking Query starts Neo4j Browser in a new tab. You’ll be requested to enter the
connection information for your database. The URL can be found in the previous screenshot – the
username and password are the ones you set when creating the instance.
In the rest of this book, examples will be provided using a local database managed with the Neo4j
Desktop application, but you are free to use whatever technique you prefer. However, note that some
minor changes are to be expected if you choose something different, such as directory location or
plugin installation method. In the latter case, always refer to the plugin or application documentation
to find out the proper instructions.
Now that our first database is ready, it is time to insert some data into it. For this, we will use our first
Cypher queries.

Inserting data into Neo4j with Cypher, the Neo4j query


language
Cypher, as we discussed at the beginning of this chapter, is the query language developed by Neo4j.
It is used by other graph database vendors, such as Redis Graph.
First, let’s create some nodes in our newly created database.
To do so, open Neo4j Browser by clicking on the Open button next to your database and selecting
Neo4j Browser:
22 Introducing and Installing Neo4j

Figure 1.12 – Start the Neo4j Browser application from Neo4j Desktop

From there, you can start and write Cypher queries in the upper text area.
Let’s start and create some nodes with the following Cypher query:

CREATE (:User {name: "Alice", birthPlace: "Paris"})


CREATE (:User {name: "Bob", birthPlace: "London"})
CREATE (:User {name: "Carol", birthPlace: "London"})
CREATE (:User {name: "Dave", birthPlace: "London"})
CREATE (:User {name: "Eve", birthPlace: "Rome"})

Before running the query, let me detail its syntax:

Figure 1.13 – Anatomy of a node creation Cypher statement


Inserting data into Neo4j with Cypher, the Neo4j query language 23

Note that all of these components except for the parentheses are optional. You can create a node with
no label and no properties with CREATE (), even if creating an empty record wouldn’t be really
useful for data storage purposes.

Tips
You can copy and paste the preceding query and execute it as-is; multiple line queries are
allowed by default in Neo4j Browser.
If the upper text area is not large enough, press the Esc key to maximize it.

Now that we’ve created some nodes and since we are dealing with a graph database, it is time to learn
how to connect these nodes by creating edges, or, in Neo4j language, relationships.
The following code snippet starts by fetching the start and end nodes (Alice and Bob), then creates
a relationship between them. The created relationship is of the KNOWS type and carries one property
(the date Alice and Bob met):

MATCH (alice:User {name: "Alice"})


MATCH (bob:User {name: "Bob"})
CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)

We could have also put all our CREATE statements into one big query, for instance, by adding aliases
to the created nodes:

CREATE (alice:User {name: "Alice", birthPlace: "Paris"})


CREATE (bob:User {name: "Bob", birthPlace: "London"})
CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)

Note
In Neo4j, relationships are directed, meaning you have to specify a direction when creating
them, which we can do thanks to the > symbol. However, Cypher lets you select data regardless
of the relationship’s direction. We’ll discuss this when appropriate in the subsequent chapters.

Inserting data into the database is one thing, but without the ability to query and retrieve this data,
databases would be useless. In the next section, we are going to use Cypher’s powerful pattern matching
to read data from Neo4j.
24 Introducing and Installing Neo4j

Extracting data from Neo4j with Cypher pattern matching


So far, we have put some data in Neo4j and explored it with Neo4j Browser. But unsurprisingly, Cypher
also lets you select and return data programmatically. This is what is called pattern matching in the
context of graphs.
Let’s analyze such a pattern:

MATCH (usr:User {birthPlace: "London"})


RETURN usr.name, usr.birthPlace

Here, we are selecting nodes with the User label while filtering for nodes with birthPlace equal to
London. The RETURN statement asks Neo4j to only return the name and the birthPlace property
of the matched nodes. The result of the preceding query, based on the data created earlier, is as follows:

╒══════════╤════════════════╕
│"usr.name"│"usr.birthPlace"│
╞══════════╪════════════════╡
│"Bob"     │"London"        │
├──────────┼────────────────┤
│"Carol"   │"London"        │
├──────────┼────────────────┤
│"Dave"    │"London"        │
└──────────┴────────────────┘

This is a simple MATCH statement, but most of the time, you’ll need to traverse the graph somehow
to explore relationships. This is where Cypher is very convenient. You can write queries with an
easy-to-remember syntax, close to the one you would use when drafting your query on a piece of
paper. As an example, let’s find the users known by Alice, and return their names:

MATCH (alice:User {name: "Alice"})-[:KNOWS]->(u:User)


RETURN u.name

The highlighted part in the preceding query is a graph traversal. From the node(s) matching label,
User, and name, Alice, we are traversing the graph toward another node through a relationship of
the KNOWS type. In our toy dataset, there is only one matching node, Bob, since Alice is connected
to a single relationship of this type.
Summary 25

Note
In our example, we are using a single-node label and relationship type. You are encouraged to
experiment by adding more data types. For instance, create some nodes with the Product
label and relationships of the SELLS/BUYS type between users and products to build more
complex queries.

Summary
In this chapter, you learned about the specificities of graph databases and started to learn about Neo4j
and the tools around it. Now, you know a lot more about the Neo4j ecosystem, including plugins such
as APOC and the graph data science (GDS) library and graph applications such as Neo4j Browser
and Neodash. You installed Neo4j on your computer and created your first graph database. Finally,
you created your first nodes and relationships and built your first Cypher MATCH statement to extract
data from Neo4j.
At this point, you are ready for the next chapter, which will teach you how to import data from various
data sources into Neo4j, using built-in tools and the common APOC library.

Further reading
If you want to explore the concepts described in this chapter in more detail, please refer to the
following references:

• Graph Databases, The Definitive Book of Graph Databases, by I. Robinson, J. Webber, and E.
Eifrem (O’Reilly). The authors, among which is Emil Eifrem, the CEO of Neo technologies,
explain graph databases and graph data modeling, also covering the internal implementation.
Very instructive!
• Learning Neo4j 3.x - Second Edition, by J. Baton and R. Van Bruggen. Even if written for an older
version of Neo4j, most of the concepts it describes are still valid – the newer Neo4j versions
have mostly added new features such as clustering for scalability, without breaking changes.
• The openCypher project (https://opencypher.org/) and GQL specification (https://
www.gqlstandards.org/) to learn about graph query language beyond Cypher.
26 Introducing and Installing Neo4j

Exercises
To make sure you fully understand the content described in this chapter, you are encouraged to think
about the following exercises before moving on:

1. Which information do you need to define a graph?


2. Do you need a graph dataset to start using a graph database?
3. True or false:

A. Neo4j can only be started with Neo4j Desktop.


B. The application to use to create dashboards from Neo4j data is Neo4j Browser.
C. Graph data science is supported by default by Neo4j.

4. Are the following Cypher syntaxes valid, and why/why not? What are they doing?

A. MATCH (x:User) RETURN x.name


B. MATCH (x:User) RETURN x
C. MATCH (:User) RETURN x.name
D. MATCH (x:User)-[k:KNOWS]->(y:User) RETURN x, k, y
E. MATCH (x:User)-[:KNOWS]-(y) RETURN x, y

5. Create more data (other node labels/relationship types) and queries.


2
Importing Data into Neo4j to
Build a Knowledge Graph
As discussed in the previous chapter, you do not have to fetch a graph dataset specifically to work with
a graph database. Many datasets we are used to working with as data scientists contain information
about the relationships between entities, which can be used to model this dataset as a graph. In this
chapter, we will discuss one such example: a Netflix catalog dataset. It contains movies and series
available on the streaming platform, including some metadata, such as director or cast.
First, we are going to study this dataset, which is saved as a CSV file, and learn how to import CSV
data into Neo4j. In a second exercise, we will use the same dataset, stored as a JSON file. Here, we
will have to use the Awesome Procedures on Cypher (APOC) Neo4j plugin to be able to parse this
data and import it into Neo4j.
Finally, we will learn about the existing public knowledge graphs, one of the biggest ones being
Wikidata. We will use its public API to add more information to our graph database.
While exploring the datasets, we will also learn how to deal with temporal and spatial data types
with Neo4j.
In this chapter, we’re going to cover the following main topics:

• Importing CSV data into Neo4j with Cypher


• Introducing the APOC library to deal with JSON data
• Discovering the Wikidata public knowledge graph
• Enriching our graph with Wikidata information
• Importing data in the cloud
28 Importing Data into Neo4j to Build a Knowledge Graph

Technical requirements
To be able to reproduce the examples provided in this chapter, you’ll need the following tools:

• Neo4j installed on your computer (see the installation instructions in the previous chapter).
• The Neo4j APOC plugin (the installation instructions will be provided later in this chapter, in
the Introducing the APOC library to deal with JSON data section).
• Python and the Jupyter Notebook installed. We are not going to cover the installation instructions
in this book. However, they are detailed in the code repository associated with this book (see
the last bullet if you need such details).
• An internet connection to download the plugins and the datasets. This will also be required
for you to use the public API in the last section of this chapter (Enriching our graph with
Wikidata information).
• Any code listed in this book will be available in the associated GitHub repository, https://
github.com/PacktPublishing/Graph-Data-Science-with-Neo4j, in the
corresponding chapter folder.

Importing CSV data into Neo4j with Cypher


The comma-separated values (CSV) file format is the most widely used to share data among data
scientists. According to the dataset of Kaggle datasets (https://www.kaggle.com/datasets/
morriswongch/kaggle-datasets), this format represents more than 57% of all datasets in
this repository, while JSON files account for less than 10%. It is popular for the following reasons:

• How it resembles the tabular data storage format (relational databases)


• Its closeness to the machine learning world of vectors and matrices
• Its readability – you usually just have to read column names to understand what it is about
(of course, a more detailed description is required to understand how the data was collected,
the unit of physical quantities, and so on) and there are no hidden fields (compared to JSON,
where you can only have a key existing from the 1,000th record and later, which is hard to know
without a proper description or advanced data analysis)

Knowing how to deal with such a data format is crucial to be able to use Neo4j in good conditions.
In this section, we will introduce the CSV dataset we are going to use, try and extract entities (both
nodes and relationships) we can model based on this data, and use Cypher to import the data into a
new Neo4j graph.
Another random document with
no related content on Scribd:
Faustus was first brought out at the theatre, Lincoln’s Inn Fields, in
’23. It had so prodigious a run, and came into such vogue, that after
much grumbling about the “legitimate” and invocations of “Ben
Jonson’s ghost” (Hogarth calls him Ben Johnson), the rival Covent
Garden managers were compelled to follow suit, and in ’25 came out
with their Doctor Faustus—a kind of saraband of infernal persons
contrived by Thurmond the dancing-master. He, too, was the deviser
of “Harleykin Sheppard” (or Shepherd), in which the dauntless thief
who escaped from the Middle Stone-room at Newgate in so
remarkable a manner received a pantomimic apotheosis. Quick-
witted Hogarth satirized this felony-mania in the caricature of Wilks,
Booth, and Cibber, conjuring up “Scaramouch Jack Hall.” To return to
Burlington Gate. In the centre, Shakspeare and Jonson’s works are
being carted away for waste paper. To the left you see a huge
projecting sign or show-cloth, containing portraits of his sacred
Majesty George the Second in the act of presenting the
management of the Italian Opera with one thousand pounds; also of
the famous Mordaunt Earl of Peterborough and sometime general of
the armies in Spain. He kneels, and in the handsomest manner, to
Signora Cuzzoni the singer, saying (in a long apothecary’s label),
“Please accept eight thousand pounds!” but the Cuzzoni spurns at
him. Beneath is the entrance to the Opera. Infernal persons with very
long tails are entering thereto with joyful countenances. The infernal
persons are unmistakeable reminiscences of Callot’s demons in the
Tentation de St. Antoine. There is likewise a placard relating to
“Faux’s Long-room,” and his “dexterity of hand.”
In 1724, Hogarth produced another allegory called the Inhabitants
of the Moon, in which there are some covert and not very
complimentary allusions to the “dummy” character of royalty, and a
whimsical fancy of inanimate objects, songs, hammers, pieces of
money, and the like, being built up into imitation of human beings, all
very ingeniously worked out. By this time, Hogarth, too, had begun to
work, not only for the ephemeral pictorial squib-vendors of
Westminster Hall—those squibs came in with him, culminated in
Gillray, and went out with H. B.; or were rather absorbed and
amalgamated into the admirable Punch cartoons of Mr. Leech—but
also for the regular booksellers. For Aubry de la Mottraye’s Travels
(a dull, pretentious book) he executed some engravings, among
which I note A woman of Smyrna in the habit of the country—the
woman’s face very graceful, and the Dance, the Pyrrhic dance of the
Greek islands, and the oddest fandango that ever was seen. One
commentator says that the term “as merry as a grig” came from the
fondness of the inhabitants of those isles of eternal summer for
dancing, and that it should be properly “as merry as a Greek.” Quien
sabe? I know that lately in the Sessions papers I stumbled over the
examination of one Levi Solomon, alias Cockleput, who stated that
he lived in Sweet Apple Court, and that he “went a-grigging for his
living.” I have no Lexicon Balatronicum at hand; but from early
researches into the vocabulary of the “High Mung” I have an
indistinct impression that “griggers” were agile vagabonds who
danced, and went through elementary feats of posture-mastery in
taverns.
In ’24, Hogarth illustrated a translation of the Golden Ass of
Apuleius. The plates are coarse and clumsy; show no humour; were
mere pot-boilers, gagne-pains, thrusts with the burin at the wolf
looking in at the Hogarthian door, I imagine. Then came five
frontispieces for a translation of Cassandra. These I have not seen.
Then fifteen head-pieces for Beaver’s Military Punishments of the
Ancients, narrow little slips full of figures in chiaroscuro, many drawn
from Callot’s curious martyrology, Les Saincts et Sainctes de
l’Année, about three hundred graphic illustrations of human torture!
There was also a frontispiece to the Happy Ascetic, and one to the
Oxford squib of Terræ Filius, in 1724, but of the joyous recluse in
question I have no cognizance.
In 1722 (you see I am wandering up and down the years as well
as the streets), London saw a show—and Hogarth doubtless was
there to see—which merits some lines of mention. The drivelling,
avaricious dotard, who, crossing a room and looking at himself in a
mirror, sighed and mumbled, “That was once a man:”—this poor
wreck of mortality died, and became in an instant, and once more,
John the great Duke of Marlborough. On the 9th of August, 1722, he
was buried with extraordinary pomp in Westminster Abbey. The
saloons of Marlborough House, where the corpse lay in state, were
hung with fine black cloth, and garnished with bays and cypress. In
the death-chamber was a chair of state surmounted by a “majesty
scutcheon.” The coffin was on a bed of state, covered with a “fine
holland sheet,” over that a complete suit of armour, gilt, but empty.
Twenty years before, there would have been a waxen image in the
dead man’s likeness within the armour, but this hideous fantasy of
Tussaud-tombstone effigies had in 1722 fallen into desuetude.[8] The
garter was buckled round the steel leg of this suit of war-harness;
one listless gauntlet held a general’s truncheon; above the vacuous
helmet with its unstirred plumes was the cap of a Prince of the
Empire. The procession, lengthy and splendid, passed from
Marlborough House through St. James’s Park to Hyde Park Corner,
then through Piccadilly, down St. James’s Street, along Pall Mall,
and by King Street, Westminster, to the Abbey. Fifteen pieces of
cannon rambled in this show. Chelsea pensioners, to the number of
the years of the age of the deceased, preceded the car. The colours
were wreathed in crape and cypress. Guidon was there, and the
great standard, and many bannerols and achievements of arms.
“The mourning horse with trophies and plumades” was gorgeous.
There was a horse of state and a mourning horse, sadly led by the
dead duke’s equerries. And pray note: the minutest details of the
procession were copied from the programme of the Duke of
Albemarle’s funeral (Monk); which, again, was a copy of Oliver
Cromwell’s—which, again, was a reproduction, on a more splendid
scale, of the obsequies of Sir Philip Sidney, killed at Zutphen. Who
among us saw not the great scarlet and black show of 1852, the
funeral of the Duke of Wellington? Don’t you remember the eighty-
four tottering old Pensioners, corresponding in number with the
years of our heroic brother departed? When gentle Philip Sidney was
borne to the tomb, thirty-one poor men followed the hearse. The
brave soldier, the gallant gentleman, the ripe scholar, the
accomplished writer was so young. Arthur and Philip! And so century
shakes hand with century, and the new is ever old, and the last
novelty is the earliest fashion, and old Egypt leers from a glass-case,
or a four thousand year old fresco, and whispers to Sir Plume, “I, too,
wore a curled periwig, and used tweezers to remove superfluous
hairs.”
In 1726, Hogarth executed a series of plates for Blackwell’s
Military Figures, representing the drill and manœuvres of the
Honourable Artillery Company. The pike and half-pike exercise are
very carefully and curiously illustrated; the figures evidently drawn
from life; the attitudes very easy. The young man was improving in
his drawing; for in 1724, Thornhill had started an academy for
studying from the round and from life at his own house, in Covent
Garden Piazza; and Hogarth—who himself tells us that his head was
filled with the paintings at Greenwich and St. Paul’s, and to whose
utmost ambition of scratching copper, there was now probably added
the secret longing to be a historico-allegorico-scriptural painter I
have hinted at, and who hoped some day to make Angels sprawl on
coved ceilings, and Fames blast their trumpets on grand staircases
—was one of the earliest students at the academy of the king’s
sergeant painter, and member of parliament for Weymouth. Already
William had ventured an opinion, bien tranchée, on high art. In those
days there flourished—yes, flourished is the word—a now forgotten
celebrity, Kent the architect, gardener, painter, decorator,
upholsterer, friend of the great, and a hundred things besides. This
artistic jack-of-all-trades became so outrageously popular, and
gained such a reputation for taste—if a man have strong lungs, and
persists in crying out that he is a genius, the public are sure to
believe him at last—that he was consulted on almost every tasteful
topic, and was teased to furnish designs for the most incongruous
objects. He was consulted for picture-frames, drinking-glasses,
barges, dining-room tables, garden-chairs, cradles, and birth-day
gowns. One lady he dressed in a petticoat ornamented with columns
of the five orders; to another he prescribed a copper-coloured skirt,
with gold ornaments. The man was at best but a wretched sciolist;
but he for a long period directed the “taste of the town.” He had at
last the presumption to paint an altar-piece for the church of St.
Clement Danes. The worthy parishioners, men of no taste at all,
burst into a yell of derision and horror at this astounding croûte.
Forthwith, irreverent young Mr. Hogarth lunged full butt with his
graver at the daub. He produced an engraving of Kent’s
Masterpiece, which was generally considered to be an unmerciful
caricature; but which he himself declared to be an accurate
representation of the picture. ’Twas the first declaration of his guerra
al cuchillo against the connoisseurs. The caricature, or copy,
whichever it was, made a noise; the tasteless parishioners grew
more vehement, and, at last, Gibson, Bishop of London (whose
brother, by the way, had paid his first visit to London in the company
of Dominie Hogarth), interfered, and ordered the removal of the
obnoxious canvas. “Kent’s masterpiece” subsided into an ornament
for a tavern-room. For many years it was to be seen (together with
the landlord’s portrait, I presume) at the “Crown and Anchor,” in the
Strand. Then it disappeared, and faded away from the visible things
extant.
With another bookseller’s commission, I arrive at another halting-
place in the career of William Hogarth. In 1726-7 appeared his
eighteen illustrations to Butler’s Hudibras. They are of considerable
size, broadly and vigorously executed, and display a liberal
instalment of the vis comica, of which William was subsequently to
be so lavish. Ralpho is smug and sanctified to a nicety. Hudibras is a
marvellously droll-looking figure, but he is not human, is generally
execrably drawn, and has a head preternaturally small, and so
pressed down between the clavicles, that you might imagine him to
be of the family of the anthropophagi, whose heads do grow beneath
their shoulders. There is a rare constable, the perfection of
Dogberryism-cum-Bumbledom, in the tableau of Hudibras in the
stocks. The widow is graceful and beautiful to look at. Unlike Wilkie,
Hogarth could draw pretty women;[9] the rogue who chucks the
widow’s attendant under the chin is incomparable, and Trulla is a
most truculent brimstone. The “committee” is a character full study of
sour faces. The procession of the “Skimmington” is full of life and
animation; and the concluding tableau, “Burning rumps at Temple
Bar,” is a wondrous street-scene, worthy of the ripe Hogarthian
epoch of The Progresses, The Election, Beer Street and Gin Lane.
This edition of Butler’s immortal satire had a great run; and the artist
often regretted that he had parted absolutely, and at once, with his
property in the plates.
So now then, William Hogarth, we part once more, but soon to
meet again. Next shall the moderns know thee—student at
Thornhill’s Academy—as a painter as well as an engraver. A
philosopher—quoique tu n’en doutais guère—thou hast been all
along.

FOOTNOTES
[2] To me there is something candid, naïve, and often
something noble in this personal consciousness and confidence,
this moderate self-trumpeting. “Questi sono miri!” cried Napoleon,
when, at the sack of Milan, the MS. treatises of Leonardo da Vinci
were discovered; and he bore them in triumph to his hotel,
suffering no meaner hand to touch them. He knew—the
Conquering Thinker—that he alone was worthy to possess those
priceless papers. So too, Honoré de Balzac calmly remarking that
there were only three men in France who could speak French
correctly: himself, Victor Hugo, and “Théophile” (T. Gautier). So,
too, Elliston, when the little ballet-girl complained of having been
hissed: “They have hissed me,” said the awful manager, and the
dancing girl was dumb. Who can forget the words that Milton
wrote concerning things of his “that posteritie would not willingly
let die?” and that Bacon left, commending his fame to “foreign
nations and to the next age?” And Turner, simply directing in his
will that he should be buried in St. Paul’s Cathedral? That
sepulchre, the painter knew, was his of right. And innocent
Gainsborough, dying: “We are all going to heaven, and Vandyke
is of the company.” And Fontenelle, calmly expiring at a hundred
years of age: “Je n’ai jamais dit la moindre chose centre la plus
petite vertu.” ’Tis true, that my specious little argument falls
dolefully to the ground when I remember that which the wisest
man who ever lived said concerning a child gathering shells and
pebbles on the sea-shore, when the great ocean of truth lay all
undiscovered before him.
[3] The bezant (from Byzantium) was a round knob on the
scutcheon, blazoned yellow. “Golp” was purple, the colour of an
old black eye, so defined by the heralds. “Sanguine” or “guzes”
were to be congested red, like bloodshot eyes; “torteaux” were of
another kind of red, like “Simnel cakes.” “Pomeis” were to be
green like apples. “Tawny” was orange. There were also “hurts” to
be blazoned blue, as bruises are.—New View of London, 1712.
[4] I believe Pope’s sneer against poor Elkanah Settle (who
died very comfortably in the Charterhouse, 1724, ætat. 76: he
was alive in 1720, and succeeded Rowe as laureate), that he was
reduced in his latter days to compass a motion of St. George and
the Dragon at Bartholomew fair, and himself enacted the dragon
in a peculiar suit of green leather, his own invention, to have been
a purely malicious and mendacious bit of spite. Moreover, Settle
died years after Pope assumed him to have expired.
[5] 1720. The horrible room in Newgate Prison where in
cauldrons of boiling pitch the hangman seethed the dissevered
limbs of those executed for high treason, and whose quarters
were to be exposed, was called “Jack Ketch’s kitchen.”
[6] Compare these voluntary torments with the description of
the Dosèh, or horse-trampling ceremonial of the Sheik El Bekree,
over the bodies of the faithful, in Lane’s Modern Egyptians.
[7] Daniel Button’s well-known coffee-house was on the south
side of Russell Street, Covent Garden, nearly opposite Tom’s.
Button had been a servant of the Countess of Warwick, and so
was patronized by her spouse, the Right Hon. Joseph Addison.
Sir Robert Walpole’s creature, Giles Earl, a trading justice of the
peace (compare Fielding and “300l. a year of the dirtiest money in
the world”) used to examine criminals, for the amusement of the
company, in the public room at Button’s. Here, too, was a lion’s
head letter-box, into which communications for the Guardian were
dropped. At Button’s, Pope is reported to have said of Patrick, the
lexicographer, who made pretensions to criticism, that “a
dictionary-maker might know the meaning of one word, but not of
two put together.”
[8] Not, however, to forget that another Duchess, Marlborough’s
daughter, who loved Congreve so, had after his death a waxen
image made in his effigy, and used to weep over it, and anoint the
gouty feet.
[9] “They said he could not colour,” said old Mrs. Hogarth one
day to John Thomas Smith, showing him a sketch of a girl’s head.
“It’s a lie; look there: there’s flesh and blood for you, my man.”
Studies in Animal Life.
“Authentic tidings of invisible things;—
Of ebb and flow, and ever-during power,
And central peace subsisting at the heart
Of endless agitation.”—The Excursion.

CHAPTER IV.
An extinct animal recognized by its tooth: how came this to be possible?—The task of classification.—
Artificial and natural methods.—Linnæus, and his baptism of the animal kingdom: his scheme of
classification.—What is there underlying all true classification?—The chief groups.—What is a
species?—Re-statement of the question respecting the fixity or variability of species.—The two
hypotheses.—Illustration drawn from the Romance languages.—Caution to disputants.
I was one day talking with Professor Owen in the Hunterian Museum, when a
gentleman approached with a request to be informed respecting the nature of a
curious fossil, which had been dug up by one of his workmen. As he drew the fossil
from a small bag, and was about to hand it for examination, Owen quietly remarked:
—“That is the third molar of the under-jaw of an extinct species of rhinoceros.” The
astonishment of the gentleman at this precise and confident description of the fossil,
before even it had quitted his hands, was doubtless very great. I know that mine was;
until the reflection occurred that if some one, little acquainted with editions, had drawn
a volume from his pocket, declaring he had found it in an old chest, any bibliophile
would have been able to say at a glance: “That is an Elzevir;” or, “That is one of the
Tauchnitz classics, stereotyped at Leipzig.” Owen is as familiar with the aspect of the
teeth of animals, living and extinct, as a student is with the aspect of editions. Yet
before that knowledge could have been acquired, before he could say thus confidently
that the tooth belonged to an extinct species of rhinoceros, the united labours of
thousands of diligent inquirers must have been directed to the classification of
animals. How could he know that the rhinoceros was of that particular species rather
than another? and what is meant by species? To trace the history of this confidence
would be to tell the long story of zoological investigation: a story too long for narration
here, though we may pause awhile to consider its difficulties.
To make a classified catalogue of the books in the British Museum would be a
gigantic task; but imagine what that task would be if all the title-pages and other
external indications were destroyed! The first attempts would necessarily be of a rough
approximative kind, merely endeavouring to make a sort of provisional order amid the
chaos, after which succeeding labours might introduce better and better
arrangements. The books might first be grouped according to size; but having got
them together, it would soon be discovered that size was no indication of their
contents: quarto poems and duodecimo histories, octavo grammars and folio
dictionaries, would immediately give warning that some other arrangement was
needed. Nor would it be better to separate the books according to the languages in
which they were written. The presence or absence of “illustrations” would furnish no
better guide; while the bindings would soon be found to follow no rule. Indeed, one by
one all the external characters would prove unsatisfactory, and the labourers would
finally have to decide upon some internal characters. Having read enough of each
book to ascertain whether it was poetry or prose: and if poetry, whether dramatic, epic,
lyric, or satiric; and if prose, whether history, philosophy, theology, philology, science,
fiction, or essay: a rough classification could be made; but even then there would be
many difficulties, such as where to place a work on the philosophy of history—or the
history of science,—or theology under the guise of science—or essays on very
different subjects; while some works would defy classification.
Gigantic as this labour would be, it would be trifling compared with the labour of
classifying all the animals now living (not to mention extinct species), so that the place
of any one might be securely and rapidly determined; yet the persistent zeal and
sagacity of zoologists have done for the animal kingdom what has not yet been done
for the library of the Museum, although the titles of the books are not absent. It has
been done by patient reading of the contents—by anatomical investigation of the
internal structure of animals. Except on a basis of comparative anatomy, there could
have been no better a classification of animals than a classification of books according
to size, language, binding, &c. An unscientific Pliny might group animals according to
their habitat; but when it was known that whales, though living in the water and
swimming like fishes, were in reality constructed like air-breathing quadrupeds—when
it was known that animals differing so widely as bees, birds, bats, and flying squirrels,
or as otters, seals, and cuttle-fish, lived together in the same element, it became
obvious that such a principle of arrangement could lead to no practical result. Nor
would it suffice to class animals according to their modes of feeding; since in all
classes there are samples of each mode. Equally unsatisfactory would be external
form—the seal and the whale resembling fishes, the worm resembling the eel, and the
eel the serpent.
Two things were necessary: first, that the structure of various animals should be
minutely studied, and described—which is equivalent to reading the books to be
classified;—and secondly, that some artificial method should be devised of so
arranging the immense mass of details as to enable them to be remembered, and also
to enable fresh discoveries readily to find a place in the system. We may be perfectly
familiar with the contents of a book, yet wholly at a loss where to place it. If we have to
catalogue Hegel’s Philosophy of History, for example, it becomes a difficult question
whether to place it under the rubric of philosophy, or under that of history. To decide
this point, we must have some system of classification.
In the attempts to construct a system, naturalists are commonly said to have
followed two methods: the artificial and the natural. The artificial method seizes some
one prominent characteristic, and groups all the individuals together which agree in
this one respect. In Botany the artificial method classes plants according to the organs
of reproduction; but this has been found so very imperfect that it has been abandoned,
and the natural method has been substituted, according to which the whole structure
of the plant determines its place. If flying were taken as the artificial basis for the
grouping of some animals, we should find insects and birds, bats and flying squirrels,
grouped together; but the natural method, taking into consideration not one character,
but all the essential characters, finds that insects, birds, and bats differ profoundly in
their organization: the insect has wings, but its wings are not formed like those of the
bird, nor are those of the bird formed like those of the bat. The insect does not breathe
by lungs, like the bird and the bat; it has no internal skeleton, like the bird and the bat;
and the bird, although it has many points in common with the bat, does not, like it,
suckle its young; and thus we may run over the characters of each organization, and
find that the three animals belong to widely different groups.
It is to Linnæus that we are indebted for the most ingenious and comprehensive of
the many schemes invented for the cataloguing of animal forms; and modern attempts
at classification are only improvements on the plan he laid down. First we may notice
his admirable invention of the double names. It had been the custom to designate
plants and animals according to some name common to a large group, to which was
added a description more or less characteristic. An idea may be formed of the
necessity of a reform, by conceiving what a laborious and uncertain process it would
be if our friends spoke to us of having seen a dog in the garden, and on our asking
what kind of dog, instead of their saying “a terrier, a bull-terrier, or a skye-terrier,” they
were to attempt a description of the dog. Something of this kind was the labour of
understanding the nature of an animal from the vague description of it given by
naturalists. Linnæus rebaptized the whole animal kingdom upon one intelligible
principle. He continued to employ the name common to each group, such as that of
Felis for the cats, which became the generic name; and in lieu of the description which
was given of each different kind, to indicate that it was a lion, a tiger, a leopard, or a
domestic cat, he affixed a specific name: thus the animal bearing the description of a
lion became Felis leo; the tiger, Felis tigris; the leopard, Felis leopardus; and our
domestic friend, Felis catus. These double names, as Vogt remarks, are like the
Christian- and sur-names by which we distinguish the various members of one family;
and instead of speaking of Tomkinson with the flabby face, and Tomkinson with the
square forehead, we simply say John and William Tomkinson.
Linnæus did more than this. He not only fixed definite conceptions of Species and
Genera, but introduced those of Orders and Classes. Cuvier added Families to
Genera, and Sub-kingdoms (embranchements) to Classes. Thus a scheme was
elaborated by which the whole animal kingdom was arranged in subordinate groups:
the sub-kingdoms were divided into classes, the classes into orders, the orders into
families, the families into genera, the genera into species, and the species into
varieties. The guiding principle of anatomical resemblance determined each of these
divisions. Those largest groups, which resemble each other only in having what is
called the typical character in common, are brought together under the first head. Thus
all the groups which agree in possessing a backbone and internal skeleton, although
they differ widely in form, structure, and habitat, do nevertheless resemble each other
more than they resemble the groups which have no backbone. This great division
having been formed, it is seen to arrange itself in very obvious minor divisions, or
Classes—the mammalia, birds, reptiles, and fishes. All mammals resemble each other
more than they resemble birds; all reptiles resemble each other more than they
resemble fishes (in spite of the superficial resemblance between serpents and eels or
lampreys). Each Class again falls into the minor groups of Orders; and on the same
principles: the monkeys being obviously distinguished from rodents, and the carnivora
from the ruminating animals; and so of the rest. In each Order there are generally
Families, and the Families fall into Genera, which differ from each other only in fewer
and less important characters. The Genera include groups which have still fewer
differences, and are called Species; and these again include groups which have only
minute and unimportant differences of colour, size, and the like, and are called Sub-
species, or Varieties.
Whoever looks at the immensity of the animal kingdom, and observes how
intelligibly and systematically it is arranged in these various divisions, will admit that,
however imperfect, the scheme is a magnificent product of human ingenuity and
labour. It is not an arbitrary arrangement, like the grouping of the stars in
constellations; it expresses, though obscurely, the real order of Nature. All true
Classification should be to forms what laws are to phenomena: the one reducing
varieties to systematic order, as the other reduces phenomena to their relation of
sequence. Now if it be true that the classification expresses the real order of nature,
and not simply the order which we may find convenient, there will be something more
than mere resemblance indicated in the various groups; or, rather let me say, this
resemblance itself is the consequence of some community in the things compared,
and will therefore be the mark of some deeper cause. What is this cause? Mr. Darwin
holds that “propinquity of descent—the only known cause of the similarity of organic
beings—is the bond, hidden as it is by various degrees of modification, which is
partially revealed to us by our classifications”[10]—“that the characters which
naturalists consider as showing true affinity between any two or more species are
those which have been inherited from a common parent, and in so far all true
classification is genealogical; that community of descent is the hidden bond which
naturalists have been unconsciously seeking, and not some unknown plan of creation,
or the enunciation of general propositions, and the mere putting together and
separating objects more or less alike.”[11]
Before proceeding to open the philosophical discussion which inevitably arises on
the mention of Mr. Darwin’s book, I will here set down the chief groups, according to
Cuvier’s classification, for the benefit of the tyro in natural history, who will easily
remember them, and will find the knowledge constantly invoked.
There are four Sub-kingdoms, or Branches:—1. Vertebrata. 2. Mollusca. 3.
Articulata. 4. Radiata.
The Vertebrata consist of four classes:—Mammalia, Birds, Reptiles, and Fishes.
The Mollusca consist of six classes:—Cephalopoda (cuttlefish), Pteropoda,
Gasteropoda (snails, &c.), Acephala (oysters, &c.), Brachiopoda, and Cirrhopoda
(barnacles).—N.B. This last class is now removed from the Molluscs and placed
among the Crustaceans.
The Articulata are composed of four classes:—Annelids (worms), Crustacea
(lobsters, crabs, &c.), Arachnida (spiders), and Insecta.
The Radiata embrace all the remaining forms; but this group has been so altered
since Cuvier’s time, that I will not burden your memory just now with an enumeration
of the details.
The reader is now in a condition to appreciate the general line of argument adopted
in the discussion of Mr. Darwin’s book, which is at present exciting very great attention,
and which will, at any rate, aid in general culture by opening to many minds new tracts
of thought. The benefit in this direction is, however, considerably lessened by the
extreme vagueness which is commonly attached to the word “species,” as well as by
the great want of philosophic culture which impoverishes the majority of our
naturalists. I have heard, or read, few arguments on this subject which have not
impressed me with the sense that the disputants really attached no distinct ideas to
many of the phrases they were uttering. Yet it is obvious that we must first settle what
are the facts grouped together and indicated by the word “species,” before we can
carry on any discussion as to the origin of species. To be battling about the fixity or
variability of species, without having rigorously settled what species is, can lead to no
edifying result.
It is notorious that if you ask even a zoologist, What is a species? you will almost
always find that he has only a very vague answer to give; and if his answer be precise,
it will be the precision of error, and will vanish into contradictions directly it is
examined. The consequence of this is, that even the ablest zoologists are constantly
at variance as to specific characters, and often cannot agree whether an animal shall
be considered of a new species, or only a variety. There could be no such
disagreements if specific characters were definite: if we knew what species meant,
once and for all. Ask a chemist, What is a salt? What an acid? and his reply will be
definite, and uniformly the same: what he says, all chemists will repeat. Not so the
zoologist. Sometimes he will class two animals as of different species, when they only
differ in colour, in size, or in the numbers of tentacles, &c.; at other times he will class
animals as belonging to the same species, although they differ in size, colour, shape,
instincts, habits, &c. The dog, for example, is said to be one species with many
varieties, or races. But contrast the pug-dog with the greyhound, the spaniel with the
mastiff, the bulldog with the Newfoundland, the setter with the terrier, the sheepdog
with the pointer: note the striking differences in their structure and their instincts: and
you will find that they differ as widely as some genera, and as most species. If these
varieties inhabited different countries—if the pug were peculiar to Australia, and the
mastiff to Spain—there is not a naturalist but would class them as of different species.
The same remark applies to pigeons and ducks, oxen and sheep.
The reason of this uncertainty is that the thing Species does not exist: the term
expresses an abstraction, like Virtue, or Whiteness; not a definite concrete reality,
which can be separated from other things, and always be found the same. Nature
produces individuals; these individuals resemble each other in varying degrees;
according to their resemblances we group them together as classes, orders, genera,
and species; but these terms only express the relations of resemblance, they do not
indicate the existence of such things as classes, orders, genera, or species.[12] There
is a reality indicated by each term—that is to say, a real relation; but there is no
objective existence of which we could say, This is variable, This is immutable.
Precisely as there is a real relation indicated by the term Goodness, but there is no
Goodness apart from the virtuous actions and feelings which we group together under
this term. It is true that metaphysicians in past ages angrily debated respecting the
Immutability of Virtue, and had no more suspicion of their absurdity, than moderns
have who debate respecting the Fixity of Species. Yet no sooner do we understand
that Species means a relation of resemblance between animals, than the question of
the Fixity, or Variability, of Species resolves itself into this: Can there be any variation
in the resemblances of closely allied animals? A question which would never be
asked.
No one has thought of raising the question of the fixity of varieties, yet it is as
legitimate as that of the fixity of species; and we might also argue for the fixity of
genera, orders, classes; the fixity of all these being implied in the very terms; since no
sooner does any departure from the type present itself, than by that it is excluded from
the category; no sooner does a white object become gray, or yellow, than it is
excluded from the class of white objects. Here, therefore, is a sense in which the
phrase “fixity of species” is indisputable; but in this sense the phrase has never been
disputed. When zoologists have maintained that species are variable, they have
meant that animal forms are variable; and these variations, gradually accumulating,
result at last in such differences as are called specific. Although some zoologists, and
speculators who were not zoologists, have believed that the possibility of variation is
so great that one species may actually be transmuted into another, i.e., that an ass
may be developed into a horse,—yet most thinkers are now agreed that such violent
changes are impossible; and that every new form becomes established only through
the long and gradual accumulation of minute differences in divergent directions.
It is clear, from what has just been said, that the many angry discussions respecting
the fixity of species, which, since the days of Lamarck, have disturbed the amity of
zoologists and speculative philosophers, would have been considerably abbreviated,
had men distinctly appreciated the equivoque which rendered their arguments hazy. I
am far from implying that the battle was purely a verbal one. I believe there was a real
and important distinction in the doctrines of the two camps; but it seems to me that
had a clear understanding of the fact that Species was an abstract term, been
uniformly present to their minds, they would have sooner come to an agreement.
Instead of the confusing disputes as to whether one Species could ever become
another Species, the question would have been, Are animal forms changeable? Can
the descendants of animals become so unlike their ancestors, in certain peculiarities
of structure or instinct, as to be classed by naturalists as a different species?
No sooner is the question thus disengaged from equivoque, than its discussion
becomes narrowed within well-marked limits. That animal forms are variable, is
disputed by no zoologist. The only question which remains is this: To what extent are
animal forms variable? The answers given have been two: one school declaring that
the extent of variability is limited to those trifling characteristics which mark the
different Varieties of each Species; the other school declaring that the variability is
indefinite, and that all animal forms may have arisen from successive modifications of
a very few types, or even of one type.
Now, I would call your attention to one point in this discussion, which ought to be
remembered when antagonists are growing angry and bitter over the subject: it is, that
both these opinions are necessarily hypothetical—there can be nothing like positive
proof adduced on either side. The utmost that either hypothesis can claim is, that it is
more consistent with general analogies, and better serves to bring our knowledge of
various points into harmony. Neither of them can claim to be a truth which warrants
dogmatic decision.
Of these two hypotheses, the first has the weight and majority of authoritative
adherents. It declares that all the different kinds of Cats, for example, were distinct and
independent creations, each species being originally what we see it to be now, and
what it will continue to be as long as it exists: lions, panthers, pumas, leopards, tigers,
jaguars, ocelots, and domestic cats, being so many original stocks, and not so many
divergent forms of one original stock. The second hypothesis declares that all these
kinds of cats represent divergencies of the original stock, precisely as the Varieties of
each kind represent the divergencies of each Species. It is true that each species,
when once formed, only admits of limited variations; any cause which should push the
variation beyond certain limits would destroy the species,—because by species is
meant the group of animals contained within those limits. Let us suppose the original
stock from which all these kinds of cats have sprung, to have become modified into
lions, leopards, and tigers—in other words, that the gradual accumulation of
divergencies has resulted in the whole family of cats existing under these three forms.
The lions will form a distinct species; this species varies, and in the course of long
variation a new species, the puma, rises by the side of it. The leopards also vary, and
let us suppose their variation at length assumes so marked a form,—in the ocelot,—
that we class it as a new species. There is nothing in this hypothesis but what is
strictly consonant with analogies; it is only extending to Species what we know to be
the fact with respect to Varieties; and these Varieties which we know to have been
produced from one and the same Species are often more widely separated from each
other than the lion is from the puma, or the leopard from the ocelot. Mr. Darwin
remarks that “at least a score of pigeons might be chosen, which, if shown to an
ornithologist, and he were told that they were wild birds, would certainly, I think, be
ranked by him as well-defined species. Moreover, I do not believe that any
ornithologist would place the English carrier, the short-faced tumbler, the runt, the
barb, the pouter and fantail in the same genus! more especially as in each of these
breeds several truly-inherited sub-breeds or species, as he might have called them,
could be shown him.”
The development of numerous specific forms, widely distinguished from each other,
out of one common stock, is not a whit more improbable than the development of
numerous distinct languages out of a common parent language, which modern
philologists have proved to be indubitably the case. Indeed, there is a very remarkable
analogy between philology and zoology in this respect: just as the comparative
anatomist traces the existence of similar organs, and similar connections of these
organs, throughout the various animals classed under one type, so does the
comparative philologist detect the family likeness in the various languages scattered
from China to the Basque provinces, and from Cape Comorin across the Caucasus to
Lapland—a likeness which assures him that the Teutonic, Celtic, Windic, Italic,
Hellenic, Iranic, and Indic languages are of common origin, and separated from the
Arabian, Aramean, and Hebrew languages, which have another origin. Let us bring
together a Frenchman, a Spaniard, an Italian, a Portuguese, a Wallachian, and a
Rhætian, and we shall hear six very different languages spoken, the speakers
severally unintelligible to each other, their languages differing so widely that one
cannot be regarded as the modification of the other; yet we know most positively that
all these languages are offshoots from the Latin, which was once a living language,
but which is now, so to speak, a fossil. The various species of cats do not differ more
than these six languages differ: and yet the resemblances point in each case to a
common origin. Max Müller, in his brilliant essay on Comparative Mythology,[13] has
said:—
“If we knew nothing of the existence of Latin—if all historical documents previous to
the fifteenth century had been lost—if tradition, even, was silent as to the former
existence of a Roman empire, a mere comparison of the six Roman dialects would
enable us to say, that at some time there must have been a language from which all
these modern dialects derived their origin in common; for without this supposition it
would be impossible to account for the facts exhibited by these dialects. Let us look at
the auxiliary verb. We find:—

Italian. Wallachian. Rhætian. Spanish. Portuguese. French.


I am sono sum sunt sunt soy sou suis
Thou art sei es eis eres es es
He is e é (este) ei es he est
We are siamo súntemu essen somos somos sommes
You are siete súnteti esses sois sois êtes (estes)
They are sono súnt eân (sun) son são sont.

It is clear, even from a short consideration of these forms, first, that all are but
varieties of one common type; secondly, that it is impossible to consider any one of
these six paradigms as the original from which the others had been borrowed. To this
we may add, thirdly, that in none of the languages to which these verbal forms belong,
do we find the elements of which they could have been composed. If we find such
forms as j’ai aimé, we can explain them by a mere reference to the radical means
which French has still at its command, and the same may be said even of compounds
like j’aimerai, i.e. je-aimer-ai, I have to love, I shall love. But a change from je suis to tu
es is inexplicable by the light of French grammar. These forms could not have grown,
so to speak, on French soil, but must have been handed down as relics from a former
period—must have existed in some language antecedent to any of the Roman
dialects. Now, fortunately, in this case, we are not left to a mere inference, but as we
possess the Latin verb, we can prove how, by phonetic corruption, and by mistaken
analogies, every one of the six paradigms is but a national metamorphosis of the Latin
original.
“Let us now look at another set of paradigms:—

Sanskrit. Lithuanian. Zend. Doric. Old Latin. Gothic. Armen.


Slavonic.
I am ásmi esmi ahmi ἐμμι yesmě sum im em
Thou art ási essi ahi ἐσσὶ yesi es is es
He is ásti esti asti ἐστί yestǒ est ist ê
We (two) are ’svás esva yesva siju
You (two) are ’sthás esta stho? ἕστόν yesta sijuts
They (two) are ’stás (esti) sto? ἐστόν yesta
We are ’smás esmi hmahi ἐσμές yesmǒ sumus sijum emq
You are ’sthá este stha ἐστέ yeste estis sijup êq
They are sánti (esti) hěnti ἐντί somtě sunt sind en

“From a careful consideration of these forms, we ought to draw exactly the same
conclusions; firstly, that all are but varieties of one common type; secondly, that it is
impossible to consider any of them as the original from which the others have been
borrowed; and thirdly, that here again, none of the languages in which these verbal
forms occur possess the elements of which they are composed.”
All these languages resemble each other so closely that they point to some more
ancient language which was to them what Latin was to the six Romance languages;
and in the same way we are justified in supposing that all the classes of the vertebrate
animals point to the existence of some elder type, now extinct, from which they were
all developed.
I have thus stated what are the two hypotheses on this question. There is only one
more preliminary which it is needful to notice here, and that is, to caution the reader
against the tendency, unhappily too common, of supposing that an adversary holds
opinions which are transparently absurd. When we hear an hypothesis which is either
novel, or unacceptable to us, we are apt to draw some very ridiculous conclusion from
it, and to assume that this conclusion is seriously held by its upholders. Thus the
zoologists who maintain the variability of species are sometimes asked if they believe
a goose was developed out of an oyster, or a rhinoceros from a mouse? the
questioner apparently having no misgiving as to the candour of his ridicule. There are
three modes of combating a doctrine. The first is to point out its strongest positions,
and then show them to be erroneous or incomplete; but this plan is generally difficult,
and sometimes impossible; it is not, therefore, much in vogue. The second is to render
the doctrine ridiculous, by pretending that it includes certain extravagant propositions,
of which it is entirely innocent. The third is to render the doctrine odious, by forcing on
it certain conclusions, which it would repudiate, but which are declared to be “the
inevitable consequences” of such a doctrine. Now it is undoubtedly true that men
frequently maintain very absurd opinions; but it is neither candid, nor wise, to assume
that men who otherwise are certainly not fools, hold opinions the absurdity of which is
transparent.
Let us not, therefore, tax the followers of Lamarck, Geoffroy St. Hilaire, or Mr.
Darwin with absurdities they have not advocated; but rather endeavour to see what
solid argument they have for the basis of their hypothesis.
FOOTNOTES
[10] Darwin: Origin of Species, p. 414.
[11] Darwin: Origin of Species, p. 420.
[12] Cuvier says, in so many words, that classes, orders, and genera, are
abstractions, et rien de pareil n’existe dans la nature; but species is not an
abstraction!—See Lettres à Pfaff, p. 179.
[13] See Oxford Essays, 1856.
Strangers Yet!
Strangers yet!
After years of life together,
After fair and stormy weather,
After travel in far lands,
After touch of wedded hands,—
Why thus joined? why ever met?
If they must be strangers yet.

Strangers yet!
After childhood’s winning ways,
After care, and blame, and praise,
Counsel asked, and wisdom given,
After mutual prayers to Heaven,
Child and parent scarce regret
When they part—are strangers yet

Strangers yet!
After strife for common ends,
After title of old friends,
After passion fierce and tender,
After cheerful self-surrender,
Hearts may beat and eyes be wet,
And the souls be strangers yet.

Strangers yet!
Strange and bitter thought to scan
All the loneliness of man!
Nature by magnetic laws
Circle unto circle draws;
Circles only touch when met,
Never mingle—strangers yet.
Strangers yet!
Will it evermore be thus—
Spirits still impervious?
Shall we ever fairly stand
Soul to soul, as hand to hand?
Are the bounds eternal set
To retain us strangers yet?

Strangers yet!
Tell not love it must aspire
Unto something other—higher:
God himself were loved the best,
Were man’s sympathies at rest;
Rest above the strain and fret
Of the world of strangers yet!
Strangers yet!

R. Monckton Milnes.
Framley Parsonage.
CHAPTER X.
Lucy Robarts.

You might also like