BDA Experiment 10

Name : Nirbhay Hanjura Subject: BDA Lab
Roll No : 201080003 Professor: Prof. Vaibhav Dhore
Experiment 10:
Aim: Flight Data Analysis using Spark GraphX
● Compute the total number of flight routes.
● Compute and sort the longest flight routes.
● Display the airport with the highest degree vertex.
● List the most important airports according to PageRank.
● List the routes with the lowest flight costs.
Theory:
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level,
GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph
with properties attached to each vertex and edge. To support graph computation, GraphX
exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages)
as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing
collection of graph algorithms and builders to simplify graph analytics tasks.
GraphX is a graph processing framework built on top of Apache Spark, primarily used for
large-scale graph processing tasks. Here are some advantages and disadvantages:
Advantages:
● Integration with Spark: GraphX seamlessly integrates with Apache Spark, leveraging
its distributed computing capabilities for efficient graph processing on large datasets.
● Performance: It offers high performance for graph processing tasks due to its
distributed nature, enabling parallel computation across clusters of machines.
● Ease of Use: For developers already familiar with Apache Spark, GraphX provides a
familiar programming model, making it relatively easy to learn and use.
● Scalability: GraphX scales well with the size of the dataset and can handle graphs with
billions of vertices and edges, making it suitable for large-scale graph analytics.
● Rich API: It provides a rich set of graph algorithms and operations out of the box,
including graph loading, transformation, and querying, reducing the need for custom
implementations.
Disadvantages:
● Limited Graph Algorithms: While GraphX offers a decent set of built-in graph
algorithms, it may not cover all use cases. Users may need to implement custom
algorithms for specific requirements.
● Complexity: GraphX, like Apache Spark, can be complex to set up and configure,
especially for users who are new to distributed computing frameworks.
● Resource Intensive: GraphX consumes significant computational resources,
particularly memory and CPU, especially when dealing with large-scale graphs, which
could lead to performance bottlenecks on smaller clusters.
● Community and Support: Compared to other graph processing frameworks like
Apache Giraph or GraphLab, GraphX may have a smaller community and less extensive
documentation and support resources.
● Development Overhead: Developing and debugging graph algorithms in a distributed
environment can be challenging, requiring careful consideration of data distribution,
partitioning, and fault tolerance.
Implementation:
Prerequisites
1. OpenJDK version 17 (ARM64 architecture) or higher
2. Oracle JDK 8 or higher for jps and jar utilities
3. Ubuntu 22.04 LTS (ARM64 architecture) or higher
4. openSSH
5. Hadoop 3.x.x ARM64 architecture installed
6. Git installed
7. Python3 installed
8. Spark 3.x installed
9.
Running Hadoop and Spark
In order to run PySpark programs and to take advantage of parallelism to execute k-means
clustering, we need our multi-node Hadoop and Spark setup to be running.
Adding input data to HDFS

Use the following command to upload data -
$ hadoop fs -copyFromLocal <src> <dest>
The following are the first 20 rows of the dataset -

Running the Scala script
Run the Spark shell in order to run the Scala script -
$ spark-shell
Output for number of routes and airports -

Output for longest routes by duration -
Bangalore-Kolkata being the longest in duration.
Output for indegrees -
Output for highest page rank -
Output for lowest cost route -

Code
import org.apache.spark.graphx._
import org.apache.spark.sql.functions._
var data =
spark.read.format("csv").option("header","true").option("inferSchema",
"true"
data.show()
data.printSchema()
data.columns
data.count()
var df = data.select("source_city", "destination_city", "duration",

"price")
var vertices =
df.select("source_city").union(df.select("destination_city")).distinct()
vertices = vertices.withColumn("id", monotonically_increasing_id + 0)

vertices.cache()
vertices.show()
var vertices_rdd = vertices.rdd

vertices_rdd.collect()
var kv_rdd = vertices.rdd.map(row => (row.getAs[Long](1),

(row.getAs[String](0))))
kv_rdd.collect()
var joined = df.join(vertices, Seq("source_city"), "left_outer")

joined.show()
joined = joined.select($"destination_city", $"duration", $"price",

$"id".alias("origin")).drop("id")
vertices = vertices.select($"source_city".alias("destination_city")
,$"id").drop("source_city")
vertices.show()
joined = joined.join(vertices)
joined = joined.select($"origin", $"id".alias("destination"), $"duration",
$"price").drop("id")
joined.show()
var edges_rdd = joined.rdd.map(row => Edge(row.getAs[Long]("origin"),
row.getAs[Long]("destination"), (row.getAs[Int]("price"),
row.getAs[Double]("duration"))))
edges_rdd.collect()
val nowhere = "nowhere"
val graph = Graph(kv_rdd, edges_rdd, nowhere)

graph.vertices.collect.take(100)
graph.edges.collect.take(100)
println(s"Number of routes: ${graph.numEdges} \n")

println(s"Number of airports: ${graph.numVertices} \n")
var sortedEdges = graph.edges.distinct().sortBy(edge => -edge.attr._2)

println("Longest routes by duration: ")
sortedEdges.take(100)
sortedEdges.take(20).foreach(edge => println(s"Source:
${kv_rdd.lookup(edge.srcId)(0)}; Destination:
${kv_rdd.lookup(edge.dstId)(0)}; Cost: ${edge.attr._1}; Duration:
${edge.attr._2} h"))
println("\n")
println("Indegrees of each airport: ")

graph.inDegrees.collect()
println("\n")
var inDegrees = graph.inDegrees.distinct().sortBy(-1*_._2)

println("Airports with highest indegrees: ")
inDegrees.take(100)
println("\n")
println(s"Airport with highest Indegree vertex:

${kv_rdd.lookup(inDegrees.take(1)(0)._1)(0)} with an Indegree of
${inDegrees.take(1)(0)._2}")
println("\n")
var pageRank = graph.pageRank(0.00001)

pageRank.vertices.sortBy(-_._2).collect()
println(s"The airport with the highest page rank is
${kv_rdd.lookup(pageRank.vertices.sortBy(-_._2).take(1)(0)._1)(0)} and has
a page rank value of ${pageRank.vertices.sortBy(-_._2).take(1)(0)._2}")
println("\n")
var sortedPrices = graph.edges.distinct().sortBy(edge => edge.attr._1)

println("Routes with lowest cost: ")
sortedPrices.take(100)
sortedPrices.take(20).foreach(edge => println(s"Source:
${kv_rdd.lookup(edge.srcId)(0)}; Destination:
${kv_rdd.lookup(edge.dstId)(0)}; Cost: ${edge.attr._1}; Duration:
${edge.attr._2} h"))
Conclusion:
In this experiment utilizing Spark GraphX for Flight Data Analysis, several key insights were
derived. Firstly, by computing the total number of flight routes, we gained a comprehensive
understanding of the connectivity within the air transportation network. Sorting the longest flight
routes provided valuable information on the extent of air travel distances. Identifying the airport
with the highest degree vertex highlighted hubs of significant connectivity, crucial for logistical
planning. Utilizing PageRank, we determined the most important airports, aiding in resource
allocation and network optimization. Finally, by listing routes with the lowest flight costs, we
discerned opportunities for cost-effective travel options. Through this experiment, I honed my
skills in data analysis using Spark GraphX and gained insights into the dynamics of the aviation
industry. The practical applications of this analysis include enhancing route planning, optimizing
resource allocation, and improving cost efficiency within the air transportation system.

BDA Experiment 10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Experiment 10

Uploaded by

Copyright:

Available Formats

Name : Nirbhay Hanjura Subject: BDA Lab

Roll No : 201080003 Professor: Prof. Vaibhav Dhore

Adding input data to HDFS

The following are the first 20 rows of the dataset -

Output for number of routes and airports -

Bangalore-Kolkata being the longest in duration.

Output for indegrees -

Output for highest page rank -

Output for lowest cost route -

var df = data.select("source_city", "destination_city", "duration",

vertices = vertices.withColumn("id", monotonically_increasing_id + 0)

var vertices_rdd = vertices.rdd

var kv_rdd = vertices.rdd.map(row => (row.getAs[Long](1),

var joined = df.join(vertices, Seq("source_city"), "left_outer")

joined = joined.select($"destination_city", $"duration", $"price",

val nowhere = "nowhere"

val graph = Graph(kv_rdd, edges_rdd, nowhere)

println(s"Number of routes: ${graph.numEdges} \n")

var sortedEdges = graph.edges.distinct().sortBy(edge => -edge.attr._2)

println("Indegrees of each airport: ")

var inDegrees = graph.inDegrees.distinct().sortBy(-1*_._2)

println(s"Airport with highest Indegree vertex:

var pageRank = graph.pageRank(0.00001)

var sortedPrices = graph.edges.distinct().sortBy(edge => edge.attr._1)

You might also like