Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Name : Nirbhay Hanjura Subject: BDA Lab

Roll No : 201080003 Professor: Prof. Vaibhav Dhore

Experiment 10:
Aim: Flight Data Analysis using Spark GraphX
● Compute the total number of flight routes.
● Compute and sort the longest flight routes.
● Display the airport with the highest degree vertex.
● List the most important airports according to PageRank.
● List the routes with the lowest flight costs.

Theory:

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level,
GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph
with properties attached to each vertex and edge. To support graph computation, GraphX
exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages)
as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing
collection of graph algorithms and builders to simplify graph analytics tasks.

GraphX is a graph processing framework built on top of Apache Spark, primarily used for
large-scale graph processing tasks. Here are some advantages and disadvantages:

Advantages:

● Integration with Spark: GraphX seamlessly integrates with Apache Spark, leveraging
its distributed computing capabilities for efficient graph processing on large datasets.
● Performance: It offers high performance for graph processing tasks due to its
distributed nature, enabling parallel computation across clusters of machines.
● Ease of Use: For developers already familiar with Apache Spark, GraphX provides a
familiar programming model, making it relatively easy to learn and use.
● Scalability: GraphX scales well with the size of the dataset and can handle graphs with
billions of vertices and edges, making it suitable for large-scale graph analytics.
● Rich API: It provides a rich set of graph algorithms and operations out of the box,
including graph loading, transformation, and querying, reducing the need for custom
implementations.

Disadvantages:

● Limited Graph Algorithms: While GraphX offers a decent set of built-in graph
algorithms, it may not cover all use cases. Users may need to implement custom
algorithms for specific requirements.
● Complexity: GraphX, like Apache Spark, can be complex to set up and configure,
especially for users who are new to distributed computing frameworks.
● Resource Intensive: GraphX consumes significant computational resources,
particularly memory and CPU, especially when dealing with large-scale graphs, which
could lead to performance bottlenecks on smaller clusters.
● Community and Support: Compared to other graph processing frameworks like
Apache Giraph or GraphLab, GraphX may have a smaller community and less extensive
documentation and support resources.
● Development Overhead: Developing and debugging graph algorithms in a distributed
environment can be challenging, requiring careful consideration of data distribution,
partitioning, and fault tolerance.
Implementation:

Prerequisites
1. OpenJDK version 17 (ARM64 architecture) or higher
2. Oracle JDK 8 or higher for jps and jar utilities
3. Ubuntu 22.04 LTS (ARM64 architecture) or higher
4. openSSH
5. Hadoop 3.x.x ARM64 architecture installed
6. Git installed
7. Python3 installed
8. Spark 3.x installed
9.
Running Hadoop and Spark
In order to run PySpark programs and to take advantage of parallelism to execute k-means
clustering, we need our multi-node Hadoop and Spark setup to be running.

Adding input data to HDFS


Use the following command to upload data -
$ hadoop fs -copyFromLocal <src> <dest>

The following are the first 20 rows of the dataset -


Running the Scala script
Run the Spark shell in order to run the Scala script -
$ spark-shell

Output for number of routes and airports -


Output for longest routes by duration -

Bangalore-Kolkata being the longest in duration.

Output for indegrees -

Output for highest page rank -

Output for lowest cost route -


Code

import org.apache.spark.graphx._
import org.apache.spark.sql.functions._

var data =
spark.read.format("csv").option("header","true").option("inferSchema",
"true"
data.show()
data.printSchema()
data.columns
data.count()

var df = data.select("source_city", "destination_city", "duration",


"price")
var vertices =
df.select("source_city").union(df.select("destination_city")).distinct()

vertices = vertices.withColumn("id", monotonically_increasing_id + 0)


vertices.cache()
vertices.show()

var vertices_rdd = vertices.rdd


vertices_rdd.collect()

var kv_rdd = vertices.rdd.map(row => (row.getAs[Long](1),


(row.getAs[String](0))))
kv_rdd.collect()

var joined = df.join(vertices, Seq("source_city"), "left_outer")


joined.show()

joined = joined.select($"destination_city", $"duration", $"price",


$"id".alias("origin")).drop("id")
vertices = vertices.select($"source_city".alias("destination_city")
,$"id").drop("source_city")
vertices.show()

joined = joined.join(vertices)
joined = joined.select($"origin", $"id".alias("destination"), $"duration",
$"price").drop("id")
joined.show()
var edges_rdd = joined.rdd.map(row => Edge(row.getAs[Long]("origin"),
row.getAs[Long]("destination"), (row.getAs[Int]("price"),
row.getAs[Double]("duration"))))
edges_rdd.collect()

val nowhere = "nowhere"

val graph = Graph(kv_rdd, edges_rdd, nowhere)


graph.vertices.collect.take(100)
graph.edges.collect.take(100)

println(s"Number of routes: ${graph.numEdges} \n")


println(s"Number of airports: ${graph.numVertices} \n")

var sortedEdges = graph.edges.distinct().sortBy(edge => -edge.attr._2)


println("Longest routes by duration: ")
sortedEdges.take(100)
sortedEdges.take(20).foreach(edge => println(s"Source:
${kv_rdd.lookup(edge.srcId)(0)}; Destination:
${kv_rdd.lookup(edge.dstId)(0)}; Cost: ${edge.attr._1}; Duration:
${edge.attr._2} h"))
println("\n")

println("Indegrees of each airport: ")


graph.inDegrees.collect()
println("\n")

var inDegrees = graph.inDegrees.distinct().sortBy(-1*_._2)


println("Airports with highest indegrees: ")
inDegrees.take(100)
println("\n")

println(s"Airport with highest Indegree vertex:


${kv_rdd.lookup(inDegrees.take(1)(0)._1)(0)} with an Indegree of
${inDegrees.take(1)(0)._2}")
println("\n")

var pageRank = graph.pageRank(0.00001)


pageRank.vertices.sortBy(-_._2).collect()
println(s"The airport with the highest page rank is
${kv_rdd.lookup(pageRank.vertices.sortBy(-_._2).take(1)(0)._1)(0)} and has
a page rank value of ${pageRank.vertices.sortBy(-_._2).take(1)(0)._2}")
println("\n")

var sortedPrices = graph.edges.distinct().sortBy(edge => edge.attr._1)


println("Routes with lowest cost: ")
sortedPrices.take(100)
sortedPrices.take(20).foreach(edge => println(s"Source:
${kv_rdd.lookup(edge.srcId)(0)}; Destination:
${kv_rdd.lookup(edge.dstId)(0)}; Cost: ${edge.attr._1}; Duration:
${edge.attr._2} h"))

Conclusion:
In this experiment utilizing Spark GraphX for Flight Data Analysis, several key insights were
derived. Firstly, by computing the total number of flight routes, we gained a comprehensive
understanding of the connectivity within the air transportation network. Sorting the longest flight
routes provided valuable information on the extent of air travel distances. Identifying the airport
with the highest degree vertex highlighted hubs of significant connectivity, crucial for logistical
planning. Utilizing PageRank, we determined the most important airports, aiding in resource
allocation and network optimization. Finally, by listing routes with the lowest flight costs, we
discerned opportunities for cost-effective travel options. Through this experiment, I honed my
skills in data analysis using Spark GraphX and gained insights into the dynamics of the aviation
industry. The practical applications of this analysis include enhancing route planning, optimizing
resource allocation, and improving cost efficiency within the air transportation system.

You might also like