Exploratory Data Analysis

Exploratory Data Analysis
Import/Preview Data
Read the CSV file.
Code
df_cities = pd.read_csv('../input/cities.csv')
df_cities.head()
Output
We made a scatter plot of the locations of all the cities. There is a surpise waiting in the
locations of the cities :)
Code
fig = plt.figure(figsize=(20,20))
#cmap, norm = from_levels_and_colors([0.0, 0.5, 1.5], ['red', 'black'])
plt.scatter(df_cities['X'],df_cities['Y'],marker = '.',c=(df_cities.CityId
!= 0).astype(int), cmap='Set1', alpha = 0.6, s = 500*(df_cities.CityId ==
0).astype(int)+1)
plt.show()
Output
It looks like the cities are arranged in a reindeer pattern.

● The red dot indicates the North Pole (CityId = 0).
● All we have to a do a is a find a path that goes from the red dot, touches all the other
dots, and comes back to the red-dot, with minimum total distance travelled !
To see how good this path is, we need a couple of functions:
1. A function to tell if a number is a prime

2. Another function to compute the total distance given a series of numbers
(Euclidean Distance)
Code
# using sieve of eratosthenes
def sieve_of_eratosthenes(n):
primes = [True for i in range(n+1)] # Start assuming all numbers are
primes
primes[0] = False # 0 is not a prime
primes[1] = False # 1 is not a prime
for i in range(2,int(np.sqrt(n)) + 1):
if primes[i]:
k = 2
while i*k <= n:
primes[i*k] = False
k += 1
return(primes)
prime_cities = sieve_of_eratosthenes(max(df_cities.CityId))
df_cities['IsPrime'] = prime_cities
def total_distance(dfcity,path):
prev_city = path[0]
total_distance = 0
step_num = 1
for city_num in path[1:]:
next_city = city_num
total_distance = total_distance + \
np.sqrt(pow((dfcity.X[city_num] - dfcity.X[prev_city]),2) +
pow((dfcity.Y[city_num] - dfcity.Y[prev_city]),2)) * \
(1+ 0.1*((step_num % 10 ==
0)*int(not(prime_cities[prev_city]))))
prev_city = next_city
step_num = step_num + 1
return total_distance
dumbest_path = list(df_cities.CityId[:].append(pd.Series([0])))
print('Total distance with the dumbest path is '+
"{:,}".format(total_distance(df_cities,dumbest_path)))
Output
If we highlight the 'prime' cities as well (green)...
Code
df_cities.plot.scatter(x='X', y='Y', s=0.07, figsize=(15, 10))
# Add north pole in red, and much larger so it stands out
north_pole = df_cities[df_cities.CityId==0]
plt.scatter(north_pole.X, north_pole.Y, c='red', s=20)
# Add in prime cities in green and slightly larger
df_primes = df_cities.loc[df_cities['IsPrime'] == True]
plt.scatter(df_primes.X, df_primes.Y, c='green', s=0.5);
plt.show()
Output
We can see that the city coordinates create a picture of Sant'as reindeer and some trees. The prime
cities seem to be relativley evenly distributed thorughout the "map".
How many Prime cities are there compared to not prime cities?
Code
print(df_cities['IsPrime'].value_counts())
sns.countplot(x='IsPrime', data = df_cities);
Output
There is about one tenth the amount of prime cities as regular cities... makes sense given that each
10th stop should be to a prime city to avoid the 10% distance penalty…
Data Processing: Methods
Sieve of Eratosthenes
Sieve of Eratosthenes is an algorithm used to find all prime numbers smaller than n, when n is
smaller than 10 million or so.
The steps involved are:
1. Create a list of all integers from 2 to n (i.e., the upper limit)
2. At first, let p = 2, the smallest prime number.
3. Starting from p, count up in increments of p (from 2p to n), and mark them in the list
(these will be 2p, 3p, 4p..; the p itself should remain unmarked)
4. Find the first number that is greater than p in the unmarked list. If there’s no such
number, then stop. Otherwise, let p be equal this new number (the next prime), and go
to step 3 and repeat the process.
5. When the algorithm terminates, the remaining unmarked numbers in the list are all the
prime numbers less than n.
The main idea here is that every single unmarked value up to p will be prime, because if it were
composite it would be marked as a multiple of some other, smaller prime. This algorithm can
also be used for n = 1000, 10000, 100000, etc.
Greedy Nearest Neighbor Algorithm

Nearest neighbor was the first greedy algorithm which gave a solution for the traveling
salesman problem. The algorithm was introduced by J. G. Skellam and it was continued by F.
C. Evans and P. J. Clark. In nearest neighbor algorithm, we randomly choose a city as the
starting city and then traverse to all the cities closest to the starting city that do not create any
cycle. This process continues till all cities are visited once.
The steps are:
1. Initialize all vertices as unvisited.
2. Select an arbitrary vertex, set it as the current vertex M (which is the North Pole in our
case)
3. Find out the shortest edge connecting current vertex M and an unvisited vertex N.
4. Set N as the current vertex M.
5. Mark N as visited.
6. If all the vertices in domain are visited, then terminate. Else go to the third step.
7. Add the North Pole as the final stop and add up the total distance travelled.
Tools Used
● Python 3.8.0
● pandas 0.25.3
Approaches and Performance Evaluation
Approach 1: Dumbest Path

Here we go in the order of CityIDs: 0, 1, 2.. etc. and come back to zero when you reach the end.
Code & Output
The first 100 steps of the dumbest path look as follows:-

Code
Output
The dumbest path seems pretty bad. We are sending Santa all over the map, without any
consideration for him whatsoever :)
Approach 2: Slightly better path
Here, we sort the cities in X,Y coordinate_order
Code and Output
We already reduced our total distance by more than half using a simple sorting !! The first 100
steps of the sorted path look as follows:-
Code and Output

Now Santa is going up and down around the same X value (note X axis only ranges from 0-10),
before moving to the next X. This is slightly more efficient path.
Approach 3 : Sorted cities within a grid of squares
If we divide the the whloe network into a grid of X's and Y's, then Santa can cover each square in the
grid, before moving on to the next square.
Code and Output

Code and Output
We can now see that Santa is going to a point, trying to cover many of the cities around that
point, before moving up. This feels like a better way to cover all the cities and gives better
results as expected.
Approach 4: Nearest Neighbor / Greedy Algorithm:

Nearest Neighbor approach is one of the easier heuristic approaches to solve the traveling salesman
problem.
Code and Output
Nearest neighbor algorithm offered a significant improvement. Whats even more mesmerising is
the visualization!!
Code
df_path = pd.DataFrame({'CityId':nnpath}).merge(df_cities,how = 'left')
fig, ax = plt.subplots(figsize=(20,20))
ax.plot(df_path['X'], df_path['Y'])
Output
Approach 5: Nearest Neighbor / Greedy Algorithm With Prime Swaps
So far we haven't used the prime-number related constraint on the path for the optimization. It says
"every 10th step is 10% more lengthy unless coming from a prime CityId". So, it is in our interest to
make sure that the prime numbers end up at the beginning of every 10th step.
Looking at the distribution of prime numbers from the EDA, it is clear that they are evenly spaced.
Hence this approach should not be as costly.
Approach:
● Loop through the path, whenever we encounter a city that has prime CityId, we will
try to swap it with the nearby city that have an index ending in 9 (ie 10th step), if
they makes the total path smaller.
● We try to swap the prime city with two cities that occur before the current city and
have an index ending in 9, or two cities that occur after the current city and have an
index ending in 9.
○ For example if we have a prime city in the 76th place, we will try to
see if that can be swapped with any of the cities in the 59th, 69th,
79th and 89th place and result in a shorter path.
● When checking to see if the path becomes shorter by swapping the cities, we will
only check the length of the sub-path that corresponds to the swap. No need to
check the length of the entire path. This will make the search lot more efficient.
Code and Output

We reduced the total distance by another 650 units with prime swaps!!!
Results
A summary of our results in each step can be seen below:
Method Distance
1. Dumbest path 446884407.5212135
2. Sorted city path 196478811.25956938
3. Sorted cities within a grid path 3226331.4903367283
4. Greedy Nearest Neighbor Algorithm 1812602.1861388374
5. Greedy Nearest Neighbor Algorithm with Prime 1811953.6824856116
Swaps
Since the least amount of distance is covered when we use Greedy Nearest Neighbor Algorithm
with Prime Swaps, we can conclude that it is the most optimal solution for our case.

Exploratory Data Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploratory Data Analysis

Uploaded by

Copyright:

Available Formats

Exploratory Data Analysis

It looks like the cities are arranged in a reindeer pattern.

1. A function to tell if a number is a prime

If we highlight the 'prime' cities as well (green)...

# Add north pole in red, and much larger so it stands out

plt.scatter(north_pole.X, north_pole.Y, c='red', s=20)

# Add in prime cities in green and slightly larger

df_primes = df_cities.loc[df_cities['IsPrime'] == True]

plt.scatter(df_primes.X, df_primes.Y, c='green', s=0.5);

sns.countplot(x='IsPrime', data = df_cities);

Data Processing: Methods

Greedy Nearest Neighbor Algorithm

Approaches and Performance Evaluation

Approach 1: Dumbest Path

Code & Output

The first 100 steps of the dumbest path look as follows:-

Here, we sort the cities in X,Y coordinate_order

Code and Output

Code and Output

Approach 3 : Sorted cities within a grid of squares

Code and Output

Approach 4: Nearest Neighbor / Greedy Algorithm:

Code and Output

You might also like