Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 1

I have a set of data in a cluster which are attack data from NSL-KDD dataset.

And i
have labelled data where different types of attack data are labelled. How can i
identify whether the attack data in cluster is also present in labelled attack data
using python?
To identify whether the attack data in the cluster is also present in the labelled
attack data using Python, you can follow these steps:

Load both the cluster data and labelled data into separate Pandas dataframes.

Create a new column in both dataframes that contains a unique identifier for each
row. This identifier can be a concatenation of relevant columns, such as the source
IP, destination IP, and timestamp.

Merge the two dataframes on the unique identifier column using an inner join. This
will create a new dataframe that only contains rows that are present in both the
cluster and labelled data.

Check the size of the new dataframe. If the size is greater than zero, then there
is overlap between the cluster and labelled data. If the size is zero, then there
is no overlap.

Here's a sample code snippet that demonstrates these steps:

python
Copy code
import pandas as pd

# Load cluster data into a Pandas dataframe


cluster_data = pd.read_csv('cluster_data.csv')

# Load labelled data into a Pandas dataframe


labelled_data = pd.read_csv('labelled_data.csv')

# Create unique identifier column in both dataframes


cluster_data['unique_id'] = cluster_data['src_ip'] + '_' + cluster_data['dst_ip'] +
'_' + cluster_data['timestamp']
labelled_data['unique_id'] = labelled_data['src_ip'] + '_' +
labelled_data['dst_ip'] + '_' + labelled_data['timestamp']

# Merge dataframes on unique identifier column using inner join


merged_data = pd.merge(cluster_data, labelled_data, on='unique_id', how='inner')

# Check size of merged dataframe


if merged_data.shape[0] > 0:
print('Overlap exists between cluster and labelled data')
else:
print('No overlap between cluster and labelled data')
Note that the specific column names and unique identifier formula used in this
example may need to be adjusted to match the structure of your data.

You might also like