Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

UNet: Real-Time Lane Detection for Autonomous Driving & GRAD-CAM visualization

Dr. Rajeev Gupta∗


Department of Computer Engineering, PDEU University
ze.w@duke.edu

Devshree Jadeja
Devshree.jce20@so t.pdpu.ac.in

Abstract

Lane detection is to detect lanes on the road and provide the accurate location and shape of each lane and path
planning. This study introduces a robust technique for detecting lane lines and objects, essential components for
autonomous vehicles and Advanced Driver-Assistance Systems (ADAS). Using Convolutional Neural Networks
(CNNs), we craft a specialized model, trained on an annotated dataset comprised of MATLAB images. Our model
integrates batch normalization and max pooling layers, strategically placed among convolutional and transposed
convolutional layers, ensuring accurate reconstruction of the input image. The distinctive architecture of our CNN
model features four convolutional layers, each equipped with increasingly refined filters, followed by batch
normalization and max pooling layers that collectively work to reduce internal covariate shifts and mitigate overfitting.
The methodology extends to video analysis, where an amalgamation of the pre-trained You Only Look Once (YOLO)
model and the MoviePy library facilitates frame-by-frame object detection. Moreover, To visualize the key detection
regions, we utilize gradient- weighted class activation mapping (Grad-CAM), thereby boosting the comprehensibility of
our model. With an impressive accuracy of 0.99 and a macro average F1-score of 0.99, our pioneering method
showcases promise for practical applications in autonomous navigation, surpassing benchmarks in both precision and
dependability.

Keywords: Lane Detection, Autonomous Vehicles, Convolutional Neural Networks (CNNs), You Only Look Once
(YOLO), Gradient-Weighted Class Activation Mapping (Grad-CAM)

1. Introduction
According to the 2013 European Accident Research and Safety Report, human error accounts for more than 90% of
driving accidents. One such challenge pertains to navigation within dynamic or uncertain environments. Visual system
solutions, in particular, are critical to tackling these issues. For minimal functionality, these solutions could be built on a
single monocular camera linked with a morphological image processing operator. [1-3]. The algorithms employ different
representations for lane detection, such as fixed-width line pairs, spline ribbons, and deformable-template models, along
with a range of detection and tracking techniques, including Hough transform, probabilistic fitting, and Kalman filtering.
Additionally, both stereo and monocular modalities have been explored. However, previous approaches often face
limitations due to real-time constraints and slower processor speeds. Consequently, lane markings have typically been
detected based on simple gradient changes, and the earlier studies have mainly focused on straight roads or highways with
clear lane markings and minimal obstacles. Lane detection algorithms, such as the popular random sample consensus
(RANSAC) method [Kim, ZuWhan, 2008], have limitations when confronted with complex road scenes that include
shadows, occlusions, and curves. To overcome these challenges, we have introduced a novel approach by incorporating
Convolutional Neural Networks (CNNs) unlike traditional methods that solely rely on RANSAC, our model leverages the
power of CNNs to enhance input images and extract regions of interest (ROIs)[Kim, J. & Lee, M, 2014]. Our approach to
lane detection innovatively utilizes Convolutional Neural Networks (CNNs), efficient pre-processing, and rich labeled data
to offer a more versatile solution. Unlike traditional algorithms reliant on rigid, hand-crafted features, our CNN learns to
identify complex patterns autonomously, increasing its adaptability to diverse scenarios. Robust pre-processing ensures the
model is not swayed by outliers or sequence biases, enhancing its generalizability. Utilizing labeled data, our model excels
in identifying lanes under varied real-world conditions, surpassing the capabilities of conventional methods. This makes
our solution an improved, more universally applicable alternative for lane detection.

Fig (1). Flow of the work

Our lane detection solution employs a CNN model The parts of the image that were significant for the detection decision are
also displayed by the authors using a technique called Gradient-weighted Class Activation Mapping (Grad-CAM), providing
interesting details into the model's decision- making process. Grad-CAM is a method for creating heat maps that show the
regions of an image that the CNN model prioritized while making a forecast. This enables us to better understand the CNN
model's decision-making process and enhance its performance. Architecture that includes convolutional layers, batch
normalization layers, max pooling layers, and transposed convolutional layers. Each layer's aim is to extract and enhance
information from the input photos. Our lane detection solution employs a CNN model architecture that includes convolutional
layers, batch normalization layers, max pooling layers, and transposed convolutional layers. Each layer's aim is to extract and
enhance information from the input images. This architecture is similar to the U-Net architecture, it to outperform traditional
methods that rely solely on simple gradient changes.

To recognize and track lane lines and objects in real time, the trained CNN model is applied to each frame of a video. We
divide a video into individual frames using the MoviePy library in order to apply the model to it. Additionally, we employ a
pre-trained You Only Look Once (YOLO) model to recognize items in the movie. Deep learning is used by the object
recognition system YOLO to find items in both still and moving pictures. It is a great option for real-time object recognition
in films because of its quick performance and high accuracy. The YOLO model is suitable for a number of applications
because it can recognize several object classes and was trained on a big dataset of photos. The parts of the image that were
significant for the detection decision are also displayed by the authors using a technique called Gradient-weighted Class
Activation Mapping (Grad- CAM), providing interesting details into the model's decision-making process. Grad-CAM is a
method for creating heat maps that show the regions of an image that the CNN model prioritized while making a forecast.
This enables us to better understand the CNN model's decision-making process and enhance its performance. The proposed
method's fundamental concept is to use encoding decoding layers to find lane markings. This one-dimensional distribution
might offer enough differentiation for lane marking identification because the lane markings have a unique pattern. However,
it becomes difficult to distinguish between the distribution of lane marking and non-lane marking areas in difficult road and
weather circumstances. As a result, we suggested using a one-dimensional deep learning strategy to classify.
Max pooling helps to shrink the size of the feature maps, which can increase the model's efficiency, while batch normalization
helps to stabilize the training process. These layer choices enable the model to monitor lane lines in real time even under
difficult circumstances like bad lighting and unclear road markers. The usage of GRAD-CAM makes it very simple to debug
the model and aids in comprehending ROIs. Consequently, in this work, an explainable artificial intelligence (XAI) technique
is employed to evaluate the data acquired by the inception U- Net model using Grad-CAM post-hoc explainability techniques,
in order to close the aforementioned research gap. The U-Net based encoder decoder with GRAD-CAM visualization used by
[4] also increased the efficiency of the model by 10%.
The suggested solution prioritizes speed and ease of use so that it can be incorporated into low-cost CPUs for video lane
recognition. We were able to accomplish stability and a real-time prediction mechanism by splitting each frame in half and
averaging the mask of recent estimates. The technology of lane line detection, which is a subset of linear object identification
and extraction, has numerous uses in traffic scenarios [5].

The remaining of this paper is structured as follows: In Section 2 we concisely describe the architecture of our model. We
provide the details regarding both the training data and the training process in Section 3. Experiments and results are shown in
Section 4. Section 5 provides a brief overview and summary on other lane detection methods. We conclude the paper with
Section 6.

In terms of detection performance against processing speed assessment, experimental findings demonstrate that the
performance of the proposed approach beats existing approaches in literature, even these difficult conditions. Despite the fact
that high performance deep learning-based methods are challenging to implement on embedded platforms with limited
resources, the proposed method stands out as a solution due to its much shorter processing time. The unique integration of
lane detection and object recognition in one system results in a comprehensive solution for understanding the road
environment. This simplifies system integration while also enhancing ADAS and autonomous navigation decision-making
and overall safety. Gradient-weighted Class Activation Mapping (Grad-CAM), which offers thorough insights into the
decision-making process, greatly supports the understandability of our system. The U-Net-inspired design, which outperforms
gradient-based methods, combined with strong pre-processing techniques to assure generalizability is what distinguishes us.

2. Lane Line Detection Network

It is proved that the solution space of a convolutional neural network (CNN) can be expanded by increasing its depth or its
width. In the ILSVRC 2014 [9], deeper and wider designs helped GoogLeNet, VGG, and AlexNet all achieve higher
accuracy. Using cutting-edge accuracy, AlexNet was the first CNN to finish the ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) in 2012. It had eight levels, which was considered deep at the time. When GoogleNet was first
introduced in 2014, it had 22 layers. It used an innovative framework called Inception architecture, which allowed it to draw
out more complex information from each incoming image. VGG made its debut in 2014 with 16–19 layers.
The design was simpler than Google Net, yet it still managed to achieve the maximum level of accuracy on ILSVRC.
Gradients may explode or disappear as a result of deeper CNNs. When a CNN's weights are too large, exploding gradients
result, and when they are too tiny, vanishing gradients. It may be challenging for a CNN to learn from the training data
because to either of these issues. line to avoid these issues we have designed the CNN model carefully with the batch
normalization implemented to make it less prone to overfitting and for better generalization. Our lane line detection model is
built using an encoder-decoder network with skip links, similar to the U-Net architecture. The model takes an input image and
produces a segmentation mask that highlights the lane lines on a pixel-by-pixel basis. The model can capture both local and
global information thanks to the encoder-decoder structure and skip connections, making it easier to localize lane lines
precisely. This architecture is suitable for lane line detection because it has demonstrated excellent performance in semantic
segmentation challenges.
2.1. Model Architecture

We successfully detected lanes by training a U-Net-like model on annotated lane images. For the training process, we
employed a collection of annotated lane photos, where the ground truth segmentation masks emphasized the lane lines. By
analyzing training data on annotated lane photos, the U-Net-like network can predict the presence of lane lines in an input
image. While the convolutional layers learn to extract essential details from the image, the decoder component reconstructs
the lanes' spatial representation. The model is trained to minimize the mean squared error of the segmentation mask prediction
and the ground truth segmentation error. By doing so, it gains the ability to recognize and segment the lane lines precisely,
enabling accurate lane recognition in unseen images.
The architecture effectively omits extraneous information while concentrating on the important features for lane detection by
combining the encoded information from the bottleneck with the up sampling of the decoder. This makes it easier for the
model to separate lane lines from other parts of the image, such as background objects, noise, or shifting road conditions.
CNN Encoder-based real- time deep lane detection system There are many uses for decoder networks in dynamic
environments and on roads with complicated conditions [6]. The feature maps contain in-depth representations of the input
image even if the spatial resolution is significantly lower close to the architecture's bottleneck. By removing the less useful
spatial data, the bottleneck helps to highlight the important lane-related features.

2.1.1. About the Dataset we used annotated MATLAB images that have lane lines added to them. 11,487 photos
were used to train the model, and 1,277 images were saved for validation. To guarantee a balanced
distribution, the dataset was randomly divided into training and validation sets with a ratio of 90:10. The
accessibility of annotated MATLAB images made correct training and evaluation of the suggested method
much easier. The image data as well as the labels at the pixel level are recorded in a serialized format
within these files, which are also saved in a manner that is serialized.

2.1.2. Processing The dataset was preprocessed before training the CNN model. Images and labels were loaded
from pickle files created from labeled MATLAB images and converted to NumPy arrays. Labels were
normalized to a range of 0 to 1, matching the normalized pixel values of the input images. Data shuffling
was performed to eliminate biases during training, ensuring random ordering of images and labels.

2.1.3. Data Augmentation: To further strengthen the robustness and generalizability of our model, we used data
augmentation approaches in addition to preprocessing. By performing random transformations like rotation,
scaling, and flipping to the images and their related labels, augmentation includes producing variations of
the training data. This method makes the model more adaptable to various road conditions and viewpoints
and better prepared to manage the difficulties presented by real-world circumstances. By adding to our
dataset, we effectively made it larger, enabling the CNN model to learn from a wider variety of samples
and enhancing its precision in lane detection.

2.1.4. Class Balancing: We used class balancing approaches to make sure the model is not biased toward one
class of lane lines over others. This required changing the sample plan during training to take into
consideration the relatively low number of specific lane line types (such as dashed or curved lines). By
balancing the classes, we avoided the model's tendency to get unduly preoccupied with the class that
comprises the majority, enhancing its capacity to precisely detect and differentiate all varieties of lane
markers.

2.1.5. . Validation Set: For tracking the model's effectiveness during training, a distinct validation set had to be
developed. We were able to evaluate the model's accuracy and generalization skills on unobserved data
because to the 10% of photos left aside for validation. The model learned to recognize lanes and objects
successfully without memorizing the training data thanks to routine validation tests that prevented
overfitting during training. Our outstanding accuracy and F1-score measurements were made possible in
large part by this validation method.
Separable convolution 3*3
dilation=1

32 filters

Separable convolution 3*3


dilation=2 Deconvolution and Up sampling Layers
64 filters

Separable convolution 3*3


dilation=3

128 filters

Seprable convolution 3*3 Convolution and pooling


dilation=4

Conv1
256 filters

Conv2
Pooling

Conv Trans 3
Pooling

Conv Trans1

Conv Trans2
Upsampling
Upsampling
Batch Normalization
Conv3
Pooling
Conv4
Fig (2). Lane detection two phases

fig (3). Lane detection network

Encoding The encoder component of the model is in charge of extracting features from the input picture. This is
accomplished in the code by using a sequence of convolutional layers. Each convolutional layer applies a series of filters to
the input picture, resulting in feature maps with varying degrees of abstraction. As the network becomes deeper, the
convolutional layers are meant to learn and extract increasingly complex information.

The encoder in the given code is made up of numerous convolutional layers with varying filter sizes and depths. Following
each convolutional layer comes batch normalization, which helps to normalize activations and increase training stability. After
some of the convolutional layers, max pooling procedures are done to reduce the spatial dimensions of the feature maps while
keeping the most salient features

Decoding The model's decoder seeks to restore the spatial resolution lost during the encoding step and provide a pixel-by-
pixel segmentation mask. Transpose convolutional layers (up sampling or deconvolutional layers) are used in the code to
create the decoder. These layers work in the opposite direction of convolutional layers, increasing spatial dimensions while
decreasing channel count. The feature maps are gradually up sampled and their spatial resolution is restored to match the size
of the original picture by the decoder layers. Convolutional layers are typically used in conjunction with up sampling to
induce nonlinearity and improve the features. The decoder layers enhance the spatial features and merge them with the
encoder's high-level semantic information, for precise lane line localization. The richness of the feature extraction method
carried out by deconvolution for the object detection tasks [7].

The U-Net design, with its encoder-decoder structure, is ideal for pixel-level classification applications Because of its shallow
layers, which allow it to catch and preserve more low-scale information, it excels at feature extraction. The U-Net model is
trained to categorize pixels into two categories: lane and background. The model's encoder contains convolutional layers that
extract features from the input image gradually, capturing essential visual patterns at different levels of abstraction then the
decoding happens

The U-Net model gains the ability to accurately categorize pixels as either being part of the lane or the backdrop by being
trained on annotated lane pictures. Giving each pixel a binary identifier (lane present or absent), enables precise lane
recognition in unseen pictures. The U-Net's effectiveness in collecting complex lane details and achieving accurate
segmentation is due in part to its capacity to preserve low- scale characteristics.

3. Methodology
3.1. Lane detection procedure

Our lane line recognition model is based on the basic idea of using a convolutional neural network (CNN) to learn the
distinguishing characteristics that discriminate lane lines from other components in an image, such as background objects,
noise, and variable road conditions. The ground truth segmentation masks highlight the lane lines, and this is accomplished
by training the CNN on a meticulously annotated dataset. The CNN learns to anticipate the existence of lane lines in an
input picture and create a segmentation mask that particularly highlights the pixels corresponding to the lanes through the
examination of this training data. Our model's mathematics uses a variety of mathematical processes, including as
convolution, pooling, and activation functions, to convert the input picture into a feature map that accurately captures the
relevant data necessary for lane recognition. The image is combined with several CNN filters to create a feature map that
highlights the existence of distinctive features. Then, pooling procedures are used to lower the spatial resolution of the
feature map, and activation functions are used to inject non-linearity into the model, allowing for the capture of more
intricate lane line representations.
The model generates a binary mask as its output, indicating whether or not lane lines are present in each pixel of the input
picture. our lane line detection methodology combines the power of CNNs, carefully annotated datasets, mathematical
operations (convolution, pooling, activation functions), U- Net-like architecture with skip connections, and Grad-CAM
visualization. These components work together to create a robust and accurate lane detection system that can effectively
identify and localize lane lines in diverse road conditions. To create the final binary mask, the threshold value is applied to
the probability map. The encoder decoder components are already highlighted in the Section 2 and how Even in the face of
noise or challenging road circumstances, correct segmentation is ensured by the skip connections, which protect low-level
data.
3.2. Training
lane detection system's main goal is to accurately report on the geometry and topology of lanes so that it may be
utilized for motion planning and control. Feature extraction, model fitting, image-to-world correspondence, and
temporal integration are some of the phases that make up a basic lane detection pipeline. The Adam optimizer is used
to build the model, and it adaptively modifies learning rates for each parameter to promote rapid convergence. The
difference between expected and actual segmentation masks is measured using mean squared error (MSE) loss.
The Adam optimizer adjusts the weights of the model based on the gradients of the loss function with respect
to each parameter. It utilizes the weight update rule to update a specific weight w at time step t:

w_t = w_{t-1} - (alpha * m_t) / (sqrt(v_t) + epsilon)


where alpha is the learning rate, m_t and v_t are the first and second moments of the gradients, and epsilon is a small
constant to avoid division by zero. The moments are computed as exponentially weighted moving averages of the
gradients and their squares, respectively:

2
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t v_t = beta2 * v_{t-1} + (1 - beta2) * g_t

where g_t is the gradient of the loss function with respect to the weight w at time step t, and beta1 and beta2 are the
decay rates for the first and second moments, respectively. The mean squared error (MSE) loss function measures the
discrepancy between the predicted and ground truth segmentation masks. For a single training example with predicted
mask y_pred and ground truth mask y_true, the MSE loss is given by:

2
MSE = (1 / N) * sum_i (y_pred_i - y_true_i)

where N is the total number of pixels in the masks, and the sum is taken over all pixels i.

The Image Data Generator class in Keras implements data augmentation techniques. The class offers a collection of
image manipulation methods that may be used in real-time during training on the training pictures, including rotation,
zooming, and flipping. The number of converted pictures processed prior to updating the model's weights depends on
the batch size and how quickly the changed images are created. The model is trained on a shuffled version of the
training dataset throughout each epoch. A hyperparameter called "epochs" controls how many times the complete
dataset is used to train the model. The ideal number of epochs relies on the size and complexity of the dataset as well as
the model's complexity. Model checkpointing is employed to save the weights of the best-performing model based on
validation loss, ensuring the availability of the most accurate model. How many augmented photos are processed before
the model's weights are updated depends on the batch size. While bigger batch sizes can speed up training but may
demand more memory, smaller batch sizes may result in more frequent weight adjustments. The batch size option is
determined by hardware constraints and optimization objectives.
The model is trained on a shuffled version of the training dataset throughout each training period. The model is kept
from forgetting the sequence of the training samples thanks to this shuffling, which helps to ensure that it efficiently
learns to recognize patterns and features.

Imag UNe Feature Masked


e t Map Image
Fig (4). Flowchart of the masking

The front view of a car serves as the input for the lane detection model, which outputs a lane edge probability map the
same size as the input picture. The model is trained using a fully supervised approach, in which we give annotated
maps with positive points denoting the borders of lane segments (labeled as "1") and negative points denoting other
locations ("0"). Using stochastic gradient descent, the network is end-to-end tuned to reduce the difference between the
anticipated probability map and the actual ground truth annotation map. Through this training process, the model is
taught the characteristics and patterns that set lane segment edges apart from other regions of the picture, enabling it to
correctly recognize lane borders in actual- world situations.

4. Literature Survey

The latest developments in lane departure warning systems and lane line recognition techniques based on image
processing and semantic segmentation are thoroughly reviewed by the authors in this study. They talk about the
difficulties and prospects in this area and compare different algorithms and approaches [12]. The paper discusses the
SegNet architecture, which is a deep convolutional encoder- decoder architecture for scene segmentation. The paper
analyzes the decoding process used in some of these approaches and reveals their pros and cons. The authors evaluate
the performance of SegNet on two scene segmentation tasks, CamVid road scene segmentation and SUN RGB-D
indoor scene segmentation [8].

It was suggested to take Fahmizal et al.'s[14] method of using CNN to detect lanes. DCNN, which in this case
consists of 23 convolutional layers and two connected layers for object detection, is implemented using Yolo.
Detecting road lanes Four methods are available. Warping: In this situation, images are handled by changing the
perspective of the input. Filtering: We only choose the range of yellow and white colors using the LUV and LAB
formats, and we eliminate the lane colors with non-lane lines. Each image is additionally divided into 15 smaller ones,
which are then combined to create the left and right images. To produce clean images, de-warping is carried out
exactly the opposite of how it was done during warping.
The paper [13] takes the color, gradient, and grayscale aspects of the lane line, the suggested method of multi-feature
fusion and window searching enhances the accuracy of lane detection. Utilizing a multi-feature fusion approach, these
features are combined to produce a binary image that improves the resilience of lane detection in challenging
conditions. These three features can be extracted with significantly less effort and greater accuracy than deep learning
features. Additionally, the LFPF (Line Fits by Previous Fits) method is suggested to seek lane-line pixels if the left and
right lanes were identified in the previous frame of the video image. It makes use of the relationship between the front
and back frames, searching the current frame's lane-line pixels near the left and right lane-line boundary equations
fitted in the previous frame to reduce algorithmic time complexity and increase lane detection stability. Regarding
local features, gradient on the lane edges is a common tool for locating the lanes, as seen in [6], which uses two filters
to find the strong lane edges in the images on the left and right edges, respectively.

In this study, a novel approach is presented for detecting lane lines using instance segmentation. The aim is to
overcome the challenges encountered by existing lane detection methods, particularly in intricate traffic scenarios.
To enhance the algorithm's performance, the authors employ a RepVgg-A0 neural network, resulting in a compact
parameter count of 9.57 million. Experimental evaluations are conducted on a mobile device, demonstrating the
algorithm's feasibility for real-time lane detection on embedded systems like the Jetson Nano. Notably, the proposed
technique attains an impressive accuracy level of 96.7%, showcasing its potential relevance in the realm of self-
driving vehicles [15].

Rama Sai Mamidala and colleagues introduced an innovative lane detection method in their work [9]. The proposed
technique, titled "Dynamic Approach for Lane Detection Using Google Street View and CNN," centers around
employing a CNN (Convolutional Neural Network) with a SegNet decoder architecture. A distinctive characteristic
of this architecture involves the incorporation of max-pooling indices within the decoders, facilitating the upscaling
of lower-resolution feature maps. This strategic inclusion preserves the frequency intricacies within segmented
images and effectively reduces the overall count of training parameters within the decoders.
5. Experimental setup

5.1. Performance merits

Lane line detection model architecture

We developed the encoder-decoder U-Net model, a specific convolutional neural network (CNN) architecture, to
achieve accurate lane line recognition. This architecture excels at picture segmentation tasks and is especially well-
suited for the challenging lane detection problems. The design of the model consists of various layers, each of which is
essential to the learning and decoding of lane line information from input photos. Convolutional layers make up the
encoder part. Batch normalization and max-pooling operations are performed after each layer. While spatially down
sampling the feature maps, this hierarchical structure recovers increasingly abstracted features from the input images.
The decoder component then up-samples and decodes these abstract features to produce pixel-wise lane line
predictions. The decoder section is the mirror image of the encoder section. Notably, the decoder's core consists of a
number of convolutional transpose layers, batch normalization, and up-sampling processes. A segmentation mask with
the same dimensions as the input image is created by the final convolutional layer, and each pixel corresponds to a
probability indicating the presence of a lane line. The entire architecture is taught from beginning to end, converting
unprocessed photos into insightful predictions of lane lines.

Using the Inception Module for Multi-Scale Feature Learning

The inclusion of the inception module, a design decision that greatly adds to our model architecture's steadfast
performance in lane detection, is one of its distinguishing characteristics. The U-Net architecture's encoder section
contains a strategically placed inception module that improves the model's capacity to learn multi-scale features. This
module, which is cleverly designed, uses filters of different sizes (1x1, 3x3, and 5x5 convolutions) at the same
hierarchical level (see Figure 2). We add extra 1x1 convolutions before and after particular convolutional procedures
to improve computing performance and decrease dimensionality. For example, a 1x1 convolution comes before a 3x3
convolution, and a 5x5 convolution process is modified in a similar way. Additionally, a smart replacement of the 5x5
convolution with a 3x3 convolution is used. The model is given the ability to record lane line details across all sizes
and orientations thanks to its multi-scale feature learning within the inception module, which ultimately contributes to
its cutting-edge lane recognition capabilities.

6. Experimental Results

6.1. Performance merits

Images are loaded from specific pickle files and used as data in the programming. Lane markers on a road are most
likely what these pictures depict. The U-Net structure, which is essentially an encoder-decoder architecture, is used to
separate the lanes in the photographs. The model's predictions were displayed for several photos from the validation set
after the dataset was used to train the model. It generated binary lane markers by thresholding lane segmentation
masks of the lanes. False positive rate (FPR), which is measured as FPR = (the number of false positives) / (the
number of target lanes), and TPR, which is calculated as TPR = (the number of detect lanes) / (the number of target
lanes). Since each lane should only be detected once, it is undesirable to overestimate or underestimate the overall
number of lanes. The False Positive Rate (FPR) is one of the most important performance indicators used to assess the
model. The FPR, which is calculated as the number of incorrectly identified lanes divided by the total number of target
lanes, is an important metric since it quantifies the rate of false positive lane detections. In this situation, a high FPR
would suggest that the model frequently recognizes inexistent lanes wrongly, possibly confusing and endangering
autonomous driving systems. The results' low false positive rate (FPR) indicates that the model successfully reduces
false positives, which is important for real-world applications.

Tab (1). The Performance merits

Collectively, these results show that the model effectively learned the underlying patterns in the training data.
In addition, the model consistently demonstrated high validation accuracy, indicating its ability to generalize
new data it had not previously encountered. The generated graph displays the rising or falling accuracy with
time during both the training and validation phases. We hope that as training progresses, both the training
accuracy and the validation accuracy improve. If the model's training accuracy keeps going up while the
validation accuracy goes down, it may be overfitting the training data and failing to generalize effectively to
new data. Conversely, if the model's accuracy plateaus at a low number during training and validation, it may
not be sophisticated enough to pick up on trends in the data. The True Positive Rate (TPR) is a crucial
indicator that is taken into account. By dividing the number of successfully recognized lanes by the total
number of target lanes, TPR calculates the rate at which the model correctly detects lanes. A high TPR shows
that the model can correctly identify lanes in a variety of situations. According to the TPR seen in the findings,
the model is excellent at spotting the actual lanes that are visible in the photos, which is essential for jobs like
lane-keeping in autonomous vehicles.
The outcomes also demonstrate how well the model generalizes to new data. The model can effectively apply
its newfound knowledge to unfamiliar images, as evidenced by the continuously excellent validation accuracy.
For real-world applications, where the model must function well in a variety of dynamic situations, this
generalization capacity is crucial.

The model's generated accuracy graph also offers insightful information about the model's training
development over time. It demonstrates the learning curve of the model and how its accuracy increases
throughout training. By highlighting convergence spots and prospective growth areas, this graph can be useful
for fine-tuning and enhancing the model's performance. The experimental findings support the U-Net model's
efficiency in lane marker segmentation, to sum up. Its capacity for real-world applications is demonstrated by
its ability to produce low False Positive Rates, high True Positive Rates, and consistent validation accuracy,
especially in autonomous driving systems where accurate lane recognition is crucial for secure and dependable
navigation.

6.2. GRAD-CAM visualization

To show the areas of the photos that had the greatest impact on the predictions of our lane detection model, we
used Gradient-weighted Class Activation Mapping (Grad-CAM) in this study. The generated heatmaps showed
the model's attention being primarily focused on the actual lane markers, even in the presence of occlusions,
shadows, or faded lines, when overlaying onto validation photos. In addition to validating our model's capacity
to concentrate on important features, this visualization technique also gave us useful insights into prospective
areas for additional development.

Fig (5). GRAD-CAM Visualization

We created a heatmap using Grad-CAM for the specified input image, offering a visual representation of the
model's decision- making process. By superimposing the heatmap onto the original input image, we were able
to observe the model's attention areas. Upon examining the Grad-CAM visuals, we identified performance
flaws, especially in certain lighting conditions or when the lines were partially obscured. Armed with this
knowledge, we modified the model design, retrained it, and enhanced its performance in these challenging
scenarios.

To demonstrate the relative advantages of our suggested methodology, we have decided to contrast our CNN
model with the well-known Lane Net and FCN models. The Lane Net model is a two-stage predictor of lane
lines. An encoder- decoder model is used in the initial stage to provide a segmentation mask for the lane lines.
The second stage is a lane localization network, which employs LSTM to learn a quadratic function to forecast
the points for the lane using the extracted lane points from the mask as input [15]. The comparison is carried
out for reference in Table (2). Gaining the trust of customers and other parties has proven to be extremely
difficult in the field of deep learning due to the inherent "black-box" nature of deep neural networks,
particularly in applications as important as autonomous driving. A significant barrier to these models'
widespread adoption has been their inability and explaining the decision-making processes of these models. In
order to address this issue, we used the Gradient-weighted Class Activation Mapping (Grad-CAM) technique,
which is a potent tool for deciphering the Inception U-Net model built on deep learning. By highlighting the
model's reasoning for making decisions, this strategic integration of Grad-CAM attempts to increase consumer
trust. The suggested Inception U-Net model in conjunction with Grad-CAM has produced amazing results, as
demonstrated by an outstanding Intersection over Union (IoU) score of 0.682. This accomplishment
demonstrates the model's superior performance when compared to other deep neural network-based
segmentation models that are state-of-the-art (SOTA). More importantly, the addition of Grad-CAM to our
architecture has improved model interpretation and transparency while also boosting speed.

Grad-CAM enables us to understand where and why the model is producing its predictions by highlighting
and displaying the regions of interest within images. As it demystifies the decision-making processes of deep
neural networks, this interpretability is an essential step towards establishing trust in them. Now, stakeholders
and customers may learn more about how the In conclusion, our novel method shines in terms of performance
and satisfies the critical demand for model interpretability and reliability. It uses the Inception U-Net model
augmented by Grad-CAM. This accomplishment is a big step toward assuring openness and accountability in
our AI systems while achieving the potential of deep learning in crucial applications like autonomous driving.

¿
IoU score= AreaOf ∪ Area Of ∩¿ ¿ ¿
(a) Original images (b) Masked Images (c) Final results

Figure 4. Illustrative results. The original images along with the corresponding masked output and lane detected output.
The distance between the ego vehicle and the next-closest vehicle in front is measured below using YOLO v5 [16] and here we
have deployed the pre-trained YOLO model for object detection. Based on the camera's settings, the models return the distance
in pixels, which can be translated to meters. To recognize and track lane lines and objects in real time, the trained CNN model
is applied to each frame of a video. We divide a video into individual frames using the MoviePy library in order to apply the
model to it. And YOLO detects cars based on their positions being detected in successive frames, calculates their speed,
annotates the detections and speeds on the original movie, and saves the outcome as a new video.

Fig (5). YOLO Object Detection


Table 1. Experiment results.
Difficulty
Background (13790672 Lane (2530896 lanes)
lanes)
Detecte
d TPR FPR Detected TPR FPR
UNet 1170 1.0 1.0 1035 97% 98.8%
LaneNet 1146 97.9% 2.7% 1001 96.7% 3.9%

To train and deploy deep learning models in the proposed work, Keras with GPU acceleration offers a
streamlined procedure that increases both usability and effectiveness. - datasets. Further fine tuning the network
using weakly labelled data improves the perfor- mance on samples with hard difficulties significantly. The fine
tuning using weakly labelled data actually enables us to improve network performance in extreme casesat a low
cost.

5. Limitations
Although our model works well in controlled circumstances, its reliance on particular datasets may limit its
capacity to adapt to a variety of road environments, such as off-road terrains and unfavorable weather. Although
helpful, GPU acceleration may produce a range of results depending on the hardware. Real-time execution in
dynamic traffic situations is challenging, especially in locations with hazy lane lines or in areas where there are
active construction zones. Furthermore, given the immense complexity of real-world events, the current data
augmentation techniques can fall short. Enhancing data augmentation, developing more effective models, or
adding hardware support may improve lane-detecting performance under pressure. Additionally,
computationally taxing, CNN complexity may have an impact on real-time processing, especially in onboard
systems with constrained resources. While enhancing, the addition of additional tools like Grad- CAM for
viewing and YOLO for object detection may further tax the computing capacity.

6. Future Scope
Examine the lane line detection system's potential integration of Generative Adversarial Networks (GANs).
Realistic road scenes, including a range of weather conditions, lighting settings, and road textures, can be
produced with GANs. The training dataset can be supplemented with this synthetic data, increasing the
robustness of the lane detection model under difficult circumstances. When GPS data is incorporated, lane-level
vehicle placement is improved for precise navigation, and V2X communication equips vehicles with real-time
traffic and road condition updates. LiDAR and radar sensor fusion improves perception accuracy, and machine
learning predictions help with determining road conditions. Driving will be safer and more effective thanks to
cognitive help, autonomous lane changes, and augmented reality guidance. Working together with smart
infrastructure improves lane detection precision, and real-time traffic flow analysis offers perceptions into
traffic and best routes. The future of mobility and transportation will advance thanks to this convergence, which
is predicted to improve road safety, navigation accuracy, and autonomous driving capabilities.

7. Conclusion
In conclusion of this study's use of a specific Convolutional Neural Network (CNN) model with an encode-
decoder structure, lane detection has been approached in a reliable and novel way. It has been emphasized how
important precise lane identification is in the context of driverless vehicles and Advanced Driver-Assistance
Systems (ADAS). The created CNN model, trained on annotated MATLAB images, has remarkable accuracy
metrics with a macro average F1-score of 0.98 and accuracy of 0.97. In comparison to more established models
like LaneNet, our suggested U-Net encoder-decoder structure performed better, especially when enhanced with
GPU acceleration. Our model's capabilities were further enhanced by the incorporation of technologies like
YOLO and Grad-CAM, which delivered more thorough results and visual insights into the decision-making
processes. The method's versatility and promise for frame-by-frame object detection & lane detection are
demonstrated by the expansion to video analysis utilizing a combination of pre-trained models and libraries

References
[1]M. Aly. Real time detection of lane markers in urban streets.
In Intelligent Vehicles Symposium, pages 7–12, 2008.
[2]V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A
deep convolutional encoder-decoder architecture for scene
segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, PP(99):1–1, 2017.
[3]H. Deusch, J. Wiest, S. Reuter, M. Szczot, M. Konrad, and
K. Dietmayer. A random finite set approach to multiple lane
detection. 53(2293):270–275, 2012.
[4]A. Gurghian, T. Koduri, S. V. Bailur, K. J. Carey, and V. N.
Murali. Deeplanes: End-to-end lane position estimation us-
ing deep neural networks. In IEEE Conference on Com-
puter Vision and Pattern Recognition Workshops, pages 38–
45, 2016.
[5]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision appli-
cations. arXiv preprint arXiv:1704.04861, 2017.
[6]J. Hur, S.-N. Kang, and S.-W. Seo. Multi-lane detection in
urban driving environments using conditional random fields.
In Intelligent Vehicles Symposium (IV), 2013 IEEE, pages
1297–1302. IEEE, 2013.
[7]B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song,
J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu,
and R. Chengyue. An empirical evaluation of deep learning
on highway driving. Computer Science, 2015.
[8]Y. Jiang, F. Gao, and G. Xu. Computer vision-based
multiple-lane detection on straight road and in a curve. In
International Conference on Image Analysis and Signal Pro-
cessing, pages 114–117, 2010.
[9]S. Lee, I. S. Kweon, J. Kim, J. S. Yoon, S. Shin, O. Bailo,
N. Kim, T.-H. Lee, H. S. Hong, and S.-H. Han. Vpgnet: Van-

You might also like