Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

2/24/24, 9:50 PM main - Jupyter Notebook

Data preparation and neural network training:

In [1]: import numpy as np


from torch.utils.data import Dataset, DataLoader
import os, random, time, shutil
import torch, torchvision
from torchvision import transforms, datasets, models
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.utils as vutils
from torch.utils.tensorboard import SummaryWriter
from PIL import Image
from tqdm import tqdm
import cv2
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt
%matplotlib inline

--------------------------------------------------------------------------
-
ModuleNotFoundError Traceback (most recent call las
t)
Cell In[1], line 4
2 from torch.utils.data import Dataset, DataLoader
3 import os, random, time, shutil
----> 4 import torch, torchvision
5 from torchvision import transforms, datasets, models
6 from torchvision.models.detection.faster_rcnn import FastRCNNPredi
ctor

ModuleNotFoundError: No module named 'torchvision'

In [ ]: # Загрузим функции из написанных .py файлов проекта:


from augment_and_visualize import *
from training_rcnn import *
from predict import *
from metrics import *

In [2]: # For reproducibility of results, let's fix the seeds:


random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True

# Code to ignore warnings
import warnings
warnings.filterwarnings("ignore")

localhost:8888/notebooks/Downloads/main.ipynb 1/32
2/24/24, 9:50 PM main - Jupyter Notebook

We implement training of a neural network


that detects objects of the "person" class
There is a pre-labeled dataset. You can download from this link
(https://disk.yandex.ru/d/7HNoc81at3r6VQ)

Let's implement data augmentation by doubling the dataset. To do this, I wrote a custom
function aug:

In [3]: aug(out_folder='augmented_dataset')

Initial number of photos and annotations = 500


Total number of photos and annotations = 1000

Preparing data for training the Faster RCNN network:

Let's divide the data into training and test:

In [52]: # Get file names without encoding value


names = []
for file in os.listdir('augmented_dataset/images'):
names.append(file.split('.')[0])

I randomly divide the photos into train (80%) and test (20%)

In [53]: train_data = random.sample(names, int(len(names) * 0.8))


print ( f'Number of images per train = { len ( train_data )} ' )
test_data = list(set(names) - set(train_data))
print ( f'Number of images on test = { len ( test_data )} ' )

Number of images per train = 800


Number of images per test = 200

Annotations in this case are available in two different formats: COCO_json and
PASCAL_VOC_xml

Key difference between xml and json box markup:

In json, each box has the following 4 values:


x_min, y_min, width, hight

In xml, each box has the following 4 values:


x_min, y_min, x_max, y_max

localhost:8888/notebooks/Downloads/main.ipynb 2/32
2/24/24, 9:50 PM main - Jupyter Notebook

Let's look at how to work with annotation data presented in xml format

In [54]: with open('detect_dataset/annotations/PASCAL_VOC_xml/oz7_violation_frame519.


data = f.read()
soup = BeautifulSoup(data, 'xml')
objects = soup.find_all('object')
num_objs = len(objects)
print(objects)

[<object>
<name>person</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>1000.3</xmin>
<ymin>301.6</ymin>
<xmax>1082.3</xmax>
<ymax>514.5</ymax>
</bndbox>
</object>, <object>
<name>person</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>1140.9</xmin>
<ymin>260.36</ymin>
<xmax>1215.6</xmax>
<ymax>493.2</ymax>
</bndbox>
</object>]

In this case, informative information is presented in the xmin , ymin , xmax , ymax section
and the class itself is presented in the name section

localhost:8888/notebooks/Downloads/main.ipynb 3/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [55]: # Let's create functions that will parse this data:


def generate_box(obj):
xmin = int(float(obj.find('xmin').text))
ymin = int(float(obj.find('ymin').text))
xmax = int(float(obj.find('xmax').text))
ymax = int(float(obj.find('ymax').text))
return [xmin, ymin, xmax, ymax]

'''
class person - 1
p.s:
I wrote down the conditions for class numbers in advance
to train the following network with two classes:
1 - man with helmet
2 - man without a helmet
'''
def generate_label(obj):
if (obj.find('name').text == "person") or (obj.find('name').text == "hat
return 1
elif obj.find('name').text == "no_hat":
return 2
return 0

Pytorch, when training detection models, requires data in the format [xmin, ymin, xmax,
ymax] for each box

localhost:8888/notebooks/Downloads/main.ipynb 4/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [56]: '''
This function will output a dictionary with 3 keys: boxes, labels and image_
The function takes as input:
image_id - index of the photo from the Pytorch class Dataset
file - path to xml file
'''
def generate_target(image_id, file):
with open(file) as f:
data = f.read()
soup = BeautifulSoup(data, 'xml')
objects = soup.find_all('object')
num_objs = len(objects)

# We will iterate through the sheet obtained after expanding the xml
boxes = []
labels = []
for i in objects :
boxes.append(generate_box(i))
labels.append(generate_label(i))
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# In this case there is only 1 class
labels = torch.as_tensor(labels, dtype=torch.int64)
# translate index torch tensor
img_id = torch.tensor([image_id])

# get the final dictionary for the photograph under study
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["image_id"] = img_id

return target

Let's create a MakeDataset class, inheriting it from the Dataset class. Let's describe the init ,
getitem and len methods

localhost:8888/notebooks/Downloads/main.ipynb 5/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [57]: class MakeDataset(Dataset):


def __init__(self, path, data, transforms=None):
self.transforms = transforms
self.names_list = data
self.path = path

def __getitem__(self, idx):
name = self.names_list[idx]
file_image = self.path + '/images/' + str(name) + '.jpg'
file_label = self.path + '/annotations/' + str(name) + '.xml'
img = Image.open(file_image).convert("RGB")

#Let's make a dictionary with an annotation using the previously wri
target = generate_target(idx, file_label)

if self.transforms is not None:


img = self.transforms(img)

return img, target

def __len__(self):
return len(self.names_list)

In [58]: data_transform = transforms.Compose([


transforms.ToTensor()
])

# The input of the MakeDataset class in __init__ is the value of the list of
# path to the shared folder with annotations and images:

train_dataset = MakeDataset(path='augmented_dataset', data=train_data, trans
val_dataset = MakeDataset(path='augmented_dataset', data=test_data, transfor

Let's see in what format the data is stored in the MakeDataset class:

In [59]: first = train_dataset[7]


features, labels = first
print(labels)

{'boxes': tensor([[ 961., 503., 1093., 945.],


[ 784., 452., 962., 946.]]), 'labels': tensor([1, 1]), 'image_i
d': tensor([7])}

In this case, there are 3 objects of the person class in the photo, so there are 3 bounding
boxes

localhost:8888/notebooks/Downloads/main.ipynb 6/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [60]: # Represents values ​


in batches as tuples:
def collate_fn(batch):
return tuple(zip(*batch))


batch_size = 8 # Set the number of photos for 1 batch

train_data_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=Tru

val_data_loader = torch.utils.data.DataLoader(
val_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=False

Demonstration of output data from the DataLoader class -> [batch_size, dicts]

In [16]: device = 'cpu'


for imgs, annotations in train_data_loader:
imgs = list(img.to(device) for img in imgs)
annotations = [{k: v.to(device) for k, v in t.items()} for t in annotati
print(annotations)
break

[{'boxes': tensor([[629., 132., 686., 261.]]), 'labels': tensor([1]), 'ima


ge_id': tensor([527])}, {'boxes': tensor([[593., 108., 633., 238.],
[630., 92., 690., 238.]]), 'labels': tensor([1, 1]), 'image_id':
tensor([481])}]

Let's visualize the source images:

Let's take 3 random photos and draw their boxes:

localhost:8888/notebooks/Downloads/main.ipynb 7/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [14]: for _ in range ( 3 ):


plot_random_image(train_data_loader)

localhost:8888/notebooks/Downloads/main.ipynb 8/32
2/24/24, 9:50 PM main - Jupyter Notebook

Network configuration:

We will use the Transfer learning approach, training the Faster RCNN network, which has
already been pretrained on COCO

In [17]: def create_model(num_classes):


model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=
in_features = model.roi_heads.box_predictor.cls_score.in_features
# Replace the number of output classes with what we need
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_class
return model

# Set the model (set the number of classes as n+1, since class 0 is empty)
model = create_model(2)

Inside the Faster R-CNN implementation with FPN (Feature Pyramid Network) in the
PyTorch torchvision library, a complex loss function is used that combines several sub-
functions.

In [18]: model.to(device)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0

# We will train networks on a video card:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device
print(device)

model.to(device)
num_epochs = 30 # number of training epochs

cuda

localhost:8888/notebooks/Downloads/main.ipynb 9/32
2/24/24, 9:50 PM main - Jupyter Notebook

In this case, the choice of the SGD optimizer may be due to the following reasons:

Model size and complexity: Faster R-CNN with ResNet50-FPN is a fairly large and complex
model with many trainable parameters, which can lead to rapid overfitting and instability
when using more complex optimizers such as Adam.
Amount and type of data: When training object detectors based on Faster R-CNN, a loss
function is used, which consists of several components, including components related to
object classification and regression. SGD is a classic optimizer that does a good job of
training such models, while Adam, which is a more advanced method, may not produce
optimal results.
Availability of pre-trained weights: In this case, we use pre-trained weights for Faster R-CNN
with ResNet50-FPN, which can simplify the training process and allow the use of a simpler
SGD optimizer instead of Adam.

In [19]: # Create an empty folder in which we will save the trained models
newpath = 'models'
if not os.path.exists(newpath):
you . makedirs ( newpath )

In [7]: # Let's run it directly on the TensorBoard laptop:


%reload_ext tensorboard
%tensorboard --logdir 'results_training'

The forward path for the model in train produces a loss_dict, which represents a dictionary
containing the loss function values ​for each of the Faster R-CNN components used during
training. The loss_dict dictionary in the Faster R-CNN model includes the following loss
function components:

1. Loss_objectness is responsible for determining whether the region of the inferred object
contains any object or not (binary classification). To do this, loss_objectness uses the
binary cross-entropy between the network output and the corresponding labels for each
region.
2. Loss_classifier is responsible for determining which class an object belongs to in a
given region. To do this, loss_classifier uses multi-class cross-entropy between the
network output and the corresponding class labels for each region.
3. Loss_box_reg is responsible for determining how well the model predicts the correct
bounding box coordinates for a detected object in the image. To do this, loss_box_reg
uses the root mean square error between the predicted bounding box coordinates and
the actual coordinates.
4. Loss_rpn_box_reg is responsible for determining how well the model predicts bounding
box coordinates for regions obtained from the Region Proposal Network (RPN) that may
contain objects. To do this, loss_rpn_box_reg uses the root mean square error between
the predicted bounding box coordinates and the actual coordinates.

Network training code:

Let’s load a custom function and train the network (30 training epochs):

localhost:8888/notebooks/Downloads/main.ipynb 10/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [20]: train(model=model, train_data_loader=train_data_loader, optimizer=optimizer,


val_data_loader=val_data_loader,
num_epochs=30, comment=' person detection new', device=device,
save_path='models/model_human_detection.pth')
Train summ loss after 27 epochs = 0.10834603011608124

100%|██████████| 100/100 [01:03<00:00, 1.58it/s]

Validation summ loss after 27 epochs = 0.12323746003210545


Era 28 training in progress (out of 30)

25it [00:39, 1.56s/it]

Train summ loss after 28 epochs = 0.09104762971401215

100%|██████████| 100/100 [01:02<00:00, 1.59it/s]

Validation summ loss after 28 epochs = 0.11854852609336376


Era 29 training in progress (out of 30)

25it [00:38, 1.55s/it]

Train summ loss after 29 epochs = 0.11474072933197021

100%|██████████| 100/100 [01:02<00:00, 1.59it/s]

V lid ti l ft 29 h 0 11717739876359701
Tensorboard with training is saved in the results_training directory. I also saved it on the dev
website so you can watch it via link.
!!! The training results can be viewed by clicking on this link
(https://tensorboard.dev/experiment/rr43qafqQKyKP7CQ5r1RCA/#scalars&_smoothingW

Loading the most successful state of the model:


Just in case, I renamed this model to model_human_detection_final.pth so that I wouldn’t
accidentally overwrite it when running the code again. This model has a neuron state
corresponding to the epoch in which it was possible to achieve the lowest validation loss
value

Since the file with the model weighs too much (158 MB), it was not possible to upload the
trained model to github. So, in order to run the code yourself using my trained model, you
need to run this function, which will download the trained networks from my Google drive to
the models folder:

In [ ]: #download_models(folder_name='models')

Load the model model_human_detection_final.pth:

In [49]: model = create_model(2)


model.load_state_dict(torch.load('models/model_human_detection_final.pth'))

Out[49]: <All keys matched successfully>

localhost:8888/notebooks/Downloads/main.ipynb 11/32
2/24/24, 9:50 PM main - Jupyter Notebook

Testing:
Let's launch a custom function that implements displaying the results of the model

In [ ]: detect_and_visualize(image_input='detect_dataset/images/am3_7_violation_fram
model_path='models/model_human_detection_final.pth',
classes=['person'], plt_show=True)
detect_and_visualize(image_input='detect_dataset/images/am3_9_frame111.jpg',
model_path='models/model_human_detection_final.pth',
classes=['person'], plt_show=True)

2 objects of the person class were detected

3 objects of the person class were detected

localhost:8888/notebooks/Downloads/main.ipynb 12/32
2/24/24, 9:50 PM main - Jupyter Notebook

Quality assessment models:

To calculate metrics in object detection tasks, two thresholds are usually used: a
classification threshold and a threshold for determining the intersection with the true position
of an object (intersection over union, IoU threshold).
The threshold for classification determines which predictions are considered positive and
which are negative. Usually the threshold is set based on the score that the model produces
for each detected object. If the score exceeds the threshold, then the object is considered
positive, otherwise - negative.
The IoU detection threshold determines how much the detected object overlaps with the
object's true position. Typically, the IoU threshold is set based on specified detection quality
requirements. If the IoU between the detected object and the true position of the object
exceeds the threshold, then the object is considered to be truly detected, otherwise it is
considered to be falsely detected.
Thus, to calculate metrics in object detection tasks, it is necessary to know two thresholds:
the threshold for classification and the threshold for determining IoU. They allow you to
separate positive and negative examples and determine how well the model detects the true
positions of objects.

An important note - the score threshold is used when the model is on combat duty in
production, so it should be chosen especially carefully. But the IoU threshold is used only for
validation to evaluate several well-known metrics (I’ll talk about them a little later in my story)

Let's choose score = 0.85 for treshold

Let's evaluate the quality of the model using the IOU metric:

In [61]: '''
The first step is to find iou scores:
Let's get an array with the number of elements equal to the number of object
and containing an array of IOU correspondences between the predicted and rea
'''
iou_scores_list = calculate_iou(model, val_dataset, treshold=0.85)

Let's see what the data looks like in this array:

In [66]: print ( f'Example 1:\n { iou_scores_list [ 2 ]} ' )


print ( f'Example 2: \n { iou_scores_list [ 67 ]} ' )

Example 1:
tensor([[0.7746]])
Example 2:
tensor([[0.9480, 0.0432],
[0.0486, 0.8020]])

localhost:8888/notebooks/Downloads/main.ipynb 13/32
2/24/24, 9:50 PM main - Jupyter Notebook

We obtain iou matrices, which contain the values ​of the IoU coefficients between all pairs of
frames from the predicted and real boxes. These values ​will range from 0 to 1, where a
value of 0 means that the frames do not overlap and a value of 1 means that the frames
match completely.

Let's calculate the average IOU for validation at a given threshold score = 0.85

In [63]: val = []
for image in iou_scores_list:
for detect in image:
val.append(max(detect))
print ( f'Average IOU on validation is: { np . mean ( val )} ' )

The average IOU on validation is: 0.8693097233772278

Let’s visualize what predicted and real bounding boxes look like on images from the
validation dataset and find the IOU for them:
ps (in this custom function, all predicted boxes for which score>0 are built at once)

In [99]: visualize_detection(val_dataset, model, 26)

Let's calculate precission and recall:

Recall for a one-class detection task shows how many of all objects of interest were
detected by the algorithm. That is, the closer the recall value is to 1, the more objects of
interest were found by the algorithm.

Precision for a one-class detection task shows how many of all bounding boxes predicted by
the algorithm actually contain objects of interest. That is, the closer the precision value is to
1, the fewer false objects were predicted by the algorithm.

localhost:8888/notebooks/Downloads/main.ipynb 14/32
2/24/24, 9:50 PM main - Jupyter Notebook

To calculate these metrics, set the IOU threshold to 0.5

Calculating metrics for one specific photo:

In [25]: # Artificial example: only 3 objects actually existed, but 4 were discovered
# with the selected score threshold with the following IOU values:

score = torch.tensor([[0.20, 0.90, 0.10],
[0.80, 0.10, 0.20],
[0.00, 0.10, 0.20],
[0.10, 0.20, 0.00]])

print('recall (iou=0.1) =', recall(score, iou_threshold=0.1))
print('precission (iou=0.1) =', precision(score, iou_threshold=0.1))
print('recall (iou=0.5) =', recall(score, iou_threshold=0.5))
print('precission (iou=0.5) =', precision(score, iou_threshold=0.5))
print('recall (iou=0.85) =', recall(score, iou_threshold=0.85))
print('precission (iou=0.85) =', precision(score, iou_threshold=0.85))

recall (iou=0.1) = 1.0


precission (iou=0.1) = 0.75
recall (iou=0.5) = 0.6666666666666666
precission (iou=0.5) = 1
recall (iou=0.85) = 0.3333333333333333
precission (iou=0.85) = 1

In [26]: #Calculation of metric environments for the entire dataset at score porg 0.8
print ( 'Average recall over the entire validation dataset =' ,
mean_metric(iou_scores_list, func='recall', iou_treshold=0.5))
print ( 'Average precision over the entire validation dataset =' ,
mean_metric(iou_scores_list, func='precision', iou_treshold=0.5))

Average recall over the entire validation dataset = 0.97083333333333333


Average precision over the entire validation dataset = 1.0

Explanation of the results obtained:

If the detection algorithm identifies many regions in a photograph that have a high similarity
to a real object, but these regions do not actually correspond to the object (a bunch of
predict boxes next to a person as in YOLO without non-maximum supression), then for this
case recall will be high and precision will be low.

In our case, the detection confidence threshold was chosen to be very high (score = 0.85),
so our situation is exactly the opposite. More often than not, the classifier will not find a
person at all, rather than finding him twice. For this reason, our precision is very high, and
recall has a lower value.

Let's calculate the Average Precision metric


When calculating AP, we change the threshold by score (model confidence) and calculate
the corresponding precision and recall values ​for each threshold. In this case, precision and
recall can vary depending on the selected threshold. When the threshold on score is set at a
very high level, that is, the model only detects objects that it considers most confident, then
precision will be very high, but recall will be low, since many objects may be missed. If the
threshold is set at a very low level, that is, the model detects all objects, but also a lot of
noise (false detections), then recall will be high, but precision will be low.

localhost:8888/notebooks/Downloads/main.ipynb 15/32
2/24/24, 9:50 PM main - Jupyter Notebook

In this case, AP is the average value of accuracy when choosing a threshold on score from 0
to 1, at which recall changes from 0 to 1. AP takes into account the importance of both
accuracy and recall when calculating the quality of an object detection algorithm and does
not take into account the choice of a specific score threshold, which allows us to give a more
general assessment of the quality of the detection model. Because in this problem there is
only one detectable class, so the metric mAP (mean AP) = AP

In [27]: rec=[]
prec = []
for i in np . linspace ( 0 , 1 , num = 11 , endpoint = False ):
iou_scores_list = calculate_iou(model, val_dataset, treshold=i)
rec.append(mean_metric(iou_scores_list, func='recall', iou_treshold=0.5)
prec.append(mean_metric(iou_scores_list, func='precision', iou_treshold=
rec.append(0)
prec.append(1)

Building precision_recall_curve:

AP can be defined as the area under the Precision-Recall curve. Let us depict this graph,
constructed when choosing a threshold IoU = 0.5

localhost:8888/notebooks/Downloads/main.ipynb 16/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [28]: plt.figure(figsize=(12, 4), dpi=80)


plt.plot(rec, prec, label="treshold IoU = 0.5")
plt.title('Precision-Recall curve')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(frameon=False)
plt.show()

plt.figure(figsize=(5, 4), dpi=80)
plt.plot(rec[:-1], prec[:-1], label="treshold IoU = 0.5")
plt.title('Cropped Precision-Recall curve')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(frameon=False)
plt.show()

Determining the average Average Precision (AP) from images of the validation dataset at
IoU=5:

In [29]: print(f'Average precision = {average_precision(prec, rec)}')

Average precision = 0.9944000000000001

localhost:8888/notebooks/Downloads/main.ipynb 17/32
2/24/24, 9:50 PM main - Jupyter Notebook

In this case, we implemented an independent calculation of the AP metric and ultimately


calculated the AP values ​for each photo at IoU = 0.5 and then averaged the metric over the
entire validation dataset. This approach is not entirely correct. It is worth looking for this
metric by immediately concatenating the detection results and ground truth for all images in
the validation dataset. Then, by chance, good detection on several individual images will not
make the AP results less informative when assessing the quality of the model.
Let's implement the described approach using ready-made functions from
torchmetrics.detection:

In [67]: mAP_AP_dataset(val_dataset, model)

100%|██████████| 200/200 [09:46<00:00, 2.93s/it]

Average Precision values ​


for the person class:
AP (при IoU=.50) = 0.9846

Conclusion on one-class detection:

We trained a rather complex Faster RCNN model and obtained very high quality metrics
during validation.
The reason for such high metrics is that for this task there was a fairly large training dataset;
in total, only 1 class was required to be detected, which makes the task quite simple. And the
most important thing is that the entire dataset is presented in the form of photographs from
several fixed cameras, so the photographs taken during validation and during training are,
unfortunately, very similar to each other. Even the use of augmentation does not add much
variety to our studied dataset. So it is worth assuming that when using this pre-trained
network on completely different (different) images, the network will show significantly lower
quality results.

localhost:8888/notebooks/Downloads/main.ipynb 18/32
2/24/24, 9:50 PM main - Jupyter Notebook

Example of work on an unfamiliar, dissimilar photograph:

In [31]: detect_and_visualize(image_input='test_folder/1.jpg',
model_path='models/model_human_detection_final.pth',
classes=['person'], plt_show=True)
detect_and_visualize(image_input='test_folder/2.jpg',
model_path='models/model_human_detection_final.pth',
classes=['person'], plt_show=True)
detect_and_visualize(image_input='test_folder/3.jpg',
model_path='models/model_human_detection_final.pth',
classes=['person'], plt_show=True)

5 objects of the person class were detected

5 objects of the person class were detected

localhost:8888/notebooks/Downloads/main.ipynb 19/32
2/24/24, 9:50 PM main - Jupyter Notebook

10 objects of the person class were detected

localhost:8888/notebooks/Downloads/main.ipynb 20/32
2/24/24, 9:50 PM main - Jupyter Notebook

Detection of people wearing helmet with


without:
Using the web service https://www.makesense.ai/ (https://www.makesense.ai/) , I
independently marked up some of the images, this time making 2 separate classes - a man
with a helmet on his head and a man without a helmet .
The dataset containing the new annotation can be downloaded from this link
(https://disk.yandex.ru/d/H4Qa16XDre6uuQ)

Regarding the procedure for creating a suitable dataset for training the model:
Initially, I marked up the photographs, highlighting the entire human figure and indicating the
class of the object (whether it was wearing a helmet or not), but this approach did not give
good results. Due to too significant unevenness in the number of objects between classes, I
received a very low recall for the class with hardhat. (since in the dataset 95% of the photos
are with people in helmets).
I was able to get slightly better results by artificially reducing the number of examples with
the class with hardhat (thereby reducing the size of the dataset to 93 photos) and marking
the boxes not by the area of ​the human body, but only by the area of ​the head. That is, there
are 2 possible classes in the box: a head with a helmet on and a bare head.

I saved the resulting annotations in the detect_hat_dataset/annotations folder.


Now let’s add images corresponding to these annotations to the still empty
detect_hat_dataset/images folder (select by matching name)

In [45]: # Создадаим пустую папку images:


newpath = 'detect_hat_dataset/images'
if not os.path.exists(newpath):
os.makedirs(newpath)

# Получим имена файлов аннотаций нового датасета без значения кодировки


names = []
for file in os.listdir('detect_hat_dataset/annotations'):
names.append(file.split('.')[0])
print(f'Всего вручную размеченных фотографий {len(names)} штук')

Всего вручную размеченных фотографий 93 штук

Скопируем из папки detect_dataset/images подходящие фотки в папку


detect_hat_dataset/images:

In [46]: for file in os.listdir('detect_dataset/images'):


if file.split('.')[0] in names:
shutil.copy2('detect_dataset/images/' + file, 'detect_hat_dataset/im

Так как в данном случае изображений размеченных особенно мало, поэтому


воспользуемся готовой функцией аугментации для увеличения размера датасета в 2
раза:

localhost:8888/notebooks/Downloads/main.ipynb 21/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [ ]: aug(image_dir="detect_hat_dataset/images",
xml_dir="detect_hat_dataset/annotations",
out_folder='augmented_hat_dataset')

Исходное число фотографий и аннотаций = 93


Итоговое число фотографий и аннотаций = 186

Теперь у нас есть папка augmented_hat_dataset с которой мы и будем работать при


обучении сети

Реализация предподгтовки данных абсолютно идентична, поэтому с более сжатым


текстовым пояснением реализуем все процессы:

In [31]: # Получим имена файлов без значения кодировки


names = []
for file in os.listdir('augmented_hat_dataset/images'):
names.append(file.split('.')[0])

In [32]: #Рандомно поделим фотографии на train (80%) и test (20%)



train_data = random.sample(names, int(len(names) * 0.8))
print(f'Число изображений на train = {len(train_data)}')
test_data = list(set(names) - set(train_data))
print(f'Число изображений на test = {len(test_data)}')

Число изображений на train = 148


Число изображений на test = 38

In [33]: train_dataset = MakeDataset(path='augmented_hat_dataset', data=train_data, t


val_dataset = MakeDataset(path='augmented_hat_dataset', data=test_data, tran

batch_size = 8 # Зададим чило фотографий на 1 батч

train_data_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=Tru

val_data_loader = torch.utils.data.DataLoader(
val_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=False

Визуализируем исходные изображения:


Возьмем случайные 3 фотки и изобразим их боксы:

localhost:8888/notebooks/Downloads/main.ipynb 22/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [42]: for _ in range(3):


plot_random_image(train_data_loader, hat_class=True)

localhost:8888/notebooks/Downloads/main.ipynb 23/32
2/24/24, 9:50 PM main - Jupyter Notebook

Зададим модель и ее параметры.


На этот раз классов уже будет 3:
фон, человек с каской и человек без каски

In [34]: device = torch.device('cuda') if torch.cuda.is_available() else torch.device


print(device)

model = create_model(3)
model.to(device)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0

cuda

localhost:8888/notebooks/Downloads/main.ipynb 24/32
2/24/24, 9:50 PM main - Jupyter Notebook

Обучение сети:
In [35]: train(model=model, train_data_loader=train_data_loader, optimizer=optimizer,
val_data_loader=val_data_loader,
num_epochs=30, comment=' hardhat detection new', device=device,
save_path='models/model_hardhat_detection.pth')

Идет обучение 1 эпохи (из 30)

25it [00:39, 1.58s/it]

Train summ loss after 1 epochs = 0.516562819480896

100%|██████████| 19/19 [00:13<00:00, 1.46it/s]

Validation summ loss after 1 epochs = 0.40158786154107046


Сохраним новую модель, так как текущая конфигурация имеет ниже val loss
Идет обучение 2 эпохи (из 30)

25it [00:38, 1.54s/it]

Train summ loss after 2 epochs = 0.38379207253456116

100%|██████████| 19/19 [00:11<00:00, 1.62it/s]

Validation summ loss after 2 epochs = 0.38813313449683945


Сохраним новую модель, так как текущая конфигурация имеет ниже val loss
Идет обучение 3 эпохи (из 30)

Tensorboard с обучением сохранен в директорию results_training. Так же я сохранил его


на сайте dev, чтобы можно было смотреть по ссылке.
!!! Результаты обучения можно посмотреть перейдя по данной ссылке
(https://tensorboard.dev/experiment/rr43qafqQKyKP7CQ5r1RCA/#scalars&_smoothingW

Загрузим лучшее состояние модели:

Загрузка самого успешного состояния модели: Такую модель на всякий случай я


переименовал в model_hardhat_detection_final.pth чтобы случайно при повторном
запуске кода не перезаписать. Данная модель имеет состояние нейронов,
соответсвующее эпохе,на которой получилось добиться самого низкого значения лосса
валидации

Так как файл с моделью весит слишком много (158 Мб), поэтому на github залить
обученную модель не вышло. Так что для того, чтобы запустить самостоятельно код с
использованием моей обученной модели необходимо запустить эту функцию, которая
с моего гугл диска скачает в папку models обученные сети:

In [48]: #download_models(folder_name='models')

Загрузим модель model_hardhat_detection_final.pth:

localhost:8888/notebooks/Downloads/main.ipynb 25/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [36]: model = create_model(3)


model.load_state_dict(torch.load('models/model_hardhat_detection_final.pth')

Out[36]: <All keys matched successfully>

Тестирование:

Запустим кастомную функцию, которая реализует выведение результатов работы


модели

localhost:8888/notebooks/Downloads/main.ipynb 26/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [ ]: detect_and_visualize(image_input='detect_hat_dataset/images/am3_6_frame084.j
model_path='models/model_hardhat_detection_final.pth',
classes=['hardhat','no_harhat'], plt_show=True,
treshhold=0.6)
detect_and_visualize(image_input='detect_hat_dataset/images/am3_9_frame090.j
model_path='models/model_hardhat_detection_final.pth',
classes=['hardhat','no_harhat'], plt_show=True,
treshhold=0.6)
detect_and_visualize(image_input='detect_hat_dataset/images/am3_9_violation_
model_path='models/model_hardhat_detection_final.pth',
classes=['hardhat','no_harhat'], plt_show=True,
treshhold=0.6)

Объектов класса hardhat обнаружено 1


Объектов класса no_harhat обнаружено 1

Объектов класса hardhat обнаружено 1


Объектов класса no_harhat обнаружено 1

localhost:8888/notebooks/Downloads/main.ipynb 27/32
2/24/24, 9:50 PM main - Jupyter Notebook

Объектов класса hardhat обнаружено 2

Оценка качества модели:

Выберем за treshold значение score = 0.6 (выбрал эмпирически)

Оценим качество модели используя метрику IOU:

In [37]: '''
Найдем первым этапом iou scores:
Получим массив с числом элементов равным чилу объектов в датасете
и содержащим массив соответвий IOU между предсказанным и рельным bounding бо
'''
iou_scores_list = calculate_iou(model, val_dataset, treshold=0.6)

Вычислим средний IOU на валидации при заданном пороге score = 0.6

In [38]: val = []
for image in iou_scores_list:
for detect in image:
val.append(max(detect))
print(f'Средний IOU на валидации равен: {np.mean(val)}')

Средний IOU на валидации равен: 0.8456330895423889

Данный скор определен по задетектированным областям. Но сама модель очень часто


не находит объекты или путает классы между собой, поэтому более объективным
параметром для оценки качества будет являться среднее значение AP между
классами - mAP

Визуализируем как выглядят предсказанные и реальные bounding боксы на


изображениях из валидационного датасета и найдем IOU для них: PS: (в данной
кастомной функции строятся сразу все предсказанные боксы, для которых score>0)

localhost:8888/notebooks/Downloads/main.ipynb 28/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [46]: visualize_detection(val_dataset, model, 5)

Определим значение метрик mAP при разных значениях порга IoU, а также значения
AP для обоих классов:

Найдем данные метрики соединив все предикты и таргеты на валидационном датасете


воедино (а не как средний map по изображениям датасета):

In [43]: mAP_AP_dataset(val_dataset, model, multiclasses=True)

100%|██████████| 38/38 [02:01<00:00, 3.19s/it]

Значения Average Precision для каждого класса:


AP (среднее по порогам IoU=.50:.05:.95) для класса WITH HARDHAT = 0.6938
AP (average across thresholds IoU=.50:.05:.95) for class WITHOUT HARDHAT =
0.7347

Mean Average Precision values:


mAP (average across thresholds IoU=.50:.05:.95) = 0.7143
mAP (при IoU=.50) = 0.9868
mAP (при IoU=.70) = 0.8487
mAP (for small objects with area < 32*32 pixels) = 0.5193
mAP (for medium objects with an area from 32*32 to 64*64 pixels) = 0.695
mAP (for large objects with area > 64*64 pixels) = 0.8381

Conclusion on two-class detection:

This time we received lower values ​for quality metrics during validation.
The reason for this may be the small size of the training dataset, the low degree of diversity
of the photographs themselves in the dataset (which makes it difficult to generalize models),
the high complexity of the task itself, and the very strong difference in the prior probability of
localhost:8888/notebooks/Downloads/main.ipynb 29/32
2/24/24, 9:50 PM main - Jupyter Notebook

classes in the dataset. There are very few photographs of people without helmets at the
train, so the recall values ​for the “without helmets” class during validation turned out to be
very low. The AP metric for this class is low.
But at the same time, the model has learned to find people with a helmet quite accurately
(the metrics for this class are an order of magnitude better)
So we can assume that the model is good at finding people with a helmet, but with difficulty
finding people without one.

Let's try to make detection on unfamiliar and dissimilar photographs in the dataset:

localhost:8888/notebooks/Downloads/main.ipynb 30/32
2/24/24, 9:50 PM main - Jupyter Notebook

In [66]: detect_and_visualize(image_input='test_folder/2.jpg',
model_path='models/model_hardhat_detection_final.pth',
classes=['hardhat','no_harhat'], plt_show=True, treshho
detect_and_visualize(image_input='test_folder/3.jpg',
model_path='models/model_hardhat_detection_final.pth',
classes=['hardhat','no_harhat'], plt_show=True, treshho

5 hardhat class objects detected

1 hardhat class objects detected


7 objects of class no_harhat were detected

localhost:8888/notebooks/Downloads/main.ipynb 31/32

You might also like