Main Python Code

2/24/24, 9:50 PM main - Jupyter Notebook
Data preparation and neural network training:
In [1]: import numpy as np

from torch.utils.data import Dataset, DataLoader
import os, random, time, shutil
import torch, torchvision
from torchvision import transforms, datasets, models
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.utils as vutils
from torch.utils.tensorboard import SummaryWriter
from PIL import Image
from tqdm import tqdm
import cv2
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt
%matplotlib inline
--------------------------------------------------------------------------
-
ModuleNotFoundError Traceback (most recent call las
t)
Cell In[1], line 4
2 from torch.utils.data import Dataset, DataLoader
3 import os, random, time, shutil
----> 4 import torch, torchvision
5 from torchvision import transforms, datasets, models
6 from torchvision.models.detection.faster_rcnn import FastRCNNPredi
ctor
ModuleNotFoundError: No module named 'torchvision'
In [ ]: # Загрузим функции из написанных .py файлов проекта:

from augment_and_visualize import *
from training_rcnn import *
from predict import *
from metrics import *
In [2]: # For reproducibility of results, let's fix the seeds:

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True

# Code to ignore warnings
import warnings
warnings.filterwarnings("ignore")
localhost:8888/notebooks/Downloads/main.ipynb 1/32
We implement training of a neural network

that detects objects of the "person" class
There is a pre-labeled dataset. You can download from this link
(https://disk.yandex.ru/d/7HNoc81at3r6VQ)
Let's implement data augmentation by doubling the dataset. To do this, I wrote a custom
function aug:
In [3]: aug(out_folder='augmented_dataset')
Initial number of photos and annotations = 500

Total number of photos and annotations = 1000
Preparing data for training the Faster RCNN network:
Let's divide the data into training and test:
In [52]: # Get file names without encoding value

names = []
for file in os.listdir('augmented_dataset/images'):
names.append(file.split('.')[0])
I randomly divide the photos into train (80%) and test (20%)
In [53]: train_data = random.sample(names, int(len(names) * 0.8))

print ( f'Number of images per train = { len ( train_data )} ' )
test_data = list(set(names) - set(train_data))
print ( f'Number of images on test = { len ( test_data )} ' )
Number of images per train = 800

Number of images per test = 200
Annotations in this case are available in two different formats: COCO_json and
PASCAL_VOC_xml
Key difference between xml and json box markup:
In json, each box has the following 4 values:

x_min, y_min, width, hight
In xml, each box has the following 4 values:

x_min, y_min, x_max, y_max
Let's look at how to work with annotation data presented in xml format
In [54]: with open('detect_dataset/annotations/PASCAL_VOC_xml/oz7_violation_frame519.

data = f.read()
soup = BeautifulSoup(data, 'xml')
objects = soup.find_all('object')
num_objs = len(objects)
print(objects)
[<object>
<name>person</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>1000.3</xmin>
<ymin>301.6</ymin>
<xmax>1082.3</xmax>
<ymax>514.5</ymax>
</bndbox>
</object>, <object>
<name>person</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>1140.9</xmin>
<ymin>260.36</ymin>
<xmax>1215.6</xmax>
<ymax>493.2</ymax>
</bndbox>
</object>]
In this case, informative information is presented in the xmin , ymin , xmax , ymax section
and the class itself is presented in the name section
In [55]: # Let's create functions that will parse this data:

def generate_box(obj):
xmin = int(float(obj.find('xmin').text))
ymin = int(float(obj.find('ymin').text))
xmax = int(float(obj.find('xmax').text))
ymax = int(float(obj.find('ymax').text))
return [xmin, ymin, xmax, ymax]

'''
class person - 1
p.s:
I wrote down the conditions for class numbers in advance
to train the following network with two classes:
1 - man with helmet
2 - man without a helmet
'''
def generate_label(obj):
if (obj.find('name').text == "person") or (obj.find('name').text == "hat
return 1
elif obj.find('name').text == "no_hat":
return 2
return 0
Pytorch, when training detection models, requires data in the format [xmin, ymin, xmax,
ymax] for each box
In [56]: '''
This function will output a dictionary with 3 keys: boxes, labels and image_
The function takes as input:
image_id - index of the photo from the Pytorch class Dataset
file - path to xml file
'''
def generate_target(image_id, file):
with open(file) as f:
data = f.read()
soup = BeautifulSoup(data, 'xml')
objects = soup.find_all('object')
num_objs = len(objects)

# We will iterate through the sheet obtained after expanding the xml
boxes = []
labels = []
for i in objects :
boxes.append(generate_box(i))
labels.append(generate_label(i))
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# In this case there is only 1 class
labels = torch.as_tensor(labels, dtype=torch.int64)
# translate index torch tensor
img_id = torch.tensor([image_id])

# get the final dictionary for the photograph under study
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["image_id"] = img_id
return target
Let's create a MakeDataset class, inheriting it from the Dataset class. Let's describe the init ,
getitem and len methods
In [57]: class MakeDataset(Dataset):

def __init__(self, path, data, transforms=None):
self.transforms = transforms
self.names_list = data
self.path = path

def __getitem__(self, idx):
name = self.names_list[idx]
file_image = self.path + '/images/' + str(name) + '.jpg'
file_label = self.path + '/annotations/' + str(name) + '.xml'
img = Image.open(file_image).convert("RGB")

#Let's make a dictionary with an annotation using the previously wri
target = generate_target(idx, file_label)
if self.transforms is not None:

img = self.transforms(img)

return img, target

def __len__(self):
return len(self.names_list)
In [58]: data_transform = transforms.Compose([

transforms.ToTensor()
])

# The input of the MakeDataset class in __init__ is the value of the list of
# path to the shared folder with annotations and images:

train_dataset = MakeDataset(path='augmented_dataset', data=train_data, trans
val_dataset = MakeDataset(path='augmented_dataset', data=test_data, transfor
Let's see in what format the data is stored in the MakeDataset class:
In [59]: first = train_dataset[7]

features, labels = first
print(labels)
{'boxes': tensor([[ 961., 503., 1093., 945.],

[ 784., 452., 962., 946.]]), 'labels': tensor([1, 1]), 'image_i
d': tensor([7])}
In this case, there are 3 objects of the person class in the photo, so there are 3 bounding
boxes
In [60]: # Represents values

in batches as tuples:
def collate_fn(batch):
return tuple(zip(*batch))

batch_size = 8 # Set the number of photos for 1 batch

train_data_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=Tru

val_data_loader = torch.utils.data.DataLoader(
val_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=False
Demonstration of output data from the DataLoader class -> [batch_size, dicts]
In [16]: device = 'cpu'

for imgs, annotations in train_data_loader:
imgs = list(img.to(device) for img in imgs)
annotations = [{k: v.to(device) for k, v in t.items()} for t in annotati
print(annotations)
break
[{'boxes': tensor([[629., 132., 686., 261.]]), 'labels': tensor([1]), 'ima

ge_id': tensor([527])}, {'boxes': tensor([[593., 108., 633., 238.],
[630., 92., 690., 238.]]), 'labels': tensor([1, 1]), 'image_id':
tensor([481])}]
Let's visualize the source images:
Let's take 3 random photos and draw their boxes:
In [14]: for _ in range ( 3 ):

plot_random_image(train_data_loader)
Network configuration:
We will use the Transfer learning approach, training the Faster RCNN network, which has
already been pretrained on COCO
In [17]: def create_model(num_classes):

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=
in_features = model.roi_heads.box_predictor.cls_score.in_features
# Replace the number of output classes with what we need
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_class
return model

# Set the model (set the number of classes as n+1, since class 0 is empty)
model = create_model(2)
Inside the Faster R-CNN implementation with FPN (Feature Pyramid Network) in the
PyTorch torchvision library, a complex loss function is used that combines several sub-
functions.
In [18]: model.to(device)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0

# We will train networks on a video card:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device
print(device)

model.to(device)
num_epochs = 30 # number of training epochs
cuda
In this case, the choice of the SGD optimizer may be due to the following reasons:
Model size and complexity: Faster R-CNN with ResNet50-FPN is a fairly large and complex
model with many trainable parameters, which can lead to rapid overfitting and instability
when using more complex optimizers such as Adam.
Amount and type of data: When training object detectors based on Faster R-CNN, a loss
function is used, which consists of several components, including components related to
object classification and regression. SGD is a classic optimizer that does a good job of
training such models, while Adam, which is a more advanced method, may not produce
optimal results.
Availability of pre-trained weights: In this case, we use pre-trained weights for Faster R-CNN
with ResNet50-FPN, which can simplify the training process and allow the use of a simpler
SGD optimizer instead of Adam.
In [19]: # Create an empty folder in which we will save the trained models
newpath = 'models'
if not os.path.exists(newpath):
you . makedirs ( newpath )

In [7]: # Let's run it directly on the TensorBoard laptop:

%reload_ext tensorboard
%tensorboard --logdir 'results_training'
The forward path for the model in train produces a loss_dict, which represents a dictionary
containing the loss function values for each of the Faster R-CNN components used during
training. The loss_dict dictionary in the Faster R-CNN model includes the following loss
function components:
1. Loss_objectness is responsible for determining whether the region of the inferred object
contains any object or not (binary classification). To do this, loss_objectness uses the
binary cross-entropy between the network output and the corresponding labels for each
region.
2. Loss_classifier is responsible for determining which class an object belongs to in a
given region. To do this, loss_classifier uses multi-class cross-entropy between the
network output and the corresponding class labels for each region.
3. Loss_box_reg is responsible for determining how well the model predicts the correct
bounding box coordinates for a detected object in the image. To do this, loss_box_reg
uses the root mean square error between the predicted bounding box coordinates and
the actual coordinates.
4. Loss_rpn_box_reg is responsible for determining how well the model predicts bounding
box coordinates for regions obtained from the Region Proposal Network (RPN) that may
contain objects. To do this, loss_rpn_box_reg uses the root mean square error between
the predicted bounding box coordinates and the actual coordinates.
Network training code:
Let’s load a custom function and train the network (30 training epochs):
In [20]: train(model=model, train_data_loader=train_data_loader, optimizer=optimizer,

val_data_loader=val_data_loader,
num_epochs=30, comment=' person detection new', device=device,
save_path='models/model_human_detection.pth')
Train summ loss after 27 epochs = 0.10834603011608124
100%|██████████| 100/100 [01:03<00:00, 1.58it/s]
Validation summ loss after 27 epochs = 0.12323746003210545

Era 28 training in progress (out of 30)
25it [00:39, 1.56s/it]
100%|██████████| 100/100 [01:02<00:00, 1.59it/s]

Era 29 training in progress (out of 30)
25it [00:38, 1.55s/it]
100%|██████████| 100/100 [01:02<00:00, 1.59it/s]
V lid ti l ft 29 h 0 11717739876359701
Tensorboard with training is saved in the results_training directory. I also saved it on the dev
website so you can watch it via link.
!!! The training results can be viewed by clicking on this link
(https://tensorboard.dev/experiment/rr43qafqQKyKP7CQ5r1RCA/#scalars&_smoothingW
Loading the most successful state of the model:

Just in case, I renamed this model to model_human_detection_final.pth so that I wouldn’t
accidentally overwrite it when running the code again. This model has a neuron state
corresponding to the epoch in which it was possible to achieve the lowest validation loss
value
Since the file with the model weighs too much (158 MB), it was not possible to upload the
trained model to github. So, in order to run the code yourself using my trained model, you
need to run this function, which will download the trained networks from my Google drive to
the models folder:
In [ ]: #download_models(folder_name='models')
Load the model model_human_detection_final.pth:
In [49]: model = create_model(2)

model.load_state_dict(torch.load('models/model_human_detection_final.pth'))
Out[49]: <All keys matched successfully>
Testing:
Let's launch a custom function that implements displaying the results of the model
In [ ]: detect_and_visualize(image_input='detect_dataset/images/am3_7_violation_fram
model_path='models/model_human_detection_final.pth',
classes=['person'], plt_show=True)
detect_and_visualize(image_input='detect_dataset/images/am3_9_frame111.jpg',
2 objects of the person class were detected
Quality assessment models:
To calculate metrics in object detection tasks, two thresholds are usually used: a
classification threshold and a threshold for determining the intersection with the true position
of an object (intersection over union, IoU threshold).
The threshold for classification determines which predictions are considered positive and
which are negative. Usually the threshold is set based on the score that the model produces
for each detected object. If the score exceeds the threshold, then the object is considered
positive, otherwise - negative.
The IoU detection threshold determines how much the detected object overlaps with the
object's true position. Typically, the IoU threshold is set based on specified detection quality
requirements. If the IoU between the detected object and the true position of the object
exceeds the threshold, then the object is considered to be truly detected, otherwise it is
considered to be falsely detected.
Thus, to calculate metrics in object detection tasks, it is necessary to know two thresholds:
the threshold for classification and the threshold for determining IoU. They allow you to
separate positive and negative examples and determine how well the model detects the true
positions of objects.
An important note - the score threshold is used when the model is on combat duty in
production, so it should be chosen especially carefully. But the IoU threshold is used only for
validation to evaluate several well-known metrics (I’ll talk about them a little later in my story)
Let's choose score = 0.85 for treshold
Let's evaluate the quality of the model using the IOU metric:
In [61]: '''
The first step is to find iou scores:
Let's get an array with the number of elements equal to the number of object
and containing an array of IOU correspondences between the predicted and rea
'''
iou_scores_list = calculate_iou(model, val_dataset, treshold=0.85)
Let's see what the data looks like in this array:
In [66]: print ( f'Example 1:\n { iou_scores_list [ 2 ]} ' )

print ( f'Example 2: \n { iou_scores_list [ 67 ]} ' )
Example 1:
tensor([[0.7746]])
Example 2:
tensor([[0.9480, 0.0432],
[0.0486, 0.8020]])
We obtain iou matrices, which contain the values of the IoU coefficients between all pairs of
frames from the predicted and real boxes. These values will range from 0 to 1, where a
value of 0 means that the frames do not overlap and a value of 1 means that the frames
match completely.
Let's calculate the average IOU for validation at a given threshold score = 0.85
In [63]: val = []
for image in iou_scores_list:
for detect in image:
val.append(max(detect))
print ( f'Average IOU on validation is: { np . mean ( val )} ' )
The average IOU on validation is: 0.8693097233772278
Let’s visualize what predicted and real bounding boxes look like on images from the
validation dataset and find the IOU for them:
ps (in this custom function, all predicted boxes for which score>0 are built at once)
In [99]: visualize_detection(val_dataset, model, 26)
Let's calculate precission and recall:
Recall for a one-class detection task shows how many of all objects of interest were
detected by the algorithm. That is, the closer the recall value is to 1, the more objects of
interest were found by the algorithm.
Precision for a one-class detection task shows how many of all bounding boxes predicted by
the algorithm actually contain objects of interest. That is, the closer the precision value is to
1, the fewer false objects were predicted by the algorithm.
To calculate these metrics, set the IOU threshold to 0.5
Calculating metrics for one specific photo:
In [25]: # Artificial example: only 3 objects actually existed, but 4 were discovered
# with the selected score threshold with the following IOU values:

score = torch.tensor([[0.20, 0.90, 0.10],
[0.80, 0.10, 0.20],
[0.00, 0.10, 0.20],
[0.10, 0.20, 0.00]])

print('recall (iou=0.1) =', recall(score, iou_threshold=0.1))
print('precission (iou=0.1) =', precision(score, iou_threshold=0.1))
recall (iou=0.1) = 1.0

precission (iou=0.1) = 0.75
recall (iou=0.5) = 0.6666666666666666
precission (iou=0.5) = 1
recall (iou=0.85) = 0.3333333333333333
precission (iou=0.85) = 1
In [26]: #Calculation of metric environments for the entire dataset at score porg 0.8
print ( 'Average recall over the entire validation dataset =' ,
mean_metric(iou_scores_list, func='recall', iou_treshold=0.5))
print ( 'Average precision over the entire validation dataset =' ,
mean_metric(iou_scores_list, func='precision', iou_treshold=0.5))
Average recall over the entire validation dataset = 0.97083333333333333

Average precision over the entire validation dataset = 1.0
Explanation of the results obtained:
If the detection algorithm identifies many regions in a photograph that have a high similarity
to a real object, but these regions do not actually correspond to the object (a bunch of
predict boxes next to a person as in YOLO without non-maximum supression), then for this
case recall will be high and precision will be low.
In our case, the detection confidence threshold was chosen to be very high (score = 0.85),
so our situation is exactly the opposite. More often than not, the classifier will not find a
person at all, rather than finding him twice. For this reason, our precision is very high, and
recall has a lower value.
Let's calculate the Average Precision metric

When calculating AP, we change the threshold by score (model confidence) and calculate
the corresponding precision and recall values for each threshold. In this case, precision and
recall can vary depending on the selected threshold. When the threshold on score is set at a
very high level, that is, the model only detects objects that it considers most confident, then
precision will be very high, but recall will be low, since many objects may be missed. If the
threshold is set at a very low level, that is, the model detects all objects, but also a lot of
noise (false detections), then recall will be high, but precision will be low.
In this case, AP is the average value of accuracy when choosing a threshold on score from 0
to 1, at which recall changes from 0 to 1. AP takes into account the importance of both
accuracy and recall when calculating the quality of an object detection algorithm and does
not take into account the choice of a specific score threshold, which allows us to give a more
general assessment of the quality of the detection model. Because in this problem there is
only one detectable class, so the metric mAP (mean AP) = AP
In [27]: rec=[]
prec = []
for i in np . linspace ( 0 , 1 , num = 11 , endpoint = False ):
iou_scores_list = calculate_iou(model, val_dataset, treshold=i)
rec.append(mean_metric(iou_scores_list, func='recall', iou_treshold=0.5)
prec.append(mean_metric(iou_scores_list, func='precision', iou_treshold=
rec.append(0)
prec.append(1)
Building precision_recall_curve:
AP can be defined as the area under the Precision-Recall curve. Let us depict this graph,
constructed when choosing a threshold IoU = 0.5
In [28]: plt.figure(figsize=(12, 4), dpi=80)

plt.plot(rec, prec, label="treshold IoU = 0.5")
plt.title('Precision-Recall curve')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(frameon=False)
plt.show()

plt.figure(figsize=(5, 4), dpi=80)
plt.plot(rec[:-1], prec[:-1], label="treshold IoU = 0.5")
plt.title('Cropped Precision-Recall curve')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(frameon=False)
plt.show()

Determining the average Average Precision (AP) from images of the validation dataset at
IoU=5:
In [29]: print(f'Average precision = {average_precision(prec, rec)}')
Average precision = 0.9944000000000001
In this case, we implemented an independent calculation of the AP metric and ultimately

calculated the AP values for each photo at IoU = 0.5 and then averaged the metric over the
entire validation dataset. This approach is not entirely correct. It is worth looking for this
metric by immediately concatenating the detection results and ground truth for all images in
the validation dataset. Then, by chance, good detection on several individual images will not
make the AP results less informative when assessing the quality of the model.
Let's implement the described approach using ready-made functions from
torchmetrics.detection:
In [67]: mAP_AP_dataset(val_dataset, model)
100%|██████████| 200/200 [09:46<00:00, 2.93s/it]
Average Precision values

for the person class:
AP (при IoU=.50) = 0.9846
Conclusion on one-class detection:
We trained a rather complex Faster RCNN model and obtained very high quality metrics
during validation.
The reason for such high metrics is that for this task there was a fairly large training dataset;
in total, only 1 class was required to be detected, which makes the task quite simple. And the
most important thing is that the entire dataset is presented in the form of photographs from
several fixed cameras, so the photographs taken during validation and during training are,
unfortunately, very similar to each other. Even the use of augmentation does not add much
variety to our studied dataset. So it is worth assuming that when using this pre-trained
network on completely different (different) images, the network will show significantly lower
quality results.
Example of work on an unfamiliar, dissimilar photograph:
In [31]: detect_and_visualize(image_input='test_folder/1.jpg',
detect_and_visualize(image_input='test_folder/2.jpg',
Detection of people wearing helmet with

without:
Using the web service https://www.makesense.ai/ (https://www.makesense.ai/) , I
independently marked up some of the images, this time making 2 separate classes - a man
with a helmet on his head and a man without a helmet .
The dataset containing the new annotation can be downloaded from this link
(https://disk.yandex.ru/d/H4Qa16XDre6uuQ)
Regarding the procedure for creating a suitable dataset for training the model:
Initially, I marked up the photographs, highlighting the entire human figure and indicating the
class of the object (whether it was wearing a helmet or not), but this approach did not give
good results. Due to too significant unevenness in the number of objects between classes, I
received a very low recall for the class with hardhat. (since in the dataset 95% of the photos
are with people in helmets).
I was able to get slightly better results by artificially reducing the number of examples with
the class with hardhat (thereby reducing the size of the dataset to 93 photos) and marking
the boxes not by the area of the human body, but only by the area of the head. That is, there
are 2 possible classes in the box: a head with a helmet on and a bare head.
I saved the resulting annotations in the detect_hat_dataset/annotations folder.

Now let’s add images corresponding to these annotations to the still empty
detect_hat_dataset/images folder (select by matching name)
In [45]: # Создадаим пустую папку images:

newpath = 'detect_hat_dataset/images'
if not os.path.exists(newpath):
os.makedirs(newpath)
# Получим имена файлов аннотаций нового датасета без значения кодировки

names = []
for file in os.listdir('detect_hat_dataset/annotations'):
print(f'Всего вручную размеченных фотографий {len(names)} штук')
Всего вручную размеченных фотографий 93 штук
Скопируем из папки detect_dataset/images подходящие фотки в папку

detect_hat_dataset/images:
In [46]: for file in os.listdir('detect_dataset/images'):

if file.split('.')[0] in names:
shutil.copy2('detect_dataset/images/' + file, 'detect_hat_dataset/im
Так как в данном случае изображений размеченных особенно мало, поэтому

воспользуемся готовой функцией аугментации для увеличения размера датасета в 2
раза:
In [ ]: aug(image_dir="detect_hat_dataset/images",
xml_dir="detect_hat_dataset/annotations",
out_folder='augmented_hat_dataset')
Исходное число фотографий и аннотаций = 93

Итоговое число фотографий и аннотаций = 186
Теперь у нас есть папка augmented_hat_dataset с которой мы и будем работать при

обучении сети
Реализация предподгтовки данных абсолютно идентична, поэтому с более сжатым

текстовым пояснением реализуем все процессы:
In [31]: # Получим имена файлов без значения кодировки

names = []
for file in os.listdir('augmented_hat_dataset/images'):
In [32]: #Рандомно поделим фотографии на train (80%) и test (20%)

train_data = random.sample(names, int(len(names) * 0.8))
print(f'Число изображений на train = {len(train_data)}')
test_data = list(set(names) - set(train_data))
print(f'Число изображений на test = {len(test_data)}')
Число изображений на train = 148

Число изображений на test = 38
In [33]: train_dataset = MakeDataset(path='augmented_hat_dataset', data=train_data, t

val_dataset = MakeDataset(path='augmented_hat_dataset', data=test_data, tran

batch_size = 8 # Зададим чило фотографий на 1 батч

train_data_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=Tru

val_data_loader = torch.utils.data.DataLoader(
val_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=False
Визуализируем исходные изображения:

Возьмем случайные 3 фотки и изобразим их боксы:
In [42]: for _ in range(3):

plot_random_image(train_data_loader, hat_class=True)
Зададим модель и ее параметры.

На этот раз классов уже будет 3:
фон, человек с каской и человек без каски
In [34]: device = torch.device('cuda') if torch.cuda.is_available() else torch.device

print(device)

model = create_model(3)
model.to(device)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0

cuda
Обучение сети:
In [35]: train(model=model, train_data_loader=train_data_loader, optimizer=optimizer,
val_data_loader=val_data_loader,
num_epochs=30, comment=' hardhat detection new', device=device,
save_path='models/model_hardhat_detection.pth')
Идет обучение 1 эпохи (из 30)
25it [00:39, 1.58s/it]
100%|██████████| 19/19 [00:13<00:00, 1.46it/s]

Сохраним новую модель, так как текущая конфигурация имеет ниже val loss
25it [00:38, 1.54s/it]
100%|██████████| 19/19 [00:11<00:00, 1.62it/s]

Сохраним новую модель, так как текущая конфигурация имеет ниже val loss
Tensorboard с обучением сохранен в директорию results_training. Так же я сохранил его

на сайте dev, чтобы можно было смотреть по ссылке.
!!! Результаты обучения можно посмотреть перейдя по данной ссылке
(https://tensorboard.dev/experiment/rr43qafqQKyKP7CQ5r1RCA/#scalars&_smoothingW
Загрузим лучшее состояние модели:
Загрузка самого успешного состояния модели: Такую модель на всякий случай я

переименовал в model_hardhat_detection_final.pth чтобы случайно при повторном
запуске кода не перезаписать. Данная модель имеет состояние нейронов,
соответсвующее эпохе,на которой получилось добиться самого низкого значения лосса
валидации
Так как файл с моделью весит слишком много (158 Мб), поэтому на github залить
обученную модель не вышло. Так что для того, чтобы запустить самостоятельно код с
использованием моей обученной модели необходимо запустить эту функцию, которая
с моего гугл диска скачает в папку models обученные сети:
In [48]: #download_models(folder_name='models')
Загрузим модель model_hardhat_detection_final.pth:
In [36]: model = create_model(3)

model.load_state_dict(torch.load('models/model_hardhat_detection_final.pth')
Out[36]: <All keys matched successfully>
Тестирование:
Запустим кастомную функцию, которая реализует выведение результатов работы

модели
In [ ]: detect_and_visualize(image_input='detect_hat_dataset/images/am3_6_frame084.j
model_path='models/model_hardhat_detection_final.pth',
classes=['hardhat','no_harhat'], plt_show=True,
treshhold=0.6)
detect_and_visualize(image_input='detect_hat_dataset/images/am3_9_frame090.j
treshhold=0.6)
detect_and_visualize(image_input='detect_hat_dataset/images/am3_9_violation_
treshhold=0.6)
Объектов класса hardhat обнаружено 1

Объектов класса no_harhat обнаружено 1

Объектов класса no_harhat обнаружено 1
Оценка качества модели:
Выберем за treshold значение score = 0.6 (выбрал эмпирически)
Оценим качество модели используя метрику IOU:
In [37]: '''
Найдем первым этапом iou scores:
Получим массив с числом элементов равным чилу объектов в датасете
и содержащим массив соответвий IOU между предсказанным и рельным bounding бо
'''
iou_scores_list = calculate_iou(model, val_dataset, treshold=0.6)
Вычислим средний IOU на валидации при заданном пороге score = 0.6
In [38]: val = []
for image in iou_scores_list:
for detect in image:
val.append(max(detect))
print(f'Средний IOU на валидации равен: {np.mean(val)}')
Средний IOU на валидации равен: 0.8456330895423889
Данный скор определен по задетектированным областям. Но сама модель очень часто

не находит объекты или путает классы между собой, поэтому более объективным
параметром для оценки качества будет являться среднее значение AP между
классами - mAP
Визуализируем как выглядят предсказанные и реальные bounding боксы на

изображениях из валидационного датасета и найдем IOU для них: PS: (в данной
кастомной функции строятся сразу все предсказанные боксы, для которых score>0)
In [46]: visualize_detection(val_dataset, model, 5)
Определим значение метрик mAP при разных значениях порга IoU, а также значения
AP для обоих классов:
Найдем данные метрики соединив все предикты и таргеты на валидационном датасете

воедино (а не как средний map по изображениям датасета):
In [43]: mAP_AP_dataset(val_dataset, model, multiclasses=True)
100%|██████████| 38/38 [02:01<00:00, 3.19s/it]
Значения Average Precision для каждого класса:

AP (среднее по порогам IoU=.50:.05:.95) для класса WITH HARDHAT = 0.6938
AP (average across thresholds IoU=.50:.05:.95) for class WITHOUT HARDHAT =
0.7347
Mean Average Precision values:

mAP (average across thresholds IoU=.50:.05:.95) = 0.7143
mAP (при IoU=.50) = 0.9868
mAP (при IoU=.70) = 0.8487
mAP (for small objects with area < 32*32 pixels) = 0.5193
mAP (for medium objects with an area from 32*32 to 64*64 pixels) = 0.695
mAP (for large objects with area > 64*64 pixels) = 0.8381
Conclusion on two-class detection:
This time we received lower values for quality metrics during validation.
The reason for this may be the small size of the training dataset, the low degree of diversity
of the photographs themselves in the dataset (which makes it difficult to generalize models),
the high complexity of the task itself, and the very strong difference in the prior probability of
classes in the dataset. There are very few photographs of people without helmets at the
train, so the recall values for the “without helmets” class during validation turned out to be
very low. The AP metric for this class is low.
But at the same time, the model has learned to find people with a helmet quite accurately
(the metrics for this class are an order of magnitude better)
So we can assume that the model is good at finding people with a helmet, but with difficulty
finding people without one.
Let's try to make detection on unfamiliar and dissimilar photographs in the dataset:
In [66]: detect_and_visualize(image_input='test_folder/2.jpg',
classes=['hardhat','no_harhat'], plt_show=True, treshho
classes=['hardhat','no_harhat'], plt_show=True, treshho
5 hardhat class objects detected
1 hardhat class objects detected

7 objects of class no_harhat were detected

Main Python Code

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main Python Code

Uploaded by

Copyright:

Available Formats

2/24/24, 9:50 PM main - Jupyter Notebook

Data preparation and neural network training:

In [1]: import numpy as np

ModuleNotFoundError: No module named 'torchvision'

In [ ]: # Загрузим функции из написанных .py файлов проекта:

In [2]: # For reproducibility of results, let's fix the seeds:

We implement training of a neural network

Initial number of photos and annotations = 500

Preparing data for training the Faster RCNN network:

Let's divide the data into training and test:

In [52]: # Get file names without encoding value

In [53]: train_data = random.sample(names, int(len(names) * 0.8))

Number of images per train = 800

Key difference between xml and json box markup:

In json, each box has the following 4 values:

In xml, each box has the following 4 values:

In [54]: with open('detect_dataset/annotations/PASCAL_VOC_xml/oz7_violation_frame519.

In [55]: # Let's create functions that will parse this data:

In [57]: class MakeDataset(Dataset):

if self.transforms is not None:

In [58]: data_transform = transforms.Compose([

In [59]: first = train_dataset[7]

{'boxes': tensor([[ 961., 503., 1093., 945.],

In [60]: # Represents values ​

In [16]: device = 'cpu'

[{'boxes': tensor([[629., 132., 686., 261.]]), 'labels': tensor([1]), 'ima

Let's visualize the source images:

Let's take 3 random photos and draw their boxes:

In [14]: for _ in range ( 3 ):

In [17]: def create_model(num_classes):

In [7]: # Let's run it directly on the TensorBoard laptop:

Network training code:

In [20]: train(model=model, train_data_loader=train_data_loader, optimizer=optimizer,

100%|██████████| 100/100 [01:03<00:00, 1.58it/s]

Validation summ loss after 27 epochs = 0.12323746003210545

25it [00:39, 1.56s/it]

Train summ loss after 28 epochs = 0.09104762971401215

100%|██████████| 100/100 [01:02<00:00, 1.59it/s]

Validation summ loss after 28 epochs = 0.11854852609336376

25it [00:38, 1.55s/it]

Train summ loss after 29 epochs = 0.11474072933197021

100%|██████████| 100/100 [01:02<00:00, 1.59it/s]

Loading the most successful state of the model:

Load the model model_human_detection_final.pth:

In [49]: model = create_model(2)

Out[49]: <All keys matched successfully>

2 objects of the person class were detected

3 objects of the person class were detected

Quality assessment models:

Let's choose score = 0.85 for treshold

Let's see what the data looks like in this array:

In [66]: print ( f'Example 1:\n { iou_scores_list [ 2 ]} ' )

The average IOU on validation is: 0.8693097233772278

In [99]: visualize_detection(val_dataset, model, 26)

Let's calculate precission and recall:

To calculate these metrics, set the IOU threshold to 0.5

Calculating metrics for one specific photo:

recall (iou=0.1) = 1.0

Average recall over the entire validation dataset = 0.97083333333333333

Explanation of the results obtained:

Let's calculate the Average Precision metric

In [28]: plt.figure(figsize=(12, 4), dpi=80)

In [60]: # Represents values

Average Precision values