Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

On optimal stopping strategies for text

recognition in a video stream as an application


of a monotone sequential decision model

Konstantin Bulatov, Nikita Razumnyi, Vladimir V. Arlazarov


September 24, 2019
Introduction

Mobile DAR systems:


• Mobile offline document data
extraction in real time
• Ability to use video stream to
increase recognition quality

Problems:
• How to combine per-frame
results
• When to stop

1/15
Introduction

1 1

2 2

3 3

4 4

5 5

6 6

7 7

2/15
Goals

3/15
Goals

Improving per-frame accuracy

3/15
Goals

Improving combination strategy

3/15
Goals

3/15
Goals

3/15
Goals

Stopping strategies

3/15
Goals

• Explore a decision-theoretic framework for


recognition stopping problem
• Describe a stopping method based on next
integrated result modeling
• Provide experimental evaluation results

4/15
Problem statement

Optimal stopping problem:


X1 , X2 , X3 , . . . – observed sequence
Ln (X1 , X2 , . . . , Xn ) – loss function
N – stopping time

Goal: minimize expected loss at stopping time:

E (LN (X1 , X2 , . . . , XN )) → min

5/15
Problem statement

Proofreading problem:
M – number of initial errors
Xi – errors corrected by the i-th proofreading
c – cost of each proofreading

Loss function:

n
Ln = M − Xi + c · n
i=1

6/15
Problem statement

Recognition stopping problem:


X∗ – correct result
Xi – i-th per-frame result
Rn = R(X1 , . . . , Xn ) – combination of n results
c – cost of each observation

Loss function:

Ln = ρ(Rn , X∗ ) + c · n

7/15
Monotone stopping problems

Monotone problems:

∀n {Ln ≤ En (Ln+1 )} ⊂ {Ln+1 ≤ En+1 (Ln+2 )}

Myopic rule:

NA = min {n ≥ 0 : Ln ≤ En (Ln+1 )}

8/15
Proposed approach

9/15
Proposed approach

Assumption about the integrator function:


∀n E(ρ(Rn , Rn+1 )) ≥ E(ρ(Rn+1 , Rn+2 ))

Monotone condition events:


{Ln ≤ En (Ln+1 )} = {ρ(Rn , X∗ )−En (ρ(Rn+1 , X∗ )) ≤ c}

Triangle inequality:
ρ(Rn , X∗ ) − En (ρ(Rn+1 , X∗ )) ≤ En (ρ(Rn , Rn+1 ))

10/15
Proposed approach

Assumption about the integrator function:


∀n E(ρ(Rn , Rn+1 )) ≥ E(ρ(Rn+1 , Rn+2 ))

Monotone condition events:


{Ln ≤ En (Ln+1 )} = {ρ(Rn , X∗ )−En (ρ(Rn+1 , X∗ )) ≤ c}

Triangle inequality:
ρ(Rn , X∗ ) − En (ρ(Rn+1 , X∗ )) ≤ En (ρ(Rn , Rn+1 ))

10/15
Proposed approach

1. Estimate the expected distance


En (ρ(Rn , Rn+1 )) given current observations;
2. Threshold the obtained estimation, thus
approximating the myopic rule.

Estimation:
( )
1 ∑
n
ˆn =
∆ δ+ ρ(Rn , R(x1 , x2 , . . . , xn , xi ))
n+1
i=1

11/15
Dataset

MIDV-500 dataset:
• 50 types, 10 clips per type
• 15000 frames in total
• 546 fields in total

• Analyzed field groups:


document numbers, dates,
MRZ, Latin names
• 248 fields in total
• 2239 field clips in total
• Tesseract v4 + ROVER

12/15
Dataset

MIDV-500 dataset:
• 50 types, 10 clips per type
• 15000 frames in total
• 546 fields in total

• Analyzed field groups:


document numbers, dates,
MRZ, Latin names
• 248 fields in total
• 2239 field clips in total
• Tesseract v4 + ROVER

12/15
Results

Tesseract v4.0.0
0.28 NK
0.26
0.25
Mean distance ρL to X ∗

0.24
0.23
0.21
0.20
0.19
0.18
0.16
0.15
1.00 3.90 6.80 9.70 12.60 15.50 18.40
Mean number of processed frames
13/15
Results

Tesseract v4.0.0
0.28 NK
0.26 NCX
0.25 NCR
Mean distance ρL to X ∗

0.24
0.23
0.21
0.20
0.19
0.18
0.16
0.15
1.00 3.90 6.80 9.70 12.60 15.50 18.40
Mean number of processed frames
13/15
Results

Tesseract v4.0.0
0.28 NK
0.26 NCX
0.25 NCR
Mean distance ρL to X ∗

0.24 N
0.23
0.21
0.20
0.19
0.18
0.16
0.15
1.00 3.90 6.80 9.70 12.60 15.50 18.40
Mean number of processed frames
13/15
Results

E(N) and E(ρL (RN , X∗ ))

Stopping Target interval for the average number of observations


method 5 ± 0.5 6 ± 0.5 7 ± 0.5 8 ± 0.5 9 ± 0.5 10 ± 0.5
5.332 8.471
NCX ∅ ∅ ∅ ∅
0.195 0.170
5.099 6.920 8.594 10.103
NCR ∅ ∅
0.201 0.180 0.167 0.164
5.000 6.000 7.000 8.000 9.000 10.000
NK
0.197 0.191 0.185 0.180 0.178 0.171
4.571 5.539 6.683 7.742 8.771 9.779
N
0.174 0.165 0.161 0.161 0.158 0.158

14/15
Conclusion

1. Decision-theoretic framework for stopping the text


recognition in a video stream
2. Stopping method based on assumed monotonicity
and next integrated result modeling
3. Evaluation on an open dataset MIDV-500, proposed
method outperforms previously introduced
approaches

4. Future work: confidence scores incorporation;


generalization for other objects; multiple objects

15/15
Questions?

15/15

You might also like