Professional Documents
Culture Documents
A More Technical Redsdport
A More Technical Redsdport
A More Technical Redsdport
InternshipReport
Submittedto:
ImageUnderstandingandPatternRecognitionLab
GermanResearchCenterforArtificialIntelligence
Kaiserslautern,Germany
Submittedby:
AmbrishDantrey,B.Tech.IIIyear,E&CE
IndianInstituteofTechnology,Roorkee
Roorkee,India
Supervisors:FaisalShafait,IllyaMezhirov
Reviewer:prof.Dr.ThomasBreuel
StartDateforInternship:15thMay,2007
EndDateforInternship:27thJuly,2007
ReportDate:27thJuly,2007
Preface
This report documents the work done during the summer internship at Image
UnderstandingandPatternRecognition(IUPR)Lab,DeutscheForschungszentrum
fr Knstliche Intelligenz(DFKI), Germany under the supervision of Prof. Dr.
ThomasBreuel. The report first shall giveanoverviewofthetaskscompleted
duringtheperiodofinternshipwithtechnicaldetails.Thentheresultsobtained
shallbediscussedandanalyzed.
Reportshallalsoelaborateonthethefutureworkswhichcanbepersuadedasan
advancementofthecurrentwork.
Ihavetriedmybesttokeepreportsimpleyettechnicallycorrect.IhopeIsucceed
inmyattempt.
AmbrishDantrey
Acknowledgments
Simplyput,IcouldnothavedonethisworkwithoutthelotsofhelpIreceived
cheerfully from whole IUPR. The work culture in IUPR really motivates.
Everybodyissuchafriendlyandcheerfulcompanionherethatworkstressisnever
comesinway.
I would specially like to thank Dr. Thomas Breuel and Dr. Daniel keysers for
provingtheniceideastoworkupon.Notonlydidtheyadvisedaboutmyproject
butlisteningtotheirdiscussionsinIPeTmeetinghaveevokedagoodinterestin
Imageanalysis.IamalsohighlyindebtedtomysupervisorsFaisalShafaitandIlya
Mezhirov,whoseemedtohavesolutionstoallmyproblems.
Author
Abstract
ThereportpresentsthethreetaskscompletedduringsummerinternshipatIUPR
whicharelistedbelow:
1. Detection of headlines in document images with black runlengths and
OCRopusperformanceevaluationindetectingheadlines
2. Reengineeringthezoneclassificationmodule
3. Evaluationofdifferentsegmentationalgorithmsperformance
Allthesetaskshavebeencompletedsuccessfullyandresultswereaccordingto
expectations.Thedetection of headlinesachievedalowerrorrateof2.85%as
against 6.52 of previously used methods. During evaluation of segmentation
algorithmsXYcutwasfoundtogainalotbynoisecleanup,whichisaninteresting
resultasitstrengthentheclaimofXYcutsegmentationalgorithmasasuitable
method for OCRopus. The reengineering and porting of zoneclassification
module to OCRopus makes it possible for OCRopus to have a text/image
segmentationifitisrequiredinfuture.
Author
OCRopus:Introduction
Thoughthefieldofopticalcharacterrecognition(OCR)isconsideredtobewidely
explored,thedevelopmentofanefficientsystemforuseinrealworldsituations
stillremainsachallengefordevelopers. OCRopusisastateoftheartdocument
analysisandOCRsystem,featuringpluggablelayoutanalysis,pluggablecharacter
recognition,statisticalnaturallanguagemodeling,multilingualcapabilitiesandis
beingdevelopedatIUPR.Thisbeingaverybigproject,Iwasassignedthetasksof
developingtoolsforlayoutanalysisandevaluation.
TheGoals:
FollowinggoalsweresetasIproceededinmywork:
1. ConversionofgroundtruthdatainMARGdatabasefromXMLformat
tohOCRmicroformat[1].
2. Developmentofarulebasedheadlinedetectionmethodusingthemedian
blackrunlengthofthelines.
1.XMLtohOCR:
2.HeadlinedetectionBasedonblackrunlengthanditsintegration
intoOCRopus:
Detectionofheadlinesindocumentimagesisoneissuethatismostlyoverlooked
butyetishighlydesirabletoproperlyformattheoutputofOCR.OCRopushadtill
nowusedarulebasedmethodwhichusedspacebetweenlinesasthecriteriafor
detectionofheadlines.Thoughthismethodworkedformanyimages,italsofailed
manytimes.Itwasanobviousobservationthatblackrunlengthsofheadlinesare
morethantheblackrunlengthofthenormalline,andwetriedtobuilduponthis
concept.Weusedmedianblackrunlengthofalineasthedecidingcriteria.The
medianwasusedinsteadofmeanbecausemeanrunlengthcouldhaveeasilybeen
affectedbythenoisemergingwithtextandwouldhaveproduceerrors.
Thewholeapproachissimpleasdiscussedbelow:
1. Calculatethemedianblackrunlengthfortheeachlineonpage.
2. Comparethisrunlengthforeachlinewiththelinesbelowandaboveit.
3. If black runlength for a line has been found K1(a parameter) times the
median runlength oflinebelowit,andK2(anotherparameter)timesthe
medianrunlengthofthelineaboveit,setitasaheadline.
ThevalueofparametersK1andK2wastobefoundexperimentally.Aftermany
timesevaluatingtheperformanceoftheprogram,thevalueofK1andK2hasbeen
setto1.5and1.1respectively.
Weusedhistogrambasedmethodtofindthemedianrunlength.Ahistogramof
thenumberofoccurrencesversusrunlengthwascalculated,oncewehavesucha
histogramwenormalizeitwiththelargestvalueofoccurrence.Thenwecalculated
thecumulativedistributionfunctionforthisnormalizedhistogram.Thepointwhen
cumulativedistributionfunctionrechesavalueof0.5,correspondstothemedian
runlength.
The program for detection of headlines was written in C++ and used standard
OCRopusclasses.TheprogramhasbeensuccessfullyintegratedintoOCRopusand
Evaluation:
falsenegativeswhichoccurredindetection.Italsooutputsthetotalnoof
true headlines which are present in the groundtruth. The command line
formofthisprogramsis:
headlineevalhOCRtruehOCRactual
2. The second program is for parsing the file produced by running above
programonalargenooffiles(oronadatabase)andcountsthetotalnoof
falsepositivesandfalsenegativesoccurredinwholedatabaseandtellsthe
errorrateofOCRopusonwholedatabase. Thecommandlineformofthis
programsis
count_errorsFILE.TXT
BothoftheaboveprogramswerewritteninPYTHON.
Criteriaforevaluation:ForevaluatingtheperformanceofOCRopusindetection
ofheadlineswedefinethetheerrorrateas:
e=(fp+fn)/T
e=percentageerror
fp=totalnooffalsepositives
fn=totalnooffalsenegatives
WeevaluatedtheperformanceonstandardUniversityofWashingtonIII(UWIII)
database[2].Theresultsforheadlinedetectionprogramshowedclearlythat
medianblackrunlengthcriteriaisbetterthanthespacebetweenlinescriteria,yet
errorswerestillpresent.Whilevisuallyanalyzingtheoutput,anobservationwas
madethatrunlengthbasedcriteriaandspacebasedcriteriabothproduced
differentfalsenegativesandpositives.Henceitwasclearthatoneofthemethod
canbeusedtoremovetheerrorsproducedbyother.Sowetriedtocombinethe
bothapproachesinsuchawaythatspacebasedcriteriaisusedasafiltertodetect
falsepositivesproducedbytherunlengthbasedcriteria.Therulewhichwasused
tocombinethemwasasfollows:
1. Userunlengthbasedcriteriatofindtheheadlines.
2. Calculatethemedianblackrunlengthforwholepage
3. Comparethemedianblackrunlengthofalllinesfoundtobeheadlinein
step1withthemedianblackrunlengthofthepage.Sincemedianblackrun
lengthofthepagerepresentsjustthesimplelinenotaheadline,ifany
headlinefoundinstep1hasarunlengthlessthanorequaltotherunlength
forwholepage,itisasuspiciouscase.Recheckforthislinewithspacebased
criteria.
Results:
Theresultswereasexpected.Onlyrunlengthbasedcriteriaperformedbetterthan
onlyspacebasedcriteriaandacombinationofboththecriteriaasdescribedabove
outperformedtheboth.TheerrorratesonstandardUW3databasefordifferent
approachesareasfollows:
Spacebasedheadlinedetection:
totalnooftextlines:138018
totalnooffalsepositives:7356.0
totalnooffalsenegatives:1713.0
%error=6.52%
BlackRunlengthbasedheadlinedetection:
totalnooftextlines:138018
totalnooffalsepositives:4341.0
totalnooffalsenegatives:1386.0
%error=4.14%
Bothapproachescombined(usingspacebasedapproachasafiltertoremove
falsepositives)
totalnooftextlines:138018
totalnooffalsepositives:2452.0
totalnooffalsenegatives:1476.0
%error=2.85%
Nextweshowsomeoftheexamples:
3. Text/ImageSegmentationandClassification
Documentimagelayoutanalysisisacrucialstepinmanyapplicationsrelatedto
documentimages,liketextextractionusingopticalcharacterrecognition(OCR),
reflowingdocuments,andlayoutbaseddocumentretrieval.Layoutanalysisisthe
process of identifying layout structures by analyzing page images. Layout
structures can be physical (text, graphics, pictures, . . . ) or logical (titles,
paragraphs, captions, headings, . . . ). The identification of physical layout
structuresiscalledphysicalorgeometriclayoutanalysis,whileassigningdifferent
logicalrolestothedetectedregionsistermedaslogicallayoutanalysis[3].The
taskofageometriclayoutanalysissystemistosegmentthedocumentimageinto
homogeneouszones,eachconsistingofonlyonephysicallayoutstructure,andto
identifytheirspatialrelationship(e.g.readingorder).Therefore,theperformance
oflayoutanalysismethodsdependsheavilyonthepagesegmentationalgorithm
used. A detailed explanation of defferent segmentation algorithms and their
performancecomparisoncanbefoundin[4,5].
Also,anotherimportantsubtaskofdocumentimageanalysisintheclassificationof
physicallysegmentedblocksintooneofthepredefinedclasses.Inmostofthe
casestheclassificationstepsfollowsthesegmentationanditishighlydesirableto
evaluatethesystemperformanceonwholesegmentation/classificationtask.With
thehelpofsuchanevaluation,itiseasytodecideiftheincorporationofthesestep
inOCRopuswouldresultinimprovedperformance.alsoitwouldbeeasytodecide
whichsegmentationalgorithmtouse.
Forclassificationstepweusedmethodasdescribedin[6]thisbeingthebest
classificationmethod.Weusedonlytwoclassestextandnontextwhichwere
releventtoOCRopus,insteadofeightclassesasdescribedinthispaper.
Wealreadyhadanimplementationofvarioussegmentationalgorithmsand
classificationstep.Thetaskincludedreengineeringtheclassificationstep'scode
andportingthewholesegmentationclassificationmoduleintoOCRopus,making
itusestandardOCRopusclassesandfunctions.Thetaskhasbeencompleted
successfullyandnowwehaveaversionofwholesegmentationclassification
moduleinOCRrepositoryanditcanbeintegratedwithOCRopusiftheresultsand
experimentscomespositive.Thecommandlineformoftheprogramis:
ocrclassifyanddisplayiIMAGEbBOUNDINGBOXFILEoOUTPUT
IMAGE
IMAGE:Theimagetobeclassified
BOUNDINGBOXFILE:Theboundingboxfileproducedbysegmentation
algorithms
OUTPUTIMAGE:Thenameofoutputimagetobewritten
Evaluation
Asdiscussedearliertheevaluationofbothsegmentationandclassificationsteps
combinedtogetherishighlydesirable.Thepurposeofdevelopingaevaluation
modulewastodecidewhichsegmentationalgorithmwouldbestsuitetheneedof
OCRopus.Wedevelopedaevaluationprogramwhichevaluatestheperformanceof
twostepsasagainstthegroundtruth.Ourcriteriafortheevaluationisthe
hammingdistancebetweenthetext/nontextzoneimageproducedfromground
truthandthatfromtheZoneclassificationmodule.Theerrorrateisdefinedas
follows:
e=HD*100/T
e=errorrate
HD=Hammingdistancebetweengroundtruthtextnontextimageand
actualtextnontextimage
T=Totalnoofpixelspresentinimage
%efficiency=100e
ThisprogramwasdevelopedinC++.Thecommandlineargumentformofthe
programis:
ocrevaluategtGROUNDTRUTHIMAGEaiACTUALIMAGE
GROUNDTRUTHIMAGE:Text/nontextimageproducedfromgroundtruth
ACTUALIMAGE:Text/nontextimageproducedfromactualprogram
Issueofnoisecleanup:DocumentImageNoiseaffectstheperformanceof
segmentationalgorithmsgreatly.Itwasourviewthattheperformanceofallthe
algorithmsshouldimproveafternoisecleanup.Abetterexplanationcanbefound
in[5].Weusednoisecleanupsystemasexplainedin[7]Alsoweexpected
improvementinperformanceofsimplesegmentationalgorithmslikeXYcuttobe
morethanthatofcomplexalgorithmslikevoronoi,reasonbeingXYcutgetsmore
affectedbynoisethanvoronoidoesandasweevaluatedtheperformanceofthese
algorithmswithandwithoutnoise,weprovedcorrect.
Results:
Threesegmentationalgorithms(Voronoi,DocstrumandXYcut)performancewas
evaluatedbyourprogram.Theresultswereaswehadexpectedandhencewere
quiteencouraging.Belowaretheerrorratesforallthesealgorithmswithand
withoutnoisecleanup.
Algorithm
Percentageefficiencywithoutnoise Percentageefficiencywithnoise
cleanup
cleanup
Voronoi
87.03
87.69
Docstrum
86.88
86.92
XYcut
80.16
85.70
Asevidenttheperformanceofallthealgorithmsincreasewithnoisecleanup,but
theimprovementwasmuchmoreforXYcutcomparedtootheralgorithms.After
noisecleanupXYcuthasanefficiencymuchclosetothatofVoronoiandbeinga
simplealgorithmsXYcutcanbeanoptimumchoicefortheOCRopus.
Conclusion:
ThewholeexperienceofworkingatIUPRwasgreat.Thisorganizationhasa
superbworkculture,greatmindsandveryhighqualityofwork.Ilearnedalotof
aboutimageprocessingandanalysis.TheworkIcouldcompleteherewasvery
satisfactory.IhavetriedtodevelopasmanyaddonsaspossibleforOCRopusand
evengotveryencouragingresultswithsomeofthem.IhopemyworkonOCRopus
helpsitmeetitsgoals.
References
1. T.M.Breuel:ThehOCRMicroformatforOCRWorkflowandResults:
ICDAR,2007,acceptedforpublication
2. I.Guyon,R.M.Haralick,J.J.HullandI.T.Phillips:DatasetsforOCRand
documentimageunderstandingresearch.In:Handbookofcharacter
recognitionanddocumentimageanalysis,WorldScientific,(1997)779799
3. R.Cattoni,T.Coianiz,,Messelodi,S.Modena,C.M.:Geometriclayout
analysistechniquesfordocumentimageunderstanding:areview.Technical
report,IRST,Trento,Italy(1998)*
4. F.Shafait,D.Keysers,andT.M.Breuel:PerformanceComparisonofSix
AlgorithmsforPageSegmentation:7thIAPRWorkshoponDocument
AnalssisSystems(DAS),pages368379
5. F.Shafait,D.Keysers,T.M.Breuel:PixelAccurateRepresentationand
EvaluationofPageSegmentationinDocumentImages:ICPR2006,
InternationalConferenceonPatternRecognition,pages872875*
6. T.M.Breuel,D.Keysers,F.Shafait:DocumentImageZoneClassification
ASimpleHighPerfomanceApproach:VISAPP2007,pages4451
7. T.Gupta:OCRopusaddons:techreports,IUPR,2007