Development Based On Community Code: A Success Story OSCONF09 #MOSC2010

You might also like

Download as odp, pdf, or txt
Download as odp, pdf, or txt
You are on page 1of 19

The 2009 MSC Malaysia

Open Source Conference

Development Based on
Community Code:
A Success Story
Azizah Bte. Suliman Azizah@uniten.edu.my

Uwe Dippel udippel@uniten.edu.my

COIT – College of Information Technology


1 Universiti Tenaga Nasional
Everything is out there!

Chris DiBona (Google):
“Billions of lines of code”


We:

It can come alive:

Support

Tutorial

Documentation

There may be a human behind it
2 
We can add to and improve that code”
Objectives


Background: Introducing the project

Summing up the necessary functions

Lining up FOSS code involved

History of successful cooperation with
the various projects and communities

->A success story was born!
3
In a nutshell ...

Archiving of existing paper documentation:



Using existing fax machines to convert
paper-based documentation into electronic
format:

Portable format: PDF
(Portable Document Format)

Documents are sent as
e-mails with the converted documents
as attachments in PDF format
4
Functions

5
Image Processing

6
Determinators

We want to commercialise our
system as embedded system, so we

Prefer to not have to deal with licensing
aspects in the sense of royalties

Need a system with low resource
requirements

Want to customize the system to suit
our needs

Want to give back to the community
7 what we have developed
Processing Steps

1. Receive fax from any fax machine


2. Separate cover page from fax
3. Extract the region with e-mail address
4. Perform OCR on characters in that region
5. Filtering noise and reconstitute e-mail
address on cover page
6. Extract all other pages from received fax
7. Convert those pages into PDF
8
8. Send PDF as e-mail to recipient's address
0. Operating System

t)
en
m
m
co
o
(N

9
1. Fax Software

Obvious contender for the fax server


listening for incoming faxes and handling
them was Hylafax
(www.hylafax.org)

We were helped within two days on



Our modem requirements

A race condition on our low-end USB fax
modem with diagnosis and suggestions
10
2. Separate Cover Page

Faxes are essentially Tiff-images, so the


extensive Tiff library, 'lib-tiff'
(http://www.remotesensing.org/libtiff/)
offers all necessary library functions, to be
called within a shell script, in 'tiffcp'.
Here we could use tiffcp to copy one page of
a multi-page Tiff-file, namely the cover
page, to a target file for the OCR.

11
3. Extract region containing
the address
OCR needs an image in .pnm format; we use
tifftopnm from the package netpbm
(http://sourceforge.net/projects/netpbm/)
for the conversion.
We calculate from image resolution and page
size, which parts need to be cut, because
they do not contain relevant information for
the recovery of the e-mail address printed
on that cover page.
This process is done calling pbmcut, also
12
from the netpbm package.
4. Performing OCR

Our requirements for the OCR were



Not dictionary-based/no learning phase

Feature extraction

Command line interface
The solution was found in ocrad
(http://www.gnu.org/software/ocrad/)
We worked with Antonio Diaz Diaz, the main
developer, based on our samples, and could
improve recognition of characters like 'J', 't',
13
as well as some noise reduction in ocrad.
5. Noise filtering and recon-
stitution of e-mail address
To recover an e-mail address, we know

Minimal and maximal number of
characters

Locations and amount of
dots and underscores

Character-set, punctuation used

Prevalence and sequence of '@' and 'dot'
We found awk (http://awk.info/) to handle
these with the greatest ease.
14
6. Extract all other pages
from received fax
In the beginning, we had used tiffcp to
extract the pages following the cover page
individually by copying and concatenation.
Later, we registered with the tiff-lib
community (tiff@lists.maptools.org)and
actually could see our request
'Please, suppress one page!'
implemented, through the feature of
suppressing the cover page not needed in
the final PDF document.
15
It took less than 2 days to get the solution.
7. Convert into PDF -
8. Send e-mail to recipient

Conversion of Tiff into PDF is already
included in the Hylafax-package, and
does the job based on ghostscript.
(No further work or communication needed
from our side).

Sending the mail (using metamail for
MIME encoding) is done through Postfix
as Message Transfer Agent (MTA); without
much of modification.

16
“Glue”

… was all that was needed to put together


the necessary steps and functions to achieve
our target. We contacted a number of
communities for advise and help, and want
to thank all the developers and individuals of
groups, who were more than willing to help
us out.
In turn, we have been able to give feed-back
to a number of projects, to improve their
software, and share our experience with
17 other users of their code.
Results

All functions could be solved by readily
available code from the community.

All sources were available to us for
modifications (e.g. hylafax)

For desired functions, we obtained prompt
help at implementing our needs (tiff-lib)

Some problems received a reply and a
patch usually within a day or less.

We could help improve community code
18 (lib-tiff, hylafax, ocrad)
Summary

Ours is a successful project based


mainly on a large number of functions,
all solved through reuse of
community code. ('Taking')
We have also been able to contribute
some minor improvements to this
community code. ('Giving back')

19 Have we been 'just lucky'??

You might also like