SteganographyinMSWordDocumentusingitsIn-builtFeatures Tidke ICCNS 2008

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/262451774

Steganography in MS Word Document using its In-built Features

Conference Paper · January 2008

CITATIONS READS
0 7,385

3 authors, including:

Vaishali S Tidake Manikrao Dhore


Nashik District Maratha Vidya Prasarak Samaj's K.B.T. College of Engineering Vishwakarma Institute of Technology
8 PUBLICATIONS   22 CITATIONS    77 PUBLICATIONS   442 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Classification View project

Machine Transliteration of Indian Languages To English View project

All content following this page was uploaded by Manikrao Dhore on 21 May 2014.

The user has requested enhancement of the downloaded file.


Proceedings of ICCNS 08 , 27-28 September 2008

Steganography in MS Word Document


using its In-built Features
Mrs. V. S. Tidake, Prof. S. G. Pukale, Prof. M. L. Dhore

Abstract— There are plenty of text resources available for text II. STEGANOGRAPHY USING CHANGE TRACKING
steganography. Microsoft word being a commonly used In the proposed steganographic method, a secret message is
communication medium can be well utilized as a cover document to
embedded inside a cover document D using change tracking
hide the data. In this paper, a new steganographic method is
presented which hides data in MSword documents. It uses one [1] to obtain a stegodocument S. The process is divided into
special feature of Microsoft word: change tracking. The process of two stages, the degeneration stage, and the revision stage, as
data hiding is divided into two steps: message embedding and shown in fig.1.
message extraction. On the sender’s side, a secret message is
embedded inside a cover document to obtain a stegodocument.
Depending on the data, the position where it should be embedded is
decided. The embedded secret message is revised back again which
makes the cover document look normal and also produces a
stegodocument. On the receiver’s side, the hidden message is
extracted back from the stegodocument. The paper shows
comparison between two encoding techniques used for message
embedding, namely Huffman and block encoding.

Keywords— Text steganography, cover document, change


tracking, message embedding, stegodocument, message extraction.

I. INTRODUCTION Fig. 1 Steganography using change tracking

Steganography is the art of sending hidden or The data embedding is done in such a way that the
invisible messages. The name came from the Greek word stegodocument appears to be the product of a collaborative
having meaning “covered writing”. While much of modern writing effort. Text segments in the document are
steganography focuses on images, audio signals, and other degenerated such that it appears to be the work of an author
digital data, there is also a plethora of text sources in which with inferior writing skills and the secret message is
information can be hidden. While there are various ways in embedded in the choices of degenerations [1]. Then the
which one may hide information in text, there is a specific set degenerations are revised back using the change tracking
of techniques that uses the linguistic structure of a text [9] as feature of MSword, in such a way that it appears as if a expert
the space in which information is hidden. author is correcting the mistakes. The change tracking
Text steganography uses text as the medium in information contained in the stegodocument allows to recover
which information is hidden. Text steganography can involve the original cover, the degenerated document, and, hence, the
anything from changing the formatting of an existing text, to secret message. The extra change tracking information is
changing words within a text, to generating random character added during message embedding so that it appears a normal
sequences or using context-free grammars to generate collaboration scenario.
readable texts [10]. With any of these methods, the common As the input data consists of characters, it is first
thing is that hidden messages are embedded in character- converted to binary data. Assume that the input message is
based text. converted to an m-bit stream M = b1 b2… bm, where each bi is
a bit. It is converted to the following binary message:
M’ = H b1 b2… bm P = b1’ b2’…
where the header H denotes length m of message and P
V. S. Tidake is with the NDMVPS’s College of Engineering, Nashik and is a
student of M.E. (CSE-IT), Vishwakarma Institute of Technolgy, Pune. (e-mail: denotes padding bits. This message M’ is embedded in the
vaishalitidake@ yahoo.co.in). cover document D.
Prof. .S. G. Pukale is with the Vishwakarma Institute of Technolgy, Pune. (e- The message bits can be embedded using different
mail: shraddhananad.pukale@vit.edu).
Prof. M. L. Dhore is with the Vishwakarma Institute of Technolgy, Pune. (e- techniques. This paper concentrates on Huffman coding and
mail: manikrao.dhore@vit.edu). block encoding. Position in cover doc where bits are

© 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA 410


Proceedings of ICCNS 08 , 27-28 September 2008

embedded, is called as embedding place. It is computed using Input: a stegodocument S = {s1, s2,…sn} and a secret key K.
the secret key K and the bit position in the message. Output: the extracted message in characters.
Steps:
III. HUFFMAN CODING
1) Recover the original document D = {d1, d2,…dn} and the
This technique uses probabilities of occurrences of degenerated document D’ = {d1’, d2’,…dn’} from S using the
each word to compute its Huffman code [11]. Words having change tracking information and the related operations
small probabilities are assigned longer Huffman codes and provided by MSword.
those having higher probabilities are assigned smaller 2) Initialize the set of embedding places P to be empty.
Huffman codes. 3) Define an index p which denotes the position of the
A. Message embedding message bit bp’ which we are currently decoding. Set initially
p = 1.
Message embedding is performed in two stages:
4) Select the same embedding place i as that in message
degeneration and revision. In the degeneration stage, first a
embedding using key K and set of embedding places P.
cover document D is segmented. Then some of the text
5) Construct a Huffman tree T for the text segment di with a
segments in a cover document D are degenerated. For a text
degeneration set Rdi of size c as described in Algorithm 1.
segment d, a degeneration set Rd is defined to be the ordered
6) Determine the choice of degeneration j such that Rd (j) =
set of possible degenerated text segments. Let us use set of
di’.
synonyms of a word as a degeneration database. Rd (j) denotes
7) Decode the message bits encoded in j by traversing the
the jth element in Rd. The term Pr {Rd (j)} denotes the
Huffman tree T from the root to the leaf node nj. Note the
probability of occurrence for Rd (j). The probabilities of
path traversed. It gives the bits embedded at that position.
occurrences are used during message embedding so that the
Convert bits to corresponding characters.
system prefers substitutions that occur commonly and, thus,
8) Repeat steps 4 to 7 until the entire message has been
produces a more natural stegodocument.
extracted.
Algorithm 1: Message Embedding using Huffman coding C. Illustration with example
Input: a cover document D partitioned into text segments d1, Working of both the algorithms is illustrated with an
d2,…,dn ; a character message to be embedded; and a secret
key K . example in this section.
Output: a stegodocument S. [a] Message embedding
Steps:
Here the set of synonyms is used as a degeneration
1) Convert character message to binary as M’ = b1’ b2’ b3’… set. The synonym database is available from different
2) Initialize the set OF embedding places P to be empty. Also resources like WordNet database [7]. In this paper the
define an index p to denote the position of the message bit bp’ synonym set is constructed from thesaurus available in
which we are currently encoding. Initially p is equal to 1. MSword itself. For example, let the text segment to be
3) Compute an embedding place i randomly using K such that degenerated is d=“scheme”. Suppose the degeneration set of
i is in the range of 1≤i≤n and i not in the set P. Now add i to “scheme” contains the eight entries scheme, system, plan,
P. method, format, idea, proposal and design. Probabilities of
4) Construct a Huffman tree T for the text segment di with their occurrences can be calculated from any related database
degeneration set Rd of size c. Use Pr {Rd (j)} as weight of a [8]. Synonyms of “scheme” and their respective probabilities
node initially. are used to find Huffman codes as shown in fig. 2.
5) Degenerate text segment di to be di’=Rd(j) , where the
degeneration choice j is determined by traveling the Huffman j Rd(j) Huffman Code
tree T from the root to a leaf node as stated by the current bits 1 Scheme 011
to be embedded.
2 System 00
6) Repeat Steps 3 to 5 until the entire message has been
3 Plan 01001
embedded.
4 Method 10
7) Revise each previously degenerated text segment di’ back
to di with the revisions made being tracked to yield stegotext 5 Format 110
segments Si for all i in P. 6 Idea 0101
7 Proposal 01000
B. Message Extraction 8 Design 111
The change tracking information included in the Fig. 2 Huffman codes for synonyms of “scheme”
stegodocument S allows simple recovery of the original
document D and the degenerated document D’, from both of By using the occurrence probabilities, construct a
which the embedded message can be extracted. Huffman tree T. Label left branch as 0 and right branch as 1.
Construct Huffman codes for all the leaf nodes, as shown in
Algorithm 2: Message Extraction fig. 2. Let the code to be embedded at this position is 110…

© 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA 411


Proceedings of ICCNS 08 , 27-28 September 2008

So when the tree is traversed from root visiting the branches “scheme”. It will be shown by stegotext as
1, 1, 0 respectively, we will reach at a leaf node of “format”. S=“proposalscheme”.
Hence the text segment d=“scheme” is degenerated to text
segment d’ = “format”. Then track changes feature of b. Message extraction
MSword is turned on and d’ = “format” is revised back to d = Given a stegotext segment S = “proposalscheme”, we
“scheme”. It will be shown by stegotext as can recover the original and the degenerated text segments to
S=“formatscheme”. be di = “scheme” and di’=“proposal” respectively. Again
construct the same block codes for the same synonym set of
“scheme”. Here the key point is that the each entry in the
[b] Message extraction
synonym set of “scheme” should be represented by same
Given a stegotext segment S = “formatscheme”, we block code at the time message embedding and the extraction.
can recover the original and the degenerated text segments to Since the degenerated text segment is “proposal”, search it in
be di = “scheme” and di’=“format” respectively. Again the synonym set of “scheme” and analyze the corresponding
construct the Huffman tree T using the given probabilities to block code for “proposal”. It will give the bits “110”. It means
get the same Huffman codes. Since the degenerated text that the bits “110” were embedded at that position.
segment is “format”, traverse the tree from the root to a leaf
node which denotes “format”. Analyze the path traveled. It V. SECURITY CONSIDERATIONS AND LIMITATIONS
will give the bits “110”. It means that the bits “110” were For every steganographic system, security is very
embedded at that position. important. The following security aspects are considered for
the given system:
IV. BLOCK ENCODING 1. The synonym database used for degeneration and the secret
Block encoding is implemented by restricting the key are agreed upon by the sender and receiver beforehand.
size of synonym set to integral power of 2. If size of the set is 2. It is robust against statistical steganalysis [6] because of the
2 raise to k, then k bits are used to encode each entry in the following reasons:
synonym database uniquely [12]. a. In Huffman coding, degenerations are chosen according to
their occurrence probabilities. So even though the adversary
Algorithms for message embedding and message extraction becomes successful to obtain the database, he can not find out
occurrence frequencies because occurrence frequencies may
Algorithms are very similar to those used in Huffman be computed from personal databases owned only by the
coding. The only difference is that instead of constructing sender and the receiver. In block encoding, the sequence of
Huffman codes, the synonyms in each set are uniquely words in the database is important to obtain block code.
represented using the bit sequence as shown in the following b. To ensure that statistical properties of the degenerations of
example. a stegodocument are closer to that of a normal document, the
message can be compressed or encrypted before embedding.
Illustration with example c. To increase robustness in the Huffman coding, we can
change the occurrence probability of degeneration after it has
Again consider the set of synonyms for “scheme”. As
been used once. So the probability of the same word getting
the size of the set is eight ( that is 2 raise to 3), three bits can
selected decreases in future and we can achieve the desired
be used to uniquely represent each entry in the set as shown
statistical coherence with a normal document.
in fig. 3.
3. The degeneration database can be modified dynamically
j Rd(j) Block Code
after embedding secret data.
1 Scheme 000 4. After embedding information in a stegodocument using the
2 System 001 proposed method, a sender may manipulate the unused
3 Plan 010 portions of the stegodocument.
4 Method 011 As every coin has two sides, the given system also
5 Format 100 has some limitations:
6 Idea 101 1. The degeneration set and the key must be known only to
7 Proposal 110 the sender and the receiver.
8 Design 111 2. The change tracking information used for message
Fig. 3 Block codes for synonyms of “scheme” embedding should not be disturbed by anybody knowingly or
unknowingly.
a. Message embedding 3. The degeneration database should be kept realistic.
Let the code to be embedded next 110… So the set is
searched for block code 110 which denotes “proposal”. Hence VI. IMPLEMENTATION RESULTS
the text segment d=“scheme” is degenerated to text segment The system is implemented using Microsoft Word
d’ = “proposal”. Then track changes feature of MSword is 2003 and C\#. The automation techniques of Microsoft Word
turned on and d’ = “proposal” is revised back to d = are also used for implementation. The degeneration database

© 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA 412


Proceedings of ICCNS 08 , 27-28 September 2008

is constructed using the thesaurus available in Microsoft [7] WordNet v2.1, a lexical database for the English
Word 2003. language. Princeton Univ., Princeton, NJ, 2005.
The System is evaluated by comparing the results http://wordnet.princeton.edu/
obtained using the three coding techniques, namely Huffman, [8] Google, Google SOAP Search API (beta), [Online].
block and arithmetic coding. The results obtained from these Available: http://www.seochat.com/c/a/Google-Optimization-
three techniques are compared with each other as shown in Help/Using-the-Google-SOAP-Search-AP
fig.4. Results show that the system gives better results if block [9] K. Bennett, “Linguistic steganography: Survey, analysis,
encoding is used for message embedding instead of Huffman and robustness concerns for hiding information in text,”
coding. Further if the message is compressed before Purdue Univ., West Lafayette, IN, CERIAS Tech. Rep. 2004–
embedding, then the system performance is improved and can 13, May 2004.
embed more data. Here the arithmetic encoding is used as [10] J. T. Brassil and N. F. Maxemchuk, “Copyright
compression technique. protection for the electronic distribution of text Documents,”
Proc. IEEE, vol. 87, no. 7, pp. 1181–1196, Jul. 1999.
[11] P. Wayner, “Mimic functions,” Crypt., vol. XVI, no. 3,
pp. 193–214, 1992.
[12] M. Chapman, I. D. George, and R. Marc, “A practical
and effective approach to large-scale automated linguistic
steganography,” in Proc. Information Security Conf., Malaga,
Spain, Oct. 2001, pp. 156–165.

Fig. 4 Comparison between encoding techniques

VII. CONCLUSION
Though the steganographic method presented in this
paper focuses on Microsoft Word, the idea can be applied to
some other communication mediums also. The robustness of
the system can be increased by increasing randomness in the
input and the degeneration database. As the work appears to
be the effort of collaborative writing, is less likely to be under
close scrutiny. The results obtained from the implementation
show that embedding capacity of the Huffman coding is less
as compared to the block encoding. Better results are obtained
when a message is compressed using arithmetic encoding
before embedding.

REFERENCES
[1] “A New Steganographic Method for Data Hiding in
Microsoft Word Documents by a Change Tracking
Technique”, Tsung-Yuan Liu, Student Member, IEEE, and
Wen-Hsiang Tsai, Senior Member, IEEE.
[3] F. A. P. Petitcolas, R. J. Anderson, and M. G. Kuhn,
“Information hiding—A survey,” Proc. IEEE, vol. 87, no. 7,
pp. 1062–1078, Jul. 1999.
[5] R. Stutsman, C. Grothoff, M. Attallah, and K. Grothoff,
“Lost in just the translation,” in Proc. ACM Symp. Applied
Computing, 2006, pp. 338–345.
[6] F. Johnson and S. Jajodia, “Steganalysis: The
Investigation of Hidden Information,” in Proc. IEEE
Information Technology Conf., Syracuse, NY, Sep. 1998, pp.
113–116.

© 2008 , Vishwakarma Institute of Technology, Pune , MS, INDIA 413

View publication stats

You might also like