Week10 String

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

This course material is now made available for public usage.

Special acknowledgement to School of Computing, National University of Singapore


for allowing Steven to prepare and distribute these teaching materials.
CS3233 CS3233
CompetitiveProgramming p g g
Dr. Steven Halim Dr.StevenHalim
Week10 StringProcessing
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Outline Outline
MiniContest#8+Discussion+Break+Admins
Coveredbriefly inclassbutindirectly examinable:
BasicStringProcessingSkills
Skippedthissemester(usethisskilltosolvemoreUVas):
Ad H St i P bl AdHocStringProblems
StringMatching(KnuthMorrisPrattsAlgorithm)
Today focus on: Today,focuson:
SuffixTrie/Tree/Array
Note: DP on String has been discussed in Week 0405 Note:DPonStringhasbeendiscussedinWeek04 05
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Section 6.2
BASICSTRINGPROCESSINGSKILLS
Section6.2
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(01) g g ( )
DataStructure
C(top)/C++(bottom)
C:nullterminated
Java
Stringclass
characterarray
Wehavetoknowthestring
l th ( t l t th
St r i ng st r ;
length(oratleastthe
upperbound)beforehand
char st r [ 10000] ;
C++:stringclass
#i ncl ude <st r i ng>
usi ng namespace st d;
st r i ng st r ; st r i ng st r ;
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(02) g g ( )
ReadingaString(aword)
C(top)/C++(bottom)
#i ncl ude <st di o. h>
Java
i mpor t j ava. ut i l . *;
scanf ( "%s", &st r ) ;
/ / & opt i onal
Scanner sc = new
Scanner ( Syst em. i n) ; / / p
#i l d <i t >
( y ) ;
st r = sc. next ( ) ;
#i ncl ude <i ost r eam>
usi ng namespace st d;
ci n >> st r ;
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(03) g g ( )
ReadingaLine ofString
C(top)/C++(bottom)
get s( st r ) ;
Java
st r = sc. next Li ne( ) ;
/ / al t er nat i ve/ saf er ver si on
/ / f get s( st r , 10000, st di n) ;
/ / but you wi l l r ead ext r a / / y
/ / ' \ 0' at t he back
/ / PS: Mooshak pr ef er f get s
get l i ne( ci n, st r ) ;
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(04) g g ( )
PrintingandFormattingStringOutput
C(top)/C++(bottom)
Preferredmethod
Java
Wecanuse
pr i nt f ( "s = %s, l = %d\ n",
st r , ( i nt ) st r l en( st r ) ) ;
System.out.print or
System.out.println,but
th b t i t C t l
h d
thebestistouseCstyle
System.out.printf
C++versionisharder
cout << "s = " << st r <<
", l = " << st r . l engt h( )
Syst em. out . pr i nt f (
" s = %s, l = %d\ n" ,
, l st r . l engt h( )
<< endl ;
st r , st r . l engt h( ) ) ;
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(05) g g ( )
ComparingTwoStrings
C(top)/C++(bottom)
pr i nt f ( st r cmp( st r , "t est ") ?
Java
Syst em. out . pr i nt l n(
"di f f er ent \ n" :
"same\ n" ) ;
st r . equal s( "t est ") ) ;
cout << st r == "t est " ?
"same" :
"di f f er ent " << endl ;
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(06) g g ( )
CombiningTwoStrings
C(top)/C++(bottom)
st r cpy( st r , "hel l o") ;
Java
st r = "hel l o";
st r cat ( st r , " wor l d") ;
pr i nt f ( "%s\ n", st r ) ;
/ / out put : "hel l o wor l d"
st r += " wor l d";
Syst em. out . pr i nt l n( st r ) ;
/ / out put : "hel l o wor l d" / / p
t "h l l "
/ / p
st r = "hel l o";
st r . append( " wor l d") ;
cout << st r << endl ;
/ / out put : "hel l o wor l d"
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(07) g g ( )
StringTokenizer:SplittingStrintoTokens
C(top)/C++(bottom)
#i ncl ude <st r i ng. h>
Java
i mpor t j ava. ut i l . *;
f or ( char *p=st r t ok( st r , " " ) ;
p;
p = st r t ok( NULL, " " ) )
St r i ngTokeni zer st = new
St r i ngTokeni zer ( st r , " ") ;
pr i nt f ( " %s\ n" , p) ;
g ( , ) ;
whi l e ( st . hasMor eTokens( ) )
Syst em. out . pr i nt l n(
t t T k ( ) )
#i ncl ude <sst r eam>
st r i ngst r eamp( st r ) ;
whi l e ( ! p eof ( ) ) {
st . next Token( ) ) ;
whi l e ( ! p. eof ( ) ) {
st r i ng t oken;
p >> t oken;
cout << t oken << endl ;
}
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(08) g g ( )
StringMatching:FindingaSubstrinaStr
C(top)/C++(bottom)
char *p=st r st r ( st r , subst r ) ;
Java
i nt pos=st r . i ndexOf ( subst r ) ;
i f ( p)
pr i nt f ( "%d\ n", p- st r - 1) ;
i f ( pos ! = - 1)
Syst em. out . pr i nt l n( pos) ;
i nt pos = st r . f i nd( subst r ) ;
i f ( pos ! = st r i ng: : npos)
cout << pos - 1 << endl ;
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(09) g g ( )
Editing/ExaminingCharactersofaString
BothC&C++
#i ncl ude <ct ype. h>
Java
CharactersofaJavaSt r i ng
f or ( i nt i = 0; st r [ i ] ; i ++)
st r [ i ] = t oupper ( st r [ i ] ) ;
canbeaccessedwith
st r . char At ( i ) , butJava
i i t bl
[ ] pp ( [ ] ) ;
/ / or t ol ower ( ch)
/ / i sal pha( ch) , i sdi gi t ( ch)
St r i ng isimmutable
(cannotbechanged)
You may have to create new Youmayhavetocreatenew
StringoruseJava
St r i ngBuf f er
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(10) g g ( )
SortingCharactersofaString
BothC&C++
#i ncl ude <al gor i t hm>
Java
JavaSt r i ng isimmutable
/ / i f usi ng C- st yl e st r i ng
sor t ( s, s + ( i nt ) st r l en( s) ) ;
(cannotbechanged)
Youhavetobreakthestring
( , ( ) ( ) ) ;
/ / i f usi ng C++ st r i ng cl ass
t ( b i ( ) d( ) )
t oChar Ar r ay( ) andthen
sortthecharacterarray
sor t ( s. begi n( ) , s. end( ) ) ;
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
BasicsofStringProcessing(11) g g ( )
SortingArray/VectorofStrings
PreferablyC++
#i ncl ude <al gor i t hm>
Java
Vect or <St r i ng> S =
#i ncl ude <st r i ng>
#i ncl ude <vect or >
new Vect or <St r i ng>( ) ;
/ / assume t hat S has i t ems
Col l ect i ons. sor t ( S) ;
vect or <st r i ng> S;
/ / assume t hat S has i t ems
t ( S b i ( ) S d( ) )
Col l ect i ons. sor t ( S) ;
/ / S wi l l be sor t ed now
sor t ( S. begi n( ) , S. end( ) ) ;
/ / Swi l l be sor t ed now
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Listof(simple)problemssolvablewithbasicstringprocessingskills
S ti 6 3 Section6.3
Justasplashanddash forthissemester
(do a few programming exercises on your own)
ADHOCSTRINGPROBLEMS
(doafewprogrammingexercisesonyourown)
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Ad Hoc String Problems (1) AdHocStringProblems(1)
Cipher(EncodeEncrypt/DecodeDecrypt)
Transformstringgivenacoding/decodingmechanism
Usually,weneedtofollowproblemdescription
Sometimes,wehavetoguessthepattern
UVa10878 DecodetheTape
FrequencyCounting q y g
Checkhowmanytimescertaincharacters(orwords)
appearinthestring
Useefficientdatastructure(orhashingtechnique)
UVa902 PasswordSearch
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Ad Hoc String Problems (2) AdHocStringProblems(2)
InputParsing
Givenagrammar(inBackusNaurFormorinotherform),
checkifagivenstringisvalidaccordingtothegrammar,
andevaluateitifpossible
U i J P tt (R E ) l Userecursive parser,JavaPattern(RegEx) class
UVa622 GrammarEvaluation
OutputFormatting
Theproblematicpartoftheproblemisinformattingthe
outputusingcertainrule
UVa10894 SaveHridoy
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Ad Hoc String Problems (3) AdHocStringProblems(3)
StringComparison
Giventwostrings,aretheysimilarwithsomecriteria?
Casesensitive?Comparesubstringonly?Modifiedcriteria?
UVa11233 DeliDeli
Others,notoneoftheabove
Butstillsolvablewithjustbasicstringprocessingskills j g p g
Note:
None of these are likely appear in IOI other than as the NoneofthesearelikelyappearinIOIotherthanasthe
bonusproblempercontestday(nolongertruein2011)
In ICPC, one of these can be the bonus problem InICPC,oneofthesecanbethebonusproblem
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
K th M i P tt Al ith KnuthMorrisPrattsAlgorithm
Section6.4
Skipped this semester (please use Suffix Array for (long) String Matching)
STRINGMATCHING
Skippedthissemester(pleaseuseSuffixArrayfor(long)StringMatching)
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
String Matching StringMatching
GivenapatternstringP,
canitbefoundinthelongerstringT?
Donotcodenavesolution
Easiestsolution:Usestringlibrary
C++:string.find
C:strstr
J St i i d Of Java:String.indexOf
InCP2.9book:KMPalgorithm
Or later/after this: Suffix Array Orlater/afterthis:SuffixArray
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Theearlierformofthisteachingmaterialiscreditedto
A/P S Wi Ki K f S C NUS A/PSungWingKin,KenfromSoC,NUS
CP2.9 Section 6.6
SUFFIXTRIE,TREE,ANDARRAY
CP2.9Section6.6
Suffix Trie (CAR CAT RAT) SuffixTrie ( CAR, CAT , RAT )
root
sorted
edge
All Suffixes:
1 CAR
Sorted Unique Suffixes:
1 AR
A C R T
T
A
A
R
label
1. CAR
2. AR
3. R
4 CAT
1. AR
2. AT
3. CAR
4 CAT A
A
T T R
path
4. CAT
5. AT
6. T
7 RAT
4. CAT
5. R
6. RAT
7 T
in Dictionary
path
label is
AR
7. RAT
8. AT
9. T
7. T
Suffix Trie (T = GATAGACA$) SuffixTrie (T= GATAGACA$ )
A C G T
8
$
i Suffix
0 GATAGACA$
A C G T
T A A G A $ C
0 GATAGACA$
1 ATAGACA$
2 TAGACA$
A T
A
A
C
A
A
G
G
C
A
6
7
$
$
2 TAGACA$
3 AGACA$
4 GACA$
A C
A
A
C
G A
G
5
$
A $
5 ACA$
6 CA$
$
$ C A A
C
3
4
A $
7 A$
8 $
$
C
A
2
t i ti
A $
1
0
terminating
vertex $
Suffix Tree (T = GATAGACA$) SuffixTree (T= GATAGACA$ )
A
th l b l
8
$
C
A
$
C
G
A
path label
of this
vertex
is GA
i Suffix
0 GATAGACA$
$
T
A
$
G
A
C
C
A
T
A
is GA
5
6
0 GATAGACA$
1 ATAGACA$
2 TAGACA$
7
A
G
A
C
A
$
A
$
T
A
A
G
A
C
$
5
4
2 TAGACA$
3 AGACA$
4 GACA$
C
A
$
A
G
A
C
C
A
$
TAGACA$ is an
edge label
ti
3
5 ACA$
6 CA$
$
A
$
merge vertices
with only 1 child
1
2
7 A$
8 $
1
0
What can we do with this specialized string data structure?
APPLICATIONSOFSUFFIXTREE
Whatcanwedowiththisspecializedstringdatastructure?
String Matching StringMatching
To find all occurrences of P (of length m) in T (of length n) TofindalloccurrencesofP(oflengthm)inT(oflengthn)
Searchforthevertexx intheSuffixTreewhichrepresentsP
All the leaves in the subtree rooted at x are the occurrences Alltheleavesinthesubtree rootedatxaretheoccurrences
Time:O(m +occ)whereocc isthetotalno.ofoccurrences
T = GATAGACA$
8
i = 012345678
P= A Occurrences:7,5,3,1

G
7
8
P= GA Occurrences:4,0
P= T Occurrences:2
P = Z Not Found
A
C
A
5
6
7
P= Z NotFound
$
3
4
2
1
0
Longest Repeated Substring LongestRepeatedSubstring
TofindthelongestrepeatedsubstringinT
Findthedeepestinternalnode
Ti O( ) Time:O(n)
e g T = GATAGACA$
internal vertex
path label length =1
internal vertex
path label length =2
8
e.g.T= GATAGACA$
Thelongestrepeated
substringisGAwith
G
8
7
pathlabellength=2
h h d
A
C
A
5
6
7
Theotherrepeated
substringisA,butits
path label length = 1
$
3
4
2
pathlabellength=1
1
0
Longest Common Substring LongestCommonSubstring
Tofindthelongestcommonsubstringoftwo
ormorestrings
Note:In1970,DonaldKnuthconjecturedthatalineartime
algorithmforthisproblemisimpossible
Now,weknowthatitcanbesolvedinlineartime
E.g.considertwostringT1andT2,
Buildageneralized SuffixTreeforT1andT2
i.e.aSuffixTreethatcombinesboththeSuffixTreeofT1andT2
Mark internal vertices with leaves representing suffixes of both T1 MarkinternalverticeswithleavesrepresentingsuffixesofbothT1
andT2
Reportthedeepestmarkedvertex
Example of LC Substring ExampleofLCSubstring
$ T1=GATAGACA$(endverticeslabeledwithblue)
T2=CATA#(endverticeslabeledwithred)
Their longest common substring is ATA with length 3 Theirlongestcommonsubstringis ATA withlength3
These are the internal
8
G
7
3
Thesearetheinternal
verticesrepresenting
suffixesfrombothstrings
8
A
C
A
5
6
7
2
#
T
A
#
#
Thedeepestonehas
pathlabelATA
$
3
4
2
0
#
1
1
0
2
HowtobuildSuffixTree?
For programming contests, we use Suffix Array instead
SUFFIXARRAY
Forprogrammingcontests,weuseSuffixArrayinstead
Disadvantage of Suffix Tree DisadvantageofSuffixTree
SuffixTreeisspaceinefficient
ItrequiresO(n||logn)bits
Nnodes,eachnodehas||branches,
eachpointerneedsO(logn)bits
A t l f i t t Actualreasonforprogrammingcontests
ItishardertoconstructSuffixTree
ManberandMyers(SIAMJ.Comp1993) proposes
anew(in1993)datastructure,calledtheSuffixArray,
whichhasasimilarfunctionalityasSuffixTree
Moreover,itonlyrequiresO(nlogn)bits
Anditismucheasiertoimplement
Suffix Array (1) SuffixArray(1)
Suffix Array (SA) is an array that stores SuffixArray(SA)isanarraythatstores:
Apermutationofn indicesofsortedsuffixes
EachintegertakesO(logn)bits,soSAtakesO(nlogn)bits g ( g ) , ( g )
e.g.considerT=GATAGACA$
i Suffix i SA[i] Suffix
0 GATAGACA$
1 ATAGACA$
$
0 8 $
1 7 A$
$
Sort
2 TAGACA$
3 AGACA$
4 GACA$
2 5 ACA$
3 3 AGACA$
4 1 ATAGACA$ 4 GACA$
5 ACA$
6 CA$
4 1 ATAGACA$
5 6 CA$
6 4 GACA$
7 A$
8 $
7 0 GATAGACA$
8 2 TAGACA$
Suffix Array (2) SuffixArray(2)
Preorder traversal of the Suffix Tree visits PreordertraversaloftheSuffixTreevisits
theterminatingverticesinSuffixArrayorder
Internalvertex inSTisarange inSA g
Eachterminatingvertex inSTisanindividualindex inSA=asuffix
i SA[i] Suffix
8
0 8 $
1 7 A$
$
G
7
8
2 5 ACA$
3 3 AGACA$
4 1 ATAGACA$
A
C
A
5
6
7
4 1 ATAGACA$
5 6 CA$
6 4 GACA$
$
3
4
2
7 0 GATAGACA$
8 2 TAGACA$
1
0
Easy/Slow Suffix Array Construction Easy/SlowSuffixArrayConstruction
#i l d l ith #include <algorithm>
#include <cstdio>
#include <cstring>
using namespace std;
char T[MAX N]; int SA[MAX N];
ThisisO(N)
char T[MAX_N]; int SA[MAX_N];
bool cmp(int a, int b) { return strcmp(T + a, T + b) < 0; }
int main() {
int n = (int)strlen(gets(T)); t ( t)st e (gets( ));
for (int i = 0; i < n; i++) SA[i] = i;
sort(SA, SA + n, cmp);
}
Whatisthetimecomplexity?
}
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Canwedobetter?
OverallO(N
2
logN)
Most(ifnotall)applicationsrelatedtoSuffixTree
can be solved using Suffix Array canbesolvedusingSuffixArray
Withsomeincreaseintimecomplexity
APPLICATIONSOFSUFFIXARRAY
String Matching StringMatching
GivenaSuffixArraySAofthestringT
FindoccurrencesofthepatternstringP
Example
T = GATAGACA$ T GATAGACA$
P=GA
Solution: Solution:
UseBinarySearchtwice
One to get lower bound Onetogetlowerbound
Onetogetupperbound
StringMatchingAnimation g g
FindingP=GA
i SA[i] Suffix i SA[i] Suffix
Finding lower bound Finding upper bound
0 8 $
1 7 A$
2 5 ACA$
0 8 $
1 7 A$
2 5 ACA$ 2 5 ACA$
3 3 AGACA$
4 1 ATAGACA$
2 5 ACA$
3 3 AGACA$
4 1 ATAGACA$
5 6 CA$
6 4 GACA$
5 6 CA$
6 4 GACA$
7 0 GATAGACA$
8 2 TAGACA$
7 0 GATAGACA$
8 2 TAGACA$
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
Time Analysis TimeAnalysis
BinarysearchrunsatmostO(logn)comparisons
EachcomparisontakesatmostO(m)time
Werunbinarysearchtwice
In the worst case O(2m log n) = O(m log n) Intheworstcase,O(2mlogn)=O(mlogn)
Longest Repeated Substring LongestRepeatedSubstring
SimplyfindthehighestentryinLCParray
O(n)
[ ] [ ] ff i SA[i] LCP[i] Suffix
0 8 0 $
1 7 0 A$
Recall:
LCP = Longest
Common Prefix
b t t
1 7 0 A$
2 5 1 ACA$
3 3 1 AGACA$
between two
successive suffices
4 1 1 ATAGACA$
5 6 0 CA$
6 4 0 GACA$
7 0 2 GATAGACA$
8 2 0 TAGACA$
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
8 2 0 TAGACA$
LongestCommon
i [i] [i] ffi
Substring
$
i SA[i] LCP[i] Owner Suffix
0 13 0 2 #
1 8 0 1 $CATA#
T1=GATAGACA$
T2=CATA#
1 8 0 1 $CATA#
2 12 0 2 A#
3 7 1 1 A$CATA#
T=GATAGACA$CATA#
Findthehighest
b i LCP
4 5 1 1 ACA$CATA#
5 3 1 1 AGACA$CATA#
numberinLCParray
providedthatitcomes
from two suffices with
6 10 1 2 ATA#
7 1 3 1 ATAGACA$CATA#
8 6 0 1 CA$CATA# fromtwosufficeswith
differentowner
Owner:Isthissuffix
8 6 0 1 CA$CATA#
9 9 2 2 CATA#
10 4 0 1 GACA$CATA#
belongtostring1
orstring2?
O(n)
11 0 2 1 GATAGACA$CATA#
12 11 0 2 TA#
O(n)
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
13 2 2 1 TAGACA$CATA#
Summary Summary
Inthislecture,youhaveseen:
Variousstringrelatedtricks g
FocusonSuffixTreeandSuffixArray
B t d t ti i th ! Butyouneedtopracticeusingthem!
Especially,scrutinizemySuffixArraycode
SolveatleastoneUVa probleminvolvingSA
We will have SAcontest next week WewillhaveSA contestnextweek
2SAproblemsinA/B/C
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS
References References
CP2.9,Chapter6
IntroductiontoAlgorithms,2
nd
/3
rd
ed,Chapter32
CS3233 CompetitiveProgramming,
StevenHalim,SoC,NUS

You might also like