5. String programming
Today's topics
String handling on modern programming languages
Algorithms for string handling
String
String = list of characters
e.g. "abcdefg"
One of the most basic data structure in modern programming languages
Programming languages and string handling
Easy string handling and pattern matching
Ready on all modern languages
1st class object on recent languages
"abc" + "def" ?
Not possible on C
Possible on Perl, Ruby, JavaScript, ...
StriNg Oriented symBOlic Language
A programming language for string handling
Various functions for handling strings
String manipulation
Pattern matching
Associative array
Recent trends
Basic feature for modern programming languages
Perl, Java, Ruby, Python, JavaScript, ...
Unix culture (awk, grep, sed, )
String operations and pattern matching are formulas
a = b + 20
s = "abc" + "def"
(This is not possible on C, Fortran, ...)
Basic string operations
String manipulation
Substring
Concatinate
Pattern matching
Text search
Text replace
Associative array
String handeling on modern languages (Ruby, Java, JS, etc.)
String concatination
"abc" + "def"
"abc" . "def"
Pattern matching
/pat*ern/ (Use regular expressions)
Associative array
a['abc']
Examples of string operations
s =~ /From:\s.*masui/
s = "<div>" + p + "</div>"
Some people do not like using +
'123' + 1 => '1231' ? 124 ?
String concatination on JavaScript
code:string.js
a = "abc";
b = a + "def";
alert(b);
Demo: String handling in Ruby
Demo: String handling in JavaScript
Comparison of string operations
http://gyazo.com/f912c0076f784b0ea142f71665749156.png
String programming in C
Programming with CPU/memory restrictions
strcat(), strcmp(), etc.
Very cumbersome
Overflow risks
Source of various attacks
String == Array
Special consideration required for memory management
Data structure of C string
a = "abc"
a b c 0
b = "def"
a b c 0 d e f 0
a = a + b
a b c 0 d e f 0 a b c d e f 0
First "abc" will never be reused
Arrays vs associative arrays
Array
a[10]
a[0], a[1], ... lined up on memory
Associative Array
a['abc']
Data structure is not obvious
Sometimes called as "hash", but not recommended
Associative array implemented with hash function
http://gyazo.com/5bf88c89ce08c15c51830a2bd5a8ea76.png
Trie-based associative array
http://gyazo.com/2605770240f7d6cd5ba63fb4d74632ac.png
Macro programming
String substitution by 'macro processor'
e.g. "#define" in C
m4: general-purpose macro processor
Described in "Software tools"
m4 example (1)
code:test.m4
% cat sample1.m4
define(name,Masui)
My name is name.
My `name' is name.
%
m4 example (2)
code:sample2.m4
define(a,100)
define(b,200)
a is ifelse(a,b,equal to,different from) b
m4 example (2)
code:sample2.txt
% cat sample2.m4
define(a,100)
define(b,200)
a is ifelse(a,b,equal to,different from) b
% m4 sample2.m4
100 is different from 200
%
m4 example(3)
code:sample3.m4
define(eq,$1 is ifelse($1,$2,equal to,different from) $2)
eq(100,100)
eq(100,200)
m4 example (3)
code:sample3.txt
% cat sample3.m4
define(eq,$1 is ifelse($1,$2,equal to,different from) $2)
eq(100,100)
eq(100,200)
% m4 sample3.m4
100 is different from 100
100 is different from 200
%
m4 example (4)
code:sample4.m4
define(eq2,$1 is `ifelse($1,$2,equal to,different from)' $2)
eq2(100,100)
eq2(100,200)
m4 example (4)
code:sample4.txt
% cat sample4.m4
define(eq2,$1 is `ifelse($1,$2,equal to,different from)' $2)
eq2(100,100)
eq2(100,200)
% m4 sample4.m4
100 is equal to 100
100 is different from 200
%
m4 example (5)
code:sample5.m4
% cat sample5.m4
define(times2,`ifelse($1,$2,$3,
`times2(incr($1),$2,$3)$3')')
define(times,`times2(1,$1,$2)')
times(4,`<br>')
%
m4 example (5)
code:sample5.m4
% cat sample5.m4
define(times2,`ifelse($1,$2,$3,
`times2(incr($1),$2,$3)$3')')
define(times,`times2(1,$1,$2)')
times(4,`<br>')
% sample5.m4
<br><br><br><br>
%
Computational power of m4
No variables and arrays
No pattern matching
Turing-complete
String search algorithms and data structure
Specialist on genetic analysis
Various information on pattern matching and data structure
"The world of fast text analysis"
http://www.amazon.co.jp/dp/4000069748 https://gyazo.com/a9a5851fab882bb99fa2395061735b0f.png
Burrows-Wheeler conversion, etc.
Various algorithms
Pattern matching algorithms
Strict matching
Regular expression
Approximate matching
Similarity calculation
Text search algorithms
Find P[1..m] in T[1..n]
Simple one-by-one comparison is slow
Various algorithms available
Knuth-Morris-Pratt (KMP) method
Boyer-Moore method
Shifter algorithn
Simple matching algorithm
Compare T[1..m] and P[1..m]
T: aaaaaaaaaaaaaaaaab
P: aaaaaab
Compare T[2..m] and P[1..m]
T: aaaaaaaaaaaaaaaaab
P: aaaaaab
...
Compare T[n-m+1..n]and P[1..m]
Compare $ n \times m times
Knuth-Morris-Pratt algorithm
If matching fails at P[7] (= b)
aaaaa is already known
Reset matching from P[6] (= a)
Use "failure function" for reculculation
Avoid unnecessary matching operations
http://gyazo.com/1d7b988f5207f5d70182f576cd28e525.png
Knuth-Morris-Pratt algorithm example
P = abcaba
http://gyazo.com/981d1b79dceedfa85cf5995ea44be3b4.png
Tranitionn from 5 to 1
State 5 knows that the last 4 characters are abca
=> Can go back to2
When 5 => 6 is not possible
Next character is not b
=> Can go back to 1
Boyer-Moore algorithm
Very fast text matching algorithm
# of character comparison < string length
All characters should be checked using KMP
Checking from the right
T[m] == c is known at the first comparison
Since there's no c in P, 2nd trial can be performed after T[m+1]
http://gyazo.com/1d7b988f5207f5d70182f576cd28e525.png
http://gyazo.com/c0ef58ec05df6c96c8591907363c7b2c.png
Boyer-Moore Algorithm (Cont'd)
Pre-calculate "skip tables" showing how many characters can be skipped
Use 2 tables
d1[c] -- skip value when character c doesn't match at the right end
d2[k] -- skip value when kth character from the right doesn't match
d1['a'] = 1, d2['c'] = d1['d'] = ... = 8 for the above example
Shifter algorithm
Matching status represented in m-bit binary value
When the next character is c, left-shift the value by 1 bit and calculate the OR with the T[c]
T[c] is pre-calculated
The k-th bit of T[c] is 0 if k-th character of P is c
e.g. Pattern abac → T['a'] = 0101
Shifter algorithm
Faster than BM in some cases
Used in approximate pattern matcher "agrep"
Regular expressions
RegExp patterns
Selection
Character sets
Repetition
Regular expression examples
abc
(abc|def)
[abc]
ab*c
Demo: grep
Regular expression examples
(時計|時間|時刻)を(0-9*)時に(セットする|設定する|あわせる) (delete|remove) a (file|data)
Generative grammar
Rules for generating a sentence
Terminal symbol (e.g. "mountain") and non-terminal symbol (e.g. noun)
noun <= "mountain"
noun phrase <= adjective + noun
Regular grammar
Grammar for regular expression
Simplest grammar category among Chomsky Hierarchy
A <= aB (A/B: nonterminal, a: terminal)
https://gyazo.com/378d3fe7e44ba4df3a182acc36d73c4a.png
American linguist, philosopher, cognitive scientist, logician,232425 political commentator, social justice activist, and anarcho-syndicalist advocate Regular expression and state machine
Convertable to state machine
e.g. Grammar and state machine for a(b|cd)
A ← aS
B ← bA
C ← cA
B ← dC
http://gyazo.com/ef47894d05a71d085fb31cdd9527910e.png
Algorithms for handling regular expressions
Aho-Corasick algorithm
grep algorithn
egrep algorithm
Aho-Corasick algorithm
Pattern matching (str1|str2|...)
Extension for KMP
Use complicated failure function
Used in "fgrep" command
example: ac|baa|bacd|ba|bb
http://gyazo.com/7926b6c03cc66c5f883d4aa2a4a14da7.png
grep algorithm
Used in "grep" command
Recursively call pattern matching functions for *, +, etc.
Cannot handle (abc|def)
Explanation in "Software Tools" by Brian Kernighan
Elegant but slow
egrep algorithm
Used in "egrep" command (=default)
Can handle any RegExp
Algorithm
Create an NDFA from RegExp
Convert NDFA to DFA
Create a state transition table
Conversion might be slow, but matching is fast
NDFA example
/(SUB)*SECTION/
Want to match SECTION, SUBSECTION, SUBSUBSECTION, ...
http://gyazo.com/2df5627603e51e8f04ed465eda1e9f4b.png
Conversion
http://gyazo.com/d9e5779fe46637b3fedf7783e0e2d1d3.png
Generated DFA
http://gyazo.com/c74f45af19228955d3a50cb796d6781b.png
Original NDFA
http://gyazo.com/2df5627603e51e8f04ed465eda1e9f4b.png
Limitations of regular expression
Number counting impossible
No state variable
No approximate pattern matchng
Approximate pattern matching
Allowing small errors
Missing character, extra character, different character
Difficult to express using RegExp
e.g. 1 character missing from abac => /bac|aac|abc|aba/
Approximate pattern matching algorithms
agrep
Extension of shifter algorithm
"Pitecan" search
Use approximate matching algorithm for dictionary search
Example state machine
http://gyazo.com/33d3ba9323939d1f9426a25d24184078.png
Example
http://gyazo.com/33a87ab418c6e283073219e7d021d387.png
Demo: Scrapbox search
Using RegExp for generating sentences
Strings that match /(sub)*section/
⇒ "section", "subsection", "subsubsection", ...
Strings can be generated automatically
Examples
(時計|時間|時刻)を(0|1|2|3|4|5|6|7|8|9|10|11|12)時に(セットする|設定する|あわせる)' (Set the time to (0|1|2...) o'clock)
時計を0時にセットする
時計を0時に設定する
時計を0時にあわせる
時計を1時にセットする
時計を1時に設定する
時計を1時にあわせる
時計を2時にセットする
時計を2時に設定する
...
re_expand.rb
code:expand.rb
require 'rubygems'
require 're_expand'
'(月|火|水|木|金)曜(1|2|3|4|5|6)限)'.expand { |s,a|
puts s
}
Results
code:result.txxt
月曜1限
月曜2限
月曜3限
月曜4限
月曜5限
月曜6限
火曜1限
火曜2限
火曜3限
火曜4限
...
Puzzle programming with string handling
code:GRR.rb
s = "-------=>------------=>-------"
while true do
s.gsub!(/=>\-/,'-=>')
s.gsub!(/=>$/,'<=')
s.gsub!(/\-<=/,'<=-')
s.gsub!(/^<=/,'=>')
s.gsub!(/=><=/,'<==>')
puts s
end
Result
code:result.txt
---=>-----<=------------------
----=>---<=-------------------
-----=>-<=--------------------
------<==>--------------------
-----<=--=>-------------------
----<=----=>------------------
---<=------=>-----------------
--<=--------=>----------------
-<=----------=>---------------
=>------------=>--------------
-=>------------=>-------------
--=>------------=>------------
---=>------------=>-----------
----=>------------=>----------
-----=>------------=>---------
------=>------------=>--------
-------=>------------=>-------
--------=>------------=>------
---------=>------------=>-----
----------=>------------=>----
-----------=>------------=>---
------------=>------------=>--
-------------=>------------=>-
--------------=>-----------<=-
---------------=>---------<=--
----------------=>-------<=---
-----------------=>-----<=----
------------------=>---<=-----
-------------------=>-<=------
--------------------<==>------
-------------------<=--=>-----
------------------<=----=>----
-----------------<=------=>---
----------------<=--------=>--
Hakoiri-musume
http://gyazo.com/3a7c99ef1f715532976f7e9489408483.png
Express the state of the board as a string
http://gyazo.com/ab3c088f1f7e81f8ab772a706168db43.png
s = "3223*4224*3553*4664*6016"
Repeat string substitutions
code:hakoiri.rb
s = "3223*4224*3553*4664*6016"
pats = [
]
strstate = {}
statestr = {}
states = 0
states += 1
prev = {}
i = 0
while i < states do
pats.each { |pair|
if pat.match(s) then
ss = s.dup
ss.sub!(pat,replace)
ss.sub!(/(01)(.*)(01)/, "0\\21") if /22.$/.match(ss) then
STDERR.puts states
end
exit
end
states += 1
STDERR.puts states if states % 1000 == 0
end
end
}
i += 1
end
Results
code:result.txt
======
父父父父
親親親親
小小番頭
娘娘小
娘娘小
======
父父父父
親親親親
小小番頭
娘娘 小
娘娘 小
======
父父父父
親親親親
小小番頭
娘娘小
娘娘 小
======
父父父父
親親親親
小小番頭
娘娘
娘娘小小
======