Slide: 1


Tree pattern matching and subset matching in deterministic O(nlog3n)

Emmanuel Filiot


Slide: 2




Plan

1.Definitions
2.Overview of the algorithm
3.Superimposed code
4.Minimum weight problem


Slide: 3




Definitions (1)

Tree pattern matching problem : Given two ordered trees, called the text and the pattern, find all occurences of the pattern in the text.
The pattern is said to occurs at a particular position if placing the pattern with root at that text position leads to a situation in which each pattern nod overlaps some text node.


Slide: 4


Definitions (2)




Slide: 5


Definitions (3)



Subset matching problem : Given an alphabet å=[0,...,k-1], a text string t[0...n-1] and a pattern string p[0...m-1] such that t[i] Í å, p[j] Í å, " i=0... n-1, " j=0... m-1, output a binary array T[0... n-m] such that T[i]=1 iff p[j]Í t[i+j] " j=0... m-1.


Lemma 1 : Tree pattern matching problem can be reduce to Subset matching problem in O(n).[CH97]


Slide: 6


Subset matching problem




Slide: 7


Overview of the algorithm (1)

Notations : s=åi |t[i]|+åj |p[j]| where |.| denotes the cardinality.
Hypothesis : Èi t[i]=Èj p[j], m£ n£ 2m, s£ 6m, i.e. s=O(n). 1



Slide: 8


Overview of the algorithm (2)

1. The maximum size of each set of t are minimized by shifting the elements of sets of both t and p in O(nlog3n)-time ® Minimum Weight Problem.
2. The sets of t and p are coded, in O(nlog2n)-time, into binary vectors of length O(log2n) which permit to see if a set is included in another one quickly.
® Superimposed Codes Problem.
3.The string matching problem is the solved by standard methods based on convolutions in O(nlog3n)-time.


Slide: 9


Superimposed Codes (1)

Data: S=S0,...,Sp Í U .
Problem: find equal length binary codes for the elements of each Si such that code(Si), defined as the 'or' of the codes of its elements, does not contain any code of elements not in Si, i.e. :
" eÎ U, " SiÎ S, eÏ Si implies that there is at least one 1 in code(e) such that the bit in code(Si) in the same position is a zero.



Slide: 10


Superimposed Codes : example

S1= a,b,c
S
2= a,b
S
3= b,c
code
(a)   =1  1  0
code(b)   =0  0  0
code(c)   =0  1  1
code(S1)=1  1  1
code(S2)=1  1  0
code(S3)=0  1  1



Slide: 11


Superimposed Codes (3)

It can be noticed that Si Í Sj iff code(Si)£ code(Sj), i.e. " p, code(Si)p£ code(Sj)p.

So the problem is to build short codes in a fast time. As the length of the codes is proportional to the set size, the goal of the second problem is to minimize the sizes of the sets.



Slide: 12


Minimum Weigth Problem(1)

Motivations : generate string t'' and p'' such that all sets in both t'' and p'' have small cardinality, and p'' matches t'' at position i iff p matches t at the same position.
Notation : T[a,i]=1 where a= a1,...,aq Í K and kÎ 0,1 iff T[j,i]=1 if j=al and 0 otherwise.

As the alphabet K is finite, it is possible to work with natural integers, instead of characters.



Slide: 13


Minimum Weigth Problem(2)

The binary matrix T[0...k-1,0...n-1] and P[0...k-1,0...m-1] are defined as follow :
T[a, i]=1 iff t[i]=a
P
[a, i]=1 iff p[i]=a

Condition 1 : p matches t at position i iff : " j=0...m-1, P[.,j]£ T[.,i+j]

Slide: 14


Minimum Weigth Problem(3) : Example

k=4, i.e. K= 0, ..., 3
n=5, m=3
t= 1, 2, 3 , 1 , 0, 3 , , 0, 2, 3
p= 2, 3 , 1 , 3

T= æ
ç
ç
ç
è
0 0 1 0 1
1 1 0 0 0
1 0 0 0 1
1 0 1 0 1
ö
÷
÷
÷
ø
, P= æ
ç
ç
ç
è
0 0 0
0 1 0
1 0 0
1 0 1
ö
÷
÷
÷
ø


p matches t at position 0 .


Slide: 15


Minimum Weigth Problem (4)

The weight of a column is the sum of its elements.
So the weights of both T and P correspond to the set sizes of both T and P.
Fact : Condition 1 remains invariant when the corresponding rows of both T and P are shifted by the same number of positions.
So it is possible to reduce the number of ones in each column of T, i.e. the set sizes.


Slide: 16


Minimum Weight Problem (5)

T = æ
ç
ç
ç
è
0 0 1 0 1
1 1 0 0 0
1 0 0 0 1
1 0 1 0 1
ö
÷
÷
÷
ø
, P = æ
ç
ç
ç
è
0 0 0
0 1 0
1 0 0
1 0 1
ö
÷
÷
÷
ø

T''= æ
ç
ç
ç
è
0 1 0 1 0
0 1 1 0 0
1 0 0 0 1
1 0 1 0 1
ö
÷
÷
÷
ø
, P''= æ
ç
ç
ç
è
0 0 0
0 0 1
1 0 0
1 0 1
ö
÷
÷
÷
ø
t''= 2, 3 , 0, 1 , 1,3 , 0 , 2, 3 p''= 2,3 , , 1,3


Slide: 17




Minimum Weight Problem : Definition



Definition : Given positive integers b,d and a binary matrix T[0...k-1,0...n-1] containing at most b ones, find (if possible) k positive integers l0,...,lk-1 Î 0,...,n-1 such that the matrix T'' obtained by shifting the ith row of T by li positions has the property that each of its columns contained at most d ones.

Theorem : Minimum Weight Problem with b=n, and d=O(n) can be solved in deterministic O(nlog3n)-time.


Slide: 18


Conclusion


Slide: 19


Bibliography


1
As pointed out in [CH97], that can be assumed without loss of generality (otherwise the problem can be linearly reduced to several subset matching subproblems satisfying these properties)

This document was translated from LATEX by HEVEA.