Pattern Recognition 33 (2000) 1263}1276
A unique-ID-based matrix strategy for e$cient iconic indexing of symbolic pictures夽 Ye-In Chang*, Hsing-Yen Ann, Wei-Horng Yeh Department of Applied Mathematics, National Sun Yat-Sen University, Kaohsiung, Taiwan, Republic of China Received 15 May 1998; accepted 13 May 1999
Abstract In this paper, we propose an e$cient iconic indexing strategy called unique-ID-based matrix (UID matrix) for symbolic pictures, in which each spatial relationship between any two objects is assigned with a unique identi"er (ID) and is recorded in a matrix. Basically, the proposed strategy can represent those complex relationships that are represented in 2D C-strings in a matrix, and an e$cient range checking operation can be used to support pictorial query, spatial reasoning and similarity retrieval; therefore, they are e$cient enough as compared to the previous approaches. From our simulation, we show that the proposed UID matrix strategy requires shorter time to convert the input data into the corresponding representation than the 2D C-string strategy, so is the case with query processing. Moreover, our proposed UID matrix strategy may require lesser storage cost than the 2D C-string strategy in some cases. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: 2D string; 2D C-string; Image databases; Pictorial query; Similarity retrieval; Spatial reasoning; Symbolic databases
1. Introduction The design of image databases has attracted much attention over the past few years. Applications which use image databases include o$ce automation, computeraided design, robotics, and medical pictorial archiving. A common requirement of these systems is to model and access pictorial data with ease [1]. Thus, one of the most important problems in the design of image database systems is how the images are stored in the image database [2]. In traditional database systems, the use of indexing to allow database accessing has been well established. Analogously, picture indexing techniques are needed to make ease pictorial information retrieval from a pictorial database [3].
夽 This research was supported in part by the National Science Council of Republic of China under Grant No. NSC-87-2213E-110-014. * Corresponding author. Tel: 887-7-5252000 (ext. 3819); fax: 886-7-5253809. E-mail address:
[email protected] (Y.-I. Chang).
Over the last decade, many approaches to represent symbol data have been proposed. Chang et al. [2] proposed a pictorial data structure, 2D string, using symbolic projections to represent symbolic pictures preserving spatial relationships among objects [4]. The basic idea is to project the objects of a picture along the x- and y-axis to form two strings representing the relative positions of objects in the x- and y-axis, respectively [5]. A picture query can also be speci"ed as a 2D string. Based on 2D strings, several algorithms in pictorial querying, spatial reasoning and similarity retrieval are proposed, where picture querying allows the users to query images with a speci"ed spatial relationship, where spatial reasoning means the inference of a consistent set of spatial relationships among the objects in an image, and the target of similar retrieval is to retrieve the images that are similar to the query image [3,6,7]. However, the representation of 2D strings is not su$cient to describe pictures of arbitrary complexity completely. For this reason, Jungert [8,9] and Chang et al. [10] introduced more spatial operators to handle more types of spatial relationships among objects in image databases. Using these extended spatial operators, 2D
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 5 - 6
1264
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
G-string representation facilitates spatial reasoning about shapes and relative positions of objects. But a 2D Gstring representation strategy is not ideally economic for complex images in terms of storage space e$ciency and navigation complexity in spatial reasoning. Therefore, Lee and Hsu [4] proposed a 2D C-string representation strategy. Since the number of subparts generated by this new cutting mechanism is reduced signi"cantly, the lengths of the strings representing pictures are much shorter while still preserving the spatial relationships among objects. As described before, based on 2D string representation, the problem of picture query turns out to be the matching of 2D subsequences, which takes non-polynomial time complexity. This makes the picture retrieval method inappropriate for implementation, especially when the number of objects in an image is large. Therefore, Chang et al. [11] proposed a new approach of iconic indexing by a nine direction lower-triangular (9DLT ) matrix. In this strategy, a pictorial query can be processed using the matrix minus operations; however, only nine spatial relationships can be handled between any two objects. In the previous approaches to represent pictorial data, as the complexity of representation strategy is increased, the more spatial relationships can be represented, which also results in a more complex strategy for query processing and a limited types of queries which can be answered. Chang and Yang [12] have proposed a prime-numberbased matrix strategy, which combines the advantages of the 2D C-string and the 9DLT matrix. However, each spatial operator is represented by a product of some prime numbers in their approach, which requires a large storage size. In this paper, we propose an e$cient iconic indexing strategy called unique-ID-based matrix (UID matrix) for symbolic pictures, in which each spatial relationship between any two objects is assigned a unique number and is recorded in a matrix. The assignment of a unique identi"er to each of 13 spatial operators is designed in a such way that it can e$ciently support pictorial query, spatial reasoning and similarity retrieval by range checking; therefore, they are as e$cient as the previous approaches. Basically, the proposed strategy can represent those complex spatial relationships which are represented in 2D C-strings in a matrix, while it does not need any cutting strategy and complex procedures to do spatial reasoning. Moreover, the proposed strategy can be considered as an extended 9DLT matrix strategy in which more than nine spatial relationships can be represented. To illustrate that the proposed strategy can perform better than the 2D C-string strategy, we also do a simulation study. In this study, we consider the performance of two-steps query processing. In the "rst step, we consider the time and the storage requirement for data representation, and in the second, we consider the time to process queries of similarity retrieval. From our simula-
tion, we show that the proposed UID matrix strategy requires shorter time to convert the input data into the corresponding representation than the 2D C-string strategy, and so is the case with query processing. Moreover, our proposed UID matrix strategy may require less storage cost than the 2D C-string strategy in some cases. The rest of the paper is organized as follows. In Section 2, we give a brief description about the 2D C-string with the 9DLT matrix representations. In Section 3, we present the proposed e$cient iconic indexing strategy for symbolic pictures. In Section 4, we make a comparison of the performance of the 2D C-string with our proposed UID matrix strategies by simulation. Finally, Section 5 gives a conclusion.
2. Background In this section, we brie#y describe two data structures for symbolic picture representation: 2D C-string and 9DLT matrix. 2.1. 2D C-string Table 1 shows the formal de"nition of the set of spatial operators, where the notation `begin(A)a denotes the value of begin-bound of object A and `end(A)a denotes the value of end-bound of object A. According to the begin-bound and end-bound of the picture objects, spatial relationships between two enclosing rectangles along the x-axis (or y-axis) can be categorized into 13 types ignoring their length. Therefore, there are 169 types of spatial relationships between two rectangles in 2D space, as shown in Fig. 1. Basically, a cutting of the 2D C-string is performed at the point of partly overlapping, and it keeps the former object and partitions the latter object intact. The cutting mechanism is also suitable for pictures with many objects. Furthermore, the end-bound point of the dominating object does not partition other objects which contain the dominating object. Less cuttings and no unnecessary cuttings in 2D-C string will make the representation more e$cient in the case of overlapping as shown in Fig. 2. The corresponding 2D C-string is as follows: 2D C}x-string( f ): A]B]D"A"D"D%C, 2D C}y-string( f ): D(B]C]A"A[C. To solve the problem of how to infer the spatial relations along the x-axis (or y-axis) between two pictorial objects in a given 2D C-string representation, the level and rank of a symbol are used [4]. That is, the spatial knowledge is embedded in the ranks of pictorial objects. To identify the spatial relationships along the x-axis (or y-axis) between two symbols using their ranks, six complex computing rules are used [4]. Furthermore, to
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
1265
Table 1 De"nitions of Lee's spatial operators (adapted from Ref. [13]) Notation
Condition
Meaning
A(B A"B
end(A)(begin(B) begin(A)"begin(B) end(A)"end(B) end(A)"begin(B) begin(A)(begin(B) end(A)'end(B) begin(A)"begin(B) end(A)'end(B) begin(A)(begin(B) end(A)"end(B) begin(A)(begin(B) (end(A)(end(B)
A disjoins B A is the same as B
A"B A%B A[B A]B A/B
A is the edge to edge with B A contains B and they do not have the same bound A contains B and they have the same begin bound A contains B and they have the same end bound A is partly overlapping with B
Fig. 2. The cutting mechanism of the 2D C-string: (a) cut along the x-axis; (b) cut along the y-axis (adapted from Ref. [13]).
these two objects' boundary subparts, there are four cases possible. For each case, up to two comparisons between the leftmost (or rightmost) bounding subparts of those two objects are needed to determine the spatial relationship [13]. 2.2. 9DLT matrix
Fig. 1. The 169 spatial relationship types of two objects (adapted from Ref. [13]).
answer the spatial relationship between two objects along the x-axis (or y-axis), which are segmented into subparts, we have to compare all subparts of the objects. In general, according to the spatial relationship between
Chang et al. [11] classify spatial relationship into nine classes, according the x- and y-axis spatial relative information embedded in the picture, and suggest a nine direction lower-triangular (9DLT ) matrix to represent a symbolic picture. Let there be nine direction codes (as shown in Fig. 3) which are used to represent relative spatial relationships among objects. In Fig. 3, R denotes the referenced object, 0 represents `at the same spatial location as Ra, 1 represents `north of Ra, 2 represents `northwest of Ra, 3 represents `west of Ra and so on. For the symbolic picture shown in Fig. 4(a), Fig. 4(b) is the corresponding 9DLT matrix. Under this representation, the processing of a pictorial query has become a matrix minus operation. However, only nine relationships can be represented.
1266
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276 Table 2 Uids of 13 spatial operators operator ( (H " uid
Fig. 3. The direction codes.
Fig. 4. 9DLT representation: (a) a symbolic picture; (b) the related 9DLT matrix.
3. An e7cient iconic indexing scheme for symbolic pictures In general, the Lee and Hsu [13] algorithm for spatial reasoning based on 2D C-strings can be summarized into the following three steps: (1) Following rank rules recursively, the rank value of each symbol is calculated. (2) Following computing rules, the spatial relationships between two symbols are inferred. (3) To infer the spatial relationship between two partitioned objects, the boundaries of their subparts are compared. Consequently, to answer a pictorial query based on 2D C-string representation, a number of steps are needed. Therefore, in this section, we propose a new iconic indexing strategy which can solve spatial queries easier and more e$ciently. By rearranging the 169 spatial relationships in Fig. 1, we propose algorithms to e$ciently support spatial reasoning, picture queries and similarity retrieval based on the unique-ID-based matrix (UID matrix) representation. 3.1. Spatial categories We now assign each of those 13 spatial operators a unique identi"er, denoted as uid as shown in Table 2. In this way, we can rearrange the total 169 spatial relationships de"ned in the 2D C-string strategy [13], in Table 3.
1
2
"H /
/H ]
[
% " ]H
[H
%H
3 4 5 6 7 8 9 10 11 12 13
By carefully assigning a unique identi"er to each operator, we can arrange these 169 spatial relationships into a table denoted as Category table, such that relationships of the same category are grouped together as shown in Table 4. To make those category rules more clear, we transform them into more formal descriptions using the corresponding values of the relationship. In this way, the processing of category classi"cation becomes a rangechecking operation. Suppose the spatial relationship between objects A and B is (ArV B, ArW B). We let the corresponding uid values as (AuidV B, AuidW B). Then, the spatial category of A and B is described as follows. 1. Disjoin: (1)uidV )2) or (1)uidW )2). 2. Join: (a) ((3)uidV )4) and (3)uidW )13)), or (b) ((5)uidV )13) and (3)uidW )4)). 3. Contain: (7)uidV )10) and (7)uidW )10). 4. Belong: (10)uidV )13) and (10)uidW )13). 5. Part}Overlap: (a) (5)uidV )6) and (5)uidW ) 13), or (b) (7)uidV )13) and (5)uidW )6), or (c) (7)uidV )9) and (11)uidW )13), or (d) (11)uidV )13) and (7)uidW )9). Given two uids (uidV , uidW ), to e$ciently determine a category, we can use the algorithm as shown in Fig. 5. The corresponding decision tree is shown in Fig. 6. 3.2. Data structure for pictorial symbol representation: the unique-ID-based matrix (UID matrix) In the 9DLT matrix representation, the spatial relationships between each object pairs are obvious. Thus, spatial reasoning and pictorial query could work e$ciently. However, it is conspicuous that a 9DLT matrix concluding the spatial relationships between two objects into nine types is insu$cient. Conversely, the 2D C-string represents a picture more precisely since it concludes the spatial relationships into 169 types. However, spatial reasoning based on the 2D C-string representation is not so straightforward. Therefore, we propose a UID matrix strategy to preserve the spatial information in the 2D C-string representation using an extended 9DLT matrix which can be a support to answer the spatial relationship directly and a support to pictorial query and similarity retrieval easily.
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276 Table 3 The 169 spatial relationships 1 2
rW 1 2 3 4 5 6 7 8 9 10 11 12 13
rV
( (H " "H / /H ] [ % " ]H [H %H
1267
3
4
5
6
7
8
9
10
11
12
13
(
(H
"
"H
/
/H
]
[
%
"
]H
[H
%H
(( ((H (" ("H (/ (/H (] ([ (% (" (]H ([H (%H
(H( (H(H (H" (H"H (H/ (H/H (H] (H[ (H% (H" (H]H (H[H (H%H
"( "(H "" ""H "/ "/H "] "[ "% "" "]H "[H "%H
"H( "H(H "H" "H"H "H/ "H/H "H] "H[ "H% "H" "H]H "H[H "H%H
/( /(H /" /"H // //H /] /[ /% /" /]H /[H /%H
/H( /H(H /H" /H"H /H/ /H/H /H] /H[ /H% /H" /H]H /H[H /H%H
]( ](H ]" ]"H ]/ ]/H ]] ][ ]% ]" ]]H ][H ]%H
[( [(H [" ["H [/ [/H [] [[ [% [" []H [[H [%H
%( %(H %" %"H %/ %/H %] %[ %% %" %]H %[H %%H
"( "(H "" ""H "/ "/H "] "[ "% "" "]H "[H "%H
]H( ]H(H ]H" ]H"H ]H/ ]H/H ]H] ]H[ ]H% ]H" ]H]H ]H[H ]H%H
[H( [H(H [H" [H"H [H/ [H/H [H] [H[ [H% [H" [H]H [H[H [H%H
%H( %H(H %H" %H"H %H/ %H/H %H] %H[ %H% %H" %H]H %H[H %H%H
1
2
3
4
5
6
7
8
9
10
11
12
13
D D D D D D D D D D D D D
D D D D D D D D D D D D D
D D J J J J J J J J J J J
D D J J J J J J J J J J J
D D J J P P P P P P P P P
D D J J P P P P P P P P P
D D J J P P C C C C P P P
D D J J P P C C C C P P P
D D J J P P C C C C P P P
D D J J P P C C C B,C B B B
D D J J P P P P P B B B B
D D J J P P P P P B B B B
D D J J P P P P P B B B B
Table 4 The category table oidV
oidW 1 2 3 4 5 6 7 8 9 10 11 12 13
Suppose a picture f contains m objects and let <" +v , v ,2, v ,. Let A be the set of 13 spatial operators K +(,(H,","H,[,[H,],]H,%,%H,/,/H,",. A m;m spatial matrix S of picture f is de"ned as follows: v v S" $ v K\ v K
v 0 rV $
v 2 v K\ rW 2 2 0 \
v K rW K $
\
$
$ rV K
0 \
2
\
0 rW K\K 2 rV 0 K\K
,
where the lower triangular matrix stores the spatial information along the x-axis, and the upper triangular matrix stores the spatial information along the y-axis. That is, S[v , v ]"rV if i'j; S[v , v ]"rW if i(j; G H HG G H GH S[v , v ]"0 if i"j, ∀v , v 3<, ∀rV , rW 3A, 1)i,j)m, G H G H HG GH where rV is the spatial operator between objects v and HG G v along the x-axis and rW is the spatial operator between H GH objects v and v along the y-axis. Note that in this G H representation, we always record the relationships between two objects v and v from the viewpoint of object G H v either along the x-axis or along the y-axis, where i(j. G Hence, S[v , v ]"rV when i'j. G H HG
1268
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
Fig. 5. The CATEGORY function. Fig. 6. Decision tree of the CATEGORY function.
Fig. 7. An image and its symbolic representation.
For the pictorial picture shown in Fig. 7, the corresponding spatial matrix S is shown as follows:
A B S"C D E
A
B
0
/H
C
D
E
(H (H (H
(H 0
(H (H (H
(H /
0
/H
( (
(H ( (
/H
/
0
"
%
0
.
According to the assignments of the uid values for those 13 spatial operators shown in Table 3, we can transform the spatial matrix S of f into a UID matrix ¹ by replacing each spatial operator rV (rW ) with its correHG GH sponding unique identi"er uidV (uidW ) as follows: A B ¹"C D E
A
B C D E
0
6
2
2
2
2
0
2
2
2
2
5
0
6
5 .
6
1
1
0
3
2
1
1
9
0
3.3. Spatial reasoning based on the UID matrix representation Spatial reasoning means the inference of a consistent set of spatial relationships among the objects in an image. Based on the UID matrix, it is easy to retrieve the spatial relationships of each pair of objects along the x- and y-axis straightforwardly, since this information is recorded directly in the matrix. Moreover, the category of each pair of objects can be inferred using the CATEGORY function shown in Fig. 5. 3.4. Pictorial query based on the UID matrix representation A pictorial query allows the users to query images with a speci"ed spatial relationship. For example `display all images with a lake east of a mountaina. In this section, we will describe the pictorial query processing based on the UID Matrix representation according to the query types classi"ed by Lee and Hsu [13]. First, the following basic orthogonal directional aggregates are the main body of a UID Matrix query. 1. x: (east, A, B) i! rV 3+(H,"H,[,%,/H, (i.e., i! uidV 3+2, 4, 6, 8, 9,).
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
2. y: (north, A, B) i! rW 3+(H,"H,[,%,/H, (i.e., i! uidW 3+2, 4, 6, 8, 9,). 3. x: (west, A, B) i! rV 3+(,",],%,/, (i.e., i! uidV 3+1, 3, 5, 7, 9,). 4. y: (south, A, B) i! rW 3+(,",],%,/, (i.e., i! uidW 3+1, 3, 5, 7, 9,). (Note that the correctness of these direction relationships can be veri"ed by Table 5, where object A is denoted by the white box and object B by the gray box. Moreover, the east and west directions are determined by relative position of the right and left boundary lines, respectively,
1269
while the north and south directions are determined by the relative position of the upper and lower boundary lines, respectively, which follow the Lee and Hsu de"nition.) Therefore, the primitive direction relationship problem becomes a membership checking of a set of numbers. Then, the pictorial queries based on a UID matrix can be processed as follows. (A) Orthogonal direction object queries. For this class of queries, we still have to follow the same 15 spatial rules as described by Lee and Hsu [13]. For example, to determine which object is in the
Table 5 The 169 spatial relationship types in a new form
1270
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
east}north of object X, the following rule is used, (ne, ?, X) "+A " uidW 3+2, 4, 6, 8, 9, and uidV 3 6 6 +2, 4, 6, 8, 9,,. (B) Category relationship object queries. This class of queries allows the users to retrieve the objects with a speci"ed category and object, for example, `"nd those objects which are disjoin with object Aa. Based on the UID matrix representation, this class of queries can be easily answered by applying the CATEGORY function shown in Fig. 5, where the constant B is replaced with a variable X to denote the unknown object. (C) Auxiliary relationship object queries. This class of queries allows the users to retrieve the objects with a speci"ed auxiliary relationship and an object, where auxiliary relations contain same, surround and part} surround, and two symmetric-inverse relationships surrounded and part}surrounded. We can make use of uid of spatial operators to process this class of queries.
(D)
(E)
(F)
(G)
(a) A is the same as B, if A is at the same location as B along western, eastern, southern, and northern directions. That is, A is the same as B i! both rV and rW are `"a (i.e., uidV "10 and uidW "10). (b) A surrounds B, if A contains B, and A completely surrounds B along four orthogonal directions. That is, A surrounds B i! both rV and rW are `%a (i.e., uidV "9 and uidW "9). (c) A part}surrounds B, if A contains B, and A surrounds B along two or three orthogonal directions. That is, A part}surrounds B i! A contains B (i.e., (7)uidV )10) and (7)uidW )10)), A is not the same as B (i.e., uidV O10 and uidW O10) and A does not surround B (i.e., uidV O9 and uidW O9). Icon relationship object queries. This type of queries allows the users to retrieve all objects with a speci"ed icon in Fig. 1. For example, to retrieve the objects whose spatial relationships with object A along x- and y-axis are `/a and `"a, respectively, a set of objects S will be retrieved, where S"+B " uidV "5 and uidW "3,. Icon relationship queries. This type of queries allows the users to retrieve the spatial relation icon in Fig. 1 with two speci"ed objects, for example, `"nd the spatial relationship icon of objects A and Ba. It is clearthatthe work issimilar to the above query class D. Category relationship queries. This type of queries can answer the spatial category according to two given objects. To "nd the spatial category with objects A and B, category rules described in Fig. 5 are used. For example, A contains B i! (7)uidV )10) and (7)uidW )10). Orthogonal direction queries. Finally, the orthogonal spatial relationship between objects A and B can be
examined. The four orthogonal directional aggregates described above are used. 3.5. Similarity retrieval based on the UID matrix representation The target of similar retrieval is to retrieve the images that are similar to the query image. In this section, we will describe the similar types and the corresponding similarity retrieval algorithm based on our UID matrix strategy. Basically, we will show how those similar types which are de"ned in the 2D C-string representation [4] can be determined based on our UID matrix representation. Now, the de"nitions of the similar types [13] are described as follows. De5nition 1. Picture f is a type-i unit picture of f, if (1) f is a picture containing the two objects A and B, represented as x: ArVY B, y: ArWY B, (2) A and B are also contained in f and (3) the relationships between A and B in f are represented as x: ArV B, and y: ArW B, then, (type-0): Category(rV , rW )"Category(rVY , rWY ); (type-1): (type-0) and (rV "rVY or rW "rWY ); (type-2): rV "rVY and rW "rWY ; where Category(rV , rW ) denotes the relationship cat egory of the spatial relationship as shown in Table 1. De5nition 2. Given a m;m matrix M , a matrix oper ator R is de"ned as follows: Let M"(M )0, where M(i, j)"M (i, j)*M ( j, i), ∀1)i)m, 1)j(i. De5nition 3. Given two m;m matrices M and M , a matrix operator R is de"ned as follows: Let M"M M M(i, j)"0, where M(i, j)"1,
∀M (i, j)"M (i, j), ∀M (i, j)OM (i, j).
De5nition 4. Given a m;m UID matrix ¹, the corresponding category matrix C is de"ned as follows. C[i, j]"1 if CA¹EGOR>(¹[i, j],¹[ j, i])"&&D'' where 1)i(j)m, C[i, j]"2 if CA¹EGOR>(¹[i, j],¹[ j, i])"&&J'' where 1)i(j)m, C[i, j]"3 if CA¹EGOR>(¹[i, j],¹[ j, i])"&&C'' where 1)i(j)m, C[i, j]"4 if CA¹EGOR>(¹[i, j],¹[ j, i])"&&B'' where 1)i(j)m,
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
C[i, j]"5 if CA¹EGOR>(¹[i, j],¹[j, i])"&&P'' where 1)i(j)m. That is, C[i, j]"1, 2, 3, 4, 5 if the relationship between objects v and v is of the disjoin, join, contain, belong G H and part}overlap category, respectively, by recalling the CATEGORY function. Based on these two new matrix operators, R and , the following three algorithms, type-0, type-1, type-2 are used to determine whether two pictures are of type-0, type-1, type-2 similarity, respectively, given two UID matrices ¹ and ¹ . Algorithm (type-0). (1) Following the CATEGORY algorithm, xnd the category matrix C and C corresponding to the two matrices ¹ and ¹ , respectively. (2) C"C C . If C is zero in the lower triangular matrix, these two pictures are of type-0 similarity; otherwise, there is no match. Algorithm (type-1). (1) Algorithm (type-0) passed. (2) ¹"¹ ¹ . (3) ¹H"(¹)0. If ¹H is zero in the lower triangular matrix, these two pictures are of type-1 similarity; otherwise, there is no match.
(Step 1) Find the spatial matrices S and S and the UID matrices ¹ and ¹ representing the two pictures f and f , respectively. A B C D A
(H
/H
B ( 0 S " C %H (H
/H
%
0
(
D (
(
0
A B ¹ " C D
0
(1) ¹"¹ ¹ . If ¹ is a zero matrix, these two pictures are of type-2 similarity; otherwise, there is no match. Now, we use one example to show how those algorithms work. Consider the pictures as shown in Fig. 8.
/H
%H
A
B
0
6
2
6
1
0
6
9
13
2
0
1
1 13
1
0
,
B
C
D
0
%H
(H
%
B ( 0 S " C %H (H
%
%
0
(
D (
(
0
A
A
D
,
C D
A
A
/H
B
0 13
B 1 ¹ " C 13
Algorithm (type-2).
1271
1
,
C D 2
9
0
9
9
2
0
1
6
1
0
.
(Step 2) Following the category rules, compute the corresponding category matrices, where 1, 2, 3, 4 and
Fig. 8. One example of pictures f and f .
1272
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
5 mean the disjoin, join, contain, belong and part}overlap relationship, respectively:
similarity: A B C D
A 0
1
0
1
B 0 ¹"¹ ¹ " C 0
0
1
0
0
0
0
D 0
1
0
0
A B C D
A
B 1 C " C 1
1
D 1
5
,
1
A
A A B C D
A
B 1 C " C 1
1
D 1
5
D 0
0
D 0
0
0
A 0
0
0
0
B 0 ¹"¹ ¹ O C 0
0
0
0
0
0
0
D 0
0
0
0
A B C D
0
.
A B C D
(Step 3) Check type-0 similarity. Since C"0 in the lower triangular matrix, these two pictures are of type-0 similarity:
B 0 C"C C " C 0
0
(Step 5) Check type-2 similarity. Since ¹O0, these two pictures are not of type-2 similarity:
1
A
B C D
B 0 ¹H"(¹)0" C 0
.
,
.
0
(Step 4) Check type-1 similarity. Since ¹H"0 in the lower triangular matrix, these two pictures are of type-1
.
3.6. A comparison In this subsection, we make a comparison of our proposed strategy and the previous proposed representation strategies, which is as shown in Table 6. The "rst item considers the data structure to represent the spatial relationships. The 9DLT matrix representation and the UID matrix representation record the
Table 6 A comparison 2D C-string
9DLT matrix
Representation Contents of representation
String Matrix Relationship along the x-, y-axis Relationship in 2D space
With cutting? Relationship types Spatial reasoning Pictorial query
Yes 169 Di$cult By using rank rules and computing rules 3
Similar types
No 9 Easy By using matrix minus operations Not proposed
UID matrix Matrix Relationship along the x-, and y-axis No 169 Easy By range checking operations 3
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
spatial relationships in matrices. In this way, e$cient matrix operations are applied, instead of complex string comparison strategies used for string representation. The second item considers the contents of representation. Only 9DLT matrix representation records the relationship in 2D space, instead of the relationship along the xand the y-axis recorded in other proposed strategies. Recording spatial relationship in 2D space makes pictorial query easy; however, the 9DLT matrix strategy records only nine spatial relationships. The third item concerns the application of a cutting mechanism. Cutting mechanisms are applied in 2D C-string representation strategy to handle overlapping objects. Since cuttings make an object into more than one components, more symbols will be used, which results in a requirement of large storage space and complex query processing strategies. Although the 9DLT matrix representations do not need cutting mechanisms, less spatial relationships could be recorded. While in our proposed strategy we do not need the cutting mechanisms, but we still can handle the cases of overlapping objects. The fourth item shows the number of spatial relationships in each of proposed strategies. Only nine spatial relationships can be recorded in the 9DLT matrix representation. Although the 2D Cstring representation strategy can record 169 spatial relationships, it needs cutting mechanisms. Our proposed strategy is the only one that can record 169 spatial relationships without cutting. The "fth item concerns the di$culty in doing spatial reasoning, where spatial reasoning means the inference of a consistent set of spatial relationships among the objects in an image. In general, if a cutting mechanism is used in the representation, then it is hard to do spatial reasoning. The sixth item considers the way to process a pictorial query, where a pictorial query allows the users to query images with a speci"ed spatial relationship. The 2D Cstring representation strategy processes pictorial queries using complex rank rules and computing rules. The 9DLT matrix representation strategy processes pictorial queries using matrix minus operations. Our proposed strategy processes pictorial queries using the range checking operation, which is more e$cient as compared with the strategies used in the other representations. The last item shows the number of similar types which a representation strategy can distinguish. Basically, our proposed strategy can distinguish the same three similar types de"ned in the 2D C-string representation strategy.
4. Simulation study
1273
In the "rst step, we consider the time and the storage requirement in preprocessing the input data; that is, to convert each spatial object of the input picture represented in the form of (object, x-axis-begin, x-axis-end, y-axisbegin, y-axis-end) into the corresponding 2D C-string (or UID matrix) representation. In the second step, we consider the time to process queries of type-i similarity, 0)i)2. (Note that to answer the query of type-2 similarity, we have to do spatial reasoning for each pair of objects in a picture; therefore, the comparison of processing time of type-2 similarity denotes the comparison of processing spatial reasoning for a given pair of objects.) To simplify our simulation, we let the maximum number of di!erent objects appearing in the database be 20. Each object can appear in a picture with 100,000 * 100,000 points. For the performance of the "rst step, we consider 2000 input pictures which are created randomly with a uniform distribution, and compute the average cost for those 2000 input pictures. We consider cases of 5, 10, 15, and 20 di!erent objects appearing in each picture, respectively. Table 7 shows the simulation results of the "rst step. From this table, we observe that the proposed UID matrix strategy requires shorter time to preprocess the input data than the 2D C-string strategy, the reason being that the 2D C-string strategy [4] requires to sort the input data "rst in terms of the values of x and y-axis. Next, the 2D C-string strategy has to check whether a cutting operation is required by detecting the partly overlapping case. When the latter occurs, the 2D C-string strategy has to decide the dominating object. A dominating object has the smallest begin bound among those partly overlapping objects in x-axis (or y-axis). For
Table 7 A comparison of preprocessing step
Time (ms) Storage (no. of characters)
2D C-string 0.411 37.62
UID 0.055 25
2D C-string 1.222 108.65
UID 0.121 100
2D C-string 2.449 204.07
UID 0.186 225
2D C-string 4.216 323.42
UID 0.217 400
(a)
Time (ms) Storage (no. of characters) (b)
Time (ms) Storage (no. of characters) (c)
In our simulation study, we consider the performance of two-step query processing for the 2D C-string and the proposed UID matrix strategies. (Note that here, we do not do the simulation study for the 9DLT matrix strategy, since it can only represent nine spatial relationships.)
Time (ms) Storage (no. of characters) (d)
1274
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276 Table 8 A comparison of the query step Query Method 2D C-string UID
Type-0 similarity 0.195 0.010
Type-1 similarity 0.2255 0.0100
Type-2 similarity 0.1605 0.0050
Type-1 similarity 0.4705 0.0455
Type-2 similarity 0.385 0.005
Type-1 similarity 0.9710 0.0955
Type-2 similarity 0.8565 0.0100
Type-1 similarity 1.6725 0.1755
Type-2 similarity 1.481 0.015
(a)
Fig. 9. An example of a local body case (adapted from Ref. [13].
example, in Fig. 2(b), B is a dominating object. (Note that when many objects satisfy the above condition, the one with the smallest end bound will be chosen as the dominating object.) After the cutting operations are all "nished, the input data (after cutting) has to be sorted again. Then, the data are scanned again, and the corresponding 2D C-string is created. But, in the process of data scanning, the 2D C-string strategy has to detect the case of the local body. For example, in Fig. 9, F is inside B, while G is inside F. The corresponding 2D C-string along the x-axis of Fig. 9 is as follows:
Query Method 2D C-string UID
Type-0 similarity 0.436 0.045 (b)
Query Method 2D C-string UID
Type-0 similarity 0.9315 0.0900 (c)
Query Method 2D C-string UID
Type-0 similarity 1.632 0.165 (d)
A"B[(C(D](E"F](G]H(J)(K[¸)(M, where `( )a is a pair of separators which is used to describe a set of symbols as one local body. Since a local body can contain another local body, the scanning processing contains a recursive procedure to handle this problem. Therefore, the 2D C-string strategy takes many complex steps to preprocess the input data into the corresponding 2D C-string representation. For the storage cost of the "rst step, we also observe that it does not always happen that the 2D C-string requires less storage cost than the UID matrix strategy. Obviously, the UID matrix strategy always requires N characters, where N is the number of objects and we can let symbols A, B, C and D denote 10, 11, 12 and 13, respectively. The storage requirement of the 2D C-string strategy will be a!ected by the number of cutting lines. As the number of cutting lines is increased, the total number of partitioned subparts is increased, resulting in the increase of the number of characters representing the picture. In the worst case, it is still O(N) [4]. Therefore, our proposed UID matrix strategy still can require less storage cost than the 2D C-string strategy in some cases. For the performance of the second step, we prepare 2000 pictures represented in 2D C-strings and the UID matrix in the database in advance. We consider cases of 5, 10, 15, and 20 di!erent objects appearing in each picture. For each case, we consider three kinds of queries: (1) type-0 similarity retrieval, (2) type-1 similarity retrieval, and (3) type-2 similarity retrieval. For the query of type-i similarity, 0)i)2, we compare one input query picture represented in the 2D C-strings (or UID
matrix) with each of those prepared 2000 pictures in the database, respectively. Table 8 shows the simulation result (in times of millisecond) of the second step. From this table, we observe that the proposed UID strategy requires shorter time to process any of those kinds of queries than the 2D C-string strategy. This is because for each kind of queries, the 2D C-string strategy has to do spatial reasoning "rst. For both the strategies, to answer the type-2 query requires shorter time than to answer the type-0 query, since the algorithm to decide the category in the type-0 query is based on the spatial reasoning which is concerned with the type-2 query. Moreover, for both strategies, to answer the type-1 query takes a little longer time than to answer the type-0 query, since to satisfy the type-1 similarity must pass the test of type-0 similarity "rst in our simulation. To perform spatial reasoning, the 2D C-string strategy has to compute the values of rank and level for each object both in x- and y-axis. (Note that the time to compute rank and level is proportional to the size of the storage cost which is discussed in the "rst step.) For example, in Fig. 9, the corresponding values of rank and level of objects along the x-axis are as follows [13]: Rank(A)"1 Rank(B)"2 Rank(C)"2.0 Rank(D)"2.2 Rank(E)"2.2.1
¸evel(A)"1 ¸evel(B)"1 ¸evel(C)"2 ¸evel(D)"2 ¸evel(E)"3
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
Rank(F)"2.2.1 Rank(G)"2.2.1.1 Rank(H)"2.2.1.1.101 Rank(I)"2.2.1.103 Rank(J)"2.2.102 Rank(K)"2.4 Rank(¸)"2.4.0 Rank(M)"4
¸evel(F)"3 ¸evel(G)"4 ¸evel(H)"5 ¸evel(I)"4 ¸evel(J)"3 ¸evel(K)"2 ¸evel(¸)"3 ¸evel(M)"1.
To identify the spatial relationships along the x-axis (or y-axis) between two symbols using their ranks, six complex computing rules are used. For example, assume that there are two symbols s , s in a picture, Rank(s )" G H G rank "kG 2kG 2kGG, ¸evel(s )"l , Rank(s )"rank " G J G G H H kH ) kH 2kHH, ¸evel(s )"l , and the rank values of the "rst J H H (h!1) levels of s are equal to those of s , i.e. kG "kH , G H kG "kH ,2,kG "kH , and kG OkH . In this case, the F\ F\ F F following computing rule determines whether a symbol s G is edge to edge with a symbol s or not. If h3[1, min(l , l )], H G H mod(kH !kG ,100)"1, and ∀x3[h#1, l ], kG '100, F F G V and ∀y3[h#1, l ], kH "0, then s "s . Furthermore, to H W G H answer the spatial relationship between two objects along the x-axis (or y-axis), which are segmented into subparts, we have to compare all subparts of the objects. In general, according to the spatial relationship between these two objects' boundary subparts, there are four cases possible. For each case, up to two comparisons between the leftmost (or rightmost) bounding subpart of those two objects are needed to determine the spatial relationship. Therefore, the 2D C-string strategy requires many complex steps in query processing.
5. Conclusion Picture indexing techniques are needed to ease pictorial information retrieval from a pictorial database. In this paper, we have proposed an e$cient iconic indexing strategy called unique-ID-based matrix (UID matrix) for symbolic pictures, which combines the advantages of the 2D C-string and the 9DLT matrix. In the proposed strategy, we have carefully assigned each spatial operator a unique identi"er such that the processing of category classi"cation becomes the range checking operation; therefore, they are as e$cient as the previous approaches. We have also proposed a unique-ID-based matrix data structure to represent pictorial data, in which the relationship between two objects is obviously recorded. Consequently, spatial reasoning can be done very straightforwardly. For answering a pictorial query, some rangechecking operations or membership checking of a set of
1275
numbers have been applied. In similarity retrieval, some new matrix operations have been applied. From our simulation, we have shown that the proposed UID matrix strategy performs better than the 2D C-string strategy in terms of time. A picture is de"ned to be ambiguous if there exists more than one di!erent reconstructed picture from its representation. How to handle the problem of an ambiguous picture is one future research direction. Furthermore, how to e$ciently handle large amounts of image databases is also an important future research direction.
References [1] S.K. Chang, C.W. Yan, D.C. Dimitro!, T. Arndt, An intelligent image database system, IEEE Trans. Software Engng. 14 (1988) 681}688. [2] S.K. Chang, Q.Y. Shi, C.W. Yan, Iconic indexing by 2-D strings, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9 (1987) 413}428. [3] G. Costagliola, F. Ferrucci, G. Tortora, M. Tucci, Correspondences * non-redundant 2D strings, IEEE Trans. Knowledge Data Engng. 7 (1995) 347}350. [4] S.Y. Lee, F.J. Hsu, 2D C-string: a new spatial knowledge representation for image database systems, Pattern Recognition 23 (1990) 1077}1087. [5] P.W. Huang, Y.R. Jean, Using 2D C>-string as spatial knowledge representation for image database systems, Pattern Recognition 27 (1994) 1249}1257. [6] S.Y. Lee, M.K. Shan, W.P. Yang, Similarity retrieval of iconic image database, Pattern Recognition 22 (1989) 675}682. [7] S.S. Liaw, A new scheme for retrieval similar pictures in a pictorial database, J. Comput. 2 (1990) 19}26. [8] E. Jungert, Extended symbolic projections as a knowledge structure for spatial reasoning, Proceedings of the Fourth BPRA Conference on Pattern Recognition, Springer, Cambridge, March 1988. [9] E. Jungert, S.K. Chang, An algebra for symbolic image manipulation and transformation, in: T.S. Kunii (Ed.), Visual Database Systems, North-Holland, Amsterdam, 1989, pp. 301}307. [10] S.K. Chang, E. Jungert, Y. Li, Representation and retrieval of symbolic pictures using generated 2D strings, Technical Report, University of Pittsburgh, PA, 1988. [11] C.C. Chang, Spatial match retrieval of symbolic pictures, J. Inform. Sci. Engng. 7 (1991) 405}422. [12] Y.I. Chang, B.Y. Yang, A prime-number-based matrix strategy for e$cient iconic indexing of symbolic pictures, Pattern Recognition 30 (1997) 1745}1757. [13] S.Y. Lee, F.J. Hsu, Spatial reasoning and similarity retrieval of images using 2D C-string knowledge representation, Pattern Recognition 25 (1992) 305}318.
About the Author*YE-IN CHANG was born in Taipei, Taiwan, ROC, in 1964. She received the B.S. degree in computer science and information engineering from National Taiwan University, Taipei, Taiwan, ROC, in 1986, and M.S. and Ph.D. degrees in computer and information science from the Ohio State University, Columbus, OH, in 1987 and 1991, respectively.
1276
Y.-I. Chang et al. / Pattern Recognition 33 (2000) 1263}1276
Since 1991, she has been on the faculty of the department of applied mathematics at National Sun Yat-Sen University, Kaohsiung, Taiwan, ROC, where she is currently a Professor. Her research interests include database systems, distributed systems, multimedia information systems and computer networks. About the Author*HSING-YEN ANN was born in Taipei, Taiwan, ROC, in 1974. He received the B.S. degree and M.S. degree in applied mathematics from National Sun Yat-Sen University in Kaohsiung, Taiwan, ROC, in 1996 and 1998, respectively. His research interests include database systems and computer networks. About the Author*WEI-HORNG YEH was born in Taipei, Taiwan, ROC, in 1977. He will receive the B.S. degree in applied mathematics from National Sun Yat-Sen University in Kaohsiung, Taiwan, ROC, in 1999.
Pattern Recognition 33 (2000) 1277}1293
Content-lossless document image compression based on structural analysis and pattern matching Yibing Yang*, Hong Yan, Donggang Yu Department of Electrical Engineering, The University of Sydney, NSW 2006, Australia Received 28 September 1998; received in revised form 22 February 1999; accepted 14 April 1999
Abstract This paper presents a highly e$cient content-lossless document image compression scheme. The method consists of three stages. Firstly, the image is analysed and segmented into symbols and position parameters by analysing the relation of the foreground to background and their connectivity. Secondly, the initial representative symbol set from symbols in the image is extracted and matched by direction-based bit-map analysis and matching, and the "nal representative and synthetic pattern set with less-repeated symbol is formed from the previous symbol set by multi-stage structure clustering and representative pattern deriving and synthesis. This "nal component set is reorganized into a compact library image. Finally, high ratio compression is achieved by coding relative positions of symbols, parameters of representative patterns and the library image using the adaptive arithmetic coder with di!erent orders and the Q-Coder, respectively. Our scheme achieves much better compression and less error-map than most of alternative systems. Its lossiness can be reduced to a quite small level in a well-de"ned pattern deriving and synthesis manner compromising compression ratio. Our method can assure content-lossless reconstruction in our symbol-level content-lossless criteria. The method can be easily combined with soft pattern matching to extend to lossless mode. In addition, combining this method with the JBIG1 progressive mode with less-redundancy component library can achieve content-lossless progressive transmission capability. Our method can also be used to deal with various symbolic images including nested symbols like Chinese character images by means of symbolic segmentation based on only connection and position-based bit-map reconstruction. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Document image analysis; Structural clustering; Pattern matching; Image compression
1. Introduction Reduction of data storage space and transmission time has been of much concern in many applications, particularly in networking area where higher compression and faster transmission of data seem to be an endless target. Despite the e!orts of many researchers and the implementation of a number of standards [1}3], there are still high demands for e$cient data compression
* Corresponding author. Tel.: #61-2-9351-6210; fax: #612-9351-3847. E-mail address:
[email protected] (Y. Yang).
technology. Compression of document images is important for the development of o$ce automation, networked business and intensive applications on the World Wide Web. In many practical applications, it may only be necessary that the decompressed document image looks as good as the original to the human eye. In this case, it may be su$cient to keep content-lossless in the component level instead of lossless in the pixel level for scanned document images such as fax document pages. It is also reasonable and attractive to have higher transmission speed and a smaller storage space at the cost of imperceptible loss of image quality. Thus, it then becomes practical to remove redundancy from the page being scanned and store content information for recording.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 2 - 0
1278
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
Several standards have been developed in the past for lossless compression of binary images. They include JPEG, JBIG, Group 4 FAX, GIF, Photo CD, PNG. There is, however, much more work to do for lossy bilevel image compression since there is no lossy image compression standard. Therefore, JBIG is working for new standards for lossy and lossless compressions of bilevel images [4]. The traditional example for document image compression is facsimile page compression by run-length coding which does not provide high compression ratio due to the unavailability of a speci"c coding method particularly for textual images. Thus, many researchers have been working on pattern- matching-based document image compression particularly e!ective for text pages [5}9]. This is as well a main direction [4] which is proposed by ISO in JBIG2. The idea for lossy compression based on pattern matching (PM) to reduce repeated patterns was "rst proposed by Ascher and Nagy [5], who did not use any compression technique for extracted patterns and positions. Subsequently, Pratt et al. implemented this method to combine symbol matching (CSM) [10] by using 2D run-length coding to compress extracted pattern bitmaps and using static coding to compress symbol positions. Brickman and Rosenbaum [11] made an extension to CSM in order to improve the compression ratio by preproviding a library which contained some types of characters. Johnsen et al. [7] described a pattern matching and substitution (PMS) technique which further improved coding e$ciency over CSM and extended coding extent from symbols to some "gures by combining the decomposition with the matching technique. Later on, Holt and Xydeas [12] also presented a modi"ed pattern matching algorithm to improve CSM result by combining size-independent strategy (CSIS). Inglis and Witten [13] proposed another type of pattern matching technique called compression-based pattern matching scheme (CTM) by using cross-entropy measurement idea. Zhang and Danskin [14] also presented a crossentropy-based pattern matching algorithm (EPM) used as a physical model to improve Inglis and Witten's crossentropy-based method [13] and matching performance. Witten and Bell et al. [8] combined PMS with residual processing to extend lossy PMS to lossless and progressive mode, and provided a test system known as MGTIC [9]. Subsequently, Howard [6] developed and proposed a new pattern matching idea known as soft pattern matching (SPM) with a technique called selective pixel reversal (SPR), which may work in both lossy and lossless mode without the risk of error in pattern substitution so that it becomes a proposal in JBIG-2 standard draft. Recently, a new bilevel image compression technique (patent) called cartesian perceptual compression (CPC) [15] has been developed by Cartesian Products, Inc. CPC is a lossy compression method for bilevel images, its
compression performance outperforms all the previous lossy compression techniques for document images by using the pattern model of perceptual discriminability with the perceptual factor and pattern prediction technique. Its compression performance, however, is background dependent. It cannot guarantee high compression ratio and 100% content-lossless for the document images with complex background (mixed black and white background) and symbols. Despite the progress in the research and development for lossy compression techniques for document images, there is still a lot of work to do in the pattern matchingbased lossy compression for document images in order to achieve a higher compression performance while maintaining low loss and cost, particularly in knowing how to guarantee content-lossless in pattern matching. In this paper, we present a highly e$cient content-lossless document image compression scheme and the results of implementation based on the proposed method. The high compression ratio for document images is achieved in three stages. Firstly, the image is analysed and segmented into symbols and position parameters by analysing the ratio of foreground and background pixels and connectivity. Secondly, the initial representative symbol set from symbols in the image is extracted and matched by direction-based bit-map analysis and matching, the "nal representative and synthetic pattern set with less-repeated symbols is formed from the previous symbol set by multi-stage structure clustering and representative pattern deriving and synthesis. This "nal component set is reorganized into a compact library image. Finally, a high compression ratio is achieved by coding relative positions of symbols, parameters of representative patterns and the library image via the adaptive arithmetic coder with di!erent orders and the Q-Coder, respectively. Our scheme achieves much better compression ratio and reduces more error-map than most of the alternative pattern matching-based lossy compression systems. Its lossiness can be reduced to a quite small level in a wellde"ned pattern deriving and synthesis manner with lower compression cost. Our method can assure content-lossless reconstruction according to our component-level content-lossless criteria. Its compression performance is comparable with CPC method [15], moreover, it can compress the document images with both black and white backgrounds in the same image by using background analysis. Besides, this method has some other advantages. For example, it can be easily combined with soft pattern matching [6] to extend to lossless mode. Combining this method with JBIG1 progressive mode with less-redundancy component library can achieve content-lossless progressive transmission capability. Moreover, it is symbol and size independent and can be used to various symbolic images including nested symbols like images with Chinese characters and Greek symbols using symbolic segmentation based on connection
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
only and position-based bit-map reconstruction. Its multi-page processing capability provides it with a higher compression ratio than most of the other current available lossy/lossless methods with the increase in the number of processed image pages, especially when all pages are from the same source. The paper is organized as follows. Section 2 discusses the analysis of background and forms of document images and symbol segmentation method based on a background-model. Section 3 presents the initial symbol matching technique based on error bit-map direction analysis. Section 4 describes the principle for multi-stage structure clustering of initial matched patterns and the method for extracting representative patterns and synthesizing typical patterns, and presents some contentlossless criteria in symbol level. The compression implementation process based on pattern matching is given in Section 5, which is implemented using the adaptive arithmetic coding with di!erent orders and the binary arithmetic Q-Coder for 1D position parameters and 2D synthetic pattern images. Sections 6 and 7 discuss and compare the experimental results of the proposed method with related compression techniques and "nally suggest an extension to lossless and progressive mode by combining this method with SPM and JBIG1.
2. Background analysis and symbol segmentation Symbol segmentation for bilevel document image compression may be somewhat di!erent from pattern segmentation for optical character recognition (OCR) systems. The former does not necessarily require meaningful segmentation of a whole character, a letter or a symbol segmentation. The result of the segmentation may be a part of a whole/meaningful character, letter and symbol. Symbol segmentation, in this sense, means extracting connected groups of black or white pixels from the image. Extracting connected white or black groups depends on document background analysis, simple page processing and layout alignment for scanned document images, which may result in a more e$cient segmentation. Accordingly, we introduce the background analysis and page processing before producing any e$cient segmentation and compression. 2.1. Background analysis and layout alignment Here, we present a simple method for analysing the background and the component of bilevel document images. Firstly, a document image is divided into 16 blocks as shown in Fig. 1, then we select every other block of those 16 blocks in horizontal and vertical directions as the "rst-stage blocks of local histogram analysis. These blocks are denoted 1}8 as shown in Fig. 1. We then calculate their local pixel di!erence histograms by using
1279
Fig. 1. Selected blocks for the analysis of a local pixel di!erence histograms of document image. The bigger blocks with labels 1}8 are used for initial local histogram analysis. If there are a few inhomogeneous changes in some local histograms, then, do "ner histogram analysis by dividing the big block with inhomogeneous histogram and neighbouring ones into smaller blocks (darker ones). Eventually, the whole document image can be processed in di!erent backgrounds.
the di!erence of the number of black pixels and the number of white pixels in each block. A document image has homogeneous background if all histograms in the 8 blocks have homogeneous distributions, as shown in Fig. 2. On the contrary, a positive bar shows that black pixels are the dominant pixels in this block as shown in Fig. 2 (a) and (b), this block may have a black background. The higher positive bar, the more black pixels there are. Similarly, a negative bar means dominant white pixels as shown in Fig. 2 (c) and (d). The higher the negative bar, the more white pixels there are and the block has a white background. If there are a few inhomogeneous distributions with both positive and negative bars or with a larger gap over the average bar gap from block to block for the pixel di!erence histograms, it may mean that there are di!erent backgrounds in a bilevel document image, with black background and white foreground, or the other way around, or with some halftone images in the document. In this case, we further divide the bigger block with di!erent distribution from dominant histograms and its neighbourhood into smaller blocks (see darker small blocks in Fig. 1) in order to separate di!erent backgrounds. Similarly, we can segment a document image into some smaller local regions with di!erent backgrounds by analysing the multistage local pixel di!erence histograms. Prior to symbol segmentation, we introduce a simple method for layout alignment to correct the skew scanned
1280
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
Fig. 2. The local histograms for the analysis of a document image. The vertical axis denotes the di!erence between the numbers of black pixels and the number of white pixels, each bar representing each block. The homogeneous bar distribution represents homogeneous background. A positive bar represents dominant black pixels in the document image and the document has black background and white foreground. Similarly, a negative bar means dominant white pixels in the document image, i.e. white background and black foreground.
document image so that follow-up symbol segmentation by lines and data compression for symbol positions will be more e$cient. The document may be skewed that the text lines in the scanned image may have a large deviation from the horizontal direction. This large skew may cause low e$ciency in follow-up symbol segmentation by lines and in the compression based on pattern matching. For example, text lines may not be able to distinguish for the former. For the latter, however, a high compression ratio depends on the position correlation of patterns to such an extent that skew text line may weaken the position correlation between patterns. Here, we use the cross-correlation between vertical lines at a "xed distance as proposed by Yan [16] to estimate and correct the skew angle of the scanned document image. The de"nition for cross-correlation function is as follows: 6\B\ 7\1\ R(s)" R (x, s)" B(x#d, y#s)B(x, y). (1) V W1 s (2) a"tan\ MNR, d where R(s) is the accumulated cross-correlation function for all pairs of lines with a "xed distance d in the image, R (x, s) is the cross-correlation function with a shift s between two vertical lines of the image, respectively, at x and x#d in the horizontal direction. The corresponding document image is a binary image B(x, y) with the black pixel B(x, y)"0 in the symbol and the white pixel B(x, y)"1 in the background. a is the estimated skew
angle for the document image. The skew of the scanned image can be corrected by rotating the angle a to the opposite direction. We only need to analyse two to three blocks with the same histogram distribution in the image. 2.2. Symbol segmentation and position extraction After the document image background is determined, we use labelling method to segment connected black or white pixel groups in the scanning line order. Black pixel groups with 8-connection are labelled if white pixels dominate the image, otherwise, white pixels groups are labelled. In our segmentation method, di!erent labels are assigned to connected components. Symbol segmentation based on connected component labelling in our method will achieve symbol extraction in a natural reading order; it is symbol-independent as well for any symbol since there is no "lling process like in the boundary tracing. We label every 8-connected component in the scanning process, so that our segmentation method can work for the segmentation of any symbol text including Chinese characters and other nested symbols. It requires less storage space, thus, it has a higher implementation e$ciency due to segmentation on a textline by text-line basis. Moreover, most symbols extracted in the text part of the scanned image will occupy a minimum space because the skew in the image has been corrected, as shown in Fig. 3. Similarly, the positions of symbols in the same text line will have a higher correlation. The position of each symbol segmented is
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
1281
Fig. 3. The result of segmentation in corrected image by labelling and the corresponding result in the skew image.
Fig. 4. The de"nition for the positions of the symbols segmented.
determined during the process of segmentation, we may use the upper left corner or the lower left corner of the symbol frame as the symbol's basic position to reconstruct the image. Here, we choose the lower left corner of the symbol frame and the relative position*pixel distance between the lower left corner of current symbol frame and the lower right corner of the last symbol frame in two successive symbols as relative position parameters (except the "rst symbol in each image page) as shown in Fig. 4. This means the position data may have higher correlation and may reach a higher compression ratio later. The "rst symbol position in each page of the image is its real pixel position in the image.
3. Initial symbol matching based on directional analysis of the error bit-map Symbol matching can provide very high compression ratios. It is based on the coding of a few representative symbols in a document image with many similarly shaped symbols. The e$ciency and success of the compression depend on the accuracy of the pattern matching technique and an e$cient coding algorithm. The less matching tolerance between patterns which belong to the same class, the higher the matching accuracy. It may, however, result in more redundancies among the same pattern class and lower compression e$ciency. It may also become a major goal to keep content-lossless with
the most tolerant matching (low rejection) in the symbolic compression. For this objective, we divide the matching process into several stages. We propose some strict and low tolerant matching criteria at the "rst stage of matching. Each segmented component is compared with the components in a pattern library extracted from the document image. We only need to code its relative positions in the image and the pattern label in the library if it is matched with a component in the pattern library. The initial matching criteria are implemented as follows: 1. Set the size threshold equal to 1. Compare the framesize of the present component with the size of the pattern in the library in both the horizontal x and the vertical y directions. Go to step 2 if only both size di!erences are less and equal to the size threshold, otherwise, check the next pattern in the library by going through all the patterns in the library, then, add the present component into the library. 2. Set alignment modes for pattern comparison according to the di!erent frame size di!erences between the two patterns compared. They do not have to align when the frame sizes in both x and y directions of the two patterns compared are equal. Matching comparison is made for di!erent modes, until a match is reached or else reject all modes. The di!erent alignment modes in di!erent frame-size di!erences are shown in Fig. 5.
1282
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
Fig. 5. Di!erent alignment modes in the comparison of the di!erent frame-size patterns, where, pattern 1 in the image with the frame sizes x and y is being compared with pattern 2 in the library extracted with the frame sizes x and y .
Fig. 6. (a) The direction de"nition of the accumulated error number in the error map. (b) and (c) &o's represent the error maps in (P !P ) and (P !P ).
3. Make matching decision according to the accumulated error numbers in four directions shown in Fig. 6 (a) of the error maps. The error-map is calculated, respectively, for the two compared patterns, that is, one error map (call it P !P ) containing pixels that set in the "rst pattern but not the second, and the other (call it P !P ) containing pixels that are set in the second pattern but not in the "rst. These two maps are shown in Fig. 6 (b) and (c). An error number is de"ned as the error number successively appeared in the error-maps. The match is rejected if any one of the error numbers in four directions of the two error maps is larger than or of the frame size in this direction, then proceed to step 4; otherwise, if the
match is accepted, the next pattern in the image is checked. 4. Repeat step 3 for all alignment modes in Fig. 5. The error map and error number are calculated for all the modes when the frame-sizes of the compared patterns are di!erent, until a match is found or all will be rejected through the checking of all alignment modes. We may widen the error number tolerance to or of the frame size in the frame boundaries of the compared patterns to reduce the noise e!ect in edges. The directions of 2 and 4 in Fig. 6 (a) to calculate error number in step 3 can be of multiple choices in accordance with the di!erent frame sizes of the compared patterns
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
1283
Fig. 7. Examples of the matching error in the directions 2 and 4.
due to the discrete distribution of digital image. Fig. 7 illustrates some error number examples in directions 2 and 4. Initial matching can be guaranteed to be accurate and highly e$cient under these directional criteria without the need to compare and calculate the weighted error for the whole error map. It can e$ciently di!erentiate some easy-mismatching patterns such as &b' from &h' and &e' from &c' according to the error number in the horizontal direction, &8' from &s' and &3' in the slant directions, and &c' from &o' in the vertical direction. The use of various alignment modes provides a slightly higher tolerant matching extent except for the matching accuracy. Symbol segmentation and initial symbol matching can be done for multi-page images by using segmentation algorithm and initial matching criteria for each image page. When processing the later page image rather than the "rst page, we use the present library patterns of previous pages to make the matching comparison. The symbol which is matched with the present library patterns becomes more and more in the each page of the image with the increase of the page number and the expansion of the library patterns, whereas, the expansion speed of the library patterns will become less and less if all image pages come from the same source such as a book or a document. The advantage of compression based on pattern matching will become obvious with the increase of the patterns.
4. Multi-stage structural clustering and representative pattern synthesis/ extraction We provide a very low matching tolerance in the initial matching in order to ensure the accuracy of fewer error
maps and the high e$ciency of the match. Thus, there are some repeated components with similar shapes in the pattern library after the initial matching. This redundancy in the component level may be greater in some patterns with noise. Obviously, we endeavour to decrease this redundancy in the component level to a minimum extent while ensuring that there is no visible error in the reconstructed pattern so that it may achieve a maximum compression ratio in the later coding operation. For this purpose, we further propose a structure-clustering-based multi-stage matching method to cluster the components with the same shape into the same class after completing the initial pattern matching and the extraction of library patterns for all image pages. On the basis of the clustering, we extract or synthesize a representative pattern from each clustering class as the class pattern according to the structure and error analysis and the contentlossless criteria. This matching method which based on structural clustering and pattern synthesis can greatly reduce the redundancy and noise e!ect in the component level and increase compression. 4.1. Multi-stage structural clustering Assuming that the document image has a black text (pixel level 0) on a white (level 255) background, we "rst set some thresholds and give some de"nitions for the structure analysis. E Small size: the frame size of the symbol compared in the horizontal (x) or the vertical (y) direction is smaller than 12. E Medium size: if the frame size of the symbol compared in the x or the y direction is larger than 11 and smaller than 24.
1284
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
E Frequency of pixel level change: the number of pixellevel changes from black to white or from white to black along the x or the y direction through the whole symbol frame. It is set to a change from white to black if the start position in the x or the y direction of the symbol frame boundary is a black pixel; it is set to a change from black to white if the end position of the frame boundary is a black pixel. E Transition position: the position of the pixel level from black to white or from white to black. It is the position from white to black if the present pixel is white and subsequently black, and the position from black to white if the present pixel is black and subsequently white. The initial position of the frame boundary is set as a transition position if it is a black pixel and it is the same for the black pixel in the end position of the frame boundary. E Error number: the number of error pixels in any error map of both patterns compared (P !P and P !P ), we use a black pixel as an error pixel in the error maps. E Diwerent scanning lines: a pair of scanning lines, with the same position and direction but, respectively, on two patterns compared, one of which is present symbol, another of which is a pattern in the pattern library, either have di!erent frequencies or the di!erence of any transition position between this pair of lines is larger than 2 pixels for frame boundary structure analysis and larger than 1 pixel for frame inner analysis.
E &0 ' and &0 '; &1 ', &1 ' and &!1 ', &!1 ': The symVL WL VL WL VL WL bol frame will be equally bisected or quartered or re-bisected in x and y directions during the process of the structure analysis according to the di!erent symbol size. We use &0 ' and &0 ' to refer the scanning line VL WL along the nth bisection in x and y direction, respectively, &1 ', &1 ' and &!1 ', &!1 ' for two symmetric VL WL VL WL scanning lines around &0 ' and &0 ' lines respectively, VL WL as shown in Fig. 8 (a) and (b). It is the "rst bisection of the symbol frames compared when n"1, further, the symbol frames compared are quartered or rebisected under the "rst bisection when n"2 and 3. By this analogy, we can equally divide the symbol frame into 2K sectors in x and y directions, respectively, to do the structure analysis (n"1, 2,2, m!1). E &0 ', &0 ', &1 ', &1 ', &!1 ' and &!1 ', where, QJK LQJK QJK LQJK QJK LQJK m"1, 2. These represent the slanted scanning lines in the inner structure analysis as shown in Fig. 8 (c) to (e). We only need to analyse the case m"1 if the frame size di!erence * "size}x!size}y or * " VW WV size}y!size}x between x and y directions is less than 4 pixels. Otherwise, in both cases m"1 and m"2. Here, the slants are #453 and !453 slant. E Take 2 rule in all 3 lines: Take 3 consecutive scanning lines in each pattern compared to analyse the inner structures when both patterns compared have odd sizes in x or y or both directions and when making slant structure analysis. The local structures of the two
Fig. 8. The bisection modes in the structure analysis. (a) and (b) represent the horizontal and the vertical structure analysis. (c) to (e) represent the slant structure analysis in the di!erent frame modes. For the horizontal and the vertical structure analysis, it takes &3' lines to do the comparison if the frame size is the odd size in each bisection. Otherwise, it takes &2' lines to do the comparison.
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
1285
Fig. 9. The structural comparison rules for the di!erent pattern frame sizes. This structure is considered identical if the structure for both the two-line pairs with the symbols &m' and &n' are the same. (a) Take &2' lines rule in all 3 lines. (b) Take &2' lines rule in all 2 lines. (c) &3' to &2' lines and &2' to &3' lines rule.
patterns compared are considered identical if two pairs of all the three pairs of scanning lines compared in any one case of Fig. 9 (a) are the same. Here, the same scanning lines mean with the same frequency and that all transition position di!erence is less than 2 on two scanning lines compared. We take only &2' of &3' considering alignment and noise e!ect. E Take 2 rule in all &2' lines: Take 2 consecutive scanning lines in each pattern compared to analyse the inner structures when both patterns compared have even size in x or y or both directions. The local structures of the two patterns compared are considered identical if both pairs of the scanning lines compared in any one case of Fig. 9 (b) are the same. E &3' to &2' lines or &2' to &3' lines rule: Take 3 and 2 consecutive scanning lines in each compared pattern to analyse the inner structures when one compared pattern is of odd size and another is of even size in x or y or both directions. The local structures of the two patterns compared are considered identical if any two pairs of scanning lines compared in any one case of Fig. 9 (c) are the same. Then, we give the following structure clustering rules to make the matching decision. 1. For corner condition 1, match is rejected if the error numbers in any two corners of the four corners (see Fig. 10 (a)) in both error maps are simultaneously larger than or equal to 3, go to step 6; otherwise, go to step 2. 2. For corner condition 2, match is rejected if the error number in any square which is consisted of 4 error pixels in all four corners (see Fig. 10 (b)) of both error maps is equal to 4, go to step 6; otherwise, go to step 3. 3. For frame boundary condition, match is rejected if there are three pairs of di!erent scanning lines among four pairs of frame boundaries compared, each pair
Fig. 10. The four-corner rules for the structure analysis. (a) Match rejected if any two error numbers in 4 corner are larger than 2. (b) Match rejected if any error number in any square with 4 pixels in the four-corner part is equal to 4.
of scanning boundaries belong to two patterns compared, respectively. Go to step 6. Otherwise, go to step 4. 4. For inner structure condition, comparison of slant scanning lines is made for any frame size and in any case. For the scanning lines &0 ', &1 ' and VWL VWL &!1 ', take the case n"1 only to make the strucVWL ture analysis when the maximum frame size of two symbols compared in x and/or y directions is of small-size; take n"1, 2 and 3 when the maximum frame size of the two symbols compared in x and/or y direction is of medium-size; take n"1, 2, 3, 5 and 6 or n"1, 2, 3, 4 and 7 when the maximum frame size of two symbols compared in x and/or y directions is larger than medium-size. Take all the 3 scanning lines whenever the pattern is of odd size, take 2 lines &0 ' VWL and &1 ' whenever the pattern is of even size. All VWL analysis is made according to either Taking 2 rule in all 3 lines or Taking 2 rule in all 2 lines or &3+ to &2' lines or &2' to &3' lines criterion. Match is rejected if any one of the rules is not met in all cases of structure analysis of
1286
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
scanning lines, go to step 6. Otherwise, if all rules are met, go to step 5. 5. A match is accepted, put all the same class patterns which are matched and new match parameters into a temporary class bu!er, do not change and readjust the initial match label and the pattern in the initial library until representative pattern synthesis for the same class in the next section is "nished. By now, the whole match operation of one symbol should be completed. Proceed to check the next symbol in the document image, then repeat from step 1. 6. Check the next pattern in the pattern library. Add this pattern into the library if no match is accepted from the whole library and complete the whole match operation of one symbol. Then go to step 1, check the next symbol in the document image. 4.2. Representative pattern synthesis/extraction in the same class We have clustered the patterns with the same structure into the same class after the structural analysis, we may constitute a new pattern or extract a representative pattern from a class to represent all the matched patterns in this class in order to achieve content-lossless reconstruction and less error in the reconstructed image as well as a higher compression ratio for the pattern library. We may choose the best pattern from a class as the representative pattern of this class. The so-called best pattern of this class means that its accumulated error number for all reconstructed patterns of this class reaches the minimum. That is, there are m patterns in class n, P (i"1, 2,2, m) represents the bit-map as shown in G Fig. 6(b) of the ith pattern in this class, P will be the best H representative pattern in all m patterns of class n if ∀ i, +P ,(i"1, 2,2, m)3C , j, ( j, k3[1, m]) G L K K "P !P ") "P !P ". H G I G G G
Fig. 11 provides an example extracted representative pattern in a class of patterns. We may synthesize a new pattern to replace all patterns in a class when the average size of the patterns is large in one of the directions or the number of the patterns is small in a class. The following rules can be used to synthesize a new pattern: E Size rule: Take the average pattern size or the major pattern size of all patterns in a class as the synthesis pattern size, take the smaller size in any direction if both pattern numbers for small-size and large-size patterns are the same. E Major rule: Take the major pixel level as the synthesis pixel level (black or white pixel) in the same position in all patterns of a class after pattern alignment. If both the numbers of black and white pixels in the same position of all patterns of a class are the same, take the major pixel level in the 8-connection neighbourhood of all patterns as the synthesis pixel level. If the number of black pixels and the number of white pixels in the 8-connected neighbourhood of all patterns are the same, take any one of them, either black or white as the new pixel level. E Stroke width rule: Take the average stroke width for all the same strokes of the patterns in a class as the stroke width of the synthesis pattern. Take the smaller width if both numbers of smaller and larger width strokes for the same stroke of all patterns in a class are the same. Fig. 12 provides an example of synthesizing a new pattern in a class of patterns. After extracting the representative pattern or synthesizing a new pattern in a class of patterns, we put the representative or new patterns into the initial pattern library to replace all patterns which belong to the same classes in all the classes of patterns in the pattern library. Then, we readjust the match label to the new class pattern and put the new match label and the position parameters into the position data set. In fact, we may further widen the frame size threshold in only one direction x or y to 2 if the frame
Fig. 11. The extracted representative pattern and all class patterns where the left-most pattern is the extracted representative pattern from a class of patterns (all except the left-most one) of structural clustering.
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
1287
Fig. 12. The synthesized pattern and all class patterns where the left-most pattern is the synthesized pattern from a class of patterns (all except the left-most one) of structural clustering.
sizes of the compared patterns in this direction are medium sizes or larger in order to have the more acceptable matches. Meanwhile, we provide a content-lossless rule to ensure content-lossless reconstruction in the component level, which is that the match will be rejected if 2 consecutive error numbers in any same direction of the 4 directions are both over the error number threshold for the all alignment modes. When the frame size threshold in the one direction x or y is widen to 2, the numbers of the alignment modes will be increased to 6 corresponding to Fig. 5 (a) and (b) and to 3 corresponding to Fig. 5 (c) and (d), respectively. By now, the whole pattern matching operation should be completed. Furthermore, we may readjust the position data structure, matching label and the pattern order in the library according to the frame size of the patterns in the library by putting the patterns with the similar size into the same line of the library image so that we may get a higher and more e$cient compression. After the readjustment of the parameters and the pattern order, we may code and compress the index data set which contains the position and matching parameters of all patterns in the image as well as the extracted pattern library.
L !p log p . Therefore, we need to have a good data G G G model that can provide the correct probability estimation for all symbols in the data set as well as a good coding method which can produce a close-to-optimal code. Hu!man coding [18] has been proven the best "xedlength coding method available. Hu!man codes, however, have to be an integral number of bits long and can be a non-optimal coding sometimes. The problem will be conspicuous when the probability of the symbol is very high. Arithmetic coding [19}21] is said to overcome this problem of integral bit length code. Here, we use context-based arithmetic coder to do data compression. As adaptive arithmetic coding [18,21,22] provides optimal compression in theory on average, it uses the minimum numbers of bytes to encode the data if we can accurately predict and estimate symbol probabilities. 5.1. Coding index data Most models for text compression deal with estimating the probability p of a given symbol by weight of symbol p" total weight of all symbols
5. Adaptive arithmetic coding for 1D position data and 2D library image After pattern segmentation and matching, we have two types of data set, one is 1D index data set including page numbers, pattern numbers and relative positions in each page image, matching parameters, the numbers and frame parameters of the representative patterns in the library. The other is a 2D library image which contains all representative patterns. Shannon [17] showed that the coding length for the best possible compression is !log p bits to encode a symbol whose probability of occurrence in the data set p, for a given set of probabilities P"p ,2, p , is the expected optimal coding bits L
where the weight of a symbol usually depends on the numbers of occurrences of the symbol in the entire data set or in a particular context. We use the entire index data to make probability estimation adaptively, that is, dynamically estimating and updating the probability of each symbol based on all symbols that precede it while considering the structure of the data. In order to decrease the cost of updating the model, we rearrange the index data structure in the order with the higher local data correlation and e$ciently implementing coding, as well as using the order-2 or the order-1 model. We put all the relative positions o+set}x and o+set}y in the same direction together, respectively, and put all x and y frame sizes
1288
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
of the library patterns together, respectively, in the end of the index data so that the update of the data model and the probability estimation of the current symbol can be made more e$cient.
der-n adaptive arithmetic coding, we put all the symbols in their original positions according to the matching label and the revision of matching symbol frame size. Reconstructed image has no visible error when there is no mismatching.
5.2. Coding library image We put all the representative patterns extracted into a library image in a certain order rather than write them in one data stream in order to fully use two-dimensional correction between the scan lines of the bilevel image and provide more e$cient compression. It is known that the best compression can be achieved by using a data model that closely matches the structure of the data. There are two basic coding methods of bit-map images. One is the pixel-by-pixel context-based model with arithmetic coding [1,6,22], and the other is MMR coder (Modi"ed Modi"ed Relative Element Address Designate) [3,2] Here, we use the arithmetic coding based on pixel-bypixel context model like used in JBIG [1] to compress the library image. JBIG coding uses two di!erent templates (Fig. 13) to estimate the context, one is a three-line template containing some pixels from the current scan line and the two previous scan lines, the other is a two-line template containing some pixels from the current line and the previous line. There is a pixel called the adaptive (AT) pixel and in each template, the pixel can be moved from its default position and has an excellent predictive value. These two templates contain 10 pixels, respectively, so that they give 2"1024 kinds of contexts, respectively. Each pixel is then coded by the binary arithmetic coder (generally known as Q-Coder) through the probability estimation and the update of the pixel's context. We may revise the prediction template to di!erent images in order to have a better tradeo! between the compression and implementation cost (times updating probability). We can also use the smaller prediction template (8-pixel template) to predict the context for average larger library patterns and use the larger template (12-pixel template) for average smaller library patterns. Finally, we put two parts of compressed data together after the coding. Decoding is a reverse process of compression, we decode the library image "rst by using the Q-Coder and put each library pattern into a temporary bu!er, then after the decoding index data by using or-
6. Experimental results We have tested our method on decades textual images with di!erent resolutions (150, 200, 300, 400 dpi), various types and sizes of fonts and symbols and the compression results are presented in Tables 1 and 2, respectively, for one page images and multi-page images. We compare our method with three lossless compressions: two-dimensional CCITT Group 3, Group 4 (ITU-T T.4 and T.6) and JBIG, and with one lossy/lossless method mgtic [9]. For one-page images, the compression ratios of our method have been from 3.53 to 5.65 times as high as the Group 3 algorithm, on average 4.55 times as high as the Group 3; from 2.78 to 4.40 times as high as the Group 4 algorithm, on average 3.45 times as high as the Group 4; from 2.25 to 3.42 times as high as JBIG algorithm, on average 2.77 times as high as the JBIG; Mgtic does not work well in its lossy mode for some test images because its reconstructed images may have some mismatches. In lossless mode, the compression ratio of Mgtic is not as high as JBIG and the implementation of Mgtic in our system (Sun Sparc Workstation) consumes more time than our method and other methods. For multi-page images, the compression ratios of our method are on average higher than for one page images, the compression ratio of our method may have a slight increase to a certain extent if the numbers of the images are increased. The increase of the compression ratio will be obvious if all images are from one source like a book or a document. On average, for two-page images from similar sources, the compression ratios of our method are about 5.73 times as high as Group 3, 4.41 times as high as Group 4 and 3.46 times as high as JBIG; for "ve-page images from similar sources, the compression ratios of our method are about 6.91 times as high as Group 3, 5.37 times as high as Group 4 and 4.10 times as high as JBIG. The compression results are illustrated graphically in Fig. 14. Figs. 15}17 provide two examples from the original image to the extracted library image and then to the reconstructed (decompressed) image. No visible error appears in the reconstructed images.
7. Conclusion and extension
Fig. 13. Two templates used in the estimation of context in JBIG coding.
We have proposed a method for highly e$cient compression of bilevel textual images based on pattern matching, clustering and representative pattern synthesis, and then encoding all pattern positions and representative
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
1289
Table 1 The compression results for one-page image by Group 3, Group 4, JBIG and our method, in which, g3, g4 and jbig stand for Group 3, 4 and JBIG method, respectively; orig. (B/W) is uncompressed bilevel image with B/W dither.tif format (packed format); cmp ratio (.tif(B/W):xxx) is the compression ratio for xxx compressed image to packed image; cmp ratio (xxx:our) is ratio of the compression ratio for the compressed image by our method to that by xxx method Imag. (bytes/page)
Orig. (B/W) .tif g3
ydoc14 77,280 byyb 77,814 binat2 55,892 binat1 57,318 ydoc31 97,194 im1 132,096 ydoc1 67,746 aim8 116,298 im10 112,170 ydoc5 75,280 im11 123,596 im4 197,798 ydoc7 74,052 yyb7 77,814 ydoc3 66,164 Average compression ratio
22,512 21,386 15,498 15,808 26,166 25,806 18,982 21,698 24,986 21,524 27,594 26,448 19,932 19,372 27,002
g4
17,744 15,842 11,440 11,542 20,988 18,624 14,260 16,038 18,524 16,274 21,780 19,844 14,982 15,166 22,358
jbig
14391 12912 9044 9302 16316 15528 11497 12731 14946 13189 16693 16563 12075 12927 17189
our
6383 5349 2745 3202 4766 5529 4221 4330 4451 5314 5273 6958 4600 5349 7140
Compression ratio our: orig.
our: g3
our: g4
our: jbig
12.11 14.55 20.36 17.90 20.39 23.89 16.05 26.86 25.20 14.17 23.44 28.43 16.10 14.55 9.27 18.88
3.53 4.00 5.65 4.94 5.49 4.67 4.50 5.01 5.61 4.05 5.23 3.80 4.33 3.62 3.78 4.55
2.78 2.96 4.17 3.60 4.40 3.37 3.38 3.70 4.16 3.06 4.13 2.85 3.26 2.84 3.13 3.45
2.25 2.41 3.29 2.91 3.42 2.81 2.72 2.94 3.36 2.48 3.17 2.38 2.63 2.42 2.41 2.77
Table 2 The compression results for multi-page textural images. The meanings of the symbols are the same as Table 1 Img. (total bytes)
bin-12 aaa3-12 doc4-12 doc5-23 yy6-34 yy6-123 yy6-345 doc4-123 doc5-123 aaa3-345 yy6-4p aaa3-4p doc2-4p doc4-4p doc6-4p yy6-5p doc6-5p doc2-5p doc4-5p aaa3-5p
Number Orig. of pages (B/W) .tif
2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
116,668 244,908 92,196 144,396 187,596 Average 281,394 281,394 138,294 216,594 367,362 Average 375,192 489,816 447,192 184,392 295,992 Average 468,990 369,990 558,990 230,490 612,270 Average
g3
31,248 55,224 34,470 53,690 52,658 compression 82,156 78,306 51,882 81,690 73,268 compression 108,774 102,050 114,566 69,750 109,604 compression 134,422 134,218 138,638 83,200 128,492 compression
g4
22,880 40,540 26,294 43,648 41,242 ratio 64,206 61,170 39,550 66,460 51,468 ratio 85,212 73,360 91,916 53,234 90,258 ratio 105,140 110,470 111,224 63,104 92,008 ratio
jbig
our
18,024 32,468 20,132 32,375 32,723
5230 7890 6630 9805 10,519
51,431 48,480 30,480 49,275 42,805
14,211 13,800 9461 13,672 8912
67,912 60,439 62,423 41,045 66,151
18,084 12,305 17,845 11,564 15,240
83,669 80,797 75,560 48,735 75,273
21,960 18,368 21,110 13,538 15,207
Compression ratio our: orig.
our: g3
our: g4
our: jbig
22.31 31.04 13.91 14.73 17.83 19.96 19.80 20.39 14.62 15.84 41.22 22.37 20.75 39.81 25.06 15.95 19.42 24.20 21.36 20.14 26.48 17.03 40.26 25.05
5.97 7.00 5.20 5.48 5.01 5.73 5.78 5.67 5.48 5.98 8.22 6.33 6.01 8.29 6.42 6.03 7.19 6.81 6.12 7.31 6.57 6.15 8.25 6.91
4.37 5.14 3.97 4.45 3.92 4.41 4.52 4.43 4.18 4.86 5.77 4.81 4.71 5.96 5.15 4.60 5.92 5.31 4.79 6.01 5.27 4.66 6.05 5.37
3.45 4.12 3.04 3.30 3.11 3.46 3.62 3.51 3.22 3.60 4.80 3.78 3.76 4.91 3.50 3.55 4.34 4.04 3.81 4.40 3.58 3.60 4.95 4.10
1290
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
Fig. 14. Comparison of the compression results for individual images obtained using g3, g4, JBIG and our method. In each graph, each index on the horizontal axis corresponds to an image or a set of multipage images and the ratio between compression ratios obtained using our method and other methods is given on the vertical axis.
Fig. 15. The original scanned document images.
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
1291
Fig. 16. The constructed library images of representative and synthesis patterns after multi-stage matching and clustering.
Fig. 17. The decompressed and reconstructed document images from representative and synthesis pattern library images and pattern position parameters.
patterns. Multi-stage pattern matching and clustering provide a good tradeo! between match accuracy and less redundancy in the component level. Experimental results
show that our scheme achieves a much better compression ratio and less error map than most of the alternative pattern matching based a lossy compression systems.
1292
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
Our method can assure content-lossless reconstruction according to our component-level content-lossless criteria. Its compression performance is comparable with CPC method [15], moreover, it can compress the document images with both black and white backgrounds in the same image by using background analysis. Besides, this method has some other advantages, for instance, it can be easily combined with soft pattern matching (SPM) [6] to extend to lossless mode. Combining this method with JBIG1 progressive mode with less-redundancy component library can also achieve content-lossless progressive transmission capability. Another advantage is that it is symbol and size independent and can be used in various symbolic images including nested symbols like images with Chinese characters and Greek symbols by using of symbolic segmentation based on the connection and position-based bit-map reconstruction. Its multipage processing capability allows it to have much higher compression ratio than other currently available lossy/lossless methods with the increase of the number of processed image pages particularly when all the pages are from the same source. Multi-page mode can be implemented e$ciently by using the segmentation and matching based on scan line order.
Acknowledgements This work is supported by the Australia Research Council.
References [1] JBIG, Progressive Bi-level Image Compression, ISO/IEC International Standard 11544, ITU-T Recommendation T.82, March, 1993. [2] W.B. Pennebaker, J.L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993. [3] K.R. McConnell, D. Bodson, R. Schaphaphorst, FAX: Digital Facsimile Technology and Applications, Artech House, Norwood, MA, 1989. [4] F. Ono, P.G. Howard, D. Fernandes et al., The Emerging JBIG-2 Standard, ISO/IEC JTC 1/SC 29/WG1 (ITU-T SG8), November, 1997.
[5] R.N. Ascher, G. Nagy, A means for achieving a high degree of compaction on scan-digitized printed text, IEEE Trans. Comput. 23 (1974) 1174}1179. [6] P.G. Howard, Text image compression using soft pattern matching, Comput. J. 4 (2/3) (1997) 146}156. [7] O. Johnsen, J. Segen, G.L. Cash, Coding of two-level pictures by pattern matching and substitution, Bell System Tech. J. 62 (8) (1983) 2513}2545. [8] I.H. Witten, T.C. Bell, H. Emberson et al., Texual image compression: two-stage lossy/lossless encoding of textual images, Proc. IEEE 86 (6) (1994) 878}888. [9] I.H. Witten, A. Mo!at, T.C. Bell, in: Managing Gigabtes: Compressing and Indexing Documents and Images, Van Nostrand Reinhold, New York, 1994. [10] W.K. Pratt, P.J. Capitant et al., Combined symbol matching facsmile data compression system, Proc. IEEE 68 (7) (1980) 786}796. [11] N.F. Brickman, W.S. Rosenbaum, Word autocorrelation redundancy match (warm) technology, IBM J. Res. Dev. 26 (6) (1982) 681}686. [12] M.J. Holt, C.S. Xydeas, Recent development in image data compression for digital facsimile, ICL Tech. J. 5 (2) (1986) 123}146. [13] S. Inglis, I.H. Witten, Compressed-based template matching, in: Proceedings of the IEEE Data Compression Conference, Los Alamitos, CA, USA, IEEE Computer Society Press, Los Alamitos, California, 1994. [14] Qin Zhang, John Danskin, A pattern-based lossy compression scheme for document images, in: Proceedings of the International Conference on Image Processing, Vol. 2, IEEE Computer Society Press, Los Alamitos, California, 1996, pp. 221}224. [15] Cartesian Products Inc. Technical Overview of Cartesian Perceptual Compression, Cartesian Products, Inc., http://www.cartesianinc.com/, 1997. [16] Hong Yan, Skew correction of document image using interline cross-correlation, CVGIP: Graph. Models Image Processing 55 (6) (1993) 538}543. [17] C.E. Shannon, A mathematical theory of communication, Bell Systems Tech. J. 27 (1948) 398}403. [18] M. Nelson, in: The Data Compression Book, M & T Books, New York, 1996. [19] G.G. Langdon, An introduction to arithmetic coding, IBM J. Res. Dev. 28 (2) (1984) 135}145. [20] W.B. Pennebaker, J.L. Mitchell, G.G. Langdon, R.B. Arps, An overview of the basic priciples of the q-coder adaptive binary arithmetic coder, IBM J. Res. Dev. 32 (6) (1988) 717}726. [21] I.H. Witten, R.M. Neal, J.G. Cleary, Arithmetic coding for data compression, Commun. ACM 30 (6) (1987) 520}540. [22] P.G. Howard, J.S. Vitter, Arithmetic coding for data compression, Proc. IEEE 82 (6) (1994) 857}895.
About the Author*YIBING YANG received her B.S., M.S. and Ph.D degrees from Nanjing University of Aeronautics and Astronautics, China, in 1983, 1986 and 1991, respectively, all in Electrical Engineering. From 1986 to 1988, she worked as an assistant professor in Nanjing University of Aeronautics and Astronautics, China, from 1992 to 1993 she was a postdoctor, and from 1994, she has worked as an associate professor, both in the Department of Radio Engineering at Southeast University, China. Meanwhile, she was on leave and worked as a research associate in Electronics Department of The Chinese University of Hong Kong from 1995 to 1996. She is currently working as a visiting scholar in the Department of Electrical Engineering of The University of Sydney, Australia. Her research interests include image and signal analysis, processing and compression, pattern recognition, medical and optical image processing, and computer vision application.
Y. Yang et al. / Pattern Recognition 33 (2000) 1277}1293
1293
About the Author*HONG YAN received his B.E. degree from Nanking Institute of Posts and Telecommunications in 1982, M.S.E. degree from the University of Michigan in 1984, and Ph.D. degree from Yale University in 1989, all in Electrical Engineering. From 1986 to 1989 he was a research scientist at General Network Corporation, New Haven, CT, USA, where he worked on developing a CAD system for optimizing telecommunication systems. Since 1989 he has been with the University of Sydney where he is currently a Professor in Electrical Engineering. His research interests include medical imaging, signal and image processing, neural networks and pattern recognition. He is an author or co-author of one book, and more than 200 technical papers in these areas. Dr. Yan is a fellow of the Institution of Engineers, Australia (IEAust), a senior member of the IEEE, and a member of the SPIE, the International Neural Network Society, the Pattern Recognition Society, and the International Society for Magnetic Resonance in Medicine. About the Author*DONGGANG YU received his diploma in the Department of Automation and Control Engineering from Northeastern University, Shenyang, China, in 1970. From 1970 he worked in the Department of Electrical Engineering of Dalian University of Technology, Liaoning, China, where he taught and did research work on Information Processing as an assistant lecturer and lecturer, respectively. He was appointed as associate professor in electrical engineering at Dalian University of Technology in 1991. He is currently working on Image Processing and Recognition as a visiting scholar at The University of Sydney, Australia. His research interests are in the areas of Image Processing, Pattern Recognition and Biomedical Signal Processing.
Pattern Recognition 33 (2000) 1295}1307
Structure and motion from straight line segments J.M.M. Montiel*, J.D. TardoH s, L. Montano Departamento Informa& tica e Ingeniern& a de Sistemas, CPS Universidad de Zaragoza, Marn& a de Luna 3, E-50015 Zaragoza, Spain Received 29 September 1998; accepted 23 April 1999
Abstract A method to determine both the camera location and scene structure from image straight segment correspondences is presented. The proposed method considers the "nite segment length in order to use stronger constraints than do those that use the in"nite line that supports the image segment. The constraints between image segments involve a weak pairing between image segment midpoints. This allows deviations of the midpoint only in the segment direction. Experimental results are presented of structure and motion computations from the image straight line segment matching using two real images. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Structure and motion; Straight segments
1. Introduction Structure and motion computing from correspondent straight features is a classical problem in computer vision. Usually, the straight image segments are modelled as their underlying in"nite supporting line. This model can be easily represented with the available mathematical representations for lines. Its main pitfall, compared with the use of points, is that the constraints for lines are less restrictive. A review of structure and motion problems using points and lines is presented by Ref. [1]. In this, the usage of points is shown to be more restrictive, always. Most man-made environments can be represented with line segments. This makes it possible to enforce the previous knowledge of the straightness of the contours. Normally, relevant image segments correspond to relevant 3D scene features. It should be noted that the semantically relevant feature is not the in"nite supporting line of the segment but the "nite length segment. Despite this, only a few works [2,3] have proposed models that include the "nite length of the line segments in structure and motion computing. The consideration
* Corresponding author. Tel.: #34-976761975; fax: #34976762111. E-mail address:
[email protected] (J.M.M. Montiel).
of the segment "nite length has renewed interest in the usage of straight features in structure and motion problems. Taylor and Kriegman [3] proposed an optimization procedure for image segments. The goal function is the image residue between the image in"nite supporting line and the reconstructed 3D segment supporting line. The segment "nite length is considered because the residue between the supporting lines is only computed in the region of the supporting line where the image segment is detected. They propose an algorithm to compute the structure and motion from straight segment correspondences in a sequence with no previous knowledge of camera location. The evaluation of the residues only in the detected image segments improves the solution but does not constrain the motion along the in"nite supporting line. Consequently, at least three images are needed to compute the solution. Zhang [2] proposed, for the "rst time, an algorithm to compute structure and motion from two images using only straight segment matching as the input. He proposed to maximize the overlap in the image between the image segments and the corresponding reconstruction by using epipolar geometry, and as a result, reconstruction computation was not required. Mathematically, the problem was reduced to a non-linear optimization problem. To compute the initial guess for non-linear optimization, he proposed to sample the solution space and use
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 7 - X
1296
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
a few high overlapping solutions as starting points for further optimization. The work presented in this paper is closely related to that of Zhang. Experimental results are presented of structure and motion computation using straight segment correspondences and two images. The solution is also presented as an optimization problem and the initial guess is also computed by sampling the parameter space. This paper di!ers in the image optimization by considering the image segment midpoints as correspondent. However, this constraint is considered only weakly and deviations are allowed for along the image segment direction. It is well known that image segment midpoints are not correspondent due to the unreliable extraction of the segment extreme points, as well as, the segment midpoint not being invariant under perspective projection. However, the detected direction for the image segment is reliable. The proposed weak pairing between image segment midpoints uniquely combines the line and point properties of the image segments. A reconstruction and reprojection scheme is proposed, which considers any number of images greater than or equal to two. Section 2 presents the probabilistic model used to represent the imprecise location of the geometrical entities. Section 3 presents the equation that relates the image segment location, the camera location and the 3D segment location. In Section 4 the solution of the structure and motion problem from straight correspondences is posed as a non-linear optimization. Section 5 presents the method in which initial guesses for the non-linear optimization with two images are computed. Experimental results with a pair of real images are presented in Section 6. Finally Section 7 is devoted to conclusions.
2. Modelling geometric information A probabilistic model, named SPmodel [4] has been selected to represent the imprecise geometrical information. This is a general model for multi-sensor fusion mainly with the following qualities: homogeneous representation for every feature irrespective of the number of degrees of fredom (d.o.f for short in the rest of the paper), and a local representation for the error around the feature location estimate. The error is not represented additively, but as a transformation composition. These qualities for multi-sensor fusion are also recognized as important for computer vision in Ref. [5]. Moreover, a modi"cation has been introduced in the original SPmodel that combines imprecise relations with deterministic relations. Sensorial information is imprecise due to measurement errors. However, some relations are known with a probability of one, e.g. a projection ray is known to pass through the optical centre of the camera.
The SPmodel is a probabilistic model that associates a reference G to locate each geometric element G. The reference location is given by the transformation t 5% relative to a world reference =. To represent this transformation, a location vector x , composed of three 5% Cartesian coordinates and three Roll}Pitch}Yaw angles, is used: x "(x, y, z, t, h, )2, 5% t "Trans(x, y, z) ) Rot(z, ) ) Rot(y, h) ) Rot(x, t). 5% (1) The estimation of the location of an element is denoted by x; , and the estimation error is represented locally by 5% a diwerential location vector d relative to the reference % attached to the element. Thus, the true location of the element is x "x; d , 5% 5% % where represents the composition of location vectors (the inversion is represented by ). Note that the error is composed with the location estimate, instead of being added. The di!erential location error d is a normally % distributed random vector with a dimension of six. Some components of d are set to zero in the following two % cases: Symmetries: Symmetries are the set of transformations that preserve the element. The location vector x rep5% resents the same element location irrespective of that d component value. For example, consider the reference % S, which locates a 3D segment (Fig. 1(a)). Rotations around the X-direction yield references that represent the same 3D segment. Theoretically those components could have any value, but a zero value is used. Deterministic components: There are components of x known to have a probability of one. Among all the 5% equivalent references for the element, the one with a null deterministic component is selected. The corresponding component is then forced to be zero. For example (Fig. 1(b)) the reference, R, associated with a projection ray; R can be attached to the optical centre, expressing its location with respect to the frame of the optical centre C, X (and hence d ) always has null X, >, and Z !0 % components. To represent null components mathematically, d is % expressed as d "B2 p , % % % where p , the perturbation vector, is a vector containing % only non-null components of d . B , the self-binding % % matrix, is a row selection matrix that selects the non-null components of d . % Based on these ideas, the information about the location of geometric element G is represented by a
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
1297
Fig. 2. Image segment in the normalized camera.
Fig. 1. (a) Several equivalent associated references, S, for a 3D segment. (b) A projecting ray R, located with respect to the optical centre C.
quadruple, the element imprecise location: L "[x; , p; , C , B ]. 5% 5% % % % Thus, the random vector de"ning the element location is expressed as x "x; B2 p , (2) 5% 5% % % p; "E[p ], C "Cov(p ), % % % % where p is a normal random vector, with a mean of % p; and a covariance matrix C . When p; "0, the estima% % % tion is regarded as centred. There are geometrical elements whose location is used as input data for the problem, e.g. the image segments. The 3D segments and the camera location are the output of the algorithm and are computed from the input data. The 3D segment is the geometrical element used to represent the scene structure. Moreover, we de"ne an intermediate geometrical element, the 2D segment, to de"ne the constraint that relates an image segment to a 3D segment and the camera location. The 2D segment consists of the projection ray for the image segment midpoint, and the projection plane for the image segment supporting line. Next, the model of these four elements is presented in detail. 2.1. Camera imprecise location We use the letter C to designate the camera reference, and model the camera as normalized, i.e. f"1. The associated reference origin is joined to the optical centre
of the camera. The z-axis is parallel to the optical axis, and pointing towards the scene. The x-axis points to the right. The y-axis is de"ned to be a direct reference (see Fig. 1(b)). Camera location has neither symmetry nor deterministic components in its di!erential location vector, so has no null components (its self-binding matrix is the identity): B "I. ! Camera location estimate x; is obtained from the struc5! ture and motion algorithm. 2.2. Image segment imprecise location We use the letter P to designate references attached to image segments. The associated reference (Fig. 2) is attached to the image segment midpoint. Its y-axis is normal to the supporting line and pointing to the `lighta side of the segment. Consequently, it codes the grey level gradient of the segment. Its z-axis is parallel to the camera z-axis. The x-axis is de"ned to be a direct reference. As an image segment belongs to the image plane, its z, t, and h components are deterministic. Therefore, its self-binding matrix is
1 0 0 0 0 0
B " 0 1 0 0 0 0 . . 0 0 0 0 0 1 The image segment location centred estimate is de"ned for the extreme points coordinates in the normalized image, with respect to the camera location (see Fig. 2) as x; "(x; , y( , 1, 0, 0, K )2, !. !. !. !. x #x C,
K "atan2 (y !y , x !x ) , x( " Q !. C Q C Q !. 2 y #y C. y( " Q !. 2
(3)
1298
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
The covariance assignment for an image segment is one of the central issues of this work. The covariance in the
and y components is taken from the in"nite supporting line. The standard deviation of the x component is de"ned to be proportional to the segment length. According to the proportionality constant i, the allowed deviations of the midpoint along the segment supporting line can be "xed. This covariance assignment mimics that proposed by Ref. [6], where it is used for 3D segments. In this paper, we use it for image segments. The same covariance assignment is also used by Ref. [7] for structure from camera location. The quantitative expression for the covariance matrix is given by C "NC N2, . .
Fig. 3. 95% acceptance regions for the image segment reference origin. Left, i"0.2, right i"0.5.
(4)
where C is the covariance for the image segment in . pixels; N is the Jacobian matrix for the transformation that converts image segment in pixels to the image segment in the normalized camera. We have chosen the above expression in order to deal with the pixel aspect ratio. Appendix B gives a detailed expression for N as a function of the image segment location estimate (3), and the intrinsic camera parameters. The form for C is . C "diag(p, p, p). . V W ( E p is set proportional to the image segment length, V n (in pixels): p "in. V
(5)
The experimental values for i have been tuned to give i'1. E p and p are computed from the covariances of the W ( image segment extreme points. Due to systematic errors, there is some correlation between the noise from the extreme points location. This e!ect is dealt with by splitting the extreme points covariance into two terms: p completely correlated covariance (0}2 pixel), and AA p non-correlated covariance (0.25}1 pixel). In Ref. [8] LA both the expression and tuning are given p 2p p"p # LA, p" LA. W AA ( 2 n
(6)
Fig. 3 shows a comparison between the 95% acceptance regions for the origin of the reference that locates the image segment according to two di!erent i assignments. It should be noted that the segment length is not considered as a geometrical parameter, but is used to de"ne the element covariance. The segment is only located by its midpoint and its orientation. Intuitively, the image segment has been modelled as `a point with an orientationa, and its standard deviation along the segment supporting line is set proportional to its length.
Fig. 4. The 2D segment (D) is an intermediate element used to relate the image segment (P) to the 3D segment (S). It includes both the projection ray for the midpoint and the projection plane for the supporting line.
2.3. 2D segment We use D to designate references attached to 2D segments. This geometrical element is used as an intermediate element to de"ne the relation between an image segment and a 3D segment (Fig. 4). A 2D segment is composed of the projection elements of the corresponding image segment: the projection ray for the image segment midpoint, and the projection plane for the in"nite supporting line. Its covariance is derived directly from that of the image segment. The associated reference is attached to the optical centre of the camera, which belongs to every projection element. Its y-axis points towards the image segment midpoint. The z-axis is normal to the supporting line
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
projection plane. The x-axis forms a direct reference. The z-direction is de"ned to code the image segment grey level gradient. Since it is attached to the optical centre, the general location vector of a 2D segment, with respect to the camera frame, is xL "(0, 0, 0, tK , hK , K )2. !" !" !" !" See Appendix C for tK , hK and K expressions as !" !" !" function of image segment location (3). As the translation components are deterministically null, the self-binding matrix only selects t, h and components:
0 0 0 1 0 0
B " 0 0 0 0 1 0 . " 0 0 0 0 0 1 The 2D segment covariance matrix is related to that of the image segment: C "K C K2 , " ". . ". where C has been presented in Eq. (4). A detailed expres. sion for the K matrix as a function of the image ". segment location vector is given in Appendix C. 2.4. 3D segment The references associated with 3D segments are designated by S. The reference is attached (Figs. 1(a) and 4) to a segment point, which approximately corresponds to the segment midpoint. The reference x-axis is aligned with the segment direction. In this work, the 3D segment location estimate is computed from the integration of several 2D segments corresponding to di!erent points of view. This integration also gives the 5;5 covariance matrix C . 1 The only symmetry for this element is the rotation around its direction. Therefore, its self-binding matrix is
1 0 0 0 0 0 0 1 0 0 0 0
B" 0 0 1 0 0 0 . 1 0 0 0 0 1 0 0 0 0 0 0 1
1299
tioned in Section 2.3, the 2D segment imprecise location is derived directly from that of the image segment. The SPmodel method is used to de"ne pairing constraints, which are de"ned in terms of the location vector x of the 3D segment with respect to the 2D "1 segment. Let x "(x , y , z , t , h , )2. "1 "1 "1 "1 "1 "1 "1 The pairing constraint is an implicit equation that indicates which x components should be zero; these null "1 components are as follows. E z . Otherwise, the 3D segment would not belong to "1 the projection plane. E h , rotation around y-axis. This should be zero, other"1 wise the 3D segment would not be in the projection plane. E x . Otherwise, the 3D segment midpoint would not "1 belong to the image segment midpoint projection ray. Theoretically, the segment midpoint is not invariant under perspective projection. However, we consider it as invariant, but it is shown later that this constraint is considered to have a low weighting. Consequently, in Fig. 4 the projection ray for the image segment midpoint does not contain the origin of S reference. In summary, the nullity of z and h takes into account "1 "1 the collinearity in the image between the image segment and the 3D segment. The nullity of x implies that "1 the image of the 3D segment midpoint is the midpoint of the image segment. This is equivalent to considering that the image segment midpoints are correspondent. Due to the assignment for image segment covariance along the segment direction, the midpoint matching normally has lower weighting than the collinearity. The low weighting for this constraint is justi"ed by the unreliable segment extreme points extraction, and by the approximation that the midpoint is invariant under perspective projection. Experimental results con"rm the validity of this assignment. The above ideas are formalized mathematically as follows: f(x )"B x "0, "1 "1 "1
1 0 0 0 0 0
B " 0 0 1 0 0 0 . "1 0 0 0 0 1 0
3. Measurement equation
Considering that
This section is devoted to formalizing the pairing constraint between an image segment (P) and a 3D segment (S) (see Fig. 4). The camera detects the image segment (P). However, the proposed pairing constraint uses the 2D segment (D) and not the image segment (P). As men-
x "x x x , "1 !" 5! 51
(7)
(8)
Eq. (7) establishes a relationship between the 3D segment location, the camera location and the image of the segment. The image segment is represented by the 2D
1300
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
segment (D). This equation is used to determine the structure and motion from the correspondences. From Eqs. (8) and (2), Eq. (7) can be expressed as f(x )"f(p , p , p ) "1 " ! 1 "B (B2 p xL B2p xL xL B2p ) "1 " " !" ! ! 5! 51 1 1 "0. (9) Consequently, we have an implicit function that relates three perturbation vectors, p , p , and p corresponding " ! 1 to the 2D segment, the camera, and the 3D segment respectively, i.e. the normal random vectors involved in the problem.
4. Structure and motion as a weighted minimization problem The structure and motion problem with m segments in n images can be stated as the solution to the non-linear system: f I(x IG G)"0, i"1,2, m, k"1,2, n, G " 1 where f I is the measurement Eq. (7) between the 2D G segment detected by camera k corresponding to the 3D segment i. Due to approximate midpoint matching and noise, the previous system is over-constrained. A minimization is proposed in order to solve it. The goal function is the summation of all the weighted residues in which the weighting matrix is the inverse of the measurement noise covariance: K L [f I(x IG G)]2[RI(x IG G)]\[f I(x IG G)], (10) G " 1 G " 1 G " 1 G I where RI(x IG G) is the measurement noise covariance. In G " 1 general, this depends on x IG G, which is the optimization " 1 variable. Considering Eq. (8) and C as the world frame, x IG G can be decomposed to " 1 x IG G"x I IG x Ix G " 1 !" !! !1 then, the problem can be expressed as given: +L I IG ,, k"1,2, n, i"1,2, m !" determine: +L G,, +L I,, k"1,2, n i"1,2, m, up !1 !! to a scale factor, such that expression (10) minimized, where + , expresses a set of elements. The previously stated problem optimizes both the structure and motion parameters. As a result, it needs to optimize 5#6 (n!2)#5m parameters. These are: "ve parameters for the second camera location (x ), six for !! each of the remaining cameras +x I,k"3,2, n, and !! "ve for each scene 3D segment +x G,i"1,2, m. The !1
result, however, can be computed by optimizing only the motion parameters as proposed in Ref. [9], in which a constraint between the structure and motion was used. The evaluation of the total residue is presented next only as a function of the motion. Given a value for the motion parameters +x I,, the structure +x G, is computed us!! !1 ing a structure from the camera location algorithm. The weighted residue is computed from the given motion and the computed structure (10). A diagram of the structure from the camera location algorithm is given in Section 4.1. For normal and uncorrelated noise, the total residual (10) would follow a s distribution with 3mn!5m!6(n!2)!5 d.o.f. The rest of this section is devoted to de"ning the terms involved in Eq. (10). The RI matrix is computed from the G linearization of Eq. (9): 2 RI"GIC IG GI , G G " G *f I GI" G , G *p IG p"I p!I " G p1G where
0
!z(
"1
y(
"1 x( 0 , GI" !y( G "1 "1 sin K !cos K 0 "1 "1 C IG is the 2D segment location noise and is computed in " Section 2.3. f I is computed directly from x; IG G components. i and G " 1 (k) scripts have been dropped for simplicity:
x( !" I . f " z( G !" hK !" 4.1. Structure from camera location As previously mentioned, a structure from the camera location computation is proposed in order to evaluate the residue function (10). The scene structure from the camera location and the correspondences are computed using LMSE solution for each 3D structure segment. It is computed from a linearized version of Eq. (9); the detailed algorithm for this structure computation is presented in Ref. [7]. It is given brie#y by the following:
p I f(p IG , p I, p G)+f I#HIp G#GI "G "0 " ! 1 G G 1 G p !I expressed as the explicit linear measurement equation normally used in optimal estimation: zI"HIp G#GIv, v&N(0, RI). G G 1 G G
(11)
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
The estimated value is p G, which is the correction for the 1 3D segment location. The camera and 2D segment location noise, p I and p IG , play the role of measurement ! " noise. Consequently, they are grouped in a measurement noise vector v. The rest of the terms in Eq. (11) are , zI"!f I"!B x; V G G "1 "G 1G
*f HI" , G *p p"I Q G p!I p1G
*f *f " , GI" " G *p p"I *p p"I p !I p1 G " G ! G p!I p1G v"
p
"IG , pI A
RI" G
C
0 "IG , 0 C I !
C IG is computed as shown in Section 2.3, and C I is tuned " ! experimentally. Detailed expressions for the previous equations, as functions of the location estimates for the camera, 2D segment, and 3D segment, are available in Appendix D.
1301
parameters are considered: E a , b , the two angles that de"ne the translation 2 2 direction, E a , b , c, the "rst two angles de"ne the rotation direc0 0 tion. c de"nes the rotation magnitude. From a , b , a , b and c the value for x is straight2 2 0 0 !! forward. The rest of the section is devoted to the a , b , a , b and c sampling. Irrespective of whether 2 2 0 0 the direction sampling a and b angles are for rotation or translation, direction sampling is carried out using the directions de"ned by the icosahedron faces. The centres of the faces produce 20 directions. Each triangular face can be subdivided n times, as shown in Fig. 5, producing 20;4L directions. The c sampling can be carried out easily by uniformly sampling the rotation interval [!a, a]. We consider the [!15, 153] interval and a sampling rate of 23. As Zhang proposed [2], only half of the sampled translation directions should be tested, since changing the translation sign leads to the solution changing the structure sign, which is geometrically equivalent. As a result, it is su$cient to try half of the translation samples and then test if the computed scene is behind or in front of the cameras. If it is behind, it has no physical meaning, and the real solution is achieved by changing the translation and scene sign.
5. Initial seed for the two images problem The main disadvantage of the non-linear optimization methods is that they need a good initial guess in order to converge to the absolute minimum of the goal function. This paper focuses on the two images problem. In order to "nd the minimum of Eq. (10), we compute the initial seed sampling the parameter space as proposed by Zhang in Ref. [2]. The whole minimization method can be summarized as follows: (1) Sample the parameter space. The sample points should approximately be uniformly distributed over the parameter space. (2) Evaluate the goal function Eq. (10) on every point. (3) Keep the samples that produced the lowest residue (in our case we kept 30 samples). (4) Optimize Eq. (10) with a classical optimization method (in our case Levenberg}Marquardt) using each of the kept samples as an initial guess. The minimum is then selected. The rest of this section is devoted to the way in which the parameter space is sampled. Considering two images, the parameter to be optimized is the location of the second camera with respect to the "rst, i.e. x . Due to !! the translation scale factor, only the translation direction should be considered. Therefore, only "ve optimization
6. Experimental results This section is devoted to showing the performance of the proposed pairing constraint to recover the structure and motion from two real images. We focus on the ability to obtain a solution as a function of the i value (see Eq. (5)). It is shown that weak pairing between the midpoints, i.e. i'0.1 performed well, whereas strong midpoint pairing (i(0.01) performed badly. This con"rms that tight pairing between the midpoints is not correct, as stated by several authors [2,3]. However, if deviations of the midpoints matching were allowed along the image segments directions, the correct result could be achieved. Fig. 6 shows the two images (512;500;8 B&W) that were used as the input. The corresponding image segments are labelled with the same numbers. These images
Fig. 5. Subdivision of an icosahedron face in order to obtain more direction samples.
1302
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
Fig. 6. Input image pair for structure and motion computation. Equal numbers label corresponding image segments.
came from a trinocular rig, and, as a result, it was possible to use the camera calibration as a ground true solution for the computed motion. Camera calibrations were computed using the Tsai method [10]. The camera focal length was 6 mm, and the lens radial distortion was compensated for the extracted image segment location. Image segments were detected with the Burns algorithm [11]. Segments shorter than 15 pixels or those with a grey level gradient smaller than 20 grey levels per pixel, were removed. The matches were computed by a stereo program [7]; spurious matches were removed manually. A total of 111 matches were retained after removing three spurious ones. The tuning parameters for the initial seed algorithm were: range for the rotation angles [!153, 153], and rotation sampling step 23. Both initial orientation and translation directions were computed from the icosahedron faces with subdivision (n"1). As a result 48 000 initial samples were used. The number of retained seeds
Fig. 7. Optimization residue (log scale), orientation error (deg) and translation error (deg) with respect to the i value. The dot horizontal line represents the 95% s with 106 d.o.f.
Table 1 Computed camera location compared with the ground true solution. xL is the camera 2 location wrt the camera 1 (translation !! normalized to unitary vector, angles in deg). Error wrt ground true is the location vector: x; x; !! !! i
Estimated xL
Ground true 1 10 10\
(0.82, (0.81, (0.82, (0.76,
!!
!0.56, !0.54, !0.54, !0.59,
0.15, 0.22, 0.21, 0.26,
!3.58, !4.43, !4.13, !0.90,
!8.86, 0.71)2 !10.02, 0.74)2 !9.56, 0.73)2 !4.68, 1.33)2
Error wrt ground true
tr. error. ori. error (deg) (deg)
(0, 0, 0, 0, 0, 0 )2 (0.01, 0.02, 0.07, !0.84, !1.16, !0.04)2 (0.01, 0.02, 0.06, !0.54, !0.69,!0.03)2 (!0.03, !0.04, 0.12, 2.79, 4.13, 0.87)2
0 4.3 3.7 7.5
0 1.2 0.7 4.2
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
was 30. The parameters for the covariances used in the optimization were: p "2 pixels and p "1 pixel used AA LA to tune the image segment location noise (6). Fig. 7 shows the optimization residue, the orientation error and translation error with respect to the i value. A logarithmic scale is used for i and the residue. The translation error is the angle (deg) between the ground
1303
true and the computed translation. The orientation error is the angle (deg), around some axis, required to align the ground true and computed camera frame. Table 1 shows a summary of various values. Focusing on Fig. 7 it can be seen that there are two regions: weak midpoint pairing (i'0.1), and strong midpoint pairing (i(0.1). In the weak pairing region,
Fig. 8. Reconstruction with i"1. (a) and (b) show reconstruction reprojected on images 1 and 2. (c) and (d) show the reconstruction on top and general view.
1304
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
the translation and rotation errors were small and the reconstructions were good. Figs. 8 and 9 show the reconstruction and its backprojection in the images for the
weak pairing case. Note that in both cases reconstruction are very similar despite the di!erence in the i value. The similarity for the backprojection is even greater and it
Fig. 9. Reconstruction with i"10. (a) and (b) show reconstruction reprojected on images 1 and 2. (c) and (d) show the reconstruction on top and general view.
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
can be seen how they are nearly equal. In the strong pairing region, the orientation and translation errors were bigger and the reconstruction had no meaning at all. The residue after optimization was nearly constant in the strong pairing region. However, in the weak pairing region it was reduced as i increased. With strong midpoint pairing, the results had no meaning because midpoints are not correspondent. The residue is dominated by the component orthogonal to the segment direction. With the weak pairing, however, deviations along the image segment were allowed. By increasing i, it was found that the solution was approximately the same but the total residue was reduced because the weight of all the deviations was reduced by the same proportion. The reason that the solution could be obtained for any value k'0.1 can be explained as follows. If only the in"nite support lines (in"nite i) were used to solve the problem, every camera location would be possible and the residue would always be zero. Considering midpoint matching, we have found that the residue mainly came from the deviations of the midpoints. Once the midpoint matching becomes weak, only deviations along the segment directions are allowed. Consequently, if the i is increased, all the weighted residues decrease and the total residue decreases, but the solution remains approximately the same. As stated in Section 4, the residue after optimization should follow, for 111 image segments in two cameras, a s with 106 d.o.f. Fixing a 95% con"dence region, the s value is 131 (note that Fig. 7 plots residue on a logarithmic scale so that 131 is represented as 2.11). It is shown in Fig. 7 that the transition between the strong and weak pairing is related to the intersection of the residue with the s value.
7. Conclusions The "nite length of the straight segments produces constraints stronger than if they were considered as their in"nite supporting line. In this paper, it has been proposed that the image segments midpoints should be considered as correspondent but giving a lower weight in the image segment direction. The experimental results showed that the proposed constraint can be used to recover the structure and the camera motion from straight segment correspondences using only two images. This showed that the constraint is stronger than if only in"nite lines were used. This result, together with Refs. [2] and [3], con"rm the importance of considering the "nite length of the segments. The proposed image segment model and the weak midpoint correspondence has also been used to compute correspondent image segments and the structure when the camera location is known [7]. Consequently, the
1305
proposed model uniquely combines both the `pointa and `linea properties of the straight segments. This model can be applied to sequences of images.
Acknowledgements This work has been supported by CICYT-TAP970992-C02-01. Software for simulation and visualization was developed by Jaime Segura and D. Berna SanjuaH n.
Appendix A: Transformations The locations for the references are expressed as transformations. There are two mathematical representations for the transformation t : a six component location 5% vector x , and an homogeneous matrix H : 5% 5% x "(x , y , z , t , h , )2, 5% 5% 5% 5% 5% 5% 5% n V o V a V p V 5% 5% 5% 5% n W o W a W p W 5% 5% 5% . H " 5% 5% n X o X a X p X 5% 5% 5% 5% 0 0 0 1
Location vector form is well suited for theoretical discussion and for covariance assignment. However, the mathematical operations such as composition, inversion or derivation are better expressed using the homogeneous matrix. The conversion between them is given by
C Ch C Sh St !S Ct 5% 5% 5% 5% 5% 5% 5% S Ch S Sh St #C Ct 5% 5% 5% 5% 5% 5% 5% H " 5% !Sh Ch St 5% 5% 5% 0 0
C Sh Ct #S St x 5% 5% 5% 5% 5% 5% S Sh Ct !C St y 5% 5% 5% 5% 5% 5% , Ch Ct z 5% 5% 5% 0 1
(A.1)
where C and S stands for cos ( ) and sin ( ) respectively.
x p V 5% 5% y p W 5% 5% z p X 5% x " 5% " . 5% t atan2(o X, a X) 5% 5% 5% h atan2(!n X,#(n V#n W) 5% 5% 5% 5%
atan2(n W, n V) 5% 5% 5% (A.2)
1306
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
Appendix B: Image normalization Jacobian
and the corresponding covariance is de"ned as a function of K : ".
N"
CK CK S S !C K S K S C (!. (+.# (K !. (K +. (!. (+.# (K !. (K +. a a a a S T S T CK SK CK CK !S K C K SK SK (!. (+.# (!. (+. (!. (+.# (!. (+. a a a a S T S T 0
0 0 0 a /a S T (a /a )#cos K (1!(a /a )) S T +. S T
where ,
where K were de"ned in Eq. (3). 1/a and 1/a are the !. S T pixel sizes in the x and y directions, expressed in mm. S and C stands for the sin( ) and cos( ) functions. K is +. de"ned as
sin tK ". ! 0 y( ". sin K sin tK sin tK ". ". ". , K " 0 ! ". y( cos K cos K ". ". ". cos K sin K cos tK ". ! ". ". 0 y( y( ". ". 0
(C.3)
y( "!(x( #y( #1, tK "atan2(1, a ), ". !. !. ". X
K "atan2(o , n ) ". V V and where a "(x( sin K !y( cos K ), X !. !. !. !. 1 n" (1#(x( sin K !y( cos K )), V #n # !. !. !. !. "
K "atan 2(a sin K , a cos K ). +. T !. S !.
Appendix C: 2D segment de5nition
!1 o" (y( sin K #x( cos K ), W #o # !. !. !. !. "
The 2D segment location with respect to the camera frame is expressed as
#o #"(1#x( #y( , " !. !.
x; "(0, 0, 0, tK , hK , K )2, !" !" !" !"
#n #"(1#x( #y( #x( y( #(y( #y( )cos K " !. !. !. !. !. !. !.
where
#(x( #x( )sin K !(x( y( #y( x( !. !. !. !. !. !. !.
tK "atan2(o , a ), hK "atan2(n , (n #n ), !" X X !" X V W
#x( y( )2 sin K cos K ). !. !. !. !.
K "atan2(n , n ), !" W V
The values of x( , y( , and K are taken from the !. !. !. image segment location with respect to the camera frame (3).
where: n "cos K #y( cos K !x( y( sin K , V !. !. !. !. !. !.
(C.1)
n "sin K #x( sin K !x( y( cos K , W !. !. !. !. !. !.
Appendix D: Measurement equation
n "y( sin K #x( cos K , X !. !. !. !.
The detailed expression for the matrices and vectors used in the linearizations are
!1 o" , X (1#x( #y( !. !. x( sin K !y( cos K !. !. !. !. a" X (1#(x( sin K !y( cos K ) !. !. !. !.
(C.2)
f"
x(
"1
z(
"1 atan2(!n X, (n V#n W) "1 "1 "1
,
(D.1)
J.M.M. Montiel et al. / Pattern Recognition 33 (2000) 1295}1307
0 !z( y( n o a 0 "1 "1 "1V "1V "1V x( 0 n X o X a X G" !y( 0 "1 "1 "1 "1 "1 sin K !cos K 0 0 0 0 cos tK "1 "1 "1 H"
0 0 !sin tK "1
1307
,
!n V !n W !n X n Wz( !n Xy( !n z( #n Xx( n Vy( !n Wx( !" !" !" !" !1 !" !1 !" !1 !" !1 !" !1 !" !1 !a V !a W !a X a Wz( !a Xy( !a Vz( #a Xx( a Vy( !a Wx( . !" !" !" !" !1 !" !1 !" !1 !" !1 !" !1 !" !1 0 0 0 !o V cos tK #a V sin tK !o W cos tK #a W sin tK !o X cos tK #a X sin tK !1 "1 !1 "1 !1 "1 !1 "1 !1 "1 !1 "1
Previous expressions are given as functions of the homogeneous matrices H , H and H . These matrices can !" "1 !1 be computed directly from the location estimated for the 2D segment, H , the camera H , and the 3D segment !" 5! location H . 51 References [1] T.S. Huang, A.R. Netravali, Motion and structure correspondences: a review, Proc. IEEE 82 (2) (1994) 252}268. [2] Z. Zhang, Estimating motion and structure from correspondences of line segments between two perspective images, IEEE Trans. Pattern Anal. Mach. Intell. 17 (12) (1995) 1129}1139. [3] C.J. Taylor, D.J. Kriegman, Structure and motion from line segments in multiple images, IEEE Trans. Pattern Anal. Mach. Intell. 17 (11) (1995) 1021}1032. [4] J.D. TardoH s, Representing partial and uncertain sensorial information using the theory of symmetries, IEEE International Conference on Robotics and Automation, Nice, France, May 1992, pp. 1799}1804.
(D.2)
[5] X. Pennec, J.P. Thirion, Validation of 3-d registration methods based on points and frames, Fifth International Conference on Computer Vision, MIT, USA, 1995, pp. 557}562. [6] Z. Zhang, O. Faugeras, A 3d world model builder with mobile robot, Int. J. Robotics Res. 11 (4) (1992) 269}284. [7] J.M.M. Montiel, L. Montano, Probabilistic structure from camera location using straight segments, Image Vision Comput. 17 (3) (1999) 263}279. [8] J.M. MartmH nez Montiel, L. Montano, The e!ect of the image imperfections of a segment on its orientation uncertainity, Seventh International Conference on Advanced Robotics, Spain, September 1995, pp. 156}162. [9] J. Weng, T.S. Huang, N. Ahuja, Motion and Structure from Image Sequences, Springer, Heidelberg, 1993. [10] R.Y. Tsai, A versatile camera calibration technique for high accuaracy 3d machine vision metrology using O!-the Shelf tv cameras and lenses, IEEE J. Robotics Automat. RA-3 (4) (1987) 323}344. [11] J.B. Burns, A.R. Hanson, E.M. Riseman, Extracting straight lines, IEEE Trans. Pattern Anal. Mach. Intell. 8 (4) (1986) 425}455.
About the Author*DR. J.M. MARTID NEZ MONTIEL received the Ph.D. in Systems Engineering and Computer Science from the University of Zaragoza 1996. At present he is assistant lecturer at the University of Zaragoza. His current interests are computer vision based on discrete features, multisensor fusion, map building, stereo vision and roboust methods. About the Author*DR. JUAN D. TARDOD S received the M.S. and Ph.D. degrees in Industrial-Electrical Engineering from the University of Zaragoza, Spain, in 1985 and 1991, respectively. He is Associate Professor in the Departamento de InformaH tica e IngeniermH a de Sistemas, University of Zaragoza, where he is in charge of courses in Real Time Systems and Computer Vision. His current research interests include sensor integration, robotics, and real time systems. About the Author*DR. LUIS MONTANO received the Ph.D. in Systems Engineering and Computer Science from the University of Zaragoza in 1987. At present he is the Head of the Department of Computer Science and Systems Engineering of the University of Zaragoza. His current interests are computer vision based on discrete features, robot programming and control, and mobile robot navigation.
Pattern Recognition 33 (2000) 1309}1323
Recognition of printed arabic text based on global features and decision tree learning techniques Adnan Amin School of Computer Science and Engineering, University of New South Wales, 2052 Sydney, Australia Received 29 April 1998; accepted 4 May 1999
Abstract Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and o!-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research "eld. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between di!erent fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The "rst step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the di$culty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with di!erent fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Pattern recognition; Printed Arabic text; Connected component; Skew detection and correction; Global features; Structural classi"cation; Machine learning C4.5; Cross-validation
1. Introduction For the past three decades, there has been increasing interest among researchers in problems related to the machine simulation of the human reading process. Intensive research has been carried out in this area with a large number of technical papers and reports in the literature devoted to character recognition. This subject has attracted immense research interest not only because of the very challenging nature of the problem, but also because it provides the means for automatic processing of large volumes of data in postal code reading [1,2], o$ce auto-
E-mail address:
[email protected] (A. Amin).
mation [3,4], and other business and scienti"c applications [5}7]. Much more di$cult, and hence more interesting to researchers, is the ability to automatically recognize handwritten characters [8}10]. The complexity of the problem is greatly increased by the noise problem and by the almost in"nite variability of handwriting as a result of the mood of the writer and the nature of the writing. Analyzing cursive script requires the segmentation of characters within the word and the detection of individual features. This is not a problem unique to computers; even human beings, who possess the most e$cient optical reading device (eyes), have di$culty in recognizing some cursive scripts and have an error rate of about 4% in reading tasks in absence of context [11].
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 4 - 4
1310
A. Amin / Pattern Recognition 33 (2000) 1309}1323
Di!erent approaches covered under the general term `character recognitiona fall into either the on-line or the o!-line category, each having its own hardware and recognition algorithms. In on-line character recognition systems, the computer recognizes the symbols as they are drawn [12}16]. The most common writing surface is the digitizing tablet, which operates through a special pen in contact with the surface of the tablet and emits the coordinates of the plotted points at a constant frequency. Breaking contact prompts the transmission of a special character. Thus, recording on the tablet produces strings of coordinates separated by signs indicating when the pen has ceased to touch the tablet surface. On-line recognition has several interesting characteristics. First, recognition is performed on one-dimensional data rather than two-dimensional images as in the case of o!-line recognition. The writing line is represented by a sequence of dots whose location is a function of time. This has several important consequences: E The writing order is available and can be used by the recognition process. E The writing line has no width. E Temporal information, like velocity can also be taken into consideration. E Additionally, penlifts can be useful in the recognition process. O!-line recognition is performed after writing or printing is completed. Optical Character Recognition, OCR [16}20], deals with the recognition of optically processed characters rather than magnetically processed ones. In a typical OCR system, input characters are read and digitized by an optical scanner. Each character is then located and segmented and the resulting matrix is fed into a preprocessor for smoothing, noise reduction, and size normalization. O!-line recognition can be considered the most general case: no special device is required for writing and signal interpretation is independent of signal generation, as in human recognition. Many papers have been concerned with the recognition of Latin, Chinese and Japanese characters. However, although almost a third of a billion people worldwide, in several di!erent languages, use Arabic characters for writing, little research progress, in both on-line and o!line has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in term of funding, and other utilities such as Arabic text database, dictionaries, etc., and of course of the cursive nature of its writing rules. There are two strategies which have been applied to printed and handwritten Arabic character recognition. These can be categorized as follows: (i) Holistic strategies in which the recognition is globally performed on the whole representation of
words and where there is no attempt to identify characters individually. These strategies were originally introduced for speech recognition and can fall into two categories: (a) Methods based on distance measurements using dynamic programming [21,22]. (b) Methods based on a probabilistic framework (hidden Markov models) [23}28]. (ii) Analytical strategies in which words are not considered as a whole, but as sequences of small size units and the recognition is not directly performed at word level but at an intermediate level dealing with these units, which can be graphemes, segments, pseudo-letters, etc. [29}32]. Surveys on Arabic recognition can be found in Ref. [33}36]. Symbolic machine learning algorithms are designed to accept example descriptions in the form of feature vectors which include a label that identi"es the class to which an example belongs. The output of the algorithm is a set of rules that classi"es unseen examples based on generalizations from the training set. This ability to generalize is the main attraction of machine learning for handwriting recognition. Samples of a character can be preprocessed into a feature vector representation for presentation to a machine learning algorithm that creates rules for recognizing characters of the same class. Symbolic machine learning has several advantages over other learning methods. It is fast in training and in recognition, generalizes well, is noise tolerant and the symbolic representation is easy to understand. This paper proposes the use of global methods to extract features for classifying and recognizing Arabic words using machine learning C4.5 [37,38] to generate a decision tree. This decision tree can then be used to predict the class of an unseen character. Fig. 1 depicts a block diagram of the system.
2. General characteristics of the Arabic writing Comparison of the various characteristics of Arabic, Latin, Hebrew and Hindi scripts are outlined in Table 1. Arabic is written from right to left. Arabic text (machine printed or handwritten) is cursive in general and Arabic letters are normally connected on the baseline. This feature of connectivity will be shown to be important in the segmentation process. Some machine printed and handwritten texts are not cursive, but most Arabic texts are, and thus it is not surprising that the recognition rate of Arabic characters is lower than that of disconnected characters such as printed English. Arabic writing is similar to English in that it uses letters (which consist of 29 basic letters), numerals,
A. Amin / Pattern Recognition 33 (2000) 1309}1323
1311
Fig. 2. Di!erent shapes of the Arabic letter `A' a in (a) beginning, (b) middle, (c) end, (d) isolated.
Fig. 1. Block diagram of the system.
Table 1 Comparison of various scripts Characteristics
Arabic
English
Hebrew
Hindi
Justi"cation Cursive Diacritics Number of vowels Letter shapes Number of letters in alphabets Complementary characters 3
R-to-L Yes Yes 2 1}4 29
L-to-R No No 5 2 26
R-to-L No No 11 1 22
L-to-R Yes Yes * 1 40
*
*
*
punctuation marks, as well as spaces and special symbols. It di!ers from English, however, in its representation of vowels since Arabic utilizes various diacritical markings. The presence and absence of vowel diacritics indicate di!erent meanings in what would otherwise be the same word. For example, is the Arabic word for both `schoola and `teachera. If the word is isolated, diacritics are essential to distinguish between the two possible meanings. If it occurs in a sentence, contextual information inherent in the sentence can be used to infer the appropriate meaning. In this paper, the issue of vowel diacritics is not treated, since it is more common for Arabic writing not to employ these diacritics. Diacritics are only found in old manuscripts or in very speci"c areas. The Arabic alphabet is represented numerically by a standard communication interchange code approved by the Arab Standard and Metrology Organization (ASMO). Similar to the American Standard Code for Information Interchange (ASCII), each character in the
ASMO code is represented by one byte. An English letter has two possible shapes, upper and lower cases. The ASCII code provides separate representations for both of these shapes, whereas an Arabic letter has only one representation in the ASMO table. This is not to say, however, that the Arabic letter has only one shape. On the contrary, an Arabic letter might have up to four di!erent shapes, depending on its relative position in the text. For instance, the letter ( A'in) has four di!erent shapes: at the beginning of the word (preceded by a space), in the middle of the word (no space around it), at the end of the word (followed by a space), and in isolation (preceded by an unconnected letter and followed by a space). These four possibilities are exempli"ed in Fig. 2. Table 2 shows the di!erent shapes of the Arabic characters in the di!erent positions of the word. In addition, di!erent Arabic characters may have exactly the same shape, and are distinguished from each other only by the addition of a complementary character. These are normally a dot, a group of dots or a zigzag (hamza). This may appear on, above, or below the baseline, and can be positioned di!erently, for instance, above, below or within the con"nes of the character. Fig. 3 depicts two sets of characters, the "rst set having "ve characters and the other set three characters. Clearly, each set contains characters which di!er only by the position and/or the number of dots associated with it. It is worth noting that any erosion or deletion of these complementary characters may result in a misrepresentation of the character. Hence, any thinning algorithm needs to e$ciently deal with these dots so as not to eliminate them and change the identity of the character. Arabic writing is cursive and is such that words are separated by spaces. However, a word can be divided into smaller units called subwords. Some Arabic characters are not connectable with the succeeding character. Therefore, if one of these characters exists in a word, it divides that word into two subwords. These characters appear only at the tail of a subword, and the succeeding character forms the head of the next subword. Fig. 4 shows three Arabic words with one, two, and "ve
A portion of a character that is needed to complement an Arabic character. A portion of a word including one or more connected characters.
1312
A. Amin / Pattern Recognition 33 (2000) 1309}1323 Table 2 The basic alphabets of Arabic characters and their shapes at di!erent positions in the word
subwords. The "rst word consists of one subword which has nine letters; the second has two subwords with three and one letter, respectively. The last word contains "ve subwords, each consisting of only one letter. In general, Arabic writing can be classi"ed into typewritten (Naskh), handwritten (Ruq'a) and artistic (or decorative Calligraphy, Ku", Diwani, Royal, and Thuluth) styles as shown in Fig. 5. Handwritten and decorative
styles usually include vertical combinations of short strokes called ligatures. This feature makes it di$cult to determine the boundaries of the characters. Furthermore, characters of the same font have di!erent sizes (i.e. characters may have di!erent widths even though the two characters have the same font and point size). Hence, word segmentation based on a "xed width cannot be applied to Arabic.
A. Amin / Pattern Recognition 33 (2000) 1309}1323
1313
Fig. 3. Arabic characters di!ering only with regard to the position and number of associated dots.
3. Digitization and preprocessing The "rst phase in our character recognition system is digitization. Document to be processed are "rst scanned and digitized. The algorithm adopted in this paper is similar to that which appears in Ref. [39]. A 300 dpi scanner is used to digitize the image in this phase and the output is a standard binary formatted image (PBM format). 3.1. Connected component analysis After the digitization is completed, the connected components must be determined. Connected components are rectangular boxes bounding together regions of connected black pixels. The objective of the connected component stage is to form rectangles around distinct components on the page, whether they be characters or images. These bounding rectangles then form the skeleton for all future analysis on the page. The algorithm used to obtain the connected components is a simple iterative procedure which compares successive scanlines of an image to determine whether black pixels in any pair of scanlines are connected together. Bounding rectangles are extended to enclose any groupings of connected black pixels between successive scanlines [40]. Fig. 6 demonstrates this procedure. Each scanline in Fig. 6 is 14 pixels in width (note that a pixel is represented by a small rectangular block), the bounding rectangles in Fig. 6(a) just enclose the black pixel of that scanline, but for each successive scanline the bounding boxes increase to include the black pixels connected to the previous scanline. Fig. 6(c) also points out that a bounding box stops growing in size only when there are no more black pixels on the current scanline joined onto black pixels of the previous scanline. 3.2. Grouping After the cc's have been determined the next step is to group neighboring connected components of similar
Fig. 5. Di!erent styles and fonts for the writing of arabic text.
Fig. 4. Arabic words with constituent subwords.
1314
A. Amin / Pattern Recognition 33 (2000) 1309}1323
previous cc's that have been successfully merged into the group). If the cc does not fall within the group's neighborhood then it's not merged into the group. Furthermore, cc's from di!erent categories are not merged together, thus as can be seen from Fig. 8(a) the large (hashed) cc's are not merged with the group to the right (even though they fall in its region of in#uence) since they are much larger than the cc's which comprise that group. 3.3. Skew estimation
Fig. 6. The process of building connected components from image scanlines.
dimensions (Fig. 7). All the cc's of documents fall into one of three categories, noise, small or large depending on their size. The noise cc's are then removed from any further skew calculation and the other cc's are then merged in accordance to the category they fall under. Merging requires the use of a pre"xed threshold (di!erent for each category) so as to provide a means of determining the neighborhood of a group. The grouping algorithm takes one cc at a time and tries to merge it into a group from a set of existing groups. If it succeeds the group's dimensions are altered so as to cater for the new cc, that is the group encompassing the rectangle is expanded so as to accommodate the new cc along with the existing cc's already forming the group. A possible merge is found when the cc and a group (both of the same category) are in close proximity to each other. When a cc is found to be near a group, its distance from each cc in the group is then checked until one is found that is within the prede"ned threshold distance or all cc's have been checked. If such a cc is found, then the cc is de"ned to be in the neighborhood of the group and is merged with that group. If the cc cannot be merged with any of the existing groups then a new group is formed with its sole member being the cc which caused its creation. Fig. 8 demonstrates the conditions necessary for a cc to be merged into an existing group. As can be seen in Fig. 8, the small cc (hashed) is merged into the left group and the group dimensions increase to accommodate the new cc. Both larger cc's are near the group of large cc's, but only the lower one is close to an individual cc. Therefore, a new group is created for the upper cc, and the lower one is merged into that group. As can be seen from Fig. 8(a), the neighborhood of a group (represented by the grey shaded region surrounding each group) is the extent of the region of in#uence which the group possesses over a cc. In other words if the cc lays within the group's neighborhood it is merged into the group (the dashed rectangles in the group represent
To estimate the skew angle for each group, we divide each group into vertical segments of approximately the width of one connected component and store only the bottom rectangle in each segment [41]. This process allows us to store primarily the bottom row of text in each group, with other random connected components where there are gaps in the bottom row. We then apply the Hough Transform to the point at the center of each rectangle, mapping the point from the (x, y) domain to a curve in the (o, h) domain according to Eq. (1). o"x cos h#y sin h, for 0)h(p.
(1)
The Hough Transform has several interesting properties: (1) Points in the (x, y) domain map to curves in the (o, h) domain. (2) Points in the (o, h) domain map to lines in the (x, y) domain, where o is the perpendicular distance of the line from the origin, and h is the angle from the horizontal of the perpendicular line. (3) Curves that cross at a common point in the (o, h) domain map to collinear points in the (x, y) domain [42]. When the Hough Transform has been applied to each point, the resultant graph is examined to "nd the point at which the most curves cross. This point will correspond to a line of connected components, which should represent the bottom row of text. The extra connected components that were mapped into the (o, h) domain will not cross at the same point and are excluded from further calculation of the angle. 3.4. Skew calculation If more than one curve crosses at this point in the graph, we then determine the slope of a straight line that best approximates these points, which represent the connected components of the bottom row of text, using the least-squares method. Least squares is a statistical method for "nding the equation (line, quadratic, etc.) of best "t given a set of points. We determine the equation of best "t for a line y"a#bx,
(2)
A. Amin / Pattern Recognition 33 (2000) 1309}1323
Fig. 7. An example of an Arabic text with connected components.
1315
1316
A. Amin / Pattern Recognition 33 (2000) 1309}1323
angles, and average these values to determine the skew angle, a. 3.5. Skew correction Once the skew angle has been determined, the image is then skew corrected to lie horizontally on the page. The method was inspired by Paeth [44]. To calculate the correct value for a pixel at location (x, y) in the skew corrected image, we determine the true original position (x , y ) of the pixel using the formulae in Eq. (6). x "x cos a#y sin a, y "y cos a!x sin a, (6) where a is the calculated skew angle of the image. Then, since (x , y ) is generally non-integral, a weighted average of the surrounding pixels is calculated to determine the value of (x, y). If (x , y ) lies out side the original image, the new pixel is given a value of white. 3.6. Detecting a loop Fig. 8. Merging criteria of a cc into an existing group.
where the coe$cients a and b are computed using the formulae [43]: n L x y !( L x ) ( L y ) G G G G b" G G G n L x!( L x ) G G G G
(3)
L y !b L x G G a" G G n
(4)
given that (x , y ) is a point from a set of samples G G +(x , y ) " i"1, 2,2,n),.. G G Analogously, the cc's associated with each line segment constitute the sample space. Once again these cc's are denoted as points, and noting that each line segment maintains a set of points, i.e. +(x , y ) " i"1, 2,2,n),, the G G above equations are applied to determine the equation of the line of best "t for a given line segment. Given that the equation derived is of the form given in Eq. (2), we can determine the angle a which the line makes with respect to the horizontal by using Eq. (5) a"tan\(b),
(5)
where b represents the gradient of the line. The slope of the calculated line gives the optimal skew angle for the group. After we calculate an angle for each group, we group them in sets of 0.53, weighting the angle by the number of connected components used in its calculation. We then determine which partition contains the most
One common phenomenon that occurs in both printed and handwritten Arabic text is blotting. For a human, the blobs can be easily recognized as loop from the context. However, these loops are di$cult to deal with and cause problems in automatic recognition system. The basic idea of detecting a loop in a binary image is to move a widow through the image from right to left, and top to bottom. If certain conditions are satis"ed, then the pixels in the center of the window are whitened. The size of the window is determined by the width of the baseline of the word. The algorithm for detecting a loop can be summarized as follows: E Locate the baseline of the image: A horizontal projection for the image can be performed. The rows which have the maximum number of pixels are taken as part of the baseline of the word. Any other adjacent lines with at least 85% of the pixels of these rows are also taken as part of the baseline. The horizontal projection of the Arabic word is shown in Fig. 9(b). E Scan the binary image: A window of size 2H;2H is used to scan the image sequentially from right to left and top to bottom. The value of the integer H is determined by the width of the baseline of the word. Furthermore, the window is even; hence, the center of the window consists of four pixels. These pixels are whitened if all of the following conditions are satis"ed: (a) all of the white pixels within the window are located on the "rst two outermost layers of the window; (b) no group of the white pixels is surrounded by black pixels within the window, i.e., no hole exists in the 2H;2H window;
A. Amin / Pattern Recognition 33 (2000) 1309}1323
1317
Fig. 9. (a) The input image (b) horizontal projection of the input word, (c) result of the algorithm.
(c) the number of the black pixels within the window are at least equal to 2H-2 pixels. 4. Global word feature In this project, we have used the second approach to exact the proper characteristic of Arabic characters. Seven types of global features have been extracted such as: number of subwords, number of peaks of each subwords, number of loops of each peak, number and position of complementary characters, the height and width of each peak. Feature extraction algorithm can be summarized into the following steps: Step 1: Loop detection: Loops are detected simply as being the inner contours (as shown in Fig. 10) obtained by running the contour tracing algorithm. The tracing algorithm traces outer contour of an object and then it traces the inner contours of the object. Step 2: Determine the number of peaks within the subword: In all printed Arabic characters, the width at a connection point is much less than the width of the beginning character. Therefore, the baseline is a medium line in the Arabic word in which all the connections between the successive characters take place. If a vertical projection of bi-level pixels is performed on the word (Eq. (7)), v( j)" w(i, j), (7) G where w(i, j) is either zero or one and i, j index the rows and columns, respectively, the peak point will have a sum
Fig. 10. The inner and outer contour of an Arabic word.
greater than the average value (AV) (Eq. (8)) ,A A<"(1/Nc) Xj (8) H and where Nc is the number of columns and X is the H number of black pixels of the jth column. Fig. 11, illustrates the segmentation of an Arabic word into peaks. Step 3: Smooth the histogram (see Fig. 11(c)) by using the averaging scheme where each point in the histogram is replaced by the average of itself and the two points on either side of it. X "(X #X #X )/3. (9) G G\ G G\ Step 4: Complementary characters: This feature plays very important role to distinguish between characters having the same shape. The complementary characters could be either a zigzag or a a group of dots (1, 2 or 3). These can be above or below the baseline. Similar characters or subwords with dots have di!erent meaning and pronunciation if the group of dots below or above the base line. Step 5: Computer the height and width of each peak whether large or small.
1318
A. Amin / Pattern Recognition 33 (2000) 1309}1323
Fig. 11. An of segmentation of the word into peaks (a) Arabic word (b), (c) histogram before and after smoothing.
5. Machine learning C4.5 The C4.5 [37,38] is an e$cient learning algorithm that creates decision trees to represent classi"cation rules. The data input to C4.5 forms a set of examples, each labeled according to the class to which it belongs. The description of the example is a list of attribute/value pairs. A node in a decision tree represents a test on a particular attribute. Suppose an object is described by its color, size and shape where color may have values red, green or blue; size may be large or small and shape may be circle or square. If the root node of the tree is labeled color then it may have three branches for each color value. Thus, if we wish to test an object's color and it is red, the object
descends the red branch. Leaf nodes are labeled with class names. Thus when an object reaches a leaf nodes, it is classi"ed according to the name of the leaf node. Building a decision tree proceeds as follows. The set of all examples forms an initial population. An attribute is chosen to split the population according to the attribute's values. Thus, if color is chosen then all red objects descend the red branch, all green objects descend the green branch, etc. Now the population has been divided into sub-populations by color. For each sub-population, another attribute value is chosen to split the sub-population. This continues as long as each population contains a mix of examples belonging to di!erent classes. Once a uniform population has been obtained, a leaf node is
A. Amin / Pattern Recognition 33 (2000) 1309}1323
created and labeled with the name of the class of the population. The key to the success of a decision tree learning algorithm depends on the criterion used to select the attribute to use for splitting. If attribute is a strong indicator of an example's class value, it should appear as early in the tree as possible. Most decision tree learning algorithms use a heuristic for estimating the best attribute. In C4.5, Quinlan uses a modi"ed version of the entropy measure from information theory. For our purposes, it is su$cient to say that this measure yields a number between 0 and 1 where 0 indicates a uniform population and 1 indicates a population where there is equal likelihood of all classes being present. The splitting criterion seeks to minimize the entropy. A further re"nement is required to handle noisy data. Real data sets often contain examples that are misclassi"ed or which have incorrect attribute values. Suppose decision tree building has constructed a node which contains 99 examples from class 1 and only one example from class 2. According to the algorithm presented above, a further split would be required to separate the 99 from the one. However, the one exception may be misclassi"ed, causing an unnecessary split. Decision tree learning algorithms have a variety of methods for `pruninga unwanted subtrees. C4.5 grows a complete tree, including nodes created as a result of noise. Following initial tree building, the program proceeds to select suspect subtrees and prunes then, testing the new tree on a data set which is separate from the initial training data. Pruning continues as long as the pruned trees yield more accurate classi"cations on the test data The C4.5 system requires two input "les, the names and the data "les. The names "le contains the names of all the attributes used to describe the training examples and their allowed values. This "le also contains the names of the possible classes. The classes are the Arabic words. The C4.5 data "les contains the attributes values for example objects in the format speci"ed by the name "le, where each example is completed by including the class to which it belongs. Every word will be classi"ed in terms of features such as: number of subwords (up to "ve), number of peaks of each subwords (up to seven), number of loops, number and position of complementary characters within the peak (up to three), height and width of each peak (Small or Large) as illustrated in Fig. 12. Fig. 13 depicts the complete representation to the Arabic word shown in Fig. 11. The word contains only one subword and four peaks.
6. Experimental results The "rst decision required in setting up a machine learning experiment is how the accuracy of the decision
1319
Fig. 12. Segment encoding for the C4.5 machine learning.
Fig. 13. The complete representation of word shown in Fig. 11 for the C4.5 machine learning.
Table 3 Error rates performance using 10-fold cross validation Fold
Error-rate % testing
1 2 3 4 5 6 7 8 9 10 Average
8.79 6.47 7.24 9.15 7.57 8.33 5.80 9.11 8.63 7.94 7.90
1320
A. Amin / Pattern Recognition 33 (2000) 1309}1323
Fig. 14. Samples of Arabic words used in experiments.
tree will be measured. The most reliable procedure is cross-validation. N-fold cross-validation refers to a testing strategy where the training data are randomly divided into N subsets. One of the N subsets is withheld as a test set and the decision tree is trained on the remaining
N!1, subsets. After the decision tree has been built, its accuracy is measured by attempting to classify the examples in the test set. Accuracy simply refers to the percentage of examples whose class is correctly predicted by the decision tree. The error rate is the accuracy subtracted
A. Amin / Pattern Recognition 33 (2000) 1309}1323
from 100%. To compensate for sampling bias, the whole process is repeated N times, where, in each iteration, a new test set is withheld and the overall accuracy is determined by averaging the accuracies found at each iteration. In brief, the cross-validation procedure for our purposes involves the following: begin 1. For a total of 1000 classes and 15,000 patterns, interleave the patterns so that the pattern of class j is followed by a pattern of class j#1. 2. Segment data into K sets (k , k ,2k ) of equal size, ) e.g. in our case for K"10, we have 1500 pattern in each set. 3. Train the C.4.5 with K!1 sets and test the system on the remaining one data set. Repeat this cycle K times, each time with a training set which is distinct from the test set. For i"1 to K do + Train with data from all partitions k where L 1)n)K, and iOn. Test with data from partition k . G , 4. Determine the recognition performance each time and take the average over a total of K performances. end Table 3 shows the error rates performance using 10-fold cross-validation. It is important to note here that the system performs extremely well with recognition rates ranging between 90 and 94% on di!erent folds and the overall recognition is 92%. This is a very good performance taking into account the fact that we have a limited number of samples in each class. Fig. 14 shows a sample of the Arabic words used in the experiments. 7. Conclusion This paper presents a new technique for recognizing printed Arabic text and as indicated by the experiments performed, the algorithm resulted in a 92.1% recognition rate using the C4.5 machine learning system. In this paper, we "rst discussed pre-processing step whose aim was to create connected components and then detected skew angle in a document. The skew correction we proposed has been showed to be faster and accurate, with run times averaging under 0.25 CPU second to calculate an angle on a DEC 5000/20 workstation [41]. Moreover, the system used a global approach which is inexpensive for feature extraction and also to avoid the di$culty of segmentation stage. The computation theory is based on the model of fast reader.
1321
The study also shows that machine learning algorithms such as C4.5 are capable of learning the necessary features needed to recognize printed Arabic text achieving a best average recognition rate of 92% using 10-fold cross-validation. The use of machine learning has removed the tedious task of manually forming rule-based dictionaries for classi"cation of unseen characters and replaced it with an automated process which can cope with the high degree of variability which exists in printed and handwritten characters. This is very attractive feature and, therefore, further exploration of this application of machine learning is well worthwhile.
References [1] L.D. Harmon, Automatic recognition of printed and script, Proc. IEEE 60 (10) (1972) 1165}1177. [2] A.A. Spanjersberg, Experiments with automatic input of handwritten numerical data into a large administration system, IEEE Trans. Man Cybernet. 8 (4) (1978) 286}288. [3] L.R. Focht, A. Burger, A numeric script recognition processor for postal zip code application, International Conference on Cybernetics and Society, 1976, pp. 486}492. [4] J. Schuermann, Reading machines, Sixth International Conference on Pattern Recognition, 1982, pp. 741}745. [5] R. Plamondon, R. Baron, On-line recognition of handprinted schematic pseudocode for automatic Fortran code generator, Eigth International Conference on Pattern Recognition. 1986, pp. 741}745. [6] A. Amin, S. Al-Fedaghi, Machine recognition of printed Arabic text utilizing a natural language morphology, Int. J. Man}Mach. Stud. 35 (6) (1991) 769}788. [7] D. Guillevic, C.Y. Suen, Cursive script recognition: A fast reader, Second International Conference on Document Analysis and Recognition, 1993, pp. 311}314. [8] M.K. Brown, S. Ganapathy, Preprocessing technique for cursive script word recognition, Pattern Recognition 19 (1) (1983) 1}12. [9] R.H. Davis, J. Lyall, Recognition of handwritten character a } review, Image Vision Comput. 4 (4) (1986) 208}218. [10] Lecolinet, E. Baret, O. Cursive word recognition: methods and strategies, in: S. Impedovo, (Ed.) Fundamentals in Handwriting Recognition, Springer-Verlag, Berlin, 1994, pp. 235}263. [11] C.Y. Suen, R. Shingal, C.C. Kwan, Dispersion factor: a quantitative measurement of the quality of handprinted characters, International Conference of Cybernetics and Society, 1977, pp. 681}685. [12] A. Amin, Machine recognition of handwritten Arabic word by the IRAC II system. Sixth International Conference on Pattern Recognition, 1982, pp. 34}36. [13] A. Shoukry, A. Amin, Topological and statistical analysis of line drawing, Pattern Recognition Lett. 1 (1983) 365}374. [14] J. Kim, C.C. Tappert, Handwriting recognition accuracy versus tablet resolution and sampling rate, Seventh International Conference on Pattern Recognition, 1984, pp. 917}918.
1322
A. Amin / Pattern Recognition 33 (2000) 1309}1323
[15] J.R. Ward, T.A. Kuklinski, model for variability e!ects in handprinted with implication for the design of handwritten character recognition system, IEEE Trans. Man Cybernet. 18 (1988) 438}451. [16] F. Nouboud, R. Plamondon, On-line recognition of handprinted characters: Survey and beta tests, Pattern Recognition 25 (9) (1990) 1031}1044. [17] J.R. Ulmann, Advance in character recognition, in: K.S. Fu (Ed.), Appl. Pattern Recognition, 1982, pp. 197}236. [18] M. Bokser, Omnidocument Technologies, Proc. IEEE 80 (7) (1992) 1066}1078. [19] H. Fujisawa, Y. Nakano, K. Kurino, Segmentation methods for character recognition: From segmentation to document structure analysis, Proc. IEEE 80 (7) (1992) 1079}1091. [20] S. Srihari, From pixel to paragraphs: The use of models in text recognition, Second Annual Symposium on Document Analysis and Information Retrieval, 1993, pp. 47}64. [21] M. Khemakhem, Reconnaissance de caracters imprimes par comparaison dynamique, These de Doctorate de 3 e'me cycle, University of Paris XI, 1987. [22] M. Khemakhem, M.C. Fehri, Recognition of Printed Arabic charcters by comparaison dynamique, Proceedings of the First Kuwait Computer Conference, 1989, pp. 448}462. [23] H.Y. Abdelazim, M.A. Hashish, Interactive font learning for Arabic OCR, Proceedings of the First Kuwait Computer Conference, 1989, pp. 464}482. [24] H.Y. Abdelazim, M.A. Hashish, Automatic recognition of handwritten Hindi numerals, Proceedings of the 11th National Computer Conference, Dhahran, 1989, pp. 287}299. [25] Z. Emam, M.A. Hashish, Application of hidden Markov model to the recognition of isolated Arabic word, Proceedings of the 11th National Computer Conference, Dhahran, 1989, pp. 761}774. [26] R. Schwartz, C. LaPre, J. Makhoul, C. Raphael, Y. Zhao, Language independent ocr using a continuous speech recognition system. 13th International Conference on Pattern Recognition, Vol. C, Vienna, Austria, 1996, 99}103. [27] M.A. Mahjoub, Choix des parametres lies a l'apprentissage dans la reconnaissance en ligne des caractercs arabes par les chaines de markov cachees. In Forum de la Recherche en Informatique, 28}39, Tunis, Juillet 1996. [28] N. BenAmara, A. Belaid, Printed PAW recognition based on planar hidden Markov models. In 13th International Conference on Pattern Recognition, Vol. II, Vienna, Austria, 1996, pp. 76}80.
[29] A. Amin, IRAC: recognition and understanding systems, in: R. Descout (Ed.), Applied Arabic Linguistic and Signal and Information Processing, Hemisphere, New York, 1987, pp. 159}170. [30] A. Amin, J.F. Mari, Machine recognition and correction of printed Arabic text, IEEE Trans. Man Cybernet 9 (1) (1989) 1300}1306. [31] H. Almuallim, S. Yamaguchi, A method of recognition of Arabic cursive handwriting, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9 (1987) 715}722. [32] T. El-Sheikh, R. Guindi, Computer recognition of Arabic cursive script, Pattern Recognition 21 (4) (1988) 293}302. [33] P. Ahmed, M. Khan, Computer recognition of Arabic scripts based text- the state of the art. Proceedings of the 4th International Conference and Exhibition on MultiLingual Computing (Arabic and Roman Script), UK, 1994, pp. 221}215. [34] B. Al-Badr, S. Mahmoud, Survey and bibliography of Arabic optical text recognition, Signal Processing 41 (1995) 49}77. [35] A. Amin, Arabic character recognition, in: H. Bunke, P.S.P. Wang (Eds.), Handbook of Character Recognition and Document Image Analysis, (Chapter 15) World Scienti"c, Singapore, 1997, pp. 397}420. [36] A. Amin, O!-line Arabic character recognition: the state of the art, Pattern Recognition 31 (5) (1998) 517}530. [37] J.R. Quilan, Discovering Rules for a Large Collection of Examples, Edinburgh University Press, Edinburgh, 1979. [38] J.R. Quilan, C4.5: Programs for Machine Learning, Morgan Kau!man, San Mateo CA, 1993. [39] A. Amin, H.B. Al-Sadoun, A new structural technique for recognizing printed Arabic text, Int. J. Pattern Recognition Artif. Intell. 9 (1) (1995) 101}125. [40] Drivas, D., Amin, A., Page segmentation and classi"cation utilising bottom-up approach, Third International Conference on Document Analysis and Recognition (ICDAR'95), Canada, 1995, pp. 610}614. [41] A. Amin, S. Fischer, A. Parkinson, R. Shin, Comparative study of skew detection algorithms, J. Electron. Imaging 5 (4) (1996) 443}451. [42] R. Duda, P. Hart, Use of the Hough transformation to detect lines and curves in pictures, Commun. ACM 15 (1972) 11}15. [43] Walpole, R.E., Myres, R.H. Probability and Statistics for Engineering and Scientists, 4th Edition, Macmillan Publishing Information, New York, 1990, p. 362. [44] A. Paeth, A fast algorithm for general raster rotation, Proceedings Graphics Interface Vision Interface, Canadian Information Processing Society, 1986, pp. 77}81.
About the Author*DR. ADNAN AMIN received the B.Sc. Degree in Mathematics and Diploma in Computer Science from Baghdad University in 1970 and 1973, respectively. He received DEA in Electronics from the University of Paris XI (Orsay) in 1978 and presented his Doctorate D'Etat (D. Sc.) in Computer Science to the University of Nancy I (CRIN), France, in 1985. From 1981 to 1985, Dr. Amin was Maitre Assistant at the University of Nancy II. Between 1985 and 1987 he worked in INTEGRO (Paris) as Head of Pattern Recognition Department. From 1987 to 1990, he was an Assistant Professor at Kuwait University and joined the School of Computer Science and Engineering at the University of New South Wales, Australia, in 1991 as a Senior Lecturer. Dr. Amin's research interests are Pattern Recognition and Arti"cial Intelligence (Document Image understanding ranging from character / word recognition to integrating visual and linguistic information in document composition), Neural networks, Machine learning, and knowledge acquisition. He has more than 70 technical papers and reports in these areas. He is an associate Editor of Pattern Recognition and Arti"cial
A. Amin / Pattern Recognition 33 (2000) 1309}1323
1323
Intelligence, a member of the Editorial Board of Pattern Analysis and Applications, and has served on the Program and Technical committees of several International Conferences and Workshops. He is serving as Co-chairman of the Second International workshop on Statistical Techniques in Pattern Recognition (SPR'98) and seventh International Workshop on Structural and Syntactical Pattern Recognition (SSPR'98) which were held in Sydney 1998. Currently, he is a chairman of Structural and Syntactical Pattern Recognition, Technical Committee (TC2) of the International Association of Pattern Recognition (IAPR). Dr. Amin is a member of IEEE, the IEEE Computer Society, Australian Pattern Recognition Society and the ACM.
Pattern Recognition 33 (2000) 1325}1338
Dealing with segmentation errors in region-based stereo matching夽 Angeles LoH pez*, Filiberto Pla Departament d'Informa& tica, Universitat Jaume I, Campus Penyeta Roja, s/n, E-12071 Castello& , Spain Received 29 September 1998; accepted 7 May 1999
Abstract Graph-based matching methods have been widely used in the areas of object recognition and stereo correspondence. In this paper, an algorithm to deal with segmentation errors in region-based matching is proposed, which consists of a preprocessing stage to the classical graph-based matching algorithm. Some regions are merged and included in the matching process in order to avoid the di!erences in segmentation. The selection of an appropriate similarity criterion to create the initial nodes in the graph and the use of approximative algorithms to "nd maximal cliques are important issues in order to reduce the computational burden. The experimental results show that the method is robust enough in the presence of noise. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Region matching; Segmentation errors; Graph-based matching; Maximal cliques; Stereo vision
1. Introduction Many applications in robotics need a map of the robot environment and stereo vision can be used to construct such a map. One of the most di$cult problems in stereo vision is to "nd corresponding points in the left and in the right image, that is, the correspondence problem. Several matching techniques have been developed, which di!er in the imaging geometry, the matching primitives, or the matching strategy. For a survey on stereo vision techniques the reader is addressed to [1}4]. Feature-based approaches are the most extended techniques and they establish correspondences between features extracted from the images. Most of the existing approaches use edge-based features as primitives for
夽 This work was partially supported by grants TIC98-0677C02-01 (CICYT, Ministerio de EducacioH n y Ciencia) and GV97-TI-05-27 (ConsellermH a de EducacioH y Cie`ncia, Generalitat Valenciana). * Corresponding author. E-mail addresses:
[email protected] (A. LoH pez),
[email protected] (F. Pla).
matching [5,6]. There are also some works that use regions [7}11], which have a higher semantic content compared with edge segments or points. The use of regions make some of the matching constraints implicit or easier to introduce, and there are typically a lower number of regions than edges. Therefore, establishing image correspondences would be more e$cient and the number of mismatches would be reduced. Marapane and Trivedi [8] proposed the use of regions as matching primitives due to its stability and its descriptive capability, which makes them more tolerant to noise than edge-based primitives. However, regionbased matching provides a very sparse disparity map. Thus, region matching may be the "rst step in a hierarchical stereo system [8] where the correspondence problem has to be solved "rst using regions, and afterwards, edges and pixels, in order to obtain a dense disparity map. Region-based techniques have to be able to handle segmentation errors. When the stereo image pair is segmented, errors may occur due to noise, bad imaging conditions, limitations of segmentation procedures, etc. In this work, a graph-based method for dealing with segmentation errors in region matching is proposed
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 6 - 8
1326
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
which consists of a preprocessing stage to the classical association graph method that creates appropriate candidates of merged regions. An appropriate similarity criterion is also proposed in order to obtain a measurement that reduces the number of false matchings, and thus, the computational cost. Experimental results in presence of noise are presented and the resulting disparity maps are compared in order to analyze the robustness of the method. The rest of this work is organized as follows. Firstly, previous work on graph-based matching is summarized in the next section. The graph-based method proposed to solve segmentation errors and the similarity criterion introduced are presented in Section 3. Finally, some experiments with images of di!erent levels of noise are shown in Section 4.
2. Previous work Some works have tried to solve the correspondence problem using graph-based methods. The usual technique is to construct an association graph where a node represents a mapping between a feature in the left image and a feature in the right image, and an arc between two nodes represents compatibility between the two mappings. Ayache and Faverjon [12] use linear edge segments as primitives and all the possible matchings between segments are represented as nodes in the graph. Horaud and Skordas [5] use linear edge segments and introduce the maximal clique search as stereo-matching strategy. This method has been used in the area of object recognition [13}15], where the largest maximal clique search obtains the best set of matches between image regions and parts of an object model. However, the largest maximal clique is not necessarily the best correspondence, due to occlusions, failures of the feature extraction process, accidental alignment, etc. Thus, a weight is associated to each node in the graph, which represents the similarity between the two edge segments in both images. They use a bene"t function to "nd the maximal clique that maximizes the sum of weights of its individual nodes. Finding all the maximal cliques [16}18] (or only the largest maximal clique) in a graph is a NP-complete problem. Several techniques try to speed up the process, like parallel implementations [19]. Herault et al. [20] cast the graph matching problem into a simulated annealing algorithm using the Metropolis method. Ranganath and Chipman [21] use a fuzzy relaxation approach that includes structural information in the node weights. They consider weights in the arcs, which measure the rate of compatibility between two nodes, thus constructing an enhanced association graph. Node weights are modi"ed in an iterative relaxation process that depends on the number and weight of nodes connected and the strength of their relationship. We use
this relaxation approach in all the experiments in this work. On the other hand, these graph-based techniques applied to region matching have to be able to handle segmentation errors: E Partly occluded regions: regions that appear in both images, with di!erent shapes due to partial occlusion, noise, etc. E Occluded regions: regions that appear only in one image. E Splitted regions: some adjacent regions in one image correspond to one or more regions in the other image. This problem is also known as oversegmentation and undersegmentation in the area of object recognition. Partially occluded regions produce mismeasured weights in the graph nodes, which can be recovered during the relaxation process. Regions that do not appear in the other image produce nodes, if any, that are usually incompatible with the other nodes. However, the problem of splitted regions is more di$cult. Yang et al. [15] correct oversegmentation by generating an augmented association graph. The method considers all the nodes that try to match di!erent features of the image to the same model patch and merge them in an only node. However, they make all the possible mergings in an iterative process without "ltering the inadequate merges. This could make the association graph grow unnecessarily, and some merged region include undesirable subregions. Ranganath and Chipman [21] correct both oversegmentation and undersegmentation by taking into account the node weights. Their method considers all the nodes that try to match di!erent regions of the image to the same model patch, but they are merged only when all node weights are below a threshold. The amount of new merge nodes is not excessive, but by imposing a threshold we may reject some desirable mergings. In this paper, we propose a method which considers all the possible cases, but rejects merges that do not provide a good correspondence by comparing the correspondences of individual regions to the correspondence of the merged region. In addition, we also consider the case where the segmentation of corresponding areas in two images give very di!erent regions. Our approach is also based on graph techniques, building a di!erent graph previous to the association graph and calculating all the sets of regions to be merged.
3. Solving segmentation errors 3.1. Graph-based stereo matching When no segmentation errors appear, the enhanced association graph method for scene matching developed
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
1327
Fig. 1. A simple example.
by Ranganath and Chipman [21] could be summarized as follows. 1. Find all the regions in both images. 2. Construct the association graph. Nodes are assigned a weight that represents similarity between two regions. Arcs are assigned a weight that represents rate of compatibility between two nodes. Pairs of regions with low similarity are rejected by means of a threshold, ¹ . 3. Perform the relaxation process, in order to include structural information into the node weights. Discard nodes with low weights. 4. Find the best maximal clique in the graph. The best maximal clique is the maximal clique with the highest sum of node weights. As reviewed in Section 2, similar procedures have been used in several works about stereo matching and object recognition. These works obtain satisfactory results in the matching when the segmentation errors are not signi"cant. In Fig. 1, a simple example of a stereo image pair and the corresponding association graph are shown. Note that there are six maximal cliques of identical length. The relaxation process includes structural information in the node weights, such that the weight of nodes connected with high compatibility to nodes of high weights is increased, and otherwise it is decreased. In order to maintain the above matching process and add minimal computational e!ort, we propose a graphbased preprocessing step to merge the appropriate regions. The main idea is to construct a graph similar to the association graph, where the nodes represent all the possible matchings, with the di!erence that all the nodes which try to match the same region of an image are connected. Such graph, which we call the incompatibilities association graph, represents groups of regions that could be merged. Although all the possible mergings are considered a priori, some mergings are rejected by means of a simple test. The test is based on the similarity
criterion used for computing similarity between two regions. From each group of regions we obtain a new merged region to be considered in the matching. Finally, the new regions generate new nodes which are added to the initial set of nodes to construct the association graph of stage 2. As this preprocessing is required before the association graph is constructed, a new stage is introduced between stages 1 and 2 of the above procedure. This new stage is detailed in Section 3.2. The similarity criterion used for computing the graph nodes and comparing region similarities is discussed in Section 3.3.
3.2. The incompatibilities association graph The incompatibilities association graph is a graph where all the possible matchings are represented as nodes and those nodes that try to match the same region are connected. However, not all these incompatibilities lead to appropriate merges, so that some test is made to reject inadequate arcs. The resulting graph represents all the groups of regions that should be merged, so that the new merged regions are considered in the classical graphbased method for region matching. 3.2.1. Graph nodes Let the left image consist of N regions (¸ , ¸ ,2, ¸ ) , and the right image consist of M regions (R , R ,2, R ). + The "rst step consists of "nding all the possible mappings between regions in both images, using the epipolar constraint to reduce the search space and some similarity criterion to measure node weights and reject pairs of non-similar regions. Each pair of regions whose similarity is greater than the similarity threshold, ¹ , is a node in the incompatibilities association graph. 3.2.2. Incompatibilities between nodes Two nodes may be incompatible by several reasons, taking into account the relationship between regions in the same scene. For merging the appropriate regions, we
1328
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
Fig. 2. Incompatibilities association graph of the example in Fig. 1(a): dashed lines represent rejected arcs.
Fig. 3. The association graph of the example in Fig. 1(a) after merging regions.
are interested in the incompatibilities arising from the fact that two nodes try to match the same region. In Fig. 2 there is a graph with arcs between each pair of nodes that try to match the same region. One splitted region will have at least one incompatibility with each splitted region corresponding to the same region. However, not all these incompatibilities lead to pairs of regions that should be merged. In the example (Fig. 1(a)), region ¸ is an extra object that does not appear in the right image. Both regions ¸ and ¸ could map region R in the right, but they should not be merged.
3.2.4. Search of all the maximal cliques The fourth step consists of "nding all the maximal cliques in the incompatibilities association graph. Each maximal clique represents a group of regions in one image that match to a common region in the other image. For each maximal clique, a new region is constructed, as the union of all the single regions. In the example, only one maximal clique is found, formed by three nodes that try to match region R at the same time. The three regions involved in the matchings, ¸ , ¸ and ¸ , are merged to form a new region, ¸ . The possible matchings concerning L are then cal culated and added to the initial set of nodes. For calculation of the association graph arcs, all the nodes concerning L are considered incompatible to all the nodes concerning the individual regions that form it. In Fig. 3, the resulting association graph is shown.
3.2.3. Graph arcs The third step consists of rejecting all the incompatibilities that will not provide adequate merged regions and establish arcs in the graph for the remaining incompatibilities. Let (¸ , R ) and (¸ , R ) be two incompatible I N I O nodes and R a region resulting from joining R and N O N R, O if S(¸ , R )'S(¸ , R ) and S(¸ , R )'S(¸ , R ) I NO I N I NO I O then establish arc between node(¸ , R ) and node(¸ , R ), I N I O (1) where S( ) is the criterion of similarity between two regions. In other words, we test every pair of incompatible mappings to see if they generate a merged region which matches to the common region better than each splitted region. If both regions together do not match to ¸ better I than separately, they should not be merged, and then the arc is not created. In the example in Fig. 1, the arc between nodes (¸ , R ) and (¸ , R ) is rejected because the mapping of the merged region (¸ , R ) is not better than each region alone. However, the arcs between nodes (¸ , R ), (¸ , R ) and (¸ , R ) remain because the union of these regions maps region R better than each region separately.
3.3. The similarity criterion The criterion to match regions and assign node weights should be able to 1. match partly occluded regions, 2. match one splitted region with the whole region in the other image, 3. match one merged region with its corresponding region in the other image better than each simple region separately, and 4. reject non-matching regions. The similarity criteria used in most of the works consist of comparing some region properties which have been proved to be signi"cant in the literature [22]. Randriamasy and Gagalowicz [7] use region size, feature means and position of center of gravity. Marapane and Trivedi [8] use mean gray level, area, perimeter, width, length, and aspect ratio. They also proposed using other spectral and spatial properties such as intensity measurements in single or multiple channels, compactness, major and minor axes, moments, texture and topological
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
descriptors. Cohen et al. [9] use similarity in the regions size, circularity, position of center of gravity, intensity mean, intensity variance, spatial moments, etc. Lee et al. [11] use some a$ne moment invariants, which are the eigenvalues of a matrix that represents the apparent motion of the region between two images. However, when a region in one image has been splitted in several regions, the region properties of the splitted regions may be very di!erent from the whole region. Segmentation errors may produce that some regions in one image are very di!erent in size and shape from the corresponding regions in the other image. Thus in order to achieve all the objectives, we propose the use of correlation-based techniques instead of region features to measure the similarity of both regions. Correlation measurements are mainly used in area-based matching techniques, where the matched area has identical size and shape in left and right images. In our case, the criterion must be adapted to manage regions with di!erent size and shape. Let us consider ¸ a region in the left and R a region in G H the right. Both regions rarely match exactly, so that we have to slide one region over the other in order to localize the point of maximum correlation. The pixels to be correlated are those de"ned by the intersection of both regions at a given disparity d. Thus, the similarity measurement can be de"ned as the maximum correlation coezcient given a set of disparities: S(¸ , R )" max (C (d)), G H GH BZ B
B
(2)
where C (d) is the correlation between the intersecting GH areas of ¸ and R at disparity d, and [d ..d ] depends G H
on the position and shape of both regions, and the disparity limits, if known. Let us call ¸ and R the intersecting areas of GHB GHB regions ¸ and R at disparity d in each image. Let G H ¸I and RI be the mean intensity in each area and GHB GHB p G H B, p G H B the standard deviations. The correlation * 0 measurement used in the experiments is the zero-mean normalized cross correlation (ZNCC), which is commonly used in area-based stereo techniques [23,24]:
1 1 C (d)" 1# GH 2 Q
(I (x, y)!¸I ) (I (x#d,y) GHB +VWZ*GHB
!RI ) , GHB
(3)
where I (x, y) is the intensity value of pixel (x,y) in image N p, Q depends on both standard deviation of the intersected area and N, number of intersected pixels, Q" Np GHB p GHB. * 0 Other correlation techniques were tested (sum of squared di!erences (SSD), zero-mean SSD) that resulted in poor results. Only the zero-mean normalized SSD
1329
(ZNSSD) measurement resulted in results similar to ZNCC. The normalization given by the standard deviation of both intersected regions seems to be very important to achieve a signi"cant correlation result, given that the number of correlated points is very di!erent in each calculation. However, the similarity criterion expressed in Eq. (2) does not satisfy the last objective of the list. An extreme example, two regions which at certain disparity intersect in an only pixel have excellent similarity if both pixels have similar gray level. Therefore, some information about the size of the matched area should be introduced in the criterion. Let us introduce a matching coezcient that calculates the number of matched pixels with respect to the size of both regions. Let us de"ne N (d) and N (d) GH GH as the matching coe$cients with respect to the maximum and minimum size, respectively. A(¸ ) GHB , N (d)" GH max(A(¸ ), A(R )) G H A(¸ ) GHB N (d)" , GH min(A(¸ ), A(R )) G H where A(¸) indicates the number of pixels of region ¸. If we maximize the product of the "rst matching coe$cient with the correlation coe$cient, we "nd the disparity value of best coincidence between both regions. Let us call it the best disparity, dK "argmax (N (d)C (d)). GH GH BZ B
B GH
(4)
We propose a combination of two criteria: S (¸ , R )"N (dK )C (dK ), G H GH GH GH GH
(5)
S (¸ , R )"N (dK )C (dK ). G H GH GH GH GH
(6)
1. For creation of graph nodes, it is necessary to have a similarity measurement that satis"es requirements 1, 2, 4 of the list. For this purpose, the appropriate criterion would be S (Eq. (6)). 2. For comparison between nodes and all the operations in the matching process, it is necessary requirements 1, 3, 4 of the list. For this purpose, the appropriate criterion would be S (Eq. (5)). Note that the use of two similarity measurements does not cause an important increment in the computational cost of the algorithm, given that the more expensive task is the searching of the best disparity (Eq. (4)), which is done once per match. 3.4. Finding the best maximal clique: a suboptimal algorithm Although a great amount of e!ort has been directed to reduce the size of the graph, the number of nodes still can
1330
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
be very high when there is a high number of regions per image. Finding the best maximal clique is a NP-complete problem, so that the computational time grows exponentially with respect to the number of nodes. Here, we propose a suboptimal algorithm that takes into account the weights of nodes in order to reduce the computational cost of the method. Given the association graph G"(N, A) where N is the list of nodes and A is the list of arcs, we propose the following algorithm for obtaining an approximative solution for the best maximal clique: 1. Sort list N by weight decreasingly. 2. Divide list N in two lists: N with the "rst K nodes of N and N with the rest of the nodes. K is selected in order to ful"l some time restriction. 3. Obtain subgraph G"(N, A), where ALA and A contains only the arcs concerning to nodes in N. 4. Find the best maximal clique in G, which is a set of nodes S. 5. For each node n in N in decreasing weight order, if G n is compatible with all nodes in S add n to S, G G otherwise delete n . G This algorithm does not guarantee the best solution, but, as it is shown in the next section, the results are very close to the optimal solution and often identical. The cost of the method consists of the addition of the costs of all the steps. The sorting algorithm is 2n ln(n) in the average case, where n is the length of list N. The cost of step 3 is linear with respect to the number of arcs of G, which is K in the worst case. The best maximal clique search in G is exponential with respect to the graph size. However, K is "xed by imposing a maximum computational time to this part of the process, which we call F(K). Finally, the compatibilities computation of each node in N with the estimated solution is linear with respect
to the solution size, which initially is K in the worst case. As the compatible nodes are added to the solution at each iteration, the cost of step 5 is quadratic, O( (n!K#1)(n#K)) in the worst case. By rejecting all the terms below the quadratic cost, the overall cost is O( K#F(K)# (n!K)) and there fore O( n#F(K)) with K)n. Although F(K) is ex ponential with respect to K, by "xing K to a certain value, we "x a limit to this term, and therefore the algorithm is quadratic with respect to n.
4. Experimental results 4.1. Response to diwerent segmentation methods Fig. 4 shows one example of the stereo pairs that we have used in the experiments. There is available a set of this synthetic stereo pair [25] corrupted with di!erent levels of noise. In particular, Fig. 4 shows the pair without noise. Figs. 5 and 7 show the results of two segmentation methods. One of the methods used to segment the images is a common region merging segmentation method [26] (Fig. 5). Another segmentation method used in the experiments is a clustering technique developed in Ref. [27] which groups nearby pixels in regions within a certain variance in the gray level (Fig. 7). In all the experiments, we applied the same segmentation method with the same parameters to both images in the stereo pair. Despite of this, several di!erences in the segmentation results from both images can be appreciated in all the cases, that is, some regions labeled in the left correspond to more than one region in the right and vice versa. Region merging is controlled by means of a threshold on the di!erence in intensity mean of two adjacent regions, ¹ . If the di!erence is below ¹ , then the regions
Fig. 4. Synthetic stereo image pair, corridor (without noise).
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
1331
Fig. 5. Segmentation by region merging of the corridor stereo pair and merged regions resulting from the algorithm.
Fig. 6. Ground truth disparity map and results from our algorithm.
are merged. Region clustering is controlled by means of a threshold on the intensity variance of a region, ¹ . If the variance is greater than ¹ , the region is divided in subregions. The maximum number of pixels within a region is also provided, in order to avoid a region to grow considerably. In the case of region merging segmentation, we "lter out the image regions in order to avoid very little regions of few pixels. In the case of region clustering segmentation, all the little regions are joined to the more similar region during the clustering process [27]. The black areas shown in the disparity maps for each type of segmentation (Fig. 6 and Fig. 8) correspond to either "ltered regions or unmatched regions. Except for the black areas, darker areas correspond to further points in the scene. Given that we assigned constant disparity within each region, the resulting disparity maps show an approxima-
tion based on fronto-parallel planes. This constraint is also implicitly assumed in the use of correlation. This approximation of the disparity map could serve for a number of robotics applications, where the needed map can be less accurate, since this map gives a general idea of the location of objects in the scene. For applications that need more accuracy, this map can be a good initialization for the matching of edges and points in the scene. Fig. 5 shows an example of the regions that have been merged as a result of the proposed algorithm, and the corresponding matched regions in the other image. Regions shaded with the same gray level in the left image are regions merged during the preprocessing step and then matched with the corresponding regions in the right image also shaded with the same gray level. The resulting correspondences include eight merged regions in the left and 19 merged regions in the right, and they contain
1332
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
Fig. 7. Segmentation by region clustering of the corridor stereo pair and merged regions resulting from the algorithm.
the corresponding regions with the same gray level. The disparity maps correspond to the right images. Some problems appear in the right side of the image, due to the lack of information about the corresponding regions in the other image. Also, some inaccuracy in the calculation of the best disparity can be observed in some cases, specially in those regions where the intensity variance is very small (i.e. the front part of the table in Fig. 11). 4.2. Computational complexity
Fig. 8. Disparity map resulting from our algorithm.
2}5 individual regions. Some of the merges are not visible because they consist of boundary regions which are added to the adjacent bounded region, or groups of tiles which are joined in order to match similar groups in the other image. Fig. 7 shows the results with the second segmentation method. Note how the proposed algorithm has been able to cope with the problem of splitting regions of one image with respect to the other one. Fig. 8 shows the resulting disparity map. Fig. 9 represents another example of a real scene (images from the JISCT stereo test set [28]). Note how some regions in the left image have been considered in the process to be merged in order to achieve a better matching with regions in the right image due to the problem of region splitting during segmentation. Figs. 10 and 11 show the results obtained on other real scenes. Some of the region mergings are shown by "lling
As it has already been pointed out, maximal clique calculation is an NP-complete problem, thus any algorithm to calculate the maximal cliques in a graph has got an exponential cost which depend on the number of nodes in the graph. In the present problem of region matching, the number of nodes in the association graph depends on the similarity threshold selected to consider pairs of possible matchings. Table 1 shows the computational cost of the algorithm using di!erent similarity thresholds. For each similarity threshold it is also shown the rate of matching achieved measured as the rate of pixels matched with respect to the number of pixels of all the regions considered for matching. The number of matched regions grow when the threshold decreases, because more possible matchings are considered in the matching process. All the computational times in this work have been calculated using a HP-725 75 MHz. The execution time of the whole process has been reduced considerably with respect to [29] given that most of the false matchings are rejected due to the similarity criteria introduced in this work. Nevertheless, in order to try to overcome the exponential complexity of the maximal clique computation, Table 2 shows the result of applying the algorithm introduced
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
1333
Fig. 9. Results for a stereo pair of real images, the parking meter example [28], segmented by the region merging method.
in Section 3.4 to calculate an approximation to the solution that provides the maximal clique approach. Note that now computing times have been reduced considerably with respect to times shown in Table 1, and that the computational cost now does not exponentially increase if the number of nodes is increased. Therefore, the subop-
timal algorithm allows the use of lower thresholds that provide better correspondence rates. As the number of graph nodes grows considerably, only the suboptimal algorithm can be applied. Although the suboptimal algorithm proposed does not guarantee the best solution, the results are identical to the best
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
1334
Fig. 10. Results for a stereo pair of real images, the textured lamp example, segmented by the region merging method.
maximal clique "nding algorithm in 99.9% of the experiments. In the rest of the experiments, the results are very similar (below 1% of di!erence in the rate of matching). Fig. 12 shows the resulting disparity maps applying both approaches to "nd the solution, Fig. 16 the maximal clique and the suboptimal algorithm. Note that they are quite similar, therefore maximal clique computation can be substituted by the proposed approximative solution. Fig. 16 shows the growth of the computational time with respect to the selection of parameter K. 4.3. Evaluation of noise inyuence In order to evaluate noise in#uence in the algorithm we used stereo pairs with di!erent variance in noise. For
example, in Fig. 13 we show the disparity maps obtained for a stereo image pair without noise, and another with noise of variance 100. The resulting disparity maps are similar, in the sense that they provide an approximation to the solution, which is shown in Fig. 6. Although the results are not very accurate, both maps give an idea of the position of the objects in the scene. Fig. 14 shows the percentage of matching with respect to the similarity threshold used in the images corrupted with di!erent noise levels. Note that the more noise level in the image, the lower similarity threshold is needed to obtain similar results in matching rate. The similarity measurements between corresponding regions decrease due to noise, so that the number of graph nodes decreases.
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
1335
Fig. 11. Results for a stereo pair of real images, the lab example, segmented by the region merging method.
Table 1 Percentage of matching for the corridor pair (without noise) Threshold ¹ Q Graph nodes % matching Time (s)
0.65 70 91.5 40.75
0.60 85 92.4 53.09
0.55 100 93.3 13969.49
Fig. 15 shows the relation between number of nodes in the graph and matching rate obtained for the noiseless and noisy examples. Note that the noisy example needs more nodes in the graph to obtain the same rate of correspondences than the noiseless example. This means that more nodes are included in the graph to solve
segmentation problems that appear due to the noise e!ect. In the noisy examples, it is possible to obtain rates of matching similar to the noiseless example by decreasing the similarity threshold. That is, the proposed method can achieve results in the presence of noise similar to the ones obtained with noise free images, which means the method is robust in the presence of noise.
5. Conclusions and further work An improvement to the graph-based method for "nding region correspondences has been introduced which
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
1336
Table 2 Percentage of matching for the corridor pair (without noise), using the suboptimal algorithm with K"30 Threshold ¹ Q Graph nodes % matching Time (s)
0.65 70 91.5 40.92
0.60 85 92.4 46.97
0.55 100 93.3 55.40
0.50 118 93.7 67.41
0.45 141 94.0 88.95
0.40 192 95.3 115.12
0.35 248 96.0 153.22
Fig. 12. Disparity maps and matching rates of the corridor example using the optimal and suboptimal algorithms (¹ "0.80). Q
Fig. 13. Disparity maps and matching rate (¹ "0.40). Q
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
Fig. 14. Noise e!ect in the matching rate.
Fig. 15. Relation between graph size and matching rate at di!erent noise levels.
Fig. 16. Growth of execution time with respect to the number of nodes considered for "nding the best maximal clique.
can deal with segmentation errors. The approach consists of a preprocessing stage to the classical graph-based method in order to calculate the appropriate merged regions to be considered in the matching process. This method permits to avoid errors due to segmentation, by considering both the single and merged regions in the matching process. The only additional cost with respect to the original method without considering segmentation errors is the
1337
calculation of all the maximal cliques in the incompatibilities association graph. Since the number of arcs is normally low, the process of "nding all the maximal cliques is rather fast, using an appropriate algorithm that takes advantage of this fact. The use of two similarity criteria based on correlation techniques, one for creating graph nodes and another for the rest of the matching process, provides less number of ambiguities, thus reducing the time of computation and producing more reliable region matches. The use of an approximative algorithm to "nd maximal cliques has been proven to provide satisfactory results in O( n#F(K) in the worst case, overcoming the problem of exponential cost if exhaustive maximal clique calculation is performed. The experiments carried out show that the method provides satisfactory results with respect to the segmentation problem. The construction of a graph with the incompatibilities between matches is a robust solution for merging regions in order to achieve better correspondences in presence of noise. The results obtained so far can be used for initializing a hierarchical matching process in a stereo system in order to provide "ner resolution. Further work is directed to use the proposed method as part of a hierarchical matching algorithm.
References [1] S.T. Barnard, M.A. Fishler, Computational stereo, ACM Comput. Surveys 14 (4) (1982) 553}572. [2] U.R. Dhond, J.K. Aggarwal, Structure from stereo * a review, IEEE Trans. Systems, Man, Cybernet. 19 (6) (1989) 1489}1510. [3] O. Faugeras, P. Fua, B. Hotz, R. Ma, L. Robert, M. Thonnat, Z. Zhang, Quantitative and qualitative comparison of some area and feature-based stereo algorithms, in: Wolfgang FoK rstner, Stephan Ruwiedel, (Eds.), Robust Computer Vision: Quality of Vision Algorithms, pp. 1}26. Wichmann, Karlsruhe, Germany, 1992. [4] Z. Zhang, Le proble`me de la mise en correspondance: L'eH tat de l'art, Technical Report RR 2146, INRIA, December 1993. [5] R. Horaud, Th. Skordas, Stereo correspondence through feature grouping and maximal cliques, IEEE Trans. Pattern Anal. Mach. Intell. 11 (11) (1989) 1168}1180. [6] N.M. Nasrabadi, A stereo vision technique using curve segments and relaxation matching, IEEE Trans. Pattern Anal. Mach. Intell. 14 (5) (1992) 566}572. [7] S. Randriamasy, A. Gagalowicz, Region based stereo matching oriented image processing, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Lahaina, Hawai, IEEE Computer Society Press, June 1991, pp. 736}737. [8] S.B. Marapane, M.M. Trivedi, Region-based stereo analysis for robotics applications, IEEE Transactions on Systems, Man, Cybernet. 19 (6) (1989) 1447}1464.
1338
A. Lo& pez, F. Pla / Pattern Recognition 33 (2000) 1325}1338
[9] L. Cohen, L. Vinet, P.T. Sander, A. Gagalowicz, Hierarchical region based stereo matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, IEEE Computer Society Press, June 1989, pp. 416}421, [10] D.S. Kalivas, A.A. Sawchuk, A region matching motion estimation algorithm, Comput. Vision, Graphics Image Processing (Image Understanding) 54 (2) (1991) 275}288. [11] C.-Y. Lee, D.B. Cooper, D. Keren, Computing correspondence based on region and invariants without feature extraction and segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society Press, June 1993, pp. 655}656. [12] N. Ayache, B. Faverjon, E$cient registration of stereo images by matching graph descriptions of edge segments, Int. J. Comput. Vision 1 (2) (1987) 107}131. [13] R.C. Bolles, Robust feature matching through maximal cliques, Imaging Appl. Automated Ind. Inspection Assembly 182 (1979) 140}149. [14] R.C. Bolles, R.A. Cain, Recognizing and locating partially visible objects: the local-feature-focus method, Int. J. Robotics Res. 1 (3) (1982) 57}82. [15] B. Yang, W.E. Snyder, G.L. Bilbro, Matching oversegmented 3d images to models using association graphs, Image Vision Comput. 7 (2) (1989) 140}149. [16] E. Balas, C.S. Yu, Finding a maximum clique in an arbitrary graph, SIAM J. Appl. Math. 15 (4) (1986) 126}135. [17] R.C. Bolles, P. Haraud, 3DPO: a three dimensional part orientation system, Int. J. Robotics Res. 5 (3) (1986) 3}26. [18] E.R. Davies, The minimal match graph and its use to speed identi"cation of maximal cliques, Signal Processing 22 (3) (1991) 329}343.
[19] Y. El-Sonbaty, M.A. Ismail, A new algorithm for subgraph optimal isomorphism, Pattern Recognition 31 (2) (1998) 205}218. [20] L. Herault, R. Horaud, F. Veillon, J.-J. Niez, Symbolic image matching by simulated annealing, in: Proceedings of the "rst British Machine Vision Conference, Oxford, England, 1990, pp. 319}324. [21] H.S. Ranganath, L.J. Chipman, Fuzzy relaxation approach for inexact scene matching, Image Vision Comput. 10 (9) (1992) 631}640. [22] J.L. Mundy, A. Zisserman (Eds.), Geometric Invariance in Computer Vision, MIT Press, Cambridge, MA, 1992. [23] D.V. Papadimitriou, T.J. Dennis, A stereo disparity algorithm for 3d model construction, in: Image Processing and Its Applications, July 1995, pp. 178}182. [24] F. Devernay, Vision steH reH oscopique et proprieH teH s di!eH rentielles des surfaces, Ph.D. thesis, ED cole Polytechnique, October 1996. [25] T. Froehlinghaus, Stereo images with ground truth disparity and occlusion, http://www-dbv.cs.uni-bonn.de%ft/ stereo.html, August 1997. [26] A. Rosenfeld, A.C. Kak, Digital Picture Processing, Vol. 1. 2nd Edition, Academic Press, New York, 1982. [27] J. Badenas, M. Bober, F. Pla, Motion and intensity-based segmentation and its applications to tra$c monitoring, in: A. del Bimbo (Ed.), Proceedings of the Ninth International Conference on Image Analysis and Processing Vol. 1310 of Lecture Notes in Computer Science, Florence, Italy, Springer, Berlin, May 1997, pp. 502}509. [28] R.C. Bolles, H.H. Baker, M.J. Hannah, The JISCT stereo evaluation, in: Proceedings of the Image Understanding Workshop, ARPA, April 1993, pp. 263}274. [29] A. LoH pez, F. Pla, Solving oversegmentation errors in graph-based region matching, in: Proceedings of the Eigth Portuguese Conference on Pattern Recognition, Guimaraes, Portugal, March 1996, pp. 387}394.
About the Author*A. LOD PEZ received the degree in Computer Science from the Politechnical University of Valencia, Spain, in 1990. From 1992 he was at INISEL (National Company on Electronics and Systems), Spain, working on computer science for aeronautics. In 1992 she joined the Department of Computer Science at the University Jaume I, Spain, where she is an Associate Lecturer. Her current research interests include stereo vision, 3D reconstruction, solid modeling and computational geometry. About the Author*F. PLA received the degree in Physics from the University of Valencia, Spain, in 1989 and the Ph.D. degree in Physics in 1993 from the same University. From 1990 to 1992 he was at the IVIA (Valencian Institute of Agricultural Research), Spain, working on machine vision for robotics in agriculture. He was also at the CEMAGREF in Montpelier, France, in 1991 and in 1993 he was at the SRI (Silsoe Research Institute), UK, as a postdoctoral fellow. In 1993 he joined the Department of Computer Science at the University Jaume I, Spain, where he is an Associate Professor. He was also a visiting scientist at the Department of Electrical and Electronic Engineering of the University of Surrey, UK, in 1996. His current research interests include stereo vision, colour image processing, motion analysis and non-parametric pattern recognition techniques.
Pattern Recognition 33 (2000) 1339}1349
Analysis of fuzzy thresholding schemes C.V. Jawahar*, P.K. Biswas, A.K. Ray Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India - 721 302 Received 6 February 1998; accepted 17 May 1999
Abstract Fuzzy thresholding schemes preserve the structural details embedded in the original gray distribution. In this paper, various fuzzy thresholding schemes are analysed in detail. Thresholding scheme based on fuzzy clustering has been extended to a possibilistic framework. The characteristic di!erence for assignment of membership of fuzzy algorithms and their correspondence with conventional hard thresholding schemes have been investigated. A possible direction towards unifying a number of hard and fuzzy thresholding schemes has been presented. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Image segmentation; Thresholding; Fuzzy partitioning; Possibilistic clustering; Entropy
1. Introduction Digital image segmentation, one of the most challenging problems in image processing, is very frequently attempted with pattern recognition methods. Segmentation is the process of partitioning an image into a "nite set of regions, such that a distinct and well-de"ned property is associated with each of them. Thresholding, the simplest and most popular strategy for segmentation, refers to the process of partitioning the pixels in an image I"[I ] I 3L"+1, 2,2, ¸,, de"ned over a two KL +", KL dimensional grid G"(m, n), 0)m)M!1, 0)n) N!1 into object (O) and background (B) regions i.e., O"+(m, n)"I *¹, KL B"+(m, n)"I (¹,, (1) KL in such a way that O6B"G and O5B". Here, ¹, the discriminant gray value is the hard threshold. Identi"cation of an optimal threshold ¹ is a complex task. A number of elegant algorithms are proposed for
* Corresponding author. Centre for Arti"cial Intelligence and Robotics, Raj Bhavan Circle, High Grounds, Bangalore - 560 001, India. E-mail address:
[email protected] (C.V. Jawahar).
this purpose. They are based on region separability, minimum error, entropy, etc. [1}3]. Another important class of algorithms employ scale-space theory [4,5] for thresholding. Most of these algorithms are initially meant for binary thresholding. This binary thresholding procedure may be extended to a multi-level one with the help of multiple thresholds ¹ , ¹ ,2,¹ to segment the L image into n#1 regions [6,7]. Multi-level thresholding based on a multi-dimensional histogram resembles the image segmentation algorithms based on pattern clustering. A hard dichotomization of pixels as in Eq. (1) is extremely di$cult when boundaries are fuzzy and regions are ill de"ned, which is frequently the case in image analysis. Moreover, the imprecision of gray values and vagueness in various image de"nitions make the segmentation problem more di$cult to manage with deterministic or stochastic image processing schemes. This led to the development of a number of algorithms based on fuzzy set-theoretic concepts [8}10]. Fuzzy thresholding involves the partitioning of an image into two fuzzy sets, i.e., OI and BI , corresponding to object and background regions by identifying the membership distributions k I and k I associated with them. A natural extension of Eq. (1) into fuzzy setting was carried out by Pal [10,11] by de"ning a `bright imagea characterised by a monotonic membership function k I , -
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 2 2 - 3
1340
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
such that k I (I )(0.5 if I (¹, - KL KL k I (I )'0.5 if I '¹, (2) - KL KL and the crossover point of the membership function corresponds with the hard threshold ¹. Background region, BI , was considered as the complement of object region, OI , i.e., k I ( j)#k I ( j)"1.0 ∀ 3L. (3) H They preferred to assign the membership with a standard S function [11,12]. Huang and Wang [13] proposed a fuzzy thresholding scheme which minimises the fuzziness in the thresholded description and, at the same time, accommodates the variations in the gray values within each of the regions. They assigned memberships as 1 k I (I )" if I *¹ - KL KL 1#"v I !I "/C KL and
2. Thresholding based on soft partitioning
k I (I )"0 if I (¹, (4) - KL KL where v I is the mean gray value of the fuzzy object region, OI , and the parameter C controls the amount of fuzziness in the thresholded description. A similar membership assignment is employed for BI also. They classi"ed the pixels unequivocally into object or background regions with the help of a hard threshold ¹ and thereby led to an abrupt discontinuity of membership distribution in object and background regions, such that OI 6BI -G and OI 5BI ..
regions in the image while another parameter c"p /p denotes the ratio of scatters and the su$xes 1 and 2 represent the regions BI and OI , respectively. It is assumed here that the `truea object and background gray values are perturbed by a physical process to form a continuous non-negative function f ( ) ) of gray values with continuous derivatives. The objective of this paper is to analyse the fuzzy thresholding process based on the membership assignment characteristics, and their correspondence with the classical hard thresholding schemes. A fuzzy thresholding procedure based on possibilistic clustering is proposed, and the implementation aspects of fuzzy thresholding algorithms are discussed. Analysis is carried out on the capabilities of various algorithms to re#ect the structural details of the gray distribution of the original image. A step towards unifying various thresholding schemes are also presented.
(5)
In our earlier paper [14], investigations are reported on the suitability of fuzzy clustering formulation for the fuzzy thresholding process. The appropriateness of fuzzy clustering schemes for thresholding can be asserted with the fact that an optimal fuzzy partition based on fuzzy clustering depicts the substructure embedded in the data set and re#ects the gray distribution within the object and background regions. All these formulations assume that the di!erence in gray level alone leads to two visually apparent distinct regions, and the gray-level histograms are characterised with two modes which may be closer or far and/or may have di!erent sizes. These geometrical and statistical characteristics of the histogram play an important role in threshold identi"cation. In this case, histogram is often expected to be of the form h "f (d( j, v ), p )#of (d( j, v ), p ), j3L, (6) H where H"+h , is the histogram of I with h denoting H H the frequency of occurrence of gray value j. Here o corresponds to the ratio of the sizes of object and background
Since thresholding is basically a pixel classi"cation problem, fuzzy thresholding formulations are found to be appropriate for this task. In this section, we brie#y explain the thresholding procedure based on fuzzy clustering [15] and extend it with the help of possibilistic concepts. The problem of fuzzy clustering is that of partitioning a set of n points X"+x , x ,2, x , into c classes, i.e., L u , u ,2, u , such that A L (a) k (x )3[0, 1] (b) 0( k (x )(n and G H G H H A (c) k (x )"1.0, G H G
(7)
where k (x ) is the membership of x in the ith class u . G H H G A natural extension of fuzzy clustering for segmentation by considering the gray value alone as a feature leads to the thresholding formulation. For thresholding, the fuzzy c-means-based objective function to be minimised is +\ ,\ J(H, k I , k I )" k (I )Od(v , I ) G KL G KL G K L * " h k ( j)Od(v ,j). H G G G H
(8)
Thresholding algorithm assumes an initial partition and goes on iteratively evaluating the region means as * jh kO(j) v " H H G G * h kO(j) H H G
(9)
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
and the memberships using 1 k I ( j)" 1#(d(v I , j)/d(v I , j))O\ and k I ( j)"1!k I ( j) (10) until there is no appreciable change in the partition. More details of this fuzzy thresholding scheme based on fuzzy c-means (fcm) and its variants to segment images having imbalance in size and scatter of object and background regions are discussed in our earlier paper [14]. Possibilistic clustering algorithm, proposed by Krishnapuram and Keller [16], based on the possibilistic theory, relaxes constraint (7c) of fuzzy partition and provides a soft description of clusters where k (x )3[0, 1] G H denotes the compatibility of element x to the ith region. H Here, thresholding based on possibilistic c-means algorithm minimises an objective function * J(H, k I , k I )" h k ( j)Od(v , j) H G G G H * # gi (1!ki( j))O. (11) G H which is similar to Eq. (8) with an additional term to avoid the trivial solution due to the relaxation of constraint (7c). The parameter g 31> may be the same for G all i or it may be estimated for each of the clusters. The modi"ed objective function along with the constraints i.e Eqs. 7(a) and 7(b) provide the membership updation formulae, which assumes the following form for binary thresholding, i.e, when c"2. 1 k I ( j)" ; and 1#(d(v I , j)/g )P\ 1 k I ( j)" . (12) 1#(d(v I , j)/g )P\ Similar to the thresholding scheme based on fcm, fuzzy thresholding algorithm based on possibilistic clustering assumes an initial partition and iteratively evaluates the memberships until there is no appreciable change in the partition. In short, these fuzzy thresholding schemes yield a soft partition while minimising the criterion function.
3. Iterative and non-iterative implementation There are mainly two approaches to implement the criteria-based thresholding schemes*either by searching the extrema for all possible thresholds or by identifying the optimal combination by iteratively evaluating the criteria function and updating the threshold accordingly. As may be seen from the formulations based on Eq. (8) or Eq. (11), the problem of fuzzy thresholding is that of identi"cation of the minima of J( ) ) to obtain the optimal
1341
thresholded description from fuzzy and possibilistic sense. Since the memberships are assigned according to Eqs. (10) and (12) in these two cases, the problem of fuzzy thresholding reduces to the identi"cation of the minima of J( ) ) for various combinations of v and v . Unlike in the hard thresholding cases, where a search for an extrema of a criterion function is carried out by varying the threshold ¹ alone, fuzzy thresholding requires the variation of v and v to obtain the optimal partition. In general, the non-iterative implementation of a fuzzy thresholding algorithm requires the following steps. for all possible n for all possible n + Assign memberships k I and k I Compute the criteria function J( ) ) , Identify the membership distributions corresponding to the extrema of J( ) ). Here n and n correspond to parameter vectors asso ciated with the background and object regions. They are v and v in case of the fuzzy thresholding schemes discussed in the previous section. An alternate iterative formulation closely follows the following steps: 1. Initialise the thresholded description k I and k I satisfying Eq. (3). 2. Compute the mean gray-values of both the regions using Eq. (9). 3. Assign the membership values using Eq. (10) (or Eq. (12)). 4. Repeat steps 2}4 until there is no appreciable change for k I and k I . A look on the prospects of the iterative and the noniterative implementation of the proposed algorithms supports the e$cient iterative implementation, provided it converges. Since the convergence of fuzzy c-means and possibilistic c-means are proved in literature [15,16], the following lemma may be stated: Lemma. Fuzzy thresholding based on soft clustering algorithms converge to the minima of the objective function Eq. (8) (or Eq. (11)) with repeated updation of memberships with Eq. (10) (or Eq. (12)) and the evaluation of regional mean gray-values using Eq. (9). Since the formulae for updation of memberships and computation of regional mean gray values are derived out of fuzzy and possibilistic clustering formulation, the algorithm is only a simple Picard iteration where each step minimises the objective function with respect to only one parameter by keeping the other independent para-
1342
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
meters "xed and thus lead to a minima of the objective function.
4. Hardening of fuzzy thresholded description Often the interest of image analysis methodologies restricts to the extraction of the object from a scene so as to characterise the object with a set of features. Even though a fuzzy thresholded description is su$cient for this purpose, conventional feature extraction and object recognition methods may not be applicable as such with this description. Thus in spite of the presence of elegant image analysis techniques developed based on fuzzy thresholded (or segmented) description, hardening schemes are required to make the description useful for the conventional object recognition schemes. Typical hardening schemes are proposed below. Scheme: HARD-1: A simple method may be hardening the fuzzy thresholded description using Eq. (1) with a crisp threshold ¹ , such that v (¹ (v and k I (¹ )" D D - D k I (¹ ). This provides mutually exclusive and exhaustive D object and background regions. Scheme: HARD-2: Another possible hardening procedure is based on a-cuts of fuzzy sets as O"OI "+(m, n); k I (I )*a, and B"G!O. (13) ? - KL The parameter a3(0, 1] directly controls the size of the object region. As a increases, O approaches the core/skeleton of the object region. Scheme: HARD-3: Both the above hardening schemes are well applicable for thresholding, yet, they "nd di$culty to extend to a general fuzzy segmentation scheme, where the number of classes is more than two and feature space is multidimensional with non-linear decision boundary. In such a case, one may harden the classes as (14) u "+x; k (x)'k (x) ∀jOi,. G SG SH Indeed, there may be cases with k (x)"k (x) where SG SH x may be assigned arbitrarily to u or u according to an G H appropriate heuristic. Note that, such cases are of more academic interest than of practical signi"cance. Scheme: HARD-4: In cases where a deterministic misclassi"cation is very costly hardening may be carried out as u "+x; k (x)'k (x) and k (x)*b ∀jOi,. (15) G SG SH SG In this case, some of the points may remain as unlabelled, according to the parameter b3[0, 1], even after hardening. These hardening schemes may be employed to validate the applicability of fuzzy thresholding schemes by comparing with the hard thresholded description. An image, shown in Fig. 1a, is thresholded using the fuzzy thresholding scheme-based ob fuzzy c-means. The resulting
fuzzy thresholded description, is hardened with scheme HARD-2 Eq. (13). The hardened description for a" 0.9, 0.5 and 0.1 are shown in Fig. 1(b), (c), (d), respectively. From a single-thresholded description, the variability in membership distribution provides a set of hard representations with the variation of compatibility of pixels to the object region. It may be noted that a good amount of structural information is present in the fuzzy thresholded description, which is not available in the classical hard descriptions. This leads to the following proposition. Proposition. A set of hard thresholded descriptions is embedded in a fuzzy thresholded description characterised by a membership distribution kBI and k I 3[0, 1] since OI "6 aOI . ? ? Therefore, the fuzzy thresholding schemes provide the advantage of better and detailed representation of the intra-region gray distribution. Thus the fuzzy thresholding formulations provide very useful information for the high-level vision
5. Membership assignment philosophy As fuzzy sets are represented by membership functions, fuzzy thresholding schemes may be characterised by the membership assignment philosophy. The performances of all the reported thresholding algorithms depend, to a large extent, on the underlying assumptions behind their formulations. Thresholding is one of the most preferable segmentation method, if two distinct regions are apparently existing and the perturbations around the mean gray value of the object and background regions provide a gray-level picture with bimodal histogram. In this case, the membership assignment scheme should assign maximum membership grades to the mean values of the object and the background regions and the membership should decrease monotonically as the gray level distance from the respective means increases. This kind of a membership function re#ects the true nature of object and background geometries, referred to as the structural details of the regions in the context of pattern recognition. Most of the reported global thresholding schemes based on histogram perform extremely well if the object and background regions are generated by identical gray distributions, i.e., both object and background gray distributions are equal in size as well as in scatter. Histogram is considered as the addition of these two distributions. A comparison of membership distribution will throw some light on the qualitative characteristics of various methods. Here two existing fuzzy thresholding schemes with distinct philosophies and the two schemes based on fuzzy and possibilistic clustering algorithm are compared. To compare the membership assignment philosophy of all the four algorithms, a bimodal histogram
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
1343
Fig. 1. (a) Original image and its fuzzy thresholded description hardened with a"(b) 0.9, (c) 0.5 and (d) 0.1.
has been considered with identical well separated modes as shown in Fig. 2. The valley of the histogram ¹ is the T optimal hard threshold. The membership distribution of object and background regions have been computed using four algorithms viz., [10] (Fig. 3a), [13] (Fig. 3b), fuzzy c-means (Fig. 3c) and possibilistic c-means (Fig. 3d). It may be seen here that, the algorithm by Murthy and Pal identi"es the fuzziness in the transition region quite e!ectively. The other algorithms re#ect the structural detail embedded in the scene in a better fashion. Even though Huang and Wang considered the regions as fuzzy, the region of support of object and background regions are found to be mutually exclusive. At the same time, the segmented descriptions based on fuzzy and possibilistic clustering provide fuzzy descriptions with continuous variation of memberships. The di!erence between the membership assignment schemes of fuzzy and possibilistic clustering is due to the orthogonality constraint Eq. (7) which is relaxed in the latter case. While the fuzzy partition preserves the relative geometrical structure, the possibilistic partition, on the other hand, provides the absolute geometrical details of the scene. It may be noted that the algorithm of Murthy and Pal as well as the fuzzy c-means algorithm provide description where Eq. (3) is satis"ed. While the
Fig. 2. A histogram with equal size and scatter of background and object regions.
method proposed by Huang and Wang and possibilistic clustering satisfy k I ( j)#k I ( j)3(0,1] and k I ( j)#k I ( j)3(0,2], respectively.
1344
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
Fig. 3. Membership assignments using (a) Murthy and Pal (b) Huang and Wang (c) Fuzzy c-means (d) Possibilistic c-means.
6. Analysis of formulations Another important aspect from which fuzzy thresholding algorithms has to be seen is the basis of formulation. Though they di!er widely in their implementation and results, interestingly, their philosophical origin coincides. It is well known from the cluster validity literature [15] that a particular fuzzy partition is preferred if it is the hardest of all the possible fuzzy partitions with the same set of parameters. The same idea is extended for fuzzy thresholding by minimisation of gray-level fuzziness in Huang and wang [13] as well as in Murthy and Pal [10]. They have employed various measures of fuzziness, such as index of fuzziness, fuzzy entropy, etc., and have identi"ed the optimal threshold as the minima of these measures with the help of an extensive search. Fuzzy entropy, the most popular measure of fuzziness is a scalar measure of a fuzzy set as given below: !k( j)log(k( j))!(1!k( j))log(1!k( j)).
(16)
Analytical comparison of the thresholding schemes formulated from diverse points of view, to perform the same task, is extremely di$cult. From this aspect, it will be worth to observing that, there are common philosophical basis for all these algorithms. Recently, such an attempt has received attention. Many of the cost functions employed by di!erent algorithms may be considered as closely related. A study of such a uni"cation, limited to three popular thresholding schemes proposed in literature [1,3,13], has been reported in [17]. Another possible common measure may be an information-theoretic measure. It is quite apparent from the formulation itself that thresholding algorithms search for a structure of the gray distributions within the object and background regions. Entropy provides a measure of deviation of a distribution from a well-de"ned structure and is useful for such a search [18]. The de"nition of entropy may vary from case to case. In a probabilistic environment, entropy becomes maximum if p "1/¸ ∀i, while in a fuzzy
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
1345
Fig. 4. Fuzzy thresholding procedure.
environment, entropy [19] re#ects the uncertainty in the distribution and is maximum if the set and its complement equals i.e., k I ( j)"k I ( j). In fact, it is not essential to depend on these logarithmic de"nitions of entropy for the present purpose. An algebraic de"nition of entropy [18,20], such as the deviation of the given distribution from a standard one such as f ((p !q )), is also applicG G able. Based on the above discussion, it may be possible to unify a number of criteria function-based thresholding schemes into a single one.
6.1. Generalisation of formulations In the reported algorithms [14] 1 and 2, it is the distortion (perturbation) of the gray values from the mean of the regional gray distributions, which is minimised, while in algorithm 3, [14], the deviation from the modelled gray distribution with Gaussian function is minimised. In all the cases, the memberships are assigned such that, the thresholded description is less fuzzy or more hard; in other words, the fuzzy entropy is relatively less.
1346
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
In general, irrespective of whether iterative or noniterative, a typical fuzzy thresholding scheme assume the steps shown in Fig. 4. Basically, these algorithms consider the optimal fuzzy thresholded description corresponding to the extrema of an objective function J"f (H, k I , k I ). i.e., J" h k I ( j)g( j, m I )# h k I ( j)g( j, m I ), (17) H H H H where m I is the set of parameters associated with the region OI . Fcm and pcm [14]: Fuzzy c-means-based thresholding formulations consider g( ) ) as a measure of scatter from the mean gray values of the regions, i.e., g( j, m I )"( j!v I ). The implementation is iterative and the convergence is analytically tractable and practically excellent. Both fcm and pcm-based formulations di!er only in the membership assignment formulae. Otsu [3]: Otsu also minimises a similar objective function in a hard setting, i.e., g( j, m I )"( j!v I ) and k I ( j)3+0, 1,. The method basically searches the minima of the objective function by evaluation of the criteria function for all possible thresholds. These algorithms are optimum from a Bayesian sense when the object and background regions are equal in size and scatter. Kittler and Illingworth [1]: Kittler and Illingworth assumed that the gray distributions corresponding to object and background regions are Gaussian in nature, i.e., the function f ( ) ) in Eq. (6) is Gaussian. In this case, g( j, m I )"(p !N( j, p I , v I )), H - where p "h / h and N( ) ) is the Gaussian distribution. H H H H Huang et al. and Pal [13,10]: These two methods basically minimise a fuzziness measure in the thresholded description. If fuzzy entropy is considered as such a measure g( j, m I )"log(k I ( j)). The mode of implementation is an extensive search for various parameters of object and background regions. The characteristic di!erence between these two methods lies in the philosophy of membership assignment. Kapur et al. [2]: This is an important hard thresholding algorithm based on the Shannon's de"nition of entropy where minimisation of sum of entropies of object and background gray probabilities yield an optimal partition. Here, g( j, m I )"log(p-I ), H where p-I "h / k I ( j)h and k I ( j)3+0, 1,. They also obH H H H tained the hard threshold by searching the minima of the objective function for all possible threshold values. It
may be noted that the conventional thresholding schemes are only special cases of fuzzy thresholding when k I ( j)3+0, 1,. 6.2. Performance characterisation Segmentation provides means to compress the bulky raw image into a description based on the belongingness of the pixels to a set of regions. It is argued in the previous sections that a fuzzy thresholding scheme incorporate the details of the gray distribution of object and background regions. Thus it provides the details of gray distribution even after segmentation and thus become more useful for the high-level vision. In this case, the conventional performance characterisation based on classi"cation accuracy may not be an appropriate choice. Here we propose a performance characterisation criteria F" h (k( I ( j)!k I ( j))# h (k( I ( j)!k I ( j)), (18) H H H H where k( denotes the true membership and k represents the membership assigned by the algorithm under consideration. In case of hard thresholding schemes k( j)3+0, 1,. Since one may not know the exact memberships of gray values in natural scenes, we have considered a number of synthetic histograms as in our earlier work [14] and the fuzzy error measure is computed for a number of fuzzy and hard thresholding schemes. The results are shown in Table 1. Here the true membership is assumed to maximise at v and shows Gaussian characG teristics with p . G 7. Discussions Most of the thresholding algorithms are useful only when background and object regions are separable with gray values alone. Yet, all the thresholding algorithms are not found to perform equally well for all such scenes. The parameters o and c play a crucial role in practice. It may be observed from formulations of algorithm 1 [14] that, when o"c"1.0 in Eq. (6), Bayesian optimal threshold coincides with ¹ which is equidistant from the mean of the object and background regions. Since ¹ is equidistant from v and v , this algorithm is a favourable choice when regions are well balanced in size and scatter. In a more general case, this leads to the following lemma: Lemma 1. Thresholding schemes based on spherical clustering algorithms are not guaranteed to provide optimal (in Bayesian sense) thresholding when regions are not well balanced i.e., oO1 and/or cO1. For the histogram model Eq. (6), given p "p , ¹ "(v #v )/2 is coincident with Bayesian optimal
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
1347
Table 1 Performance of various thresholding schemes F(;10)
Parameters p
p
o
Otsu
Kittler
Moment
Kapur
Huang
Murthy
FCM
PCM
15 15 15 15 15 15 15 15 15 15 15 15
15 15 15 5 10 10 10 5 5 5 10 10
1.00 0.50 0.33 1.00 1.00 0.50 0.33 0.33 2.00 3.00 3.00 2.00
10.67 10.68 10.68 10.68 10.68 10.69 10.69 10.69 10.68 10.68 10.68 10.67
10.67 10.73 10.79 10.80 10.73 10.69 10.69 10.72 10.90 10.93 10.86 10.84
10.67 11.18 11.25 20.47 11.16 10.80 10.88 10.70 20.63 17.73 12.16 12.32
10.67 11.18 11.25 20.47 11.16 10.80 10.88 10.70 20.63 17.73 12.16 12.32
8.061 8.063 8.064 8.889 8.454 8.586 8.654 9.308 8.611 8.473 8.262 8.321
8.061 8.063 8.064 8.889 8.454 8.586 8.654 9.308 8.611 8.473 8.262 8.321
6.137 6.139 6.141 6.771 6.319 6.379 6.409 7.094 6.567 6.463 6.232 6.260
0.644 0.646 0.648 4.053 1.792 2.178 2.376 5.771 2.914 2.346 1.212 1.410
threshold ¹ , where background and object gray densit ies are equal, only if o is unity. Otherwise, if o'1.0, ¹ will be less than ¹ and for o(1.0, ¹ will be greater than ¹ . Indeed, it is assumed that the estimated v and v match with the true one, which generates the histo gram. The observations made here regarding the performance of thresholding schemes are valid for a general segmentation process, and applicability of such an algorithm for a wider class of images leads to the following assertion. Performance of segmentation algorithms based on hyperspherical partitions depends on (a) overlap of regional density functions in the feature space, (b) proportions in size of various regions, and (c) scatter and type of distribution in the feature space.
A * A * h kO( j)d(v , j)# g h (1!k ( j))O. H G G G H G G H G H
(20)
In both the cases, for q'1, the segmented description is soft and re#ects the details of the gray distribution. It may be noted that such a segmentation procedure does not guarantee the connectivity of pixels in a region. Fig. 5(a) depicts a scene consisting of distinct objects to be segmented. Fuzzy thresholding based on fuzzy clustering is employed for the purpose, and the resulting image is shown in Fig. 5(b) with di!erent gray shades fpr distinct regions after hardening with scheme HARD-3. A possibilistic approach has also been tried for the thresholdling. It has been found that both the algorithms are able to extract the modes of the histogram properly.
8. Summary
7.1. Multithresholding The discussions carried out in the previous sections pertain to binary thresholding alone. The concepts and methods can be extended to a more general setting by considering the image to have c regions and each of them exhibiting distinct gray property. In such a case, the histogram becomes multimodal, and the segmentation process reduces to identi"cation of valley points between the modes to partition the scene into distinct regions [6]. Fuzzy segmented description is achieved by minimising A * h kO( j)d(v , j). H G G G H
with the help of fuzzy c-means or by "nding the optimal possibilistic partition by minimising
(19)
The classical thresholding schemes assign the pixel unequivocally to a region and do not distinguish among pixels in a region, even if their gray values are di!erent in the original image. Consequently, the hard threshold selection schemes are associated with loss of structural details on thresholding. On the contrary, the identities of pixels are preserved in fuzzy partition space since the membership assigned to a pixel depends on the di!erence between its gray value and the mean gray value of the region to which it belongs. Fuzzy thresholding schemes can threshold noisy images too. Since thresholding schemes are, in general, sensitive to noise, fuzzy thresholding formulations also su!er in presence of severe
1348
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
Fig. 5. Segmentation of a multi-modal scene.
noise. Performance of fuzzy algorithms in presence of noise requires further careful evaluation. Fuzzy thresholding formulations based on fuzzy clustering have been extended to a possibilistic framework. The characteristic di!erence for assignment of membership and correspondence with conventional hard thresholding schemes have been investigated here. The possibility of unifying a number of hard and fuzzy thresholding schemes has been presented. It may be observed that the hard thresholding schemes are only special cases of the fuzzy ones. Moreover, as far as the incorporation of the structural details of the gray distributions are concerned, fuzzy algorithms are superior to the conventional schemes.
References [1] J. Kittler, J. Illingworth, Minimum error thresholding, Pattern Recognition 19 (1986) 41}47. [2] J.N. Kapur, P.K. Sahoo, A.K.C. Wong, A new method for gray-level picture thresholding using entropy of the histogram, Comput. Vision Graphic Image process. 29 (1985) 273}285. [3] N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Systems Man Cybernet. 9 (1979) 62}66. [4] J.M. Jolian, A. Rosenfeld, A Pyramidal Framework for Early Vision, Kluwer, Netherlands, 1994. [5] T. Lindeberg, Scale-Space Theory in Computer Vision, Kluwer, Netherlands, 1994. [6] S.S. Reddi, S.F. Rudin, H.R. Keshavan, An optimal multiple threshold scheme for image segmentation, IEEE Trans. Systems Man Cybernet 14 (1984) 661}665.
[7] S. Wang, R.M. Haralick, Automatic multi threshold selection, Comput. Vision Graphics Image Process. 25 (1984) 46}67. [8] J.C. Bezdek, S.K. Pal, Fuzzy Models for Pattern Recognition, IEEE Press, New York, 1992. [9] C.V. Jawahar, A.K. Ray, Fuzzy statistics of digital images, IEEE Signal Process. Lett. 3 (1996) 225}227. [10] C.A. Murthy, S.K. Pal, Fuzzy thresholding: mathematical framework, bound functions and weighted moving average technique, Pattern Recognition Lett 11 (1990) 197}206. [11] S.K. Pal, A. Rosenfeld, Image enhancement and thresholding by optimization of fuzzy compactness, Pattern Recognition Lett. 7 (1988) 77}86. [12] G. Klir, T. Folger, in: Fuzzy sets, Uncertainty and Information, Prentice Hall, Englewood Cli!s, NJ, 1988. [13] L.K. Huang, M.J.J. Wang, Image thresholding by minimizing the measures of fuzziness, Pattern Recognition 28 (1995) 41}51. [14] C.V. Jawahar, P.K. Biswas, A.K. Ray, Investigations on fuzzy thresholding based on fuzzy clustering, Pattern Recognition 30 (1997) 1605}1613. [15] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [16] R. Krishnapuram, J.M. Keller, A possibilistic approach to clustering, IEEE Trans. Fuzzy Systems 1 (1993) 98}110. [17] H. Yan, Uni"ed formulation of a class of image thresholding techniques, Pattern Recognition 29 (1996) 2025}2032. [18] S. Watanabe, Pattern recognition as a quest for minimum entropy, Pattern Recognition 13 (1981) 381}387. [19] A. De Luca, S. Termini, A de"nition of a nonprobabilistic entropy in the setting of fuzzy set theory, Inform and Control 20 (1972) 301}312. [20] C.H. Li, C.K. Li, Minimum cross entropy thresholding, Pattern Recognition 26 (1993) 617}625.
C.V. Jawahar et al. / Pattern Recognition 33 (2000) 1339}1349
1349
About the Author*C.V. JAWAHAR received B.Tech in 1991 from University of Calicut and M.Tech and Ph.D. from IIT Kharagpur in 1994 and 1997, respectively. Presently he is a scientist at the Centre for Arti"cial Intelligence and Robotics, Bangalore, India. His areas of interest include computer vision, fuzzy set theoretic approach to pattern recognition, texture analysis.
About the Author*P.K. BISWAS completed his B.Tech(Hons), M.Tech and Ph.D. in Electronics and Electrical Communication Engineering from Indian Institute of Technology, Kharagpur, in the years 1985, 1988 and 1991, respectively. From 1985 to 1987 he was with Bharat Electronics Ltd., Ghaziabad, India. Currently, he is an Assistant Professor in the Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur. His areas of interest include pattern recognition, computer vision, parallel processing, distributed control and computer network.
About the Author*A.K. RAY received B.E in Electronics and Electrical Communication Engineering from B.E. College in 1975 and M.Tech and Ph.D. from IIT Kharagpur. From 1977 to 1980 he was associated with a DST sponsored research project on Automatic Postal Sorting and Biomedical Pattern Recognition'. He joined IIT Kharagpur in 1980 where he is currently an Associate Professor in the Department of Electronics and Electrical Communication Engineering. He has more than 50 research publications in National and International journals and conferences. His main research interest is on computer vision, pattern recognition and fuzzy system analysis. He has investigated a large number of sponsored and consultancy research projects in India on diverse aspects of static and dynamic scene analysis and automatic target tracking.
Pattern Recognition 33 (2000) 1351}1367
Recognition of occluded polyhedra from range images Michael Boshra *, M.A. Ismail Department of Informatics, Electronics Research Institute, Giza, Egypt Department of Computer Science, University of Alexandria, Alexandria, Egypt Received 29 October 1998; accepted 7 May 1999
Abstract Occlusion remains a major hindrance for automatic recognition of 3-D objects. In this paper, we address the occlusion problem in the context of polyhedral object recognition from range data. A novel approach is presented for object recognition based on sound occlusion-guided reasoning for feature distortion analysis and perceptual organization. This type of reasoning enables us to maximize the amount of information extracted from the scene data, thus leading to robust and e$cient recognition. The proposed approach is based on a multi-stage matching process, which attempts to recognize scene objects according to their order in the occlusion hierarchy (i.e., an object is recognized before those that are occluded by it). Such a strategy helps in resolving some occlusion-induced ambiguities in feature distortion analysis. Furthermore, it leads to veri"cation of object/pose hypotheses with greater con"dence. Matching is based on a hypothesize-cluster-and-verify approach. Hypotheses are generated using an occlusion-tolerant composite feature, a fork, which is a pair of non-parallel edges that belong to the same surface. Generated hypotheses are then clustered and veri"ed using a robust pixel-based technique. Indexing is performed using distortion-adaptive bounds on a rich set of viewpointinvariant fork attributes, for high selectivity even in the presence of heavy occlusion. Performance of the system is demonstrated using complex multi-object scenes. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Object recognition; Polyhedral objects; Range data; Occluded scenes; Feature distortion analysis
1. Introduction Three-dimensional object recognition is a central theme in the "eld of computer vision. It can be de"ned as follows. Given a set of model objects and scene data provided by a sensor observing one or more of these objects, the objective is to identify the scene objects and determine their poses (locations and orientations) in the 3-D space. This task involves extracting features from the scene data, and searching for a consistent correspondence between extracted scene features, and those of model objects. A scene/model feature correspondence is considered consistent, if it satis"es the object-rigidity
* Corresponding author. Present address: Visualization and Intelligent Systems Laboratory, B-232 College of Engineering, University of California, Riverside, CA 92521, USA. E-mail address:
[email protected] (M. Boshra).
constraint. Object recognition is complicated by several factors such as data uncertainty, occlusion, complexity of object shapes, and possible presence of unknown objects (clutter). Dealing with all these factors, in order to achieve high levels of recognition performance, is a major challenge. The major thrust of this paper is to deal with occlusion in complex multi-object scenes, where objects can arbitrarily occlude each other. Fig. 1 shows a typical multi-object scene. From this scene, we can observe the following e!ects of occlusion on scene features: E Some object features are completely hidden. For example, edge EQ of base block OQ is hidden by prism OQ . In addition, edge EQ of top block OQ is hidden due to self-occlusion. E Some object features are distorted. For example, surface SQ and edge EQ of OQ are partially visible, due to occlusion by OQ and OQ , respectively.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 8 - 1
1352
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
Fig. 2. An illustration of a scene fork feature, FQ"(EQ , EQ , SQ).
Fig. 1. An example of a complex scene involving occlusion.
E Some `falsea features have appeared. For example, edge EQ is not an intrinsic feature of surface SQ but is existing due to occlusion by OQ . E There is a fundamental ambiguity in deciding whether some features are true or false. For example, edge EQ of surface SQ is either true or false, depending on whether OQ and OQ are sub-parts of the same object or two separate ones, respectively. Obviously, this ambiguity can only be resolved through model-based analysis. Classi"cation of scene features, as distorted, undistorted or false, is critical for achieving robust and e$cient recognition. In other words, it is important for the recognition system to distinguish between true and false features, in order to exclude false ones from being used in matching. Furthermore, it is important for the system to distinguish between distorted and undistorted (true) features, in order to adjust indexing and matching parameters appropriately for each case. In this paper, we address the occlusion problem in the context of a recognition task involving range data and polyhedral model objects. We present an object recognition approach that is based on occlusion-guided reasoning for feature distortion analysis, similar to that outlined above, and perceptual organization. This type of reasoning enables the achievement of robust and e$cient recognition through maximizing the amount of information extracted from the scene data. The proposed approach can be outlined as follows. Initially, a perceptual organization process is used to determine groups of features that appear to belong to the same object, as well as occlusion relationships between them. This information is used to guide a multi-stage matching process, which attempts to recognize scene objects according to their order in the occlusion hierarchy (i.e., an object is recognized before those that are occluded by it). This strategy serves two important purposes. Firstly, it helps in resolving the above-mentioned
fundamental ambiguity in true/false feature classi"cation. Secondly, it facilitates verifying object/pose hypotheses with greater con"dence, thus leading to a more robust veri"cation performance. Each matching stage determines identity and pose of the scene object corresponding to a feature group based on a hypothesize-cluster-andverify approach. Hypotheses are generated using an occlusion-tolerant composite feature, a fork, which is a pair of non-parallel edges that belong to the same surface. An illustration of the fork feature is shown in Fig. 2. Generated hypotheses are then clustered, and veri"ed using a robust pixel-based technique. Indexing, into the model database, is performed using distortion-adaptive bounds on a rich set of viewpoint-invariant fork attributes, in order to achieve high selectivity even in the presence of signi"cant distortion levels. The chosen bounds are called adaptive because the tightness of most of them is inversely proportional to the extent of fork distortion (the few remaining attribute bounds are distortion-independent). This approach leads to indexing performance that gracefully degrades with the amount of distortion in the scene data. The remainder of the paper is organized as follows. The next section reviews relevant research. Section 3 presents an overview of the system. Model and scene description schemes are discussed in Sections 4 and 5, respectively. The matching algorithm is described in Section 6, and analyzed in Section 7. In Section 8, we present experimental results to demonstrate the performance of the proposed system. Finally, conclusions are drawn in Section 9.
2. Relevant research The problem of 3-D object recognition has received extensive attention over the past 15 years (e.g., see surveys [1}4]). Object recognition systems can be classi"ed according to the strategy used to establish consistent correspondence between scene and model features. We have three main approaches: 1. Tree Search: Scene/model correspondence is established by building a search tree, where each node
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
corresponds to a possible match between a scene feature and a model one. The search tree is pruned by using local constraints and/or the rigidity constraint [5}7]. 2. Alignment: Selected subsets of scene features are aligned with corresponding model subsets to generate a number of object/pose hypotheses. These subsets are typically of the minimal size that is needed to fully determine all pose parameters (in the 3-D case, there are six parameters, three translational and three rotational). The generated hypotheses are then veri"ed by performing exhaustive comparison between scene and model features [8}13]. 3. Vote Accumulation: Hypotheses are generated by selecting scene-feature subsets that cover all scene features, and aligning them with model subsets. These hypotheses are then clustered, and the one corresponding to the cluster of largest size is selected [14}16]. Choice of the appropriate approach for a given recognition task depends on several factors, such as amount of sensor noise, scene complexity (number and con"guration of scene objects), and dimensionality of scene data (2-D or 3-D). Object recognition systems can also be classi"ed into two categories according to how they deal with multiobject scenes. In the "rst category, systems assume that the input scene consists of either a single object [6,17}19], or several objects that can be correctly separated from each other [20,21]. This assumption enables the utilization of scene perceptual information for improving performance. However, reliance on such an assumption limits the applicability of these systems in complex scenes, where it may be di$cult to correctly segment a scene into objects. In the second category, this problem is handled by avoiding the object-segmentation step. Instead, recognition is performed by directly matching scene features with corresponding model ones [7,10]. This approach can perform reasonably well in complex scenes. However, it often leads to a waste of computational resources, and accordingly time ine$ciency, due to the large number of scene features that have to be considered simultaneously. This problem is manifested in the form of thrashing behavior (tree search), many unsuccessful hypothesis-veri"cation operations (alignment), or extensive clustering operations (vote accumulation). Our work can be viewed as an attempt to combine the advantages of the above two approaches, while overcoming the disadvantages. In order to utilize the perceptual information of the scene data, we segment the set of scene features into groups, where each group consists of features that appear to belong to the same object. Considering each group of features independently in a multi-stage matching framework, we break up the original recognition problem into several independent ones of consider-
1353
ably smaller size, thus improving time e$ciency. Possible object-segmentation errors are handled through adopting a hypothesize-cluster-and-verify approach. The number of generated hypotheses is signi"cantly reduced through using an attribute-rich feature (the fork) for matching. Thus, our approach is expected to achieve high levels of performance in complex scenes, more than that attainable by either of the two general approaches described above.
3. Overview of the system Fig. 3 shows a block diagram of the proposed object recognition system. It is composed of three modules, model description, scene description, and matching: (1) Model description: The purpose of this o!-line module is to construct the model database. Each model object is described by an attributed graph, called model graph or MG, where nodes represent object surfaces and links represent adjacency relationships between them. In addition, a list of all model forks is stored separately. This redundant information is needed to speedup the matching process. Direct access to model forks that are consistent with a given scene fork can be achieved by building a multi-dimensional indexing structure on viewpointinvariant fork attributes (such a structure is not implemented in the current system). (2) Scene description: The purpose of this module is to construct a high-level description of the scene data. Such a description is desirable due to the di$culty of interpreting image pixels directly. In our system, we have chosen to describe the scene data by means of an attributed graph, called scene graph or SG, where nodes represent surfaces extracted from the scene data, and links represent adjacency relationships between them. Advantages of such a representation are: (1) generality, i.e., capability of describing complex scenes, (2) support for local features, which is important for recognizing occluded objects, and
Fig. 3. Block diagram of the system.
1354
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
Fig. 4. An example of scene-description graphs: (a) scene, (b) SG, (c) POG.
(3) compatibility with model-object representation, which facilitates the recognition task. Scene perceptual information is captured by grouping surfaces that appear to belong to the same object to form a number of potential objects (POs). Furthermore, occlusion relationships between scene POs are extracted, and described by means of a directed graph, which is called potential-object graph (POG). This graph is constructed through reasoning about the types of links in the SG. It is used to guide the multi-stage matching module in deciding what surfaces to be used at each stage of the match, as will be described below. As an illustration, Fig. 4(a) shows a scene, and Figs. 4(b) and 4(c) show the corresponding SG and POG, respectively. (3) Matching: This module recognizes scene objects by comparing scene and model descriptions. In addition to the outputs of the scene- and model-description modules, the matcher takes at its input the sensed range image to be used for veri"cation of object/pose hypotheses. As mentioned, matching is a multi-stage process. In each stage, the matcher attempts to recognize an object corresponding to a topmost PO, i.e., a PO that is either not occluded by any other PO (has no predecessors in the POG), or whose occluding POs (predecessor nodes in the POG) have already been recognized in earlier stages. Recognition is performed using a hypothesize-clusterand-verify approach, where hypotheses are generated through scene/model fork matching, and veri"ed using a pixel-based technique. The output of the system is a directed graph, called object graph or OG, which depicts recognized objects, as well as occlusion relationships between them. This is contrary to most existing systems, which produce a `lista of recognized objects. Occlusion relationships are important for providing a better understanding of the scene, which is particularly important in some recognition tasks involving robots such as bin picking [8]. As an illustration, the result of processing the scene in Fig. 4(a) is either of the two OGs shown in Fig. 5, depending on whether this scene corresponds to an object on top of
Fig. 5. Possible OGs for the scene shown in Fig. 4(a).
another one, or just a single two-part object. Obviously, the feasibility of each interpretation depends on the objects in the model database.
4. Model description In this section, we describe the model graph, and present the selected fork attributes. 4.1. The model graph The MG is an attributed graph whose nodes and links represent model surfaces and edges, respectively. The node attributes associated with model surface SK are: 1. Surface equation: represented as nKx"cK, where nK is the outward normal of SK, and cK is the distance from the origin to SK in the direction of nK, 2. Surface area (aK): area of SK, and 3. Surface perimeter (pK): perimeter of SK. On the other hand, the selected link attributes of model edge EK, which joins surfaces SK and SK, are: G H 1. End points (vK, vK): locations of the end points of EK, and 2. Dihedral angle (hK): dihedral angle between SK and SK. G H
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
1355
4.2. The model fork
5.1. Construction of the scene graph
A model fork, FK, can be represented as a tuple (EK, EK, SK), where EK and EK are edges of surface SK. Attributes of FK can be classi"ed according to whether or not they vary with viewpoint. The selected viewpointvariant attributes, which are used for generating pose hypotheses, are:
Initially, the input image is segmented into a set of planar surfaces. Approaches for range-image segmentation can be broadly classi"ed into two categories: edgebased (e.g. Refs. [22}24]) and region-based (e.g. Refs. [25,26]). In our work, we have chosen a hierarchical region-based algorithm. This algorithm starts by extracting a number of planar-surface kernels using a quadtree region splitting scheme (the planarity test used is based on root-mean-square "t error). Surface kernels are then grown, by merging consistent neighboring kernels and pixels, to form the "nal set of scene surfaces, which represent the nodes of the SG. The contour of each extracted surface is polygonized into a number of straight edges. Each edge separates between the associated surface and a number of adjacent surfaces (possibly one). We split each edge into a set of diedges, where each diedge is a sub-edge that has exactly two incident surfaces. An example of diedges is shown in Fig. 7. Notice that diedges carry link information between adjacent surfaces. The data model corresponding to the SG is shown in Fig. 8. In such a model, a single-to-double-arrowed link denotes a one-to-many relationship, while a double-to-doublearrowed one denotes a many-to-many relationship [27].
1. Intersection point (pK): location of the intersection point of EK and EK (notice that pK may lie inside one of the fork edges, if SK is a concave surface, see Fig. 6), 2. Surface normal (nK): outward normal of SK, and 3. Edge direction (dK): direction of the vector from pK to G the end points of EK, where i3+1, 2, (note that this G direction is ambiguous, if pK lies inside EK; in such G a case, we generate two forks corresponding to each possible direction). On the other hand, the selected viewpoint-invariant fork attributes are: Surface area (aK): area of SK, Surface perimeter (pK): perimeter of SK, Fork angle (gK): angle between dK and dK, Vertex distance (eK): distance from the intersection GH point pK to the jth end of EK, vK, in the direction of dK G GH G (i, j3+1, 2,, edge ends are labeled such that eK (eK ), G G 5. Dihedral angle (hK): dihedral angle of EK(i3+1, 2,), and G G 6. Direction yag, (bK): a binary attribute that is set to $1 if dir (dK;dK)"$nK, respectively, where dir (v)" v/""v"".
1. 2. 3. 4.
Thus, the total number of viewpoint-invariant attributes is 10. As mentioned, these attributes are used for indexing.
5. Scene description In this section, we describe the methods used to construct scene and potential-object graphs.
Fig. 7. An example of diedges: edge EQ of surface SQ has three adjacent surfaces, (SQ , SQ , SQ ), and accordingly it consists of three diedges (DQ , DQ , DQ ).
Fig. 6. Examples of a fork: (a) pK is outside both edges, (b) pK is inside one of the edges.
1356
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
3. Diedge angle (hQ): This angle is determined depending on the adjacency type, tQ, as follows: 1. 䡩 convex, concave: dihedral angle between SQ and G SQ, H 1. 䡩 occluding (occluded): angle between nQ (nQ) and G H normal to DQ that is pointing towards SQ (SQ) in H G a direction that is parallel to the image plane, 1. 䡩 Unknown: zero. 5.2. True/false edge classixcation
Fig. 8. Data model of the scene graph.
Notice that edge and diedge entities in the data model are connected by a many-to-many link, since each diedge corresponds to exactly two edges, and, as mentioned, each edge consists of one or more diedges. We have chosen the following node attributes, which correspond to scene surface SQ: 1. Surface equation: represented as nQx"cQ where nQ is the outward normal of SQ and cQ is the distance from the origin to SQ in the direction of nQ. 2. Surface area (aQ): 3-D area of SQ. 3. Surface perimeter (pQ): sum of lengths of the true edges that bound SQ (refer to Section 1 for de"nition of true and false edges). As will be shown later, the classi"cation of some edges, namely concave edges, may change from true to false as matching progresses. Thus, unlike the area attribute, the value of pQ is dynamic, i.e., it can change during the course of matching. We have chosen the following link attributes, which correspond to a diedge, DQ, involving scene surfaces SQ G and SQ: H 1. Adjacency type (tQ): Adjacency between SQ and SQ is G H classi"ed as follows: 1. 䡩 Convex (Concave): SQ and SQ are connected and the G H dihedral angle between them is less (more) than 1803, 1. 䡩 Occluding: SQ occludes SQ, G H 1. 䡩 Occluded: SQ is occluded by SQ, or G H 1. 䡩 Unknown: type of adjacency cannot be determined, 2. End points: locations of the end points of DQ,
In this section, we present the rules used to classify scene edges as either true or false. This type of feature distortion analysis is needed for constructing the POG, which will be presented in the next section. Scene edges are classi"ed as either true or false depending on the types of associated diedges. Let DQ be a diedge that is associated with edges EQ and EQ. The type of G H DQ is used to classify EQ and EQ according to the following G H rules: E E E E E
Convex: Both EQ and EQ are true (see Fig. 9(a)). G H Occluding: EQ is true, while EQ is false (see Fig. 9(b)). G H Occluded: EQ is false, while EQ is true. G H Unknown: both EQ and EQ are considered false. G H Concave: this case is ambiguous, as it has three possible interpretations:
E 䡩 Interpretation A: surfaces of EQ and EQ belong to the G H same object, which implies that both EQ and EQ are G H true (see Fig. 10(a)),
Fig. 9. True (T) and false (F) surface edges for: (a) a convex diedge, and (b) an occluding diedge.
Fig. 10. True (T) and false (F) surface edges for a concave diedge: (a) EQ and EQ belong to the same object, (b) surface of EQ occludes G H G that of EQ, (c) surface of EQ is occluded by that of EQ. H G H
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
1357
assume that it is true. Then, during the course of matching, we have two possibilities. If base and top blocks are two di!erent objects, then EQ will be corH rectly re-classi"ed as false, after recognizing the top block. Otherwise, if both blocks belong to the same object, then the initial classi"cation of EQ (as true) will H not be reversed. 5.3. Construction of the potential-object graph
Fig. 11. A polyhedral scene.
E 䡩 Interpretation B: surface EQ occludes that of EQ, G H implying that EQ is true, while EQ is false (see G H Fig. 10(b)), or E 䡩 Interpretation C: surface EQ is occluded by that of G EQ, implying that EQ is false, while EQ is true (see H G H Fig. 10(c)). The above ambiguity is resolved as follows: E We analyze other diedges of edge, say, EQ. If any of G these diedges is true, then we can conclude that EQ G is also true. Accordingly, we can exclude interpretation C. Furthermore, if all edges in the model database have exactly two incident surfaces, then we can exclude interpretation A as well. For example, let us consider the scene shown in Fig. 11. The ambiguity associated with concave diedge DQ is resolved by examining the other diedge of EQ, DM Q. Since DM Q is a true diedge (beG cause it is occluding), we can conclude that EQ is also G true. Accordingly, DQ can be interpreted as either of the "rst two interpretations, A and B. Furthermore, if all model edges have exactly two incident surfaces, then only interpretation B is valid for DQ, and so we can conclude that EQ is false. H E If ambiguity is still unresolved (e.g., for classi"cation of edge EQ in Fig. 11, in the general case), we utilize the H adopted matching strategy as follows. Initially, since we are interested in recognizing an unoccluded object, we assume that interpretation C is invalid for the given diedge, and accordingly, we classify the associated edge as true. This classi"cation is then revised during the course of matching by applying the following rule: `If the neighboring surface belongs to an object that has already been recognized, then the edge will be classi"ed as false (since interpretations A and B will become invalid), otherwise, the edge will be classi"ed as true (since interpretation C will be assumed invalid due to our interest in recognizing a topmost object)a. Applying this rule to edge EQ in Fig. 11, we initially H
The POG is constructed by "rst forming POs, then inferring occlusion relationships between them. Rules of POG-construction depend on shapes of given model objects, as well as nature of expected scenes. We have found the following pair of rules satisfactory for our purposes: E Node-construction rule: Two surfaces SQ and SQ in the G H SG are said to belong to the same PO, if there exists a path of convex diedges between them. This is an equivalence relation, and so it partitions surfaces into a set of equivalence classes, each of which forms a PO. E Link-construction rule: A directed link is formed from PO to PO , if there exists a surface, SQ3PO , that is I J G I adjacent to another one, SQ3PO , such that the border H J edge of SQ is true, while that of SQ is false. G H Fig. 4 shows an example of the above process. In the POG shown in Fig. 4(c), notice that a directed link is drawn from PO to PO because surface 5 in PO occludes surface 3 in PO . The POG-construction rules are also used for updating the POG during the matching process (see Section 6.1).
6. Matching In this section, we outline the matching algorithm, and explain the processes of hypothesis generation and veri"cation in detail. 6.1. Algorithm The matching algorithm is outlined in Fig. 12. From this "gure, we observe that matching is a multi-stage process. In each stage, we recognize a topmost object (i.e., an object that is either unoccluded, or whose occluding objects have already been recognized in earlier stages). Recognition is performed as follows. Firstly, topmost PO's are obtained by sorting POG topologically [28], and selecting nodes that are ranked "rst (we implicitly assume here that occlusion relationships are acyclic, which is arguably the case in most practical scenarios). Notice that we consider all topmost PO's for matching, not just one of them, since it is generally not guaranteed that a topmost PO will actually correspond to a topmost scene object. Secondly, for each surface in a selected PO,
1358
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
Fig. 12. The matching algorithm.
we choose an `optimala fork using a criterion that depends on fork uniqueness and reliability. Each selected fork is then matched with a subset of model forks, retrieved through indexing, to generate a number of object/pose hypotheses (Section 6.2). Thirdly, for each PO, generated hypotheses are clustered to form a new set of hypotheses, each one of them corresponds to a cluster center. Fourthly, hypotheses are ranked, in descending order, according to the ratio of corresponding cluster size to PO size. Such a ranking is expected to place promising hypotheses "rst, since a valid hypothesis is likely to be generated from all, or most, surfaces in the corresponding PO. Fifthly, hypotheses are passed to a pixel-based veri"er according to their ranking, until the valid one is found (Section 6.3). Note that veri"cation is needed after clustering since, in general, it is not guaranteed that
the top-ranked hypothesis will correspond to a topmost object. 6.2. Hypothesis generation The hypothesis-generation algorithm considers each scene surface (which belongs to a topmost PO) independently. It consists of the following steps: 1. Distortion of true edges is characterized. 2. Distortion-adaptive attribute bounds are computed for all forks that can be extracted from the given surface. 3. Optimal scene fork is selected based on uniqueness and reliability. 4. The optimal fork is used for indexing and matching.
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
1359
Table 1 Computation of distortion-adaptive bounds on fork attributes. Note that: (1) thresholds (daQ, dpQ, dgQ, deQ ,deQ ,dhQ) are for accommodating G uncertainty, (2) parameters (max aQ, max pQ, min eQ , max eQ ) denote maximum/minimum possible attribute values, (3) dihedral-angle bounds (those on hK) of fork edge EQ correspond to the intersection of dihedral-angle bounds of its diedges which are computed as shown G G below, and (4) the two bounds in the concave-diedge case correspond to interpretations A and B, respectively Model attributes
aK pK gK eK G eK G hK (tQ"Convex) G G (tQ"Occluding) G (tQ"Concave) G bK
Distortion-adaptive scene bounds Condition
Condition"True
Condition"False
All edges/edge-ends are True All edges/edge-ends are True True Lower end of EQ is True G Upper end of EQ is True G True True True True
aQGdaQ pQGdpQ gQGdgQ eQ GdeQ G eQ GdeQ G hQGdhQ G G (0,1803!hQ#dhQ) G G hQGdhQ 6(0, hQ!1803#dhQ) G G G G bQ
(aQ!daQ, max aQ) (pQ!dpQ, max pQ)
Each of these steps is described below: (1) Characterizing edge distortion: Distorted true edges can be distinguished from undistorted ones by analyzing each edge end to determine whether it is true or false (i.e., corresponds to a model edge end or not, respectively). We have three possibilities (cf. Ref. [29]): 1. Both ends are true (edge is undistorted), 2. One end is true, while the other is false (edge is distorted on one end), or 3. Both ends are false (edge is distorted on both ends). The following simple rule is used for true/false classi"cation of an edge end: `An edge end is considered true if the edge adjacent to it is true, and the near end of the adjacent edge is at a distance less than some threshold; otherwise, it is considered a false enda. (2) Computing distortion-adaptive bounds: Table 1 summarizes the computation of scene-fork bounds. From this table, we can observe that the tightness of most of these bounds depends on both extent of distortion and diedge type (e.g., convex, concave). These distortionadaptive bounds can be viewed as `optimala, in the sense that they capture all relevant information. To the best of our knowledge, such a characteristic is not possessed by existing object recognition systems, which tend to rely on occlusion-insensitive attribute bounds only. In our case, we have only two occlusion-insensitive bounds: those corresponding to fork angle and direction #ag. (3) Selecting optimal fork: Scene forks are evaluated using a function that depends on both uniqueness, which is important for indexing, and reliability, which is needed for matching (to produce accurate pose estimates). In our work, uniqueness and reliability are measured by the
(min eQ, eQ #deQ ) G (eQ !deQ , max eQ ) G
number of tight fork bounds, and the sum of fork-edge lengths, respectively. A bound is said to be tight, if the corresponding attribute value can be approximately estimated (e.g., bounds on fork angle and convex edge, see Table 1). Speci"cally, the evaluation function is de"ned as follows: E(FQ)"2¹¸ #¸ #¸ ,
(1)
where FQ is a scene fork, T number of tight bounds in FQ, ¸ length of ith fork edge (i3+1, 2,), and ¸ an upper G
bound on lengths of model edges. It is clear that E(FQ) favors uniqueness over reliability (note that more reliable hypotheses are still obtained through clustering). The fork which achieves the highest score is selected for generating hypotheses. (4) Fork indexing and matching: Indexing is performed by selecting a subset of model forks whose attribute values lie within the respective bounds of the optimal scene fork. Matching involves determining the relative 3-D poses between the scene fork and the selected model forks. The 3-D pose is computed quite straightforwardly by using scene/model fork origins to determine the three translational parameters, while using scene/model surface normals and edge directions to estimate the three rotational parameters. 6.3. Hypothesis verixcation Hypothesis veri"cation is a model-based process, which involves comparing description of a hypothesized model object with scene description. The scene/model description used in veri"cation can be either featurebased, involving high-level features such as surfaces and
1360
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
edges [20,30], or pixel-based, involving scene and model images [8,9,31,32]. In our work, we have chosen to adopt pixel-based veri"cation, which has several advantages over feature-based veri"cation such as: (1) the implementation is much simpler, (2) time complexity of verifying a hypothesis is independent of scene complexity (number and types of scene features), and (3) it is more suitable for implementation on parallel vision hardware. The standard pixel-based veri"cation scheme using range data can be outlined as follows (see Ref. [8]). Evidence obtained by comparing a pair of corresponding scene and model pixels is classi"ed into three types: (1) Positive evidence: model and scene pixels are approximately the same, (2) Negative evidence: the model pixel is closer to the sensor than the scene one, or (3) Neutral evidence: the model pixel is farther from the sensor than the scene one. Note that the third case is said to be neutral, since we can not determine whether it is due to either occlusion or hypothesis invalidity. A hypothesis is accepted only if the majority of votes are either positive or neutral. In such a case, the percentage of neutralevidence pixels provides an estimate of the extent of occlusion. Our technique extends the standard scheme, in an e!ort to achieve more robust veri"cation performance. The key idea of the proposed technique is to utilize our matching strategy (objects are recognized according to their order in the occlusion hierarchy), in order to verify objects with greater con"dence. This is achieved as follows. The neutral-evidence type is split into two subtypes, depending on whether the scene pixel belongs to a previously recognized object:
hypotheses with greater con"dence due to using the additional constraint that a neutral-evidence pixel has to belong to a previously recognized object in order to count its vote for hypothesis validity. For example, in the beginning of recognition, a hypothesis is accepted only if the majority of votes are strictly positive. In subsequent stages, a neutral-evidence pixel is counted, in addition to positive-evidence pixels, only if it belongs to an object that has already been recognized in an earlier stage. Keeping track of recognized scene pixels is performed by maintaining an object label image. Initially, pixels in this image are set to zero. Then, when the ith scene object is recognized, the object label image is updated by changing the labels of all pixels that belong to this object to i. This process can also provide the following easy way of determining occlusion relationships between objects: if the old label of a pixel that belongs to object i is jO0, then we can conclude that object i is directly occluded by object j. This information can then be used to update the OG (see the algorithm in Fig. 12) by adding a directed link from node j to node i.
1. Neutral-minus evidence: The evidence is neutral and the scene pixel does not belong to a previously-recognized object. This type of evidence votes against the hypothesis that the object is a topmost one. 2. Neutral-plus evidence: The evidence is neutral and the scene pixel does belong to a previously recognized object. In such a case, we still cannot determine whether this is due to occlusion or hypothesis invalidity. However, the fact that the scene pixel is already recognized tends to `favora the possibility of occlusion over that of hypothesis invalidity.
E E E E E E
Since we are always looking for a topmost object, a hypothesis is accepted only if the majority of votes are either positive or neutral-plus. That is, we accept a hypothesis if N #N N LN o" N #N #N #N N LN LE E
(2)
is larger than some threshold that is close to unity, where N , N , N , and N are the numbers of positive-, neuN LN LE E tral-plus-, neutral-minus-, and negative-evidence pixels, respectively. It can be seen that this criterion veri"es
7. Analysis of the system In this section, we analyze the computational complexity of the proposed system, and compare it with related work. 7.1. Computational complexity De"ne the following: NQ: number of scene objects, NK: number of model objects, RQ: number of scene surfaces, KK: average number of forks per model object, a: average ratio of matched-to-total model forks, b: average order of valid hypothesis among ordered list of hypotheses, and E <: cost of verifying a hypothesis. In each stage of the matching algorithm, an average of RQ/NQ scene surfaces are selected for matching (we assume here for simplicity that only a single topmost PO is obtained at each stage). A fork is selected from each scene surface and matched with an average of aNKKK model forks to generate object/pose hypotheses. Accordingly, the complexity of hypothesis generation in a single stage is of order O(aNKKKRQ/NQ). These hypotheses are then clustered and sorted using algorithms of typical complexities O(n) and O(n log n), respectively. Thus, the time complexities of clustering and sorting in a single stage are O(NK(aKKRQ/NQ)) and O((aNKKKRQ/NQ) log (aNKKKRQ/ NQ)) (note that clustering is applied separately to each group of hypotheses that belong to the same model
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
object). Finally, hypotheses are veri"ed according to their ranking, until the valid one is found. This step has complexity of order O(b<), where b is expected to be close to unity due to hypothesis ranking. Since the complexity of clustering is of higher order than those of hypothesis generation, sorting and hypothesis veri"cation, the overall complexity of our system can be expressed as
NK C"O (aKKRQ) . NQ
(3)
As will be experimentally demonstrated in Section 8, a is expected to be very low due to the utilization of all available constraints on the selected fork attributes (refer to Section 6.2). It is interesting to compare the complexity of our system with that of systems that directly match scene features with corresponding model features (refer to Section 2). In order to provide a basis for such a comparison, we analyze the complexity of a hypothetical directmatch-based system, which directly matches scene forks, one per surface, with model ones. Furthermore, we assume that the fork bounds used for indexing are distortion-pessimistic; i.e., they are obtained by assuming that the fork surface is partially occluded and all edge ends are false. Such a pessimistic approach is inevitable in the absence of the feature distortion analysis presented in Section 6.2. It can be shown that the complexity of such a system is C"O(NK(aKKRQ)),
(4)
where a is the average ratio of matched-to-total model forks. The superiority of our system over the direct-matchbased one can be seen by comparing their complexity expressions in Eqs. (3) and (4), respectively. Dividing Eq. (4) by Eq. (3) we can estimate the speed gain in our system:
a C "O NQ . a C We can expect a to be much larger than a, since indexing in our system is performed using distortion-adaptive bounds, as opposed to distortion-pessimistic bounds in the direct-match-based system. Obviously, this is especially true in the case of unoccluded scene forks. As an example, if we assume the presence of a moderate number of scene objects, NQ"3, and a conservative estimate of indexing-selectivity ratio a/a"2, we can expect our system to be 12 times faster than the direct-match-based system. 7.2. Comparison with related work Fan et al. [33] presented an interesting approach for recognition of 2.5-D free-form objects from range data.
1361
This approach resembles ours in the following aspects: (1) scene data are segmented into scene objects (potential objects in our terminology), (2) segmented scene objects are used to guide the matching process, and (3) errors in object segmentation are accommodated by the matcher. It can be described as follows: E Both scene data and model objects are represented by attributed surface graphs. E Object segmentation is performed by assigning to each link in the scene graph a number, which represents the belief that the two incident surfaces belong to the same object. This number, which lies between 0 and 1, is chosen to be 1 for convex edges, 0.75 for concave edges, and between 0 and 0.5 for occluding edges, depending on the distance between the incident surfaces. Scene objects are the subgraphs obtained after discarding links whose associated numbers are less than some threshold. E Indexing is performed by computing coarse di!erences between a given scene subgraph, and model graphs (e.g., di!erence in number of nodes). E Matching is based on a search technique, where search-tree pruning is performed using unary constraints (e.g., surface area and perimeter), and binary constraints (e.g., angle between a pair of surface normals). Occlusion-sensitive attribute values (e.g., surface area) are approximately determined by estimating the extent of occlusion, and adjusting the attribute value accordingly. Relatively liberal thresholds are used for accommodating the inaccuracy of estimating the occlusion extent. A particularly important feature of the above system is capability of handling free-form objects (as opposed to the relatively simpler polyhedral objects in our case). However, it can be observed that it deals with multiobject scenes in an ad hoc fashion, which fundamentally limits achievable performance. On the other hand, the system presented in this paper is based on sound occlusion-guided reasoning, and so it is expected to perform robustly and e$ciently in complex multi-object scenes. In particular, the proposed system can be compared to the one in Ref. [33] as follows: E The object-segmentation process in Ref. [33] can perform well, if there is a signi"cant variation in depth among scene objects. However, if the objects are stacked on top of each other, then there is a strong possibility that all objects will be considered as a single scene object. On the other hand, the object-segmentation algorithm proposed in this paper (Section 5.3) is expected to have superior performance for this type of scenes, which is achieved through occlusion-guided reasoning involving types of scene diedges. E Occlusion-sensitive attributes are handled in an approximate fashion in Ref. [33]. In contrast, the
1362
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
proposed approach is adept in its utilization of these attributes. It starts by attempting to detect occlusion. If there is no occlusion, then the extracted attribute value is used to provide a powerful constraint that leads to high indexing selectivity. Otherwise, if occlusion exists, it neither discards the attribute nor makes approximating assumptions to estimate the actual value. Instead, the extracted value is used as a bound (on the actual value) that is still useful for indexing.
8. Experimental results In this section, we experimentally demonstrate the performance of the proposed system using complex multi-object scenes. Scene range data are generated synthetically, where uncertainty is introduced by adding Gaussian noise of standard deviation 1 pixel value. The selected model database consists of eight objects, which are shown in Fig. 13. 8.1. A two-object scene Fig. 14(a) shows the "rst scene, which consists of object TRUNCATED PRISM resting on another object, BLOCK. The input range image is processed as follows. Firstly, the
Fig. 15. First experiment: (a) SG, (b) POG.
region-based segmentation algorithm outlined in Section 5.1 is applied to extract a number of scene surfaces, which are shown in Fig. 14(b). Secondly, the contour of each extracted surface is polygonized into a number of straight edges (Fig. 14(c)). Thirdly, edges are tracked to determine the corresponding diedges, and construct the SG shown in Fig. 15(a). Fourthly, the POG is constructed by applying the rules stated in Section 5.3 (Fig. 15(b)). This POG consists of two POs, PO and PO , which
Fig. 13. Model database.
Fig. 14. First experiment: (a) scene, (b) extracted surfaces, (c) extracted straight edges.
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
1363
Fig. 16. True and false edges of surfaces 7, 4 and 8.
Table 2 First experiment: evaluation of forks that belong to surfaces 7, 4 and 8 according to Eq. (1) (assuming that ¸ "500). Forks
labeled* are those selected as optimal Stage
Surface
F Q
¹
¸
¸
E(F ) Q
1
7
(a, b) (a, c) (a, d)* (b, c) (c, d) (a, b) (a, d) (b, c)* (b, e) (c, d) (d, e) (b, c)*
9 9 10 8 9 7 6 7 7 6 6 5
27 27 27 64 28 9 9 24 24 79 25 25
64 28 36 28 36 24 25 79 25 25 25 31
9091 9055 10063 8092 9064 7033 6034 7103 7049 6104 6050 5056
2
4
8
correspond to scene objects TRUNCATED PRISM and BLOCK, respectively. Notice that a directed link is drawn from PO to PO , denoting PO occludes PO , since surfaces 3 and 7 of PO , have occluding diedges with surface 4 of PO (see Fig. 14(c)). The matching process consists of two stages involving potential objects PO and PO , respectively. Each stage starts by selecting optimal forks from each surface, as explained in Section 6.2. As an illustration, let us consider surface 7 ("rst stage) and surfaces 4 and 8 (second stage). True and false edges of these surfaces are shown in Fig. 16. Table 2 shows results of evaluating forks extracted from these surfaces using the function de"ned in Eq. (1). From this table, we observe that the number of tight bounds of the optimal fork, T, is proportional to the extent of occlusion of the corresponding surface. In particular, the value of T is ten (which is the maximum) for unoccluded surface 7, seven for lightly occluded surface 4, and only "ve for heavily occluded surface 8. Optimal-fork selection is followed by hypothesis generation through fork indexing and matching, as ex-
Table 3 First experiment: results of hypothesis generation Stage
Surface
¹ for optimal fork
Matched forks (Generated hypotheses)
1
3 6 7 2 4 5 8
10 10 10 10 7 7 5 8.4/10
2 2 2 4 4 18 61 13.3/412
2
Avg./Max.
plained in Section 6.2. Results of this process are summarized in Table 3. From this table, we observe that indexing is highly selective: on the average, less than 4% of the model forks are retrieved. This is due to our indexing strategy which is based on: (1) model retrieval using distortion-adaptive bounds on a rich set of fork attributes, and (2) optimal scene-fork selection using a criterion that favors most unique (most constraining) forks. We also observe that the number of retrieved model forks is relatively large when using optimal forks of surfaces 5 and 8. This is due to partial occlusion, in addition to the fact that these scene forks are right angled, a fork type that is abound in the model database (see Fig. 13). Note that, thanks to our indexing strategy, all scene/model fork matches are successful. Accordingly, the number of generated hypotheses is the same as the number of matches. Generated hypotheses are clustered, ranked, and then veri"ed using the pixel-based veri"er described in Section 6.3. In both stages, the valid hypothesis is ranked "rst. Results of pixel-based veri"cation are summarized in Table 4. Robustness of the proposed technique can be observed in these results. In the "rst stage, the veri"er accepts a hypothesis with mostly positive-evidence pixels, while in the second one, it accepts neutral-evidence pixels
1364
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
Table 4 First experiment: results of hypothesis veri"cation (N , N , N LN N and N are the numbers of positive-, neutral-positive-, neuLE L tral-negative-, and negative-evidence pixels, respectively, and o is the vote ratio de"ned in Eq. (2) Stage
N N
N LN
N LE
N L
o
1 2
23859 35524
0 14565
140 380
925 1307
0.96 0.97
stages, respectively, see Table 3) to 422. Furthermore, clustering involves 71688 comparisons, compared to only 2889 in our system (for both stages). Thus, we observe that clustering in our system is about 24 times faster than that in the direct-match-based system. Recall that clustering is the most computationally expensive component in both systems (Section 7.1). This justi"es our recognition approach which is based on occlusion-guided reasoning for feature distortion analysis and perceptual organization. 8.2. A four-object scene
Fig. 17. First experiment: wireframes of recognized objects superimposed on the scene image.
only if they correspond to the object that is recognized in the "rst stage. Fig. 17 shows wireframes of the two valid hypotheses superimposed on the scene image. To reinforce the complexity analysis presented in Section 7.1, we quantitatively compare the proposed system with the direct-match-based one described in that section. The later is implemented by modifying our system such that: (1) all surfaces are matched in a single stage, and (2) indexing is performed assuming surfaces are occluded and all edge ends are false. Running the modi"ed system, we have found that the number of generated hypotheses increases from 93 (6 and 87, for the two
Our second scene, shown in Fig. 18(a), is a substantially complex one, involving four objects stacked on top of each other. Extracted surfaces, and polygonized contours are shown in Figs. 18(b) and (c), respectively, while corresponding SG and POG are shown in Figs. 19(a) and (b), respectively. The POG consists of four POs corresponding to the four scene objects. Notice that, even in such a complex scene, our POG-construction method succeeds in identifying surfaces which belong to each object, as well as determining occlusion relationships between them. The matching process consists of four stages. Hypothesis-generation results for all stages are summarized in Table 5. Notice that, thanks to our indexing strategy, the number of scene/model fork matches is kept at a very minimal level, even for occluded surfaces. As in the "rst experiment, clustering and sorting rank the valid hypothesis "rst for all stages. Results of verifying these hypotheses, using our pixel-based veri"er, are shown in Table 6. Fig. 20 shows the wireframes of the valid hypotheses superimposed on the scene image. As in the "rst experiment, we compare the performance of the proposed system with the direct-matchbased system. Running the direct-match-based system (described in Section 8.1), we have found that the number of generated hypotheses increases from 46 (4, 8, 8 and 26, for the four stages, respectively, see Table 5) to 761.
Fig. 18. Second experiment: (a) scene, (b) extracted surfaces, (c) extracted straight edges.
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
Fig. 19. Second experiment: (a) SG, (b) POG.
Fig. 20. Second experiment: the four recognized objects superimposed on the scene image.
Table 5 Second experiment: results of hypothesis generation Stage
Surface
¹ for optimal fork
Matched forks (Generated hypotheses)
1
4 10 11 14 2 12 13 7 8 9 3 5 6
10 10 10 10 10 7 7 10 7 7 7 10 7 8.6/10
1 1 1 1 2 4 2 2 4 2 8 4 14 3.5/412
2
3
4
Avg./Max.
1365
system, than in the direct-match-based one. This further reinforces the importance of occlusion-guided reasoning in object recognition.
9. Conclusions and future directions
Table 6 Second experiment: results of hypothesis veri"cation (N , N , N LN N and N are the numbers of positive-, neutral-positive-, neuLE L tral-negative-, and negative-evidence pixels, respectively, and o is the vote ratio de"ned in Eq. (2) Stage
N N
N LN
N LE
N L
o
1 2 3 4
30150 40227 22417 28500
0 1310 9266 25135
239 196 122 157
684 864 505 646
0.97 0.98 0.98 0.99
Furthermore, clustering involves 245,436 comparisons, compared to only 259 in the proposed system (for all stages). Thus, clustering is about 947 times faster in our
We have presented a novel approach for robust and e$cient recognition of occluded 3-D objects from range data. Robustness and e$ciency are achieved through feature distortion analysis and scene perceptual organization based on sound occlusion-guided reasoning. Recognition is performed using a multi-stage matching framework, which attempts to recognize scene objects according to their order in the occlusion hierarchy. Such a strategy helps in resolving an occlusion-induced ambiguity in true/false classi"cation of concave edges, and sets the stage for robust hypothesis veri"cation. Matching is based on a hypothesize-cluster-and-verify approach, where hypotheses are generated using an occlusion-tolerant and attribute-rich feature, the fork, and veri"ed using a robust pixel-based technique. Highly selective fork indexing is achieved through both deriving distortionadaptive bounds on the rich set of fork attributes, and choosing forks based on a criterion that favors unique ones. Performance of the proposed approach has been demonstrated both analytically, through complexity analysis, and experimentally, using complex multi-object scenes. The proposed approach has been presented in the context of a recognition task involving polyhedral objects and range data of good quality. Consideration of such a simpli"ed task has helped in focusing the presentation on our main objective, namely, handling occlusion-induced distortion in complex multi-object scenes. Our approach can be generalized to handle more complicated tasks involving complex objects (e.g., quadratic,
1366
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367
free-form), and range data of poor quality (which is the typical case when using real data). This can be achieved by extending the three main modules of our system, shown in Fig. 3, as follows: E Model description: Complex model objects can still be described using an attributed surface graph, since it is arguably a general representation scheme. E Scene description: As above, scene data can be described using an attributed surface graph. Rules for true/false feature classi"cation, which are used for construction of potential-object graph and matching, can be extended to consider non-linear features (e.g., conical surfaces, curved edges). In addition, these rules can be extended to handle errors in extracting features from scene data of poor quality. Examples of the new rules are: (1) a scene edge is considered true, only if its length is greater than some threshold (handles a spurious edge), and (2) an end of a true edge is considered true, only if the next edge is true and its nearby end is at a distance less than some threshold (handles a missing edge joining the two edges under consideration). E Matching: The hypothesize-cluster-and-verify approach for matching is still applicable. The only major di!erence is that the feature sets used for hypothesis generation will be extended to include combinations of linear and non-linear features. Notice that our pixelbased veri"er can handle model objects of arbitrary complexity. Furthermore, notice that hypothesis clustering provides some tolerance to errors in the scenedescription process (e.g., considering a false feature as a true one). Development of such a system is an interesting subject for future research.
References [1] F. Arman, J.K. Aggarwal, Model-based object recognition in dense-range images * a review, ACM Comput. Surveys 25 (1) (1993) 5}43. [2] P.J. Besl, R.C. Jain, Three-dimensional object recognition, ACM Comput. Surveys 17 (1) (1985) 75}145. [3] R.T. Chin, C.R. Dyer, Model-based recognition in robot vision, ACM Comput. Surveys 18 (1) (1986) 67}108. [4] P. Suetens, P. Fua, A.J. Hanson, Computational strategies for object recognition, ACM Comput. Surveys 24 (1) (1992) 5}61. [5] O.D. Faugeras, M. Hebert, The representation, recognition and locating of 3-D objects, Int. J. Robotic Res. 5 (3) (1986) 27}52. [6] W.E.L. Grimson, T. Lozano-Perez, Model-based recognition and localization from sparse range or tactile data, Int. J. Robotic Res. 3 (3) (1984) 3}35. [7] W.E.L. Grimson, T. Lozano-Perez, Localizing overlapping parts by searching the interpretation tree, IEEE Trans. Pattern Anal. Mach. Intell. 9 (4) (1987) 469}482.
[8] R.C. Bolles, P. Horaud, 3DPO: a three-dimensional part orientation system, Int. J. Robotic Res. 5 (3) (1986) 3}26. [9] M. Boshra, H. Zhang, An e$cient pixel-based technique for visual veri"cation of 3-D object hypotheses, Proceedings of IEEE International Conference on Robotics and Automation, Minneapolis, Minnesota, 1996, pp. 3472}3477. [10] M. Dhome, M. Richetin, J. Lapreste, G. Rives, Determination of the attitude of 3-D objects from a single perspective view, IEEE Trans. Pattern Anal. Mach. Intell. 11 (12) (1989) 1265}1278. [11] F. Stein, G. Medioni, Structural indexing: e$cient 3-D object recognition, IEEE Trans. Pattern Anal. Mach. Intell. 14 (2) (1992) 125}145. [12] M. Umasuthan, A.M. Wallace, Model indexing and object recognition using 3D viewpoint invariance, Pattern Recognition 30 (9) (1997) 1415}1434. [13] K.C. Wong, J. Kittler, Recognizing polyhedral objects from a single perspective view, Image Vision Comput. 11 (4) (1993) 211}220. [14] B.A. Boyter, J.K. Aggarwal, Recognition of polyhedra from range data, IEEE Expert 1 (1) (1986) 47}59. [15] M. Dhome, T. Kasvand, Polyhedra recognition by hypothesis accumulation, IEEE Trans. Pattern Anal. Mach. Intell. 9 (3) (1987) 429}439. [16] G. Stockman, Object recognition and localization via pose clustering, Comput. Vision Graphics Image Process. 40 (3) (1987) 361}387. [17] K.D. Gremban, K. Ikeuchi, Appearance-based vision and the automatic generation of object recognition programs, in: A.K. Jain, P.J. Flynn (Eds.), Three-Dimensional Object Recognition Systems, Elsevier Science Publishers, Amsterdam, 1993, pp. 229}258. [18] S.B. Kang, K. Ikeuchi, The complex EGI: a new representation for 3-D pose determination, IEEE Trans. Pattern Anal. Mach. Intell. 15 (7) (1993) 707}721. [19] A.K.C. Wong, S.W. Lu, M. Rioux, Recognition and shape synthesis of 3-D objects based on attributed hypergraphs, IEEE Trans. Pattern Anal. Mach. Intell. 11 (3) (1989) 279}290. [20] C.H. Chen, A.C. Kak, A robot vision system for recognizing 3-D objects in low-order polynomial time, IEEE Trans. Systems Man Cybernet. 19 (6) (1989) 1535}1563. [21] W. Kim, A.C. Kak, 3-D object recognition using bipartite matching embedded in discrete relaxation, IEEE Trans. Pattern Anal. Mach. Intell. 13 (3) (1991) 224}251. [22] M. Baccar, L.A. Gee, R.C. Gonzalez, M.A. Abidi, Segmentation of range images via data fusion and morphological watersheds, Pattern Recognition 29 (10) (1996) 1673}1687. [23] V. Chandrasekaran, M. Palaniswami, T.M. Caelli, Range image segmentation by dynamic neural network architecture, Pattern Recognition 29 (2) (1996) 315}329. [24] M.A. Wani, B.G. Batchelor, Edge-region-based segmentation of range images, IEEE Trans. Pattern Anal. Mach. Intell. 16 (3) (1994) 314}319. [25] P.J. Besl, R.C. Jain, Segmentation through variable-order surface "tting, IEEE Trans. Pattern Anal. Mach. Intell. 10 (2) (1988) 167}192. [26] Y. Xinming Yu, T.D. Bui, A. Krzyzak, Robust estimation for range image segmentation and reconstruction, IEEE Trans. Pattern Anal. Mach. Intell. 16 (5) (1994) 530}538.
M. Boshra, M.A. Ismail / Pattern Recognition 33 (2000) 1351}1367 [27] G. Wiederhold, Database Design, McGraw Hill, New York, 1983. [28] A.V. Aho, J.E. Hopcraft, J.D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA, 1974. [29] C. Hansen, T.C. Henderson, CAGD-based computer vision, IEEE Trans. Pattern Anal. Mach. Intell. 11 (11) (1989) 1181}1193. [30] W.E.L. Grimson, D.P. Huttenlocher, On the veri"cation of hypothesized matches in model-based recognition, IEEE Trans. Pattern Anal. Mach. Intell. 13 (12) (1991) 1201}1213.
1367
[31] H.G. Barrow, J.M. Tenenbaum, R.C. Bolles, H.C. Wolf, Parametric correspondence and chamfer matching: two new techniques for image matching, in: Proceedings of the Fifth International Joint Conference on Arti"cial Intelligence, Cambridge, MA, 1977, pp. 659}663. [32] P. Flynn, A.K. Jain, BONSAI: 3-D object recognition using constrained search, IEEE Trans. Pattern Anal. Mach. Intell. 13 (10) (1991) 1066}1075. [33] T. Fan, G. Medioni, R. Nevatia, Recognizing 3-D objects using surface descriptions, IEEE Trans. Pattern Anal. Mach. Intell. 11 (11) (1989) 1140}1157.
About the Author*MICHAEL BOSHRA received the B.Sc. and M.Sc. degrees in computer science from the University of Alexandria, Egypt, in 1988 and 1992, respectively, and the Ph.D. degree in computing science from the University of Alberta, Canada, in 1997. He is currently a post-doctoral fellow in the Center for Research in Intelligent Systems (CRIS) at the University of California, Riverside. From 1989 to 1992, he worked as a research assistant at the National Research Center, Giza, Egypt. His current research interests include object recognition, sensor fusion, performance prediction, and multi-dimensional indexing structures. About the Author*M.A. ISMAIL is a professor of computer science at the department of computer science, University of Alexandria, Egypt. He has taught computer science and engineering at the University of Waterloo, Canada; UPM, Saudi Arabia; the University of Windsor, Canada; the University of Michigan, USA. His research interests include pattern analysis and machine intelligence, data structures and analysis, medical computer science, and non-traditional databases.
Pattern Recognition 33 (2000) 1369}1382
LAFTER: a real-time face and lips tracker with facial expression recognition Nuria Oliver *, Alex Pentland , Franc7 ois BeH rard Vision and Modeling, Media Laboratory, MIT, Cambridge, MA 02139, USA CLIPS-IMAG, BP 53, 38041 Grenoble Cedex 9, France Received 22 October 1998; accepted 15 April 1999
Abstract This paper describes an active-camera real-time system for tracking, shape description, and classi"cation of the human face and mouth expressions using only a PC or equivalent computer. The system is based on use of 2-D blob features, which are spatially compact clusters of pixels that are similar in terms of low-level image properties. Patterns of behavior (e.g., facial expressions and head movements) can be classi"ed in real-time using hidden Markov models (HMMs). The system has been tested on hundreds of users and has demonstrated extremely reliable and accurate performance. Typical facial expression classi"cation accuracies are near 100%. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Face and facial features detection and tracking; Facial expression recognition; Active vision; Hidden Markov Models
1. Introduction This paper describes a real-time system for accurate tracking and shape description, and classi"cation of the human face and mouth using 2-D blob features and hidden Markov models (HMMs). The system described here is real-time, at 20}30 frames per second, and runs on SGI Indy workstations or PentiumPro Personal Computers without any special-purpose hardware. In recent years, much research has been done on machine recognition of human facial expressions. Feature points [1], physical skin and muscle activation models [2}4], optical #ow models [5], feature based models using manually selected features [6], local parametrized optical #ow [7], deformable contours [8,9], combined
* Corresponding author. Tel.: #617-253-0608; fax: #617253-8874. E-mail addresses:
[email protected] (N. Oliver), sandy@ media.mit.edu (A. Pentland),
[email protected] (F. BeH rard). The active-camera face detection and tracking system has been ported to a PentiumPro using Microsoft VisualC## under Windows NT. It also works in real-time (30 fps).
with optical #ow [10] as well as deformable templates [11}14] among several other techniques have been used for facial expression analysis. This paper extends these previous e!orts to real-time analysis of the human face using our blob tracking methodology. This extension required development of an incremental Expectation Maximization method, a new mixture-of-Gaussians blob model, and a continuous, real-time HMM classi"cation method suitable for classi"cation of shape data. The notion of `blobsa as a representation for image features has a long history in computer vision [15}18], and has had many di!erent mathematical de"nitions. In our usage it is a compact set of pixels that share a visual property that is not shared by the surrounding pixels. This property could be color, texture, brightness, motion, shading, a combination of these, or any other salient spatio-temporal property derived from the signal (the image sequence). In the work described in this paper blobs are a coarse, locally adaptive encoding of the images' spatial and color/texture/motion/etc. properties. A prime motivation for our interest in blob representations is our discovery that they can be reliably detected and tracked even in complex, dynamic scenes, and that
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 3 - 2
1370
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
they can be extracted in real-time without the need for special purpose hardware. These properties are particularly important in applications that require tracking people, and recently we have used 2-D blob tracking for real-time whole-body human interfaces [18] and realtime recognition of American Sign Language hand gestures [19]. Applications of this new system, called LAFTER [20] (Lips and Face TrackER) include video-conferencing, real-time computer graphics animation, and `virtual windowsa for visualization. Of particular interest is our ability for accurate, real-time classi"cation of the user's mouth shape without constraining head position; this ability makes possible (for the "rst time) real-time facial expression recognition in unconstrained o$ce environments. The paper is structured as follows: the general mathematical framework is presented in Section 2; LAFTER's architecture is described in Section 3; the face detection and tracking module appears in Section 4; Section 5 comprises the mouth detection and tracking; mouth expression recognition is in Section 7; results and applications are contained in Section 8 and "nally the main conclusions and future work appear in Section 9.
2. Mathematical framework The notion of grouping atomic parts of a scene together to form blob-like entities based on proximity and visual appearance is a natural one, and has been of interest to visual scientists since the Gestalt psychologists studied grouping criteria early in this century [21]. In modern computer vision processing we seek to group pixels of images together and to `segmenta images based on visual coherence, but the `featuresa obtained from such e!orts are usually taken to be the boundaries, or contours, of these regions rather than the regions themselves. In very complex scenes, such as those containing people or natural objects, contour features often prove unreliable and di$cult to "nd and use. The blob representation that we use was developed by Pentland and Kauth et al. [15,16] as a way of extracting an extremely compact, structurally meaningful description of multi-spectral satellite (MSS) imagery. In this method feature vectors at each pixel are formed by adding (x, y) spatial coordinates to the spectral (or textural) components of the imagery. These are then clustered so that image properties such as color and spatial similarity combine to form coherent connected regions, or `blobsa, in which all the pixels have similar image properties. This blob description method is, in fact, a special case of recent minimum description length (MDL) techniques [22}25]. We have used essentially the same technique for realtime tracking of people in color video [18]. In that
application the spatial coordinates are combined with color and brightness channels to form a four-element feature vector at each point (x, y, r , g )"(x, y, (r/(r# g#b)), (g/(r#g#b))). These were then clustered into blobs to drive a `connected-bloba representation of the person. By using the expectation}maximization [26] (EM) optimization method to obtain Gaussian mixture models for the spatio-chrominance feature vector, very complex shapes and color patterns can be adaptively estimated from the image stream. In our system we use an incremental version of EM, which allows us to adaptively and continuously update the spatio-chromatic blob descriptions. Thus not only can we adapt to very di!erent skin colors, etc., but also to changes in illumination. 2.1. Blobs: a probabilistic representation We can represent shapes in both 2-D and 3-D by their low-order statistics. Clusters of 2-D points have 2-D spatial means and covariance matrices, which we shall denote q and C . The blob spatial statistics are described O in terms of their second-order properties. For computational convenience we will interpret this as a Gaussian model. The Gaussian interpretation is not terribly signi"cant, because we also keep a pixel-by-pixel support map showing the actual occupancy. Like other representations used in computer vision and signal analysis, including superquadrics, modal analysis, and eigen-representations, blobs represent the global aspects of the shape and can be augmented with higher-order statistics to attain more detail if the data supports it. The reduction of degrees of freedom from individual pixels to blob parameters is a form of regularization which allows the ill-conditioned problem to be solved in a principled and stable way. For both 2-D and 3-D blobs, there is a useful physical interpretation of the blob parameters in the image space. The mean represents the geometric center of the blob area (2-D) or volume (3-D). The covariance, being symmetric semi-de"nite positive, can be diagonalized via an eigenvalue decomposition: C"'¸'2, where ' is orthonormal and ¸ is diagonal. The diagonal ¸ matrix represents the size of the blob along uncorrelated orthogonal object-centered axes and ' is a rotation matrix that brings this object-centered basis in alignment with the coordinate basis of C. This decomposition and physical interpretation is important for estimation, because the shape ¸ can vary at a di!erent rate than the rotation '. The parameters must be separated so they can be treated appropriately. 2.2. Maximum likelihood estimation The blob features are modeled as a mixture of Gaussian distributions in the color (or texture, motion, etc.)
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
space. The algorithm that is generally employed for learning the parameters of such a mixture model is the Expectation}Maximization (EM) algorithm of Dempster et al. [26,27]. In our system the input data vector d is the normalized R, G, B content of the pixels in the image, x"(r , g )" (r/(r#g#b), g/(r#g#b)). Our own work [18], or that of Schiele et al. or Hunke et al. [28,29] have shown that use of normalized or chromatic color information (r , g )" (r/(r#g#b), g/(r#g#b)) can be reliably used for "nding #esh areas present in the scene despite wide variations in lighting. The color distribution of each of our blobs is modeled as a mixture of Gaussian probability distribution functions (PDFs) that are iteratively estimated using EM. We can perform a maximum likelihood decision criterium after the clustering is done because human skin forms a compact, low-dimensional (approximately 1-D) manifold in color space. Two di!erent clustering techniques, both derived from EM are employed: an o!-line training process and an on-line adaptive learning process. In order to determine the mixture parameters of each of the blobs, the unsupervised EM clustering algorithm is computed o!-line on hundreds of samples of the di!erent classes to be modeled (in our case, face, lips and interior of the mouth), in a similar way as is done for skin color modeling in Ref. [30]. When a new frame is available the likelihood of each pixel is computed using the learned mixture model and compared to a likelihood threshold. Only those pixels whose likelihood is above the threshold are classi"ed as belonging to the model. 2.3. Adaptive modeling via EM Even though general models make the system relatively user-independent, they are not as good as an adaptive, user-speci"c model would be. We therefore use adaptive statistical modeling of the blob features to narrow the general model, so that its parameters are closer to the speci"c users' characteristics. The "rst element of our adaptive modeling is to update the model priors as soon as the user's face has been detected. Given n independent observations x "(r , g ), G G G i"1,2, n of the user's face, we model them as being samples of a Normal distribution in color space with mean the sample mean k and covariance matrix, . SQCP SQCP The skin color prior distribution is also assumed to be normal p(x"k , )"N(k , ) whose ECLCP?J ECLCP?J ECLCP?J ECLCP?J parameters have been computed from hundreds of samples of di!erent users. By applying Bayesian integration of the prior and user's distributions we obtain a Normal posterior distribution N(k , ) whose su$cient statNMQR NMQR istics are given by:
\ \ \ " # , NMQR ECLCP?J SQCP
1371
\ \ k " *k # *k . (1) NMQR ECLCP?J SQCP NMQR ECLCP?J SQCP Eq. (1) corresponds to the computation of the posterior skin color probability distribution from the prior (general) and the user's (learned from the current image samples) models. This update of skin model occurs only at the beginning of the sequence, assuming that the blob features are not going to drastically change during run time. To obtain a fully adaptive system, however, one must also be able to handle second-to-second changes in illumination and user characteristics. We therefore use an on-line Expectation}Maximization algorithm [31,32] to adaptively model the image characteristics. We model both the background and the face as a mixture of Gaussian distributions with mixing proportions n and K components: G ) e\V\IG2 \ G V\IG . (2) p(x/#)" n G (2n)B" " G G The unknown parameters of such a model are the su$cient statistics of each Normal distribution (k , ), the G G mixing proportions n and the number of components of G the mixture K. The incremental EM algorithm is data driven, i.e., it estimates the distribution from the data itself. Two update algorithms are needed for this purpose: A criterium for adding new components to the current distribution as well as an algorithm for computing the su$cient statistics of each Normal Gaussian component. The su$cient statistics are updated by computing an on-line version of the traditional EM update rules. If the "rst n data points have already been computed, the parameters when data point (n#1) is read are estimated as follows: First, the posterior class probability p(i"xL>) or responsibility (credit) hL> for a new data G point xL> is computed: nLp(xL>/hL) G . hL>" G (3) G nLp(xL>/hL) H H H This responsibility can be interpreted as the probability that a new data point xL> was generated by component i. Once this responsibility is known, the su$cient statistics of the mixture components are updated, weighted by the responsibilities: hL>!nL G, nL>"nL# G G G n
(4)
hL> kL>"kL# G (xL>!kL), G G n*wL G G
(5)
Superscript n will refer in the following to the estimated parameters when n data points have already been processed.
1372
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
hL> pL>"pL# G ((xL>!kL)!pL), (6) G G G G n*wL G where p is the standard deviation of component i and G wL> is the average responsibility of component i per G point: wL>"wL#(hL!wL)/n. The main idea behind G G G G this update rules is to distribute the e!ect of each new observation to all the terms in proportion to their respective likelihoods. A new component is added to the current mixture model if the most recent observation is not su$ciently well explained by the model. If the last observed data point has a very low likelihood with respect of each of the components of the mixture, i.e. if it is an outlier for all the components, then a new component is added with mean the new data point and weight and covariance matrix speci"ed by the user. The threshold in the likelihood can be "xed or stochastically chosen. In the latter case the algorithm would randomly choose whether to add a component or not given an outlier. There is a maximum number of components for a given mixture as well. The foreground models are initialized with the o!-line unsupervised learned a priori mixture distributions described above. In this way, the algorithm quickly converges to a mixture model that can be directly related to the a priori models' classes. The background models are not initialized with an a priori distribution but learned on-line from the image.
optimal linear estimate of the state, but, if all noises are Gaussian, it provides the optimal estimator. In our system to ensure stability of the MAP segmentation process, the spatial parameters for each blob model are "ltered using a zero-order Kalman "lter. For each blob we maintain two independent, zero-order "lters, one for the position of the blob centroid and another for the dimensions of the blob's bounding box. The MAP segmentation loop now becomes:
2.4. MAP segmentation
A generalized version of this technique is employed in Ref. [35] for fusing several concurrent observations. This Kalman "ltering process is used in the tracking of all of the blob features. In our experience the stability of the MAP segmentation process is substantially improved by use of the Kalman "lter, specially given that LAFTER's real-time performance yields small errors in the predicted "lter state vectors. Moreover, smooth estimates of the relevant parameters are crucial for preventing jittering in the active camera, as described in Section 4.2.
Given these models, a MAP foreground-background decision rule is applied to compute support maps for each of the classes, that is, pixel-by-pixel maps showing the class membership of each model. Given several statistical blob models that could potentially describe some particular image data, the membership decision is made by searching for the model with the maximum a posteriori (MAP) probability. Once the class memberships have been determined, the statistics of each class are then updated via the EM algorithm, as described above. This approach can easily be seen to be a special case of the MDL segmentation algorithms developed by Darrell and Pentland [23,24] and later by Ayer and Sawhney [25]. 2.5. Kalman xltering Kalman "lters have extensively been used in control theory as stochastic linear estimators. The Kalman "lter was "rst introduced by Kalman [33] for discrete systems and by Kalman and Bucy [34] for continuous-time systems. The objective is to design an estimator that provides estimates of the non-observable estate of a system taking into account the known dynamics and the measured data. Note here that the Kalman "lter provides the
1. For each blob predict the "lter state vector, XH"XK and covariance matrix, CH"CK #(*t)=, where the matrix = measures the precision tolerance in the estimation of the vector X and depends on the kinematics of the underlying process. 2. For each blob new observations > (e.g., new estimates of blob centroid and bounding box computed from the image data) are acquired and the Mahalanobis distance between these observations (>, C) and the predicted state (XK , CK ) is computed. If this distance is below threshold, the "lters are updated by taking into account the new observations: CK "[CH\#C\]\,
(7)
XK "CK [CH\XH#C\>]\.
(8)
Otherwise a discontinuity is assumed and the "lters are reinitialized: XK "XH and CK "CH.
2.6. Continuous real-time HMMs Our approach to temporal interpretation of facial expressions uses Hidden Markov Models (HMMs) [36] to recognize di!erent patterns of mouth shape. HMMs are one of the basic probabilistic tools used for time series modeling. A HMM is essentially a mixture model where all the information about the past of the time series is summarized in a single discrete variable, the hidden state. This hidden state is assumed to satisfy a xrst-order Markov condition: any information about the history of the process needed for future inferences must be re#ected in the current state. HMMs fall into our Bayesian framework with the addition of time in the feature vector. They o!er dynamic time warping, an e$cient learning algorithm and clear
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
Fig. 1. Graphical representation of real-time left-to-right hidden Markov models.
Bayesian semantics. HMMs have been prominently and successfully used in speech recognition and, more recently, in handwriting recognition. However, their application to visual recognition purposes is more recent [37}40]. HMMs are usually depicted rolled-out in time, as Fig. 1 illustrates. The posterior state sequence probability in a HMM is given by P(S"O)"P p (0 )2 p R(o )P R"s , where Q Q R Q R Q R\
1373
S"+a ,2, a , is the set of discrete states, s 3S corres , R ponds to the state at time t. P &P R G R\ H is the stateGH Q ? Q ? to-state transition probability (i.e. probability of being in state a at time t given that the system was in state a at G H time t!1). In the following we will write them as P R R\. Q Q The prior probabilities for the initial state are expressed as P &P G"P . Finally, p (o )&p R G(o )"p R(o ) are G Q ? Q G R Q ? R Q R the output probabilities for each state. The Viterbi algorithm provides a formal technique for "nding the most likely state sequence associated with a given observation sequence. To adjust the model parameters (transition probabilities A, output probabilities parameters B and prior state probabilities n) such that they maximize the probability of the observation given the model an iterative procedure } such as the Baum}Welch algorithm * is needed. We have developed a real-time HMM system that computes the maximum likelihood of the input sequence with respect to all the models during the testing or recognition phase. This HMM-based system runs in real time on an SGI Indy, with the low-level vision processing occurring on a separate Indy, and communications occurring via a socket interface.
3. System's architecture LAFTER's main processing modules are illustrated in Fig. 2 and will be explained in further detail in the next sections.
Fig. 2. LAFTER's architecture.
The output probability is the probability of observing o given state a at time t. R G
1374
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
Fig. 3. Face detection, per-pixel probability image computation and face blob growing.
4. Automatic face detection and tracking Our approach to the face "nding problem uses coarse color and size/shape information. This approach has advantages over correlation or eigenspace methods, such as speed and rotation invariance under constant illumination conditions. As described in the mathematical framework (Section 2), our system uses an adaptive EM algorithm to accomplish the face detection process. Both the foreground and background classes are learned incrementally from the data. As a trade-o! between the adaptation process and speed, new models are updated only when there is a signi"cant drop in the posterior probability of the data given in the current model. Two to three mixture components is the typical number required to accurately describe the face. Mouth models are more complex, often requiring up to "ve components. This is because the mouth model must include not only lips, but also the interior (tongue) of the mouth and the teeth. 4.1. Blob growing After initial application of the MAP decision criterion to the image, often isolated and spurious pixels are misclassi"ed. Thus local pixel information needs to be merged into connected regions that correspond to each of the blobs. The transition from local to global information is achieved by applying a connected component algorithm which grows the blob. The algorithm we use is an speedoptimized version of a traditional connected component algorithm that considers for each pixel the values within a neighborhood of a certain radius (which can be varied at run-time) in order to determine whether this pixel belongs to the same connected region. Finally, these blobs are then "ltered to obtain the best candidate for being a face or a mouth. Color information alone is not robust enough for this purpose. The background, for instance, may contain skin colors that could be grown and erroneously considered as faces. Additional information is thus required. In the current system, geometric information, such as the size and shape of the
object to be detected (faces) is combined with the color information to "nally locate the face. In consequence, only those skin blobs whose size and shape (ratio of aspect of its bounding box) are closest to the canonical face size and shape are considered. The result is shown in Fig. 3.
4.2. Active camera control Because our system already maintains a Kalman "lter estimate of the centroid and bounding box of each blob, it is a relatively simple matter to use these estimates to control the active camera so that the face of the user always appears in the center of the image and with the desired size. Our system uses an abstraction of the camera control parameters, so that di!erent camera/motor systems (currently the Canon VCC1 and Sony EVI-D30) can be successfully used in a transparent way. In order to increase tracking performance, the camera pan-tilt-zoom control is done by an independent light-weight process (thread) which is started by the main program. The current estimation of the position and size of the user's face provides a reference signal to a PD controller which determines the tilt, pan and zoom of the camera so that the target (face) has the desired size and is at the desired location. The zoom control is relatively simple, because it just has to be increased or decreased until the face reaches the desired size. Pan and tilt speeds are controlled by S "(C E#C HdE/dt)/F , where C and X C A CH B C are constants, E is the error, i.e. the distance between B the face current position and the center of the image, F is X the zoom factor, and S is the "nal speed transmitted to A the camera. The zoom factor plays a fundamental role in the camera control because the speed with which the camera needs to be adjusted depends on the displacement that a "xed point in the image undergoes for a given rotation angle, which is directly related to the current zoom factor. The relation between this zoom factor and the current camera zoom position follows a non-linear law which needs to be approximated. In our case, a second order polynomial provides a good approximation. Fig. 4 illustrates the processing #ow of the PD controller.
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
1375
extracted during the initialization phase and their statistics are computed, as is depicted in Fig. 5. The second image in the same "gure is an example of how the system performs in the case of facial hair. The robustness of the system is increased by computing at each time step the linearly predicted position of the center of the mouth. A con"dence level on the prediction is also computed, depending on the prediction error. When the prediction is not available or its con"dence level drops below a threshold, the mouth's position is reinitialized. 5.1. Mouth shape
Fig. 4. PD controller.
5. Mouth extraction and tracking Once the face location and shape parameters are known (center of the face, width, height and image rotation angle), we can use anthropometric statistics to de"ne a bounding box within which the mouth must be located. The mouth is modeled using the same principles as the face, i.e. through a second-order mixture model that describes both its chromatic color and spatial distribution. However to obtain good performance we must also produce a more "nely detailed model of the face region surrounding the mouth. The face model that is adequate for detection and tracking might not be adequate for accurate mouth shape extraction. Our system, therefore, acquires image patches from around the located mouth and builds a Gaussian mixture model. In the current implementation, skin samples of three di!erent facial regions around the mouth are
The mouth shape is characterized by its area, its spatial eigenvalues (e.g., width and height) and its bounding box. Fig. 6 depicts the extracted mouth feature vector. The use of this feature vector to classify facial expressions has been suggested by psychological experiments [41,42], which examined the most important discriminative features for expression classi"cation. Rotation invariance is achieved by computing the face's image-plane rotation angle and rotating the region of interest with the negative of this angle. Therefore, even though the user might turn the head the mouth always appears nearly horizontal, as Fig. 5 illustrates.
6. Speed, accuracy, and robustness Running LAFTER on a single SGI Indy with a 200Mhz R4400 processor, the average frame rate for tracking is typically 25 Hz. When mouth detection and parameter extraction are added to the face tracking, the average frame rate is 14 Hz. To measure LAFTER's 3-D accuracy during head motion, the RMS error was measured by having users make large cyclic motions along the XM , >M , and Z-axis, respectively, with the true 3-D position of the face being determined by manual triangulation. In this experiment the camera actively tracked the face position, with the image-processing/camera-control loop running at a nearly constant 18 hz. The image size was 1/6 full
Fig. 5. Multi-resolution mouth extraction, skin model learning. Head and mouth tracking with rotations and facial hair.
The mouth extraction and processing is performed on a Region of Interest (ROI) extracted from a full resolution image (i.e. 640;480 pixels) whereas the face detection and processing is done on an image of 1/6 full resolution, i.e. 106;80 pixels.
1376
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
Fig. 6. Mouth feature vector extraction.
resolution, i.e. 106;80 pixels, and the camera control law varied pan, tilt, and zoom to place the face in the center of the image at a "xed pixel resolution. Fig. 7 illustrates the active-camera tracking system in action. The RMS error between the true 3-D location and the system's output was computed in pixels and is shown in Table 1. Also shown is the variation in apparent head size, e.g., the system's error at stabilizing the face image size. As can be seen, the system gave quite accurate estimates of 3-D position. Perhaps most important, however, is the robustness of the system. LAFTER has been tested on hundreds of users at many di!erent events, each with its own lighting and environmental conditions. Examples are the Digital Bayou, part of SIGGRAPH 96', the Second International Face & Gesture Workshop (October 96 ) or several open houses at the Media Laboratory during the last two years. In all cases the system failed in
Fig. 7. Active camera tracking.
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382 Table 1 Translation and zooming active tracking accuracies
Static Face
Translation Range
X RMS Error (pixels)
Y RMS Error (pixels)
0.0 cm
0.5247 (0.495%) 0.6127 (0.578%) 0.8034 (1.0042%) 0.6807 (0.6422%)
0.5247 (0.6559%) 0.8397 (1.0496%) 1.4287 (1.7859%) 1.1623 (1.4529%)
Width Std (pixels)
Height Std (pixels)
Size change (pixels)
2.2206 (2.09%)
2.6920 (3.36%)
Max. size: 86;88 Min. size: 14;20
X translation $76 cm Y translation $28 cm Z translation $78 cm
Zooming
approximately 3}5% of the cases, when the users had dense beard, extreme skin color or clothing very similar to the skin color models.
7. Mouth-shape recognition Using the mouth shape feature vector described above, we trained "ve di!erent HMMs for each of the following mouth con"gurations (illustrated in Fig. 8): neutral or default mouth position, extended/smile mouth, sad mouth, open mouth and extended#open mouth (such as in laughing). The neutral mouth acted to separate the various expressions, much as a silence model acts in speech recognition. The "nal HMMs we derived for the non-neutral mouth con"gurations consisted of 4-state forward HMMs. The neutral mouth was modeled by a 3-state forward HMM.
1377
Recognition results for a eight di!erent users making over 2000 expressions are summarized in Table 2. The data were divided into di!erent sets for training and testing purposes. The "rst line of the recognition results shown in Table 2 corresponds to training and testing with all eight users. The total number of examples is denoted by N, having a total N"2058 instances of the mouth expressions (N"750 for training and N"1308 for testing). The second line of the same table corresponds to person-speci"c training and testing. As can be seen, accurate classi"cation was achieved in each case. In comparison with other facial expression recognition systems, the approach proposed by Matsuno et al. [2] performs extremely well on training data (98.4% accuracy) but more poorly on testing data, with 80% accuracy. They build models of facial expressions from deformation patterns on a potential net computed on training images and subsequent projection in the so called Emotion Space. Expressions of new subjects are recognized by projecting the image net onto the Emotion Space. Black et al. [7] report an overall average recognition of 92% for six di!erent facial expressions (happiness, surprise, anger, disgust, fear and sadness) in 40 di!erent subjects. Their system combines deformation and motion parameters to derive mid- and high-level descriptions of facial actions. The descriptions depend on a number of thresholds and a set of rules that need to be tuned for each expression and/or subject. The system described in Ref. [43] has a recognition rate of about 74% when using 118 testing images of the seven psychologically recognized categories across several subjects. They use #exible models for representing appearance variations of faces. Essa et al. [44] report 98% accuracy in recognizing "ve di!erent facial expressions using both peak-muscle activations and spatio-temporal motion energy templates from a database of 52 sequences. An accuracy of 98.7% is reported by Yael Moses et al. [9] on real-time facial expression recognition. Their system detects and tracks the user's mouth, by representing it by a valley contour based between the lips. A simple classi"cation algorithm is then
Fig. 8. Open, sad, smile and smile-open recognized expressions.
1378
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
Table 2 Recognition results: training and testing data Test on: Train on
Training
Testing
All users Single user
97.73 100.00
95.95 100.00
used to discriminate between "ve di!erent mouth shapes. They consider only confusions but not false negatives (confusions of any expression to neutral) on two independent samples of about 1000 frames each and of a predetermined sequence of "ve di!erent expressions plus the neutral face. Padgett et al. [45] report 86 accuracy on emotion recognition on novel individuals using neural networks for classi"cation. The recognized emotions are happy, sad, fear, anger, surprise, disgust or neutral across 12 individuals. Finally the method adopted by Lien et al. [46] is the most similar to ours in the sense of the recognition approach, because they also use HMMs. The expression information is extracted by use of facial feature point tracking (for the lower face * mouth*) or by pixel-wise #ow tracking (for the upper face * forehead and eyebrows*) followed by PCA to compress the data. Their system has an average recognition rate for the lower face of 93 and for the upper face of 91% using FACS. 8. Applications 8.1. Automatic camera man The static nature of current video communication systems induces extra articulatory tasks that interfere with
real world activity. For example, users must keep their head (or an object of interest) within the "eld of the camera (or of the microphone) in order to be perceived by distant parties. As a result, the user ends up being more attentive to the way how to using the interface than to the conversation itself. The communication is therefore degraded instead of enriched. In this sense, LAFTER, with its active camera face tracking acts as an &automatic camera man' that is continuously looking at the user while he/she moves around or gestures in a video-conference session. In informal teleconferencing testing, users have con"rmed that this capability signi"cantly improves the usability of the teleconferencing system. 8.2. Experiences with a virtual window system Some of the limitations of traditional media spaces * with respect to the visual information * are [47]: restricted "eld of view on remote sites by the video, limited video resolution, spatial discontinuity, medium anisotropy and very restricted movement with respect to remote spaces. Each of these negatively a!ects the communication in a media space, with movement one of the most in#uential, as Gibson emphasized in Ref. [48]. Motion allows us to increase our "eld of view, can compensate for low resolution, provides information about the three-dimensional layout and allow people to compensate for the discontinuities and anisotropies of current media spaces, among other factors. Therefore, not only allowing movement in local media spaces is a key element for desktop mediated communication and video-conference systems * as we have previously emphasized *, but also the ability of navigating and exploring the remote site.
Fig. 9. The virtual window: Local head positions are detected by the active tracking camera and used to control a moving camera in the remote site. The e!ect is that the image on the local monitor changes as if it were a window. The second image illustrates the virtual window system in use.
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
1379
We found that by incorporating our face tracker into a Virtual Window system, users could successfully obtain the e!ect of a window onto another space. To the best of our knowledge this is the "rst real-time robust implementation of the virtual window. In informal tests, users reported that the LAFTER-based virtual window system gives a good sense of the distant space. 8.3. Real-time computer graphics animation
Fig. 10. Real-time computer graphics animation.
The Virtual Window proposed by Gaver [49] illustrates an alternative approach: as the user moves in front of his local camera, the distant motorized camera is moved accordingly: exploring a remote site by using head movements opens a broad spectrum of possibilities for systems design that allow an enriched access to remote partners. Fig. 9 depicts an example of a virtual window system. One of the main problems that Gaver recognized in his virtual window system was that its vision controller was too sensitive to lighting conditions and to moving objects. Consequently, the tracking was unstable; users were frustrated and missed the real purpose of the system when experiencing it.
Because LAFTER continuously tracks face location, image-plane face rotation angle, and mouth shape, it is a simple matter to use this information to obtain realtime animation of a computer graphics character. This character can, in its simplest version, constantly mimic what the user does (as if it were a virtual mirror) or, in a more complex system, understand (recognize) what the user is doing and react to it. A `virtual mirrora version of this system * using the character named Waldorf shown in Fig. 10 * was exhibited in the Digital Bayou section of SIGGRAPH'96 in New Orleans. 8.4. Preferential coding Finally, LAFTER can be used as the front-end to a preferential image coding system. It is well known that people are most sensitive to coding errors in facial features. Thus it makes sense to use a more accurate (and more expensive) coding algorithm for the facial features,
Fig. 11. Preferential coding: the "rst image is the JPEG #at encoded image (File size of 14.1 Kb); the second is a very low resolution JPEG encoded image using #at coding (File size of 7.1 Kb); the third one is a preferential coding encoded image with high resolution JPEG for the eyes and mouth but very low resolution JPEG coding for the face and background (File size of 7.1 Kb).
1380
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
and a less accurate (and cheaper) algorithm for the remaining image data [50}52]. Because the location of these features is detected by our system, we can make use of this coding scheme. The improvement obtained by such system is illustrated in Fig. 11.
9. Conclusion and future work In this paper we have described a real-time system for "nding and tracking a human face and mouth, and recognizing mouth expressions using HMMs. The system runs on a single SGI Indy computer or PentiumPro Personal Computer, and produces estimates of head position that are surprisingly accurate. The system has been successfully tested on hundreds of naive users in several physical locations and used as the base for several di!erent applications, including an automatic camera man, a virtual window video communications system, and a real-time computer graphics animation system.
10. Summary This paper describes an active-camera real-time system for tracking, shape description, and classi"cation of the human face and mouth using only a PC or equivalent computer. The system is based on use of 2-D blob features, which are spatially-compact clusters of pixels that are similar in terms of low-level image properties. Patterns of behavior (e.g., facial expressions and head movements) can be classi"ed in real-time using Hidden Markov Models (HMMs). The system has been tested on hundreds of users and has demonstrated extremely reliable and accurate performance. Typical facial expression classi"cation accuracies are near 100%. LAFTER has been used as the base for several practical applications, including an automatic camera-man, a virtual window video communications system, and a real-time computer graphics animation system.
References [1] A. Azarbayejani, A. Pentland, Camera self-calibration from one point correspondence, Tech. Rep. 341, MIT Media Lab Vision and Modeling Group, 1995. Submitted IEEE Symposium on Computer Vision. [2] K. Matsuno, P. Nesi, Automatic recognition of human facial expressions, CVPR'95, IEEE, New York 1 (1995) 352}359. [3] K. Waters, A muscle model for animating three-dimensional facial expression, in: M.C. Stone (Ed.), SIGGRAPH '87 Conference Proceedings, Anaheim, CA, July 27}31, 1987, Computer Graphics, Vol. 21, Number 4, July 1987, pp. 17}24.
[4] M. Rydfalk, CANDIDE: a parametrized face, Ph.D. Thesis, LinkoK pnik University, EE Dept., October 1987. [5] I. Essa, A. Pentland, Facial expression recognition using a dynamic model and motion energy, ICCV'95 (1995) 360}367. [6] I. Pilowsky, M. Thornton, M. Stokes, Aspects of face processing, Towards the quanti"cation of facial expressions with the use of a mathematics model of a face (1986) 340}348. [7] M. Black, Y. Yacoob, Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion, ICCV'95 (1995) 374}381. [8] R. Magnol", P. Nosi, Analysis and synthesis of facial motions, International Workshop on Automatic Face and Gesture Recognition, IEEE, Zurich 1 (1995) 308}313. [9] Y. Moses, D. Reynard, A. Blake, Determining facial expressions in real time, ICCV'95 (1995) 296}301. [10] Y. Yacoob, L. Davis, Recognizing human facial expressions from long image sequences using optical-#ow, Pattern Anal. Mach. Intell. 18 (1996) 636}642. [11] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Int. J. Comput. Vision 1 (1988) 321}331. [12] A. Yuille, P. Hallinan, D. Cohen, Feature extraction from faces using deformable templates, Int. J. Comput. Vision 8 (1992) 99}111. [13] H. Hennecke, K. Venkatesh, D. Stork, Using deformable templates to infer visual speech dynamics, Tech. Rep., California Research Center, June 1994. [14] C. Bregler, S.M. Omohundro, Surface Learning with Applications to Lipreading, Adv. Neural Inform. Process. Systems 6 (1994) 43}50. [15] A. Pentland, Classi"cation by clustering, IEEE Symposium on Machine Processing and Remotely Sensed Data, Purdue, IN, 1976. [16] R. Kauth, A. Pentland, G. Thomas, Blob: an unsupervised clustering approach to spatial preprocessing of mss imagery, 11th International Symposium on Remote Sensing of the Environment, Ann Harbor MI, 1977. [17] A. Bobick, R. Bolles, The representation space paradigm of concurrent evolving object descriptions, Pattern Anal. Mach. Intell. 14 (2) (1992) 146}156. [18] C. Wren, A. Azarbayejani, T. Darrell, A. Pentland, P"nder: real-time tracking of the human body, Photonics East, SPIE, Vol. 2615, Bellingham, WA, 1995. [19] T. Starner, A. Pentland, Real-time asl recognition from video using hmm's, Technical Report 375, MIT, Media Laboratory, MIT, Media Laboratory, Cambridge, MA 02139. [20] N. Oliver, F. Berard, A. Pentland, Lafter: lips and face tracking, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR97), S. Juan, Puerto Rico, June 1997. [21] W. Ellis, A Source Book of Gestalt Psychology. In Harcourt Brace and Co., New York, 1939. [22] J. Rissanen, Encyclopedia of Statistical Sciences, Minimum-Description-Length Principle, Vol. 5, Wiley, New York, 1987, pp. 523}527. [23] T. Darrell, S. Sclaro!, A. Pentland, Segmentation by minimal description, ICCV'90 (1990) 173}177. [24] T. Darrell, A. Pentland, Cooperative robust estimation using layers of support, Pattern Anal. Mach. Intell. 17 (5) (1995) 474}487.
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382 [25] S. Ayer, H. Sawhney, Layered representation of motion video using robust maximum-likelihood estimation of mixture models and mdl encoding, ICCV95. [26] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via de em algorithm, J. Roy. Statist. Soc. 39-B (1977) 1}38. [27] R. Redner, H. Walker, Mixture densities, maximum likelihood and the em algorithm, SIAM Rev. 26 (1984) 195}239. [28] B. Schiele, A. Waibel, Gaze tracking based on face color, International Workshop on Automatic Face and Gesture Recognition (1995) 344}349. [29] H. Hunke, Locating and tracking of human faces with neural networks, Technical Report, CMU, CMU, Pittsburgh PA, August 1994. [30] T. Jebara, A. Pentland, Parametrized structure from motion for 3d adaptive feedback tracking of faces, CVPR 97 (1997) 144}150. [31] C. Priebe, Adaptive mixtures, J. Amer. Statist. Assoc. 89 (427) (1994) 796}806. [32] D.M. Titternington, Recursive parameter estimation using incomplete data, J. Roy. Statist. Soc. B 46 (1984) 257}267. [33] R. Kalman, A new approach to linear "ltering and prediction problems, ASME J. Eng. 82 (1960) 35}45. [34] R. Kalman, R. Bucy, New results in linear "ltering and prediction theory, Trans. ASME Ser. D. J. Basic Engng. 83 (1961) 95}107. [35] J. Crowley, F. Berard, Multi-modal tracking of faces for video communications, CVPR97 (1997) 640}645. [36] L.R. Rabiner, B.H. Juang, An introduction to hidden Markov models, IEEE ASSP Mag. Jan. (1986) 4}16. [37] J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden markov models, Trans. Inst. Electron. Inform. Commun. Eng. D-II, J76DII, (12) (1993) 2556}2563. [38] A. Wilson, A. Bobick, Learning visual behavior for gesture analysis, IEEE International Symposium on Computer Vision, 1995. [39] A. Wilson, A. Bobick, Recognition and interpretation of parametric gesture, International Conference on Computer Vision, 1998.
1381
[40] T. Starner, A. Pentland, Visual recognition of american sign language using hidden markov models, International Workshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, 1995. [41] H. Yamada, Dimensions of visual information for categorizing facial expressions, Japanese Psychol. Res. 35 (4) (1993) 172}181. [42] S. Morishima, Emotion model, International Workshop on Automatic Face and Gesture Recognition, Zurich (1995) 284}289. [43] A. Lanitis, C. Taylor, T. Cootes, A uni"ed approach for coding and interpreting face images, ICCV'95 (1995) 368}373. [44] I.A. Essa, Analysis, Interpretation, and Synthesis of Facial Expressions. PhD Thesis, MIT Department of Media Arts and Sciences, 1995. [45] C. Padgett, G. Cottrell, Representing face images for emotion classi"cation, Neural Information Processing Systems, NIPS'96, Denver, Colorado, USA, 1996. [46] J. Lien, T. Kanade, J. Cohn, A. Zlochower, C. Li, Automatically recognizing facial expressions in spatio-temporal domain using hidden markov models, in: Proceedings of the Workshop on Perceptual User Interfaces, PUI97, Ban!, Canada, 1997. [47] W. Gaver, The a!ordances of media spaces for collaboration, CSCW, 1992. [48] J. Gibson, The Ecological Approach to Visual Perception, Houghton Mi%in, New York, 1979. [49] W. Gaver, G. Smets, K. Overbeeke, A virtual window on media space, CHI, 1995. [50] A. Eleftheriadis, A. Jacquin, Model-assisted coding of video teleconferencing sequences at low bit rates, ISCAS, May}June 1994. [51] K. Ohzeki, T. Saito, M. Kaneko, H. Harashima, Interactive model-based coding of facial image sequence with a new motion detection algorithm, IEICE E79B (1996) 1474}1483. [52] K. Aizawa, T. Huang, Model-based image-coding: Advanced video coding techniques for very-low bit-rate applications, Proceedings of IEEE 83 (1995) 259}271.
About the Author*NURIA M. OLIVER is a Research Assistant in the Vision and Modeling Group at the Media Laboratory of Massachussetts Institute of Technology, pursuing a Ph.D. in Media Arts and Sciences. She works with Professor Alex Pentland. She received with honors her B.Sc. and M.Sc. degrees in Electrical Engineering and Computer Science from ETSIT at the Universidad Politecnica of Madrid (UPM), Spain, 1994. Before starting her Ph.D. at MIT she worked as a research engineer at Telefonica I#D. Her research interests are computer vision, machine learning and arti"cial intelligence. Currently she is working on the three previous disciplines in order to build computational models of human behavior. About the Author*ALEX (SANDY) P. PENTLAND is the Academic Head of the M.I.T. Media Laboratory. He is also the Toshiba Professor of Media Arts and Sciences, an endowed chair last held by Marvin Minsky. His recent research focus includes understanding human behavior in video, including face, expression, gesture, and intention recognition, as described in the April 1996 issue of Scienti"c American. He is also one of the pioneers of wearable computing, a founder of the IEEE Wearable Computer technical area, and General Chair of the upcoming IEEE International Symposium on Wearable Computing. He has won awards from the AAAI, IEEE, and Ars Electronica. He is a founder of the IEEE Wearable Computer technical area, and General Chair of the upcoming IEEE International Symposium on Wearable Computing.
1382
N. Oliver et al. / Pattern Recognition 33 (2000) 1369}1382
About the Author*FRANhOIS BED RARD is a Ph.D. student in Computer Science at CLIPS-IMAG laboratory at the University of Grenoble (France). His research interests concern the development of real-time Computer Vision systems and their use in the "eld of Human-Computer Interaction. His research advisors are Professor JoeK lle Coutaz and Professor James L. Crowley. He spent two summers working with Prof. Alex Pentland at the MIT Media Laboratory's Vision and Modeling Group and with Michael Black at Xerox PARC's Image Understanding Area.
Pattern Recognition 33 (2000) 1383}1393
Model-based segmentation of nuclei Ge Cong*, Bahram Parvin Information and Computing Science Division, Lawrence Berkeley National Laboratory, MS 50B-2239, 1 Cyclotron Road, Berkeley, CA 94720, USA Received 16 December 1998; accepted 29 April 1999
Abstract A new approach for the segmentation of nuclei observed with an epi-#uorescence microscope is presented. The proposed technique is model based and uses local feature activities in the form of step-edge segments, roof-edge segments, and concave corners to construct a set of initial hypotheses. These local feature activities are extracted using either local or global operators and corresponding hypotheses are expressed as hyperquadrics. A neighborhood function is de"ned over these features to initiate the grouping process. The search space is expressed as an assignment matrix with an appropriate cost function to ensure local and neighborhood consistency. Each possible con"guration of nucleus de"nes a path and the path with least overall error is selected for "nal segmentation. The system is interactive to allow rapid localization of large numbers of nuclei. The operator then eliminates a small number of false alarms and errors in the segmentation process. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Segmentation; Grouping; Hyperquadric; Medical image processing; Shape recovery
1. Introduction Automatic delineation of cell nuclei is an important step in mapping functional activities into structural components in cell biology. This paper examines delineation of individual nucleus that are observed with an epi#uorescence microscope. The nuclei that we are dealing with are in mammary cells. These cells cover the capillaries that carry milk in the breast tissues. The nuclei of interest reside in a thin layer that surround a particular type of capillary in the tissue. The intent is to build the necessary computational tools for large-scale population studies and hypothesis testing. These nuclei may be clumped together, thus, making quick delineation infeasible. At present stage, we are working on 2D crossing-section images of the tissue which are obtained by focusing the optical system at speci"c locations along the z-axis. Thus, we can assume that the nuclei abut but do not overlap
* Corresponding author. Tel.: #510-486-4158; fax: #510486-6363. E-mail address:
[email protected] (G. Cong).
each other. An example is shown in Fig. 1(a). Previous e!orts in this area have been focused on thresholding, local geometries, and morphological operators for known cell size [1,2]. Others have focused on an optimal cut path that minimizes a cost function in the absence of shape, size, or other information [3}7]. In this paper, we propose a new approach that utilizes both step-edge and roof-edge boundaries to partition a clump of nuclei in a way that is globally consistent. In this context, images are binarized and boundaries } corresponding to step edges } are recovered. Next, concave corners are extracted from polygonal approximation of the initial boundary segments. These corners provide possible cues to where two adjacent nuclei may come together. Thresholding separates big clumps consisting of several nuclei squeezed together. The boundaries between every two adjacent nuclei inside one clump are not detected by thresholding since they have higher intensities, as shown in Fig. 1(b) and (c). Thus, crease segments are detected [8}11] which provide additional boundary conditions for the grouping process, as shown in Fig. 1(d). These crease segments correspond to trough edges and are treated as common boundaries between adjacent nuclei. False creases may be extracted in the process.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 9 - 3
1384
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
Fig. 1. An example of nuclei with the results of global and local operations: (a) original image; (b) threshold image; (c) boundary objects; and (d) local troughs.
However, since our algorithm need not use all the segments provided, false crease segments can be discarded in the grouping stage in favor of the global optimization. A unique feature of our system is in hyperquadric representation of each hypothesis and the use of this representation for global consistency. The main advantage of such a parameterized representation } as opposed to polygonal representation } is better stability in shape description from partial information. In this fashion, each step-edge boundary segment belongs to one and only one nucleus while each roof-edge boundary segment is shared by two and only two nuclei. These initial hypotheses and their localized inter-relationship provides the basis for search in the grouping step. This is expressed in terms of an adequate cost function and minimized through dynamic programming. The "nal result of this computational step is then shown to a user for veri"cation and elimination of false alarms. In the next section, we will brie#y review each step of the representation process and parameterization of each hypothesis in terms of hyperquadric. This will be fol-
lowed by the details of the grouping protocol, results on real data, and concluding remarks.
2. Representation The initial step of the computational process is to collect su$cient cues from local feature activities so that a set of hypotheses } not all of them correct } can be constructed for consistent grouping. These initial steps include thresholding, detection of concave points from boundary segments, extraction of crease segments from images, and hyperquadric representation of each possible hypothesis. 2.1. Thresholding Binary thresholding extracts the clump patterns from the original image. The corresponding threshold can be obtained through analysis of the intensity histogram or contrast histogram. As shown in Fig. 2, since background
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
Fig. 2. Thresholding.
always corresponds to the "rst peak in the intensity histogram, intensity analysis select the "rst valley following this peak as the threshold. In the contrast analysis, optimum threshold corresponds to the peak in the contrast histogram which is the accumulation of local contrast at each edge point. Thresholding is a valid approach for #uorescence images because of absence of any shading artifact. 2.2. Polygonal approximation The next step is to partition the clump silhouettes into segments that are partial nucleus boundaries. Often the location on the boundary between two adjacent nuclei is signaled by a concave point, thus, a reliable corner detector is needed. These corners are localized by the concave vertices of the polygonal approximation of the original contours [1,12] and the concavity is determined by the turning angle between adjacent line segments. The arcs of the clump boundaries between every two adjacent corners are de"ned as the `boundary segmentsa. Since polygonal approximation just selects some `feature pointsa from the original curve as new vertices, all the corners are guaranteed to be on the original boundaries. An example of this step of the process is shown in Fig. 3. 2.3. Detection of crease boundaries This step detect the common boundaries between every two squeezed nuclei in one clump which can be modeled as crease features. In grey images, crease points can be de"ned as local extremes of the principal curvature along the principal direction [8}11]. It is well known that due to noise, scale, "nite di!erential operators, and thresholds, it is very di$cult to detect complete creases as shown in Fig. 1(d). Images are enhanced through a variation of nonlinear di!usion [13] to improve localization of crease points. The principal curvature k is then
1385
Fig. 3. Detection of concave corners.
computed as the solution of the following Eq. [14] (EG!F)k!(EN#G¸!2FM)k #(¸N!M)"0
(1)
Where E, F, G, ¸, M, N are the xrst and second fundamental forms: E"1#f , V F"f f , V W G"1#f , W ¸"f , VV M"f , VW N"f . WW The principal direction (dx : dy) is given by the following Eq. [14]: (EM!¸F) dx#(EN!¸G) dx dy #(FN!MG) dy"0.
(2)
Crease points are detected and linked to form `crease segmentsa as shown in Fig. 1(d). 2.4. Hyperquadric model A brief introduction to hyperquadric "tting is included. A more detailed description can be found in [15}17]. A 2D hyperquadric is a closed curve de"ned by: , "A x#B y#C "AG"1. G G G G
(3)
Since c '0, Eq. (3) implies that G "A x#B y#C ")1, ∀i"1, 2,2, N, G G G
(4)
1386
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
which corresponds to a pair of parallel line segments for each i. These line segments de"ne a convex polytope (for large c) within which the hyperquadric is constrained to lie. This representation is valid across a broad range of shapes which need not be symmetric. The parameters A and B determine the slopes of the bounding lines and, G G along with C , the distance between them. c determines G G the `squarenessa of the shape. Hyperquadric can model both convex and concave shapes, thus, we do not assume that the nuclei is convex in our approach. 2.4.1. Fitting problem Assume that m data points p "(x , y ), j"1, 2,2, m H H H from n segments (m" L m ) are given. The cost funcG G tion is de"ned as K 1 , e" (1!F (p ))#j Q , (5) H H G "" F (p )"" H H H G where F (p )" , "A x #B y #C "AG, is the gradient H H G H G G G H operator, j is the regularization parameter and Q is the G constraint term [17]. The parameters A , B , C , c are G G G G calculated by minimizing e using the Levenberg}Marquard non-linear optimization method [18] from a suitable initial guess [17]. Several examples of hyperquadric "tting to an initial set of partial segments are shown in Fig. 4.
3. Grouping for nuclei Let each clump be represented by n boundary seg@ ments b , i"1,2, n and n crease segments c , G @ A G i"1,2, n . We assume that there are at most n nuclei in A @ the clump because each nucleus should have at least one boundary segment detected to indicate its existence. The nucleus ' correspondent to the index of b is de"ned as G G a set of boundary and crease segments belonging to the ith nucleus. Note that ' does not necessarily include G b and may be empty. All the segments in a certain ' is G G "tted by the hyperquadric as the actual shape of the nucleus. To delineate the nuclei in one clump, we need to "nd the assignment of all b , i"1,2, n and c , G @ G i"1,2, n to ' , i"1,2, n . According to their characA G @ ters, each b belongs to one and only one ' while each G G c belongs to two di!erent nuclei (common boundary) or G not a single nucleus (false crease). This is called the consistence criterion. Thus, two or more boundary segments may be assigned to the same ' while some other G ' are un"lled. G To assign the segments, we de"ne a new set ' for each G ' such that ' -' . It is assumed that detecting ' is G G G G trivial and ' contains all the segments that have certain G possibilities to be part of the ith nucleus. Computing ' from ' is in fact subject to local, adjacency, and global G G constraint. It is under-constrained and the solution is not
Fig. 4. Fitting results for hyperquadrics.
unique. Each possible solution is measured by the `goodness criteriaa proposed in Section 3.3 and the one with minimum cost determines the segmentation. 3.1. Neighborhood box A neighborhood box, which is de"ned over a region for each b , is used to construct the ' , as shown in Fig. 5. G G Suppose that p and p are the end points of b and r is G the line segment connecting them with l"#r# as the length. The neighborhood box is then de"ned as the combination of a square that extends r and the region enclosed by b . The length of the square edge is a pre-set G number ¸ unless l'¸ where the length is set to l. Any segment b , j"1,2, n or c , j"1,2, n that reH @ H A sides in the neighborhood box is included in ' . When G ¸ is properly selected, all the boundary and crease segments of ' are guaranteed to be in the neighborhood G box, thus, ' -' . G G 3.2. Search strategy Our next step is to compute ' from ' subjected to G G the consistence criterion. Since every segment in ' has G some possibility to be in ' , the solution is not unique. G However, the construction of ' reduces greatly the G searching space. The optimal segmentation is computed by measuring di!erent solutions based on global evolution criteria. The key data structure in our approach is the Assignment Matrix M. Each row of M indicates a possible nucleus. For the clump under investigation, we can construct up to n nuclei. Thus, M has n rows. Each column @ @ of M indicates a boundary or crease segment. Since each crease may be shared by two nuclei, we assign two columns for it. Thus, M has n #2n columns. Let @ A s "b , 1)j)n , H H @ "s @ "c , 1)j)n , s@ L >H\ L >H H A
(6)
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
' "+b , b ,, ' "+b , c , b ,, ' "+b , c , b ,, ' "+b , b ,. (8) The ith row of M represents all possible segments that may be part of a nucleus. The jth column of M indicates all possible nuclei that s may belong to. The main conH straint is to enforce assigning a boundary segment in one and only one nucleus and sharing a crease segment between two di!erent nuclei or not using this segment at all for any nucleus. According to this consistence criterion, a `patha in M is de"ned as a routine from left to right with the so-called `boundary parta and `crease parta. The boundary part passes one and only one `1a for each boundary segment column. For each pair of crease segment columns s @ and s @ which are correL >H\ L >H spondent to the same crease segment, the crease part passes either one `1a in each column but di!erent rows or does not pass any `1a at all. Thus, each path is a segmentation of the clump with the correspondent assignments of ' . In the path, every boundary segment is assigned to G a nucleus while some crease segments may not be used which enables us to discard the false creases. For example, path I in Fig. 7 indicates that
Fig. 5. Neighborhood function.
' "+b ,, ' "+b , c , b ,, ' "+b , b ,, ' "+b ,, ' "+c ,, ' " , (9) ' , i"1, 22 are then "tted by hyperquadrics each of G which is evaluated by the criteria proposed in the next section. Thus, the nuclei segmentation problem is equivalent to "nding a best path with minimum cost. For example, the best path for Fig. 6 is Path II as shown in Fig. 7, i.e.,
Fig. 6. An example of boundary and crease segments.
Fig. 7. Assignment matrix for features of Fig. 6.
M is determined by
1 if s 3' , H G m " GH 0 otherwise.
(7)
An example for construction of M for feature segments of Fig. 6 is shown in Fig. 7, where we assume that ' "+b , c , b , b ,, ' "+b , c , b , b ,,
1387
' "+b , c , b ,, ' "+b , c , b ,, ' "+b , b ,, ' " , ' " , ' " . (10) The actual search process is based on dynamic programming [19,20], where the local cost function is de"ned in the next section. The dynamic programming
1388
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
algorithm is essentially a multi-stage optimization technique where at each stage, or each iteration, the size of the path is increased by one set of feature segments. This process is repeated for each starting point in the assignment matrix, and the path with least cost is selected as "nal hypothesis. 3.3. Evaluation criteria Although nuclei may have completely di!erent morphology, we have some general information about their shapes and properties. This information enables us to compare di!erent hyperquadrics, get rid of the undesirable ones, and reduce false alarms. The `goodnessa criterion includes four terms: area A, shape S, overlap O and error C. Each is evaluated by its representative function E , E , E and E . The cost of the local hyperquadric is 1 ! then given by E "E #E #E #E . The cost of one 2 1 ! path is the summation of costs of the entire set of hypotheses. The transition cost between two adjacent hypotheses is simply an exclusive consistency measure. E , E , 1 E and E are computed as follows: ! 1. E . A is the area of the hyperquadric. A nucleus should neither be bigger than (A ) nor smaller than @ (A ). Q
0, if A )A)A , Q @ E " 1!e\\@N if A'A , @ 1!e\Q\N if A(A , Q where we choose A "¸, A " ¸. @ Q
(11)
2. E . S de"ned as an aspect ratio as measured by the 1 ratio of minor to major axes as shown in Fig. 8(a). E is 1 de"ned to favor perfect circles: E "1!e\\1N1. (12) 1 3. E . A hyperquadric may not always be enclosed by the nuclei clump. An overlap measure is de"ned as the ratio of area inside the clump to the total area of the hyperquadric. E is de"ned to favor larger values of O as shown in Fig. 8(b): E "1!e\\-N-. (13) 4. E . The error C is de"ned as C"e/m, where e is the ! error in Eq. (5) of the hyperquadric "tting process and m is the total number of points: (14) E "1!e\!N! ! where p , p , p , p are weighting factors for each 1 - ! criterion. 3.4. Post-processing Two kinds of errors may occur in the "tting process. These include a "t that spans outside of the clumped boundary (background inclusion) and one that encloses other nuclei as well. The "rst problem is solved through a simple `ANDa operation. The second type of error could be either due to representation of the same nucleus with two di!erent hyperquadric or a simple overlap between representation of two separate nuclei, as shown in Fig. 9. The "rst case can be easily resolved through a test in the proximity in the
Fig. 8. Evaluation criteria. (a) Shape rate; (b) Overlap rate.
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
1389
Fig. 9. Di!erent types of error in the "t: (a) same nucleus is represented by two objects and part of background; and (b) overlap between two representations of nuclei.
Fig. 10. Contour dynamic: (a) Two contours are squeezed together; (b) Contour dynamic; (c) Stable state.
center of mass. The second case uses a simple model to express dynamic behavior of the boundary as shown in Fig. 10(a). Let two elastic contours C , C be squeezed against each other. The stable boundary between them will be a!ected by the rebound force that corresponds to the energy of deformation. The rebound force F at any point on the deformed contour is proportional to its displacement d from the original position and perpendicular to the new boundary, as shown in Fig. 10(b), F"ad. And the deformation energy is given by e" bd ds, where a, b are elastic parameters. At equi* librium, F "F along ¸ and the deformation energy should be minimum, as shown in Fig. 10(c). Then it is not di$cult to prove that every point on ¸ must have the same distance from its original positions on C and C . Hence, we use a protocol based on distance transform to partition squeezed hyperquadrics. Let region R contain h hyperquadrics h , i"1,2, h with some overlap G where the distance transformations, D (x, y), for each G hyperquadric is computed. Then each point (x, y)3R is assigned to ith nucleus if D (x, y)*D (x, y), ∀j" G H 1,2, h as shown in Fig. 11(a) and (b). 3.5. Experimental results and discussions The proposed protocol has been tested on real data obtained from a #uorescence microscope. The results computed by our approach as well as the `correcta seg-
mentations generated manually by skilled operator interaction are presented in Figs. 12}16. Comparisons between the two types of results are summarized in the following Table 1. `Manuala column presents the numbers of nuclei detected by the skilled operator and those in `Algorithma are numbers of nuclei detected by our algorithm. `Rejecteda are the di!erences between the proceeded columns. Analysis of the output of our algorithm is given in Table 2. `Bad locationa indicates the numbers of nuclei with incorrect partial boundary locations; `Fuseda are the numbers of nuclei that are fused to other ones; `Fragmenteda, the numbers of nuclei that are fragmented into two or more small shapes. The `Acceptablea column gives us the numbers of nuclei correctly detected by our algorithm and `Reliabilitya are the percentages of acceptable nuclei to all detected ones. As we can see, the nuclei lying on the boundaries of the original images are rejected since most of their boundary segments cannot be provided. In Fig. 15, one nucleus is fragmented into two small shape because of the false crease information. The absence of crease segment as well as the fact that, in some situations, a bigger nucleus is `bettera than two very small ones according to our criteria, leads to the fusion of nuclei in Figs. 15 and 16. `Bad locationa happens in Figs. 12 and 13 and 16 because the boundary information is not strong enough to enhance
1390
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
Fig. 11. Steps in resolving the overlap problem between multiple nuclei: (a) two nuclei; (b) three nuclei.
Fig. 12. Segmentation results: (a) original image; (b) correct segmentation; (c) our segmented nuclei.
Fig. 13. Segmentation results: (a) original image; (b) correct segmentation; (c) our segmented nuclei.
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
Fig. 14. Segmentation results: (a) original image; (b) correct segmentation; (c) our segmented nuclei.
Fig. 15. Segmentation results: (a) original image; (b) correct segmentation; (c) our segmented nuclei.
1391
1392
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393
Fig. 16. Segmentation results: (a) original image; (b) correct segmentation; (c) our segmented nuclei.
range of shapes from partial boundary information. The assignment matrix, on the other hand, converts the segmentation problem into a constrained optimization problem. Our approach aims for global consistency, and as a result, it is less error prone and generates a few false alarm for "nal veri"cation by an operator.
Table 1 Numbers of nuclei detected by the two methods Figure index
Manual
Algorithm
Rejected
12 13 14 15 16
10 11 4 12 14
7 7 4 9 12
3 4 0 3 2
Acknowledgements Authors thank Dr. Mary Helen Barcellos-Ho! and Mr. Sylvain Costes for motivating the problems, valuable discussion, and providing the data used in this experiment. This work is supported by the Director, O$ce of Energy Research, O$ce of Computation and Technology Research, Mathematical, Information, and Computational Sciences Division, and O$ce of Biological and Environmental Research of the U. S. Department of Energy under contract No. DE-AC03-76SF00098 with the University of California. The LBNL publication number is 643205.
the concavity. Since our approach seeks a global solution, it is possible that some local incorrect segmentations be the cost for a better global optimization. In all of the experiments, the hyperquadric has four terms, N"4. The evaluation parameters are ¸"45, p "200, p "5, p "0.5 and p "1. These numbers 1 ! are obtained based on a priori information in the speci"c application domain.
4. Conclusion We have presented a new approach for segmentation of nuclei based on partial geometric information. Two key issues are hyperquadric "tting and assignment matrix. Hyperquadric representation can model a broad
References [1] M. Sonka, V. Hlavac, R. Boyle, Image Processing Analysis and Machine Vision, Chapman & Hall, London, 1995.
Table 2 Detailed analysis of our results Figure index
Bad location
Fused
Fragmented
Acceptable
Reliability (%)
12 13 14 15 16
1 1 0 0 1
0 0 0 1 1
0 0 0 1 0
6 6 4 7 10
88 88 100 77 83
G. Cong, B. Parvin / Pattern Recognition 33 (2000) 1383}1393 [2] H. Talbot, I. Villalobos, Binary image segmentation using weighted skeletons, SPIE Image Algebra Morphol. Image Process. 1769 (1992) 393}403. [3] J. Leu, H. Yau, Detection of the dislocations in metal crystals from microscopic images, Pattern Recognition 24 (1) (1991) 41}56. [4] S. Ong, H. Yeow, R. Sinniah, Decomposition of digital clumps into convex parts by contour tracing and labelling, Pattern Recognition Lett. 13 (1992) 789}795. [5] J. Liang, Intelligent splitting the chromosome domain, Pattern Recognition 22 (1989) 519}532. [6] Y. Jin, Jayasooriah, R. Sinniah, Clump splitting through concavity analysis, Pattern Recognition 15 (1994) 1013}1018. [7] W. Wang, Binary image segmentation of aggregates based on polygonal approximation and classi"cation of concavities, Pattern Recognition 31 (10) (1998) 1502}1524. [8] O. Monga, N. Ayache, P. Sander, From voxel to intrinsic surface features, Image Vision Comput. 10 (6) (1992). [9] O. Monga, S. Benayoun, O. Faugeras, From partial derivatives of 3D density images to ridge lines, Proceedings of the Conference on Computer Vision and Pattern Recognition, 1992, pp. 354}359. [10] J. Thirion, A. Gourdon, The 3D matching lines algorithm, Graph. Model Image Process. 58 (6) (1996) 503}509. [11] J. Thirion, New feature points based on geometric invariants for 3D image registration, Int. J. Comput. Vision 18 (2) (1996) 121}137.
1393
[12] R. Gonzalez, R. Woods, Digital Image Processing, Addison-Wesley, Reading, MA, 1992. [13] P. Saint-Marc, J. Chen, G. Medioni, Adaptive smoothing: A general tool for early vision, IEEE Trans. Pattern Anal. Mach. Intell. 13 (6) (1991) 514}530. [14] D. Struik, Lectures on Classical Di!erential Geometry, Dover Publications, New York, 1988. [15] S. Han, D. Goldgof, K. Bowyer, Using hyperquadric for shape recovery from range data, Proceedings of the IEEE International Conference on Computer Vision, 1993, pp. 292}296. [16] A. Hanson, Hyperquadrics: smoothly deformable shapes with convex polyhedral bounds, Comput. Vision, Graphics, Image Process. 44 (1988) 191}210. [17] S. Kumar, S. Han, D. Goldgof, K. Boeyer, On recovering hyperquadrics from range data, IEEE Trans. Pattern Anal. Mach. Intell. 17 (11) (1995) 1079}1803. [18] W. Press, S. Teukollsky, W. Vetterling, B. Flannery, Numerical Recipes in C, Cambridge Uni. Press, Cambridge, England, 1992. [19] B. Parvin, C, Peng, W. Johnston, M. Maestre, Tracking of tubular molecules for scienti"c applications, IEEE Trans. Pattern Anal. Mach. Intell. 17 (1995) 800}805. [20] R. Bellman, Dynamic Programming, Princeton University Press, Princeton, NJ, 1957.
About the Author*GE CONG received the BS degree in electrical engineering from Wuhan University, Wuhan, China in 1992 and the Ph.D degree in computer science from Institute of Automation, Chinese Academy of Sciences in 1997. He is currently a sta! scientist in Lawrence Berkeley National Laboratory. His research interests include computer vision, pattern recognition and bioinformatics. About the Author*BAHRAM PARVIN received his Ph.D. in Electrical Engineering from University of Southern California in 1991. Since then he has been on the sta! at the Information and Computing Sciences Division at Lawrence Berkeley National Laboratory. His areas of research include computer vision and collaboratory research. He is a senior member of IEEE.
Pattern Recognition 33 (2000) 1395}1399
Can the classi"cation capability of network be further improved by using quadratic sigmoidal neurons? Baoyun Wang *, Zhenya He Department of Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, People's Republic of China Department of Radio Engineering, Southeast University, Nanjing 210096, People's Republic of China Received 8 December 1995; accepted 8 April 1999
Abstract In Ref. [4], by using constructive method, Chiang et al., proved that a three-layer neural network containing k#1 single threshold quadratic sigmoidal hidden neurons and one multithreshold sigmoidal output neuron could learn arbitrary dichotomy de"ned on a training set of 4k patterns. In this paper the classi"cation capability of the feed forward neural networks containing multiple or single threshold quadratic sigmoidal neurons in the hidden and output layer is evaluated. The degree of improvement on the classi"cation capability of network by using quadratic sigmoidal neurons is analyzed. Published by Elsevier Science Ltd. Keywords: Neural networks; Classi"cation; Dichotomy; Quadratic sigmoidal neuron
1. Introduction The classi"cation capability of feedforward network is an important topic in the understanding of neural networks. The focus is often put on the lower or upper bound on the number of hidden neurons required to learn the classi"cation of a given training set [1,2,5]. The number of hidden nodes needed for various feedforward networks to dichotomize any dichotomy de"ned on a training set was studied in many references, for example, Refs. [4}7]. In Ref. [3], Chiang et al. proposed a new activation function called quadratic sigmoidal function (QSF), and introduced an extended-type of neurons called quadratic sigmoidal neuron. Compared with the conventional perceptrons, the neural networks consisting of quadratic sigmoidal neurons enjoy faster learning, better generalization capability and stronger classi"cation capability. This paper mainly considers how far the improvement of the classi"cation capability of neural network can be
* Corresponding author. E-mail address:
[email protected] (B. Wang). 0031-3203/00/$20.00 Published by Elsevier Science Ltd. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 9 7 - 7
reached by using quadratic sigmoidal neurons. It is proved that for the three-layer feedforward neural network containing single or multiple threshold quadratic sigmoidal neurons in the hidden and output layer at least k#1 hidden nodes are needed to dichotomize any dichotomy de"ned on the training set of 4k#7 examples [2}4].
2. Quadratic sigmoidal neurons 2.1. Some types of activation function In this section, some notations to be used in the following will be presented: (a) Multi-threshold quadratic sigmoidal neuron: suppose the input for the neuron x"(1, x , x ,2, x ), L the corresponding weight vector w"(w , w ,2, w ), L and the threshold vector #"(h , h ,2, h ). The ac L tivation function of this type of neurons is 1 , f (net, #)" 1#exp(net!g(#, x))
(1)
1396
B. Wang, Z. He / Pattern Recognition 33 (2000) 1395}1399
where L net"w ) x"w # w x , G G G
(2)
L g(#, x)"h # h x . G G G
(3)
DIOH neural network } a single-hidden-layer network with direct input-to-output connection and containing multiple or single threshold quadratic Heaviside neurons in the hidden and output layer. Its structure is given in Fig. 1(b).
3. Main results
(b) Single threshold quadratic sigmoidal neuron: the threshold function g(#, x) of activation function is reduced to a constant h (c) Multithreshold quadratic Heaviside neuron: it takes the following extended quadratic Heaviside function as its activation function.
3.1. Problem formation
(d) Single threshold quadratic Heaviside neuron: the threshold g(#, x) in activation function is reduced to a scalar.
As pointed out by Chiang et al. [1], the dichotomy problem can be described as a partition on the input space into two subsets, i.e., S>, S\. A signal y3S> is called a positive example and has a target signal f(x)"1 associated with it. A signal y3S\ is a negative example with target f(y)"0. For any training set S " L +x, x,2, xN " xG3RL,, we can always "nd a vector such that the new training set S"+z, z,2, zN " zG"v ) xG3R, contains no duplicated elements. So we only need to consider the dichotomy on onedimensional pattern space.
2.2. Some notations
3.2. The classixcation capability of DIOH neural network
In this subsection, some useful notations are presented, which are read as QSNN } three-layer network containing single or multiple threshold quadratic sigmoidal neurons in the hidden and output layer. Its structure is given in Fig. 1(a).
Consider a DIOH neural network, let w "(w , w ) G G G and #"(h , h ) denote the weight and threshold of G G G the ith hidden node, and u"(u , u ,2, u ) and I #"(h, h,2, h) denote the hidden-to-output con I nection weight and threshold of output node. The output of the network should be
1 if g(#, x)!(w ) x)'0, HC(w ) x, #)" O 0 if g(#, x)!(w ) x))0.
(4)
I O(x)"HC u #vx# u H(w #w x, h O G G G G G
#h x), # , G
(5)
where v denotes the direct input}output connection weight, H( ) ) is Heaviside function. Theorem 1. For arbitrary w , u, #, # (i"1, 2,2, k), G a DIOH neural network can divide arbitrary closed interval I into at most 4k#3 intervals S, S,2, SI>, such that (a) SG5SH" (b) f(x)"1!f(y), ∀x3SG, y3SG>, i"1,2, 4k#2. Here f(x) means the target value of network. Fig. 1. The Structures of QSNN (1.a) and DIOH (1.b).
Proof. See Appendix A.
B. Wang, Z. He / Pattern Recognition 33 (2000) 1395}1399
Appendix A. Proof of Theorem 1
3.3. The classixcation capabilities of QSNN Given the training set S"+zG " zG3R, i"0, 1,2, 4k#2,,
(6-1)
which satis"es zG(zG> and f(zG)"1!f(zG>),
(6-2)
where f( ) ) means the target value of neural network. Before constructing a neural network to learn the training set S, we often assume some closed intervals to contain the training pattern, as follows: a (I (a (I (b (I (c (I (c (I \ (b (I (a (2(I (a . I> > The weights and thresholds of the neural network to be designed are determined by a , a , b , c ,2, a . If they \ > satisfy the following conditions (here we still adopt the notation in the proof of theorem 1):
I> 8 I , i"1,2, k J J I> I> u # u h O h# hh G G G G G G
q\, q> , G G
(7)
1397
for x3
I> 8 I . J J (8)
We denote the corresponding QSNN as QSNN-I and obtain Theorem 2. Theorem 2. If QSNN-I is used to learn the training set S in Eq. (6), then at least k#1-hidden neurons are needed. Proof. See Appendix B. Theorem 2 shows the degree of the improvement of the classi"cation capability of neural network. Compared with Committee Machine, the quadratic sigmoidal neuron only improves the classi"cation capability of neural network by a factor of 4.
4. Conclusions This paper evaluated the degree of improvement on the classi"cation capability of neural network by using quadratic sigmoidal neuron. It is helpful for us to "nd more powerful neuron model for arti"cial neural network, especially while being used as a classi"er.
Acknowledgements This work was supported by National Natural Science Foundation of China (NSFC).
Let q\, q> denote the zeros of q (x)"(h#hx)! G G G (w #w x), it is positive only when x is in an interval G G (q\, q>). There are two cases to be considered. G G 1. The case q>(q\ , for i"1, 2, 2, k!1. G G> In this case, R is divided into 3k interval by the hidden neurons, noted as I , I ,2, I . The value of h (x)" I G H(w #w x, h #h x) in each interval is as follows: G G G G h "1, G x3I , jOi, i"1, 2,2, k G\ h "0 H I x3 8 (I 6I ) h "0, j"1, 2,2, k (A.1) G\ G H G The output of the neural network is
1, 0"HC(u #vx, #)" O 0,
"u #vx")(h "u #vx"'(h
I if x3 8 (I 6I ), G\ G G
(A.2)
1, (u #vx#u ))h#h G G O"HC(u #vx#u , #)" O G 0, (u #vx#u )'h#h G G if x3I . (A.3) G\ From Eq. (A.3), we can see that I can be divided into G\ three parts by the neural network, I , I , I , as G\ I\ I\ shown in Fig. 2c. In Fig. 2, also shown is the neighboring relationship of the intervals. I 6I is next to G G>\ I and I , the parameters in Eq. (A.2) u , v, h G\ G>\ are independent of i. So I\ O(x)"HC(u #vx, #)"1, if x3 8 (I 6I ) (A.4) O G\ G G should hold. To select the parameters u , v, h suitably one can ensure Eq. (A.4) and divide I and I into I>6I\ and I I> 6I\ . And I I O"HC(u #vx, #)"1 if x3I>6I> , O I O"HC(u #vx, #)"0 if x3I\6I\ . O I To sum up the above analysis we conclude that a DIOH neural network can divide any closed interval I into at most 4k#3 parts, I\, I>, I\ , I> , I I I , I , I , i"1, 2,2, k G\ G\ G\ I 6I , i"1, 2,2, k!1, G G> which meet the requirements.
1398
B. Wang, Z. He / Pattern Recognition 33 (2000) 1395}1399
Fig. 2. (a) Sub-intervals produced by the hidden neuron dividing I. (b) Overlapping of the sub-intervals. (c) Dividing of sub-intervals in (a) by the hidden neuron.
2. The case where there exists at least one index i, q>'q\ . As shown in Fig. 2b, the output of network G G> is as follows
1, (u #vx#u )(h#h G G O"HC (u #vx#u , #)" O G 0, (u #vx#u )*h#h G G for x3(q\, q\ ), G G> O"HC (u #vx#u , #) O G>
"
1, (u #vx#u )(h#h G> G> 0, (u #vx#u )*h#h G> G>
for x3(q>, q> ), G G> O"HC(u #vx#u #u , #) O G G> 1, (u #vx#u #u )(h#h#h G G> G G> " 0, (u #vx#u #u )*h#h#h G G> G G> for x3(q\ , q>). G> G From the above equations, we can "nd that the number of intervals produced by the neural network would not increase. So the proof is completed. 䊐
Appendix B. Proof of Theorem 2 Proof. Case 1: At least one single threshold quadratic sigmoidal neuron in the hidden layer Assume there exists a QSNN-I with s (s(k#1) hidden neurons which can learn the training set S correctly. If the weights and thresholds meet Eqs. (7) and (8),
by Lemmas 3, 4 and Corollary 1 in Ref. [4] we can replace single threshold (or multithreshold) quadratic sigmoidal neurons with single threshold (or multithreshold) Heaviside neurons, and one of single threshold quadratic sigmoidal hidden neuron can be approximated by direct input-to-output connection. Therefore, we can deduce that there should exist a neural network with the architecture as stated in Theorem 1 and no more than k!1 hidden nodes which can learn S correctly, this contradicts with Theorem 1. Case 2: No single threshold quadratic sigmoidal neuron in the hidden layer Clearly, the network with k#1 multithreshold quadratic sigmoidal hidden neurons is at most as powerful as the network with k#1 multithreshold quadratic sigmoidal neurons and one single threshold quadratic sigmoidal neuron. While the last network can dichotomize any dichotomy de"ned on the training set of 4(k#1)# 3"4k#7 examples. So the proof is completed. )
References [1] C.C. Chiang, H.C. Fu, Using multi-threshold quadratic sigmoidal neurons to improve classi"cation capability of multilayer perceptrons, IEEE Trans. Neural Networks 5 (1994) 516}519. [2] M. Arai, Bounds on the number of hidden units in binaryvalued three-layer neural networks, Neural Networks 6 (1993) 855}860. [3] E.B. Baum, D. Haussler, What size net gives valid generalization?, Neural Computation 1 (1989) 151}160.
B. Wang, Z. He / Pattern Recognition 33 (2000) 1395}1399 [4] S.C. Huang, Y.F. Huang, Bounds on the number of hidden neurons in multilayer perceptrons, IEEE Trans. Neural Networks 2 (1991) 47}55. [5] C.C. Chiang, H.C. Fu, A variant of second-order multilayer perceptron and its application to function approximation, in: Proceeding of the IJCNN'92, Baltimore, III, 1992, pp. 887}892.
1399
[6] N.J. Nilsson, Learning Machines: Foundation of Trainable Pattern Classifying Systems, McGraw-Hill, New York, 1965. [7] E.D. Sontag, On the recognition capabilities of feed forward nets, Tech. Rep. SYCON 90-03, SYCON-Rutgers Center Systems Control, Department of Mathematics, Rutgers University, New Brunswick, NJ, April 1990.
About the Author*BAOYUN WANG was born in 1967. He received M.S. degree in applied mathematics from Huazhong University of Science and Technology (HUST), Wuhan, in 1993, and Ph.D. degree in electrical engineering from Southeast University, Nanjing, in 1997. Since January, 1997, he has been with Nanjing University of Posts and Telecommunications, Nanjing, China. His research interests include neural networks, pattern recognition and digital signal processing. He has published more than 20 technical papers. About the Author*ZHENYA HE was born in Jiangsu Province. He received B.S. degree in electrical engineering from Beiyang University, Tianjin, in 1947. Presently professor and director of DSP Division, Department of Radio Engineering, Southeast University, Nanjing, China. During 1992}1997, he was a presiding scientist of the National Key Project of China } Neural Networks Theory and Its Application. He was the general chair of ICNNSP95, Nanjing. His research interests include adaptive signal processing, multidimensional signal processing and neural networks. In these "elds, he published more than 300 papers. He is an IEEE Fellow, INNS member, and a fellow of Chinese Institute of Communications.
Pattern Recognition 33 (2000) 1401}1403
Rapid and Brief Communication
A new approach for text-independent speaker recognition Shung-Yung Lung*, Chih-Chien Thomas Chen Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan Received 25 August 1999; accepted 21 September 1999
1. Introduction The Karhunen}Loeve (KL) transform [1] is a wellknown technique for providing a second moment characterization of a random process in terms of uncorrelated random variables. The truncated version of the KL transform has been shown to yield the best approximation to a random process of all N-dimensional approximations. This property has resulted in a host of applications including data compression and reduced order modeling of recognition systems. An evaluation of various Karhunen}Loeve transform (KLT) for text-independent speaker recognition is presented. In addition, a new data compression is examined for this application. The new data compression is called the adaptive split-and-merge algorithm. The adaptive split-and-merge algorithm is strong because the parameters in the algorithms depend only on the context of the speaker data under analysis. One of the key processes, the determination of region homogeneity, is treated as a sequence of decision problems in terms of predicates in the hypothesis model.
2. Adaptive split-and-merge algorithm In this paper, we propose an adaptive split-and-merge data compression algorithm based on data features and a hypothesis model. The analysis of data feature provides the requisite parameters serving as constraints in the hypothesis model. In hypothesis model, the likelihood ratio test [2] is the backbone for testing a statistical hypothesis. Let X , X ,2, X and > , > ,2, > be N O the respective random samples corresponding to the speaker data of two regions. These two regions are as-
* Corresponding author. Tel.: #886-7-525-2000 ext 4179; fax: #886-7-261-5738. E-mail address:
[email protected] (S.-Y. Lung).
sumed to have independent normal distributions N(h , h ) and N(h , h ), respectively. Let u"+(h , h , h , h )" !R(h "h (R, 0(h "h (R, and )" +(h , h , h , h )"!R(h , h (R, 0(h , h (R,. The likelihood functions de"ned on the parameter spaces u and ) are, respectively,
1 N>O 2ph N (X !h )# O (> !h ) ;exp ! G G G G 2h
¸(u)"
(1)
and
1 N 1 O 2ph 2ph N (X !h ) O (> !h ) ! G G . ;exp ! G G 2h 2h (2)
¸())"
The maximum likelihood estimators u and w of h and h in Eq. (1) are, respectively, N X # O > u" G G G G , p#q N (X !u)# O (> !u) w" G G G G p#q and the maximum of ¸(u) is
¸(uH)"
e\ N>O . 2pw
(3)
Similarly, the maximum likelihood estimators for h , h , h and h of Eq. (2) are, respectively, N X O > N (X !u ) , u " G G , u " G G , w " G G p q p O (> !u ) . w " G G q
0031-3203/00/$20.00 2000 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 2 1 2 - 5
1402
S.-Y. Lung, C.-C.T. Chen / Pattern Recognition 33 (2000) 1401}1403
The maximum of ¸()) is
e\ N e\ O . 2pw 2pw
¸()H)"
2.1. Speaker feature training (4)
On the basis of Eqs. (3) and (4), the likelihood ratio for testing the hypothesis of uniformity H : h "h , h "h against all alternatives is shown as ¸(u*) jH" . ¸()H)
A database of Mandarin speakers recorded from radio broadcast was used. The radio broadcast data has 65 male speakers and 35 female speakers. For each speaker, 400 frames of speech data are collected where all the silence and noise are not eliminated. The detailed can be seen in Lung [4].
(5) 2.2. Classixcation of speech data
Hypothesis H is rejected if and only if j(x , x ,2, x )"jH)j (1, where j is a suitably L chosen constant. The signi"cance level of the test is given by a"Pr[j(x , x ,2, x ))j ; H ]. L Two uniformity predicates in terms of heuristic and statistical tests are applied in the proposed method to supervise the initial region growing and the "nal region formation processes, respectively. The predicates associated with the two processes are de"ned as
if h (R))e , P (R)" ' false otherwise true
3. Experiment results (6)
and
P (R 6R )" $ G H
true false
¹ : h (R , R ))e , % G H ¹ : h (R , R ))e , $ G H otherwise,
if
The maximal likelihood decision rule is applied to the 200-frame averaged spectral vector > in the KLT [1], hard-limited KLT [3] (HLKLT), reduce form of KLT [4] (RFKLT) and adaptive KLT domains. The detailed can be seen in Lung [4].
(7)
where e , k"1, 2, 3, are the thresholds and I h (R )" max [A(i, j)]! min [A(i, j)], , G GYHYZ0G GYHYZ0G h (R , R )""u G !u H ", % G H 0 0
Several speaker recognition experiments were performed to evaluate the various KLT. The results on computational speed are presented in Table 1. In the table, type A shows the e!ect of data compression algorithm on average time required, type B shows the e!ect of intra-speaker covariance on average time required, type C shows the e!ect of inter-speaker covariance on average time required, type D shows the e!ect of eigenvectors on average time required and type E shows the e!ect of per speaker recognition on average time required. For speaker recognition experiments, the results for four methods are reported in Table 2. The best recognition rate of adaptive KLT obtained was 93%.
+ [A(i, j)!u G ]/m,K+ [A(i, j)!u H ]/n,L GYHYZ0G 0 GYHYZ0H 0 h (R , R )" , $ G H + [A(i, j)!u]# [A(i, j)!u]/(m#n),K>L GYHYZ0G GYHYZ0H u G " A(i, j)/m, u H " A(i, j)/n, u (m ) u G 0 GYHYZ0G 0 GYHYZ0H 0 #n ) u H )/(m#n), where A(i, j) is the speaker data at 0 location (i, j), and m and n are the sizes of the two regions, R and R , being tested, respectively. G H The thresholds e , e and e are evaluated according to the chosen characteristic feature distributions. By thresholding the distribution of standard deviations, the parameter e represents the maximum standard deviation of speaker data distribution to be allowed in a uniform region. The values of e and e obtained by thresholding the corresponding distributions give the respective maximum tolerances of average data di!erence and likelihood ratios when an attempt is made to join two regions.
Table 1 Average computation time (Milliseconds) by pentium-133 Type
KLT [1] HLKLT [3] RFKLT [4] Adaptive KLT
A B C D E Total
0 283 301 490 570 1644
0 283 301 490 448 1522
505 247 151 193 199 1295
617 266 145 187 213 1428
S.-Y. Lung, C.-C.T. Chen / Pattern Recognition 33 (2000) 1401}1403 Table 2 Speaker recognition rates (the silence and noise are not eliminated) Method KLT [1] HLKLT [3] RFKLT [4] Adaptive KLT
Dim
16
24
32
78% 76% 84% 85%
83% 80% 87% 90%
89% 85% 91% 93%
4. Discussions We have presented a region-based compression algorithm which combines the strengths of data feature analysis and a hypothesis model to produce an initial speaker data compression. All the parameters in the algorithm are computed automatically on the basis of the data
1403
features extracted from the regions and depend only on the context of the speaker data under analysis. The computed parameters provide the hypothesis model with appropriate constraints to test the region homogeneity.
References [1] Shung-Yung Lung, Disk distance measure of speaker recognition, Electron. Lett. (33) (1997) 1678}1679. [2] M. Spann, R. Wilson, A quad-tree approach to image segmentation which combines statistical and spatial information, Pattern Recognition (18) (1985) 257}269. [3] Chih-Chien Thomas Chen, Chin-Ta Chen, Chih-Ming Tsai, Hard-imited Karhunen}Loeve transform for text independent speaker recognition, Electron. Lett. (33) (1997) 2014}2016. [4] Shung-Yung Lung, Chih-Chien Thomas Chen, Further reduced form of Karhunen}Loeve transform for text independent speaker recognition, Electron. Lett. (34) (1998) 1380}1382.