K-NN query algorithm based on PB-tree with the parallel lines division
© Tang et al.; licensee Springer. 2012
Received: 9 August 2012
Accepted: 13 August 2012
Published: 30 November 2012
Spatial index and query are enabling techniques for achieving the vision of the Internet of Things. K-NN is an algorithm which is used widely in spatial database. Traditional query algorithms use R-tree as the index structure and improve the query efficiency by using the measurement distance and pruning strategy. Based on the study of previous algorithms, this paper proposes a novel K-NN query algorithm based on PB-tree with the parallel lines division. PB-tree index is different from the traditional R-tree index, where PB-tree adopts parallel lines to divide the spatial region and uses parallel lines as the parent node. It is similar to the binary tree index structure and requires to query three small portions nearest to the queried object in each K-NN query. Therefore, the search range is narrowed and the query efficiency is enhanced. Experiments show that PB-tree is better than the traditional R-tree from the aspect of query performance. PB-tree can avoid the deficiency of a large number of overlap and coverage among odes in R-tree and multiple index paths when searching data objects, and hence PB-tree can find K-NN objects meeting the conditions quickly and efficiently in large data sets.
Things (or objects) in the Internet of Things [1, 2] can be modeled as MBRs  in some context, and spatial index and query are enabling techniques for searching objects that a user interests. K-neighbor query method is widely used for querying spatial objects in spatial database. It is different from the point and range query, and is used to find the K-nearest objects near a given point in the space, namely, the k-NN query [4, 5].
Current research of spatial index methods attempts to achieve the quick mapping from spatial regions to spatial entities . As an important technology aiming to improve query efficiency, spatial index is a key component of spatial database system structure. R-tree [3, 7] proposed by Guttman is a balanced tree which is composed of multi-nested external rectangles. It is another form of B-tree developed towards multi-dimensional space, and divides the space objects according to ranges, and each node corresponds to a region and a disk page. Disk pages of non-leaf nodes store the region scope of all its sub-nodes, and all child nodes of non-leaf nodes fall within the scope of the non-leaf nodes region. Disk pages of leaf nodes store MBRs of all spatial objects within leaf nodes regional scope. R-tree is a dynamic index structure. Queries can be carried out simultaneously with the insertion or deletion algorithm, and this method does not require to reorganize the tree structure regularly.
Many K-NN query algorithms have been proposed based on R-tree index structure.  presented a branch and bound method to traverse R-tree, which sorts and prunes the MBR within the nodes in R-tree to carry on K-nearest neighbors query.  proposed a spatial k-NN query strategy by using MINDIST and MINMAXDIST to sort and pruning rule to prune tree, and generated Active Branch List (ABL) in leaf-level, namely, rect1, rect2, ⋯, rectk. ABL has the nodes needed to be visited currently and the child nodes needed to continue to search. Each level ABL is sorted in R-tree according to the ascending order of MINDIST and MINMAXDIST to carry on K-NN query.  proposed a multi-object K-NN query, and used pruning rules to achieve multi-object K-NN query in R-tree. However, R-tree allows overlap and coverage among the sibling nodes, even for the exact match query, R-tree cannot guarantee to visit only one branch when carrying inquiries. This is the factor affecting the searching efficiency of R-tree.
Based on above K-NN algorithms, a novel K-NN query algorithm based on PB-tree with the parallel lines division is proposed in this paper. This method aims to address the problem of overlap and coverage among the nodes in R-tree. The construction of PB-tree index is similar to that of binary tree, which uses the parallel lines dividing the region as parent nodes. The searching path of leaf node is single. There is no overlap and coverage among division regions. No matter the queried objects are stored in the parent nodes or leaf nodes, the K-NN query region of every object contains three small portions only in PB-tree. When the query result does not satisfy the conditions, it needs continue to expand the query scope toward the two sides of the previous region. Due to effectively selecting the searching range, the query efficiency and accuracy are both improved.
The K-NN query algorithm based on the PB-tree with the parallel lines division
Spatial index structure [11, 12] organizes and stores index data according to the spatial distributive features of spatial data. It does not traverse data sets when carrying on queries to data sets. Spatial index can completely gain query results or the smaller data set including all query results through visiting the index data, so the good index structure has the advantages of higher storage efficiency and query efficiency.
PB-tree adopts parallel lines to divide the spatial region. The selected location of parallel lines should be able to make the entire region divided into two almost equal parts in each division until the number of data objects contained in each region is between the pre-specified maximum and minimum values. Data objects intersecting with parallel lines are stored in the corresponding data lists of parent nodes. Data objects completely contained in a region are stored in corresponding leaf nodes. The restrictive parameters M is the maximum number of data objects contained in the leaf nodes of PB-tree, and m is minimum number of entity objects contained in the leaf nodes. The PB-tree has the following properties:
The root node is the diagonal which divides the entire region into two parts. Root node has two children nodes, the left and right regions of the diagonal (the left and right regions are not divided), or the division parallel lines of the left and right regions (the left and right regions are divided);
Each non-leaf node is the parallel line dividing the sub-regions. All data objects intersecting with parallel lines are stored in the corresponding data lists. The left and right child nodes are the adjacent left and right regions or the division parallel lines of the adjacent left and right regions;
The number of entity objects contained in each leaf node is between m and M.
During the process of carrying on K-NN queries, since the actual space region is very large and without rules, if directly using the parallel lines to divide the area, the distribution of data objects is not uniform. To resolve this issue, before making the inquiries, we carry on clustering on the spatial data objects through using the K-means  clustering algorithm, which makes the MBR of each cluster tending to square. The distance among data objects in the same cluster is small, and the distance among clusters is large.
- (1)The queried objects intersecting with the parallel lines
We first query the pointer which points to the data list where the queried object is, and achieve the corresponding parallel line;
We search the sub-tree which uses the parallel line as the root node, and gain the most right child node division region from the left sub-tree and the most left child node division region from the right sub-tree. If the left sub-tree or right sub-tree is the division region, we directly return the corresponding left or right division region.
The neighbor region of the queried object will be the parallel line where the queried object is and the corresponding left and right division regions. We calculate the distance between the data objects in the above three parts and the queried object, sorting and inserting these data objects into the sorting queue.
If we fail to find all the neighbor objects meeting the conditions, we need continue to expand the query area, merge the searched regions as a new region, and continue to query the left and right parallel lines region of this new area, the corresponding query algorithm being shown in the second case.
- (2)The queried object is included in a division region
• We search the region where the query object is, and find the parent node parallel line of the above region. If this area is the left (right) child, then the above parallel line is the nearest right (left) parallel line of the queried object;
We continue to query the left (right) parallel line of the region where the queried object is, and search the parent node of the found right (left) parallel line. If the right parallel line is the left child, and the region where the query object is does not have left adjacent parallel line, then the neighbor region of the queried data object is the region where the queried object is and the right parallel line. If the right parallel line is the right child, and the parent node parallel line is the left parallel line of the region where the queried object is, then the neighbor region of the queried data object is the region where the queried object is and the left and right parallel lines. If the left parallel line is the left child, and the parent node parallel line is the right parallel line of the region where the queried object is, then the neighbor region of the queried data object is the region where the queried object is and the left and right parallel lines. If the left parallel line is the right child, we use the middle order traversal algorithm to search for the direct stepfather node of the region where the queried object is, and the parallel line of the direct stepfather node is the right parallel line of the region where the queried object is. We calculate the distance between the data objects in the above searched region and the queried object, sorting and inserting these data objects into the sorting queue.
If we fail to find the neighbor objects meeting the conditions, it still needs to expand the query area, merge the above queried area as a new area, and continue to find the left and right division regions of the new area, the corresponding query algorithm being shown in the first case.
When the search is finished in the cluster class where the queried object is, if it still fails to find all the neighbor objects meeting the conditions, we search the next class nearest to the queried object, add the queried object into the new found cluster class to form a new region, and continue to carry on division and query in the new region, until we find all the neighbor objects.
The new K-NN query algorithm based on PB-tree will traverse the tree no matter searching the left and right division regions or the left and right division parallel lines. In the process of traversing, there are two cases according to the region where the object is. Because of partitioning out the data objects intersecting with the parallel lines, there are no overlap and coverage among the division regions. When searching the corresponding regions and parallel lines, the query path is single, and it can directly find the left and right neighbor regions. Compared with the R-tree, the width and depth of PB-tree is reduced, and all the division regions are not at the same level. The traversal query is very quick when carry on querying the left and right parallel lines of the division regions in the low levels, and the query efficiency is greatly improved. If the most left and right division areas of the sub-tree of a certain parallel line are in the low levels, the time spent on querying is also significantly reduced, and the query efficiency of traversing from top to bottom or from bottom to top is significantly enhanced, thus it can greatly improve the K-NN query efficiency.
Experiment and Evaluation
The experimental results show that the new query algorithm outperforms the traditional K-NN query algorithm based on the R-tree in performance. The CPU time and disk access times required for querying are greatly reduced. After clustering the regions, when carry on searching in each cluster class, due to the number of data objects is less, it is easy to manage, and the distance between the data objects in the same class is small. After dividing the region by using the parallel lines, the length and width of the small region in each class are proportional to the distance between data objects. It will not appear the situation of making the data objects in the neighbor area not the former neighbor objects of the queried object, at the same time, the data objects not in the neighbor region are the former neighbor objects due to the length and width of the neighbor area overlarge. If we carry on the division to the whole area, this situation will occur frequently and the query efficiency and accuracy are obviously reduced. Because of conducting the region division to the cluster where the queried object is and establishing the PB-tree index in the above cluster, it can quickly find the neighbor query region through simply traversing a certain part of the tree or querying a certain single path. During the process of regional expansion, it is the alternation of the above two cases. The entire search algorithm makes K-NN query efficiency greatly increased.
To support spatial query in the context of the Internet of Things, this paper proposed a novel K-NN query algorithm based on PB-tree index. Through dividing a large region into small regions by clustering algorithm, our method can achieve the effective management of spatial data objects. The experiments show that the proposed query algorithm has a good performance, and can be widely used in the large spatial data set.
This work was supported by the Fundamental Research Funds for the Central Universities (China University of Geosciences at Beijing).
- Ashton K: That ‘Internet of Things’ Thing. RFID J. 2011Google Scholar
- Doytsher Y, Galon B, Kanza Y: Querying Socio-spatial Networks on the World-Wide Web. Proceedings of the 21st International World Wide Web Conference (WWW). 2012, Lyon, France, 329–332Google Scholar
- Guttman A: R-trees: a Dynamic Index Structure for Spatial Searching. Proceedings of the ACM SIGMOD International Conference on Management of Data. 1984, Boston, Massachusetts, 47–57Google Scholar
- Li B, Pan M, Wu Z: Effective Reverse K-nearest Neighbor Query Based on Revised R∗-tree in Spatial Databases. Proceedings of the 19th International Conference on Geoinformatics. 2011, Shanghai, China, 1–5Google Scholar
- Lu J, Lu Y, Cong G: Reverse Spatial and Textual k Nearest Neighbor Search. Proceedings of the 2011 international conference on Management of data (SIGMOD ’11). 2011, Athens, Greece, 349–360View ArticleGoogle Scholar
- Al-Badarneh A, Al-Alaj A: A Spatial Index Structure Using Dynamic Recursive Space Partitioning. Proceedings of the International Conference on Innovations in Information Technology. 2011, Abu Dhabi, United Arab Emirates, 255–260Google Scholar
- He Z, Wu C, Wang C: Clustered Sorting R-tree: An Index for Multi-Dimensional Spatial Objects. Proceedings of the 4th International Conference on Natural Computation. 2008, Jinan, China, 348–352Google Scholar
- Lai L, Liu Z, Yan L: K-nearest neighbor search algorithm by using R-tree. Computer Engineering and Design, vol 23(9). 2002Google Scholar
- Liu Y, Zhu Z, Shi S: A new space k-nearest neighbor query strategy. Journal of Shanghai Jiao Tong University. 2001, 35 (9):
- Liu Y, Bo S, Zhang Q, Hao Z: Multi-object nearest neighbor queries. Computer Engineering, vol 30(11). 2004, 66-68.Google Scholar
- Xie G, Su J: Spatial data indexing technology and its application in GIS software. Hainan Normal University, vol 18(4). 2005, 372-376.Google Scholar
- Wu X, Zang C: A New Spatial Index Structure For GIS Data. Proceedings of the 3rd International Conference on Multimedia and Ubiquitous Engineering. 2009, Qingdao, China, 471–476Google Scholar
- Pettinger D, Fatta GD: Space Partitioning for Scalable K-Means. Proceedings of the 9th International Conference on Machine Learning and Applications. 2010, Washington, DC, USA, 319–324Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.