 Original submission
 Open Access
 Published:
KNN query algorithm based on PBtree with the parallel lines division
Communications in Mobile Computing volume 1, Article number: 10 (2012)
Abstract
Spatial index and query are enabling techniques for achieving the vision of the Internet of Things. KNN is an algorithm which is used widely in spatial database. Traditional query algorithms use Rtree as the index structure and improve the query efficiency by using the measurement distance and pruning strategy. Based on the study of previous algorithms, this paper proposes a novel KNN query algorithm based on PBtree with the parallel lines division. PBtree index is different from the traditional Rtree index, where PBtree adopts parallel lines to divide the spatial region and uses parallel lines as the parent node. It is similar to the binary tree index structure and requires to query three small portions nearest to the queried object in each KNN query. Therefore, the search range is narrowed and the query efficiency is enhanced. Experiments show that PBtree is better than the traditional Rtree from the aspect of query performance. PBtree can avoid the deficiency of a large number of overlap and coverage among odes in Rtree and multiple index paths when searching data objects, and hence PBtree can find KNN objects meeting the conditions quickly and efficiently in large data sets.
Introduction
Things (or objects) in the Internet of Things [1, 2] can be modeled as MBRs [3] in some context, and spatial index and query are enabling techniques for searching objects that a user interests. Kneighbor query method is widely used for querying spatial objects in spatial database. It is different from the point and range query, and is used to find the Knearest objects near a given point in the space, namely, the kNN query [4, 5].
Current research of spatial index methods attempts to achieve the quick mapping from spatial regions to spatial entities [6]. As an important technology aiming to improve query efficiency, spatial index is a key component of spatial database system structure. Rtree [3, 7] proposed by Guttman is a balanced tree which is composed of multinested external rectangles. It is another form of Btree developed towards multidimensional space, and divides the space objects according to ranges, and each node corresponds to a region and a disk page. Disk pages of nonleaf nodes store the region scope of all its subnodes, and all child nodes of nonleaf nodes fall within the scope of the nonleaf nodes region. Disk pages of leaf nodes store MBRs of all spatial objects within leaf nodes regional scope. Rtree is a dynamic index structure. Queries can be carried out simultaneously with the insertion or deletion algorithm, and this method does not require to reorganize the tree structure regularly.
Many KNN query algorithms have been proposed based on Rtree index structure. [8] presented a branch and bound method to traverse Rtree, which sorts and prunes the MBR within the nodes in Rtree to carry on Knearest neighbors query. [9] proposed a spatial kNN query strategy by using MINDIST and MINMAXDIST to sort and pruning rule to prune tree, and generated Active Branch List (ABL) in leaflevel, namely, rect1, rect2, ⋯, rectk. ABL has the nodes needed to be visited currently and the child nodes needed to continue to search. Each level ABL is sorted in Rtree according to the ascending order of MINDIST and MINMAXDIST to carry on KNN query. [10] proposed a multiobject KNN query, and used pruning rules to achieve multiobject KNN query in Rtree. However, Rtree allows overlap and coverage among the sibling nodes, even for the exact match query, Rtree cannot guarantee to visit only one branch when carrying inquiries. This is the factor affecting the searching efficiency of Rtree.
Based on above KNN algorithms, a novel KNN query algorithm based on PBtree with the parallel lines division is proposed in this paper. This method aims to address the problem of overlap and coverage among the nodes in Rtree. The construction of PBtree index is similar to that of binary tree, which uses the parallel lines dividing the region as parent nodes. The searching path of leaf node is single. There is no overlap and coverage among division regions. No matter the queried objects are stored in the parent nodes or leaf nodes, the KNN query region of every object contains three small portions only in PBtree. When the query result does not satisfy the conditions, it needs continue to expand the query scope toward the two sides of the previous region. Due to effectively selecting the searching range, the query efficiency and accuracy are both improved.
The KNN query algorithm based on the PBtree with the parallel lines division
Spatial index structure [11, 12] organizes and stores index data according to the spatial distributive features of spatial data. It does not traverse data sets when carrying on queries to data sets. Spatial index can completely gain query results or the smaller data set including all query results through visiting the index data, so the good index structure has the advantages of higher storage efficiency and query efficiency.
PBtree adopts parallel lines to divide the spatial region. The selected location of parallel lines should be able to make the entire region divided into two almost equal parts in each division until the number of data objects contained in each region is between the prespecified maximum and minimum values. Data objects intersecting with parallel lines are stored in the corresponding data lists of parent nodes. Data objects completely contained in a region are stored in corresponding leaf nodes. The restrictive parameters M is the maximum number of data objects contained in the leaf nodes of PBtree, and m is minimum number of entity objects contained in the leaf nodes. The PBtree has the following properties:

The root node is the diagonal which divides the entire region into two parts. Root node has two children nodes, the left and right regions of the diagonal (the left and right regions are not divided), or the division parallel lines of the left and right regions (the left and right regions are divided);

Each nonleaf node is the parallel line dividing the subregions. All data objects intersecting with parallel lines are stored in the corresponding data lists. The left and right child nodes are the adjacent left and right regions or the division parallel lines of the adjacent left and right regions;

The number of entity objects contained in each leaf node is between m and M.
During the process of carrying on KNN queries, since the actual space region is very large and without rules, if directly using the parallel lines to divide the area, the distribution of data objects is not uniform. To resolve this issue, before making the inquiries, we carry on clustering on the spatial data objects through using the Kmeans [13] clustering algorithm, which makes the MBR of each cluster tending to square. The distance among data objects in the same cluster is small, and the distance among clusters is large.
We will search the cluster class of queried objects by adopting the parallel lines to carry on division to the region of above cluster class. In such division, there are two cases depending on whether the queried data object intersecting with the division parallel lines or completely contained in a divided area. The specific query algorithm is as follows:

(1)
The queried objects intersecting with the parallel lines

We first query the pointer which points to the data list where the queried object is, and achieve the corresponding parallel line;

We search the subtree which uses the parallel line as the root node, and gain the most right child node division region from the left subtree and the most left child node division region from the right subtree. If the left subtree or right subtree is the division region, we directly return the corresponding left or right division region.

The neighbor region of the queried object will be the parallel line where the queried object is and the corresponding left and right division regions. We calculate the distance between the data objects in the above three parts and the queried object, sorting and inserting these data objects into the sorting queue.

If we fail to find all the neighbor objects meeting the conditions, we need continue to expand the query area, merge the searched regions as a new region, and continue to query the left and right parallel lines region of this new area, the corresponding query algorithm being shown in the second case.


(2)
The queried object is included in a division region

• We search the region where the query object is, and find the parent node parallel line of the above region. If this area is the left (right) child, then the above parallel line is the nearest right (left) parallel line of the queried object;

We continue to query the left (right) parallel line of the region where the queried object is, and search the parent node of the found right (left) parallel line. If the right parallel line is the left child, and the region where the query object is does not have left adjacent parallel line, then the neighbor region of the queried data object is the region where the queried object is and the right parallel line. If the right parallel line is the right child, and the parent node parallel line is the left parallel line of the region where the queried object is, then the neighbor region of the queried data object is the region where the queried object is and the left and right parallel lines. If the left parallel line is the left child, and the parent node parallel line is the right parallel line of the region where the queried object is, then the neighbor region of the queried data object is the region where the queried object is and the left and right parallel lines. If the left parallel line is the right child, we use the middle order traversal algorithm to search for the direct stepfather node of the region where the queried object is, and the parallel line of the direct stepfather node is the right parallel line of the region where the queried object is. We calculate the distance between the data objects in the above searched region and the queried object, sorting and inserting these data objects into the sorting queue.

If we fail to find the neighbor objects meeting the conditions, it still needs to expand the query area, merge the above queried area as a new area, and continue to find the left and right division regions of the new area, the corresponding query algorithm being shown in the first case.

When the search is finished in the cluster class where the queried object is, if it still fails to find all the neighbor objects meeting the conditions, we search the next class nearest to the queried object, add the queried object into the new found cluster class to form a new region, and continue to carry on division and query in the new region, until we find all the neighbor objects.

The new KNN query algorithm based on PBtree will traverse the tree no matter searching the left and right division regions or the left and right division parallel lines. In the process of traversing, there are two cases according to the region where the object is. Because of partitioning out the data objects intersecting with the parallel lines, there are no overlap and coverage among the division regions. When searching the corresponding regions and parallel lines, the query path is single, and it can directly find the left and right neighbor regions. Compared with the Rtree, the width and depth of PBtree is reduced, and all the division regions are not at the same level. The traversal query is very quick when carry on querying the left and right parallel lines of the division regions in the low levels, and the query efficiency is greatly improved. If the most left and right division areas of the subtree of a certain parallel line are in the low levels, the time spent on querying is also significantly reduced, and the query efficiency of traversing from top to bottom or from bottom to top is significantly enhanced, thus it can greatly improve the KNN query efficiency.
As the division region shown in Figure 1 and the corresponding PBtree index shown in Figure 2, if we want to find the six neighbor objects of P_{12}, we search the data list of the storage area where the P_{12} is, that is the parallel line d, according to the search algorithm of the first case. We find the left and right division regions R_{4} and R_{5} of d, calculate the distance between P_{12} and the data objects in the above three regions, sort and insert these data objects into the sorting queue, and achieve the former three neighbor object P_{5}, P_{8}, P_{9}. Then we merge the three regions, search the left and right parallel lines according to the query algorithm of the second case, and gain the remaining three neighbor objects P_{1}, P_{2}, P_{6}.
Experiment and Evaluation
In order to demonstrate the KNN query algorithm feasibility of the improved PBtree data structure, we adopt the new KNN query algorithm a based on the PBtree with the parallel lines division and the KNN query algorithm b proposed in literature [8] to carry on comparison of the query performance. The experimental data come from the real twodimensional geospatial data of a city. The data set size is 500000, including points, lines and polygons and other spatial entities. We use c++ language to carry on programming, and all programs are run on windows XP operating system. Figure 3 shows the CPU running time comparison of the two algorithms by adjusting the number of the queried objects (fixing K). Figure 4 presents the changes of the used disk access times with k when using the two algorithms (fixing the number of the queried objects). Figure 5 shows the accuracy when query the 250 neighbor objects of the given 100, 200, 300, 400 data objects.
The experimental results show that the new query algorithm outperforms the traditional KNN query algorithm based on the Rtree in performance. The CPU time and disk access times required for querying are greatly reduced. After clustering the regions, when carry on searching in each cluster class, due to the number of data objects is less, it is easy to manage, and the distance between the data objects in the same class is small. After dividing the region by using the parallel lines, the length and width of the small region in each class are proportional to the distance between data objects. It will not appear the situation of making the data objects in the neighbor area not the former neighbor objects of the queried object, at the same time, the data objects not in the neighbor region are the former neighbor objects due to the length and width of the neighbor area overlarge. If we carry on the division to the whole area, this situation will occur frequently and the query efficiency and accuracy are obviously reduced. Because of conducting the region division to the cluster where the queried object is and establishing the PBtree index in the above cluster, it can quickly find the neighbor query region through simply traversing a certain part of the tree or querying a certain single path. During the process of regional expansion, it is the alternation of the above two cases. The entire search algorithm makes KNN query efficiency greatly increased.
Conclusions
To support spatial query in the context of the Internet of Things, this paper proposed a novel KNN query algorithm based on PBtree index. Through dividing a large region into small regions by clustering algorithm, our method can achieve the effective management of spatial data objects. The experiments show that the proposed query algorithm has a good performance, and can be widely used in the large spatial data set.
References
 1.
Ashton K: That ‘Internet of Things’ Thing. RFID J. 2011
 2.
Doytsher Y, Galon B, Kanza Y: Querying Sociospatial Networks on the WorldWide Web. Proceedings of the 21st International World Wide Web Conference (WWW). 2012, Lyon, France, 329–332
 3.
Guttman A: Rtrees: a Dynamic Index Structure for Spatial Searching. Proceedings of the ACM SIGMOD International Conference on Management of Data. 1984, Boston, Massachusetts, 47–57
 4.
Li B, Pan M, Wu Z: Effective Reverse Knearest Neighbor Query Based on Revised R∗tree in Spatial Databases. Proceedings of the 19th International Conference on Geoinformatics. 2011, Shanghai, China, 1–5
 5.
Lu J, Lu Y, Cong G: Reverse Spatial and Textual k Nearest Neighbor Search. Proceedings of the 2011 international conference on Management of data (SIGMOD ’11). 2011, Athens, Greece, 349–360
 6.
AlBadarneh A, AlAlaj A: A Spatial Index Structure Using Dynamic Recursive Space Partitioning. Proceedings of the International Conference on Innovations in Information Technology. 2011, Abu Dhabi, United Arab Emirates, 255–260
 7.
He Z, Wu C, Wang C: Clustered Sorting Rtree: An Index for MultiDimensional Spatial Objects. Proceedings of the 4th International Conference on Natural Computation. 2008, Jinan, China, 348–352
 8.
Lai L, Liu Z, Yan L: Knearest neighbor search algorithm by using Rtree. Computer Engineering and Design, vol 23(9). 2002
 9.
Liu Y, Zhu Z, Shi S: A new space knearest neighbor query strategy. Journal of Shanghai Jiao Tong University. 2001, 35 (9):
 10.
Liu Y, Bo S, Zhang Q, Hao Z: Multiobject nearest neighbor queries. Computer Engineering, vol 30(11). 2004, 6668.
 11.
Xie G, Su J: Spatial data indexing technology and its application in GIS software. Hainan Normal University, vol 18(4). 2005, 372376.
 12.
Wu X, Zang C: A New Spatial Index Structure For GIS Data. Proceedings of the 3rd International Conference on Multimedia and Ubiquitous Engineering. 2009, Qingdao, China, 471–476
 13.
Pettinger D, Fatta GD: Space Partitioning for Scalable KMeans. Proceedings of the 9th International Conference on Machine Learning and Applications. 2010, Washington, DC, USA, 319–324
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities (China University of Geosciences at Beijing).
Author information
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Leaf Node
 Parallel Line
 Data Object
 Spatial Index
 Cluster Class