Spatial index structure [11, 12] organizes and stores index data according to the spatial distributive features of spatial data. It does not traverse data sets when carrying on queries to data sets. Spatial index can completely gain query results or the smaller data set including all query results through visiting the index data, so the good index structure has the advantages of higher storage efficiency and query efficiency.
PB-tree adopts parallel lines to divide the spatial region. The selected location of parallel lines should be able to make the entire region divided into two almost equal parts in each division until the number of data objects contained in each region is between the pre-specified maximum and minimum values. Data objects intersecting with parallel lines are stored in the corresponding data lists of parent nodes. Data objects completely contained in a region are stored in corresponding leaf nodes. The restrictive parameters M is the maximum number of data objects contained in the leaf nodes of PB-tree, and m is minimum number of entity objects contained in the leaf nodes. The PB-tree has the following properties:
-
The root node is the diagonal which divides the entire region into two parts. Root node has two children nodes, the left and right regions of the diagonal (the left and right regions are not divided), or the division parallel lines of the left and right regions (the left and right regions are divided);
-
Each non-leaf node is the parallel line dividing the sub-regions. All data objects intersecting with parallel lines are stored in the corresponding data lists. The left and right child nodes are the adjacent left and right regions or the division parallel lines of the adjacent left and right regions;
-
The number of entity objects contained in each leaf node is between m and M.
During the process of carrying on K-NN queries, since the actual space region is very large and without rules, if directly using the parallel lines to divide the area, the distribution of data objects is not uniform. To resolve this issue, before making the inquiries, we carry on clustering on the spatial data objects through using the K-means [13] clustering algorithm, which makes the MBR of each cluster tending to square. The distance among data objects in the same cluster is small, and the distance among clusters is large.
We will search the cluster class of queried objects by adopting the parallel lines to carry on division to the region of above cluster class. In such division, there are two cases depending on whether the queried data object intersecting with the division parallel lines or completely contained in a divided area. The specific query algorithm is as follows:
-
(1)
The queried objects intersecting with the parallel lines
-
We first query the pointer which points to the data list where the queried object is, and achieve the corresponding parallel line;
-
We search the sub-tree which uses the parallel line as the root node, and gain the most right child node division region from the left sub-tree and the most left child node division region from the right sub-tree. If the left sub-tree or right sub-tree is the division region, we directly return the corresponding left or right division region.
-
The neighbor region of the queried object will be the parallel line where the queried object is and the corresponding left and right division regions. We calculate the distance between the data objects in the above three parts and the queried object, sorting and inserting these data objects into the sorting queue.
-
If we fail to find all the neighbor objects meeting the conditions, we need continue to expand the query area, merge the searched regions as a new region, and continue to query the left and right parallel lines region of this new area, the corresponding query algorithm being shown in the second case.
-
(2)
The queried object is included in a division region
-
• We search the region where the query object is, and find the parent node parallel line of the above region. If this area is the left (right) child, then the above parallel line is the nearest right (left) parallel line of the queried object;
-
We continue to query the left (right) parallel line of the region where the queried object is, and search the parent node of the found right (left) parallel line. If the right parallel line is the left child, and the region where the query object is does not have left adjacent parallel line, then the neighbor region of the queried data object is the region where the queried object is and the right parallel line. If the right parallel line is the right child, and the parent node parallel line is the left parallel line of the region where the queried object is, then the neighbor region of the queried data object is the region where the queried object is and the left and right parallel lines. If the left parallel line is the left child, and the parent node parallel line is the right parallel line of the region where the queried object is, then the neighbor region of the queried data object is the region where the queried object is and the left and right parallel lines. If the left parallel line is the right child, we use the middle order traversal algorithm to search for the direct stepfather node of the region where the queried object is, and the parallel line of the direct stepfather node is the right parallel line of the region where the queried object is. We calculate the distance between the data objects in the above searched region and the queried object, sorting and inserting these data objects into the sorting queue.
-
If we fail to find the neighbor objects meeting the conditions, it still needs to expand the query area, merge the above queried area as a new area, and continue to find the left and right division regions of the new area, the corresponding query algorithm being shown in the first case.
-
When the search is finished in the cluster class where the queried object is, if it still fails to find all the neighbor objects meeting the conditions, we search the next class nearest to the queried object, add the queried object into the new found cluster class to form a new region, and continue to carry on division and query in the new region, until we find all the neighbor objects.
The new K-NN query algorithm based on PB-tree will traverse the tree no matter searching the left and right division regions or the left and right division parallel lines. In the process of traversing, there are two cases according to the region where the object is. Because of partitioning out the data objects intersecting with the parallel lines, there are no overlap and coverage among the division regions. When searching the corresponding regions and parallel lines, the query path is single, and it can directly find the left and right neighbor regions. Compared with the R-tree, the width and depth of PB-tree is reduced, and all the division regions are not at the same level. The traversal query is very quick when carry on querying the left and right parallel lines of the division regions in the low levels, and the query efficiency is greatly improved. If the most left and right division areas of the sub-tree of a certain parallel line are in the low levels, the time spent on querying is also significantly reduced, and the query efficiency of traversing from top to bottom or from bottom to top is significantly enhanced, thus it can greatly improve the K-NN query efficiency.
As the division region shown in Figure 1 and the corresponding PB-tree index shown in Figure 2, if we want to find the six neighbor objects of P12, we search the data list of the storage area where the P12 is, that is the parallel line d, according to the search algorithm of the first case. We find the left and right division regions R4 and R5 of d, calculate the distance between P12 and the data objects in the above three regions, sort and insert these data objects into the sorting queue, and achieve the former three neighbor object P5, P8, P9. Then we merge the three regions, search the left and right parallel lines according to the query algorithm of the second case, and gain the remaining three neighbor objects P1, P2, P6.