Spatial association discovery process using frequent subgraph mining

Spatial associations are one of the most relevant kinds of patterns used by business intelligence regarding spatial data. Due to the characteristics of this particular type of information, different approaches have been proposed for spatial association mining. This wide variety of methods has entailed the need for a process to integrate the activities for association discovery, one that is easy to implement and ﬂexible enough to be adapted to any particular situation, particularly for small and medium-size projects to guide the useful pattern discovery process. Thus, this work proposes an adaptable knowledge discovery process that uses graph theory to model different spatial relationships from multiple scenarios, and frequent subgraph mining to discover spatial associations. A proof of concept is presented using real data. This is an open access article under the CC BY-SA license.

INTRODUCTION Spatial knowledge discovery aims to find useful and novel patterns in spatial datasets to support decision-making in a particular problem domain [1]. Among all the possible patterns to discover, spatial associations are one of the most commonly used today in multiple fields such as climatology, geography, geology, criminology and ecology, among many others. They are comprised of predicates that involve spatial objects along with spatial and non-spatial relationships between those objects [2]. There are many challenges associated with the characteristics of spatial data that make this data mining task more complicated, such as the spatial dependency data attributes, the multiplicity of spatial data representation models, the spatial relations between data objects and some particular spatial properties such as spatial autocorrelation and spatial heterogeneity [3].
Multiple algorithms have been developed for association pattern mining that can be used. Each of these algorithms, in general, aims to solve particular concerns about the aforemetioned challenges. The selection of a proper algorithm has become an arduous activity due to the growing number of new alternatives and their variants, specially to inexperienced users. Thus, it is necessary to provide a new process for small or medium-size application domains, one that is easy to implement and flexible enough to be adapted to multiple contexts. Consequently, this paper proposes a new process for association mining discovery from spatial data Journal homepage: http://journal.uad.ac.id/index.php/TELKOMNIKA

TELKOMNIKA Telecommun Comput El Control
Ì 1885 that utilizes graph theory to model spatial objects and the relations between them and frequent subgraph mining to find the substructures with a high repetition rate inside the general graph. These substructures correspond to association patterns. The proposal is a new alternative to model complex situations from a particular problem domain, but not replace or improve results from the algorithms in the state-of-the-art, however it provides a road map to initially address a problem. The rest of this paper is arranged as follows: section 2. on the characteristics of spatial data; section 3. contains association patterns and their characteristics regarding spatial data; section 4. includes the proposed process for discovery of spatial associations; a proof of concept using real world data is shown in section 5. Lastly, section 6. contains conclusions and future works .

SPATIAL DATA
Spatial data is a particular type of dependent data. Formally, a spatial database D is a set of spatial where each S k i is a spatial attribute that stores values about the spatial contexts, and each X l i is a non-spatial attributes with values measured at particular locations [3,4]. The non-spatial attributes may be numerical or categorical according to the problem domain and the spatial attributes may be specified as coordinates or places (e.g. city name or state code). Additionally, there are three basic types of spatial objects: points, used to model specific punctual locations in the space; lines, used to model linear extensions such as rivers or roads; and polygons, used to represent objects that have a two-dimensional extension in the space, such as regions or states.
The dependence of non-spatial attributes on spatial ones means that different implicit spatial relations can be extracted from data. Let D be a spatial database, a relation R ⊆ D 2 is called spatial if and only if it is defined through a binary predicate P (x, y)|x, y ∈ D that involves the spatial attributes from the spatial objects x and y. For example, the spatial relation N ⊆ D 2 , with x, y ∈ D, defined by the predicate shown in (2), is the neighborhood relations between two spatial points using euclidean distance: xN y ⇐⇒ Dist(x, y) < λ , λ ∈ R + These relations can be classified as geometric, if they are related to the principles of euclidean geometry (e.g. neighbouring relationships); directional, when they refer to relative spatial orientations (e.g. above, below, north, east); topological, if they are independent from the concepts of distance and direction and are not affected by spatial transformations such as rotation or translation (e.g. intersect, inside), or hybrid, if they are related to two or more of the aforementioned types of properties. These relationships can be calculated using different methods depending on the problem domain and the class of spatial data used: points, lines or polygons [5,6].
On the other hand, two properties are derived from spatial dependence: spatial autocorrelation, i.e., observations of spatially distributed random variables are not location-independent, and spatial heterogeneity, i.e., patterns found in some region of the space may not have the same support in other region. Spatial autocorrelation refers to the particularity of spatial data to not be distributed independently throughout the space. The distribution depends on the characteristics of the data points, the characteristics of the underlying space or the spatial neighboring relationships. For example, churches tends to be located near public squares or animal tends to travel to locations that contain their food sources [7]. Spatial heterogeneity is related to spatial autocorrelation. This phenomenon describes the local nature of spatial patterns, which are subordinated to some specific locations. Thus, a spatial pattern, such as association rules, may have a high support value in a region and a low support value in a different one. This phenomenon is also known as Simpson's paradox [8]. All these particular characteristics make knowledge extraction from spatial data become a complex activity which not only has to consider patterns between data records, but also the implicit relationships between spatial objects.

SPATIAL ASSOCIATIONS
One of the most common patterns to find in data is the association pattern. An association pattern P is defined as an n-ary predicate P = (p 1 , p 2 , · · · , p n ) with a high probability of occurrence in the dataset. Its classic application is the supermarket basket analysis to discover whether or not there is some correlation between items that are bought together. An association pattern is referred to as spatial if at least one of its atomic predicates p k involves a spatial relationship between its variables [2]. For example, in a city C, churches and public squares tend to be neighbors: As shown in the previous example, Inside(X, C), Inside(Y, C) and N eighbors(X, Y ) are spatial Ì ISSN: 1693-6930 predicates related to topological and geometric relationships. Many different relations must be taken into consideration at the same time to find useful spatial associations. Also, these relations must be calculated in local contexts, due to the aforementioned Simpson's Paradox. Multiple efforts have been made in order to find spatial association patterns in spatial databases: [7] proposes a method for spatial association mining that consider spatial autocorrelation by using a cell structure; [9] focuses on the problem of rule extraction from spatial data with crisp condition attributes and fuzzy decisions. A rough-fuzzy set based rule extraction model is used to deal with both fuzziness and roughness; [10] combines and extend techniques developed in both spatial and fuzzy data mining to deal with the uncertainty found in typical spatial data. This proposal uses fuzzy logic to get relevant information from transition areas between spatial neighborhoods to spatial association mining and for spatial relationships modelling; [11,12] propose an algorithm for local patterns discovery considering spatial heterogeneity that incorporates a novel spatial metric for support evaluation based on event density in a particular area; [13] presents a specially designed algorithm to discover spatial associations related to El Niño Southern Oscillation (ENSO); [14] applies an algorithm that explores multiple spatial objects hierarchies; [15] uses A-Priori-based approaches to find spatial association rules; [6,16] propose using Inductive Logic Programming (ILP) for reach this data mining purpose by modelling and stracting high support spatial relations from spatial data. [17] worked with metaheuristics such as genetic algorithms and evolutionary programming; [18] suggested a data-transformation approach before using traditional association rule mining algorithms; [19] introduced non-trivial structures such as graphs for spatial relationship representation; among others.
Because of this variety of spatial data mining approaches for association discovery, it is difficult to select a proper algorithm or method to be used in small knowledge discovery application contexts. Because of this, a unified and general process is required to deal with the aforementioned problems and it has to be flexible enough to be adapted to multiple particular situations and easy to implement.

SPATIAL ASSOCIATION DISCOVERY PROCESS
This work describes a new process for spatial association extraction considering the possibility of having multiple relationships between spatial objects of any kind (i.e. points, lines, polygons), and considering the spatial autocorrelation and spatial heterogeneity. This process is designed as a first approach to get spatial association knowledge from data in particular contexts easy to implement in small or medium-size projects. The process Figure 1

Data preparation
The proposed process starts with a spatial data preparation step. It is necessary to codify the various spatial datasets obtained from different sources in different formats, in order to enable the extraction of relations between all the data instances in later steps. In general terms, it is not uncommon to have multiple spatial objects layers, each of them with a particular representation type and related to a particular scenario from the problem domain. On the other hand, two types of datasets must be considered: target datasets, with objects directly related to the problem domain that are going to be present in every association pattern, and relevant datasets, that may or may not be related to the target datasets, but add important information that may be useful for decision making [20]. These data must be prepared by cleaning errors, solving inconsistent and null values, and dealing with outliers. New attributes or even new data objects could be generated using the input data. This step requires considerable effort and may require many iterations. Thus, it is advisable to implement the process using a proper methodology such as CRISP-DM [21].

Neighborhood definition
As mentioned before, a particular spatial association pattern may have a higher occurrence probability in some regions and lower probability in others [8]. For this reason it is preferred to search for this kind of pattern locally. For this, we propose defining partitions of the dataset, called neighborhoods in this context, and the subsequent execution of the association pattern search algorithm on each of them.
These neighborhoods can be defined beforehand using knowledge from to the problem domain, or using spatial clustering techniques. Using density-based or distance-based spatial clustering algorithms [22][23][24] is suggested due to the First Law of Geography, which states that spatial objects located together are more closely related than those that are far away from each other [25,26]. Nonetheless, there is an issue to consider in this step: the limits between neighborhoods may add important information for spatial association mining. Thus, the use of fuzzy clustering techniques or flexible boundaries models may be desirable.

Modelling of spatial relationships using graphs
Now, we have to calculate the spatial relations between the target data instances and the instances of the relevant dataset from each neighborhood. Depending on the problem domain, different types of spatial relations can be calculated: euclidian, topological, directional or hybrid relationships, as mentioned above [6]. This might be a step with a high computational cost.
Graph theory is proposed to model the spatial relationships due to its close relation with first order logic and the pattern to find [16]. Graphs are discrete structures consisting of vertices and edges that connect these vertices. There are different kinds of graphs, depending on whether edges have directions (digraphs), whether multiple edges can connect the same pair of vertices (multigraphs), and whether loops are allowed.
Formally, a simple graph G = (V, E) consists of V, a nonempty set of vertices (or nodes) and E, a set of edges. Each edge has two vertices associated with it, called its endpoints. An edge is said to connect its endpoints. To relate each edge to its endpoints, a function φ : E → {v1 ∈ V, v2 ∈ V }, called incidence function, is used. A multigraph, on the other hand, is a graph where multiple edges can exist associated with the same endpoints. Additionally, each vertex and each edge can be labeled with data related to the represented object. This structure can be adapted to multiple scenarios and multiple efficient algorithms can be used to extract valuable information such as maximum cliques [27].
In the context of this work, multigraphs are used to model spatial objects as vertices and the relations between them as edges. A small example can be seen in Figure 2 (a). Two sets of labels and two extra functions to asign those labels to the vertices and edges are needed. So, let G be a multigraph without loops G = (V, E, L, K, φ, l , k ) where: V is the vertex set of G, which corresponds to the spatial objects from the datasets; E is the edge set that corresponds to each calculated relationship between the spatial objects; L is the vertex label set with the characteristics of the spatial data objects; K is the edge label set, with the characteristics of each spatial relation; φ : E → {x ∈ P(V )/|x| ≤ 2} is the incidence function; l ⊆ V × L and k ⊆ E × K are labeling relations.
The aforementioned structure makes it possible to model multiple different relationships with the same endpoints labeled with different attributes. Also, many attributes of spatial objects could be taken into consideration. Additionally, it must be noted that loops (i.e. edges with only one endpoint) are not considered because their lack of semantics in this context (there are not spatial relationships that involves only one spatial object). Fuzzy logic could also be a valuable tool to model the spatial relationships, if the situation requires it [10]. More information about fuzzy logic this can be found in [28]

Frequent subgraph mining
To extract spatial associations with a high probability of occurrence, frequent subgraph mining is proposed to be used for each modeled graph. Given a multigraph G = (V, E, L, K, φ, l , k ) like the one described in the previous section, the frequent subgraph mining problem in a single multigraph is finding recurring subgraph Ì ISSN: 1693-6930 G i ⊂ G, or in other words, a subgraph that has multiple instances in the original graph Figure 2 (b). It must be noted that two graphs are isomorphic if all of their vertices and edges are shared including its labels.These frequent subgraphs represent the relationships between spatial object types that take place in the space with a high occurrence probability. Multiple algorithms have been designed for frequent subgraph mining in a single big graph, calculating the relevance of a pattern in different ways. Some well-known examples of this are IncGM+, FSSG, SUBDUE, among others [29,30]. A set of frequent subgraphs for each neighborhood is obtained as a result of this step and must be analyzed to obtain useful knowledge for decision-making.

Evaluation of results
In the final step, frequent subgraphs translated into n-ary predicates that represent trivial information (non-novel patterns) must be filtered. The support and confidence measures can be extracted, selecting the metrics that the desicion-maker consider to be more appropiate. This activity could be performed automatically or manually by an analyst with knowledge about the problem domain with help from an expert.

PROOF OF CONCEPT
The proof of concept presented in this section is intended to show how the proposed process works, implemented by different programming and data mining tools. The data used in this example consists of 10 data files containing the location of facilities in Buenos Aires (Argentina) and its surroundings. These facilities include libraries(74), clinics(63), post offices(55), sports halls(50), nightclubs(41), schools(107), gas stations (97), churches (125), museums(37) or police stations (93).
For each of them, in the preparation step of the proposed process, the data files were integrated into a single data file of spatial points using QGis (http://qgis.org/). Each spatial point is comprised of two spatial attributes, Latitude and Longitude, and one non-spatial attribute, the type of building from the previous list. After that, only the points that are located outside Buenos Aires limits were filtered to reduce the search space, leaving 742 spatial points Figure 3 (a), (orange). Then, in the neighborhood definition step, the HDDBSCAN clustering algorithm [31] from the 'dbscan' library from R programming language was used on the spatial data attributes to generate two neighborhoods with a minimum number of points equal to 50 in each of them Figure 3 (a), (blue). Only two neighborhoods were used because of explanatory purposes.
In the next step, for each of the generated neighborhoods, a geometric relationship between their data points was extracted forming a graph with vertices labeled with the type of facility related to each data point and edges labeled with the sentence "close to" if the adjacent points were less than 150 meters away from each other (this value was selected for illustrative purposes only). Thus, two graphs were created: one with 71 vertices and 45 edges in neighborhood 1, and another with 15 vertices and 11 edges in neighborhood 2.
To obtain the frequent subgraphs of each of the generated graphs, SUBDUE algorithm was used via its implementation in Subdue Graph Miner Software, using the compression rate as support measure.

Ì 1889
The result was a subgraph as shown in Figure 3 (b), with a compression rate of 15.5% in neighborhood 1, which was translated into the predicate Post office(x 1 ) ∧ Nightclub(x 2 ) ∧ Close to(x 1 , x 2 ) and two subgraphs in neighborhood 2 , both with a compression rate of 27.2% that was translated into the predicates Figure 3. (a) Spatial neighborhoods generated for the proof of concept using HDBSCAN algorithm; (b) Results of the proof of concept.

Discussion
The contributions of the proposed process are, firstly, the possibility of adapting it to multiple scenarios, due to its flexible underlying structure being based on graphs. Some of the aforementioned methods use flexible structures too [6,16] but the complexity of these methods increases because of the use of techniques based on Logic Programming. On the other hand, some other methods do not take into account complex patterns [19]. Furthermore, the possibility of including valuable information related to the data objects and the spatial relations by using labels in the graph representation is also considered. Generally, the data structures involved do not take into account complex data associated to the spatial relations between spatial data.
In relation to the above, the proposed process considers spatial phenomena such as autocorrelation and heterogeneity, by using spatial neighborhoods. Some alternatives such as [7] considering spatial autocorrelation but not considering spatial heterogeneity or complex data relationships. In most of the cases studied, these characteristics are present due to their relevance in data mining.
Also, related to this, the proposed process allows its implementation by using existing tools such as frequent subgraph mining algorithms and clustering algorithms. Some of the state-of-the-art alternatives include very flexible and powerful strategies, but implementation is hard, making them not suitable for application in small or medium size projects [6,9,16,19]. Lastly, the high adaptability of the procedure is a desired characteristic due to the possibility of selecting among many algorithms for the implementation of each step. Usually, the state-of-the-art methods propose a single alternative for its execution.

6.
CONCLUSION This work describes a knowledge discovery process called for extraction of spatial associations. The process is flexible enough to take into account multiple and varied spatial relationships between spatial objects of any kind, using a graph structure to model them. Heterogeneity and autocorrelation phenomena are also considered, defining neighborhoods where the search process is performed to find this class of regularity. The solution was designed to initially approach to this data mining task without worrying too much about particular characteristics of data mining algorithms. In a large-scale project, this process could guide the selection of specific methods based on the results obtained in first iterations of an incremental methodology. A proof of concept is presented as well, using real data to illustrate how the process is implemented using different programming and data mining tools in each of the proposed steps. In future works, the research will be focused on implementation strategies according to the problem domain for each of the steps of the process, in order to decrease computational execution time when dealing with large amounts of spatial objects and spatial relationships. Also, fuzzy methods will be considered for relation modelling and neighborhood definition.