Mining for Spatial Patterns Shashi Shekhar Department of Computer Science University of Minnesota
Download ReportTranscript Mining for Spatial Patterns Shashi Shekhar Department of Computer Science University of Minnesota
Mining for Spatial Patterns
Shashi Shekhar Department of Computer Science University of Minnesota http://www.cs.umn.edu/~shekhar
Collaborators:
V. Kumar, G. Karypis, C.T. Lu, W. Wu, Y. Huang, V. Raju, P. Zhang, P. Tan, M. Steinbach This work was partially funded by NASA and Army High Performance Computing Center Shashi Shekhar Mining For Spatial Patterns 1
Spatial Data Mining(SDM) - Examples
Historical Examples: London Asiatic Cholera 1854 (Griffith) Dental health and fluoride in water, Colorado early 1900s Current Examples: Cancer clusters (CDC), Spread of disease (e.g. Nile virus) Crime hotspots (NIJ CML, police petrol planning) Environmental justice (EPA), fair lending practices Upcoming Applications: Location aware services Defense: Sensor networks, Mobile ad-hoc networks Civilian: Mortgage PMI determination based on location Shashi Shekhar Mining For Spatial Patterns 2
Army Relevance of SDM
Strategic Predicting global hot spots (FORMID) Army land: endangered species vs. training and war games Search for local trends in massive simulation data Critical infra-structure defense (threat assessment) Tactical Inferring enemy tactics (e.g. flank attack) from blobology Detection of lost ammunition dumps (Dr. Radhakrishnan) Operational Interpretation of maps: map matching (locating oneself on map) • identify terrain feature, e.g. ravines, valleys, ridge, etc.
Locating enemy (e.g. sniper in a haystack, sensor networks) Avoiding friendly fire Shashi Shekhar Mining For Spatial Patterns 3
Spatial Data Mining(SDM) - Definition
Search of implicit, interesting patterns in geo-spatial data Ex. Reconnaissance, Vector maps(NIMA, TEC), GPS, Sensor networks Data Mining vs. Statistics: Primary vs. Secondary analysis Global vs. local trends Spatial Data Mining vs. Data Mining: Spatial Autocorrelation Continuous vs. Discrete data types Shashi Shekhar Mining For Spatial Patterns 4
Background
Spatial Data Mining
Spatial statistics in Geology, Regional Economics NSF workshop on GIS and DM (3/99) NSF workshop on spatial data analysis (5/02)
Spatial patterns:
Spatial outliers Location prediction Associations, colocations Hotspots, Clustering, trends, … Shashi Shekhar Mining For Spatial Patterns 5
Framework
2 Approaches to mining Spatial Data
1. Pick spatial features; use classical DM methods 2. Use novel data mining techniques
Our Approach:
Define the problem: capture special needs Explore data using maps, other visualization Try reusing classical DM methods If classical DM perform poorly, try new methods Evaluate chosen methods rigourously Performance tuning if needed Shashi Shekhar Mining For Spatial Patterns 6
Spatial Association Rule
Citation: Symp. On Spatial Databases 2001 Problem: Given a set of boolean spatial features find subsets of co-located features, e.g. (fire, drought, vegetation) Data - continuous space, partition not natural, no reference feature Classical data mining approach: association rules But, Look Ma! No Transactions!!! No support measure!
Approach: Work with continuous data without transactionizing it!
confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)] support: cardinality of spatial join of instances of fire, drought, dry veg.
participation: min. fraction of instances of a features in join result new algorithm using spatial joins and apriori_gen filters Shashi Shekhar Mining For Spatial Patterns 7
y A K M A B A B D BCE
Event Definition
Convert the time series into sequence of events at each spatial location.
t1 CM A B G x DF ABE G DL J CE F EG M t2 A B DK L BCD A B D t3 A B GL AB E CFM DEF EG K time
Grid Cell (x,y)
(1,1) (1,2) (1,3) (1,4) (2,1) (2,2) (2,3) (2,4) (3,1) (3,2) (3,3) (3,4) (4,1) (4,2) (4,3) (4,4)
t1
Æ {A, B, D} Æ {A, K, M} {B, C, E} Æ Æ {A, B} Æ {A, B, G} {C, M} Æ Æ Æ Æ Æ
t2
Æ {D, L, J} {A, B, E, G} Æ {E, G, M} {C, E, F} Æ {D, F} Æ Æ Æ Æ Æ {D, K, L} Æ {A, B}
t3
Æ Æ {B, C, D} Æ {C, F, M} {A, B, G, L} Æ {A, B, D} Æ {A, B, E} Æ Æ Æ Æ {E, G, K} {D, E, F} Shashi Shekhar Mining For Spatial Patterns 8
Interesting Association Patterns
Use domain knowledge to eliminate uninteresting patterns.
A pattern is less interesting if it occurs at random locations.
Approach: Partition the land area into distinct groups (e.g., based on land cover type).
For each pattern, find the regions for which the pattern can be applied.
If the pattern occurs mostly in a certain group of land areas, then it is potentially interesting.
If the pattern occurs frequently in all groups of land areas, then it is less interesting.
Shashi Shekhar Mining For Spatial Patterns 9
Association Rules
Intra-zone non-sequential Patterns FPAR-Hi NPP-Hi (support 10) Shrubland regions • Region corresponds to semi-arid grasslands, a type of vegetation, which is able to quickly take advantage of high precipitation than forests.
• Hypothesis: FPAR-Hi events could be related to unusual precipitation conditions.
Shashi Shekhar Mining For Spatial Patterns 10
Co-location
Can you find co-location patterns from the following sample dataset?
Shashi Shekhar Answers: and Mining For Spatial Patterns 11
Co-location
Spatial Co-location
A set of features frequently co-located
Given
A set T of K boolean spatial feature types T={f 1 ,f 2 , … , f k } A set P of N locations P={p 1 , …, p N a spatial frame work S, p i } in P is of some spatial feature in T A neighbor relation R over locations in S
Find
T c = subsets of T frequently co-located
Objective
Correctness Completeness Efficiency
Constraints
R is symmetric and reflexive Monotonic prevalence measure Window Centric Mining For Spatial Patterns Shashi Shekhar Reference Feature Centric Event Centric 12
Co-location
Comparison with association rules
underlying space item-types collections prevalence measure conditional probability measure Association rules Co-location rules discrete sets item-types transactions continuous space events /Boolean spatial features neighborhoods support participation index Pr.[ A in T | B in T ] Pr.[ A in N(L) | B at L ]
Participation index
Participation ratio pr(f i , c) of feature f i in co-location c = {f 1 , f 2 , …, f k }: fraction of instances of f i with feature {f 1 , …, f i-1 , f i+1 , …, f k } nearby 2.Participation index = min{pr(f i , c)}
Algorithm
Hybrid Co-location Miner Shashi Shekhar Mining For Spatial Patterns 13
Dataset
Spatial Co-location Patterns
• Spatial feature A,B,C and their instances • Possible associations are (A, B), (B, C), etc.
• Neighbor relationship includes following pairs: •A1, B1 •A2, B1 •A2, B2 •B1, C1 •B2, C2 Shashi Shekhar Mining For Spatial Patterns 14
Dataset
Spatial Co-location Patterns
Partition approach [Yasuhiko, KDD 2001] •Support not well defined,i.e. not independent of execution trace •Has a fast heuristic which is hard to analyze for correctness/completeness Spatial feature A,B, C, and their instances Shashi Shekhar Support A,B =2 B,C=2 Mining For Spatial Patterns Support A,B=1 B,C=2 15
Dataset
Spatial Co-location Patterns
Spatial feature A,B, C, and their instances Reference feature approach [Han SSD 95] •C as reference feature to get transactions •Transactions: (B1) (B2) •Support (A,B) = Ǿ from Apriori algorithm •Note: Neighbor relationship includes following pairs: •A1, B1 •A2, B1 •A2, B2 •B1, C1 •B2, C2 Shashi Shekhar Mining For Spatial Patterns 16
Dataset
Spatial Co-location Patterns
Spatial feature A,B, C, and their instances Our approach (Event Centric) • Neighborhood instead of transactions • Spatial join on neighbor relationship • Support Prevalence •Participation index = min. p_ratio •P_ratio(A, (A,B)) = fraction of instance of A participating in join(A,B, neighbor) •Examples Support(A,B)=min(2/2,3/3)=1 Support(B,C)=min(2/2,2/2)=1 Shashi Shekhar Mining For Spatial Patterns 17
Dataset
Spatial Co-location Patterns
Partition approach Our approach Spatial feature A,B, C, and their instances Support A,B =2 B,C=2 Shashi Shekhar Support A,B=1 B,C=2 Mining For Spatial Patterns Support(A,B)=min(2/2,3/3)=1 Support(B,C)=min(2/2,2/2)=1 Reference feature approach C as reference feature Transactions: (B1) (B2) Support (A,B) = Ǿ 18
Spatial Outliers
Spatial Outlier: A data point that is extreme relative to it neighbors Case Study: traffic stations different from neighbors [SIGKDD 2001, JIDA 2002] Data - space-time plot, distr. Of f(x), S(x) Distribution of base attribute: spatially smooth frequency distribution over value domain: normal Classical test - Pr.[item in population] is low Q? distribution of diff.[f(x), neighborhood agg{f(x)}] Insight: this statistic is distributed normally!
Test: (z-score on the statistics) > 2 Performance - spatial join, clustering methods Shashi Shekhar Mining For Spatial Patterns 19
Spatial Outlier Detection
Given
A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function : V -> R A comparison function
F diff
Confidence level threshold (
F
aggr f
,
F aggr
k ) -> R Statistic test function ST: R ->{T, F}
Find
O = {v i | v i V, v i is a spatial outlier}
Objective
Correctness: The attribute values of v i is extreme, compared with its neighbors Computational efficiency
Constraints
F diff
and ST are algebraic aggregate functions of and
F aggr
Computation cost dominated by I/O op.
Shashi Shekhar Mining For Spatial Patterns 20
Spatial Outlier Detection
Spatial Outlier Detection Test
1. Choice of Spatial Statistic S(x) = [f(x)–E y N(x) (f(y))] Theorem: S(x) is normally distributed if f(x) is normally distributed 2. Test for Outlier Detection | (S(x) s ) / s | >
Hypothesis
I/O cost determined by clustering efficiency
f(x) S(x)
Shashi Shekhar Mining For Spatial Patterns 21
Graphical Spatial Tests
Moran Scatter Plot Original Data
Shashi Shekhar Mining For Spatial Patterns
Variogram Cloud
22
A Unified Approach Spatial Outliers
•Tests : quantitative, graphical •Results: •Computation = spatial self-join •Tests:
algebraic
functions of join •Join predicate: neighbor relations •I/O-cost: f(clustering efficiency) •Our algorithm is I/O-efficient for Algebraic tests
Scatter Plot Original Data Our Approach
23 Shashi Shekhar Mining For Spatial Patterns
Spatial Outlier Detection
Results
1. CCAM achieves higher clustering efficiency (CE) 2. CCAM has lower I/O cost 3. High CE => low I/O cost 4. Big Page => high CE
CE value I/O cost Cell-Tree
Shashi Shekhar
CCAM
Mining For Spatial Patterns
Z-order
24
Location Prediction
Citations: IEEE Tran. on Multimedia 2002, SIAM DM Conf. 2001, SIGKDD DMKD 2000 Problem: predict nesting site in marshes given vegetation, water depth, distance to edge, etc. Data - maps of nests and attributes spatially clustered nests, spatially smooth attributes Classical method: logistic regression, decision trees, bayesian classifier but, independence assumption is violated ! Misses auto correlation !
Spatial auto-regression (SAR), Markov random field bayesian classifier Open issues: spatial accuracy vs. classification accurary Open issue: performance - SAR learning is slow!
Shashi Shekhar Mining For Spatial Patterns 25
Location Prediction
Given: 1.
Spatial Framework
S
2. Explanatory functions:
f
{
s
1 ,...
s n
:
S
}
R X k
3. A dependent class:
f C
:
S
C
{
c
1 ,...
c M
}
mappings:
R
...
R
C
Find: Classification model:
f
ˆ
c
Nest locations Objective:maximize classification_accuracy (
f
ˆ
c
,
f c
) Constraints: Spatial Autocorrelation exists Vegetation durability Shashi Shekhar Mining For Spatial Patterns Distance to open water Water depth 26
Motivation and Framework
Shashi Shekhar Mining For Spatial Patterns 27
•
Spatial AutoRegression (SAR)
Spatial Autoregression Model (SAR)
•
y =
Wy + X
+
• • • • • •
error vector Solutions
• •
W models neighborhood relationships
models strength of spatial dependencies
and
- can be estimated using ML or Bayesian stat.
e.g., spatial econometrics package uses Bayesian approach using sampling-based Markov Chain Monte Carlo (MCMC) method.
Likelihood-based estimation requires O(n 3 ) ops.
Other alternatives – divide and conquer, sparse matrix, LU decomposition, etc.
Shashi Shekhar Mining For Spatial Patterns 28
Evaluation
Linear Regression Spatial Regression
y
y
Spatial model is better
X
Wy
X
Shashi Shekhar Mining For Spatial Patterns 29
•
MRF Bayesian
Markov Random Field based Bayesian Classifiers
•
Pr(l
• •
i | X, L Pr(l i i | L ) = Pr(X|l i i , L i ) Pr(l i | L i ) / Pr (X) ) can be estimated from training data L i denotes set of labels in the neighborhood of si excluding labels at si
• •
Pr(X|l i , L i ) can be estimated using kernel functions Solutions
• • •
stochastic relaxation [Geman] Iterated conditional modes [Besag] Graph cut [Boykov]
Shashi Shekhar Mining For Spatial Patterns 30
Experiment Design
Shashi Shekhar Mining For Spatial Patterns 31
Prediction Maps(Learning)
MRF-P Prediction (ADNP=3.36) Actual Nest Sites (Real Learning) NZ=85 MRF-GMM Prediction (ADNP=5.88) NZ=138 SAR Prediction (ADNP=9.80) Shashi Shekhar NZ=140 Mining For Spatial Patterns NZ=130 32
Prediction Maps(Testing)
MRF-P Prediction (ADNP=2.84) Actual Nest Sites (Real Testing) Actual Nest Sites (Real Learning) NZ=30 MRF-GMM Prediction (ADNP=3.35) NZ=80 SAR Prediction (ADNP=8.63) Shashi Shekhar NZ=76 Mining For Spatial Patterns NZ=80 33
• • • •
Comparison (MRF-BC vs. SAR)
SAR can be rewritten as y = (QX)
•
where Q = (I-
W) -1 smoothing operation.
+ Q
which can be viewed as a spatial
•
This transformation shows that SAR is similar to linear logistic model, and thus suffers with same limitations – i.e., SAR model assumes linear separability of classes in transformed feature space SAR model also make more restrictive assumptions about the distribution of features and class shapes than MRF The relationship between SAR and MRF are analogous to the relationship between logistic regression and Bayesian classifiers.
Our experimental results shows that MRF model yields better spatial and classification accuracies than SAR predictions.
Shashi Shekhar Mining For Spatial Patterns 34
MRF vs. SAR
Confusion Matrix: Spatial Confusion Matrix:
Shashi Shekhar Mining For Spatial Patterns 35
Conclusion and Future Directions
Spatial domains may not satisfy assumptions of classical methods data: auto-correlation, continuous geographic space patterns: global vs. local, e.g. spatial outliers vs. outliers data exploration: maps and albums Open Issues patterns: hot-spots, blobology (shape), spatial trends, … metrics: spatial accuracy(predicted locations), spatial contiguity(clusters) spatio-temporal dataset scale and resolutions sentivity of patterns geo-statistical confidence measure for mined patterns Shashi Shekhar Mining For Spatial Patterns 36
Army Relevance and Collaborations
Relevance: “Maps are as important to soldiers as guns” - unknown Joint Projects: High Performance GIS for Battlefield Simulation (ARL Adelphi) Spatial Querying for Battlefield Situation Assessment (ARL Adelphi) Joint Publications: w/ G. Turner (ARL Adelphi, MD) & D. Chubb (CECOM IEWD) IEEE Computer (December 1996) IEEE Transactions on Knowledge and Data Eng. (July-Aug. 1998) Three conference papers Visits, Other Collaborations GIS group, Waterways Experimentation Station (Army) Concept Analysis Agency, Topographic Eng. Center, ARL, Adelphi Workshop on Battlefield Visualization and Real Time GIS (4/2000) 37
7.
8.
9.
1.
2.
3.
4.
5.
6.
Reference
S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “
Spatial Databases: Accomplishments and Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999.
S. Shekhar and Y. Huang,
“Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th International Symposium on Spatial and Temporal Databases (SSTD01), July 2001.
S. Shekhar, C.T. Lu, P. Zhang,
"Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.
S. Shekhar, C.T. Lu, P. Zhang,
“Detecting Graph-based Saptial Outlier”, Intelligent Data Analysis, To appear in Vol. 6(3), 2002
S. Shekhar, S. Chawla,
the book “Spatial Database: Concepts, Implementation and Trends”, Prentice Hall, 2002
S. Chawla, S. Shekhar, W. Wu and U. Ozesmi
, “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.
S. Chawla, S. Shekhar, W. Wu and U. Ozesmi,
“Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001.
S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla,
“Spatial Contextual Classification and Prediction Models for Mining Geospatial Data”,To Appear in IEEE Transactions on Multimedia, 2002. S. Shekhar, V. Kumar, P. Tan. M. Steinbach, Y. Huang, P. Zhang, C. Potter, S. Klooster, “Mining Patterns in Earth Science Data”, IEEE Computing in Science and Engineering (Submitted)
Shashi Shekhar Mining For Spatial Patterns 38
10.
11.
12.
13.
Reference
S. Shekhar,
C.T. Lu, P. Zhang, “A Unified Approach to Spatial Outliers Detection”, IEEE Transactions on Knowledge and Data Engineering (Submitted) S. Shekhar, C.T. Lu, X. Tan, S. Chawla, Map Cube: A Visualization Tool for Spatial Data Warehouses, as Chapter of Geographic Data Mining and Knowledge Discovery. Harvey J. Miller and Jiawei Han (eds.), Taylor and Francis, 2001, ISBN 0-415-23369-0. S. Shekhar, Y. Huang, W. Wu, C.T. Lu, What's Spatial about Spatial Data Mining: Three Case Studies , as Chapter of Book: Data Mining for Scientific and Engineering Applications. V. Kumar, R. Grossman, C. Kamath, R. Namburu (eds.), Kluwer Academic Pub., 2001, ISBN 1-4020-0033-2 Shashi Shekhar and Yan Huang , Multi-resolution Co-location Miner: a New Algorithm to Find Co-location Patterns in Spatial Datasets, Fifth Workshop on Mining Scientific Datasets (SIAM 2nd Data Mining Conference), April 2002
Shashi Shekhar Mining For Spatial Patterns 39