README.rmd

---
title: "Probabilistic Spatial Partition Model"
author: "Carl Müller-Crepon"
output:
   md_document:
      variant: gfm

---


# pspm: Probabilistic Spatial Partition Model

## Introduction

This packages contains a series of functions and classes to set up and fit a Probabilistic Spatial Partition Model and sample partitionings following the the methodology introduced by Müller-Crepon, Schvitz and Cederman (2023). In short, the model is based on a simplification of continuous geographic space as a (planar) graph. The observed partitioning is encoded on the graphs' vertices such that each vertex is a member of one and only one partition. The graphs partitioning is modeled as the result of _attractive_ and _repulsive_ force active on its edges which affect the probability that the vertices they connect belong to the same or to different partitions. The model allows to estimate the effects of edge-level attributes on the repulsion/attraction between vertices, thus estimating their impact on the overall partitioning of the graph.  

Please refer to the original publication for all technical details beyond the summary provided below. It can be found [here]() and as ungated version [here](http://www.carlmueller-crepon.org/publication/state_shape/). 

Note that our companion R-package [`SpatialLattice`](https://github.com/carl-mc/SpatialLattice) simplifies the creation and handling of spatial graphs to facilitate the analysis of spatial partitionings.  

When using the pspm package, please cite:

Müller-Crepon, Carl, Guy Schvitz, Lars-Erik Cederman (2023).
Shaping States into Nations: The Effects of Ethnic Geography on State Borders.
_American Journal of Political Science_, conditionally accepted for publication.  


## The model
We model the distribution over all possible partitionings $P$ of lattice $G$ as a Boltzmann distribution:

```math
Pr(P = p_{i}) =  {e^{-\epsilon_{i}}\over\displaystyle\sum_{i = 1}^{|\mathbb{P}|}e^{-\epsilon_{i}} }
```

here, a partitioning $i$'s chance of realization decreases with its energy $\epsilon_i$. The energy of a partitioning is defined as the sum of energies on the graph's edges that run between nodes $j$ and $k$ where $j$ and $k$ are located in the same partition ($s_{j,k} = 1$). Edges energies are not realized where $j$ and $k$ do not belong to the same partition ($s_{j,k} = 0$):

```math
\epsilon_{i} = \displaystyle\sum_{j,k \in L}\, \epsilon_{j,k}\,s_{j,k}
```

Potential and realized edge-level energies are in turn affected by a constant ($\beta_0$) and a set of edge-level attributes $\textbf{x}_{j,k}$ that are weighted by a parameter vector $\beta$

```math
\epsilon_{j,k} = \beta_0 + \beta\, \textbf{x}_{j,k}
```

The empirical goal of the PSPM is it to estimate parameters $\beta_0$ and $\beta$ from observed data -- a positive $\beta$ indicates that an attribute exerts a _repulsive_ force, whereas a negative one indicates an _attractive_ force. This is due to the fact that an overal minimization of a partitings' energy maximizes its likelihood of realization. To find the parameters that maximize this likelihood, the PSPM uses a maximum composite likelihood approach. Uncertainty estimates can be derived via a parametric bootstrap that samples partitionings with a given set of $\beta_0$ and $\beta$ parameters and then fits the model on the sampled partitionings. 


## Installation

You can directly download and install the pspm package from GitHub. Before doing so, please make sure that you have [Python3](https://www.python.org/downloads/) installed, which the package heavily relies on. For simplicity and as shown below, you can also install python and necessary modules from within R, using \code{reticulate}'s \code{install_miniconda()} and \code{py_install()} functions. 

Upon installation and first usage, the package should automatically install all necessary python dependencies via the [reticulate](https://cran.r-project.org/web/packages/reticulate/index.html) R-package. If this is not the case, the user may see an error and have install the following modules manually: scipy, networkx, numpy, abc, typing, collections. 

```{r, eval = F}
# # Get python ready if not yet installed
# library(reticulate)
# reticulate::install_miniconda()
# reticulate::py_install(c("scipy","networkx","abc","numpy","typing","collections"))

# Download pspm package
library(devtools)
install_github(repo = "carl-mc/pspm")
```


## Getting started

the pspm package heavily builds on the igraph library which is used to handle the underlying network data and allows user-friendly data manipulation. Since random sampling plays an important role during sampling, it is recommended to set a random seed in R and python. `pspm_set_seed()` achieves just that. 

```{r, eval = T, warning=FALSE, message=F}
# # If multiple python versions installed, point reticulate to python 3.x
# your_python_path <- '/usr/bin/python3' ## customize to your machine!
# Sys.setenv(RETICULATE_PYTHON = your_python_path)

# Packages
library(pspm)
library(igraph)

# Set seeds in R and python
pspm_set_seed(1)

```


## Setting up a toy lattice

To illustrate the model. we first set up a toy lattice -- as a `PSPM` object -- with 100 nodes arranges on a 10-by-10 grid. The lattice encodes two edge-level characteristics that separate one side of the lattice from the other -- think of a river or mountain range. The distribution of one of these attributes is shown as dotted vs. straight edges on the graph below. Both attributes exert _repulsive_ forces (`beta = c(2,1)`). Furthermore, the vertices are attracted to each other by a baseline constant of -2. We sample a partitioning of the lattice with a burn-in period of 10 rounds.

Please see the R-package [`SpatialLattice`](https://github.com/carl-mc/SpatialLattice) for code and examples on how to construct spatial graph data for observed spatial partitionings. 

```{r, eval = T}
# Make mock PSPM Object with a sampled partitioning
sl <- generate_grid_data(N_sqrd = 10, ## 10 x 10 lattice
                        beta0 = -2, ## Negative constant = baseline attraction between nodes
                        beta = c(2,1), ## Include two repulsive edge-level predictors
                        dep_structure = "von_neumann", ## Each node connects to 4 neighbors
                        burnin = 10 ## Sample with a 10 burn-in periods
                        )

# Print class of object
print(class(sl))

# Plot PSPM object
sl$plot_partitioning(edge_predictor = 1, 
                     edge.width = 5, vertex.size = 10,
                     main = "A Toy Lattice")
```


## PSPM to igraph conversion

The `PSPM` object created above can be transformed into an `igraph` object and vice-versa using the following methods:

```{r, eval = T}
# Transform PSPM to igraph
graph <- PSPM2igraph(sl)

# Transfers data as edge- and vertex-attributes
edge_attr_names(graph)
vertex_attr_names(graph)

# Transform igraph back to PSPM
sl.from.g <- igraph2PSPM(g = graph, 
                         outcome_name = "Y",
                         edge_pred_names = c("x1", "x2"))

```


## Fitting a PSPM Model

Finally, we can fit the PSPM to estimate the effect of the edge-level attributes on the partitioning of the lattice. The function `fit_pspm_model()` provides a wrapper around lower-level functions and classes, allowing the user to estimate the model directly from an `igraph` object with a formula where the left-handside variable Y refers to the vertex attribute that encodes vertices partition members and the right-handside variables referring to edge-level predictors. Multiple graphs can be passed to the function at the same time such that one model is fit across them. Once a model is fit, `bootstrap_pspm()` allows for carrying out a (parallelized) parametric bootstrap which can return the full distribution of estimates as well as (basic and percentile-based) confidence intervals. 

```{r, eval = T}
## The easy way

### Estimate
m.simple <- fit_pspm_model(formula = Y ~ x1 + x2, 
                           g_ls = list(graph),
                           return_pspm = TRUE)

### Bootstrap CIs
bs.simple <- bootstrap_pspm(m.simple,
                            n_boot_iter = 10, ## Should be > 100
                            burnin = 10, ## Could be higher, depending on complexite of graph and model
                            cl = 10L, ## Number of CPUs for parallelization
                            return_sims = TRUE, ## Return full distribution of estimates
                            ci_level = .95)

### Summary
summary(m.simple)
print(bs.simple$ci_mat)
plot(density(bs.simple$beta_boot[,"x1"]),
     main = "Distribution of bootstrapped estimates of x1")

```


For completeness, the following code shows the same estimation process using the lower-level classes and functions in the PSPM package. We first have to transform our `PSPM` object into a `PSPMLearn` object, which include the `fit_composite_log_likelihood()` method. Once the `PSPMLearn` object is fitted, it can be directly bootstrapped using the appropriate method. Note that this method does not automatically handle missing data and other intricacies, such as keeping track of variable names.  

```{r, eval = T}
## The complicated way (inside the wrapper)

### Initiate PSPMLearn Object
learn_obj <- PSPMLearn$new(list(sl.from.g))

### Fit
m.compl <- learn_obj$fit_composite_log_likelihood(beta_init = c(0,0,0))

### Bootstrap
bs.compl <- learn_obj$par_bootstrap_composite_log_likelihood(n_boot_iter = 10, burnin = 10,
                                                 cl = 10L, return_sims = FALSE, ci_level = .95)

### Summary
summary(m.compl)
print(bs.compl)
```


## Producing nice tables

Finally, models fitted with  `fit_pspm_model()` and the (`maxLik`)[https://cran.r-project.org/web/packages/maxLik/index.html] methods used under the hood can be printed as text, html, or latex files using a slight extension of the (texreg)[https://cran.r-project.org/web/packages/texreg/texreg.pdf] package. 


```{r, eval = T}
# Integration with texreg
pspm2table(list(m.simple),
           bootci = list(bs.simple$ci_mat), boottype = "percentile",
           type = "text", add.stats = c("Edges" = "N_edges", "Vertices" = "N"))


```


## Feedback, comments, questions
We are very grateful for any bug reports, feedback, questions, or contributions to this package. Please report any issues here or write to c.a.muller-crepon [at] lse.ac.uk .