# Scipy Stats

I recently was the TA for a Monte Carlo methods course which involved sampling many statistical distributions. Often students would write standalone functions to draw samples which was prone to error. While knowing how to code up an exotic distribution can be useful, I encourage students not to re-invent the wheel every assignment. The `scipy.stats`

package contains many probability distributions and useful functions when doing statistics.

```
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

## Normal Distribution

We will start by exploring the Normal distribution.

All statistical distribution objects in `scipy.stats`

package contain useful methods. The `norm`

class has:

`pdf(x)`

to calculate the Probability density function at x.`cdf(x)`

to calculate the Cumulative distribution function at x.`interval(alpha)`

to calculate the endpoints of the range that contains alpha percent of the distribution`rvs(size)`

to generate random variates of arbitrary size.`fit(data)`

to estimate the parameters for generic data.

and many more

The `stats.norm`

class by default creates a standard normal distribution with mean 0 and variance of 1.

```
dist = stats.norm()
print('Distribution Mean: \t{0:.2f}'.format(dist.mean()))
print('Distribution Varience: \t{0:.2f}'.format(dist.var()))
```

```
Distribution Mean: 0.00
Distribution Varience: 1.00
```

### Visualization

Let’s plot the PDF and CDF to get a better understanding

```
fig, axs = plt.subplots(2,1, figsize=(5,6))
x = np.linspace(-5, 5, 200)
axs[0].plot(x, dist.pdf(x))
axs[0].set(title='Probability Density Function',
xlabel='$x$',
ylabel='$\mathcal{N}(x)$')
axs[1].plot(x, dist.cdf(x))
axs[1].set(title='Cumulative Distribution Function',
xlabel='$x$',
ylabel='$\Phi(x)$')
plt.tight_layout()
```

### Location and Shape Parameters

We can pass to the distribution object a parameter `loc`

which controls the mean and `scale`

which controls the standard deviation. Lets create a variety of distributions by varying these parameters

```
parameters =pd.DataFrame([
[0, 0.2],
[0, 1.0],
[0, 5.0],
[-2, 0.5]], columns=('Mean', 'Varience'))
parameters
```

Mean | Varience | |
---|---|---|

0 | 0 | 0.2 |

1 | 0 | 1.0 |

2 | 0 | 5.0 |

3 | -2 | 0.5 |

```
fig, axs = plt.subplots(2,1, figsize=(7,9))
for _, mean, var in parameters.itertuples():
dist = stats.norm(loc=mean, scale=np.sqrt(var))
axs[0].plot(x, dist.pdf(x),
label = r'$\mu = {},\quad \sigma^2 = {}$'.format(mean, var))
axs[0].set(title='Probability Density Function',
xlabel='$x$',
ylabel='$\mathcal{N}(x)$')
axs[0].legend()
axs[1].plot(x, dist.cdf(x),
label = r'$\mu = {},\quad \sigma^2 = {}$'.format(mean, var))
axs[1].set(title='Cumulative Distribution Function',
xlabel='$x$',
ylabel='$\Phi(x)$')
axs[1].legend()
plt.tight_layout()
```

### Random Variate Samples

To sample a distribution the `rvs`

method is used. For reproducibility I will set the `random_state`

.

```
dist = stats.norm()
samples = dist.rvs(1000, random_state=0)
plt.hist(samples, bins=50);
```

**Numpy Random**

If you are only interested in random variate samples sometimes it is easier to use the `numpy.random`

module (See numpy docs) which also contains many distributions. By setting the seed we can create the same results as above

```
np.random.seed(0)
samples = np.random.normal(size=1000)
plt.hist(samples, bins=50);
```

### Parameter Fitting

Often we are interested in fitting the parameters of a distribution to data we have collected. Lets generate some data first.

```
data = stats.norm.rvs(loc=5, scale=3, size=1000, random_state=42)
```

The `fit`

method returns the shape parameters that best fit the given data

```
mu, std = stats.norm.fit(data)
print('Best fit mean: \t{}'.format(mu))
print('Best fit std: \t{}'.format(std))
```

```
Best fit mean: 5.0579961674669764
Best fit std: 2.9361786232420632
```

We can use these values to overlay the fitted distribution onto a normalised histogram of the data

```
# Plotting the histogram of the data
plt.hist(data, bins=50, density=True, alpha=.8)
# Plotting the PDF of the fitted distribuiton
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
fitted_dist = stats.norm(mu, std)
plt.plot(x, fitted_dist.pdf(x), 'k--', linewidth=2)
plt.title("Fit: $\mu$ = {:.2f}, $\sigma$ = {:.2f}".format(mu, std));
```

### Shaded Distributions

A common visualization is a shaded normal distribution to highlight the tails. Let’s take the fitted distribution from the previous example and fill in the 95% confidence regions. I made a simple `fill_dist`

function to aid in the task.

```
def fill_dist(x1, x2, dist, **kwargs):
"""Shade a distribution between x1, and x2"""
x_range = np.linspace(x1, x2)
plt.fill_between(x_range, dist.pdf(x_range), alpha=0.5, **kwargs)
```

The `interval`

function returns the bounds which include a given percentage of the data

```
lower, upper = fitted_dist.interval(.95)
plt.plot(x, fitted_dist.pdf(x), 'k')
fill_dist(lower, upper, fitted_dist, color='C0', label='Included')
fill_dist(upper, xmax, fitted_dist, color='C3', label='Excluded')
fill_dist(xmin, lower, fitted_dist, color='C3')
plt.vlines(fitted_dist.mean(), 0,
fitted_dist.pdf(fitted_dist.mean()), linestyles='dashed')
plt.title('95% Confidence Region')
plt.legend()
```

```
<matplotlib.legend.Legend at 0x11ab60eb8>
```

## Other Distributions

There are many different distributions we can choose from such as the Chi Squared distribution.

Often it can be useful to vary a shape parameter and see how it affects the distribution. For example

```
x = np.linspace(0, 8, 100)
for i in range(1, 7):
dist = stats.chi2(i)
plt.plot(x, dist.pdf(x), label=r'$k={}$'.format(i))
plt.ylim(-.01, 0.5)
plt.legend(ncol=2)
plt.title('$\chi^2_k$ for k degrees of freedom')
```

```
Text(0.5,1,'$\\chi^2_k$ for k degrees of freedom')
```

## Multivariate Normal

The final distribution I wanted to discuss was the `multivariate_normal`

distribution. The probability density function is defined as
where \(\mu\) is the mean, \(\Sigma\) the covariance matrix,
and \(k\) is the dimension of the space where \(x\) takes values.

```
mu = [2,1]
sigma = [[1.3, .3],
[0.3, .5]]
N = stats.multivariate_normal(mu, sigma)
```

I would like to create a contour plot of this distribution. First we will generate the grid of \(x\) and \(y\) coordinates using `np.meshgrid`

.

```
x = np.linspace(-1, 5, 200)
y = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x, y)
```

From the documentation it reads

The inputs can be any shape of array, as long as the last axis labels the components.

In order to calculate our pdf at every \(xy\) point we need to stack the data into an \(NxMx2\) array. This can be done with `np.dstack`

which concatenates 2D arrays along a third axis

```
samples = np.dstack((X, Y))
```

Now we can call `plt.contourf`

with the pdf as the \(Z\) component

```
plt.contourf(X, Y, N.pdf(samples))
```

```
<matplotlib.contour.QuadContourSet at 0x11ab5d470>
```

Consider the following mixture of 2D distributions,

where,

```
mu1 = [3,0]
sigma1 = [[1.5, 0],
[0, 0.5]]
N1 = stats.multivariate_normal(mu1, sigma1)
mu2 = [-1.25,2.5]
sigma2 = [[0.5, -0.6],
[-0.6, 1.0]]
N2 = stats.multivariate_normal(mu2, sigma2)
```

Now we can define a function to sample the PDF

```
def pi(sample):
"""A mixed Gaussian model."""
return 0.6*N1.pdf(sample) + 0.4*N2.pdf(sample)
```

Plotting in the same manner as before but now calling `pi`

instead.

```
x = np.linspace(-4, 6, num=100)
y = np.linspace(-2, 6, num=100)
X, Y = np.meshgrid(x, y)
plt.contourf(x, y, pi(np.dstack((X, Y))), cmap='coolwarm')
```

```
<matplotlib.contour.QuadContourSet at 0x11b10c4a8>
```

## Conclusion

Sampling and exploring a wide variety of statistical distributions is made easier with the `scipy.stats`

module. Creating your own functions each time you want to draw a sample is error prone and inefficient. I think it’s important to know what tools are available and I hope this post shines some light on one useful example.

I hope to do more `scipy`

related posts in the future.