Probability is the foundation of many of the methods and models in data science. Statistical inference methods use data to understand the underlying processes behind the data. With probability we can study the regularities that arise in random phenomena. Hence knowing probability and its applications are important to work effectively on data science problems.
Flipping a coin¶
Let's try to gain some intutition about probability by starting with the classic random experiment, a
coin flip.
If we flip a coin there are two possible outcomes: heads or tails. This random experiment is called a
Bernoulli trial, after Jacob Bernoulli.
A Bernoulli trial is one where the possible outcomes are binary. Each outcome is called an event.
We assign probabilities to events.
With a fair coin we have a 50% chance of getting heads and a 50% chance of getting tails for each
event.
To simulate the coin flips we can use the bernoulli
object from the scipy
module where:
- p
is the probability of success
- size
is the number of trials
I will also be using a random seed for reproducibility.
import numpy as np
np.random.seed(23)
from scipy.stats import bernoulli
# flip the coin "size" times
ber = bernoulli.rvs(p=0.5, size=10)
ber
np.sum(ber)
Binomial Distribution¶
A sequence of independent Bernoulli trial follows the binomial distribution.So, if we want to
count the number of successes instead of using the bernoulli object and adding the outcomes, we can
use the binomial distribution and the binom
object, where:
p
: probability of outcomen
: number of flips per trialsize
: number of times we repeat experiment
from scipy.stats import binom
# count the number of heads for "n" flips in "size" trials
binom.rvs(n=10, p=0.5, size=5)
After conducting many random experiments, we can notice that some outcomes are more likely than others. This is called a probability distribution. There are two important functions that are useful for probability calculations: the probability mass function and the cumulative distribution function
Probability mass function (pmf)¶
A discrete random variable has a finite number of possible outcomes. The probability mass function allows us to calculate the probability of getting a particular outcome for a discrete random variable. The binomial probability mass function allows us to calculate the probability of getting $k$ heads from $n$ coin flips with $p$ probability of getting heads. $$binom.pmf(k, n, p) = \binom{n}{k}p^k(1 - p)^{n-k} $$ where $$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$ The formula multiplies the number of different ways that we can get $k$ successes out of $n$ coin flips by the probability of success $p$ raised by the number of successes $k$ by the probability of failure $1-p$ raised to the number of failures $n-k$.
# probability of "k" heads after "n" flips with a fair coin
binom.pmf(k=5, n=10, p=0.5)
We can plot this to see how the most likely outcome vary. I will be using the bokeh
library throughout most of the course
from bokeh.io import curdoc
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
output_notebook()
def plot_binomial_pmf(n, p):
# create a figure object
plot = figure(x_axis_label='number of heads', y_axis_label='probability of outcome', plot_height=400, plot_width=550)
# define an array with the number of successes we could get
heads = np.arange(n+1)
# get the pmf for every success
y = [binom.pmf(k=k, n=n, p=p) for k in heads]
# plot on figure
plot.segment(heads, 0, heads, y, line_width=2)
plot.circle(heads, y, size=16)
#change axis ticks
plot.xaxis.ticker = heads
return plot
plt = plot_binomial_pmf(n=10, p=0.5)
show(plt)
Cumulative distribution function (cdf)¶
When we want to calculate the probability of getting $k$ or fewer heads from $n$ throws, we use the comulative distribution function, which adds the probabilities of getting 0 heads out of $n$ flips, getting 1 head out of 10 flips and so on up to $k$ heads out of $n$ flips.
$$binom.cdf(k, n, p) = \binom{n}{0}p^0(1 - p)^{n} + \binom{n}{1}p(1 - p)^{n-1} + ... + \binom{n}{k}p^k(1 - p)^{n-k}$$The binomial probability distribution function allows us to calculate the cumulative probability of getting $k$ heads or fewer from $n$ coin flips with $p$ probability of getting heads.
# probability of less than 2 heads after 10 throws with a fair coin
binom.cdf(k=2, n=10, p=0.5)
We can use the complement to see what the probabilities of getting more than $k$ heads from $n$ throws.
# probability of more than 2 heads after 10 throws with a fair coin
1 - binom.cdf(k=2, n=10, p=0.5)
Alternatively we can use the function binom.sf(k,n,p)
for the complement.
sf
stands for survival function.
# probability of more than 2 heads after 10 throws with a fair coin using binom.sf
binom.sf(k=2, n=10, p=0.5)
With the following function we can see how the probability of "k" changes with different values of "p" and "n"
def plot_binom_cdf(n, p):
# create a figure object
plot = figure(x_axis_label='k', y_axis_label='cdf', plot_height=400, plot_width=550)
# defining an array with the number of successes we could get
successes = np.arange(n+1)
# create a list of probability outcomes
probs = [binom.cdf(k=k, n=n, p=p) for k in successes]
# plot cdf
plot.line(successes, probs, line_width=4, legend=('p='+ str(p)))
plot.segment(successes, 0, successes, probs)
plot.circle(successes, probs, size=5, color='red', tags=probs)
# change axis ticks
plot.xaxis.ticker = successes
plot.legend.location = 'top_left'
return plot
plt = plot_binom_cdf(n=10, p=0.5)
show(plt)
Expected value¶
A discrete random variable has finite outcomes. For instance the roll of a dice has 6 posiible outcomes. For discrete random variables, the expected value is the sum of the possible outcomes weighted by their probability. To calculate the expected value, each outcome is multiplied by its probability, and the products are summed. $$E(X)=\sum_{i=1}^{k}x_1p_1 = x_1p_1 + x_2p_2 + ... + x_kp_k$$ In simple terms, the expected value is a value were the probability will concentrate when we repeat the experiment. For our coin example we multiply the tails outcome $0$ for its probability $1-p$ then we add the heads outcome which is $1$ multiplied by its probability $p$ $$E(X)=\sum_{i=1}^{2}x_1p_1 =0\times(1-p) + 1\times p=p$$ and we get the expected value $p$.
Arithmetic Mean¶
The arithmetic mean is the sum of each outcome divided by the number of samples.
$$\bar{X}=\frac{1}{n}\sum_{i=1}^{n}x_1=\frac{1}{n}(x_1+x_2+...+x_n)$$
The arithmetic mean of the oucomes converges to the expected value as we increase the
number of random experiments. In scipy we will use the describe()
function from
scipy.stats
to get the mean.
If we repeat the experiment and make $n$ coin flips we should get a number near the expected value, so the more the flips, the more the mean of all the throws approaches the expected value as we can see from the graph below.
from bokeh.models import Slider, Label, Arrow, VeeHead, Div
from scipy.stats import describe
def plot_mean_of_flips(p, size):
#generate x values
x = np.arange(size)
# generate coin flips
results = binom.rvs(n=1, p=p, size=size)
# get the list of accumulated means
y = [describe(results[:i+1]).mean for i in x]
# creating the plot
plot = figure(y_range=(0,1), x_axis_label='Coin flips', y_axis_label='Mean of coin flips', plot_height=400, plot_width=550)
plot.line(x, y)
plot.line(x, p, color='grey', line_alpha=0.8)
label = Label(x=size*3/5, y=p, text="expected value", text_baseline="bottom", text_alpha=0.8)
plot.add_layout(label)
return plot
plt = plot_mean_of_flips(p=0.5, size=1000)
show(plt)
That is the relationship between the expected value and the mean.
Variance¶
The variance measures how concentrated or spread out from the expected value the data is.
$$Var(X)=E[(X-E(X))^2]=\sum_{i=1}^{n}p_i\times (x_i-E(X))^2$$Variance is the expected value of the squared deviation of a random variable from its expected value. In the particular case of the binomial distribution, the expected value is the product of the number of coin flips and the probability of getting heads.
For $X\sim Binomial(n,p)$ $$E(X)=n\times p$$ $$Var(X)=n\times p \times (1-p)$$
The variance is the product of the expected value and the probability of failure. We can use
binom.stats()
method to get the expected value and variance of a binomial distribution.
mean, variance = binom.stats(n=10, p=0.5)
print('mean = ', mean)
print('variance = ', variance)
Expected value, mean and variance are essential to probability and statistics. In fact these are the most important measures to calculate to determine if the data is spread out or concentrated around the expected value.
averages = []
variances = []
for i in range(0, 1500):
# 10 draws of 10 coin flips with 25% probability of heads
sample = binom.rvs(n=10, p=0.25, size=10)
# Mean and variance of the values in the sample variable
averages.append(describe(sample).mean)
variances.append(describe(sample).variance)
# Calculate the mean of the averages variable
print("Mean {}".format(describe(averages).mean))
# Calculate the mean of the variances variable
print("Variance {}".format(describe(variances).mean))
# Calculate the mean and variance
print(binom.stats(n=10, p=0.25))
Simulated data is very important for modeling random phenomena. You can always check if your data follows the intended distribution. Repeating experiments and taking means will give you insight about probability and will introduce you to statistics as you find regularities. scipy.stats offers many functions to generate simulated data and describe it. Experiment some more, always checking your simulation results against the expected results.
Calculating the probabilities of two events¶
Independet Events¶
Given two events, if the events are not affected by the order in which they happen and the occurrance of one event does not affect the probability of the other, then the events are independent.
Probability of A and B¶
With our coin case if we flip it twice we get four possible outcomes. We can ask what the probability of each outcome is. For independent events we use the multiplication rule. The probability of $A$ and $B$ is the probability of $A$ times the probability of $B$.
$$P(A\textrm{ and }B)=P(A)P(B)$$
We can count how many time each outcome repeats using the function find_repeats
which we
import from scipy.stats
. We pass the data to the function to get how many times each
outcome in the data repeats.
This function returns:
- values : ndarray
The unique values from the (flattened) input that are repeated. - counts : ndarray
Number of times the corresponding ‘value’ is repeated.
from scipy.stats import find_repeats
# generate a sample of two fair coin flips
sample = binom.rvs(n=2, p=0.5, size=1000)
reps = find_repeats(sample)
reps
for num, heads in enumerate(reps.counts):
print('{0} heads: {1}'.format(str(num), str(heads)))
The relative frequency is the number of favourable trials divided by the total trials.
To get the relative frequency in the data we can import the scipy.stats.relfreq
function
and specify the data and numbins
as parameters, in our case numbins
is three
for the three possible outcomes, that we had stored in our instance of find_repeats
.
from scipy.stats import relfreq
relfreq(sample, numbins=len(reps.values))
This function returns:
- frequency : ndarray
Binned values of relative frequency. - lowerlimit : float
Lower real limit. - binsize : float
Width of each bin. - extrapoints : int
Extra points.
Probability of A or B¶
To calculate the probability of $A$ or $B$ we need to consider whether the events are mutually
exclusive (ones that cannot happen at the same time) or not.
If two events $A$ and $B$ are mutually exclusive, the events are called disjoint events. The
probability of two disjoint events $A$ or $B$ happening is the probability of $A$ plus the probability
of $B$.
$$P(A\textrm{ or }B)=P(A)+P(B)$$
In the case of overlapping events, the procedure is slightly different: we add the probability of $A$ to the probability of $B$ but we need to subtract the overlap. Let's consider an event $A$ and an event $B$ that overlaps with $A$, the probability of $A$ or $B$ is the probability of $A$ $$P(A\textrm{ or }B)=P(A)+P(B)-P(A\textrm{ and }B)$$ This obviously also works for events that do not overlap, because $P(A\textrm{ and }B)$ is 0. For example we can calculate the probability of getting a Jack or a spade in a deck of poker cards.
# Figure probabilities
P_Jack = 4/52
P_Spade = 13/52
# Joint probability: P(A and B)
P_Jack_n_Spade = P_Jack*P_Spade
# Probability of Jack or spade: P(A or B)
P_Jack_or_Spade = P_Jack + P_Spade - P_Jack_n_Spade
print(P_Jack_or_Spade)
Dependent Events¶
Two events are dependent if the outcome or occurrence of $A$ affects the outcome or occurrence of $B$ so that the probability is changed. To calculate dependent events we have to introduce conditional probability.
Conditional Probabilities¶
The conditional probability of an event $B$ in relationship to an event $A$ is the probability that event $B$ occurs given that event $A$ has already occurred. The notation for conditional probability is $P(B|A)$.
$$P(A\textrm{ and }B)=P(A)P(B|A)$$rearranging the formula we get:
$$P(B|A)=\frac{P(A\textrm{ and }B)}{P(A)}$$Conditional probabilities allow us to calculate probability in a reduced or restricted sample space. Conditioning is just changing our reference to calculate probabilities.
We can use a deck of poker cards again to calculate the probability of getting a Jack and then another Jack.
# Needed probabilities
P_first_Jack = 4/52
P_Jack_given_Jack = 3/51
# Joint probability calculation
P_two_Jacks = P_first_Jack*P_Jack_given_Jack
print(P_two_Jacks)
Total probability law¶
The Law of total probability (also known as Total Probability Rule) is a fundamental rule in statistics relating to conditional and marginal probabilities. The rule states that if the probability of an event is unknown, it can be calculated using the known probabilities of several distinct events.
There are three events: $A$, $B$, and $C$. Events $B$ and $C$ are distinct from each other while event $A$ intersects with both events. We do not know the probability of event $A$. However, we know the probability of event $A$ under condition $B$ and the probability of event $A$ under condition $C$.
The total probability rule states that by using the two conditional probabilities, we can find the probability of event $A$ (i.e., total probability).
Mathematically, the total probability rule can be written in the following equation:
$$P(A)=\sum_{n}P(A\textrm{ and }B_n)$$where:
- $n$ – the number of events
- $B_n$ – the distinct event
Let's consider the following example:
Of the total population of three states X, Y, and Z, 43% are from state X, 25% are from state Y, and 32% are from state Z. A poll is taken and the result is the following:
- 53% of the voters support John Doe in state X.
- 67% of the voters support John Doe in state Y.
- 32% of the voters support John Doe in state Z.
What is the total percentage of voters that support John Doe?
# Individual probabilities
P_X = 0.43
P_Y = 0.25
P_Z = 0.32
# Conditional probabilities
P_Support_g_X = 0.53
P_Support_g_Y = 0.67
P_Support_g_Z = 0.32
# Total probability calculation
P_Support = P_X * P_Support_g_X + P_Y * P_Support_g_Y + P_Z * P_Support_g_Z
print(P_Support)
Adding up the joint probabilities of nonoverlapping partitions will get us to the total probability law. This is important for Bayes' rule, so let's move on to that.
Bayes' rule¶
Bayes' rule allows us to calculate conditional probabilities. If we consider what we have seen previously
$$P(A\textrm{ and }B)=P(A)P(B|A)$$$$P(B\textrm{ and }A)=P(B)P(A|B)$$then
$$P(A)P(B|A)=P(B)P(A|B)$$so
$$P(A|B) = \frac{P(A)P(B|A)}{P(B)}$$this is Bayes' formula, which read as the probability of $A$ given $B$ is equal to the probability of $A$ times the probability of $B$ given $A$ divided by the probability of $B$.
Let's consider the following excercise:
A certain electronic part is manufactured by three different vendors named V1, V2, and V3.
Half of the parts are produced by V1, 25% by V2, and the rest by V3. The probability of a part being damaged given that it was produced by V1 is 1%, while it's 2% for V2 and 3% for V3.
If a part taken at random is damaged, what is the probability that the part was manufactured by V1?
# Individual probabilities & conditional probabilities
P_V1 = 0.50
P_V2 = 0.25
P_V3 = 0.25
P_D_g_V1 = 0.01
P_D_g_V2 = 0.02
P_D_g_V3 = 0.03
# Probability of Damaged
P_Damaged = P_V1 * P_D_g_V1 + P_V2 * P_D_g_V2 + P_V3 * P_D_g_V3
# Bayes' rule for P(V1|Damaged)
P_V1_g_D = P_V1 * P_D_g_V1 / P_Damaged
print(P_V1_g_D)
Bayes' rule allows us to calculate conditional probabilities reducing our sample space with non overlapping partitions.
Normal Distribution¶
The normal distribution is the most important and widely used type of distribution in probability and statistics. It is also called Gaussian distribution after Johann Gauss and allows us to model many situations.
The probability density is a function that assigns the relative likelihood to each possible outcome in the sample space. In normal distributions, the probability density has a bell shape. The plot is dense and symmetric around the mean as we can see below.
from bokeh.palettes import Dark2_5 as palette
from scipy.stats import norm
def plot_normal(loc=0, stds=[1, 2]):
plot = figure(x_axis_label='x', y_axis_label='probability density of x', plot_height=400, plot_width=550)
x = np.arange(loc-10, loc+10, 0.1)
for std, color in zip(stds, palette):
y = norm.pdf(x, loc=loc, scale=std)
plot.line(x, y, color=color, line_width=3, line_alpha=0.7,
legend='standard deviation '+str(std))
return plot
plt = plot_normal(loc=0, stds=[0.5, 1, 2, 3])
show(plt)
The probability of getting a value below the mean is the same as the probability of getting a value above the mean. One important consequence of the symmetry of the probability density function is that the mean is the value with the highest probability density.
The standard deviation is a measure of how spread out the probability density is. For different standard deviations the curve concentrates more or less probability density around the mean. The lower the value of the standar deviation, the more concentrated the probability density is around the mean.
From a statistical point of view it is interesting to know how far the data is from the mean in terms of standard deviations. In any normal distribution 0.68 probability is concentrated one standard deviation around the mean. 0.95 probaility is concentrated two standard deviations around the mean. 0.997 probability is within 3 standard deviations.
Normal probabilities¶
To calculate the probability density of a given value we use the probability density function
pdf
: we pass the value we want to calculate, with the loc
parameter for the
mean and the scale
parameter for the standard deviation.
# probability density
pdf = norm.pdf(x=-1, loc=0, scale=2)
pdf
If we want to get the probability of a value being below $x$ we can use again the cumulative distribution function.
from bokeh.layouts import widgetbox, column, row
def plot_normal_cdf(x, loc=0, scale=1):
# create first plot
p1 = figure(x_axis_label='x', y_axis_label='probability density of x',
title='norm.cdf({}) with mean {} and standard deviation {}'.format(x, loc, scale),
plot_height=400, plot_width=550)
# generate x values
x_values = np.arange(loc-(4*scale), loc+(4*scale), 0.01)
# calculate y values for pdf
y_values = norm.pdf(x_values, loc=loc, scale=scale)
# calculate cdf of x
cdf = norm.cdf(x, loc=loc, scale=scale)
# plot the pdf
p1.line(x_values, y_values)
# create x and y values for fill area
xs = [e for e in x_values if e <= x]
ys = [e for e in y_values if e <= cdf]
# plot the area equivalent to the cdf of x
p1.patches([[x, *xs]], [[0, *ys]],
fill_color='blue',
legend='probability of '+str(round(cdf, 3)))
# create second plot
p2 = figure(title='cdf', x_axis_label='x', y_axis_label='probability of X less than x', plot_height=400, plot_width=550)
# create a list of probability outcomes
probs = [norm.cdf(k, loc=loc, scale=scale) for k in x_values]
# plot cdf
p2.line(x_values, probs, line_width=4)
p2.line([x, x], [0, cdf], line_color='black')
p2.line([x, loc-(4*scale)], [cdf, cdf], line_color='black')
p2.add_layout(Arrow(end=VeeHead(size=25, fill_color='red'), line_color='red',
x_start=x, y_start=0, x_end=x, y_end=cdf))
p2.add_layout(Arrow(end=VeeHead(size=25, fill_color='red'), line_color='red',
x_start=x, y_start=cdf, x_end=loc-(4*scale), y_end=cdf))
layout = row([p1, p2])
return layout
plt = plot_normal_cdf(4, loc=0, scale=2)
show(plt)
We use norm.sf
when we want to calculate the survival function for our normal
distribution, the complement of the cdf
def plot_normal_sf(x, loc=0, scale=1):
# create first plot
plot = figure(x_axis_label='x', y_axis_label='probability density of x',
title='norm.sf({}) with mean {} and standard deviation {}'.format(x, loc, scale),
plot_height=400, plot_width=550)
# generate x values
x_values = np.arange(loc-(4*scale), loc+(4*scale), 0.01)
# calculate y values for pdf
y_values = norm.pdf(x_values, loc=loc, scale=scale)
# calculate sf of x
sf = norm.sf(x, loc=loc, scale=scale)
# plot the pdf
plot.line(x_values, y_values)
# create x and y values for fill area
xs = [e for e in x_values if e >= x]
ys = [e for e in y_values if e <= sf]
# color the probability area
plot.patches([[xs[0], *xs[::-1]]], [[*ys]],
fill_color='blue',
legend='probability of '+str(round(sf, 3)))
return plot
plt = plot_normal_sf(2, loc=0, scale=2)
show(plt)
If we want to know the probability between two values we can take the probability of a value and subtract the probability of the second value
P_between = norm.cdf(x=1, loc=0, scale=1) - norm.cdf(x=-1, loc=0, scale=1)
P_between
Exactly what we were expecting as, like stated previously, in any normal distribution 0.68 probability is concentrated one standard deviation around the mean.
Otherwise if we want to get the probability of getting a value less than and greater than, given the
probabilities are independent, we add the probabilities of the events using cdf
and
sf
.
joint = norm.cdf(x=-1, loc=0, scale=1) + norm.sf(x=1, loc=0, scale=1)
joint
Which corresponds to the complement of the between
1 - P_between == joint
Finally, if we want to know the interval where any given probability concentrates we can use
norm.interval
and specify the probability.
alpha=0.95
norm.interval(alpha, loc=0, scale=1)
Poisson Distribution¶
The poisson distribution is a very useful type of probability distribution that can model the
frequancy with which an event occurs during a fixed interval of time. In Poisson distributions, our
outcomes can be classified as successes or failures and the average number of successful events per
unit is known.
Suppose we want to know the average number of calls per minute to a call center, to find this we use
the probability mass function and specify $\mu$ (mu), the mean of the distribution.
First we need to import the poisson
object from scipy.stats
and then we can
use the pmf
function.
The probability of getting 0 and 6 calls in a minute, with the mean being 4.1 is:
from scipy.stats import poisson
print('0 calls: ' + str(poisson.pmf(k=0, mu=4.1)))
print('6 calls: ' + str(poisson.pmf(k=6, mu=4.1)))
Let's write a function to plot the distribution
def plot_poisson_pmf(mu=0):
# create a figure object
plot = figure(x_axis_label='k', y_axis_label='poisson.pmf(k, mu={})'.format(mu), plot_height=400, plot_width=550)
# define an array with the number of successes we could get
successes = np.arange(mu*3)
# get a list of pmf for every success
y = [poisson.pmf(k=k, mu=mu) for k in successes]
# plot on figure
plot.vbar(successes, top=y, width=0.5, legend=('mu= '+ str(mu)))
#change x axis ticks
plot.xaxis.ticker = successes
return plot
plt = plot_poisson_pmf(mu=4.1)
show(plt)
If we change the value of $\mu$ we can see that the distribution changes.
When the mean is small, the probability of having 0 events is higher. As the mean gets higher, the curve moves to the right.
We can now have a look at the cdf
to see what the probabilities of having $k$ or fewer
calls is.
def plot_poisson_cdf(mu=0):
# create a figure object
plot = figure(x_axis_label='k', y_axis_label='poisson.cdf(k, mu={})'.format(mu), plot_height=400, plot_width=550)
# define an array with the number of successes we could get
successes = np.arange(mu*3)
# get a list of pmf for every success
y = [poisson.cdf(k=k, mu=mu) for k in successes]
# plot on figure
plot.vbar(successes, top=y, width=0.5)
#change x axis ticks
plot.xaxis.ticker = successes
return plot
plt = plot_poisson_cdf(mu=4.1)
show(plt)
We can then write another function to plot the sf
. This is the probability of having
more than $k$ calls.
def plot_poisson_sf(mu=0):
# create a figure object
plot = figure(title='poisson.sf(mu={})'.format(mu),
x_axis_label='k', y_axis_label='poisson.pmf(k, mu={})'.format(mu),
plot_height=400, plot_width=550)
# define an array with the number of successes we could get
successes = np.arange(mu*3)
# get a list of pmf for every success
y = [poisson.sf(k=k, mu=mu) for k in successes]
# plot on figure
plot.vbar(successes, top=y, width=0.5)
#change x axis ticks
plot.xaxis.ticker = successes
return plot
plt = plot_poisson_sf(mu=4.1)
show(plt)
If we want instead the value where we accumulate a given probability we use the percent point
function (ppf
)
poisson.ppf(q=0.5, mu=4.1)
For 0.5 probability we get a value of 4, this is close to our mean.
To generate a sample we use rvs
import matplotlib.pyplot as plt
import seaborn as sns
# create a sample
sample = poisson.rvs(mu=4.1, size=1000)
sns.distplot(sample, kde=False)
plt.show()
Note that the sum of all frequancies corresponds to our sample size
Geometric distributions¶
We have already seen how binomial distributions can be used to model a series of experiments with binary outcomes. With a geometric distribution we model a series of failed outcomes until we obtain a successful one.
For example, if we know that a basketball player has 0.3 probability of scoring a free throw, what is the probability of missing the first throw and scoring the second? In the geometric model we want to study the probability of trying until we succeed.
The geometric distribution allows us to calculate the probability of success after $k$ trials given the probability success for each trial. The only parameter is the probability of success for each trial.
To work with a geometric distribution we need to import the geom
object from
scipy.stats
. The p
parameter specifies the probability of success.
from scipy.stats import geom
def plot_geom_pmf(p=0):
# create a figure object
plot = figure(title='geom.pmf(k, p={})'.format(p),
x_axis_label='k', y_axis_label='geom.pmf(k, p={})'.format(p),
plot_height=400, plot_width=550)
# define an array with the number of successes we could get
successes = np.arange(1, 15)
# get a list of pmf for every success
y = [geom.pmf(k=k, p=p) for k in successes]
# plot on figure
plot.vbar(successes, top=y, width=0.5)
#change x axis ticks
plot.xaxis.ticker = successes
return plot
plt = plot_geom_pmf(p=0.12)
show(plt)
If we want to know the probability of success in $k$ or fewer attempts we use the cdf
function.
def plot_geom_cdf(k, p=0):
# calculate cdf
cdf = geom.cdf(k=k, p=p)
# create a figure object
p1 = figure(x_axis_label='k', y_axis_label='geom.pmf(k, p={})'.format(p),
plot_height=400, plot_width=550)
# create a figure object
p2 = figure(title='geom.cdf(k={}, p={}) = {}'.format(k, p, cdf),
x_axis_label='k', y_axis_label='geom.cdf(k, p={})'.format(p),
plot_height=400, plot_width=550)
# define an array with the number of successes we could get
successes = np.arange(1, k+10)
for x in successes:
# get a list of pmf values for every success
pmf_k = geom.pmf(k=x, p=p)
# get a list of cdf for every success
cdf_k = geom.cdf(k=x, p=p)
# plot on figure
p1.vbar(x, top=pmf_k, width=0.5, color=('blue' if x <= k else 'lightblue'))
label1 = Label(x=x, y=pmf_k, text=str(round(pmf_k, 2)), text_align='center', text_baseline="bottom", text_alpha=0.8)
p1.add_layout(label1)
# plot on figure
p2.vbar(x, top=cdf_k, width=0.5, color=('blue' if x == k else 'lightblue'))
label2 = Label(x=x, y=cdf_k, text=str(round(cdf_k, 2)), text_align='center', text_baseline="bottom", text_alpha=0.8)
p2.add_layout(label2)
#change x axis ticks
p1.xaxis.ticker = successes
layout = row([p1, p2])
return layout
plt = plot_geom_cdf(k=4, p=0.2)
show(plt)
And the survival function:
def plot_geom_sf(k, p=0):
# calculate cdf
sf = geom.sf(k=k, p=p)
# create a figure object
plot = figure(title='geom.sf(k={}, p={}) = {}'.format(k, p, sf),
x_axis_label='k', y_axis_label='geom.pmf(k, p={})'.format(p),
plot_height=400, plot_width=550)
# define an array with the number of successes we could get
successes = np.arange(1, k+10)
for x in successes:
# get a list of pmf for every success
y = geom.pmf(k=x, p=p)
# assigning a color to highlight the probability area
color = 'blue' if x > k else 'lightblue'
# plot on figure
plot.vbar(x, top=y, width=0.5, color=color)
label = Label(x=x, y=y, text=str(round(y, 2)), text_align='center', text_baseline="bottom", text_alpha=0.8)
plot.add_layout(label)
#change x axis ticks
plot.xaxis.ticker = successes
show(plot)
return plot
_ = plot_geom_sf(k=2, p=0.3)
To see the value where we accumulate a given probability we use the percent point function (ppf).
def plot_geom_ppf(q, p):
#calculate ppf
ppf = geom.ppf(q=q, p=p)
plot = figure(title='geom.ppf(q={}, p={}) = {}'.format(q, p, ppf), x_axis_label='k',
y_axis_label='geom.cdf(k, p={})'.format(q),
plot_height=400, plot_width=550)
# define an array with the number of successes we could get
successes = np.arange(1, ppf+5)
# plot bars on figure with probability annotations
for x in successes:
# get a list of cdf values for every success
cdf_k = geom.cdf(k=x, p=p)
if cdf_k >= 0.98:
continue
plot.vbar(x, top=cdf_k, width=0.5, color=('blue' if x == ppf
else 'lightblue'))
label = Label(x=x, y=cdf_k, text=str(round(cdf_k, 2)),
text_align='center', text_baseline="bottom", text_alpha=0.8)
plot.add_layout(label)
#plot arrow
plot.add_layout(Arrow(end=VeeHead(size=30, fill_color='red'), line_color='darkblue',
x_start=0, y_start=q, x_end=ppf, y_end=q))
return plot
plt = plot_geom_ppf(q=0.3, p=0.13)
show(plt)
Last but not least, we can generate $size$ sample of attempts with a given $p$ using the
rvs
function.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Create the sample
sample = geom.rvs(p=0.3, size=10000)
# Plot the sample
sns.distplot(sample, bins=np.linspace(0,20,21), kde=False)
plt.show()
From sample to population¶
Now we will see some patterns that we can observe in the sample mean when the sample size becomes larger. This patterns form the basis of the law of large numbers.
Jakob Bernoulli developed the law of large numers in his book Ars Conjectandi (1793). The law states that the sample mean tends to the expected value as the sample grows larger, as seen in a previous plot. As the sample mean becomes larger it gets nearer to the population mean.
for size in [10, 100, 10000]:
sample_mean = describe(binom.rvs(n=10, p=0.5, size=size)).mean
print('sample size {}: mean {}'.format(size, sample_mean))
Adding random variables¶
The most important result in probability and statistics is the central limit theorem. Let's take a look at what happens when we add random variables. The CLT states that the sum of random variables tends to a normal distribution as the number of them grows to infinity. This theorem works under certain conditions:
- the variables must have the same distribution
- the variables must be independent (the outcome of one variable does not effect the outcome on the others).
We can start adding binomial, geometric or even Poisson random variables, and as we add more, we get a normal distribution. Let's see an example.
First we generate our population using poisson.rvs
# Generate the population
population = poisson.rvs(mu=2, size=1000, random_state=10)
Now we can plot a histogram of our population
# Plot the histogram
_ = plt.hist(population, bins=range(9), width=0.8)
_ = plt.xlabel('Population values')
_ = plt.ylabel('Frequency')
The result is a poisson skewed plot of our data. Now we can plot the sample means
# define a list of sample means
sample_means = []
for _ in range(1000):
# select 10 from population
sample = np.random.choice(population, 10)
# Calculate sample mean of sample
sample_means.append(describe(sample).mean)
_ = plt.hist(sample_means)
_ = plt.xlabel('Sample mean values')
_ = plt.ylabel('Frequency')
_ = plt.title('Sample means histogram')
We get a plot centered at 2 which is the mean of the population, with a bell shape as we expect.
Conclusion¶
We have covered a lot of ground. We have learned to simulate random variables in order to review the foundamental concepts of probability such as density, distribution, expected values, sample mean, variance, joint probability of dependent and independent events and conditional probability using Bayes' rule. We have reviewed some of the most important probability distributions such as binomial, geometric poisson and normal. We have also see the law of large numbers and the central limit theorem.
There is a lot more to see, thus to really understand and to grisp with all the methods we have see in this course it's important to apply them on real data. Only with practice we can truly master the content. By relating these methods to real-life situations we can gain deeper understanding about probability and statistics.