Probability is the foundation of many of the methods and models in data science. Statistical inference methods use data to understand the underlying processes behind the data. With probability we can study the regularities that arise in random phenomena. Hence knowing probability and its applications are important to work effectively on data science problems.

Flipping a coin¶

Let's try to gain some intutition about probability by starting with the classic random experiment, a coin flip.
If we flip a coin there are two possible outcomes: heads or tails. This random experiment is called a Bernoulli trial, after Jacob Bernoulli. A Bernoulli trial is one where the possible outcomes are binary. Each outcome is called an event. We assign probabilities to events. With a fair coin we have a 50% chance of getting heads and a 50% chance of getting tails for each event.
To simulate the coin flips we can use the bernoulli object from the scipy module where: - p is the probability of success - size is the number of trials

I will also be using a random seed for reproducibility.

In [1]:

import numpy as np
np.random.seed(23)

from scipy.stats import bernoulli

# flip the coin "size" times
ber = bernoulli.rvs(p=0.5, size=10)
ber

Out[1]:

array([1, 1, 1, 0, 0, 1, 0, 0, 1, 0])

In [2]:

np.sum(ber)

Out[2]:

Binomial Distribution¶

A sequence of independent Bernoulli trial follows the binomial distribution.So, if we want to count the number of successes instead of using the bernoulli object and adding the outcomes, we can use the binomial distribution and the binom object, where:

p: probability of outcome
n: number of flips per trial
size: number of times we repeat experiment

In [3]:

from scipy.stats import binom

# count the number of heads for "n" flips in "size" trials
binom.rvs(n=10, p=0.5, size=5)

Out[3]:

array([1, 7, 7, 4, 5])

After conducting many random experiments, we can notice that some outcomes are more likely than others. This is called a probability distribution. There are two important functions that are useful for probability calculations: the probability mass function and the cumulative distribution function

Probability mass function (pmf)¶

A discrete random variable has a finite number of possible outcomes. The probability mass function allows us to calculate the probability of getting a particular outcome for a discrete random variable. The binomial probability mass function allows us to calculate the probability of getting $k$ heads from $n$ coin flips with $p$ probability of getting heads. $$binom.pmf(k, n, p) = \binom{n}{k}p^k(1 - p)^{n-k} $$ where $$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$ The formula multiplies the number of different ways that we can get $k$ successes out of $n$ coin flips by the probability of success $p$ raised by the number of successes $k$ by the probability of failure $1-p$ raised to the number of failures $n-k$.

In [4]:

# probability of "k" heads after "n" flips with a fair coin
binom.pmf(k=5, n=10, p=0.5)

Out[4]:

0.24609375000000025

We can plot this to see how the most likely outcome vary. I will be using the bokeh library throughout most of the course

In [5]:

from bokeh.io import curdoc
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
output_notebook()

def plot_binomial_pmf(n, p):
    
    # create a figure object
    plot = figure(x_axis_label='number of heads', y_axis_label='probability of outcome', plot_height=400, plot_width=550)

    # define an array with the number of successes we could get
    heads = np.arange(n+1)
    
    # get the pmf for every success
    y = [binom.pmf(k=k, n=n, p=p) for k in heads]
    
    # plot on figure
    plot.segment(heads, 0, heads, y, line_width=2)
    plot.circle(heads, y, size=16)

    #change axis ticks
    plot.xaxis.ticker = heads

    return plot

plt = plot_binomial_pmf(n=10, p=0.5)
show(plt)

Loading BokehJS ...

Cumulative distribution function (cdf)¶

When we want to calculate the probability of getting $k$ or fewer heads from $n$ throws, we use the comulative distribution function, which adds the probabilities of getting 0 heads out of $n$ flips, getting 1 head out of 10 flips and so on up to $k$ heads out of $n$ flips.

$$binom.cdf(k, n, p) = \binom{n}{0}p^0(1 - p)^{n} + \binom{n}{1}p(1 - p)^{n-1} + ... + \binom{n}{k}p^k(1 - p)^{n-k}$$

The binomial probability distribution function allows us to calculate the cumulative probability of getting $k$ heads or fewer from $n$ coin flips with $p$ probability of getting heads.

In [6]:

# probability of less than 2 heads after 10 throws with a fair coin
binom.cdf(k=2, n=10, p=0.5)

Out[6]:

0.054687500000000014

We can use the complement to see what the probabilities of getting more than $k$ heads from $n$ throws.

In [7]:

# probability of more than 2 heads after 10 throws with a fair coin
1 - binom.cdf(k=2, n=10, p=0.5)

Out[7]:

0.9453125

Alternatively we can use the function binom.sf(k,n,p) for the complement. sf stands for survival function.

In [8]:

# probability of more than 2 heads after 10 throws with a fair coin using binom.sf
binom.sf(k=2, n=10, p=0.5)

Out[8]:

0.9453125

With the following function we can see how the probability of "k" changes with different values of "p" and "n"

In [9]:

def plot_binom_cdf(n, p):
    # create a figure object
    plot = figure(x_axis_label='k', y_axis_label='cdf', plot_height=400, plot_width=550)

    # defining an array with the number of successes we could get
    successes = np.arange(n+1)

    # create a list of probability outcomes
    probs = [binom.cdf(k=k, n=n, p=p) for k in successes]

    # plot cdf
    plot.line(successes, probs, line_width=4, legend=('p='+ str(p)))
    plot.segment(successes, 0, successes, probs)
    plot.circle(successes, probs, size=5, color='red', tags=probs)
    
    # change axis ticks
    plot.xaxis.ticker = successes
    plot.legend.location = 'top_left'
    
    return plot

plt = plot_binom_cdf(n=10, p=0.5)
show(plt)

Expected value¶

A discrete random variable has finite outcomes. For instance the roll of a dice has 6 posiible outcomes. For discrete random variables, the expected value is the sum of the possible outcomes weighted by their probability. To calculate the expected value, each outcome is multiplied by its probability, and the products are summed. $$E(X)=\sum_{i=1}^{k}x_1p_1 = x_1p_1 + x_2p_2 + ... + x_kp_k$$ In simple terms, the expected value is a value were the probability will concentrate when we repeat the experiment. For our coin example we multiply the tails outcome $0$ for its probability $1-p$ then we add the heads outcome which is $1$ multiplied by its probability $p$ $$E(X)=\sum_{i=1}^{2}x_1p_1 =0\times(1-p) + 1\times p=p$$ and we get the expected value $p$.

Arithmetic Mean¶

The arithmetic mean is the sum of each outcome divided by the number of samples. $$\bar{X}=\frac{1}{n}\sum_{i=1}^{n}x_1=\frac{1}{n}(x_1+x_2+...+x_n)$$ The arithmetic mean of the oucomes converges to the expected value as we increase the number of random experiments. In scipy we will use the describe() function from scipy.stats to get the mean.

If we repeat the experiment and make $n$ coin flips we should get a number near the expected value, so the more the flips, the more the mean of all the throws approaches the expected value as we can see from the graph below.

In [11]:

from bokeh.models import Slider, Label, Arrow, VeeHead, Div

from scipy.stats import describe

def plot_mean_of_flips(p, size):
    
    #generate x values
    x = np.arange(size)
    
    # generate coin flips
    results = binom.rvs(n=1, p=p, size=size)
    
    # get the list of accumulated means
    y = [describe(results[:i+1]).mean for i in x]
    
    # creating the plot
    plot = figure(y_range=(0,1), x_axis_label='Coin flips', y_axis_label='Mean of coin flips', plot_height=400, plot_width=550)
    plot.line(x, y)
    plot.line(x, p, color='grey', line_alpha=0.8)
    label = Label(x=size*3/5, y=p, text="expected value", text_baseline="bottom", text_alpha=0.8)
    plot.add_layout(label)

    return plot

plt = plot_mean_of_flips(p=0.5, size=1000)
show(plt)

That is the relationship between the expected value and the mean.

Variance¶

The variance measures how concentrated or spread out from the expected value the data is.

$$Var(X)=E[(X-E(X))^2]=\sum_{i=1}^{n}p_i\times (x_i-E(X))^2$$

Variance is the expected value of the squared deviation of a random variable from its expected value. In the particular case of the binomial distribution, the expected value is the product of the number of coin flips and the probability of getting heads.

For $X\sim Binomial(n,p)$ $$E(X)=n\times p$$ $$Var(X)=n\times p \times (1-p)$$

The variance is the product of the expected value and the probability of failure. We can use binom.stats() method to get the expected value and variance of a binomial distribution.

In [12]:

mean, variance = binom.stats(n=10, p=0.5)
print('mean = ', mean)
print('variance = ', variance)

mean =  5.0
variance =  2.5

Expected value, mean and variance are essential to probability and statistics. In fact these are the most important measures to calculate to determine if the data is spread out or concentrated around the expected value.

In [13]:

averages = []
variances = []

for i in range(0, 1500):
	# 10 draws of 10 coin flips with 25% probability of heads
    sample = binom.rvs(n=10, p=0.25, size=10)
	# Mean and variance of the values in the sample variable
    averages.append(describe(sample).mean)
    variances.append(describe(sample).variance)
  
 # Calculate the mean of the averages variable
print("Mean {}".format(describe(averages).mean))

# Calculate the mean of the variances variable
print("Variance {}".format(describe(variances).mean))

# Calculate the mean and variance
print(binom.stats(n=10, p=0.25))

Mean 2.4948666666666663
Variance 1.8872518518518517
(array(2.5), array(1.875))

Simulated data is very important for modeling random phenomena. You can always check if your data follows the intended distribution. Repeating experiments and taking means will give you insight about probability and will introduce you to statistics as you find regularities. scipy.stats offers many functions to generate simulated data and describe it. Experiment some more, always checking your simulation results against the expected results.

Calculating the probabilities of two events¶

Independet Events¶

Given two events, if the events are not affected by the order in which they happen and the occurrance of one event does not affect the probability of the other, then the events are independent.

Probability of A and B¶

With our coin case if we flip it twice we get four possible outcomes. We can ask what the probability of each outcome is. For independent events we use the multiplication rule. The probability of $A$ and $B$ is the probability of $A$ times the probability of $B$.

$$P(A\textrm{ and }B)=P(A)P(B)$$

We can count how many time each outcome repeats using the function find_repeats which we import from scipy.stats. We pass the data to the function to get how many times each outcome in the data repeats. This function returns:

values : ndarray
The unique values from the (flattened) input that are repeated.
counts : ndarray
Number of times the corresponding ‘value’ is repeated.

In [14]:

from scipy.stats import find_repeats

# generate a sample of two fair coin flips
sample = binom.rvs(n=2, p=0.5, size=1000)

reps = find_repeats(sample)
reps

Out[14]:

RepeatedResults(values=array([0., 1., 2.]), counts=array([237, 497, 266], dtype=int64))

In [15]:

for num, heads in enumerate(reps.counts):
    print('{0} heads: {1}'.format(str(num), str(heads)))

0 heads: 237
1 heads: 497
2 heads: 266

The relative frequency is the number of favourable trials divided by the total trials.

To get the relative frequency in the data we can import the scipy.stats.relfreq function and specify the data and numbins as parameters, in our case numbins is three for the three possible outcomes, that we had stored in our instance of find_repeats.

In [16]:

from scipy.stats import relfreq

relfreq(sample, numbins=len(reps.values))

Out[16]:

RelfreqResult(frequency=array([0.237, 0.497, 0.266]), lowerlimit=-0.5, binsize=1.0, extrapoints=0)

This function returns:

frequency : ndarray
Binned values of relative frequency.
lowerlimit : float
Lower real limit.
binsize : float
Width of each bin.
extrapoints : int
Extra points.

Probability of A or B¶

To calculate the probability of $A$ or $B$ we need to consider whether the events are mutually exclusive (ones that cannot happen at the same time) or not.
If two events $A$ and $B$ are mutually exclusive, the events are called disjoint events. The probability of two disjoint events $A$ or $B$ happening is the probability of $A$ plus the probability of $B$. $$P(A\textrm{ or }B)=P(A)+P(B)$$

In the case of overlapping events, the procedure is slightly different: we add the probability of $A$ to the probability of $B$ but we need to subtract the overlap. Let's consider an event $A$ and an event $B$ that overlaps with $A$, the probability of $A$ or $B$ is the probability of $A$ $$P(A\textrm{ or }B)=P(A)+P(B)-P(A\textrm{ and }B)$$ This obviously also works for events that do not overlap, because $P(A\textrm{ and }B)$ is 0. For example we can calculate the probability of getting a Jack or a spade in a deck of poker cards.

In [17]:

# Figure probabilities
P_Jack = 4/52
P_Spade = 13/52

# Joint probability: P(A and B)
P_Jack_n_Spade = P_Jack*P_Spade

# Probability of Jack or spade: P(A or B) 
P_Jack_or_Spade = P_Jack + P_Spade - P_Jack_n_Spade

print(P_Jack_or_Spade)

0.3076923076923077

Dependent Events¶

Two events are dependent if the outcome or occurrence of $A$ affects the outcome or occurrence of $B$ so that the probability is changed. To calculate dependent events we have to introduce conditional probability.

Conditional Probabilities¶

The conditional probability of an event $B$ in relationship to an event $A$ is the probability that event $B$ occurs given that event $A$ has already occurred. The notation for conditional probability is $P(B|A)$.

$$P(A\textrm{ and }B)=P(A)P(B|A)$$

rearranging the formula we get:

$$P(B|A)=\frac{P(A\textrm{ and }B)}{P(A)}$$

Conditional probabilities allow us to calculate probability in a reduced or restricted sample space. Conditioning is just changing our reference to calculate probabilities.

We can use a deck of poker cards again to calculate the probability of getting a Jack and then another Jack.

In [18]:

# Needed probabilities
P_first_Jack = 4/52
P_Jack_given_Jack = 3/51

# Joint probability calculation
P_two_Jacks = P_first_Jack*P_Jack_given_Jack

print(P_two_Jacks)

0.004524886877828055

Total probability law¶

The Law of total probability (also known as Total Probability Rule) is a fundamental rule in statistics relating to conditional and marginal probabilities. The rule states that if the probability of an event is unknown, it can be calculated using the known probabilities of several distinct events.

There are three events: $A$, $B$, and $C$. Events $B$ and $C$ are distinct from each other while event $A$ intersects with both events. We do not know the probability of event $A$. However, we know the probability of event $A$ under condition $B$ and the probability of event $A$ under condition $C$.

The total probability rule states that by using the two conditional probabilities, we can find the probability of event $A$ (i.e., total probability).

Mathematically, the total probability rule can be written in the following equation:

$$P(A)=\sum_{n}P(A\textrm{ and }B_n)$$

where:

$n$ – the number of events
$B_n$ – the distinct event

Let's consider the following example:

Of the total population of three states X, Y, and Z, 43% are from state X, 25% are from state Y, and 32% are from state Z. A poll is taken and the result is the following:

53% of the voters support John Doe in state X.
67% of the voters support John Doe in state Y.
32% of the voters support John Doe in state Z.

What is the total percentage of voters that support John Doe?

In [19]:

# Individual probabilities
P_X = 0.43
P_Y = 0.25
P_Z = 0.32

# Conditional probabilities
P_Support_g_X = 0.53
P_Support_g_Y = 0.67
P_Support_g_Z = 0.32

# Total probability calculation
P_Support = P_X * P_Support_g_X + P_Y * P_Support_g_Y + P_Z * P_Support_g_Z
print(P_Support)

0.4978

Adding up the joint probabilities of nonoverlapping partitions will get us to the total probability law. This is important for Bayes' rule, so let's move on to that.

Bayes' rule¶

Bayes' rule allows us to calculate conditional probabilities. If we consider what we have seen previously

$$P(A\textrm{ and }B)=P(A)P(B|A)$$$$P(B\textrm{ and }A)=P(B)P(A|B)$$

then

$$P(A)P(B|A)=P(B)P(A|B)$$

so

$$P(A|B) = \frac{P(A)P(B|A)}{P(B)}$$

this is Bayes' formula, which read as the probability of $A$ given $B$ is equal to the probability of $A$ times the probability of $B$ given $A$ divided by the probability of $B$.

Let's consider the following excercise:

A certain electronic part is manufactured by three different vendors named V1, V2, and V3.

Half of the parts are produced by V1, 25% by V2, and the rest by V3. The probability of a part being damaged given that it was produced by V1 is 1%, while it's 2% for V2 and 3% for V3.

If a part taken at random is damaged, what is the probability that the part was manufactured by V1?

In [20]:

# Individual probabilities & conditional probabilities
P_V1 = 0.50
P_V2 = 0.25
P_V3 = 0.25
P_D_g_V1 = 0.01
P_D_g_V2 = 0.02
P_D_g_V3 = 0.03

# Probability of Damaged
P_Damaged = P_V1 * P_D_g_V1 + P_V2 * P_D_g_V2 + P_V3 * P_D_g_V3

# Bayes' rule for P(V1|Damaged)
P_V1_g_D = P_V1 * P_D_g_V1 / P_Damaged

print(P_V1_g_D)

0.2857142857142857

Bayes' rule allows us to calculate conditional probabilities reducing our sample space with non overlapping partitions.

Normal Distribution¶

The normal distribution is the most important and widely used type of distribution in probability and statistics. It is also called Gaussian distribution after Johann Gauss and allows us to model many situations.

The probability density is a function that assigns the relative likelihood to each possible outcome in the sample space. In normal distributions, the probability density has a bell shape. The plot is dense and symmetric around the mean as we can see below.

In [21]:

from bokeh.palettes import Dark2_5 as palette

from scipy.stats import norm

def plot_normal(loc=0, stds=[1, 2]):
    plot = figure(x_axis_label='x', y_axis_label='probability density of x', plot_height=400, plot_width=550)

    x = np.arange(loc-10, loc+10, 0.1)   

    for std, color in zip(stds, palette):
    
        y = norm.pdf(x, loc=loc, scale=std)

        plot.line(x, y, color=color, line_width=3, line_alpha=0.7,
                  legend='standard deviation '+str(std))
    return plot

plt = plot_normal(loc=0, stds=[0.5, 1, 2, 3])
show(plt)

The probability of getting a value below the mean is the same as the probability of getting a value above the mean. One important consequence of the symmetry of the probability density function is that the mean is the value with the highest probability density.

The standard deviation is a measure of how spread out the probability density is. For different standard deviations the curve concentrates more or less probability density around the mean. The lower the value of the standar deviation, the more concentrated the probability density is around the mean.

From a statistical point of view it is interesting to know how far the data is from the mean in terms of standard deviations. In any normal distribution 0.68 probability is concentrated one standard deviation around the mean. 0.95 probaility is concentrated two standard deviations around the mean. 0.997 probability is within 3 standard deviations.

Normal probabilities¶

To calculate the probability density of a given value we use the probability density function pdf: we pass the value we want to calculate, with the loc parameter for the mean and the scale parameter for the standard deviation.

In [22]:

# probability density
pdf = norm.pdf(x=-1, loc=0, scale=2)
pdf

Out[22]:

0.17603266338214976

If we want to get the probability of a value being below $x$ we can use again the cumulative distribution function.

In [23]:

from bokeh.layouts import widgetbox, column, row

def plot_normal_cdf(x, loc=0, scale=1):

    # create first plot
    p1 = figure(x_axis_label='x', y_axis_label='probability density of x', 
                title='norm.cdf({}) with mean {} and standard deviation {}'.format(x, loc, scale), 
                plot_height=400, plot_width=550)

    # generate x values
    x_values = np.arange(loc-(4*scale), loc+(4*scale), 0.01)  
    
    # calculate y values for pdf
    y_values = norm.pdf(x_values, loc=loc, scale=scale)
    
    # calculate cdf of x
    cdf = norm.cdf(x, loc=loc, scale=scale)
    
    # plot the pdf
    p1.line(x_values, y_values)

    # create x and y values for fill area
    xs = [e for e in x_values if e <= x]
    ys = [e for e in y_values if e <= cdf]
    
    # plot the area equivalent to the cdf of x
    p1.patches([[x, *xs]], [[0, *ys]], 
                 fill_color='blue', 
                 legend='probability of '+str(round(cdf, 3)))
    
    # create second plot
    p2 = figure(title='cdf', x_axis_label='x', y_axis_label='probability of X less than x', plot_height=400, plot_width=550)

    # create a list of probability outcomes
    probs = [norm.cdf(k, loc=loc, scale=scale) for k in x_values]
    
    # plot cdf
    p2.line(x_values, probs, line_width=4)
    p2.line([x, x], [0, cdf], line_color='black')
    p2.line([x, loc-(4*scale)], [cdf, cdf], line_color='black')
    p2.add_layout(Arrow(end=VeeHead(size=25, fill_color='red'), line_color='red', 
                        x_start=x, y_start=0, x_end=x, y_end=cdf))
    p2.add_layout(Arrow(end=VeeHead(size=25, fill_color='red'), line_color='red', 
                        x_start=x, y_start=cdf, x_end=loc-(4*scale), y_end=cdf))
    
    layout = row([p1, p2])

    return layout

plt = plot_normal_cdf(4, loc=0, scale=2)
show(plt)

We use norm.sf when we want to calculate the survival function for our normal distribution, the complement of the cdf

In [24]:

def plot_normal_sf(x, loc=0, scale=1):

    # create first plot
    plot = figure(x_axis_label='x', y_axis_label='probability density of x', 
                title='norm.sf({}) with mean {} and standard deviation {}'.format(x, loc, scale), 
                plot_height=400, plot_width=550)

    # generate x values
    x_values = np.arange(loc-(4*scale), loc+(4*scale), 0.01)  
    
    # calculate y values for pdf
    y_values = norm.pdf(x_values, loc=loc, scale=scale)
    
    # calculate sf of x
    sf = norm.sf(x, loc=loc, scale=scale)

    # plot the pdf
    plot.line(x_values, y_values)

    # create x and y values for fill area
    xs = [e for e in x_values if e >= x]
    ys = [e for e in y_values if e <= sf]
    
    # color the probability area
    plot.patches([[xs[0], *xs[::-1]]], [[*ys]],
                 fill_color='blue', 
                 legend='probability of '+str(round(sf, 3)))
    
    return plot

plt = plot_normal_sf(2, loc=0, scale=2)
show(plt)

If we want to know the probability between two values we can take the probability of a value and subtract the probability of the second value

In [25]:

P_between = norm.cdf(x=1, loc=0, scale=1) - norm.cdf(x=-1, loc=0, scale=1)
P_between

Out[25]:

0.6826894921370859

Exactly what we were expecting as, like stated previously, in any normal distribution 0.68 probability is concentrated one standard deviation around the mean.

Otherwise if we want to get the probability of getting a value less than and greater than, given the probabilities are independent, we add the probabilities of the events using cdf and sf.

In [26]:

joint = norm.cdf(x=-1, loc=0, scale=1) + norm.sf(x=1, loc=0, scale=1)
joint

Out[26]:

0.31731050786291415

Which corresponds to the complement of the between

In [27]:

1 - P_between == joint

Out[27]:

True

Finally, if we want to know the interval where any given probability concentrates we can use norm.interval and specify the probability.

In [28]:

alpha=0.95

norm.interval(alpha, loc=0, scale=1)

Out[28]:

(-1.959963984540054, 1.959963984540054)

Poisson Distribution¶

The poisson distribution is a very useful type of probability distribution that can model the frequancy with which an event occurs during a fixed interval of time. In Poisson distributions, our outcomes can be classified as successes or failures and the average number of successful events per unit is known. Suppose we want to know the average number of calls per minute to a call center, to find this we use the probability mass function and specify $\mu$ (mu), the mean of the distribution.
First we need to import the poisson object from scipy.stats and then we can use the pmf function.

The probability of getting 0 and 6 calls in a minute, with the mean being 4.1 is:

In [29]:

from scipy.stats import poisson

print('0 calls: ' + str(poisson.pmf(k=0, mu=4.1)))
print('6 calls: ' + str(poisson.pmf(k=6, mu=4.1)))

0 calls: 0.016572675401761255
6 calls: 0.10933602182030897

Let's write a function to plot the distribution

In [30]:

def plot_poisson_pmf(mu=0):
    
    # create a figure object
    plot = figure(x_axis_label='k', y_axis_label='poisson.pmf(k, mu={})'.format(mu), plot_height=400, plot_width=550)

    # define an array with the number of successes we could get
    successes = np.arange(mu*3)

    # get a list of pmf for every success
    y = [poisson.pmf(k=k, mu=mu) for k in successes]
    
    # plot on figure
    plot.vbar(successes, top=y, width=0.5, legend=('mu= '+ str(mu)))
       
    #change x axis ticks
    plot.xaxis.ticker = successes

    return plot

plt = plot_poisson_pmf(mu=4.1)
show(plt)

If we change the value of $\mu$ we can see that the distribution changes.

When the mean is small, the probability of having 0 events is higher. As the mean gets higher, the curve moves to the right.

We can now have a look at the cdf to see what the probabilities of having $k$ or fewer calls is.

In [31]:

def plot_poisson_cdf(mu=0):
    
    # create a figure object
    plot = figure(x_axis_label='k', y_axis_label='poisson.cdf(k, mu={})'.format(mu), plot_height=400, plot_width=550)

    # define an array with the number of successes we could get
    successes = np.arange(mu*3)

    # get a list of pmf for every success
    y = [poisson.cdf(k=k, mu=mu) for k in successes]
    
    # plot on figure
    plot.vbar(successes, top=y, width=0.5)
       
    #change x axis ticks
    plot.xaxis.ticker = successes

    return plot

plt = plot_poisson_cdf(mu=4.1)
show(plt)

We can then write another function to plot the sf. This is the probability of having more than $k$ calls.

In [32]:

def plot_poisson_sf(mu=0):
    
    # create a figure object
    plot = figure(title='poisson.sf(mu={})'.format(mu), 
                  x_axis_label='k', y_axis_label='poisson.pmf(k, mu={})'.format(mu), 
                  plot_height=400, plot_width=550)

    # define an array with the number of successes we could get
    successes = np.arange(mu*3)
    
    # get a list of pmf for every success
    y = [poisson.sf(k=k, mu=mu) for k in successes]
    
    # plot on figure
    plot.vbar(successes, top=y, width=0.5)
       
    #change x axis ticks
    plot.xaxis.ticker = successes

    return plot

plt = plot_poisson_sf(mu=4.1)
show(plt)

If we want instead the value where we accumulate a given probability we use the percent point function (ppf)

In [33]:

poisson.ppf(q=0.5, mu=4.1)

Out[33]:

4.0

For 0.5 probability we get a value of 4, this is close to our mean.

To generate a sample we use rvs

In [34]:

import matplotlib.pyplot as plt
import seaborn as sns

# create a sample
sample = poisson.rvs(mu=4.1, size=1000)

sns.distplot(sample, kde=False)
plt.show()

<Figure size 640x480 with 1 Axes>

Note that the sum of all frequancies corresponds to our sample size

Geometric distributions¶

We have already seen how binomial distributions can be used to model a series of experiments with binary outcomes. With a geometric distribution we model a series of failed outcomes until we obtain a successful one.

For example, if we know that a basketball player has 0.3 probability of scoring a free throw, what is the probability of missing the first throw and scoring the second? In the geometric model we want to study the probability of trying until we succeed.

The geometric distribution allows us to calculate the probability of success after $k$ trials given the probability success for each trial. The only parameter is the probability of success for each trial.

To work with a geometric distribution we need to import the geom object from scipy.stats. The p parameter specifies the probability of success.

In [35]:

from scipy.stats import geom

def plot_geom_pmf(p=0):
    
    # create a figure object
    plot = figure(title='geom.pmf(k, p={})'.format(p), 
                  x_axis_label='k', y_axis_label='geom.pmf(k, p={})'.format(p), 
                  plot_height=400, plot_width=550)

    # define an array with the number of successes we could get
    successes = np.arange(1, 15)
    
    # get a list of pmf for every success
    y = [geom.pmf(k=k, p=p) for k in successes]
    
    # plot on figure
    plot.vbar(successes, top=y, width=0.5)
       
    #change x axis ticks
    plot.xaxis.ticker = successes

    return plot

plt = plot_geom_pmf(p=0.12)
show(plt)

If we want to know the probability of success in $k$ or fewer attempts we use the cdf function.

In [36]:

def plot_geom_cdf(k, p=0):
    
    # calculate cdf
    cdf = geom.cdf(k=k, p=p)
    
    # create a figure object
    p1 = figure(x_axis_label='k', y_axis_label='geom.pmf(k, p={})'.format(p), 
                  plot_height=400, plot_width=550)

    # create a figure object
    p2 = figure(title='geom.cdf(k={}, p={}) = {}'.format(k, p, cdf),
                  x_axis_label='k', y_axis_label='geom.cdf(k, p={})'.format(p), 
                  plot_height=400, plot_width=550)
    
    # define an array with the number of successes we could get
    successes = np.arange(1, k+10)
    
    for x in successes:

        # get a list of pmf values for every success
        pmf_k = geom.pmf(k=x, p=p) 
        
        # get a list of cdf for every success
        cdf_k = geom.cdf(k=x, p=p)  

        # plot on figure
        p1.vbar(x, top=pmf_k, width=0.5, color=('blue' if x <= k else 'lightblue'))
        
        label1 = Label(x=x, y=pmf_k, text=str(round(pmf_k, 2)), text_align='center', text_baseline="bottom", text_alpha=0.8)
        p1.add_layout(label1)
        
        # plot on figure
        p2.vbar(x, top=cdf_k, width=0.5, color=('blue' if x == k else 'lightblue'))
        
        label2 = Label(x=x, y=cdf_k, text=str(round(cdf_k, 2)), text_align='center', text_baseline="bottom", text_alpha=0.8)
        p2.add_layout(label2)
      
    #change x axis ticks
    p1.xaxis.ticker = successes
    
    layout = row([p1, p2])

    return layout

plt = plot_geom_cdf(k=4, p=0.2)
show(plt)

And the survival function:

In [37]:

def plot_geom_sf(k, p=0):
    
    # calculate cdf
    sf = geom.sf(k=k, p=p)
    
    # create a figure object
    plot = figure(title='geom.sf(k={}, p={}) = {}'.format(k, p, sf), 
                  x_axis_label='k', y_axis_label='geom.pmf(k, p={})'.format(p), 
                  plot_height=400, plot_width=550)

    # define an array with the number of successes we could get
    successes = np.arange(1, k+10)
    
    for x in successes:

        # get a list of pmf for every success
        y = geom.pmf(k=x, p=p) 
        
        # assigning a color to highlight the probability area
        color = 'blue' if x > k else 'lightblue'

        # plot on figure
        plot.vbar(x, top=y, width=0.5, color=color)
        
        label = Label(x=x, y=y, text=str(round(y, 2)), text_align='center', text_baseline="bottom", text_alpha=0.8)
        plot.add_layout(label)
       
    #change x axis ticks
    plot.xaxis.ticker = successes

    show(plot)
    return plot

_ = plot_geom_sf(k=2, p=0.3)

To see the value where we accumulate a given probability we use the percent point function (ppf).

In [38]:

def plot_geom_ppf(q, p):

    #calculate ppf
    ppf = geom.ppf(q=q, p=p)
    
    plot = figure(title='geom.ppf(q={}, p={}) = {}'.format(q, p, ppf), x_axis_label='k', 
                  y_axis_label='geom.cdf(k, p={})'.format(q), 
                  plot_height=400, plot_width=550)
    
    # define an array with the number of successes we could get
    successes = np.arange(1, ppf+5)
    
    # plot bars on figure with probability annotations
    for x in successes:
        
        # get a list of cdf values for every success
        cdf_k = geom.cdf(k=x, p=p) 

        if cdf_k >= 0.98:
            continue
            
        plot.vbar(x, top=cdf_k, width=0.5, color=('blue' if x == ppf 
                                                  else 'lightblue'))
        
        label = Label(x=x, y=cdf_k, text=str(round(cdf_k, 2)), 
                      text_align='center', text_baseline="bottom", text_alpha=0.8)
        plot.add_layout(label)      
    
    #plot arrow
    plot.add_layout(Arrow(end=VeeHead(size=30, fill_color='red'), line_color='darkblue', 
                        x_start=0, y_start=q, x_end=ppf, y_end=q))
    
    return plot

plt = plot_geom_ppf(q=0.3, p=0.13)
show(plt)

Last but not least, we can generate $size$ sample of attempts with a given $p$ using the rvs function.

In [39]:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Create the sample
sample = geom.rvs(p=0.3, size=10000)

# Plot the sample
sns.distplot(sample, bins=np.linspace(0,20,21), kde=False)
plt.show()

From sample to population¶

Now we will see some patterns that we can observe in the sample mean when the sample size becomes larger. This patterns form the basis of the law of large numbers.

Jakob Bernoulli developed the law of large numers in his book Ars Conjectandi (1793). The law states that the sample mean tends to the expected value as the sample grows larger, as seen in a previous plot. As the sample mean becomes larger it gets nearer to the population mean.

In [40]:

for size in [10, 100, 10000]:
    sample_mean = describe(binom.rvs(n=10, p=0.5, size=size)).mean
    print('sample size {}: mean {}'.format(size, sample_mean))

sample size 10: mean 4.6
sample size 100: mean 4.88
sample size 10000: mean 5.0141

Adding random variables¶

The most important result in probability and statistics is the central limit theorem. Let's take a look at what happens when we add random variables. The CLT states that the sum of random variables tends to a normal distribution as the number of them grows to infinity. This theorem works under certain conditions:

the variables must have the same distribution
the variables must be independent (the outcome of one variable does not effect the outcome on the others).

We can start adding binomial, geometric or even Poisson random variables, and as we add more, we get a normal distribution. Let's see an example.

First we generate our population using poisson.rvs

In [41]:

# Generate the population
population = poisson.rvs(mu=2, size=1000, random_state=10)

Now we can plot a histogram of our population

In [42]:

# Plot the histogram
_ = plt.hist(population, bins=range(9), width=0.8)
_ = plt.xlabel('Population values')
_ = plt.ylabel('Frequency')

The result is a poisson skewed plot of our data. Now we can plot the sample means

In [43]:

# define a list of sample means
sample_means = []

for _ in range(1000):
    # select 10 from population
    sample = np.random.choice(population, 10)
    
    # Calculate sample mean of sample
    sample_means.append(describe(sample).mean)
    
_ = plt.hist(sample_means)
_ = plt.xlabel('Sample mean values')
_ = plt.ylabel('Frequency')
_ = plt.title('Sample means histogram')

We get a plot centered at 2 which is the mean of the population, with a bell shape as we expect.

Conclusion¶

We have covered a lot of ground. We have learned to simulate random variables in order to review the foundamental concepts of probability such as density, distribution, expected values, sample mean, variance, joint probability of dependent and independent events and conditional probability using Bayes' rule. We have reviewed some of the most important probability distributions such as binomial, geometric poisson and normal. We have also see the law of large numbers and the central limit theorem.

There is a lot more to see, thus to really understand and to grisp with all the methods we have see in this course it's important to apply them on real data. Only with practice we can truly master the content. By relating these methods to real-life situations we can gain deeper understanding about probability and statistics.