$P(y|x) =\frac{P(x, y)}{P(x)}$
The probability of rolling a 6 on a die given that an even number has rolled.
$P(6|\mathrm{even}) = \frac{P(\mathrm{even},\ 6)}{P(\mathrm{even})} = \frac13$
Symmetry: $$P(y|x)P(x) = P(x, y) = P(x|y)P(y)$$
\[ p(y|x)p(x) = p(x, y) = p(x|y)p(y) \Rightarrow \]
\[ \Rightarrow p(y|x) = \frac{p(x|y)p(y)}{p(x)} \]
\[ p(x) = \int p(x, y)dy = \int p(x|y)p(y)dy = E_y p(x|y)\]
Sum Rule:
\[p(x_1, \dots , x_k) = \int p(x_1, \dots , x_k, \dots , x_n)d{x_{k+1}} \dots d{x_n}\]Product Rule:
\[ p(x_1, x_2, x_3, \dots , x_{n−1}, x_n) = p(x_1|x_2, x_3, \dots, x_{n−1}, x_n) \cdot \\ \cdot p(x_2|x_3, \dots, x_{n−1}, x_n) \cdot \ldots \cdot p(x_{n}) \]Given:
Find:
$P(y=1|x=1) = P(\mathrm{exam\ is\ passed}|\mathrm{student\ is\ happy})\ —\ ?$
$\quad\quad\quad\quad$ | Frequentist Approach (Fischer) | Bayesian Approach |
---|
Relationship to Randomness | Cannot predict | Can predict given enough information |
---|
Values $\quad\quad\quad$ | Random and deterministic | All random |
---|
Inference Method | Maximum likelihood | Bayes' Theorem |
---|
Parameter Estimates | Point estimates | Posterior distribution |
---|
Use Cases | $n \gg d$ | Always |
---|
Let $p(x|\theta) = A(\theta)$ and $p(\theta) = B(\alpha)$
The prior distribution of parameters $p(\theta)$ is called conjugate to $p(x|\theta)$ (conjugate priors), if the posterior distribution belongs to the same family as the prior $p(\theta|x) = B(\alpha^\prime)$
Knowing the formula for the family to which our posterior distribution should belong, it is easy to calculate its normalizing factor (the integral in the denominator)
Posterior distribution of the expected value of a normal distribution
Normal distribution:
$N(x|\mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2} \right)$Likelihood:
$p(x|\mu) = N(x|\mu, 1) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2} \right)$Conjugate prior:
$p(\mu) = N(\mu|m,s^2)$Posterior probability of getting heads when flipping a biased coin
$q$ — the probability of heads equals the fraction of heads in $n$ flips
Binomial distribution: $p(x|q) = C_n^x q^x (1-q)^{n-x}$
Beta distribution: $p(q| \alpha, \beta) = \frac{q^{\alpha-1} (1-q)^{\beta-1}}{B(\alpha, \beta)}$
The numerator of the posterior distribution:
$$q^{\alpha-1+x} (1-q)^{\beta-1+(n-x)} = q^{\alpha^\prime-1} (1-q)^{\beta^\prime-1},$$where $\alpha^\prime = \alpha + x, \beta^\prime = \beta + n - x$, so essentially, we also know the denominator, it equals $B(\alpha^\prime, \beta^\prime)$.
A broad class of computational algorithms based on the repeated execution of random experiments and the subsequent calculation of quantities of interest.
$$\int f(x)dx \approx \frac{1}{n} \sum\limits_{i=1}^n f(x_i)$$Andrey Andreyevich Markov (senior), 1906
The sequence $x_1, x_2, \dots, x_{n-1}, x_n$ is a Markov chain, if
$$p(x_1, x_2, \dots, x_{n-1}, x_n) = p(x_n|x_{n-1}) \dots p(x_2|x_1)p(x_1)$$
In this case, we can use the Monte Carlo method to calculate mathematical expectations for this probability distribution
$$E_{p(x)} f(x) \approx \frac{1}{n} \sum\limits_{i=1}^n f(x_i),$$where $x_i \sim p(x)$
An instance of the Metropolis-Hastings algorithm, named after the distinguished scientist Josiah Gibbs.
Suppose we derived the posterior distribution only up to a normalization constant:
$$p(x_1, x_2, x_3) = \frac{\hat p(x_1, x_2, x_3)}{C}$$Josiah Willard Gibbs
We start with $(x_1^0, x_2^0, x_3^0)$, and within a loop, we perform the following procedure:
After a few iterations "the chain has warmed up", we can estimate expectations based on these samples.
Rejection Sampling
\(i = 0\), repeat many times:
The difficulty lies in choosing the parameter \(C\), it's often too large.