Softmax

2019.03.18

Softmax (answer)

Softmax The name "softmax" originates in contrast to the argmax function (considered a hard-max) which extracts the index of the element having the greatest value. The argmax function with one-hot encoding: Compared to the softmax function: doesn't need to be the base and a more general form for the function is: where can be alterted to effectively change the base. It can be understood from this format how softmax approximates argmax. Read more...

2019.03.18

Incremental average (estimate update)

2018.08.11

Incremental average (estimate update) (answer)

Incremental average (estimate update) Statement: \[ A_n = A_{n-1} + \frac{1}{n}(V_n - A_{n-1}) \] Alterative form: \[ \mathrm{NewEstimate} \leftarrow \mathrm{OldEstimate} + \operatorname{StepSize}[\mathrm{NewData} - \mathrm{OldEstimate}] \] The second form describes updating our estimate of the average by multiplying an error term, \( \mathrm{NewData} - \mathrm{OldEstimate} \), by a weighting factor, \( \mathrm{StepSize} \). \( \mathrm{StepSize} \) is \( \frac{1}{n} \) when all data points are weighted equally. The average, \( A_{n-1} \) is known for a sequence of \( n-1 \) values, \( V_1, V_2, . Read more...

2018.08.11

Cross-Entropy and KL divergence

\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \) Math and science::INF ML AI Cross-Entropy and KL divergence Consider that we have an encoding scheme for sending codewords from a set \( C \) whose frequencies are given by the probability distribution \( q: C \to \mathbb{R} \). Read more...

2017.02.28

Cross-Entropy and KL divergence (answer)

\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \) Math and science::INF ML AI Cross-Entropy and KL divergence Consider that we have an encoding scheme for sending codewords from a set \( C \) whose frequencies are given by the probability distribution \( q: C \to \mathbb{R} \). Read more...

2017.02.28

Entropy

2017.02.27

Entropy (answer)

Entropy When using variable length encoding to encode a set of tokens, using a shorter code for more common tokens will result in shorter messages on average. The minimum average message length given a distribution of tokens is called the entropy of the distribution. Codespace and decodabilityWhen designing variable length coding schemes, codes must be designed to be uniquely decodable. For example, if 0 and 01 are both codewords, then it is not clear which code begins the string: 0100111 [or is it? Read more...

2017.02.27

Maximum likelihood estimation (MLE)

2017.02.04

Maximum likelihood estimation (MLE) (answer)

Maximum likelihood estimation (MLE) Say you have some data. Say you're willing to assume that the data comes from some distribution -- perhaps Gaussian. There are an infinite number of different Gaussians that the data could have come from: different means, different variances. MLE will pick the Gaussian that is "most consistent" with your data (the precise meaning of consistent is explained below). So say you've got a data set of y = -1,3, and 7. Read more...

2017.02.04

Kevin Doran