It is a thesaurus developed since 2005
A word’s meaning is given by the words that frequently appear close-by.
When a word $w$ appears in a text, its context is the set of words that appear nearby (within a fixed-size window). We use the many contexts of $w$ to build up a representation of $w$.
Example: Among all pets, **???** are the best at catching mice.
One-hot encoding representations:
One-hot simplified example:
The probability of word \(w_t\) in a given context:
\[C_t = (w_{t-m}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+m})\] \[p(w_{t} = w|C_t) = \underset{w \in W}{\text{Softmax}} \left< u_w, v^{-t} \right>\]\(v^{-t} = \frac{1}{2m} \sum\limits_{w \in C_t} v_w\) — the average vector of words from the context \(C_t\)
\(v_w\) — vectors of predicting words,
\(u_w\) — vector of the predicted word, generally \(u_w \neq v_w\)
Criterion of maximum log-likelihood, \(U, V \in \mathbb{R}^{|W| \times d}\):
\(\sum\limits_{t=1}^n \log p(w_t|C_t) \to \max\limits_{U, V}\)
\( p(w_o|w_c) = \frac{\exp[v(w_o)u^T(w_c)]}{\sum\limits_{w=1}^W \exp[v(w)u^T(w_c)]}\)
\(W\) — the set of all dictionary words
\(w_c\) — central word
\(w_o\) — context word
\(u(\cdot)\) and \(v(\cdot)\) — parameter vectors (embeddings), which are multiplied scalar-wise
In the denominator, of course, where all the words of the vocabulary are!
Mikolov proposed...
We model the probability more efficiently by building a Huffman tree on the words, and then:
Here, \(v(n)\) is the trainable vector in the tree node, \(d_{nw_o} = 1\) if \(w_o\) is in the right subtree, \(d_{nw_o} = -1\) otherwise.
As word weights \(\omega_w\), it makes sense to use TF-IDF (term frequency / inverse document frequency)
Motivation: Unlike Word2Vec, which is a predictive model, GloVe is a count-based model that leverages global statistical information about word occurrences.
The model captures word relationships using the following formula:
\(X_{ij}\) — co-occurrence matrix where \(X_{ij}\) is the frequency of word \(j\) in the context of word \(i\)
\[ \log(X_{ij}) = w_i^T \tilde{w}_j + b_i + \tilde{b}_j \]This equation models the ratio of co-occurrence probabilities to capture semantic relationships between words.
Pretraining base LLMs involves training on massive and diverse text datasets to capture a wide range of linguistic patterns, knowledge, and contextual understanding.
In the context of language models, is defined as the exponential of the average log-likelihood per word in a given test set.
It can be interpreted as the weighted average number of choices a model has when predicting the next item in a sequence. A lower perplexity score indicates a better predictive model because it suggests the model has fewer choices, hence less uncertainty.
What else can you explore?