Ryangineer

Noteworthy Machine Learning Algorithms

Machine Learning ⇒ software able to detect patterns, make decisions, predict outcomes, learn from mistakes & optimize own performance without being explicitly programmed to do so

Supervised Learning

↳ "learning a function that maps to an output based on the example of input-output pairs"

Linear Regression | Predict Real Values
Estimate or predict real values based on continuous variables -> establish relationship between independent variables (matrix of features) & dependent variable (output) by fitting a best line

Homoscedasticity
"Homoskedastic . . . refers to a condition in which the variance of the residual, or error term, [that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable], in a regression model is constant. That is, the error term does not vary much as the value of the predictor variable changes." Investopedia
Multicollinearity
"[R]efers to predictors that are correlated [, that is, highly linearly related,] with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant." Minitab
No Free Lunch Theorems (NFL)
"[S]tate that any one algorithm that searches for an optimal cost or fitness solution is not universally superior to any other algorithm. . . . 'If an algorithm performs better than random search on some class of problems then in must perform worse than random search on the remaining problems.'” Medium
Parsimonious Model
"Parsimonious models are simple models [with the least assumptions & variables but] with great explanatory predictive power. They explain data with a minimum number of parameters, or predictor variables. The idea behind parsimonious models stems from Occam's razor, or 'the law of briefness' (sometimes called lex parsimoniae in Latin)." Statistics How To
Derivation of Line of Best Fit | Ordinary Least Squares method | Sum of Squares Residual

SS_res = Σ(y - ŷ)² → min
Simple Linear Regression
Combining one variable in an equation to predict a single outcome

y = b₀ + b₁x₁

Multiple Linear Regression
Combining many variables in an equation to predict a single outcome
$y_{i} = β_{0} + β_{1} x_{1}_{i} + β_{2} x_{2}_{i} + β_{3} x_{3}_{i} + ε_{i}$
Polynomial Linear Regression

y = b₀ + b₁x₁ + b₂x₂² + . . . + b_nx_nⁿ
R Squared | Goodness of Fit Parameter
R² = 1 - SS_res/SS_tot
↳ where:
Adjusted R Squared
Adj R² = 1 - (1 - r²) * [(n - 1)/(n - p - 1)]
↳ where:

Support Vector Regression | Classification
Use as a regression method, maintaining all the main features that characterize the algorithm (maximal margin). The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences.
- Epsilon-Insensitive Tube
  
  $\frac{1}{2} ∥ w ∥ + C$ Σ(ξ_i + ξ_i^*) → min
- The Gaussian RBF Kernel
  
  $K (x⃗, {l⃗}^{i}) = \exp (- \frac{∥ x⃗, {l⃗}^{i} ∥^{2}}{2 σ^{2}})$
Logistic Regression | Classification
Used to estimate discrete values, binary values (0/1, yes/no, true/false) based on given set of independent variables; predicts probability between 0 & 1 as output values.
Logistic regression like its name is logarithmic. Its graph is curvilinear. If the dependent variable is binary, the graph is sigmoid. If not, the graph can be more pronounced, parabolic, etc.
- Linear Regression | Sigmoid Function | Predicting Probability (p̂)
  
  y = b₀ + b₁x₁
  
  $↳p = \frac{1}{1 + e^{-y}}$
  
  $↳ ln (\frac{p}{1 - p}) = b_{0} + b_{1} x$
Decision Tree Regression
Supervised learning algorithm used for classification problems; works for categorical & continuous variables
- Standard Deviation Reduction
Support Vector Machines | Discriminative Classifier
Discriminative classifier formally defined by a separating hyperplane
- Maximum Margin Hyperplane | Support Vectors
Kernel SVM | Nonlinear

Mapping to a higher-dimensional space, applying the support vector algorithm & then projecting back to lower dimensional space resulting in a nonlinear separator

Linearly Separable with Hyperplane in 3D

φ(x₁, x₂) ⇒ (x₁, x₂, z)

The Gaussian or Radial Basis Function (RBF) Kernel

$K (x⃗, {l⃗}^{i}) = \exp (- \frac{|| x⃗, {l⃗}^{i} {||}^{2}}{2 σ^{2}})$

↳ where:

↳ K = function applied to two vectors
↳ x = point in datasets
↳ l = landmark

Sigmoid Kernel

K(X, Y) = tanh(γ ˙ X^TY + r)

Polynomial Kernel

K(X, Y) = tanh(γ ˙ X^TY + r)^d, γ > 0

Naive Bayes Classification
Probabilistic classifier based on Bayes Theorem with an assumption of independence between predictors (aka, features or independent variables)
- Bayes Theorem ⇒ The probability of an event given prior knowledge of related events that occurred earlier
  
  $P (y ∣ x_{1}, \dots, x_{n}) = \frac{P (y) P (x_{1}, \dots, x_{n} ∣ y)}{P (x_{1}, \dots, x_{n})}$
K-Nearest Neighbors
Used for classification & regression; a simple algorithm that stores all available cases & classifies new cases by a "majority vote" of its K-nearest neighbors
- Euclidean Distance
  
  $Between P_{1} & P_{2} = \sqrt{(x_{2} - x_{1}) 2 + (y_{2} - y_{1}) 2}$

Unsupervised Learning

↳ "looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision"

K-Means Clustering

Unsupervised algorithm which solves clustering problems; follows simple/easy way to classify a dataset through a certain number of clusters

Within Cluster Sum of Squares (WCSS)| Quantifiable metric to evaluate how certain number of clusters performs compared to different number of clusters

WCSS = Σ distance(P_i, C₁)² + Σ distance(P_i, C₂)² +Σ distance(P_i, C₃)²

↳ where:

↳ distance = distance between each point inside cluster
↳ C = centroids, respectively

Apriori Association

Analyzes the association of specific preferences in customer transactions (movies watched, items purchased in convenience store - beer & pampers urban myth) to discover relationships and how items are associated with each other

Support

Movie recommendation example:
↳ where M = specific Movie

$Support(M) = \frac{# of user watchlists containing M}{# of user watchlists}$

Confidence

$Confidence (M_{1} \to M_{2}) = \frac{# of user watchlists containing (M_{1} \to M_{2})}{# of user watchlists containing (M_{1})}$

Lift → measuring the relevance of an associated rule & the improvement prediction

$Lift (M_{1} \to M_{2}) = \frac{Confidence (M_{1} \to M_{2})}{Support (M_{2})}$

Reinforcement Learning

↳ "how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward"

Upper Confidence Bound Algorithm | Deterministic Model

Modern application of Multi-Armed Bandit Problem (reference slot machine distributions)

Advertising Model (requires update at every round)

Step 1: Each round n considers two numbers for each ad i:

↳ N_i(n) → # of times the ad i selected up to round n
↳ R_i(n) → Σ of rewards of ad i up to round n

Step 2: From these two numbers we compute:

↳ Average reward of ad i up to round n with:

$r̄ (n) = \frac{R_{i} (n)}{N_{i} (n)}$

↳ Confidence interval [r̄_i(n) - △_i(n), r̄_i(n) + △_i(n)] at round n with:

$△_{i} (n) = \sqrt{\frac{3}{2} * \frac{\log (n)}{N_{i} (n)}}$

Step 3: Select the ad i that has the maximum UCB r̄_i(n) + △_i(n)

Thompson Sampling Algorithm | Probabilistic Model

Constructs distributions of where we think the actual expected value might lie; an auxiliary mechanism to solve the problem

Advertising Model (can accomodate delayed feedback & has better empirical evidence than UCB)

Step 1: Each round n considers two numbers for each ad i:

↳ N_i¹(n) → # of times the ad i rewarded 1 up to round n
↳ N_i⁰(n) → # of times the ad i rewarded 0 up to round n

Step 2: For each ad i, we take a random draw from the distribution below:

θ_i(n) = β(N_i¹(n) + 1, N_i⁰(n + 1))

Step 3: Select the ad i that has highest θ_i(n)

Random Forest Regression

Ensemble decision trees; a collection of decision trees (aka forest) to classify a new object based on attributes; each tree gives a classification & we say the tree "votes" for the class

Dimensionality Reduction

Identifies highly significant variables when you have thousands

Gradient Boosting

Ensemble of machine learning algorithms

Lovely Deep Learning

Artificial Neural Networks

↳ A computing system that consist of a number of simple but highly interconnected elements or nodes, called ‘neurons’, which are organized in layers which process information using dynamic state responses to external inputs, an extremely useful algorithm for finding patterns too complex to be manually extracted

Neuron Definition

A mathematical operation that takes its input, multiplies it by its weight & then passes the sum through an activation function

Neuron Formula

Y₁ = activation(w₁x₁ + w₂x₂ + w₃x₃ + . . . + w_mx_m)

Sigmoid Activation Function

Σw_ix_i $φ (x) = \frac{1}{1 + e^{-z}}$

Threshold Function

Σw_ix_i $φ (x) = {1 if x ≥ 0, 0 if x < 0}$

Rectifier Function

Σw_ix_i $φ (x) = (x, 0)$

Hyperbolic Tangent Function (tanh)

Σw_ix_i $φ (x) = \frac{1 - e^{-2x}}{1 + e^{-2x}}$

Convolutional Neural Networks

↳ A class of deep neural networks, most commonly applied to analyzing visual imagery. CNNs are regularized versions of multilayer perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer.

Convolution | Visual Imagery Analysis

A special kind of mathematical linear operation to give a network a degree of translation invariance; eg, a typical image convolution is a form of blurring

Natural Language Processing

↳ Starts with raw text in whatever format available, processes it, extracts relevant features and builds models to accomplish various NLP tasks

NLP Pipeline

Text Processing ⇒ Feature Extraction ⇒ Modeling

Document-Term Matrix
Compute dot product (sum of the products of corresponding elements) to find similarities
a * b = Σ (a₁b₁ + a₂b₂ + a₃b₃ + . . . + a_nb_n)
Cosine Similarity
Divide the product of two vectors by their magnitudes or Euclidean norms
$cos(θ) = \frac{a*b}{||a||*||b||} ↳ where:
Identical vectors → cos(θ) = 1
Orthogonal vectors → cos(θ) = 0
Exact opposite vectors → cos(θ) = -1$
TF-IDF Transform
Term frequency-inverse document frequency

tfidf(t, d, D) = tf(t, d) * idf(t, D)
↳ where:
tf(t, d) = $\frac{count(t, d)}{|d|}$
idf(t, D) = $Log (\frac{|D|}{|{d ∈ D : t ∈ d}|})$

Stemming
Takes the root of a word removing conjugation to simplify & understand gist meaning (reducing final dimension )

Lemmatization

Refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Ryangineer

Machine LearningAlgorithms

My Machine Learning Projects

Noteworthy Machine Learning Algorithms

Supervised Learning

Linear Regression | Predict Real Values

Homoscedasticity

Multicollinearity

No Free Lunch Theorems (NFL)

Parsimonious Model

Derivation of Line of Best Fit | Ordinary Least Squares method | Sum of Squares Residual

SSres = Σ(y - ŷ)2 → min

Simple Linear Regression

y = b0 + b1x1

Multiple Linear Regression

y i = β 0 + β 1 x 1 i + β 2 x 2 i + β 3 x 3 i + ε i

Polynomial Linear Regression

y = b0 + b1x1 + b2x22 + . . . + bnxnn

R Squared | Goodness of Fit Parameter

R2 = 1 - SSres/SStot ↳ where: ↳ SStot = Σ(y - yavg)2

Adjusted R Squared

Adj R2 = 1 - (1 - r2) * [(n - 1)/(n - p - 1)] ↳ where: ↳ p = number of regressors ↳ n = sample size

Support Vector Regression | Classification

Epsilon-Insensitive Tube

1 2 ∥ w ∥ + C Σ(ξi + ξi*) → min

The Gaussian RBF Kernel

K ( x⃗, l⃗ i ) = exp ( - ∥ x⃗, l⃗ i ∥ 2 2 σ 2 )

Logistic Regression | Classification

Linear Regression | Sigmoid Function | Predicting Probability (p̂)

y = b0 + b1x1 ↳p = 1 1 + e -y ↳ ln ( p 1 - p ) = b 0 + b 1 x

Decision Tree Regression

Standard Deviation Reduction

F(T, X) = ΣP(c)S(c)

Support Vector Machines | Discriminative Classifier

Maximum Margin Hyperplane | Support Vectors

{xi, yi} where i = 1 . . . L, yi ∈ {-1, 1}, x ∈ ℝD

Kernel SVM | Nonlinear

Linearly Separable with Hyperplane in 3D

φ(x1, x2) ⇒ (x1, x2, z)

The Gaussian or Radial Basis Function (RBF) Kernel

K ( x⃗, l⃗ i ) = exp ( - || x⃗, l⃗ i || 2 2 σ 2 )

↳ where: ↳ K = function applied to two vectors ↳ x = point in datasets ↳ l = landmark

Sigmoid Kernel

K(X, Y) = tanh(γ ˙ XTY + r)

Polynomial Kernel

K(X, Y) = tanh(γ ˙ XTY + r)d, γ > 0

Naive Bayes Classification

Bayes Theorem ⇒ The probability of an event given prior knowledge of related events that occurred earlier

K-Nearest Neighbors

Euclidean Distance

Unsupervised Learning

K-Means Clustering

Within Cluster Sum of Squares (WCSS)| Quantifiable metric to evaluate how certain number of clusters performs compared to different number of clusters

WCSS = Σ distance(Pi, C1)2 + Σ distance(Pi, C2)2 +Σ distance(Pi, C3)2

↳ where: ↳ distance = distance between each point inside cluster ↳ C = centroids, respectively

Apriori Association

Support

Movie recommendation example: ↳ where M = specific Movie Support(M) = # of user watchlists containing M # of user watchlists

Confidence

Confidence ( M 1 → M 2 ) = # of user watchlists containing ( M 1 → M 2 ) # of user watchlists containing ( M 1 )

Lift → measuring the relevance of an associated rule & the improvement prediction

Lift ( M 1 → M 2 ) = Confidence ( M 1 → M 2 ) Support ( M 2 )

Reinforcement Learning

Upper Confidence Bound Algorithm | Deterministic Model

Advertising Model (requires update at every round)

Step 1: Each round n considers two numbers for each ad i: ↳ Ni(n) → # of times the ad i selected up to round n ↳ Ri(n) → Σ of rewards of ad i up to round n Step 2: From these two numbers we compute: ↳ Average reward of ad i up to round n with:

r̄ ( n ) = R i ( n ) N i ( n )

↳ Confidence interval [r̄i(n) - △i(n), r̄i(n) + △i(n)] at round n with:

△ i ( n ) = 3 2 * log ( n ) N i ( n )

Step 3: Select the ad i that has the maximum UCB r̄i(n) + △i(n)

Thompson Sampling Algorithm | Probabilistic Model

Advertising Model (can accomodate delayed feedback & has better empirical evidence than UCB)

Step 1: Each round n considers two numbers for each ad i: ↳ Ni1(n) → # of times the ad i rewarded 1 up to round n ↳ Ni0(n) → # of times the ad i rewarded 0 up to round n Step 2: For each ad i, we take a random draw from the distribution below: θi(n) = β(Ni1(n) + 1, Ni0(n + 1))

Step 3: Select the ad i that has highest θi(n)

Random Forest Regression

Dimensionality Reduction

Gradient Boosting

My Neural Networks Projects

Lovely Deep Learning

Artificial Neural Networks

Machine Learning
Algorithms

SS_res = Σ(y - ŷ)² → min

y = b₀ + b₁x₁

$y_{i} = β_{0} + β_{1} x_{1}_{i} + β_{2} x_{2}_{i} + β_{3} x_{3}_{i} + ε_{i}$

y = b₀ + b₁x₁ + b₂x₂² + . . . + b_nx_nⁿ

R² = 1 - SS_res/SS_tot
↳ where:

↳ SS_tot = Σ(y - y_avg)²

Adj R² = 1 - (1 - r²) * [(n - 1)/(n - p - 1)]
↳ where:

↳ p = number of regressors
↳ n = sample size

$\frac{1}{2} ∥ w ∥ + C$ Σ(ξ_i + ξ_i^*) → min

$K (x⃗, {l⃗}^{i}) = \exp (- \frac{∥ x⃗, {l⃗}^{i} ∥^{2}}{2 σ^{2}})$

y = b₀ + b₁x₁

$↳p = \frac{1}{1 + e^{-y}}$

$↳ ln (\frac{p}{1 - p}) = b_{0} + b_{1} x$

{x_i, y_i} where i = 1 . . . L, y_i ∈ {-1, 1}, x ∈ ℝ^D