Machine Learning

Last Updated: 2021-11-19

Machine Learning vs Programming

  • Programming: the implementation of the logic and / or math. You need to know exactly what to do, step-by-step.
  • Machine Learning: based on the observations, experiments and statistics. You provide input and output, machine will fill in the blank.



The target, the true, the tagging of positive/negative, good/bad, car/bike/human .

y y


The input variables, the traits that describe the instance or the item.

x = { x 1 , x 2 , . . . , x N } \mathbf{x} = \{x_1, x_2, ..., x_N\}


The instances of the data, each instance is either a combination of the feature vector and the label, as a "labeled example":

( x , y ) (\mathbf{x}, y)

or only the features, as an "unlabeled example":

( x , ? ) (\mathbf{x}, ?)

Model and Inference

Model is a "trained" function f ( x ) f(\mathbf{x}) , which takes a feature vector as input, and outputs an inference y y' .

By comparing inference y y' and label y y , we can evaluate the performance of the model.


Tensor is n-dimensional array that stores data of the same type, for example [[1, 2], [3, 4]] is 2x2 tensor of integers (similarly one can have ["a", "b", "c"] as tensor of strings). Blob is an entity that stores arbitrary type of data, for example Tensor.

A Typical End-to-end Machine Learning System

  • Data acquisition / collection: acquire data from third party, or collect from mobile, browser, devices, sensor, etc.
  • Data dictionary / feature store
  • Data warehouse
  • Data Pipeline(ETL): move data to analytics platform(e.g. a Hadoop cluster)
  • Data Prep:
    • Driver Set: define the population of training/testing/validation; append proper meta data for evaluation.
    • Feature Engineering: some variables are generated on the fly so can be logged; newly created variables needs to be simulated offline
    • Data Sanity Check: check if data is clean and usable.
  • Model Building: training and testing ML models
  • Online Variable: on-the-fly, pre-generated lookups loaded in cache(aerospike, ehcache)
  • Model Deployment: run models in offline batch mode or deploy to online system for real-time scoring.

Use Cases

  • Risk Management And Anti-Fraud
  • Precise Marketing: customer profiling, segmentation, and acquisition
  • Network Security
  • User Intentions
    • Predict churn
    • customer value: predict customer value, identify high value accounts
    • inactive account reactivate: predicts the probability that an inactive account will become reactivated
  • Personalization / Recommendation
  • Image recognition: face recognition, OCR, autonomous vehicle
  • Machine translation
  • and more...

Machine Learning Tools Abstraction Layers

  • one-button-click
  • declarative ML framework: provides common abstractions for many different model architectures. This enables users to change the underlying model with a single line of code
  • workflow/pipeline/components
  • lib: tensorflow


Some classical Machine Learning problems.

Mandelbrot: https://www.tensorflow.org/tutorials/non-ml/mandelbrot

MNIST: https://github.com/carlosvilchez/spark-mnist/blob/master/src/main/scala/MnistDriver.scala

Titanic: OneR, Naive Bayes