Machine Learning
Machine Learning vs Programming
- Programming: the implementation of the logic and / or math. You need to know exactly what to do, step-by-step.
 - Machine Learning: based on the observations, experiments and statistics. You provide input and output, machine will fill in the blank.
 
Terminology
Labels
The target, the true, the tagging of positive/negative, good/bad, car/bike/human .
Features
The input variables, the traits that describe the instance or the item.
Examples
The instances of the data, each instance is either a combination of the feature vector and the label, as a "labeled example":
or only the features, as an "unlabeled example":
Model and Inference
Model is a "trained" function , which takes a feature vector as input, and outputs an inference .
By comparing inference and label , we can evaluate the performance of the model.
Tensor
Tensor is n-dimensional array that stores data of the same type, for example [[1, 2], [3, 4]] is 2x2 tensor of integers (similarly one can have ["a", "b", "c"] as tensor of strings). Blob is an entity that stores arbitrary type of data, for example Tensor.
A Typical End-to-end Machine Learning System
- Data acquisition / collection: acquire data from third party, or collect from mobile, browser, devices, sensor, etc.
 - Data dictionary / feature store
 - Data warehouse
 - Data Pipeline(ETL): move data to analytics platform(e.g. a Hadoop cluster)
 - Data Prep:
    
- Driver Set: define the population of training/testing/validation; append proper meta data for evaluation.
 - Feature Engineering: some variables are generated on the fly so can be logged; newly created variables needs to be simulated offline
 - Data Sanity Check: check if data is clean and usable.
 
 - Model Building: training and testing ML models
 - Online Variable: on-the-fly, pre-generated lookups loaded in cache(aerospike, ehcache)
 - Model Deployment: run models in offline batch mode or deploy to online system for real-time scoring.
 
Use Cases
- Risk Management And Anti-Fraud
 - Precise Marketing: customer profiling, segmentation, and acquisition
 - Network Security
 - User Intentions
    
- Predict churn
 - customer value: predict customer value, identify high value accounts
 - inactive account reactivate: predicts the probability that an inactive account will become reactivated
 
 - Personalization / Recommendation
 - Image recognition: face recognition, OCR, autonomous vehicle
 - Machine translation
 - and more...
 
Machine Learning Tools Abstraction Layers
- one-button-click
 - declarative ML framework: provides common abstractions for many different model architectures. This enables users to change the underlying model with a single line of code
 - workflow/pipeline/components
 - lib: tensorflow
 
Tutorials
Some classical Machine Learning problems.
Mandelbrot: https://www.tensorflow.org/tutorials/non-ml/mandelbrot
MNIST: https://github.com/carlosvilchez/spark-mnist/blob/master/src/main/scala/MnistDriver.scala
Titanic: OneR, Naive Bayes
Playground
- Open AI: https://gym.openai.com/
 - Kaggle: https://www.kaggle.com/competitions, Probably the most well-known data science industry competition site open to almost anyone.
 - Data Mining Cup: https://www.data-mining-cup.com/
 - Aliyun: https://tianchi.aliyun.com/promotion/goldenleague.php