Frameworks, Tools and Libs

Updated: 2019-01-13

A typical Data Scientist/ML Engineer's toolbox.

Basic Python Tools

  • Pandas
  • Scikit Learn
  • Jupyter

Deep Learning / Neural Network

Distributed Data Processing

  • Hadoop
  • Spark
  • Hive
  • Presto

Graph Mining

Pytorch

  • A replacement for NumPy to use the power of GPUs
  • Tensors are similar to NumPy’s ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing.

Caffe2

Tensor is n-dimensional array that stores data of the same type, for example [[1, 2], [3, 4]] is 2x2 tensor of integers (similarly one can have ["a", "b", "c"] as tensor of strings). Blob is an entity that stores arbitrary type of data, for example Tensor.

Model Deployment

Models may be developed by one language(e.g. Python, R), however your production environment may use another(e.g. Java). One way to bridge the gap is to encode the models in a language/tool-neutral way:

  • ONNX: A collaboration between Facebook and Microsoft. Supports Caffe2, PyTorch, and Cognitive Toolkit.
  • PMML: XML
  • PFA: YAML or JSON

ONNX

Open Neural Network Exchange: AWS, Microsoft, Facebook

ONNX is intended to be a standardized format that will allow deep learning models trained on one framework to be transferred to another framework with minimal extra work.