Pandas - Index

Last Updated: 2024-02-13

What's new in pandas 2.0

  • Arrow / PyArrow: Faster and More Memory-efficient Operations
    • Pandas was built using NumPy data structures for memory management. In 2.0 you can use pyarrow as the backing memory format.
    • PyArrow is a Python library (built on top of Arrow)
    • Arrow: written in C++; an open-source and language-agnostic columnar data format to represent data in memory. It can enable zero-copy sharing of data between processes.
    • Polars (similar to Arrow) is a Rust-based data manipulation library for Python that provides a DataFrame API similar to pandas, but with enhanced performance and scalability for large datasets.
  • Copy-on-Write Performance Enhancement: It makes Pandas more similar to Spark and how lazy operations are performed in Spark.
  • the Index feature has been expanded to include NumPy numeric dtypes, such as int8, int16, int32, uint8, uint16, uint32, uint64, float32, and float64, whereas previously only int64, uint64, and float64 types were supported.

How to Install / Upgrade Pandas

$ pip install -U pandas


get index


get columns


Read As Pandas DataFrame


df = pd.read_csv("train.csv")

then convert DataFrame to arrays:

data = pd.read_csv("train.csv").values

Skip the first column and convert data to float

X = df.values[:, 1:].astype(float)

Extract first column as Y

Y = df.values[:, 0]

Other methods:

  • pd.read_csv
  • pd.read_excel
  • pd.read_hdf
  • pd.read_sql
  • pd.read_json
  • pd.read_msgpack (experimental)
  • pd.read_html
  • pd.read_gbq (experimental)
  • pd.read_stata
  • pd.read_sas
  • pd.read_clipboard
  • pd.read_pickle


Write From Pandas DataFrame

Write to csv


Other methods:

  • df.to_csv
  • df.to_excel
  • df.to_hdf
  • df.to_sql
  • df.to_json
  • df.to_msgpack (experimental)
  • df.to_html
  • df.to_gbq (experimental)
  • df.to_stata
  • df.to_clipbodf.ard
  • df.to_pickle

Write as JSON

This is similar to the problem dumping JSON in NumPy:

>>> json.dumps(pd.Series([1,2,3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 0    1
1    2
2    3
dtype: int64 is not JSON serializable
>>> json.dumps(pd.Series([1,2,3]).values)
Traceback (most recent call last):
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: array([1, 2, 3]) is not JSON serializable

Convert to list first can solve the problem

>>> json.dumps(pd.Series([1,2,3]).values.tolist())
'[1, 2, 3]'