Intro

Getting started, User-guide, API refs.

Install:

pip install pandas

or

conda install -c conda-forge pandas

Import:

import pandas as pd

Reading and Saving Files

See official guide

Reading

Usepd.read_*(), which returns a DataFrame:

  • csv

    df = pd.read_csv("xxx.csv")
    
  • xslx

    df = pd.read_xslx("xxx.xslx")
    
  • parquet:

    df = pd.read_parquet("xxx.parquet")
    

Saving

Use df.to_*(), where df is a DataFrame:

Efficiency of files
parquet > csv > xslx by efficiency
  • parquet:

    df.to_parquet("xxxx.parquet")
    
  • excel:

    df.to_excel("xxxx.xlsx", sheet_name="sheet1", index=False)
    

Inspecting

See official guide

  • check rows

    df
    
  • check types

    df.dtypes
    
  • check shape

    df.shape
    
  • brief technical summary

    df.info()
    
  • brief statistical summary

    df.describe()
    
  • other statistical summary

    df[["Age", "Fare"]].median()
    df[["Age", "Fare"]].sum()
    ...
    
  • first/last n rows:

    df.head()
    df.head(8)
    df.tail()
    df.tail(8)
    

Selecting

See official guide

Columns

  • select one column

    df_col = df["col name"]
    
  • select columns

    df_cols = df[["col name1", "col name2"]]
    

Rows (filtering)

  • filter

    df_above_35 = df[df["Age"] > 35]
    

    where df["Age"] > 35 returns an array of bools

  • filter within a range

    df_class_23 = df[df["Pclass"].isin([2, 3])]
    
  • combine conditions

    df_class_23 = df[(df["Pclass"] == 2) | (df["Pclass"] == 3)]
    
Note when combining conditions
  • each condition must be surrounded by parentheses ()
  • use | and & instead of and or or
  • non Null values

    df_age_no_na = df[df["Age"].notna()]
    

Indexing

  • nth row

    df.iloc[3]
    
  • rows starts from nth

    df.iloc[3:6]
    

Rows and columns

  • filter

    df_adult_names = df.loc[df["Age"] > 35, ["Name", "Sex"]]
    

    It shows rows that Age > 35 and column Name and Sex

  • indexing

    df.iloc[9:25, 2:5]
    

    rows 10 till 25 and columns 3 to 5.

Ploting

See official documents

Creating and Merging

See Official guide and this

  • create new column based on another column(s):

    df["london_mg_per_cubic"] = df["station_london"] * 1.882
    
  • concatenate tables

    df = pd.concat([df1, df2, df3])
    

Reshaing and Renaming

See Official guide

  • sort/reorder

    df = df.sort_values("count")
    
  • reset index

    df = df.reset_index()
    

    original index will be added as a new column index

  • drop column(s)

    df = df.drop(by=["col name1", "col name2"], axis=1) 
    
  • rename

    df = df.rename(
        columns={
            "index": "Index",
            "start time": "start date",
        }
    )
    

Grouping

  • group by colmun(s):

    groups = groups.groupby(["end date", "user id"])
    groups_size = groups.size()
    sums = groups.sums()