Pandas

December 29, 2023 327 words 2 min

Table of Contents

Intro
Reading and Saving Files
- Reading
- Saving
Inspecting
Selecting
Ploting
Creating and Merging
Reshaing and Renaming
Grouping

Intro

Getting started, User-guide, API refs.

Install:

pip install pandas

or

conda install -c conda-forge pandas

Import:

import pandas as pd

Reading and Saving Files

See official guide

Reading

Usepd.read_*(), which returns a DataFrame:

csv
```
df = pd.read_csv("xxx.csv")
```
xslx
```
df = pd.read_xslx("xxx.xslx")
```
parquet:
```
df = pd.read_parquet("xxx.parquet")
```

Saving

Use df.to_*(), where df is a DataFrame:

Efficiency of files

parquet > csv > xslx by efficiency

parquet:
```
df.to_parquet("xxxx.parquet")
```

excel:

df.to_excel("xxxx.xlsx", sheet_name="sheet1", index=False)

Inspecting

See official guide

check rows
```
df
```
check types
```
df.dtypes
```
check shape
```
df.shape
```
brief technical summary
```
df.info()
```
brief statistical summary
```
df.describe()
```

other statistical summary

df[["Age", "Fare"]].median()
df[["Age", "Fare"]].sum()
...

first/last n rows:

df.head()
df.head(8)
df.tail()
df.tail(8)

Selecting

See official guide

Columns

select one column
```
df_col = df["col name"]
```

select columns

df_cols = df[["col name1", "col name2"]]

Rows (filtering)

filter
```
df_above_35 = df[df["Age"] > 35]
```
where df["Age"] > 35 returns an array of bools

filter within a range

df_class_23 = df[df["Pclass"].isin([2, 3])]

combine conditions

df_class_23 = df[(df["Pclass"] == 2) | (df["Pclass"] == 3)]

Note when combining conditions

each condition must be surrounded by parentheses ()
use | and & instead of and or or

non Null values
```
df_age_no_na = df[df["Age"].notna()]
```

Indexing

nth row
```
df.iloc[3]
```
rows starts from nth
```
df.iloc[3:6]
```

Rows and columns

filter

df_adult_names = df.loc[df["Age"] > 35, ["Name", "Sex"]]

It shows rows that Age > 35 and column Name and Sex

indexing
```
df.iloc[9:25, 2:5]
```
rows 10 till 25 and columns 3 to 5.

Ploting

See official documents

Creating and Merging

See Official guide and this

create new column based on another column(s):

df["london_mg_per_cubic"] = df["station_london"] * 1.882

concatenate tables
```
df = pd.concat([df1, df2, df3])
```

Reshaing and Renaming

See Official guide

sort/reorder
```
df = df.sort_values("count")
```
reset index
```
df = df.reset_index()
```
original index will be added as a new column index

drop column(s)

df = df.drop(by=["col name1", "col name2"], axis=1)

rename

df = df.rename(
    columns={
        "index": "Index",
        "start time": "start date",
    }
)

Grouping

group by colmun(s):

groups = groups.groupby(["end date", "user id"])
groups_size = groups.size()
sums = groups.sums()