Intro
Getting started, User-guide, API refs.
Install:
pip install pandas
or
conda install -c conda-forge pandas
Import:
import pandas as pd
Reading and Saving Files
See official guide
Reading
Usepd.read_*()
, which returns a DataFrame
:
csv
df = pd.read_csv("xxx.csv")
xslx
df = pd.read_xslx("xxx.xslx")
parquet:
df = pd.read_parquet("xxx.parquet")
Saving
Use df.to_*()
, where df
is a DataFrame
:
parquet
> csv
> xslx
by efficiencyparquet:
df.to_parquet("xxxx.parquet")
excel:
df.to_excel("xxxx.xlsx", sheet_name="sheet1", index=False)
Inspecting
See official guide
check rows
df
check types
df.dtypes
check shape
df.shape
brief technical summary
df.info()
brief statistical summary
df.describe()
other statistical summary
df[["Age", "Fare"]].median() df[["Age", "Fare"]].sum() ...
first/last
n
rows:df.head() df.head(8) df.tail() df.tail(8)
Selecting
See official guide
Columns
select one column
df_col = df["col name"]
select columns
df_cols = df[["col name1", "col name2"]]
Rows (filtering)
filter
df_above_35 = df[df["Age"] > 35]
where
df["Age"] > 35
returns an array ofbool
sfilter within a range
df_class_23 = df[df["Pclass"].isin([2, 3])]
combine conditions
df_class_23 = df[(df["Pclass"] == 2) | (df["Pclass"] == 3)]
- each condition must be surrounded by parentheses
()
- use
|
and&
instead ofand
oror
non
Null
valuesdf_age_no_na = df[df["Age"].notna()]
Indexing
n
th rowdf.iloc[3]
rows starts from
n
thdf.iloc[3:6]
Rows and columns
filter
df_adult_names = df.loc[df["Age"] > 35, ["Name", "Sex"]]
It shows rows that
Age
> 35 and columnName
andSex
indexing
df.iloc[9:25, 2:5]
rows 10 till 25 and columns 3 to 5.
Ploting
Creating and Merging
See Official guide and this
create new column based on another column(s):
df["london_mg_per_cubic"] = df["station_london"] * 1.882
concatenate tables
df = pd.concat([df1, df2, df3])
Reshaing and Renaming
See Official guide
sort/reorder
df = df.sort_values("count")
reset index
df = df.reset_index()
original index will be added as a new column
index
drop column(s)
df = df.drop(by=["col name1", "col name2"], axis=1)
rename
df = df.rename( columns={ "index": "Index", "start time": "start date", } )
Grouping
group by colmun(s):
groups = groups.groupby(["end date", "user id"]) groups_size = groups.size() sums = groups.sums()