Intro
Getting started, User-guide, API refs.
Install:
pip install pandas
or
conda install -c conda-forge pandas
Import:
import pandas as pd
Reading and Saving Files
See official guide
Reading
Usepd.read_*(), which returns a DataFrame:
csv
df = pd.read_csv("xxx.csv")xslx
df = pd.read_xslx("xxx.xslx")parquet:
df = pd.read_parquet("xxx.parquet")
Saving
Use df.to_*(), where df is a DataFrame:
parquet > csv > xslx by efficiencyparquet:
df.to_parquet("xxxx.parquet")excel:
df.to_excel("xxxx.xlsx", sheet_name="sheet1", index=False)
Inspecting
See official guide
check rows
dfcheck types
df.dtypescheck shape
df.shapebrief technical summary
df.info()brief statistical summary
df.describe()other statistical summary
df[["Age", "Fare"]].median() df[["Age", "Fare"]].sum() ...first/last
nrows:df.head() df.head(8) df.tail() df.tail(8)
Selecting
See official guide
Columns
select one column
df_col = df["col name"]select columns
df_cols = df[["col name1", "col name2"]]
Rows (filtering)
filter
df_above_35 = df[df["Age"] > 35]where
df["Age"] > 35returns an array ofboolsfilter within a range
df_class_23 = df[df["Pclass"].isin([2, 3])]combine conditions
df_class_23 = df[(df["Pclass"] == 2) | (df["Pclass"] == 3)]
- each condition must be surrounded by parentheses
() - use
|and&instead ofandoror
non
Nullvaluesdf_age_no_na = df[df["Age"].notna()]
Indexing
nth rowdf.iloc[3]rows starts from
nthdf.iloc[3:6]
Rows and columns
filter
df_adult_names = df.loc[df["Age"] > 35, ["Name", "Sex"]]It shows rows that
Age> 35 and columnNameandSexindexing
df.iloc[9:25, 2:5]rows 10 till 25 and columns 3 to 5.
Ploting
Creating and Merging
See Official guide and this
create new column based on another column(s):
df["london_mg_per_cubic"] = df["station_london"] * 1.882concatenate tables
df = pd.concat([df1, df2, df3])
Reshaing and Renaming
See Official guide
sort/reorder
df = df.sort_values("count")reset index
df = df.reset_index()original index will be added as a new column
indexdrop column(s)
df = df.drop(by=["col name1", "col name2"], axis=1)rename
df = df.rename( columns={ "index": "Index", "start time": "start date", } )
Grouping
group by colmun(s):
groups = groups.groupby(["end date", "user id"]) groups_size = groups.size() sums = groups.sums()