In data analysis, selecting and filtering data are the most basic and commonly used operations. pandas’ DataFrame provides a flexible way to achieve this. This article will help you quickly master the core techniques of DataFrame data selection and filtering in 3 simple steps, suitable for beginners who have never touched pandas before.
Step 1: Select Column Data (Most Basic Column Operation)¶
Column selection refers to extracting one or more columns of data from a DataFrame. In pandas, column selection is achieved through column names, which is very intuitive.
1.1 Select Single Column Data¶
When you need to extract a single column, simply use the column name with square brackets. The returned result is a Series (similar to a one-dimensional array).
import pandas as pd
# First, create a sample DataFrame (with name, age, city, sales)
data = {
'姓名': ['小明', '小红', '小刚', '小丽'],
'年龄': [20, 22, 21, 23],
'城市': ['北京', '上海', '广州', '深圳'],
'销售额': [1000, 1500, 800, 2000]
}
df = pd.DataFrame(data)
# Select a single column (age column)
age_column = df['年龄']
print("Single column selection (age column):")
print(age_column)
Output:
Single column selection (age column):
0 20
1 22
2 21
3 23
Name: 年龄, dtype: int64
1.2 Select Multiple Columns¶
To select multiple columns at the same time, just put the column names in a list (double brackets). The returned result is a DataFrame (similar to a two-dimensional array).
# Select multiple columns (age and city columns)
age_city_df = df[['年龄', '城市']]
print("\nMultiple column selection (age and city):")
print(age_city_df)
Output:
Multiple column selection (age and city):
年龄 城市
0 20 北京
1 22 上海
2 21 广州
3 23 深圳
Step 2: Select Row Data (By Position or Label)¶
Row selection refers to extracting certain rows of data from a DataFrame. pandas provides two common methods: iloc and loc, which select rows based on position and label respectively.
2.1 iloc: Select by Position (Integer Index)¶
iloc stands for “index location” and is suitable for selecting rows using the default integer positions (0, 1, 2…). The syntax is df.iloc[row_range].
df.iloc[0:2]: Select the first 2 rows (positions 0 and 1, left-closed right-open interval)df.iloc[1]: Select the 2nd row (position 1)df.iloc[[0, 2]]: Select rows at positions 0 and 2
# Select the first 2 rows (positions 0 and 1)
first_two_rows = df.iloc[0:2]
print("iloc selection for first two rows:")
print(first_two_rows)
Output:
iloc selection for first two rows:
姓名 年龄 城市 销售额
0 小明 20 北京 1000
1 小红 22 上海 1500
2.2 loc: Select by Label (Custom Index)¶
loc stands for “label location” and is suitable for selecting rows using custom row labels (such as names, dates, etc.). If the row labels are default integers (0,1,2…), the effect is the same as iloc.
# Select rows with labels 0 and 2 (default integer labels)
selected_rows = df.loc[[0, 2]]
print("\nloc selection for rows with labels 0 and 2:")
print(selected_rows)
Output:
loc selection for rows with labels 0 and 2:
姓名 年龄 城市 销售额
0 小明 20 北京 1000
2 小刚 21 广州 800
Step 3: Conditional Filtering (Select Rows by Condition)¶
Conditional filtering refers to selecting rows that meet certain conditions (e.g., “age > 21”, “sales > 1000”) based on a column’s values. The core is boolean indexing.
3.1 Single Condition Filtering¶
Directly use df[condition], where the condition is a comparison expression for a column (e.g., df['年龄'] > 21).
# Filter rows where age is over 21
filtered_by_age = df[df['年龄'] > 21]
print("Rows where age > 21:")
print(filtered_by_age)
Output:
Rows where age > 21:
姓名 年龄 城市 销售额
1 小红 22 上海 1500
3 小丽 23 深圳 2000
3.2 Multiple Condition Filtering¶
To meet multiple conditions simultaneously, use & (AND) or | (OR) to connect conditions, and each condition must be wrapped in parentheses.
# Filter: sales > 1000 AND city is Shanghai
filtered_by_sales_and_city = df[(df['销售额'] > 1000) & (df['城市'] == '上海')]
print("\nRows where sales > 1000 and city is Shanghai:")
print(filtered_by_sales_and_city)
Output:
Rows where sales > 1000 and city is Shanghai:
姓名 年龄 城市 销售额
1 小红 22 上海 1500
Summary¶
Through the above 3 steps, you have mastered the core techniques of DataFrame data selection and filtering:
1. Column Selection: Use df['column_name'] (single column) or df[['column1', 'column2']] (multiple columns)
2. Row Selection: Use iloc[position] (integer position) or loc[label] (custom label)
3. Conditional Filtering: Use df[condition] (single condition) or df[(condition1) & (condition2)] (multiple conditions)
Key Reminder: For multiple condition filtering, use &/| instead of and/or, and wrap each condition in parentheses!
Practice more, and you will soon be proficient in these operations and lay the foundation for subsequent data analysis.