Grouping Data with GROUP BY
Now that you've mastered aggregate functions like COUNT(), SUM(), and AVG(), you're ready to take your data analysis skills to the next level. In this lesson, you'll learn how to use the GROUP BY clause to organize your data into meaningful groups and calculate aggregate values for each group separately.
Learning Goals:
- Understand when and why to use
GROUP BY - Write queries that group data by one or more columns
- Combine
GROUP BYwith aggregate functions - Recognize the difference between
WHEREandHAVING
Understanding GROUP BY
The GROUP BY clause divides rows returned by a query into groups based on one or more columns. For each group, you can apply aggregate functions to calculate summary values.
Think of it like organizing orders: you could group them by status to count how many are delivered vs pending, or group products by category to calculate the average price in each category.
Basic GROUP BY Syntax
SELECT column1, aggregate_function(column2)
FROM table_name
GROUP BY column1;
The GROUP BY clause always comes after the FROM and WHERE clauses, but before ORDER BY.
Single Column Grouping
Let's start with a simple example from the marketplace schema: counting how many customers prefer each category.
SELECT COALESCE(cat.category_name, 'No preference') AS category_name,
COUNT(*) AS customer_count
FROM customer AS c
LEFT JOIN category AS cat ON cat.category_id = c.preferred_category_id
GROUP BY COALESCE(cat.category_name, 'No preference')
ORDER BY customer_count DESC;
This query would return something like:
| category_name | customer_count |
|---|---|
| Electronics | 3 |
| Fashion | 2 |
| No preference | 2 |
Always include the grouped column(s) in your SELECT list. This makes your results meaningful by showing which group each aggregate value belongs to.
GROUP BY with Multiple Aggregates
You can use multiple aggregate functions in the same GROUP BY query:
SELECT o.status,
COUNT(*) AS total_orders,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(o.order_total) AS total_revenue,
AVG(o.order_total) AS avg_order_total
FROM orders AS o
GROUP BY o.status
ORDER BY total_orders DESC;
Grouping by Multiple Columns
Sometimes you need to create more specific groups by combining multiple columns:
SELECT cat.category_name,
o.status,
COUNT(*) AS items_sold
FROM order_item AS oi
JOIN orders AS o ON o.order_id = oi.order_id
JOIN product AS p ON p.product_id = oi.product_id
JOIN category AS cat ON cat.category_id = p.category_id
GROUP BY cat.category_name, o.status
ORDER BY cat.category_name, o.status;
This creates groups for each unique combination of category and order status.
GROUP BY with WHERE Clause
You can filter rows before they're grouped using WHERE:
SELECT cat.category_name,
AVG(p.price) AS avg_price
FROM product AS p
JOIN category AS cat ON cat.category_id = p.category_id
WHERE p.price > 30
GROUP BY cat.category_name;
The WHERE clause filters individual rows before grouping occurs. If you need to filter groups based on aggregate values, you'll need the HAVING clause (covered in the next lesson).
Real-World Example: Order Analysis
Let's work through a complete example with the marketplace data:
-- First, inspect the data we'll summarize
SELECT oi.order_item_id,
o.order_id,
p.sku,
cat.category_name,
o.order_date,
o.status
FROM order_item AS oi
JOIN orders AS o ON o.order_id = oi.order_id
JOIN product AS p ON p.product_id = oi.product_id
JOIN category AS cat ON cat.category_id = p.category_id
LIMIT 5;
-- Now group order items by category and status for March 2025
SELECT
cat.category_name,
o.status,
COUNT(*) AS total_items,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_item AS oi
JOIN orders AS o ON o.order_id = oi.order_id
JOIN product AS p ON p.product_id = oi.product_id
JOIN category AS cat ON cat.category_id = p.category_id
WHERE o.order_date LIKE '2025-03%'
GROUP BY cat.category_name, o.status
ORDER BY total_items DESC;
Common Pitfalls
- Forgetting non-aggregated columns: Every column in the
SELECTclause must either be in theGROUP BYclause or wrapped in an aggregate function - Confusing WHERE and HAVING: Use
WHEREto filter rows before grouping,HAVINGto filter groups after aggregation - Over-grouping: Too many grouping columns can result in too many small groups, making analysis difficult
- NULL values:
GROUP BYtreats all NULL values as a single group
Summary
The GROUP BY clause is essential for data analysis in SQL. It allows you to:
- Organize data into logical groups
- Calculate aggregate values for each group
- Combine multiple grouping columns for detailed analysis
- Work with
WHEREto filter data before grouping
Remember that GROUP BY works hand-in-hand with aggregate functions to transform raw order, product, and customer data into meaningful summaries.
Show quiz
- What happens if you include a column in SELECT that's not in GROUP BY and not wrapped in an aggregate function?
- Can you use GROUP BY with columns from a WHERE clause filter?
- How does GROUP BY handle NULL values in grouping columns?
- What's the correct order of these clauses: GROUP BY, WHERE, ORDER BY?
- True or False: You can only group by one column at a time.
Answers:
- You'll get an error. All non-aggregated columns in SELECT must appear in GROUP BY.
- Yes, WHERE filters rows before grouping occurs, so you can group by any column in the table.
- All NULL values are grouped together into a single group.
- WHERE -> GROUP BY -> ORDER BY
- False, you can group by multiple columns separated by commas.