共计 8142 个字符,预计需要花费 21 分钟才能阅读完成。
目录
- 递归 CTE:层级 / 树型数据(类目 / 组织)
- 层级汇总:含子类的 GMV 统计(祖先映射法)
- ROLLUP/ 分组小计:多维报表一条 SQL
- 动态透视(Pivot):条件聚合与动态 SQL
- 留存与分群报表:窗口函数 Cohort 矩阵
- 时间区间连接:有效期价格 / 会员等级匹配
- 优化器 Trace:optimizer_trace 的开启与阅读
- 物化视图(模拟):增量汇总与幂等回填
- 数据质量 SQL:孤儿行 / 重复键 / 范围异常检查
- 实战练习(含参考答案)
版本:MySQL 8.0。延续
shop库。示例可直接复用第一课~第五课数据结构。
1) 递归 CTE:层级 / 树型数据(类目 / 组织)
表结构(商品类目,父子关系):
CREATE TABLE IF NOT EXISTS categories (
cat_id BIGINT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
parent_id BIGINT NULL,
KEY idx_parent(parent_id),
CONSTRAINT fk_cat_parent FOREIGN KEY (parent_id) REFERENCES categories(cat_id)
);
插入示例:
INSERT INTO categories(cat_id,name,parent_id) VALUES
(1,'电子',NULL), (2,'外设',1), (3,'键盘',2), (4,'鼠标',2), (5,'显示器',1);
递归 CTE 展开整棵树(含根到叶的路径、深度):
WITH RECURSIVE cat_tree AS (
-- 根层
SELECT cat_id, parent_id, name,
CAST(cat_id AS CHAR(200)) AS path,
1 AS depth
FROM categories WHERE parent_id IS NULL
UNION ALL
-- 向下展开
SELECT c.cat_id, c.parent_id, c.name,
CONCAT(ct.path, ',', c.cat_id) AS path,
ct.depth + 1 AS depth
FROM categories c
JOIN cat_tree ct ON c.parent_id = ct.cat_id
-- 简易环路保护(深度阈值 + 非严格检测)WHERE ct.depth < 32 AND FIND_IN_SET(c.cat_id, ct.path) = 0
)
SELECT * FROM cat_tree ORDER BY path;
要点:
FIND_IN_SET结合path字符串可防简单环;深度阈值避免意外自循环。
2) 层级汇总:含子类的 GMV 统计(祖先映射法)
需求:统计每个类目 包含其所有子类 的 GMV。
思路:先构造“每个节点的所有祖先”映射表,再把明细按类目 → 祖先映射累加。
-- 2.1 构造 ancestor 映射(child → ancestor)WITH RECURSIVE anc AS (
SELECT cat_id AS child, cat_id AS ancestor FROM categories
UNION ALL
SELECT c.cat_id AS child, a.ancestor
FROM categories c
JOIN anc a ON c.parent_id = a.child
)
SELECT * FROM anc; -- child, ancestor
-- 2.2 订单明细加入类目(假设 products(cat_id) 已存在)-- 先把订单 GMV 摘出来
WITH item_gmv AS (SELECT oi.order_id, p.cat_id, SUM(oi.quantity * oi.unit_price) AS gmv
FROM order_items oi JOIN products p ON oi.product_id = p.product_id
GROUP BY oi.order_id, p.cat_id
),
anc AS (
SELECT cat_id AS child, cat_id AS ancestor FROM categories
UNION ALL
SELECT c.cat_id AS child, a.ancestor
FROM categories c JOIN anc a ON c.parent_id = a.child
)
SELECT a.ancestor AS cat_id, SUM(g.gmv) AS gmv_with_children
FROM item_gmv g
JOIN anc a ON g.cat_id = a.child
GROUP BY a.ancestor
ORDER BY gmv_with_children DESC;
祖先映射法允许你一次性计算“含子类”的指标;需要时再按时间分组即可。
3) ROLLUP/ 分组小计:多维报表一条 SQL
按渠道 × 月份汇总,并输出小计与总计:
WITH paid AS (
SELECT channel,
DATE_FORMAT(created_at,'%Y-%m') AS ym,
amount
FROM orders WHERE status='paid'
)
SELECT channel, ym,
SUM(amount) AS gmv
FROM paid
GROUP BY channel, ym WITH ROLLUP
ORDER BY channel IS NULL, channel, ym IS NULL, ym;
WITH ROLLUP生成分组小计与总计。ORDER BY ... IS NULL让总计行排在末尾。必要时用IFNULL(channel,'ALL')等友好显示。
4) 动态透视(Pivot):条件聚合与动态 SQL
静态列(已知几个月):
SELECT channel,
SUM(CASE WHEN DATE_FORMAT(created_at,'%Y-%m')='2025-08' THEN amount ELSE 0 END) AS m_2025_08,
SUM(CASE WHEN DATE_FORMAT(created_at,'%Y-%m')='2025-09' THEN amount ELSE 0 END) AS m_2025_09,
SUM(CASE WHEN DATE_FORMAT(created_at,'%Y-%m')='2025-10' THEN amount ELSE 0 END) AS m_2025_10
FROM orders WHERE status='paid'
GROUP BY channel;
动态列(月份未知):用预处理构造 SQL。
SET @sql = (
SELECT GROUP_CONCAT(
DISTINCT CONCAT('SUM(CASE WHEN DATE_FORMAT(created_at,\'%Y-%m\')=\'', ym, '\' THEN amount ELSE 0 END) AS `', ym,'`') ORDER BY ym SEPARATOR',\n'
)
FROM (SELECT DISTINCT DATE_FORMAT(created_at,'%Y-%m') AS ym
FROM orders WHERE status='paid'
) t
);
SET @sql = CONCAT('SELECT channel,', @sql, 'FROM orders WHERE status=\'paid\'GROUP BY channel');
PREPARE stmt FROM @sql; EXECUTE stmt; DEALLOCATE PREPARE stmt;
5) 留存与分群报表:窗口函数 Cohort 矩阵
目标:按“用户首单月份”分 cohort,计算后续每月是否复购。
-- 5.1 计算用户首单月
WITH first_month AS (
SELECT user_id,
DATE_FORMAT(MIN(created_at),'%Y-%m') AS cohort
FROM orders WHERE status='paid'
GROUP BY user_id
),
user_month AS (
SELECT o.user_id,
DATE_FORMAT(o.created_at,'%Y-%m') AS ym
FROM orders o WHERE o.status='paid'
)
SELECT fm.cohort, um.ym,
COUNT(DISTINCT um.user_id) AS buyers,
TIMESTAMPDIFF(MONTH, STR_TO_DATE(CONCAT(fm.cohort,'-01'),'%Y-%m-%d'),
STR_TO_DATE(CONCAT(um.ym,'-01'),'%Y-%m-%d')) AS month_offset
FROM first_month fm JOIN user_month um USING(user_id)
GROUP BY fm.cohort, um.ym
ORDER BY fm.cohort, um.ym;
转矩阵(列为月偏移):
SELECT cohort,
SUM(CASE WHEN month_offset=0 THEN buyers ELSE 0 END) AS m0,
SUM(CASE WHEN month_offset=1 THEN buyers ELSE 0 END) AS m1,
SUM(CASE WHEN month_offset=2 THEN buyers ELSE 0 END) AS m2,
SUM(CASE WHEN month_offset=3 THEN buyers ELSE 0 END) AS m3
FROM (
-- 把上一查询作为子表
SELECT fm.cohort, um.ym,
COUNT(DISTINCT um.user_id) AS buyers,
TIMESTAMPDIFF(MONTH, STR_TO_DATE(CONCAT(fm.cohort,'-01'),'%Y-%m-%d'),
STR_TO_DATE(CONCAT(um.ym,'-01'),'%Y-%m-%d')) AS month_offset
FROM (SELECT user_id, DATE_FORMAT(MIN(created_at),'%Y-%m') AS cohort
FROM orders WHERE status='paid' GROUP BY user_id
) fm
JOIN (SELECT user_id, DATE_FORMAT(created_at,'%Y-%m') AS ym
FROM orders WHERE status='paid'
) um USING(user_id)
GROUP BY fm.cohort, um.ym
) t
GROUP BY cohort
ORDER BY cohort;
需要转百分比可用
m1/m0等;更灵活的“动态列”可用 §4 的动态 SQL 技巧。
6) 时间区间连接:有效期价格 / 会员等级匹配
场景:给订单匹配下单时有效的价格或会员等级。
-- 价格表(有效区间)CREATE TABLE IF NOT EXISTS price_history (
product_id BIGINT,
start_ts DATETIME NOT NULL,
end_ts DATETIME NOT NULL,
price DECIMAL(10,2) NOT NULL,
PRIMARY KEY(product_id, start_ts),
KEY idx_end(end_ts)
);
-- 匹配订单行对应的有效价格(区间连接)SELECT oi.order_id, oi.product_id, oi.quantity,
ph.price AS price_effective
FROM order_items oi
JOIN orders o ON oi.order_id=o.order_id
JOIN price_history ph
ON ph.product_id=oi.product_id
AND o.created_at >= ph.start_ts
AND o.created_at < ph.end_ts;
优化建议:对
(product_id, start_ts)建复合索引并保证end_ts有索引,查询加product_id精确条件可显著减少扫描。
7) 优化器 Trace:optimizer_trace 的开启与阅读
-- 仅在会话级开启
SET optimizer_trace='enabled=on', end_markers_in_json=on;
-- 运行你的查询
SELECT /* test */ * FROM orders WHERE user_id=1 AND created_at>=NOW()-INTERVAL 30 DAY;
-- 查看 JSON 轨迹(包含访问路径评估、代价计算、选择性估计等)SELECT * FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE\G
-- 关闭
SET optimizer_trace='enabled=off';
使用场景:解释为什么没走某索引、为什么选择 Nested Loop/ 临时表等。注意 trace 较重,只在排查时短期开启。
8) 物化视图(模拟):增量汇总与幂等回填
目标:维护 metrics_daily(d, gmv)。
CREATE TABLE IF NOT EXISTS metrics_daily (
d DATE PRIMARY KEY,
gmv DECIMAL(18,2) NOT NULL DEFAULT 0
);
-- 增量 upsert(每天跑一次;或事件调度器见第四课)INSERT INTO metrics_daily(d, gmv)
SELECT DATE(created_at) AS d, SUM(amount) AS gmv
FROM orders WHERE status='paid' AND created_at >= CURDATE()-INTERVAL 7 DAY
GROUP BY d
ON DUPLICATE KEY UPDATE gmv=VALUES(gmv);
幂等回填(历史整段):
-- 先删除目标区间再重算,避免累加污染
DELETE FROM metrics_daily WHERE d BETWEEN '2025-07-01' AND '2025-10-11';
INSERT INTO metrics_daily(d, gmv)
SELECT DATE(created_at) AS d, SUM(amount)
FROM orders WHERE status='paid'
AND created_at >= '2025-07-01' AND created_at < '2025-10-12'
GROUP BY d
ON DUPLICATE KEY UPDATE gmv=VALUES(gmv);
原则:先删后插 +
ON DUPLICATE KEY UPDATE,保证多次执行结果一致。
9) 数据质量 SQL:孤儿行 / 重复键 / 范围异常检查
-- 9.1 孤儿订单(无对应用户)SELECT o.order_id FROM orders o
LEFT JOIN users u ON o.user_id=u.user_id
WHERE u.user_id IS NULL;
-- 9.2 重复邮箱(不唯一)SELECT email, COUNT(*) c FROM users
GROUP BY email HAVING c>1;
-- 9.3 金额异常(负值 / 极端值)SELECT * FROM orders WHERE amount<0
UNION ALL
SELECT * FROM (SELECT *, NTILE(100) OVER(ORDER BY amount) AS pct
FROM orders WHERE status='paid'
) t WHERE pct>=100; -- 近似 P100 长尾检查
10) 实战练习(含参考答案)
练习 1(层级汇总):统计每个“一级类目”包含所有子类的 GMV,并按 GMV 降序。
WITH RECURSIVE anc AS (
SELECT cat_id AS child, cat_id AS ancestor FROM categories
UNION ALL
SELECT c.cat_id AS child, a.ancestor
FROM categories c JOIN anc a ON c.parent_id = a.child
),
item_gmv AS (SELECT oi.order_id, p.cat_id, SUM(oi.quantity*oi.unit_price) AS gmv
FROM order_items oi JOIN products p ON oi.product_id=p.product_id
GROUP BY oi.order_id, p.cat_id
)
SELECT a.ancestor AS cat_id, SUM(g.gmv) AS gmv_all
FROM item_gmv g JOIN anc a ON g.cat_id=a.child
GROUP BY a.ancestor
ORDER BY gmv_all DESC;
练习 2(ROLLUP 报表):输出 渠道×月份 GMV,小计与总计。
SELECT channel, DATE_FORMAT(created_at,'%Y-%m') AS ym, SUM(amount) AS gmv
FROM orders WHERE status='paid'
GROUP BY channel, ym WITH ROLLUP
ORDER BY channel IS NULL, channel, ym IS NULL, ym;
练习 3(区间连接):将 price_history 与 orders 连接,求每笔订单的“下单时价格”。
SELECT o.order_id, oi.product_id, ph.price
FROM orders o JOIN order_items oi ON o.order_id=oi.order_id
JOIN price_history ph
ON ph.product_id=oi.product_id
AND o.created_at>=ph.start_ts AND o.created_at<ph.end_ts;
练习 4(Cohort 留存):输出 cohort(首单月)在 m0~m3 的复购人数矩阵。
WITH fm AS (SELECT user_id, DATE_FORMAT(MIN(created_at),'%Y-%m') AS cohort
FROM orders WHERE status='paid' GROUP BY user_id
), um AS (SELECT user_id, DATE_FORMAT(created_at,'%Y-%m') AS ym
FROM orders WHERE status='paid'
), base AS (
SELECT fm.cohort, um.ym,
TIMESTAMPDIFF(MONTH, STR_TO_DATE(CONCAT(fm.cohort,'-01'),'%Y-%m-%d'),
STR_TO_DATE(CONCAT(um.ym,'-01'),'%Y-%m-%d')) AS mo,
COUNT(DISTINCT um.user_id) AS buyers
FROM fm JOIN um USING(user_id)
GROUP BY fm.cohort, um.ym
)
SELECT cohort,
SUM(CASE WHEN mo=0 THEN buyers ELSE 0 END) AS m0,
SUM(CASE WHEN mo=1 THEN buyers ELSE 0 END) AS m1,
SUM(CASE WHEN mo=2 THEN buyers ELSE 0 END) AS m2,
SUM(CASE WHEN mo=3 THEN buyers ELSE 0 END) AS m3
FROM base GROUP BY cohort ORDER BY cohort;
练习 5(optimizer_trace):开启 trace,解释为何以下查询没有用到某个索引,并给出可索引化改写。
SET optimizer_trace='enabled=on';
SELECT * FROM orders WHERE DATE(created_at)='2025-10-01'; -- 可能未走索引
SELECT * FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE\G
SET optimizer_trace='enabled=off';
-- 改写:SELECT * FROM orders WHERE created_at>='2025-10-01' AND created_at<'2025-10-02';
小结
- 递归 CTE + 祖先映射解决“含子类”的统计与树遍历。
WITH ROLLUP直出小计 / 总计;动态透视靠条件聚合或动态 SQL。- Cohort 报表 = 首单分组 + 月偏移;区间连接需好索引与谓词顺序。
- optimizer_trace 帮你理解计划选择;物化汇总要 幂等 与增量。

