數(shù)據(jù)重復(fù)是數(shù)據(jù)庫(kù)常見問(wèn)題,易導(dǎo)致統(tǒng)計(jì)失真、存儲(chǔ)冗余。下面我們系統(tǒng)了解 6 個(gè) SQL 去重關(guān)鍵字或操作符,并系統(tǒng)梳理 SQL 去重方法,即從基礎(chǔ)的 DISTINCT 到復(fù)雜的窗口函數(shù),結(jié)合多數(shù)據(jù)庫(kù)特性,讓我們能從中精準(zhǔn)選擇去重方案。(所有代碼塊與表格都可以左右滾動(dòng))
1. DISTINCT:?jiǎn)伪砘A(chǔ)去重
DISTINCT 是 SQL 中用于消除查詢結(jié)果集重復(fù)行的關(guān)鍵字。它作用于整個(gè) SELECT 子句返回的所有列,僅保留組合值完全唯一的行。若查詢多列,需所有列值均相同才算重復(fù)。例如: SELECT DISTINCT col1, col2 FROM table
會(huì)對(duì)兩列組合去重。注意:DISTINCT 會(huì)隱式排序結(jié)果,可能影響性能,大數(shù)據(jù)量時(shí)建議結(jié)合索引優(yōu)化。
原理:在查詢結(jié)果集上直接過(guò)濾重復(fù)行,保留唯一值。
語(yǔ)法:SELECT DISTINCT column1, column2 FROM table;
模擬數(shù)據(jù):
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50),
department VARCHAR(50)
);
INSERT INTO employees VALUES
(1, 'Alice', 'HR'),
(2, 'Bob', 'Engineering'),
(3, 'Alice', 'HR'), -- 姓名重復(fù)
(4, 'Charlie', 'Marketing'),
(5, 'Bob', 'Engineering'); -- 姓名+部門重復(fù)
查詢示例:
-- 單列去重
SELECT DISTINCT name FROM employees;
/* 輸出:
+----------+
| name |
+----------+
| Alice |
| Bob |
| Charlie |
+----------+
注釋:重復(fù)的 'Alice' 和 'Bob' 被合并 */
-- 多列聯(lián)合去重
SELECT DISTINCT name, department FROM employees;
/* 輸出:
+----------+---------------+
| name | department |
+----------+---------------+
| Alice | HR |
| Bob | Engineering |
| Charlie | Marketing |
| Bob | Engineering | -- 此行保留(因id不同)
+----------+---------------+
注釋:多列去重需所有字段值完全相同 */
2. GROUP BY:分組去重
GROUP BY 是 SQL 中按指定字段分組實(shí)現(xiàn)去重的關(guān)鍵字,可結(jié)合聚合函數(shù)(如 COUNT、MAX)處理分組數(shù)據(jù)。語(yǔ)法為 SELECT 列 FROM 表 GROUP BY 列
,僅保留每組唯一值。例如按部門分組統(tǒng)計(jì)人數(shù),比 DISTINCT 更靈活,支持復(fù)雜計(jì)算,大數(shù)據(jù)量下性能更優(yōu)(依賴索引)。
原理:按指定字段分組實(shí)現(xiàn)去重,支持聚合計(jì)算。
語(yǔ)法:SELECT column1, COUNT(*) FROM table GROUP BY column1;
查詢示例:
-- 統(tǒng)計(jì)每個(gè)部門的員工數(shù)
SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;
/* 輸出:
+---------------+----------------+
| department | employee_count |
+---------------+----------------+
| HR | 2 |
| Engineering | 2 |
| Marketing | 1 |
+---------------+----------------+
注釋:按部門分組并計(jì)數(shù) */
GROUP BY 與 DISTINCT 對(duì)比:
均能去重,GROUP BY 按字段分組,支持聚合計(jì)算(如 COUNT),適合需統(tǒng)計(jì)分析的場(chǎng)景;DISTINCT 直接返回唯一行,語(yǔ)法更簡(jiǎn)潔。大數(shù)據(jù)量時(shí) GROUP BY 性能可能更優(yōu)(依賴索引),且結(jié)果可排序,而 DISTINCT 作用于全列組合,無(wú)分組邏輯。
3. UNION 與 UNION ALL:多結(jié)果集去重
在SQL中,UNION 和 UNION ALL 都是用于合并多個(gè)查詢結(jié)果集的操作符,需保證各查詢返回的列數(shù)、順序和數(shù)據(jù)類型一致。
- UNION:會(huì)對(duì)合并后的結(jié)果集自動(dòng)去除重復(fù)行,相當(dāng)于合并+去重,但可能因排序去重消耗更多性能。
- UNION ALL:直接合并所有結(jié)果,包括重復(fù)行,不進(jìn)行去重處理,執(zhí)行效率更高。
實(shí)際使用時(shí),若確認(rèn)無(wú)重復(fù)或無(wú)需去重,優(yōu)先用 UNION ALL 提升性能。
原理:
UNION ALL
:直接拼接保留重復(fù)項(xiàng)
模擬數(shù)據(jù):
CREATE TABLE hr_dept (id INT, name VARCHAR(50));
CREATE TABLE eng_dept (id INT, name VARCHAR(50));
INSERT INTO hr_dept VALUES (1, 'Alice'), (2, 'David');
INSERT INTO eng_dept VALUES (2, 'David'), (3, 'Bob'), (1, 'Alice');
查詢對(duì)比:
-- UNION:自動(dòng)去重
SELECT name FROM hr_dept
UNION
SELECT name FROM eng_dept;
/* 輸出:
+-------+
| name |
+-------+
| Alice |
| David |
| Bob |
+-------+ */
-- UNION ALL:保留所有記錄
SELECT name FROM hr_dept
UNION ALL
SELECT name FROM eng_dept;
/* 輸出:
+-------+
| name |
+-------+
| Alice |
| David |
| David |
| Bob |
| Alice |
+-------+ */
性能提示:
UNION ALL
效率更高(無(wú)額外操作)
4. EXCEPT / MINUS:差集去重
EXCEPT / MINUS 是SQL中用于求差集的操作符,返回第一個(gè)查詢結(jié)果中不存在于第二個(gè)查詢的記錄并去重。
- EXCEPT支持SQL Server、PostgreSQL等;
- MINUS是Oracle對(duì)EXCEPT的替代。
需兩查詢列數(shù)、類型一致,MySQL不支持,需用LEFT JOIN模擬。適用于查找數(shù)據(jù)集差異場(chǎng)景。具體拆解如下:
原理:返回第一個(gè)結(jié)果集,剔除第二個(gè)結(jié)果集的重復(fù)項(xiàng)。
數(shù)據(jù)庫(kù)支持:
模擬數(shù)據(jù):
CREATE TABLE all_products (id INT, name VARCHAR(50));
CREATE TABLE sold_products (id INT, name VARCHAR(50));
INSERT INTO all_products VALUES (1, 'Laptop'), (2, 'Phone'), (3, 'Tablet');
INSERT INTO sold_products VALUES (1, 'Laptop'), (3, 'Tablet');
標(biāo)準(zhǔn)實(shí)現(xiàn):
-- SQL Server/Oracle
SELECT * FROM all_products
EXCEPT
SELECT * FROM sold_products; -- Oracle用MINUS
/* 輸出:
+----+-------+
| id | name |
+----+-------+
| 2 | Phone |
+----+-------+ */
MySQL替代方案:
-- LEFT JOIN模擬
SELECT ap.*
FROM all_products ap
LEFT JOIN sold_products sp ON ap.id = sp.id
WHERE sp.id IS NULL;
-- NOT IN模擬
SELECT * FROM all_products
WHERE id NOT IN (SELECT id FROM sold_products);
5. INTERSECT:交集去重
INTERSECT 是SQL中求交集的操作符,返回兩個(gè)查詢結(jié)果集中共有的記錄并自動(dòng)去重。需保證兩查詢列數(shù)、順序和數(shù)據(jù)類型一致。
支持場(chǎng)景:如查找同時(shí)存在于兩表的用戶ID。
注意:MySQL不支持,需用INNER JOIN或IN子查詢模擬;Oracle、SQL Server等主流數(shù)據(jù)庫(kù)支持。性能依賴索引,大數(shù)據(jù)量慎用。具體介紹如下:
原理:返回兩個(gè)結(jié)果集的共同項(xiàng)(自動(dòng)去重)。
數(shù)據(jù)庫(kù)支持:
模擬數(shù)據(jù):
CREATE TABLE all_members (id INT, name VARCHAR(50));
CREATE TABLE active_members (id INT, name VARCHAR(50));
INSERT INTO all_members VALUES (1, 'Tom'), (2, 'Jerry'), (3, 'Spike');
INSERT INTO active_members VALUES (1, 'Tom'), (3, 'Spike'), (4, 'Tyke');
標(biāo)準(zhǔn)實(shí)現(xiàn):
-- SQL Server/Oracle
SELECT * FROM all_members
INTERSECT
SELECT * FROM active_members;
/* 輸出:
+----+-------+
| id | name |
+----+-------+
| 1 | Tom |
| 3 | Spike |
+----+-------+ */
MySQL替代方案:
-- INNER JOIN模擬
SELECT am.*
FROM all_members am
INNER JOIN active_members ac ON am.id = ac.id;
-- IN子查詢模擬
SELECT * FROM all_members
WHERE id IN (SELECT id FROM active_members);
6. 高級(jí)去重技巧
1. 窗口函數(shù)去重
-- 保留每個(gè)部門最新入職記錄
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY department
ORDER BY hire_date DESC
) AS rn
FROM employees
) t WHERE rn = 1;
2. 子查詢?nèi)ブ?/strong>
-- 查找有訂單的用戶
SELECT * FROM users
WHERE user_id IN (
SELECT DISTINCT user_id FROM orders
);
3. DELETE去重
-- 刪除重復(fù)郵箱記錄
DELETE FROM users
WHERE user_id NOT IN (
SELECT MIN(user_id)
FROM users
GROUP BY email
);
7. 數(shù)據(jù)庫(kù)去重差異參考表
| | | |
---|
EXCEPT | | | |
INTERSECT | | | |
ROW_NUMBER | | | |
ROWID | | | |
8. 去重性能優(yōu)化指南
索引策略:
-- 為去重字段創(chuàng)建索引
CREATE INDEX idx_dept ON employees(department);
- 分批處理:
LIMIT 1000
+ 循環(huán) - 小數(shù)據(jù)量:優(yōu)先用
DISTINCT
去重代價(jià)排序(效率降序):
UNION ALL
> DISTINCT
≈ GROUP BY
> UNION
> EXCEPT
/INTERSECT
9. 去重應(yīng)用場(chǎng)景速查表
| | |
---|
| | SELECT DISTINCT dept FROM emp |
| | SELECT col FROM A UNION SELECT col FROM B |
| | PARTITION BY id ORDER BY date DESC |
| | DELETE WHERE id NOT IN (SELECT MIN(id)...) |
| | SELECT FROM A EXCEPT SELECT FROM B |
| | SELECT FROM A INTERSECT SELECT FROM B |
10. SQL去重方法總結(jié)
SQL去重方法主要有三類:
- 基礎(chǔ)去重:
DISTINCT
和GROUP BY
適合單表操作 - 集合運(yùn)算:
UNION
/EXCEPT
/INTERSECT
處理多結(jié)果集 - 高級(jí)技巧:窗口函數(shù)和子查詢解決復(fù)雜場(chǎng)景
選擇依據(jù):
- 數(shù)據(jù)量大?。ㄐ?shù)據(jù)用
DISTINCT
,大數(shù)據(jù)用GROUP BY
) - 數(shù)據(jù)庫(kù)支持(MySQL需注意語(yǔ)法差異)
- 業(yè)務(wù)需求(是否需要聚合計(jì)算、是否需物理刪除等)
SQL 去重需兼顧場(chǎng)景與性能,小數(shù)據(jù)用 DISTINCT,大數(shù)據(jù)靠索引與分批。理解這些方法原理、差異及組合使用,可高效解決實(shí)際工作中的各類重復(fù)問(wèn)題,提升數(shù)據(jù)質(zhì)量。
閱讀原文:原文鏈接
該文章在 2025/9/1 12:18:25 編輯過(guò)