How to delete duplicate rows in SQL Server?
#1
I've run into a bit of a roadblock trying to clean up some data in SQL Server. I need to remove duplicate rows from a table, but there's no unique identifier or primary key associated with each row. The duplicate rows are completely identical across all columns. I've attempted the following DELETE operation, expecting to remove duplicates, but it doesn't seem to work as there's no ID column to key on. It uses a subquery to identify duplicates based on the `col1`, assuming it's an ID, but the table structure doesn't support such a column.
Here's the non-functional code:

Code:
DELETE
FROM table
WHERE col1 IN(
    SELECT col1 FROM table GROUP BY col1 HAVING(COUNT(col1) > 1)
)

The query aims to delete rows that have more than one occurrence, but as all columns have the same value in duplicates, the query ends up targeting all rows, not just the duplicates. What would be the right SQL query to remove just the duplicate data, retaining a single instance of each duplicate row?
Reply
#2
Your initial approach would indeed require a unique identifier to work properly. However, you can use Common Table Expressions (CTE) along with the ROW_NUMBER function to achieve your goal. The ROW_NUMBER function can provide a unique sequence number to each row within the partition of your result set, and then you can delete the rows where the row number is greater than 1, which are the duplicates. Here's how you can apply this concept:

Code:
WITH CTE AS(
    SELECT * ,
    ROW_NUMBER() OVER(
        PARTITION BY col1, col2, col3, col4, col5, col6, col7 ORDER BY(SELECT NULL)
    ) AS rn FROM table
)
DELETE FROM CTE WHERE rn > 1;

This should solve your problem.
Reply
#3
Keep in mind that when using ROW_NUMBER with ORDER BY (SELECT NULL), SQL Server assigns numbers arbitrarily. In this case, it's not an issue because all duplicate rows are identical and there's no need to keep a specific one. However, in other situations where you might need to retain a particular row based on certain criteria, you should specify an appropriate column or set of columns in the ORDER BY clause.
Reply
#4
If performance is a concern, especially with a large dataset, be cautious with CTE-based deletion. It can be transaction heavy and might lock the table for a time. An alternative approach would be to use a temporary table or table variable to hold the duplicates first and then delete where exists. This can sometimes result in less locking and quicker operations, like so:

Code:
SELECT
col1, col2, col3, col4, col5, col6, col7,
ROW_NUMBER() OVER(
    PARTITION BY col1, col2, col3, col4, col5, col6, col7 ORDER BY(SELECT NULL)
) AS rn
INTO #Duplicates
FROM table;
DELETE t
FROM table t
INNER JOIN #Duplicates d ON
t.col1 = d.col1
AND t.col2 = d.col2
AND t.col3 = d.col3
AND t.col4 = d.col4
AND t.col5 = d.col5
AND t.col6 = d.col6
AND t.col7 = d.col7
WHERE d.rn > 1;
DROP TABLE #Duplicates;
Reply
#5
This should get me through the issue. The explanations are quite clear, thank you. For future reference and for anyone who might come across this discussion, here is the compiled and working code, assuming the table we're working on is named `DuplicateTbl` and not just `table` because `table` is a reserved keyword in SQL. Also, SQL Server doesn't need additional modules or libraries imported like some programming languages do, so here's the code ready to be executed:

Code:
WITH CTE AS(
    SELECT * ,
    ROW_NUMBER() OVER(
        PARTITION BY col1, col2, col3, col4, col5, col6, col7 ORDER BY(SELECT NULL)
    ) AS rn FROM DuplicateTbl
)
DELETE FROM CTE WHERE rn > 1;

Applying it will remove the duplicates and leave me with one instance of each set of previously duplicated rows.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)