Wednesday, September 14, 2011

Finding Duplicate Rows Using TSQL

Ok so here is a tired old post that has been blogged about since the internet’s inception, right? Well sort of... I am not going to yammer on here too much about a topic that is covered exhaustively on technical blogs like mine, nor do I claim to be a 'SQL Guru' of any sorts, but I noticed a lot of the sites offering help on this topic always did so for a very basic and simple example. Well I too am going to use a simple example but expand on its usefulness to hopefully help out a few wondering the search engines in need of help. Your typical 'find duplicate rows in a table by ID' example is shown below:

SELECT ID
FROM Books
GROUP BY ID
HAVING COUNT(ID) > 1
Another example using a varchar column:

SELECT Title
FROM Books
GROUP BY Title
HAVING COUNT(Title) > 1
Ok the above is great for small tables, to manually track down records, or maybe as part of a larger query or subquery. However odds are you are going to need additional columns of data and probably the actual duplicate rows themselves. Well initial thought might be to expand the simple example query above to include the additional fields, but you will quickly find out that the query will yield no results. This is because the query above groups on the columns in question having a count greater than 1. Well if you add additional columns to the query that do not contain duplicates, then this condition is no longer 'True' and thus no results are returned.

The fix is to Join in another copy of the same table. One table's purpose is to focus on the duplicate rows, and the second table's purpose is to focus on the additional columns needed in the results. When joining on the same table an Alias must be given to distinguish between the two. Here is the expanded example from above, that will return all of the *actual* duplicate rows, and any additional information that was sought:

SELECT bAll.ID, bAll.PublishDate, bAll.Title, bAll.Price
FROM Books bAll
INNER JOIN (SELECT Title
FROM Books
GROUP BY Title
HAVING COUNT(Title) > 1) bDups
ON bAll.Title = bDups.Title
ORDER BY bAll.Title
Lastly, here is a template of the above query that you might want to keep handy as sort of a 'fill-in-the-blanks' template (remove brackets - they are just placeholders and not required syntax) for your own 'finding duplicate rows' needs:

SELECT [AliasAllTable].[Field1], [AliasAllTable].[Field2], [AliasAllTable].[Field3]
FROM [MainTable] [AliasMainTable]
INNER JOIN (SELECT [DuplicateFieldName]
FROM [MainTable]
GROUP BY [DuplicateFieldName]
HAVING COUNT([DuplicateFieldName]) > 1) [AliasDuplicateTable]
ON [AliasAllTable].[DuplicateFieldName] = [AliasDuplicateTable].[DuplicateFieldName]
ORDER BY [AliasAllTable].[DuplicateFieldName]
I welcome any SQL experts to comment on streamlined ways to accomplish the identical task; I can certainly update the post with additional information. However with the plethora of examples available, too many seemed to be of the basic flavor example and I wanted to introduce the additional functionality that is probably often sought after.

4 comments:

  1. That's how I do it, too, basically.

    ReplyDelete
  2. Thanks Allen great post, far from SQL expert.

    ReplyDelete
  3. Yep, I am a developer at heart and know just enough SQL to be dangerous! If you or anyone has a more streamlined or current way of completing the task, please post back with the example and I will update the post as I mentioned.

    ReplyDelete
  4. I usually use the ROW_NUMBER method to find duplicates. Example:

    ;WITH CTE
    AS
    (
    SELECT ROW_NUMBER() OVER (PARTITION BY title
    ORDER BY ( SELECT 0 ) ) RN, book_id, title, author_last_name, author_first_name, rating
    FROM books
    )
    SELECT book_id, title, author_last_name, author_first_name, rating
    FROM CTE
    WHERE RN > 1

    ReplyDelete