Duplicate rows in sql

EmpHector

Junior Member
Aug 10, 2009
10
0
0
hi,

TTITLE FORENAME SURNAME ADDRESS HOME_PHONE TOWN

mr Mohan Krishnan 42,MonksRD 1234567891 coventry,west midlands

null jason Rosss 13,CharterRD 1234567899 birmingham,west midlands

null Mohan Krishnan 42,MonksRD 1234567891 coventry,west midlands

mr Chris Hadland 30,watfordRD 5678912345 london,east midlands

mr Mohan Krishnan 42,MonksRD 4567891123 coventry,westmidlands



In this table the person Mohan is appeared three times and in the first row his title is mr and in the third row the title is null and in the 5 th row the title is mr and the home_phone is different from the other two rows, but all the three rows denote a same person. If this is the case then how will i find this duplicate record using a sql query..

my boss gave me 12 million records like this and he asked me to find all the duplicate records as i mentioned aboveand delete them.
please help me with this.

thanks in advance....

 

presidentender

Golden Member
Jan 23, 2008
1,167
0
76
This problem will vary based on what exactly constitutes a duplicate. I doubt that firstname lastname is sufficient, since you'll have more than one "John Smith," but just from your example there are other changes between records. In any case, you will need to use a "select" statement with a "group by" clause; you'll have to decide which fields to group by.
 

brandonbull

Diamond Member
May 3, 2005
6,330
1,203
126
You're boss might want to supply you with some better business rules around defining "duplicates" before deleting records.

 

brandonb

Diamond Member
Oct 17, 2006
3,731
2
0
I write software for bill collections, we define a dupe as someone who matches one of the 4 criteria:

Last Name + SSN
First Name + Last Name + Address 1 + Address 2 + Zip
First Name + Last Name + Phone
First Name + Last Name + Address 1 + Address 2 + City

But better to ask your boss!
 

GeekDrew

Diamond Member
Jun 7, 2000
9,100
13
81
Originally posted by: brandonb
I write software for bill collections, we define a dupe as someone who matches one of the 4 criteria:

Last Name + SSN
First Name + Last Name + Address 1 + Address 2 + Zip
First Name + Last Name + Phone
First Name + Last Name + Address 1 + Address 2 + City

But better to ask your boss!

Holy crap... I can imagine quite a few real-life scenarios, that I've personally witnessed, that violate those conditions.
 

EmpHector

Junior Member
Aug 10, 2009
10
0
0
select Surname, count(*) from Master_Table where ADDRESS_1 like ADDRESS_1 and forename like forename and HOME_PHONE=HOME_PHONE group by Surname HAVING count(*) > 1;

i found this query could find the no of records that are duplicated how many times in the table wherer ADDRESS_1 and forename and home_phone matches with each other but i dont know how to list the records that are duplicated so that i could delete them ....
 
Nov 7, 2000
16,404
3
81
Originally posted by: GeekDrew
Originally posted by: brandonb
I write software for bill collections, we define a dupe as someone who matches one of the 4 criteria:

Last Name + SSN
First Name + Last Name + Address 1 + Address 2 + Zip
First Name + Last Name + Phone
First Name + Last Name + Address 1 + Address 2 + City

But better to ask your boss!

Holy crap... I can imagine quite a few real-life scenarios, that I've personally witnessed, that violate those conditions.
And everyone freaking hates collectors, coincidence?!
 
Nov 7, 2000
16,404
3
81
Originally posted by: EmpHector
select Surname, count(*) from Master_Table where ADDRESS_1 like ADDRESS_1 and forename like forename and HOME_PHONE=HOME_PHONE group by Surname HAVING count(*) > 1;

i found this query could find the no of records that are duplicated how many times in the table wherer ADDRESS_1 and forename and home_phone matches with each other but i dont know how to list the records that are duplicated so that i could delete them ....

unless im crazy, that query will only count number of surnames where a surname is repeated.
 

EmpHector

Junior Member
Aug 10, 2009
10
0
0
@above

ya i could find that it would just show me the no of records that are duplicated in the table but i need to the query to list all of those column with have address_1, forename and Surname matching with all thier column being listed.....
 
Nov 7, 2000
16,404
3
81
alright, well lets say you do have records that are duplicate based on those 3 columns, which one of them do you want to keep? does it matter?

anyways try this:

select rhs.* from master_table lhs inner join master_table rhs on lhs.surname = rhs.surname and lhs.forename = rhs.forename and lhs.address_1 = rhs.address_1 and lhs.rowid <> rhs.rowid;

I think you will find it doesn't actually solve your problem, but thats what you are asking for...

That joins the table to itself where the record matches a different record on that criteria. The last part in the join clause keeps the record from matching itself. You will get some record explosion for the frequently duplicated values. And again, this doesn't solve the problem of choosing which records to keep and which ones to drop.
 

KIAman

Diamond Member
Mar 7, 2001
3,342
23
81
This is why I stress validation on entry for my projects. Essentially, garbage in = garbage out.

In your case, there are several methods of identifying duplicates by simply counting the number of rows which match itself depending on which column criteria(s) you would like to match into a temp table. Then remove the rows from the original table which had a count greater than 2 then insert a row back from the temp table.

The hard part is

1. Determining what column(s) match equate to a duplicate
- don't be surprised to find name mismatches as well 'Mohan Krishnann' for example
2. Determining which duplicate row to keep as the "unique" row

And in the end, until the data entry portion is resolved, you will have a lifetime job on your hands.

If this were me, I wouldn't even take this job considering the information you have. I would explain the risks and only if the risks are signed off, I would go about removing duplicates. Good luck.
 

imported_Dhaval00

Senior member
Jul 23, 2004
573
0
0
This probably is not going to help the OP, but I have dealt with a few projects where this was a business requirement (and realistically, desirable). Within the data warehousing realm, we have used Integration Services' Fuzzy Lookup transformation. We don't necessarily delete/cleanse the data ourselves: we ask the customers to "set" their cutoff threshold [for example, if the confidence and similarity is above 80%, treat the rows as duplicate].

Again, not a full-proof solution, but quite effective if you're dealing with millions of rows.
 

txrandom

Diamond Member
Aug 15, 2004
3,773
0
71
Originally posted by: brandonb
I write software for bill collections, we define a dupe as someone who matches one of the 4 criteria:

Last Name + SSN
First Name + Last Name + Address 1 + Address 2 + Zip
First Name + Last Name + Phone
First Name + Last Name + Address 1 + Address 2 + City

But better to ask your boss!

Damn it, why am I getting my Dad's bills.
 

BoberFett

Lifer
Oct 9, 1999
37,563
9
81
Originally posted by: KIAman
This is why I stress validation on entry for my projects. Essentially, garbage in = garbage out.

I'm guessing in this case there was no input therefore no validation. It's a bulk mailing list they purchased, if I had to guess.
 

KIAman

Diamond Member
Mar 7, 2001
3,342
23
81
Originally posted by: BoberFett
Originally posted by: KIAman
This is why I stress validation on entry for my projects. Essentially, garbage in = garbage out.

I'm guessing in this case there was no input therefore no validation. It's a bulk mailing list they purchased, if I had to guess.

If that were so, I'd run this through an address scrubber based off a standardized database, like USPS. Looking at the OPs list of examples, I don't think that was done (west midlands, westmidlands, for ex).
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |