Thursday, June 23, 2011

Database Development - Row Duplication 1

CREATE PROCEDURE [dbo].[usp_blg_vkj_0000001] AS
BEGIN

--Row duplication caused by a join to a non-unique index./* Data problems can often be identified without actually checking data as a developer can sometimes recognize
when data returned would be invalid just by reviewing the sql used to return the data. I have already demonstrated row duplication caused by Cartesian or cross joins in a previous blog article so I will only review two other examples here.

This example demonstrates row duplication caused by a join to a non-unique index or row duplication
caused by joining a base table foreign key to what is supposed to be the primary key of a CodeListBaseTable or look up table. For this example we have not implemented a unique index on the CodeListBaseTable. */

-------------------------------------------------------------------------
-- Create schema.
-------------------------------------------------------------------------
IF NOT EXISTS (SELECT * FROM sys.schemas WHERE name = N'BLOG1')
 EXEC sys.sp_executesql N'CREATE SCHEMA [BLOG1] AUTHORIZATION [dbo]'

-- Populate our BaseTable.
IF NOT OBJECT_ID('BLOG1.BaseTable') IS NULL DROP TABLE BLOG1.BaseTable

CREATE TABLE BLOG1.BaseTable(RecId integer, ListCode integer)

INSERT INTO BLOG1.BaseTable(RecId, ListCode )
SELECT RecId = 1, ListCode = 1
UNION ALL SELECT RecId = 2, ListCode = 2
UNION ALL SELECT RecId = 3, ListCode = 3

-- Populate the coldelist table.
IF NOT OBJECT_ID('BLOG1.CodeList1') IS NULL DROP TABLE BLOG1.CodeList1

CREATE TABLE BLOG1.CodeList1(CodeID integer, CodeValue varchar(50))

INSERT INTO BLOG1.CodeList1(CodeID , CodeValue )
SELECT CodeID = 1, CodeValue = 'CodeList1Value1'
UNION ALL SELECT CodeID = 2, CodeValue = 'CodeList1Value2'
UNION ALL SELECT CodeID = 3, CodeValue = 'CodeList1Value3'

-- Populate the coldelist table.
IF NOT OBJECT_ID('BLOG1.CodeList2') IS NULL DROP TABLE BLOG1.CodeList2

CREATE TABLE BLOG1.CodeList2(CodeID integer, CodeValue varchar(50))

INSERT INTO BLOG1.CodeList2(CodeID , CodeValue )
SELECT CodeID = 1, CodeValue = 'CodeList2Value1'
UNION ALL SELECT CodeID = 2, CodeValue = 'CodeList2Value2'
UNION ALL SELECT CodeID = 3, CodeValue = 'CodeList2Value3'

/* The query below returns good data but since there is no unique record constraint
on the BaseTable table there is still a potential problem. */

SELECT BaseTable.RecId , BaseTable.ListCode , Join1.CodeValue
FROM BLOG1.BaseTable BaseTable
 left join BLOG1.CodeList1 Join1 ON BaseTable.ListCode = Join1.CodeID

/*
RecId       ListCode    CodeValue
----------- ----------- --------------------------------------------------
1           1           CodeList1Value1
2           2           CodeList1Value2
3           3           CodeList1Value3

(3 row(s) affected)

Now let's add a duplicate Code ID in the BaseTable table to see what happens.
In real life a unique constraint should exist so this can not occur.

Note: Normally you should have a unique index on both the code Id and code value.
A common mistake is to just put one on the code id and let duplicates occur in the code value.
*/

--Modify the data for our next example.
IF NOT OBJECT_ID('BLOG1.CodeList1') IS NULL DROP TABLE BLOG1.CodeList1

CREATE TABLE BLOG1.CodeList1(CodeID integer, CodeValue varchar(50))

INSERT INTO BLOG1.CodeList1(CodeID , CodeValue )
SELECT CodeID = 1, CodeValue = 'CodeList1Value1'
UNION ALL SELECT CodeID = 2, CodeValue = 'CodeList1Value2'
UNION ALL SELECT CodeID = 3, CodeValue = 'CodeList1Value3'
UNION ALL SELECT CodeID = 1, CodeValue = 'CodeList1Value4'

/*
SELECT * FROM BLOG1.CodeList1 BaseTable

CodeID      CodeValue
----------- --------------------------------------------------
1           CodeList1Value1
2           CodeList1Value2
3           CodeList1Value3
1           CodeList1Value4
(4 row(s) affected)
*/

--Modify the data for our next example.IF NOT OBJECT_ID('CodeList2') IS NULL DROP TABLE CodeList2

IF NOT OBJECT_ID('BLOG1.CodeList2') IS NULL DROP TABLE BLOG1.CodeList2

CREATE TABLE BLOG1.CodeList2(CodeID integer, CodeValue varchar(50))

INSERT INTO BLOG1.CodeList2(CodeID , CodeValue )
SELECT CodeID = 1, CodeValue = 'CodeList2Value1'
UNION ALL SELECT CodeID = 2, CodeValue = 'CodeList2Value2'
UNION ALL SELECT CodeID = 3, CodeValue = 'CodeList2Value3'
UNION ALL SELECT CodeID = 1, CodeValue = 'CodeList2Value4'

/*

SELECT * FROM BLOG1.CodeList2 BaseTable

CodeID      CodeValue
----------- --------------------------------------------------
1           CodeList2Value1
2           CodeList2Value2
3           CodeList2Value3
1           CodeList2Value4

(4 row(s) affected)

Row duplication occurs in the following example since we no longer have a unique value
on either side of the join. There are only 3 records in the base table however row duplication
returns more records. */

SELECT
    BaseTable.RecId
   , BaseTable.ListCode
   , Join1.CodeValue
-- BaseTable contains 3 records
FROM BLOG1.BaseTable BaseTable
-- join1: CodeListBaseTable contains 4 records with one duplicate code.
    left join BLOG1.CodeList1 Join1 WITH (NOLOCK) ON BaseTable.ListCode = Join1.CodeID

 /*

RecId       ListCode    CodeValue
----------- ----------- --------------------------------------------------
1           1           CodeList1Value1
1           1           CodeList1Value4
2           2           CodeList1Value2
3           3           CodeList1Value3

(4 row(s) affected)

Now see what happens when you not only are linking without a unique ID on either side of join1 but are also linking without a unique ID on either side of join2. Row duplication occurs in the following example since we no longer have a unique value.
There are only 3 records in the base table however row duplication returns more records. */

SELECT 
    BaseTable.RecId
  , BaseTable.ListCode
   , CodeListBaseTableValue = Join1.CodeValue
   , CodeListJoin1Value = Join2.CodeValue
-- BaseTable contains 3 records
FROM BLOG1.BaseTable BaseTable
-- join1: CodeList1 contains 4 records with one duplicate code.
     left join BLOG1.CodeList1 Join1 WITH (NOLOCK) ON BaseTable.ListCode = Join1.CodeID
  -- join2: CodeList2 contains 4 records with one duplicate code
     left join BLOG1.CodeList2 Join2 WITH (NOLOCK) ON BaseTable.ListCode = Join2.CodeID
ORDER BY Join1.CodeValue , Join2.CodeValue

/* There are only 3 records in our base table so it is obvious row duplication is occurring.
We have row duplication occuring from the first join causing the base recordset to return 4 records.
The reason for the row duplication is due to the join not being based on the base table foreing key
value joined to code table primary keys. At least one side of the join has to identify a single record.
We also have row duplication occuring from the second join for the same reason. Row duplication is not only being caused by the joins not being based on the base table foreing key values joined to code table primary keys but also because CodeList1 and CodeList2 are both linking on the same join criteria which is not allowed in SQL unless only one record is returned for each join. These joins would be fine and the normal way to query a codelist table if we put a primary keys and unique record constraints on our codelist tables.

RecId       ListCode    CodeListBaseTableValue                             CodeListJoin1Value
----------- ----------- -------------------------------------------------- --------------------------------------------------
1           1           CodeList1Value1                                    CodeList2Value1
1           1           CodeList1Value1                                    CodeList2Value4
2           2           CodeList1Value2                                    CodeList2Value2
3           3           CodeList1Value3                                    CodeList2Value3
1           1           CodeList1Value4                                    CodeList2Value1
1           1           CodeList1Value4                                    CodeList2Value4

(6 row(s) affected)

*/

END

No comments:

Post a Comment