Showing posts with label Database Development. Show all posts
Showing posts with label Database Development. Show all posts

Wednesday, April 18, 2012

Natural Key Unique Record Constraint

The single most important aspect of table design, view design and handling data in general is a real data/natural key unique record constraint/definition (URIX) which defines and enforces data granularity (what exists in the table) . If a natural key unique record constraint hasn't been defined then data aggregated from the table is meaningless. Additionally without a proper unique record constraint the table data could be duplicated or worse and no one might ever recognize duplicate data as what exactly duplicate data was had not been defined. SQL group by statements can be used to determine the natural key unique record constraint for a table. With traditional MS Access applications the only way to implement unique record constraints are through the use of primary keys(PK) or composite primary keys which increases the complexity of designing access applications and is seldom done. Referential integrity is normally setup using auto-number/surrogate keys as the PK which adds some confusion to the purpose of the Primary Key as it no longer serves as a unique record constraint. While using a surrogate key to implement referential integrity may be optional having a natural key unique record constraint is mandatory if you want to ensure data integrity. Fortunately in SQL Server the unique record constraint can be implemented as a unique index or unique composite index and the unique key used to configure referential integrity is the actual primary key. This allows developers to locate a specific record with a where clause specifying a single value and reduces the links between tables in schema diagrams. More complicated data constraints are implemented using table triggers such as allowing only one record in a defined set of records to be marked as the active record or primary record such as primary phone number or primary address.  Unfortunately this can only be done at the form level in MS Access. Implementing unique record constraints in MS Access web applications takes even another approach as the primary key has to be a single surrogate key and unique record constraints are implemented through the use of data macros similar to SQL Server triggers.

As should be obvious by now the main goal of this blog article is to stress the importance of real data/natural key unique record constraints (URIX) . Unfortunately this relatively simple concept is often overlooked and is likely the main cause of garbage data.  The URIX is also actually also the most important aspect of joining tables in SQL statements/views. Since the URIX defines the data contained in a table when two tables are joined (foreign key(FK) to primary key(PK))  to create a view AND the URIX/PK of the second table  is not fully contained in the URIX of the first table then the URIX of the query or view ends up being a combination of the URIX of the first table plus any new columns from the PK/UIRX of the second table. Assuming the second table is joined using a left outer join the record count of the view will now reflect the record count of the first table plus the count of matching records in the second table grouped on the linked columns having COUNT(*) > 1 greater than one. In short URIX is the single most important aspect of both tables and SQL statements and if not understood and used correctly the cause of row duplication in SQL statements.

In short URIX is the single most important aspect of both tables and SQL statements and if not understood and used correctly the cause of row duplication in SQL statements.

In short URIX is the single most important aspect of both tables and SQL statements and if not understood and used correctly the cause of row duplication in SQL statements.
       
The following articles contain related information

Friday, March 23, 2012

Configuring Referential Integrity

Find a table 'tblTableName' in the database that doesn't have any referential integrity configured. This will appear as tables with no links in the diagram window and with no right menu button
Option to add related tables.

Create a group by query on the table to analyze the data. Look for anything appearing to be a unique record constraint based on real data (No auto numbers). Include all columns to start and cut away small pieces or start with a single column and add more until you find the unique record constraint based on real data. These records will appear as HAVING COUNT(*) = 1.

SELECT [ColumnName] , Recs = COUNT(*)
FROM [tblTableName]
GROUP BY [ColumnName]
--HAVING COUNT(*) = 1
ORDER BY COUNT(*) DESC

A 1
B 1
C 1

We only have one of each value so this is probably a primary table that will be linked to through foreign keys of other tables.

I use my code to search for tables containing a column named 'ColumnName'.
I'm using wild cards to look for any column somewhat matching the text we are searching for.

-- Find it somewhere!
EXEC dbatools.dbo.usp_srvr_object_find
@CSVServerNameListInclude = @@SERVERNAME
, @CSVDatabaseNameListInclude = 'DatabaseName'
, @CSVObjectTypeList ='TABLE_COLUMN'
, @CSVWildCardObjectNameList = '%ColumnName%'

Open a diagram and try linking the tables and saving. The save will only work if the data is related.
Even one missing value will keep the save from completing. Records/Values will either need to be added to the primary table or removed from the foreign table before RI can be configured.
After the data is reconciled (deleted or added) try linking the tables again.

Another method is to search the database for data matching our primary tables Primary Key.
I use my DBATools.usp_srvr_db_table_data_exists procedure to search the database for
references to a few code values. I don't use the @UseLike feature as I am looking for
specific code values.

-- Find data in any table.
EXEC DBATools.dbo.usp_srvr_db_table_data_exists
@ServerName = @@SERVERNAME
, @DatabaseName ='DatabaseName'
, @EscapeCharacter = '~'
, @SearchText = 'A'
, @UseLike = 0

I would then look at the number of occurrences of more than one value from the primary table.
Any table with one or more matches could potentially be related. Any table with just one match of each value could be the primary table.

SELECT [ColumnName] , Recs = COUNT(*)
FROM [tblTableName]                                                      
GROUP BY [ColumnName]
--HAVING COUNT(*) = 1
--HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC

Next I would make educated guesses to whittle down the tables I think may contain foreign key references to this data. Look at both the table names and column names. Some will obviously not be related. You can also look through the code for procedures using the tables in question as you may see how it is being used through existing code.

-- Find it somewhere!
EXEC dbatools.dbo.usp_srvr_object_find
@CSVServerNameListInclude = @@SERVERNAME
, @CSVDatabaseNameListInclude = 'DatabaseName'
, @CSVObjectTypeList ='TABLES,TABLE_COLUMN,PROCEDURES,PROCEDURE_CODE,SYNONYMS,PROCEDURE_CODE2'
, @CSVWildCardObjectNameList = '%tblTableName%'


As we find data we want to configure RI by linking 'like' columns in the SSMS diagram designer.
Denormalized data should never be linked to anything but a code table so we may end up having to manually create and populate some code tables. You can always try linking the columns to confirm or deny a data relationship although this may not be practical on very large tables. If your data is from a legacy (and old) system you will may have a lot of multiple column RI relationships.

Please note that configuring referential integrity is also covered in my post on legacy database conversions  as the two processes are similar in some cases. 

Thursday, March 15, 2012

Database Development - General

1.     For Microsoft development I also prefer a unique record constraint built on a natural key and a surrogate key as the primary key. 

A.     This way we can always identify a record with one single unique value which simplifies development, SQL statements and audit log strategies since schema and SQL statements will be composed of single joins between tables.
B.     We have defined what is in our table with the unique record constraint built on a natural key. Without this aggregate figures are meaningless or at minimum undefined.
C.    For performance and index maintenance reasons using a surrogate key as the “CLUSTERED” primary key is usually provides the best generic performance for user interfaces.
D.    Microsoft’s SharePoint Access Services requires a single column primary key and referential integrity is enforced through the use of data macros so our design strategy is somewhat compatible with SharePoint Access Databases.
E.     I would stress that whether it is the primary key or just a unique record constraint the single most important aspect of table design is a unique record constraint built on a natural key.

  1. Review existing stored procedures and replace cursor based operations with set based operations
  2. Make sure there are no semi-random updates or repetitive updates. One way to ensure there are no semi-random updates is to run the population procedure to populate two copies of the same table using the same source data.  If there are any data discrepancies between the data in the resulting tables then it is likely that there are semi-random updates in the population code.
  3. Make sure stored procedures are designed with flexible input parameters to eliminate the need for multiple procedures that perform the same basic operations.  This minimizes maintenance and debugging time.
  4. With reporting stored procedures I like to standardize the input parameters and include a dynamic date range id.
  5. Set up fully automated server side tracing and execution of the index wizard to analyze and automatically apply or just email suggested index additions or deletions. 
  6. Ensure there are statistics on all tables.  Under certain circumstances the SQL Server 2000 sp_createstats function will die without an error message and not create statistics on all tables.  These missing statistics need to be identified and created to ensure efficient SQL Server execution plans.
  7. Setup audit logging on all tables.  On SQL Server version prior to 2008 I do this with a stored procedure that dynamically creates and applies audit log triggers on all tables and uses 2 stages to populate the final audit log table so the performance hit of full audit logging is minimized. SQL Server 2008 greatly simplifies this with a new feature called Change Data Capture which is much more easily implemented.
  8. To encapsulate business logic and reduce maintenance I like to create user defined functions and/or tables to perform repeatable operations.  This usually involves some tradeoff between performance and maintenance so it must be balanced according to the needs of the company and the amount of data being processed.

Legacy Database Conversions

Step 1: Import the data into SQL Server. The SSIS wizard works well for quick for raw imports. If given  data in flat files I would use BCP. Make sure to ask for any readable searchable code of any sort from the legacy system.  I would want a copy of it as it may contain business logic used in the application that may need to be converted.




Step 2: We need to created natural key unique indexes on all of our data. For this we need to determine the existing record definitions for each table by identifying the natural key unique record constraint for each table using SQL group by statements excluding any auto numbering of records. This is mainframe data so we will probably have to strip extended characters from some data and do trims on all of it. 


Ex.
SELECT Recs = COUNT(*)
, [PERIOD]
,[ENTITYID]
,[AMT]
,[DAYS]
,[DESCRPN]
FROM [dbo].[tblLotsOfData]
GROUP BY
[PERIOD]
,[ENTITYID]
,[AMT]
,[DAYS]
,[DESCRPN]
HAVING COUNT(*) > 1


If we have duplicates then that would not be the real data unique record constraint. You can either start with all the columns or make educated guesses and add them until unique. There still may be duplicates in the source data so a table with a million rows that has 20 duplicates is probably a table with duplicate data in it. Any duplicate data will need to be moved out of the table so a natural key (probably concatenated) unique record constraint can be placed on the table. At some point data removed from the table will need to be reconciled.

Notes:

--Object:  Find the most granular level of uniqueness of a record as described by the supporting SQL statement.
-- Use selectivity data spreadsheet for data population information.
--Script the table and check for surrogate keys to excluded from the groups.
-- Exclude surrogate keys from the result grouping.

-- The next are general rules.
-- A primary key column must be fully populated so look at the data and in general you would (but don't have to) exclude columns that are not fully populated.
-- Exclude Bit columns or convert them to integers for groupings
-- Exclude text columns (not character or varchar) and binary columns. They will cause errors as they are not allowed in groupings or keys.
-- Exclude columns such as note, comment, user, last date, Y/N, bit if they are obviously just properties of a record such as colors, Lengths and are not likely primary key candidates.

If duplication is found look for or consider that there may be missing datetime or other data elements to add to your unique record constraint. The duplication could also be bad records not containing enough information to properly identify anything.

The single most important aspect of table design is implementing a unique record constraint based on meaningful data. Without adequate unique record identifiers we would have to perform multiple searches to identify a single record and would not be able to enforce data integrity through the use of a unique index or constraint. Additionally If table data hasn't been defined then it can't used for any valid metrics as it content has not been defined.  Proper constraints on the table such as defaults for the user and LastDate columns will prevent the loss of data.  Additionally without a proper unique record constraint the table data could be duplicated or worse and no one might ever recognize duplicate data as what exactly duplicate data was had not been defined. 

Step 3: We need to analyze the data and look for denormalized data in different tables so my DBATools database functions would be very helpful to do it in an automated way.
----------------------------------------------------------------------------------------
-- Find data in any table
----------------------------------------------------------------------------------------
EXEC DBATools.dbo.usp_srvr_db_table_data_exists
@ServerName = @@SERVERNAME
, @DatabaseName ='DatabaseName'
--, @csvNoSchemaTableList = 'tb~_%'
--, @csvNoSchemaTableExcludeList = 'tblName1'
, @EscapeCharacter = '~'
, @SearchText = 'Amount'
, @UseLike = 1

-- Find other things!
EXEC dbatools.dbo.usp_srvr_object_find
@CSVServerNameListInclude = @@SERVERNAME
, @CSVDatabaseNameListInclude = 'DBATools'
, @CSVObjectTypeList ='TABLES,TABLE_COLUMN,PROCEDURES,PROCEDURE_CODE,SYNONYMS,PROCEDURE_CODE2'
, @CSVWildCardObjectNameList = '%tblColumnSelectivity%'


Step 4: Reproduce the table relationships (mainframe) . This will help us identify the duplicate data that we will most likely find. We can use the procedure calls above to help and also look for similar column names in different tables. Hopefully we will find one of each that is the primary key and primary or code / lookup table(s). The others will be foreign keys referencing the primary key. We can user the INFORMATION_SCHEMA.COLUMNS system table to find the relations. They will not necessarily be related however very likely could be related data elements. The actual data will show us if they are indeed related. We may have to build code or lookup tables using the lookup values as the temporary code names.

SELECT
COLUMN_NAME = TABLE_CATALOG + '.' + TABLE_SCHEMA + '.' + TABLE_NAME + '.' + COLUMN_NAME
+ ' ' + DATA_TYPE + COALESCE('(' + CAST(CHARACTER_MAXIMUM_LENGTH as varchar) + ')' ,'')
+ ' <' + CAST(IS_NULLABLE as varchar) + '>'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME IN(
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
GROUP BY COLUMN_NAME
HAVING COUNT(*) > 1
)
AND COLUMN_NAME = 'name'

COLUMN_NAME
-----------------------------------
master.dbo.spt_fallback_db.name varchar(30) <NO>
master.dbo.spt_fallback_dev.name varchar(30) <NO>
master.dbo.spt_values.name nvarchar(35) <YES>

(3 row(s) affected)

Step 5: Identify what we would want to alter in the existing design and determine what we want to change.       i.e. Make a copy and start modifications.

Step 6: Then if data is being relocated we need to create any new tables and Map the Data to them the related columns.

Step 7: Determine and write the business logic in SQL server functions and procedures. The key here is to write the procedures so they can be used for both the reports and the user interface so we significantly reduce necessary QA. We would want to create templates for (Select, Insert, Update, Delete) and use them to create our procedures dynamically. This is a time saver but there will be plenty of procedures that can't be genericized.

Step 8:Write the interface and reports. Data sources for the interface and reports would come from paramaterized procedures. The procedures would run SQL statements calling user defined table functions, tables or some combination thereof.   I prefer including data aggregation in the procedures so that they are independent of the implementation outside of SQL Server and also so they can be recalculated using new parameters which can't be done in views alone or in 2005 report model projects.

You may want to review my post on configuring referential integrity for related information.

Thursday, June 9, 2011

Database Development - Quality Assurance.

There is some confusion in the business world about the difference between data quality assurance(QA) and spot checking data. The confusion usually appears after electing for spot checking when QA should have been done. QA and spot checking serve a similar but not identical purpose. Spot checking is usually done by business analysts and involves reviewing a few figures looking for obvious data issues. This is usually what gets me involved in the developer side of data issues.  What I do is write sets of SQL Server user defined functions, user defined table functions and stored procedures with matching parameters to check every record and every data element for accuracy.  I find it useful to use a processing design that allows variable batch sizes so I usually code the master procedure as a while loop that with each iteration does a set based SQL operation .  A variable value can be used for the top(n) construct so the number of records process per iteration can be controlled. As far as the actual data comparison I would first want to confirm historical data had not changed by comparing existing data with a read only historical record of the data. I would then review the new data for glaring errors and have subject area experts (end users) review it. Once the data was accepted by the business analyst it would be appended to the read only historical record of the data. The read only historical record of the data should allow adjustment records for on going balancing. As a developer I put subject area experts knowledge into business logic but I am not an actual subject area expert on every industry. I write SQL statements to perform statistical analysis on groupings of alphanumeric text so it is more important that I understand how to aggregate data correctly then be a subject area expert.

Be aware that you may pay someone to do QA for you but if your interpretation of the definition for QA is not specifically stated in your contract then it's likely your data has only been spot checked by a business analyst.

Monday, June 6, 2011

Database Development - Design Strategy

Using Business Objects correctly will make your application more robust and adaptable. If your business logic is encapsulated into Business Objects a term I use to refer to SQL Server user defined tables, functions, procedures, triggers and CLRs. Business Objects are constructed primarily of SQL statements a relatively simple language. When used correctly they simplify development and maintenance and are relatively transportable. They also keep your business logic in a central location which you can easily script. You can then perform text searches on the SQL statement based text/code to help you perform on going code review and analysis. Business Objects help separate your business logic from your reporting tool and user interface and can also reduce your QA efforts if the same Business Objects are used to retrieve data for the user interface as are used to populate your reports. Make sure to write your procedures and user defined tables so they return zero or more records so they can be used for both your reports and UI. Use Business Objects whenever possible for data updates and data retrieval operations for the user interface, reporting, SSIS packages, CLRs and database triggers enforcing your business rules.

Now a little about database development. Make sure every table has a clustered index so it can be defragmented automatically. For each table create a real data unique record constraint and a single column surrogate/identity/auto-number primary key (PK). The single column PK works great for an audit logging strategy since it provides record access through a single value. The single column PK also simplifies your schema diagrams and SQL statements. A real data unique record constraint is often confused with a primary key. While a primary key can serve as both I prefer to keep them separate unless I'm in MS Access and there isn't a choice. The main purpose of a unique constraint is to define what would constitute a duplicate record and to configure the table so that they are not allowed. If you haven't defined what is in your table then aggregating the data in it will not be of much use as the data would just represent something as of yet undefined. Counting values in a table without a real data unique record constraint is as meaningless as summing unrelated percentages or aggregating distinct values. In short put a real data unique record constraint on every table even most of your temp tables. Use a trigger to implement it if you can't do it any other way.  Although there may be times when it is necessary to include an identity/auto-number column in the the unique record constraint it should be avoided.

Try not to store the same data in more than one table unless it is audit log or temporal information. If you do store the same data in more than one location (denormalized data) define something about it to differentiate it from the other data such as the date and time when it was saved. If your data is supposed to stay synchronized with the other data use referential integrity and create database triggers to enforce the business rule(s). Use referential integrity on all  tables except possibly for some temp tables and use triggers to enforce business logic where necessary.

Lock your data down and have the ability to store audit logs for all transactions. Audit logging should be done at the lowest level possible with table triggers. Please do not use application code in yet another sad attempt at audit logging. Make sure your audit log tables are not in your application database. Audit log data needs to be stored somewhere where it won't bloat the database.

SSRS reporting is a pretty nice tool for it's price especially if you designed your reports using the same SQL objects you are populating your user interface with. Use CLR routines for report formatting features so one person can manage 100s of report formatting changes.

SSIS packages can be adjusted for band width usage so they can be immensely helpful as long as they are used to call your business objects or fill another need without taking on business logic that belongs in the business objects or CLR functions. 

In closing I would suggest you make sure your SQL developers can identify common SQL errors. I will try to review each of these at a later date but for now I will have to just mention a few such as

Row duplication caused by a join to a non-unique index.

Joins on denormalized data vs. joins on unique code lists.

Semi-random updates: This is where there is one record to be updated but the SQL statement specifying the value to be used in the update returns multiple values. Obviously only one value can be used to update one column of one row.

Omission of data due to inner joins.

Omission of data due to active inactive flags instead of origination_date/acquired date - maturity date/disposed date etc.

Omission of data by putting where clause criteria on left outer joined table elements in the where clause instead of the join clause.

Omission of data by joining outer or inner joined tables to each other instead of a root data set containing all possible keys for both tables.

Summing unrelated percentages.

Aggregating distinct values.

Excluding data in calculations due to bad join and where clause criteria.

That's enough for now! Thank you for your time. Hopefully my suggestions can help you with your development efforts.