SQL Server - Database Development: Legacy Database Conversions

Find a table 'tblTableName' in the database that doesn't have any referential integrity configured. This will appear as tables with no links in the diagram window and with no right menu button

Option to add related tables.

Create a group by query on the table to analyze the data. Look for anything appearing to be a unique record constraint based on real data (No auto numbers). Include all columns to start and cut away small pieces or start with a single column and add more until you find the unique record constraint based on real data. These records will appear as HAVING COUNT(*) = 1.

SELECT [ColumnName] , Recs = COUNT(*)

FROM [tblTableName]

GROUP BY [ColumnName]

--HAVING COUNT(*) = 1

ORDER BY COUNT(*) DESC

A 1

B 1

C 1

We only have one of each value so this is probably a primary table that will be linked to through foreign keys of other tables.

I use my code to search for tables containing a column named 'ColumnName'.

I'm using wild cards to look for any column somewhat matching the text we are searching for.

-- Find it somewhere!

EXEC dbatools.dbo.usp_srvr_object_find

@CSVServerNameListInclude = @@SERVERNAME

, @CSVDatabaseNameListInclude = 'DatabaseName'

, @CSVObjectTypeList ='TABLE_COLUMN'

, @CSVWildCardObjectNameList = '%ColumnName%'

Open a diagram and try linking the tables and saving. The save will only work if the data is related.

Even one missing value will keep the save from completing. Records/Values will either need to be added to the primary table or removed from the foreign table before RI can be configured.

After the data is reconciled (deleted or added) try linking the tables again.

Another method is to search the database for data matching our primary tables Primary Key.

I use my DBATools.usp_srvr_db_table_data_exists procedure to search the database for

references to a few code values. I don't use the @UseLike feature as I am looking for

specific code values.

-- Find data in any table.

EXEC DBATools.dbo.usp_srvr_db_table_data_exists

@ServerName = @@SERVERNAME

, @DatabaseName ='DatabaseName'

, @EscapeCharacter = '~'

, @SearchText = 'A'

, @UseLike = 0

I would then look at the number of occurrences of more than one value from the primary table.

Any table with one or more matches could potentially be related. Any table with just one match of each value could be the primary table.

SELECT [ColumnName] , Recs = COUNT(*)

FROM [tblTableName]

GROUP BY [ColumnName]

--HAVING COUNT(*) = 1

--HAVING COUNT(*) > 1

ORDER BY COUNT(*) DESC

Next I would make educated guesses to whittle down the tables I think may contain foreign key references to this data. Look at both the table names and column names. Some will obviously not be related. You can also look through the code for procedures using the tables in question as you may see how it is being used through existing code.

-- Find it somewhere!

EXEC dbatools.dbo.usp_srvr_object_find

@CSVServerNameListInclude = @@SERVERNAME

, @CSVDatabaseNameListInclude = 'DatabaseName'

, @CSVObjectTypeList ='TABLES,TABLE_COLUMN,PROCEDURES,PROCEDURE_CODE,SYNONYMS,PROCEDURE_CODE2'

, @CSVWildCardObjectNameList = '%tblTableName%'

As we find data we want to configure RI by linking 'like' columns in the SSMS diagram designer.
Denormalized data should never be linked to anything but a code table so we may end up having to manually create and populate some code tables. You can always try linking the columns to confirm or deny a data relationship although this may not be practical on very large tables. If your data is from a legacy (and old) system you will may have a lot of multiple column RI relationships.

Please note that configuring referential integrity is also covered in my post on legacy database conversions as the two processes are similar in some cases.

Step 1: Import the data into SQL Server. The SSIS wizard works well for quick for raw imports. If given data in flat files I would use BCP. Make sure to ask for any readable searchable code of any sort from the legacy system. I would want a copy of it as it may contain business logic used in the application that may need to be converted.

Step 2: We need to created natural key unique indexes on all of our data. For this we need to determine the existing record definitions for each table by identifying the natural key unique record constraint for each table using SQL group by statements excluding any auto numbering of records. This is mainframe data so we will probably have to strip extended characters from some data and do trims on all of it.

Ex.
SELECT Recs = COUNT(*)
, [PERIOD]
,[ENTITYID]
,[AMT]
,[DAYS]
,[DESCRPN]
FROM [dbo].[tblLotsOfData]
GROUP BY
[PERIOD]
,[ENTITYID]
,[AMT]
,[DAYS]
,[DESCRPN]
HAVING COUNT(*) > 1

If we have duplicates then that would not be the real data unique record constraint. You can either start with all the columns or make educated guesses and add them until unique. There still may be duplicates in the source data so a table with a million rows that has 20 duplicates is probably a table with duplicate data in it. Any duplicate data will need to be moved out of the table so a natural key (probably concatenated) unique record constraint can be placed on the table. At some point data removed from the table will need to be reconciled.

Notes:

--Object: Find the most granular level of uniqueness of a record as described by the supporting SQL statement.

-- Use selectivity data spreadsheet for data population information.

--Script the table and check for surrogate keys to excluded from the groups.

-- Exclude surrogate keys from the result grouping.

-- The next are general rules.

-- A primary key column must be fully populated so look at the data and in general you would (but don't have to) exclude columns that are not fully populated.

-- Exclude Bit columns or convert them to integers for groupings

-- Exclude text columns (not character or varchar) and binary columns. They will cause errors as they are not allowed in groupings or keys.

-- Exclude columns such as note, comment, user, last date, Y/N, bit if they are obviously just properties of a record such as colors, Lengths and are not likely primary key candidates.

If duplication is found look for or consider that there may be missing datetime or other data elements to add to your unique record constraint. The duplication could also be bad records not containing enough information to properly identify anything.

The single most important aspect of table design is implementing a unique record constraint based on meaningful data. Without adequate unique record identifiers we would have to perform multiple searches to identify a single record and would not be able to enforce data integrity through the use of a unique index or constraint. Additionally If table data hasn't been defined then it can't used for any valid metrics as it content has not been defined. Proper constraints on the table such as defaults for the user and LastDate columns will prevent the loss of data. Additionally without a proper unique record constraint the table data could be duplicated or worse and no one might ever recognize duplicate data as what exactly duplicate data was had not been defined.

Step 3: We need to analyze the data and look for denormalized data in different tables so my DBATools database functions would be very helpful to do it in an automated way.
----------------------------------------------------------------------------------------
-- Find data in any table
----------------------------------------------------------------------------------------
EXEC DBATools.dbo.usp_srvr_db_table_data_exists
@ServerName = @@SERVERNAME
, @DatabaseName ='DatabaseName'
--, @csvNoSchemaTableList = 'tb~_%'
--, @csvNoSchemaTableExcludeList = 'tblName1'
, @EscapeCharacter = '~'
, @SearchText = 'Amount'
, @UseLike = 1

-- Find other things!
EXEC dbatools.dbo.usp_srvr_object_find
@CSVServerNameListInclude = @@SERVERNAME
, @CSVDatabaseNameListInclude = 'DBATools'
, @CSVObjectTypeList ='TABLES,TABLE_COLUMN,PROCEDURES,PROCEDURE_CODE,SYNONYMS,PROCEDURE_CODE2'
, @CSVWildCardObjectNameList = '%tblColumnSelectivity%'

Step 4: Reproduce the table relationships (mainframe) . This will help us identify the duplicate data that we will most likely find. We can use the procedure calls above to help and also look for similar column names in different tables. Hopefully we will find one of each that is the primary key and primary or code / lookup table(s). The others will be foreign keys referencing the primary key. We can user the INFORMATION_SCHEMA.COLUMNS system table to find the relations. They will not necessarily be related however very likely could be related data elements. The actual data will show us if they are indeed related. We may have to build code or lookup tables using the lookup values as the temporary code names.

SELECT
COLUMN_NAME = TABLE_CATALOG + '.' + TABLE_SCHEMA + '.' + TABLE_NAME + '.' + COLUMN_NAME
+ ' ' + DATA_TYPE + COALESCE('(' + CAST(CHARACTER_MAXIMUM_LENGTH as varchar) + ')' ,'')
+ ' <' + CAST(IS_NULLABLE as varchar) + '>'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME IN(
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
GROUP BY COLUMN_NAME
HAVING COUNT(*) > 1
)
AND COLUMN_NAME = 'name'

COLUMN_NAME
-----------------------------------
master.dbo.spt_fallback_db.name varchar(30) <NO>
master.dbo.spt_fallback_dev.name varchar(30) <NO>
master.dbo.spt_values.name nvarchar(35) <YES>

(3 row(s) affected)

Step 5: Identify what we would want to alter in the existing design and determine what we want to change. i.e. Make a copy and start modifications.

Step 6: Then if data is being relocated we need to create any new tables and Map the Data to them the related columns.

Step 7: Determine and write the business logic in SQL server functions and procedures. The key here is to write the procedures so they can be used for both the reports and the user interface so we significantly reduce necessary QA. We would want to create templates for (Select, Insert, Update, Delete) and use them to create our procedures dynamically. This is a time saver but there will be plenty of procedures that can't be genericized.

Step 8:Write the interface and reports. Data sources for the interface and reports would come from paramaterized procedures. The procedures would run SQL statements calling user defined table functions, tables or some combination thereof. I prefer including data aggregation in the procedures so that they are independent of the implementation outside of SQL Server and also so they can be recalculated using new parameters which can't be done in views alone or in 2005 report model projects.

You may want to review my post on configuring referential integrity for related information.

SQL Server - Database Development

Links

Friday, March 23, 2012

Configuring Referential Integrity

Thursday, March 15, 2012

Legacy Database Conversions