Thursday, March 15, 2012

Legacy Database Conversions

Step 1: Import the data into SQL Server. The SSIS wizard works well for quick for raw imports. If given  data in flat files I would use BCP. Make sure to ask for any readable searchable code of any sort from the legacy system.  I would want a copy of it as it may contain business logic used in the application that may need to be converted.




Step 2: We need to created natural key unique indexes on all of our data. For this we need to determine the existing record definitions for each table by identifying the natural key unique record constraint for each table using SQL group by statements excluding any auto numbering of records. This is mainframe data so we will probably have to strip extended characters from some data and do trims on all of it. 


Ex.
SELECT Recs = COUNT(*)
, [PERIOD]
,[ENTITYID]
,[AMT]
,[DAYS]
,[DESCRPN]
FROM [dbo].[tblLotsOfData]
GROUP BY
[PERIOD]
,[ENTITYID]
,[AMT]
,[DAYS]
,[DESCRPN]
HAVING COUNT(*) > 1


If we have duplicates then that would not be the real data unique record constraint. You can either start with all the columns or make educated guesses and add them until unique. There still may be duplicates in the source data so a table with a million rows that has 20 duplicates is probably a table with duplicate data in it. Any duplicate data will need to be moved out of the table so a natural key (probably concatenated) unique record constraint can be placed on the table. At some point data removed from the table will need to be reconciled.

Notes:

--Object:  Find the most granular level of uniqueness of a record as described by the supporting SQL statement.
-- Use selectivity data spreadsheet for data population information.
--Script the table and check for surrogate keys to excluded from the groups.
-- Exclude surrogate keys from the result grouping.

-- The next are general rules.
-- A primary key column must be fully populated so look at the data and in general you would (but don't have to) exclude columns that are not fully populated.
-- Exclude Bit columns or convert them to integers for groupings
-- Exclude text columns (not character or varchar) and binary columns. They will cause errors as they are not allowed in groupings or keys.
-- Exclude columns such as note, comment, user, last date, Y/N, bit if they are obviously just properties of a record such as colors, Lengths and are not likely primary key candidates.

If duplication is found look for or consider that there may be missing datetime or other data elements to add to your unique record constraint. The duplication could also be bad records not containing enough information to properly identify anything.

The single most important aspect of table design is implementing a unique record constraint based on meaningful data. Without adequate unique record identifiers we would have to perform multiple searches to identify a single record and would not be able to enforce data integrity through the use of a unique index or constraint. Additionally If table data hasn't been defined then it can't used for any valid metrics as it content has not been defined.  Proper constraints on the table such as defaults for the user and LastDate columns will prevent the loss of data.  Additionally without a proper unique record constraint the table data could be duplicated or worse and no one might ever recognize duplicate data as what exactly duplicate data was had not been defined. 

Step 3: We need to analyze the data and look for denormalized data in different tables so my DBATools database functions would be very helpful to do it in an automated way.
----------------------------------------------------------------------------------------
-- Find data in any table
----------------------------------------------------------------------------------------
EXEC DBATools.dbo.usp_srvr_db_table_data_exists
@ServerName = @@SERVERNAME
, @DatabaseName ='DatabaseName'
--, @csvNoSchemaTableList = 'tb~_%'
--, @csvNoSchemaTableExcludeList = 'tblName1'
, @EscapeCharacter = '~'
, @SearchText = 'Amount'
, @UseLike = 1

-- Find other things!
EXEC dbatools.dbo.usp_srvr_object_find
@CSVServerNameListInclude = @@SERVERNAME
, @CSVDatabaseNameListInclude = 'DBATools'
, @CSVObjectTypeList ='TABLES,TABLE_COLUMN,PROCEDURES,PROCEDURE_CODE,SYNONYMS,PROCEDURE_CODE2'
, @CSVWildCardObjectNameList = '%tblColumnSelectivity%'


Step 4: Reproduce the table relationships (mainframe) . This will help us identify the duplicate data that we will most likely find. We can use the procedure calls above to help and also look for similar column names in different tables. Hopefully we will find one of each that is the primary key and primary or code / lookup table(s). The others will be foreign keys referencing the primary key. We can user the INFORMATION_SCHEMA.COLUMNS system table to find the relations. They will not necessarily be related however very likely could be related data elements. The actual data will show us if they are indeed related. We may have to build code or lookup tables using the lookup values as the temporary code names.

SELECT
COLUMN_NAME = TABLE_CATALOG + '.' + TABLE_SCHEMA + '.' + TABLE_NAME + '.' + COLUMN_NAME
+ ' ' + DATA_TYPE + COALESCE('(' + CAST(CHARACTER_MAXIMUM_LENGTH as varchar) + ')' ,'')
+ ' <' + CAST(IS_NULLABLE as varchar) + '>'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME IN(
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
GROUP BY COLUMN_NAME
HAVING COUNT(*) > 1
)
AND COLUMN_NAME = 'name'

COLUMN_NAME
-----------------------------------
master.dbo.spt_fallback_db.name varchar(30) <NO>
master.dbo.spt_fallback_dev.name varchar(30) <NO>
master.dbo.spt_values.name nvarchar(35) <YES>

(3 row(s) affected)

Step 5: Identify what we would want to alter in the existing design and determine what we want to change.       i.e. Make a copy and start modifications.

Step 6: Then if data is being relocated we need to create any new tables and Map the Data to them the related columns.

Step 7: Determine and write the business logic in SQL server functions and procedures. The key here is to write the procedures so they can be used for both the reports and the user interface so we significantly reduce necessary QA. We would want to create templates for (Select, Insert, Update, Delete) and use them to create our procedures dynamically. This is a time saver but there will be plenty of procedures that can't be genericized.

Step 8:Write the interface and reports. Data sources for the interface and reports would come from paramaterized procedures. The procedures would run SQL statements calling user defined table functions, tables or some combination thereof.   I prefer including data aggregation in the procedures so that they are independent of the implementation outside of SQL Server and also so they can be recalculated using new parameters which can't be done in views alone or in 2005 report model projects.

You may want to review my post on configuring referential integrity for related information.

No comments:

Post a Comment