Sunday, October 21, 2012

SSIS - Knowledge dump and links.

I've been developing with SQL Server for the past 20 years and have occasionally been tasked to create an SSIS package or to fix one that was not operating properly.  I have found the occasional use of the data transfer wizard very convenient however I usually avoid SSIS packages for a number of reasons I will list at the end of this article. That being said I work as a contract and my most recent positions have had be doing a lot of SSIS so I though I would part with a few of the SSIS basics helpful in my last few projects.

Before I start I almost forgot to add this tidbit. Many import projects require you to run in 32 bit mode so make sure you have it set and don't be supprised if you have to keep resetting it. When your package starts  start dying everywhere again go back and make sure it's not back 64 bit mode.  It's a bit annoying but I don't have a solution for it other than.

Error Handling:  In order to keep a package from stopping when it encounters an error with a data flow or some other object in your package you will need to create an event handler, an empty event handler will do.  First make sure your global variables are showing or available and then select the object or data flow you want to keep from stopping your package and set it's MaximumErrorCount property to 0.  Next select the Event Handlers tab and click on "Click here to create and OnError event handler ...". Next set the system variable "propagate" to false. That's it. Make sure you don't set the MaximumErrorCount property of the actual error handler to zero.

Global temp tables ##MyTempTable: I also wanted to use temp tables in my SQL Execute task objects.  In order to do this you need to set the connection manager property  RetainSameConnection to true and also set DelayValidation to true on all tasks that use the temp table. One thing I found that I didn't see mention anywhere else is that global temp tables are evidently only global within one data-flow path. For instance if you use two data-flow paths to parallel process some functionality the temp tables are only global to the data-flow path they are being executed in.  As far as I could tell you can't use global temp tables from two different data-flow paths together in the same Execute SQL task.. Global temp tables do not appear to be global to separate data-flow paths.

Excel Imports: One of the  projects I've worked on recently was importing Excel files into SQL Server, applying some business rules to the data and updating or inserting the data into some tables. This sounds pretty easy until you start finding out how the jet drivers decide how to import Excel data.
I read a myriad of articles explaining how to import the data and of coarse my project was importing macro enabled spreadsheets so it required using the ACE.OLEDB.12.0 driver and setting some  extended properties (Excel 12.0 Macro;HDR=Yes or Excel 12.0 Macro;HDR=No) If excel was not installed on the machine the Microsoft Access Database Engine 2010 Redistributable driver would need to be installed.

After scouring a lot of Internet articles and a lot of import testing I finally found a description of how the driver determines how it will import data. Please read the article for yourself at http://www.etl-tools.com/imex1.html and related information at Microsoft kb/257819. In short it sounds like Excel just isn't a good import export tool.  The only way to get reliable data importing from Excel is to use the first column to control how excel imports the data. You should also beware of Excel data imports because they were probably exported using the standard Excel driver and have probably  truncated some text at 255 characters.

So here is what you do to get you data imported from Excel.  First you meed to create a template with an example of the  file structure in the first row of the spreadsheet or you need to manually open the spreadsheet up and if necessary insert a row at the top or use the first row to provide an example data set. In my situation I wanted to import everything as text so I could import all the spreadsheets into the same data cleansing table and then deal with cleansing the data and  converting the data types  in SQL Server.  I suspect you can hide the first row and use it as your data description  but I didn't  have the need to try it as I had a and empty first row to use and I used blanks to indicate to Excel what to be imported.  The only thing I really had to do is make sure one of the columns was 4000 characters long so Excel would know it was long text and I wouldn't get meta data errors in SSIS when attempting to import it.

You also need to make sure the TypeGuessRows registry setting is set to 1 so it will use  the first row as the data import specification .I check it first and save the value so I can reset it to the original value when I'm done.  That way other people doing excel imports will not be effected by you changes except if the try to import data at the same time. The  service account running the DBEngine needs to be given privileges to write  to the registry so check the computer services or use the SQL Server Configuration Manager to find out what account is running the SQL Server DB Engine. Then if necessary use RegEdit to give permissions to the account. There is a menu option in RegEdit to give the permissions. Below  is some T-SQL code to read and write the registry values on 64 bit machines. 

----------------------------------------------------
-- Check for provider:Microsoft.ACE.OLEDB.12.0
-------------------------------------------------------------
SET @intRegValue = -1
exec @rc = master.dbo.xp_regread
 @rootkey='HKEY_LOCAL_MACHINE'
, @key='SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Excel\'
, @value_name= 'TypeGuessRows'
, @value = @intRegValue  output
IF @rc <> 0
OR @intRegValue = -1
BEGIN
SET @msg = 'Error reading Excel TypeGuessRows.  The Excel file import provider Microsoft.ACE.OLEDB.12.0 is probably not installed.'  + CHAR(13)
SET @msg = @msg  + 'You may need to install the Microsoft Access Database Engine 2010 Redistributable which can be found at the URL address ' + CHAR(13)
SET @msg = @msg  + 'http://www.microsoft.com/en-us/download/details.aspx?id=13255'
PRINT @msg
END

---------------------------------------------------------------
-- Set the TypeGuessRows registry value.
---------------------------------------------------------------
exec  @rc = xp_regwrite
@rootkey='HKEY_LOCAL_MACHINE'
, @key='SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Excel\'
, @value_name='TypeGuessRows'
, @type='REG_DWORD'
, @value=1

File System Operation

I tried using the SSIS File System objects to implement File System operations but I found C# to be a better option. A few things I needed to do were to  loop through directoriesread files from the disk in order of write date, read the file propertiesexecute stored procedures, perform some logging , setup logging, setup logging2, deleting all the files in a directory, etc.

Parallel processing is definitely an advantage in SSIS but aside that for a number of reasons whenever possible I prefer to use SQL Server TSQL and CLRs or at least call them from SSIS. Here are a few of the reasons.




  • I already have a TSQL code base of about 1000 procedures, 300 functions and some CLRs that provide me with  FileSystem and other functions and it's all setup pretty well for re-usability. This let's me develop reusable code and functionality such as data imports.
  • The code for SSIS packages can't just be printed out and reviewed like TSQL and C# code.  
  • The SSIS graphical interface is often painfully slow. Especially with complex packages.
  • SSIS often give less then helpful error messages making it difficult to see error messages readily displayed in TSQL or Oracle Developer. It does provide ways to see a ton of messages but unless configured carefully is more likely to produce information overload and make it hard to see the actual problems. 






  • No comments:

    Post a Comment