Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

DBA Dev BI Categories

Import UTF-8 Unicode Special Characters with


SQL Server Integration Services

Problem
As you work with large scale SQL Server Integration Services ETL processes and sequences, you are bound to
have to work with UTF-8 encoded text files. Without proper handling, UTF-8 / Unicode characters can cause
havoc with your SSIS load tasks. What are some ideas for handling the UTF-8 file formats with SSIS? Check out
this tip to learn more.

Solution
First, providing some background about UTF-8 and Unicode would likely go a long way into explaining how to
handle these different code page types. Unicode is an encoding standard maintained by the Unicode
Consortium; most of the biggest players in the technology field (Google, SAP, Microsoft, Oracle) along with
many others belong to the consortium. The consortium was setup to standardize the numbering scheme
assigned to represent a letter or character across all languages, countries, software products, and hardware
platforms. Drilling down further, UTF-8 is actually an encoding method for handling all the characters in the
Unicode set of characters and stands for Unicode Transformation Format. I know I am going to do a disservice to
the encoding process by explaining the transformation in these terms, however, UTF is sort of like the
decryption key (or secret decoder ring!) used to map your backend bytes to the actual characters displayed on
the screen. Of course, there is much more to it, so feel free to take a look at the link in the next steps section of
this tip. To ease the conversion from prior encoding standards, keep in mind the first 128 UTF-8 characters
match the ANSI character encoding standards.

Working with SSIS and UTF-8 Unicode Data


Enough of the theory and background; let us put this knowledge into practice. First, we need to create a UTF-8
encoded text file with some special characters. As you can see from the screen prints below, most of the rows
contain one or several special characters. I actually created two different text files; the first text file is a bar
delimited, 4 column file as shown in the upper screen print below. The second file is a ragged right, fixed width
4 column file.
To create the files in the correct encoding format, I used a free tool called NotePad++. As you can see in the
below screen print, the encoding for our sample files is set to UTF-8.

Fortunately, SSIS is smart enough to handle many different encoding code pages. Furthermore, UTF-8 Unicode
encoding format is one of the most popular encoding pages as it is used by a wide variety of web pages. Using
the bar delimited file, noted in the initial screen prints above, we can quickly load the data with minimal
adjustment to the flat file connection. First, you will need to open up SQL Server Data Tools or SSDT (formerly
known as BIDS), and then create a new SSIS Package with a proper and descriptive name. Next as shown
below, is to create a connection to the bar delimited flat file. After browsing to the flat file, you will notice that
on the General Options page that the Code page defaults to the 65001 (UTF-8) code page which is the proper
UTF-8 code page. On this same window, adjust the Connection manager name to an appropriate name, change
the format to delimited, and then select the "Column names are in the first data row" check box.
Next, on the columns screen, verify that the column delimiter is set to vertical bar.

Then on the Advanced screen, the data type for each string field must be changed from string [DT_STR] to
Unicode string [DT_WSTR].
Finally, we can preview the flat file connection setup, as revealed below.

Based on the above preview, the character sets looks like they are displaying the appropriate characters
especially from the UTF-8 encoding standpoint. Next, we can setup up our data flow with a flat file source and a
SQL Server destination and finally mapping the source and destination columns
Next the SQL Server destination is created and finally the source and destination columns are mapped as
previewed in the next two screen shots.
At last, we are ready to test run our data flow package which reports success, as shown below, with two green
check marks!

Running a query against the table displays success also; please note all the fields are NVARCHAR data types.

Of course, you are probably saying, "so what is the problem"? With delimited flat file sets, nothing. However, let
us move on to a fixed length file. Running through the same connection file process, we start with browsing and
selecting our fixed format flat file. Be sure that the Code page is 65001 (UTF-8), the format is set to Ragged
right, and that the "Column names in the first data row" check box is selected. So far, so good.

However, as displayed on the below screen print, we immediately begin to see some problems as we attempt to
set the fixed column widths. On this screen you set the column widths for each field by click on the ruler area.
Taking a look at the first column, the name field, we see that even though the column is supposed to be just 14
characters wide, some of the names with special characters push beyond 14 characters wide. We could adjust
the column size by one or more units, but then some of the columns would actually contain characters
belonging to another field. Also notice, that each subsequent column is pushed right based on the number of
special characters on that row.

I decided that maybe the problem is just a preview issue and trudge forward onto the advanced screen. Just as
with the delimited file, we need to adjust all the character columns to use a Unicode string [DT_WSTR] data
type.

Finally, reviewing the preview window, as illustrated below, the same problem which appeared on column
screen surfaces on the preview screen. Again, I contemplated that maybe the issue was just a "preview" issue
and moved forward with creating the data flow.
Following the same process noted above for the delimited file, we create a data flow with a flat file source and
SQL Server destination are created. Finally we run the fixed file package.

Wow, the data loaded successfully. However reviewing the below query results shows that the preview problem
raised earlier permeated to load. The good thing is that characters were loaded correctly, but just into the
wrong column.

After completing some various web search results, I come to the conclusion that the problem is a known issue
with a related Microsoft Connect entry:
http://connect.microsoft.com/SQLServer/feedback/details/742632/european-character-causes-error-when-
importing-flat-file-utf-8-code-page-with-ragged-right-format. So what is our work around? Since the characters
are being loaded correctly, we actually have a few options. First, and my preferred option is to load the entire
row of data into a single column and then use SQL to complete the parsing process. Another option could be to
write a VB or C script component to parse out the characters. For the rest of this example, we will focus on the
SQL solution. First we need to create a staging table with a single column large enough to load the entire row;
the data type for the column needs to be NVARCHAR. I have heard that some folks mention that they cannot fit
the entire row into one column, so you can potentially load the data into two or more rows.

Now we can test run the package to load the entire row into the staging table whose result is displayed below.

Next we create a query to parse the long string into the appropriate column parts using the SUBSTRING
function. Using the original file layout of column widths and start positions can be of great help when creating
the Substring query.

Finally, we can add this query as a source for a second data flow to move the data from the single column
staging table to the final table, as illustrated in the next two screen prints.
Using this alternate method, we now have the fixed ragged right file loaded into the table as noted below.

Conclusion
Even though SSIS supports a large variety of various code page encoding formats, loading data from a UTF-8
encoded flat files that contain special characters (and other encoding formats) may not load as expected. A
delimited file with special characters loads fairly easily and without many issues outside of setting the data
types correctly. However if you have a ragged right fixed format flat file, the load will generally occur without
error, but as seen in the preview mode of the flat file connection manager, the fields and rows of data that
contain special characters will often push other characters to the right and thus load incorrectly parsed data
into the wrong fields. This situation can be worked around by loading the entire row as a Unicode string into a
column wide enough to accept all the row data. Once loaded into the single column, the data can then be
parsed using the SQL Server Substring function. As an alternative, you could also parse the row using a script
component in SSIS. Although the issue has been acknowledged on Microsoft Connect, it has been around since
SQL Server 2005 and was not addressed in SQL Server 2012. Thus, we will likely need to deal with the
workaround for the near future.

Next Steps
Check out the following resources:
Unicode Consortium: http://www.unicode.org/
UTF-8 Specifications: http://www.ietf.org/rfc/rfc3629.txt
Markus Kuhn's FAQ on Unicode and UTF-8: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
Get started with SQL Server Integration Services:
SQL Server Integration Services Tutorial
SQL Server Business Intelligence Tips
Import Excel unicode data with SQL Server Integration Services

Last Update: 12/5/2013

About the author


Scott Murray has a passion for crafting BI Solutions with SharePoint, SSAS, OLAP and SSRS.

View all my tips

Related Resources
More Business Intelligence Tips...

Post a comment or let the author know this tip helped.


All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is
not published. Required fields are marked with an asterisk (*).

*Name *Email Notify for updates


*** NOTE *** - If you want to include code from SQL Server Management Studio (SSMS) in your post, please copy the code from SSMS and paste the code
into a text editor like NotePad before copying the code below to remove the SSMS formatting.

Get free SQL tips:

*Enter Code Save Comment

Tuesday, March 22, 2016 - 2:47:46 PM - Bosko Back To Top

:) :) Python one liner conversions...

python.exe -c "open('out.csv','wb').write( unicode( open('inp.csv','rb').read(), 'utf8').encode('UTF-16LE') )"

Wednesday, March 16, 2016 - 11:41:30 AM - Scott Murray Back To Top

It all depends on the data in the file and what it has been assigned. Yes the check box is the "default" setting,
but the DSN may discover other data source types based on the 1st xx number of lines

Wednesday, March 16, 2016 - 10:26:04 AM - Jared Back To Top

I am confused as to the redundancy in unicode settings available. You can set the code page in the general
file connection settings to 65001 (UTF-8), as you demonstrated. Then what does the "Unicode" checkbox on
the same page do exactly? And why is it necessary to go to each field in a file source and change the
datatype to DT_WSTR? Shouldn't they already be unicode strings?

Wednesday, September 30, 2015 - 12:44:04 PM - Robert Carnegie Back To Top

Alternatively, you may be able to convert a file from UTF-8 to UTF-16 and then treat it as UCS-2 in SQL
Server. Even use BCP or BULK INSERT. Notepad++ has a setting to alter the encoding when you save a file,
but I've seen Notepad++ do funny things when editing a large file. Java comes with a converter "native2ascii"
which also did some funny things, but the data seemed to come out OK in SQL Server. I used a recipe like this
in a CMD file. This is actually converting to and then from UTF-32.

SET FILE1=%1
SET FILE2=C:\OUTPUT\UCS.CSV
SET CONVERT=C:\jdk1.6.0_10\bin\native2ascii
%CONVERT% -encoding UTF-8 %FILE1% | %CONVERT% -reverse -encoding X-UTF-16LE-BOM > %FILE2%

By dropping any UTF-8 text file on the script's icon, the file data should be shoved into file UCS.CSV.

At the time (2011) I wrote, "Conversion by this method prior to import is not completely satisfactory, because
the output file is a little larger than expected from the input size, without explanation, except that something
like blank lines seems to appear in the converted output. That appears to be filtered out in import, but it may
like blank lines seems to appear in the converted output. That appears to be filtered out in import, but it may
be not the only unintended alteration."

Looking at it subsequently, I found I also couldn't remember why I apparently deliberately chose an encoding
with "Byte order mark" (BOM) in the output. It may be better without.

Thursday, April 16, 2015 - 3:07:06 PM - MMM Back To Top

Great post, saved me tons of hours of working. Very nice work around, really appreciate it, THANKS!!!

Wednesday, October 29, 2014 - 6:08:16 PM - Jason Simmons Back To Top

Very helpful article. I have been dealing with the exact same issue and spent 24 hours tweaking my database
to no avail.... this got me what I needed and I am well on my way with loading data.

Thanks again!

~Jason

Tuesday, October 21, 2014 - 5:02:09 AM - Pat Back To Top

Very helpful tips. Thank you!

Tuesday, August 19, 2014 - 2:07:47 PM - Heather Phillips Back To Top

I would like to open a DAT file from a Windows Mail or Microsoft Outlook email. It does not open with file
viewers etc. It has codes in the notepad file that I am able to view and it's an email that was saved. Are you
able to help me?

I would be so grateful if you could.

Thank you

Thursday, August 07, 2014 - 11:48:39 PM - Sam Back To Top

This was a huge help!

I super appreciate you writing this. Was stuck on this import issue for a few hours.

Thanks!

Sam

Wednesday, January 08, 2014 - 8:19:55 AM - John Back To Top


Well done!

enter email address Join


Learning Resources
DBAs Tutorials

Developers Webcasts

BI Professionals Whitepapers

Careers Tools

Q and A

Today's Tip

Copyright (c) 2006-2016 Edgewood Solutions, LLC All rights reserved


Some names and products listed are the registered trademarks of their respective owners.

You might also like