Go to content Go to navigation and search

Home

Current SQL Server Blog Articles


Search

Browse

RSS / Atom

Email me

textpattern

Creative Commons License
All Blog Articles, Data Models and Free Source Code by Simon Greener, The SpatialDB Advisor is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

String Tokenizer for SQL Server 2008 written in TSQL

Saturday August 29 2009 at 18:01

Keywordsstring tokenizer separator delimiter
Summary

String tokenizer for Sql Server 2008 and Denali.

Update: Tokenizer has been updated for Denali analytic features and to expose the separators in the output.

We often have need of string tokenizers in applications. I had need of one in some TSQL development I am currently doing for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).

The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article

Still, with a little perseverance I came up with a working implementation.

Here it is.

  1. USE [GISDB]  -- You need to change this if you use this function.
  2. GO
  3. /*********************************************************************************
  4. ** ==@==function    : Tokenizer
  5. ** ==@==precis      : Splits any string into its tokens.
  6. ** ==@==description : Supplied a string and a list of separators this function
  7. **                returns resultant tokens as a table collection.
  8. ** ==@==example     : SELECT t.token
  9. **                  FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
  10. ** ==@==param       : p_string. The string to be Tokenized.
  11. ** ==@==param       : p_separators. The characters that are used to split the string.
  12. ** ==@==depend      : dbo.generate_series()
  13. ** ==@==history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
  14. ** ==@==history     : Simon Greener - Jul 2006 - Original coding (extended SQL)
  15. ** ==@==history     : Simon Greener - Aug 2008 - Converted to SQL Server 2008
  16. **/
  17. DROP FUNCTION Tokenizer;
  18. ==<nbsp/>==
  19. CREATE FUNCTION Tokenizer(==@==p_string     VARCHAR(MAX),
  20.                           ==@==p_separators VARCHAR(254))
  21.   RETURNS ==@==varchar_table TABLE
  22.  (
  23.    token VARCHAR(MAX)
  24.   )
  25. AS
  26. BEGIN
  27.   BEGIN
  28.       WITH myCte AS (
  29.       SELECT c.beg,
  30.              c.fullstring,
  31.              ROW_NUMBER() OVER(ORDER BY c.beg ASC) RowVersion
  32.         FROM (SELECT b.beg, b.fullstring
  33.                 FROM (SELECT a.beg, ==@==p_string AS fullstring
  34.                         FROM (SELECT c.IntValue AS beg
  35.                                 FROM dbo.generate_series(1,DATALENGTH(==@==p_string),1) c
  36.                               ) a
  37.                      ) b,
  38.                      (SELECT SUBSTRING(==@==p_separators,d.IntValue,1) AS delim
  39.                         FROM dbo.generate_series(1,DATALENGTH(==@==p_separators),1) d
  40.                      ) c
  41.                WHERE CHARINDEX(c.delim,SUBSTRING(b.fullstring,b.beg,1)) > 0
  42.                UNION ALL SELECT 0 AS beg, ==@==p_string AS fullstring
  43.                UNION ALL SELECT DATALENGTH(==@==p_string)+1 AS beg, @p_string AS fullstring
  44.              ) c
  45.       )
  46.       INSERT INTO ==@==varchar_table
  47.       SELECT SUBSTRING(d.fullstring, (d.beg + 1), (d.end_p - d.beg - 1) ) token
  48.         FROM (SELECT BASE.beg,
  49.                      LEAD.beg end_p,
  50.                      BASE.fullstring
  51.                 FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.RowVersion = LEAD.RowVersion-1
  52.              ) d
  53.        WHERE d.end_p IS NOT NULL
  54.          AND d.end_p > d.beg + 1;
  55.       RETURN;
  56.   END;
  57. END
  58. GO

Here are my, simple, tests.

  1. SELECT DISTINCT t.token
  2.   FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t;

Result.

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon
  1. SELECT t.token
  2.   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Result.

token
The
rain
in
spain
stays
mainly
on
the
plain

Now, if you want to collect them back into a single string, here’s an example of what you can do.

  1. SELECT (STUFF((SELECT DISTINCT ':' + a.gtype
  2.                  FROM ( SELECT DISTINCT t.token AS gtype
  3.                           FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') AS t
  4.                        ) a
  5.                 ORDER BY ':' + a.gtype
  6.                 FOR XML PATH(''), TYPE, ROOT).VALUE('root[1]','nvarchar(max)'),1,1,'')
  7.         ) AS GeometryTypes;

Result.

GeometryTypes
LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon

Upgraded Version for Denali

  1. CREATE FUNCTION [dbo].[Tokenizer](@p_string     VARCHAR(MAX),
  2.                                   @p_separators VARCHAR(254))
  3.   RETURNS @varchar_table TABLE
  4.  (
  5.    id        INT,
  6.    token     VARCHAR(MAX),
  7.    separator VARCHAR(MAX)
  8.   )
  9. AS
  10. BEGIN
  11.   BEGIN
  12.   WITH myCte AS (
  13.     SELECT c.beg, c.sep, ROW_NUMBER() OVER(ORDER BY c.beg ASC) RowID
  14.       FROM (SELECT b.beg, c.sep
  15.         FROM (SELECT a.beg
  16.             FROM (SELECT c.IntValue AS beg
  17.                 FROM dbo.generate_series(1,LEN(@p_string),1) c
  18.                 ) a
  19.             ) b,
  20.             (SELECT SUBSTRING(@p_separators,d.IntValue,1) AS sep
  21.              FROM dbo.generate_series(1,LEN(@p_separators),1) d
  22.              ) c
  23.         WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
  24.         UNION ALL SELECT                0 AS beg, CAST(NULL AS VARCHAR) AS sep
  25.         UNION ALL SELECT LEN(@p_string)+1 AS beg, CAST(NULL AS VARCHAR) AS sep
  26.              ) c
  27.       )
  28.     INSERT INTO @varchar_table
  29.     SELECT ROW_NUMBER() OVER (ORDER BY a.RowID ASC) AS Id,
  30.            CASE WHEN LEN(a.token) = 0 THEN NULL ELSE a.token END AS token,
  31.        a.sep
  32.       FROM (SELECT RowID,
  33.                  SUBSTRING(@p_string, (d.beg + 1), (LEAD(beg,1) OVER (ORDER BY RowID ASC) - d.beg - 1) ) AS token,
  34.                      LEAD(sep,1) OVER (ORDER BY RowID ASC) AS sep
  35.             FROM MyCTE d
  36.         ) AS a
  37.      WHERE LEN(a.token) <> 0 OR LEN(a.sep) <> 0;
  38.     RETURN;
  39.   END;
  40. END
  41. GO

Result.

  1. SELECT t.id, t.token, t.separator
  2.   FROM dbo.tokenizer('POLYGON((2300 -700, 2800 -300, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400, 2300 200, 2100 100, 2500 100, 2300 -200, 1800 -300, 2300 -500, 2200 -400, 2400 -400, 2300 -700), (2300 1000, 2400  900, 2200 900, 2300 1000))',' ,()') AS t;

Result.

id token separator
1 POLYGON (
2 NULL (
3 2300
4 -700 ,
5 2800
6 -300 ,
7 2300
8 700 ,
9 2800
10 1100 ,
11 2300
12 1100 ,
13 1800
14 1100 ,
15 2300
16 400 ,
17 2300
18 200 ,
19 2100
20 100 ,
21 2500
22 100 ,
23 2300
24 -200 ,
25 1800
26 -300 ,
27 2300
28 -500 ,
29 2200
30 -400 ,
31 2400
32 -400 ,
33 2300
34 -700 )
35 NULL ,
36 NULL (
37 2300
38 1000 ,
39 2400
40 900 ,
41 2200
42 900 ,
43 2300
44 1000 )
45 NULL )

I hope that someone out there finds this useful.

Creative Commons License

post this at del.icio.uspost this at Diggpost this at Technoratipost this at Redditpost this at Farkpost this at Yahoo! my webpost this at Windows Livepost this at Google Bookmarkspost this to Twitter

Comment [5]

Simon,

Thanks for the article. I think I’ve had several occasions when I thought would be nice to have a string tokenizer, but been lazy too to look for a script for one for SQL Server or roll my own.

For a group concatenation function for strings I have one of those .NET custom aggregate functions, mostly because I can never remember the XML path syntax and its a bit easier to write.

The disadvantage of that is .NET by default is disabled in SQL Server so have to enable it in surface area etc. and getting an admin for a customer to do this is oh so frustrating – I often wait for it to be escalated to some administrator who has a clue which sometimes takes a week for a 1 minute configuration change.

It would have been really nice if SQL Server 2008 allowed defining aggregates in T-SQL similar to what PostgreSQL allows with sql/plpgsql.

By the way I think your code would work fine in SQL Server 2005 too. Will have to give it a try.

Regina · 30 August 2009, 01:23 · #

Simon

Just a small note that all your @ symbols have been removed from your variables in the function.

— James · 31 August 2009, 10:17 · #

James,

Thanks for letting me know: the article should, now, be fixed.

regards
Simon

Simon · 31 August 2009, 10:56 · #

Nice. However, this would have been even better if it returned some kind of order for the tokens. Let’s say I want to reliably identify the second token in the string… how do I do that? As far as I know, an SQL table is unordered by definition.

— Darren · 22 January 2011, 04:33 · #

Yes, relational theory says that a relation (table), is not ordered. To order you have to include an ORDER BY clause in the SQL.

So, if you pulled the SQL out the function, removed the INSERT, and instead of:

Select SUBSTRING, (d.end_p – d.beg – 1) ) token

you put

Select d.beg, SUBSTRING, (d.end_p – d.beg – 1) ) token

you will see the ordering is preserved.

If not, just add:

Order by d.beg

at the end. That is:

Select SUBSTRING, (d.end_p – d.beg – 1) ) token
From (Select BASE.beg,
LEAD.beg end_p,
BASE.fullstring
From MyCTE BASE
LEFT JOIN MyCTE LEAD
ON BASE.RowVersion = LEAD.RowVersion-1
) d
Where d.end_p Is Not Null
And d.end_p > d.beg + 1
order by d.beg;

Simon

Simon Greener · 22 January 2011, 13:37 · #