Go to content Go to navigation and search

Home

Current Oracle Spatial Blog Articles


Search

Browse

RSS / Atom

Email me

textpattern

Creative Commons License
All Blog Articles, Data Models and Free Source Code by Simon Greener, The SpatialDB Advisor is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

String Tokenizer for Oracle

Wednesday June 29 2011 at 14:49

Keywordsstring tokenizer returns token separator
Summary

This article contains two versions of a string tokenizer. One that just returns tokens that are separated by a list of separators, the other that returns tokens and the separators as they are found in the string.

UPDATED FUNCTION AT END OF ARTICLE RETURNS SEPARATORS

We often have need of string tokenizers in applications. Here is one for Oracle based on work by Pawel Barut.

  1. -- We need a type to hold the returned tokens
  2. --
  3. CREATE OR REPLACE TYPE T_TokenSet AS TABLE OF VARCHAR2(4000);
  4. /
  5. SHOW errors
  6. GRANT EXECUTE ON T_TokenSet TO public WITH GRANT OPTION;
  7. /*********************************************************************************
  8. ** @function    : Tokenizer
  9. ** @precis      : Splits any string into its tokens.
  10. ** @description : Supplied a string and a list of separators this function
  11. **                returns resultant tokens as a pipelined collection.
  12. ** @example     : SELECT t.column_value
  13. **                  FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') ) t;
  14. ** @param       : p_string. The string to be Tokenized.
  15. ** @param       : p_separators. The characters that are used to split the string.
  16. ** @requires    : t_TokenSet type to be declared.
  17. ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
  18. ** @history     : Simon Greener - July 2006 - Original coding (extended SQL sourced from a blog on the internet)
  19. **/
  20. CREATE OR REPLACE
  21.   FUNCTION Tokenizer(p_string     IN VarChar2,
  22.                      p_separators IN VarChar2 DEFAULT ' ')
  23.     RETURN T_TokenSet Pipelined
  24.   AS
  25.     v_strs camod_common.T_TokenSet;
  26.   BEGIN
  27.     IF ( p_string IS NULL
  28.          OR
  29.          p_separators IS NULL ) THEN
  30.        RETURN;
  31.     END IF;
  32.     WITH sel_string AS (SELECT p_string fullstring FROM dual)
  33.     SELECT substr(fullstring, beg+1, end_p-beg-1) token
  34.            Bulk Collect INTO v_strs
  35.       FROM (SELECT beg, Lead(beg) OVER (ORDER BY beg) end_p, fullstring
  36.               FROM (SELECT beg, fullstring
  37.                       FROM (SELECT Level beg, fullstring
  38.                               FROM sel_string
  39.                             CONNECT BY Level <= LENGTH(fullstring)
  40.                   )
  41.                      WHERE instr(p_separators,substr(fullstring,beg,1)) >0
  42.                     UNION ALL
  43.                     SELECT 0, fullstring
  44.                       FROM sel_string
  45.                     UNION ALL
  46.                     SELECT LENGTH(fullstring)+1, fullstring
  47.                       FROM sel_string)
  48.            )
  49.      WHERE end_p IS NOT NULL
  50.        AND end_p > beg + 1;
  51.     FOR i IN v_strs.FIRST..v_strs.LAST Loop
  52.       PIPE ROW(v_strs(i));
  53.     END Loop;
  54.     RETURN;
  55.   END Tokenizer;

Here are my, simple, tests.

  1. SELECT DISTINCT t.column_value AS token
  2.   FROM TABLE(Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':')) t;

Result.

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon
  1. SELECT t.column_value AS token
  2.   FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!')) t;

Result.

token
The
rain
in
spain
stays
mainly
on
the
plain

Updated Function

Sometimes it is more than handy to be able to access the separators in the position they are found in the output from a Tokenizer. Here is an updated version of the above that does this.

  1. -- New types
  2. --
  3. -- We need a type to hold the returned tokens
  4. --
  5. DROP TYPE T_Token    Force;
  6. DROP TYPE T_TokenSet Force;
  7. CREATE TYPE T_Token AS Object (
  8.    id        INTEGER,
  9.    token     varchar2(30000),
  10.    separator varchar2(30000)
  11. );
  12. /
  13. SHOW errors
  14. GRANT EXECUTE ON T_Token TO public WITH GRANT OPTION;
  15. CREATE TYPE T_TokenSet AS TABLE OF codesys.t_token;
  16. /
  17. SHOW errors
  18. GRANT EXECUTE ON T_TokenSet TO public WITH GRANT OPTION;
  19. /*********************************************************************************
  20. ** @function    : Tokenizer
  21. ** @precis      : Splits any string into its tokens.
  22. ** @description : Supplied a string and a list of separators this function
  23. **                returns resultant tokens as a pipelined collection.
  24. ** @example     : SELECT t.column_value
  25. **                  FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') ) t;
  26. ** @param       : p_string. The string to be Tokenized.
  27. ** @param       : p_separators. The characters that are used to split the string.
  28. ** @requires    : t_TokenSet type to be declared.
  29. ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
  30. ** @history     : Simon Greener - July 2006 - Original coding (extended SQL sourced from a blog on the internet)
  31. ** @history     : Simon Greener - Apr 2012  - Extended TO include returning OF tokens
  32. **/
  33. CREATE OR REPLACE
  34.   FUNCTION Tokenizer(p_string     IN VarChar2,
  35.                      p_separators IN VarChar2 DEFAULT ' ')
  36.     RETURN T_TokenSet Pipelined
  37.   AS
  38.     v_tokens codesys.T_TokenSet;
  39.   BEGIN
  40.     IF ( p_string IS NULL
  41.          OR
  42.          p_separators IS NULL ) THEN
  43.        RETURN;
  44.     END IF;
  45.     WITH myCTE AS (
  46.        SELECT c.beg, c.sep, ROW_NUMBER() OVER(ORDER BY c.beg ASC) rid
  47.          FROM (SELECT b.beg, c.sep
  48.                  FROM (SELECT Level beg
  49.                          FROM dual
  50.                         CONNECT BY Level <= LENGTH(p_string)
  51.                       ) b,
  52.                       (SELECT SubStr(p_separators,level,1) AS sep
  53.                         FROM dual
  54.                         CONNECT BY Level <= LENGTH(p_separators)
  55.                       ) c
  56.                 WHERE instr(c.sep,substr(p_string,b.beg,1)) >0
  57.                UNION ALL SELECT 0, CAST(NULL AS varchar2(10)) FROM dual
  58.              ) c
  59.     )
  60.     SELECT T_Token(ROW_NUMBER() OVER (ORDER BY a.rid ASC),
  61.                    CASE WHEN LENGTH(a.token) = 0 THEN NULL ELSE a.token END,
  62.                    a.sep) AS token
  63.       Bulk Collect INTO v_tokens
  64.       FROM (SELECT d.rid,
  65.                    SubStr(p_string,
  66.                           (d.beg + 1),
  67.                           NVL((Lead(d.beg,1) OVER (ORDER BY d.rid ASC) - d.beg - 1),LENGTH(p_string)) ) AS token,
  68.                    Lead(d.sep,1) OVER (ORDER BY d.rid ASC) AS sep
  69.               FROM MyCTE d
  70.            ) a
  71.      WHERE LENGTH(a.token) <> 0 OR LENGTH(a.sep) <> 0;
  72.     FOR v_i IN v_tokens.FIRST..v_tokens.LAST loop  
  73.        PIPE ROW(v_tokens(v_i));
  74.     END LOOP;
  75.     RETURN;
  76.   END Tokenizer;

Testing

  1. SELECT DISTINCT t.token
  2.  FROM TABLE(Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':')) t;

Results

TOKEN
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon

The classic “Rain in Spain…”.

  1. SELECT t.*
  2.   FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!')) t;

Results

ID TOKEN SEPARATOR
1 The {SPACE}
2 rain {SPACE}
3 in {SPACE}
4 spain ,
5 (null) {SPACE}
6 stays {SPACE}
7 mainly {SPACE}
8 on {SPACE}
9 the {SPACE}
10 plain .
11 (null) !

Now, let’s process a POLYGON WKT.

  1. SELECT t.id, t.token, t.separator
  2.   FROM TABLE(tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400  900, 2200 900, 2300 1000))',' ,()')) t;

Results

ID TOKEN SEPARATOR
1 POLYGON (
2 (null) (
3 2300 {SPACE}
4 400 ,
5 (null) {SPACE}
6 2300 {SPACE}
7 700 ,
8 (null) {SPACE}
9 2800 {SPACE}
10 1100 ,
11 (null) {SPACE}
12 2300 {SPACE}
13 1100 ,
14 (null) {SPACE}
15 1800 {SPACE}
16 1100 ,
17 (null) {SPACE}
18 2300 {SPACE}
19 400 )
20 (null) ,
21 (null) {SPACE}
22 (null) (
23 2300 {SPACE}
24 1000 ,
25 (null) {SPACE}
26 2400 {SPACE}
27 (null) {SPACE}
28 900 ,
29 (null) {SPACE}
30 2200 {SPACE}
31 900 ,
32 (null) {SPACE}
33 2300 {SPACE}
34 1000 )
35 (null) )

This time don’t include the space as a separator.

  1. SELECT t.id, t.token, t.separator
  2.   FROM TABLE(tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400  900, 2200 900, 2300 1000))',',()')) t;

Results

ID TOKEN SEPARATOR
1 POLYGON (
2 (null) (
3 2300 400 ,
4 2300 700 ,
5 2800 1100 ,
6 2300 1100 ,
7 1800 1100 ,
8 2300 400 )
9 (null) ,
10 {SPACE} (
11 2300 1000 ,
12 2400 900 ,
13 2200 900 ,
14 2300 1000 )
15 (null) )

I hope that someone out there finds this useful.code id=

Creative Commons License

post this at del.icio.uspost this at Diggpost this at Technoratipost this at Redditpost this at Farkpost this at Yahoo! my webpost this at Windows Livepost this at Google Bookmarkspost this to Twitter

Comment