C++ String Tokenizer

Download: tokenizer.zip

Manipulating a string in C++ is somewhat weak area, compared to other scripting languages. This is a simple C++ class to split a string into multiple substrings based on given delimiters. This tokenizer uses only standard C++ string classes and an iterator.

It behaves like strtok() function in C standard library. You can get the next token as std::string data type by calling Tokenizer::next(). When there is no more token left, then, it returns a zero-length string, "".

This class exposes 4 member functions:


void Tokenizer::set ( const std::string& str,
                      const std::string& delimiter=DEFAULT_DELIMITER )

Tokenizer::set() is to set both the input string to be split and the delimiter string. Once it is called, Tokenizer clears all previous processes and initializes its members for the new input string. The delimiter string can be omitted. When the delimiter string is omitted, the default delimiters will be used. The default delimiters are space(" "), tab(\t, \v), newline(\n), carriage return(\r), and form feed(\f).


void Tokenizer::setString(const std::string& str)

Tokenizer::setString() is to define only the input string to be split. As same as Tokenizer::set() method, it will clear the previous working string and initializes itself for the new input string.


void Tokenizer::setDelimiter(const std::string& delimiter)

Tokenizer::setDelimiter() is to set only the delimiter string. Note that the delimiter string can hold multiple characters. And it is possible to change the delimiter string while you are working in the middle of input string. If so, Tokenizer will process the input string at the current cursor position and return a token string from the current position, not from the begining of the string.


std::string Tokenizer::next()

Tokenizer::next() is to get the next token string at the current cursor position. When the cursor reaches the end of the input string, then it returns a zero-length string, "".

Here is an example of how to use Tokenizer class:


#include "Tokenizer.h"
#include <string>
#include <iostream>
using namespace std;

int main(int argc, char* argv[])
{
    // instanciate Tokenizer class
    Tokenizer str("This is a very long string.");
    string token;

    // Tokenizer::next() returns a next available token from source string
    // If it reaches EOS, it returns zero-length string, "".
    while((token = str.next()) != "")
    {
        cout << token << endl;
    }
    return 0;
}


The output should be like this:
===============================
This
is
a
very
long
string.

Download the source code of Tokenizer class, tokenizer.zip.

Updates:
2011-03-08: Added split() function to return the array of tokens.
2008-01-22: Fixed a minor bug in handling the end of string.

←Back
 
Hide Comments
comments powered by Disqus