tsunami

C# 3.0: ZeroWidthSplit

Luke Breuer
2008-01-06 08:48 UTC

Intro
ZeroWidthSplit was originally written for a C# 4.0 proposal for non-nullable ref types, but it has since gotten a life of its own.

ZeroWidthSplit is designed to split a string into pieces without actually consuming any characters in the process. In other words,
"a,b,c".Split(',')
// output is { "a", "b", "c" }
consumes the , character. I came up with the extremely artificial need to not eat the , character. Thus:
ZeroWidthSplit("a,b,c", new Regex(","))
// output is "a", ",b", ",c"
This was sort of a bad example; oh well.
Functional?
Functional programming is pretty cool, as ZeroWidthSplit demonstrates. In particular, by taking the approach I did below, it is easier to verify that the algorithm is correct. A more imperative approach would have to treat the edge cases (first element in the string and last element) specially. String processing can be quite difficult, as getting the edge cases correct can be very tricky business. The approach below virtually guarantees that there are no funky edge cases.
The code
static IEnumerable<string> ZeroWidthSplit(string splitThis, Regex atStartOfMatch)
{
    if (splitThis == null)
        throw new ArgumentNullException("splitThis");
    if (atStartOfMatch == null)
        throw new ArgumentNullException("atStartOfMatch");

    return new[] { 0 }
        .Concat(atStartOfMatch
            .Matches(splitThis)
            .Cast<Match>()
            .Select(m => m.Index)
        )
        .Concat(new[] { splitThis.Length })
        .SelfJoinByOffset(1)
        .Where(t => t.Second > t.First)
        .Select(t => splitThis.Substring(t.First, t.Second - t.First));
}
An explanation
ZeroWidthSplit takes the input "xAaBbCc" and the regex [A-Z] and returns "x", "Aa", "Bb", "Cc".

It works by taking input, applying the regex to it, getting a sequence of match start indexes, prepending zero, and appending the string length.
character indexes            0 1 2 3 4 5 6 7
the string                 " x A a B b C c"
places it needs to be cut   | |   |   |   |
prepended 0                 |
regex match indexes           |   |   |
appended string length                    |

the resultant indexes        0 1   3   5   7
I have spaced out the string to show positions in between characters. Indent the pipes one character to see the actual character indexes.

The output of SelfJoinByOffset is the following:
Tuple<int, int>.First   0 1 3 5 7
Tuple<int, int>.Second  1 3 5 7 0
Second - First          1 2 2 2 -7
Notice that First and Second - First are the arguments I want to send to String.Substring, at least except for the -7, which is excluded by the call to Where. Alternatively, I could alter SelfJoinByOffset to not return the last element.