Beyond Regular Expressions

Inspired by John Gruber’s attempts to pattern match URLs in a block of arbitrary text, I’ve attempted the same in REBOL:

Extract URLs
Link Up - applying the pattern to add links to faces in REBOL/View.

It might seem unwieldy compared to very concise Regular Expressions. However, REBOL’s Parse dialect though not quite as succinct - initially - provides a more versatile and expressive vocabulary for tackling pattern matching.

Unlike RegEx, you don’t get predefined character ranges. Instead you have charsets that you establish, merge, negate to form your base nouns:

letter: charset [#"a" - #"z"]
digit: charset [#"0" - #"9"]
word: charset [#"_" #"0" - #"9" #"A" - #"Z" #"a" - #"z"] ; per regex
space: charset "^/^- ()<>^"'" ; for curly quotes, need unicode (R3)
punct: charset "!'#$%&`*+,-./:;=?@[/]^^{|}~" ; regex 'punct without ()<>
chars: complement union space punct

Additionally, you can build smaller rules that form additional nouns:

paren: ["(" some [chars | punct | "(" some [chars | punct] ")"]")"]

When you have your nouns, you can then go about describing what your target looks like:

uri: [
    [
          letter some [word | "-"] ":" [1 3 "/" | letter | digit | "%"]
        | "www" 0 3 digit "."
        | some [letter | digit] "." 2 4 letter
    ]
    some [opt [some punct] some [chars | paren] opt "/"]
]

REBOL’s BNF-based parse dialect avoids repetition, and is very easy to decipher (which is at best arguable for RegEx).

The function is completed with a further parse rule that matches the entirety of the text block, extracting the parts we need.

text: use [emit-link emit-text link mk ex][
    emit-link: [(append out to-url link)]
    emit-text: [(unless mk = ex [append out copy/part mk ex])]

    [
        mk: any [
            ex: copy link uri emit-text emit-link mk:
            | some [chars | punct] some space ; non-uri words
            | skip
        ]
        ex: emit-text
    ]
]

func [
    "Separates URLs from plain text"
    txt [string!] "Text to be split"
][
    out: copy []
    if parse/all txt text [out]
]

REBOL 3 (usable, though still in development) will add Unicode and extensions to parse grammar.