HTML and Rebol

This is an experimental script using parsing rules from the HTML5 spec. Included are two functions built atop this parser to convert HTML strings to Rebol values.

This script can be used in both the Ren-C fork of Rebol 3 and in Red (which will likely be the case pending completion of the ‘experimental’ phase).

Ren-C:

import <markup>

Red and Ren-C:

do %markup.reb
; or
; do https://raw.githubusercontent.com/rgchris/Scripts/master/experimental/markup.reb

(uses Red macros to effect compatibility—this might have side-effects on other Red code, though maybe not: check the source prior to the Rebol header)

load-markup

This function creates a block of tag/string values akin to Rebol 2’s LOAD/MARKUP. It does try to respect HTML5 parsing rules, though is susceptible to badly formed markup structure—good for quick and dirty scraping jobs.

>> load-markup {<a href="/somewhere">Something</a>}
== [<a> #("href" "/somewhere") "Something" </a>]

Self-closing tags are denoted by an empty close tag:

>> load-markup {<foo/>}
== [<foo> </>]

load-html

This function returns a hierarchal structure of nodes representing the document tree. It takes the form of a linked list that can be navigated using node/first (first child), node/last (last child), node/next (next sibling), node/back (previous sibling) and node/parent (parent). A walk function will traverse the entire tree (or a node’s children with an /only refinement):

>> trees/walk load-html {<img/>} [if node/type = 'element [probe node/name]]
html
head
body
img

An additional MARKUP-AS-BLOCK function will convert the hierarchy to a more traditional Rebol block format:

>> probe markup-as-block load-html {<table><th>Header<td>Cell}
** (1,25): expected-closing-tag-but-got-eof
[
<html> [
<head> none
<body> [
<table> [
<tbody> [
<tr> [
<th> [
%.txt "Header"
]
<td> [
%.txt "Cell"
]
]
]
]
]
]
]

Note: parse errors are currently printed as in the above example, this will be configurable beyond the experimental phase.

More DOM-like functions to follow.

Misc

If you have any issues, post here.

For Rebol 2 HTML handling, checkout this script built atop PowerMezz.