UText/1.2 Discussion
Brainstorming. Critic points. Wish list.
Last updated on Wed Apr 20, 2011.
Issues
The following should be fixed.
HTML Cleanup
The optional html cleanup that drops all empty lines before saving html files was a quick and dirty implementation that does not work well: Empty lines inside <pre> (which is a css style question, not a pure html tag one) are incorrectly removed.
If Tag
As [lf] inside an [if] is interpreted as condition separator, it is not possible to embed a newline character in an expression to be returned. It is also not possible to embed an [if] into another [if], because the [lf] of the inner [if] are incorrectly interpreted as being part of the outer [if].
Line Start with Dashes
A line starting with -- is discarded as a comment. You get in trouble if you want to get a line starting this way. This is annoying and sometimes difficult to work around, because this preprocessing is hard coded in UTL parsing and (independently) in Script parsing and cannot be disabled.
Windows
Under Windows the UText Shell seems to add an empty line after each response. Cause untried. I do not use the Windows version in a production environment, I tested it superficially and everything seems to work.
UText Script
The follwing are some possible enhancements.
White Space Handling
Script parser should respect white space in parameters and body and pass it over. Now it replaces sequences of white space characters with a single space.
Module load
The add-in load function load should be not static. But a static register must be possible.
Comments
For symmetry with UTL, lines between {-- and --} should be ignored.
Single Token
It would be probably useful to have a syntax for treating a string as a single token when being parsed. To consider:
Multiple lines between a &{ and a &} are parsed as one single word, white space is respected, keywords are ignored. Embedded &{..&} should be allowed (needs level counting as in begin...end).
UTL Language
Iterative selectors
With a single selector one can now select some unit and return a particular child unit of it if the selection unit is a predecessor of the return unit, but otherwise not. For example:
~keyword {
~name "alfa"
~description "the first letter of the greek alphabet"
}
You get the keyword with the selector
#keyword."alfa" name
but you cannot get its description with a selector, because name and description are at the same level. You can do this only with a script:
select #keyword."alfa" name do select description
The selector language could be extended with an operator to perform multiple select operations in a single selector:
#keyword."alfa" name, description
Regular Expression Matching Selectors
The selector can now match binary data, for example:
select # post . "science" category
returns all posts from the category ”science“. There could be a selector clause to match binary data against a regular expression.
select # post . /^2010/ timestamp
This would return al posts whose timestamp begins with ”2010“, that is, all posts from year 2010.
Compound Selectors
A new compound selector such as selector1 && selector2 could return the intersection between selectors, and the operation || the union. Thus :webpage && =index would return the unit with type ”webpage“ and name ”index“, and :html.tag || ~include would return all children that are html tags or include lines.
Perhaps introduce the logical shortcut operators & (and), | (or), in order to get [v fulltitle | title] expand as the fulltitle or, if the current unit does not have one, the title.
Functions
Target Generation
Target generation should be integrated inside the source text. A ”target“ unit would map a unit to its output according to its contents.
~target =stdweb {
~rule {
~type webfarm
~out [foreach/ webpage][out][/foreach]
}
~rule {
~type webfarm.webpage
~out [save/ [v filename]][foreach/ ?][out][/foreach][/save]
}
~rule {
~role webfarm.webpage.tag
~out <[u.role]>[v]</[u.role]>
}
}
A parser could allow a compact expression of the same:
[target/ =stdweb]
:webfarm [foreach/ webpage][out][/foreach]
:webfarm.webpage [save/ [v filename]][foreach/ ?][out][/foreach][/save]
~webfarm.webpage.tag <[u.role]>[v]</[u.role]>
[/target]
One could define generic target modules such as ”Html“ and ”LaTeX“ and then bind them to other targets:
~target =stdweb {
~use html
~use cms
}
~target =myweb {
~use stdweb
}
Or equivalently:
[target/ =stdweb html cms]
[...]
When outputing the use chain is looked up in reverse order as long as no matching rule is found.
An output statement can call a specific target: [out html].
UString Tags
If
A selector could be introduced:
[if/ <selector>]...
The condition would be applied to the first unit returned by the selector instead of the current unit. The current position of the caller UText object should remain unchanged.
Eventually prefix ! to negate conditions, i.e. !:html meaning: not isType('html').
Design Questions
Shortcut Preprocessing
The preprocessing of ustring that expands shortcuts (i.e. _words_ into [i words]) is poor and unflexible: It can only be deactivated for a whole file through a script, it can not be deactivated by a tag (only the [code] tag deactivates it hardcoded in the utext kernel), and when active it often disturbs.
Selector Behavior
The selector language should be revised and brought to a clean design. Now it gives sometimes unexpected results. Problem cases should be collected here for future study.
1. Count limit now applies to the whole results. It should be possible to apply it to a specific level. Examples:
(1)article.timestamp
article.(1)timestamp
Now both return the first timestamp. It should be possible to get the first timestamp for each article, a shortcut for
select article do select (1)timestamp
Odd First Level Units Behavior
The fact that all first level units are children of the unit unit means that they are available for any other unit as role, because any other unit's type is a descendant of the type unit. This is welcome for the basic functionality such as parse, but is involuntary in general. For example, if you define:
^ webpage
~webpage first page
~webpage second page
the interpreter will parse this as:
^ webpage
~webpage first page {
~webpage second page
}
Because the role webpage for the second page is found under the type webpage of the first one. Workaround: either you put an additional level:
^ website {
^ webpage
}
~website
~webpage first page
~webpage second page
which works of course as long as you only define one website. Or you enclose each page in curly brackets:
^ webpage
~webpage first page {
}
~webpage second page {
}
I do not know the solution for this odd behavior caused by design.
Mark-Up in UStrings
Now that I begin generating LaTeX appart from Html there arise some extensibility problems.
Currently a < sign is a Html mark and to output an angle bracket in Html one uses the tag [ab].
If after writing some source files one wants to get a LaTeX export, one can introduce similarly a tag [cb] for curly brackets, and interpret { as a LaTeX mark. But this has two problems. First, already existing source files must be altered replacing { with [cb]. Second, the source code gets bloated, having lots of [ab] and [cb].
If a source text generates say 10 different mark-up targets, I surely do not want to populate the whole source text with marks for all of them. I surely do not want to alter the whole source each time I append a mark-up target, neither. I want to have a general main source without special marks and eventually little chunks of specialised code for each target type.
It would be better not to make tags for ”literal“ signs, but to tag the ”mark up“ signs instead. That way < and { would be literal signs, the Html angle bracket would be [ab] and the LaTeX curly bracket would be [cb]. A general tag [html] could preprocess the content string to allow direct Html input, similarly a tag [latex].
Separate Parsing and Semantic
Parsing of particular formats should be strictly separated from using particular text structures. If one needs to make both, one should define a parser tag that feeds a text structure and a second tag that transforms this text structure. This way one can have many parsers for the same structure and apply many transformations to it, adding only one additional tag for each new format or operation, otherwise one must add each time so many tags as combinations of format x transformations already exist.
A counterexample is the current add-in module cms that is very bad designed. It does not separate the content management functions from the HTML generation, and thus it can only be used for generating web sites. It should generate basic tags instead, such as [i] or [url], and these should be expanded by other modules, such as a html for web site generation or a latex for LaTeX generation.
Default Child Unflexibility
The fact that the default child and default binary child are always the first candidate avoids to have extra syntax to define it (by now there are no metalanguage facilities in UTL at all). But this imposes limitations. Example:
^ html-part {
^ tag :ustring
^ h1 : tag
^ p : tag
}
^ content : html-part
^ subcontent : html-part
You must define tag as the first child, otherwise you cannot instantiate it for p and h1. Therefore, the default child for content and subcontent is tag, which is nonsense, since tag was ment as a sort of ”abstract type“ not to be instantiated in single texts but only when defining the subtypes ”h1“ etc. One can workaround the problem this way:
^ html-part {
^ tag : ustring
^ h1 : tag
^ p : tag
}
^ content : html-part {
^ p : html-part.p
}
^ subcontent : html-part {
^ p : html-part.p
}
That works well, but the expression is redundant. But if you define ”tag“ as:
^ tag {
^ tx : ustring
^ class : string
}
Then it won't work, because this UTL expression:
~content
first paragraph
second paragraph
will be parsed incorrectly as:
~content {
~p {
~tx first paragraph
~tx second paragraph
}
Until now I don't know any workaround for this with pure UTL means (without Perl programming).
UString is Text
Currently ustrings are strings "with embedded tags". Uncool. That is the common old way of doing things. It should be done the text-oriented way. The ustrings should be parsed by the UTL parser and fed as text structure together with the rest of the units. One would have two possibilities:
- Handle a ustring unit as a whole, i.e. output it as an html string.
- Enter into its structure, query its members. For example a selector retrieving all external links existing in a particular page, or all occurrences of a particular person name in all pages.
This way one gets that the output processors are bound under a particular name for a particular unit: [n something] looks up for a processor associated to the role name ”n“ for the current type and its antecessors.
See UString as Text.
The same applies to the UText script. The current implementation of script as string is uncool. Scripts should be parsed and fed into the text repository (as transformations?) and invoked at run time.
Parsing Binary Data
To consider: at parse time binary data is not being fed as ”string“ as it now is, but the parser bound to its type is invoked. There is a default string-parser that puts the contents literally, and there is a ustring-parser that recognises embedded tags and invokes their respective parsers.
There can also be a particular parser for say ~paragraph etc.
There can be a parser for integer that stores a 4 or 8 byte binary number instead of its decimal representation as string.
There can be a parser for file. One writes:
~file /home/john/photos/2009/05/holiday.jpg
The file parser puts the binary file in the repository. File should not be a binary type but contain some metadata that the parser could query from operating system, say:
^file {
^original-filename :string
^format :string
^created :datetime
^updated :datetime
^contents :binary
}
Of course there should be an operation to extract the file into a disk. Perhaps additionally through the [save] tag, too.

