On Text Structure
Last update on Sat Apr 23, 2011.
Text Integrity
It should be possible to define integrity rules that apply to all instances of a type.
For example:
^website {
^(1) title : ustring
^(1-)stylesheet : ustring
}
Each website must have exactly one title and one or more stylesheet. Compiling should abort if a website is defined which does not fullfill this.
Such integrity validation requires a sort of "transaction" capability. (An implicit transaction commiting when leaving each level or at the end of each read OS file would be too unflexible.) One can commit a transaction only if all validation succeeds. Only commited data can be read.
The possibilty to require a constant minimum and maximum amount of instances would be surely useful, but validation should be generalized. Perhaps as a post-load trigger, that gets called when commiting a unit. One binds a unit with a post-load transformation, that gets called by the system and has the chance to return an "abort" signal that causes the commit to fail.
Text Triggers
Text transformations can be bound to a text trigger, so that they get called when a particular event occurs. For example:
Load-Trigger. Fired when the unit is requested to be loaded. It is responsible for parsing some source files or getting somehow the corresponding text and entering it.
Preprocess-Trigger. Fired when reading a unit to preprocess the UTL before parsing it.
Postload-Trigger. Fired after loading a unit. For post-process and validation purposes. It can abort the load returning an error signal.
Output-Trigger. Responsible for conversion, for example generating HTML pages, Open Office Documents or Latex files. 
Perhaps not a trigger. Compare to a parser. Or is a parser a trigger, too? What about a ”cast“ operation to transform a unit from one type into another?
The triggers are bound to a type and get called by the system when the event occurrs at any instance of it.
Maybe triggers should be possibly bound to single units, too.
Text Query
Let us now think about what semantics query languages should have, independent of the way of coding expressions.
One can define conditions on single levels and relate them to each other. Conditions on single levels:
- text unit is a particular text unit — either one sets the condition to be a single unit or one sets the condition that the unit must have a particular name.
- text unit has a particular role — either the role must have a particular name or it must be a particular role
- text unit has a particular type — either the type must have a particular name or it must be a particular type
- text unit has particular binary data — depends upon binary data type. For example: Strings can be matched completely or partially, match a pattern or a regular expression, whereas numbers can be exactly matched, be lesser or greater than a given value.
Conditions relating levels:
- two separate evaluated units are: at the same level, or parent and child, etc., in general there are N levels in between, being N between 0 and inf (infinity: one must be under the other one no matter at what level).
- Conditions of two units are related: for each unit matching the first condition, there exists a unit matching the second one, or all units match the second one.
Each condition can be negated.
Text Formula
Update March 8, 2010. After I've tried to implement text mit multiple parents, roles and types (s. below), I think now this is not the right way. Not only it is difficult to implement but the implemented model is ugly: there are ”units“ on the one side and ”relationships“ between them on the other side. I think it is cleaner to have just a single rule for defining units as a four-way relationship without exceptions. The question below does not seem to me now as an open question, but I let it here for future reconsideration.
At UText/1 each text unit has exactly one parent, one role and one type. I think the real text structure is more general: a single unit can participate in a text more than once, each time having one parent, one role and one type.
Isn't it somehow obvious? We can establish relationships between symbols, each time we say something about one of them, but we can make lots of sentences with each symbol, we say this and we say that. (But perhaps this is incorrect, perhaps each text unit should be unambiguous by nature, and we should build explicitally compound units instead, being each component a partial assertion.)
The implementation of the text structure is of course not the problem. Instead of an array of 3 scalars parent, role, type, one needs an array of 4 scalars: unit, parent, role, type. The same unit occurs more than once in this array structure.
But I can not foresee if this is going to work well or difficulties will arise. What about navigating the structure? What about selectors? Perhaps one should just implement the structure in a mock up and see.
Single Type, Multiple Parents and Roles
It probably makes sense to have each unit of a unique type, which describes it internally. This unit can then participate as child in many parents, playing each time a role.
=p {
=u ~r :t
}
=q {
=p.u ~s :t
}
Either one must always repeat the type or only give it once, but that depends upon feed order, which is not so good.
One needs perhaps a notation not fixing the parent and enumerating children but fixing the unit and enumerating all parents.
=u :t |{
=p ~r
=q ~s
}
Note that the role is here the child's role, not the parent's!
Or in a single line:
=u :t |=p ~r
=u :t |=q ~s
Multiple Parents, Roles and Types
The general case.
=u :t |=p ~r
=u :v |=q ~s

