ICFP 2001 Programming Contest
*****************************
Challenge Task
**************
version 6
*********
Damien Doligez, Luc Maranget, Pierre Weis
==========================================
1 The SML/NG markup language
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
The W4C (World Wide Wireless Web Consortium) has just published the
specification of SML/NG (Simple Markup Language -- New Generation), a
simplified version of XXHTML designed for the new generation of
hypertext rendering micro-devices, running on hardware with reduced
computational capacity such as wristtop computers, thumbnail-worn PDAs,
and internet-enabled ice boxes.The programming task is to design and
implement an optimiser for SML/NG that will simplify the source
documents and reduce their size.
2 Description of SML/NG
*=*=*=*=*=*=*=*=*=*=*=*=*
A document is composed of text and tags. A tag is a sequence of
characters (the tag's name) between `<' and `>'. Anything else in a
document is called text. For example,
<<
foobar
>>
is the text `foo' followed by tag `', text `bar', and tag
`'.The characters `<' and `>' can only appear in a document as part
of a tag (but of course the strings `<' and `>' can appear in the
text).
2.1 Tags
==========
A tag whose name begins with the character `/' is a closing tag. The
corresponding open tag has the same name without the leading `/'. For
instance, `' is the closing tag corresponding to the open tag
`'.All tags in a document appear in pairs, composed of an open tag
and the corresponding closing tag (in this order). The region of the
document between two matching tags (including the tags) is called the
range of this pair. For example, in `foobar', the range of the
`', `' pair is `bar'.Tags are properly nested: for any two
ranges, either they are disjoint or one is entirely included in the
other.Each tag changes the attributes of text within its range as
described below:
B (bold) set the B attribute
EM (emphasis) invert the EM attribute
I (italic) set the I attribute
PL (plain) reset the U level to 0 and unset the B, EM, I, S, and TT
attributes
S (strong emphasis) set the S attribute
TT (typewriter) set the TT attribute
U (underline) increment the U level (but not above 3)
0...9 (size) set the size to 0...9
r (color) set the color to red
g set the color to green
b set the color to blue
c set the color to cyan
m set the color to magenta
y set the color to yellow
k set the color to black
w set the color to white
Note: case is significant in tag names.There is one additional
interaction between the attributes: S hides the EM state (i.e. where the
S attribute is set, the EM attribute is irrelevant).
2.2 White space
=================
There are 4 non-printing characters: SPC (ASCII code 0x20), CR (ASCII
code 0x0D), LF (ASCII code 0x0A), and TAB (ASCII code 0x09). Any
sequence of these characters is called white space and is equivalent to
one SPC character, except where the TT attribute is set (there are other
conditions that prevent spaces from being collapsed, see section 3). In
the parts of the document where TT is set, the whitespace characters are
significant and must be preserved exactly. A document contains only
these four characters and printable ASCII characters (codes 0x21 to
0x7E, included).There is another interaction between whitespace and
flags: the EM, I, B, and S attributes are irrelevant for whitespace;
moreover, color is irrelevant where U = 0.
2.3 BNF grammar
=================
The BNF grammar of documents is given below.
document ::= document document
| textchar *
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `<0>' document `0>'
| `<1>' document `1>'
| `<2>' document `2>'
| `<3>' document `3>'
| `<4>' document `4>'
| `<5>' document `5>'
| `<6>' document `6>'
| `<7>' document `7>'
| `<8>' document `8>'
| `<9>' document `9>'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
| `' document `'
textchar ::= any printable character except `<' and `>'
| CR | LF | TAB | SPC
3 Specification
*=*=*=*=*=*=*=*=*
This section defines the meaning of a document. Two documents are said
to be equivalent if they have the same meaning.The meaning of a document
is a sequence of decorated characters; a decorated character is a
character associated with a property record. A property record has 8
fields:
b a boolean
em a boolean
i a boolean
s a boolean
tt a boolean
u an integer between 0 and 3 (included)
size an integer between 0 and 9 (included)
color an element of {r, g, b, c, m, y, k, w}
To compute the meaning of a document given a current context (a
context is a property record), consider the document as a sequence of
characters and tags. Treat each element of this sequence in turn as
follows:
- if it is a printing character, output it, decorated by the current
context
- if it is a whitespace character, compute the current space context,
which is the current context modified in the following way: the s,
em, i, and b flags are unset, and if u is 0 then color is set to w.
If the tt flag is true in the current context, then output the
input character decorated by the current space context; else if the
previous output character was a SPC decorated with the same space
context then do nothing; otherwise, output a SPC decorated with the
current space context.
- if it is an open tag, save the current context and change it
according to the tag name as follows:
B set the b flag
EM invert the em flag if the s flag is not set; otherwise do
nothing
I set the i flag
PL unset the s, em, i, b, and tt flags, and set u = 0.
S set the s flag and unset the em flag
TT set the tt flag
U if u is less than 3, increment it; otherwise do nothing
0...9 set size accordingly
r, g, b, etc. set color accordingly
- if it is a closing tag, restore the context that was saved at the
corresponding open tag
The meaning of the document is the sequence of decorated characters
output by the above algorithm.A root context is any context with u = 0
and b = em = i = s = tt = false (size and color can have any value). Two
documents are equivalent if they have the same meaning in every possible
root context (i.e. for all values of size and color).You can use the
on-line document checker and equivalence tester at
Examples
========
For example, the following pairs of documents are equivalent:
<<
xxx yyy
xxx yyy
xxx yyy zzz
xxx yyy zzz
xxx yyy
xxx yyy
xxx yyy
xxx yyy
xxx yyy
xxx yyy
xxx yyy zzz
xxx yyy zzz
>>
The following pairs of documents are not equivalent:
<<
xxx yyy
xxx yyy
>>
Reason: multiple spaces are significant within `TT'.
<<
xxx yyy zzz
xxx yyy zzz
>>
Reason: `yyy' is both in italics and in bold in the first document but
only in italics in the second one.
<<
xxx yyy
xxx yyy
>>
Reason: `yyy' is underlined twice in the first document but only once
in the second one.
<<
xxx yyy
xxx yyy
>>
Reason: the first document has three underlined spaces between `xxx'
and `yyy' because the middle one is in blue; the second document has
only one red underlined space at that point.
4 The task
*=*=*=*=*=*=
You must write a program to optimise SML/NG documents. Your program
will be given a correct SML/NG document on its standard input, and it
must output (on stdout) an equivalent document that is as small as
possible. The size of a document is simply defined as its length in
bytes.For example, opportunities for optimisation include the following:
- whitespace compression
replacing a white space sequence with a single space or newline
- redundancy elimination
changing foo
into foo
- overlap inversion
changing bla blafoo bar truc
into bla blafoo bar truc
- PL shortcut
changing foo bar gee
into foo bar gee
- whitespace simplification
changing bla bla barfoo
into bla bla barfoo
- EM elimination
changing foo bar foo .
into foo bar foo .
- color nesting
changing aaabbbcccdddeee
into aaabbbcccdddeee
There will also be a limitation on the amount of time that your
program may use to do its work. The time limit will depend on the input
file and it will be given to your program as a number of seconds, both
in its first command-line argument and in the TLIMIT environment
variable. The limit will never be less that 180 (i.e. 3 minutes).Note
that the limit is real time (a.k.a wall-clock time), not CPU time.
5 Judgement criteria
*=*=*=*=*=*=*=*=*=*=*=
Your programs will be ranked according to the following criteria:
1. Correctness. Every program that crashes or gives the wrong result
(i.e. some output that is not equivalent to the input) on any one of
our test inputs will be mercilessly eliminated from the competition.
2. Stupidity. Every program that gives a result bigger than the input
on any of the inputs will also be eliminated.
3. Output size. The remaining programs will be ranked according to the
size of their outputs on a well-chosen set of inputs. The inputs will
be generated using a variety of techniques such as hand-writing,
translation from HTML, random generation. These inputs can be big (up
to five megabytes).
4. Speed of optimisation. If the top programs are very close to each
other using the previous criterion, we will use speed as a tie
breaker.
6 Online stuff
*=*=*=*=*=*=*=*=
The following Web pages may be of interest to you:
- Contest home page:
- FAQ:
- News:
- Document checker and equivalence tester:
- Procedure for submitting entries:
7 Good luck
*=*=*=*=*=*=*
Have fun !
-----------------------------------------------------------------------
This document was translated from LaTeX by HeVeA
(http://pauillac.inria.fr/~maranget/hevea/index.html).