Improve documentation wrt UTF-8

parent 7882fd78
......@@ -19,6 +19,13 @@ PROJECT RESOURCES
* the mirror at https://gitlab.com/uplex/varnish/libvmod-j for issues,
merge requests and all other interactions.
.. _Höhrmann UTF-8 decoder: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
.. _Exhaustive Test Program: https://git.sr.ht/~slink/hoehrmann-utf8
This project contains, as a submodule, the `Höhrmann UTF-8 decoder`_
as tested by the `Exhaustive Test Program`_.
INTRODUCTION
============
......
......@@ -50,7 +50,7 @@ THE** `WARNING`_.
.. _JSON: https://www.json.org/json-en.html
.. _RFC 8259: https://www.rfc-editor.org/rfc/rfc8259
.. _RFC 8259: https://www.ietf.org/rfc/rfc8259.txt
Formatting `JSON`_ in pure VCL is a PITA, because string processing in
VCL was never made for it. VCL being a Domain Specific Language, it
......@@ -227,6 +227,40 @@ use ``j.array(j.number(1) + j.number(2) + j.number(3))`` and to create
an array of three strings (``["1","2","3"]``) use
``j.array(j.string(1) + j.string(2) + j.string(3))``
UNICODE / UTF-8
===============
.. _Höhrmann: https://git.sr.ht/~slink/hoehrmann-utf8
JSON does not strictly mandate strings to contain valid UTF-8. `RFC
8259`_ section 8.2 reads:
[...] this specification allows member names and string values to
contain bit sequences that cannot encode Unicode characters [...]
however
When all the strings represented in a JSON text are composed
entirely of Unicode characters [...] (however escaped), then that
JSON text is interoperable [...]
For the most part, this module is not concerned with whether or not
strings represent valid UTF-8 or Unicode:
`j.string()`_ with the ``escape=none`` and ``escape=minimal``
(default) options only checks/ensures that strings are properly
escaped and is otherwise transparent with the exception of NUL /
``\0``, which marks the end of the string.
The two exceptions are:
* `j.string()`_ with the ``escape=ascii`` option decodes UTF-8 using
the `Höhrmann`_ decoder, which fails for invalid UTF-8, but only
conducts minimal checks on Unicode points.
* `j.unquote()`_ fails if the input is not a valid JSON string or if
invalid UTF-8 would be produced.
VMOD INTERFACE REFERENCE
========================
......
......@@ -36,7 +36,7 @@ THE** `WARNING`_.
.. _JSON: https://www.json.org/json-en.html
.. _RFC 8259: https://www.rfc-editor.org/rfc/rfc8259
.. _RFC 8259: https://www.ietf.org/rfc/rfc8259.txt
Formatting `JSON`_ in pure VCL is a PITA, because string processing in
VCL was never made for it. VCL being a Domain Specific Language, it
......@@ -213,6 +213,40 @@ use ``j.array(j.number(1) + j.number(2) + j.number(3))`` and to create
an array of three strings (``["1","2","3"]``) use
``j.array(j.string(1) + j.string(2) + j.string(3))``
UNICODE / UTF-8
===============
.. _Höhrmann: https://git.sr.ht/~slink/hoehrmann-utf8
JSON does not strictly mandate strings to contain valid UTF-8. `RFC
8259`_ section 8.2 reads:
[...] this specification allows member names and string values to
contain bit sequences that cannot encode Unicode characters [...]
however
When all the strings represented in a JSON text are composed
entirely of Unicode characters [...], then that JSON text is
interoperable [...]
For the most part, this module is not concerned with whether or not
strings represent valid UTF-8 or Unicode:
`j.string()`_ with the ``escape=none`` and ``escape=minimal``
(default) options only checks/ensures that strings are properly
escaped and is otherwise transparent with the exception of NUL /
``\0``, which marks the end of the string.
The two exceptions are:
* `j.string()`_ with the ``escape=ascii`` option decodes UTF-8 using
the `Höhrmann`_ decoder, which fails for invalid UTF-8, but only
conducts minimal checks on Unicode points.
* `j.unquote()`_ fails if the input is not a valid JSON string or if
invalid UTF-8 would be produced.
VMOD INTERFACE REFERENCE
========================
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment