org.cyberneko.html.filters

Class Purifier

Implemented Interfaces:
XMLComponent, XMLDocumentFilter, HTMLComponent

public class Purifier
extends DefaultFilter

This filter purifies the HTML input to ensure XML well-formedness. The purification process includes:
  • ensuring the string "--" does not appear in the content of a comment;
  • ensuring the string "]]>" does not appear in the content of a CDATA section;
  • ensuring that the XML declaration has required pseudo-attributes and that the values are correct; and
  • synthesized missing namespace bindings.

    Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".

    In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.

    The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.

  • Version:
    $Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
    Author:
    Andy Clark

    Field Summary

    protected static String
    AUGMENTATIONS
    Include infoset augmentations.
    protected static String
    NAMESPACES
    Namespaces.
    protected static HTMLEventInfo
    SYNTHESIZED_ITEM
    Synthesized event info item.
    static String
    SYNTHESIZED_NAMESPACE_PREFX
    Synthesized namespace binding prefix.
    protected boolean
    fAugmentations
    Augmentations.
    protected boolean
    fInCDATASection
    True if inside a CDATA section.
    protected NamespaceContext
    fNamespaceContext
    Namespace information.
    protected boolean
    fNamespaces
    Namespaces.
    protected String
    fPublicId
    Public identifier of doctype declaration.
    protected boolean
    fSeenDoctype
    True if the doctype declaration was seen.
    protected boolean
    fSeenRootElement
    True if root element was seen.
    protected int
    fSynthesizedNamespaceCount
    Synthesized namespace binding count.
    protected String
    fSystemId
    System identifier of doctype declaration.

    Fields inherited from class org.cyberneko.html.filters.DefaultFilter

    fDocumentHandler, fDocumentSource

    Method Summary

    void
    characters(XMLString text, Augmentations augs)
    Characters.
    void
    comment(XMLString text, Augmentations augs)
    Comment.
    void
    doctypeDecl(String root, String pubid, String sysid, Augmentations augs)
    Doctype declaration.
    void
    emptyElement(QName element, XMLAttributes attrs, Augmentations augs)
    Empty element.
    void
    endCDATA(Augmentations augs)
    End CDATA section.
    void
    endElement(QName element, Augmentations augs)
    End element.
    protected void
    handleStartDocument()
    Handle start document.
    protected void
    handleStartElement(QName element, XMLAttributes attrs)
    Handle start element.
    void
    processingInstruction(String target, XMLString data, Augmentations augs)
    Processing instruction.
    protected String
    purifyName(String name, boolean localpart)
    Purify name.
    protected QName
    purifyQName(QName qname)
    Purify qualified name.
    protected XMLString
    purifyText(XMLString text)
    Purify content.
    void
    reset(XMLComponentManager manager)
    void
    startCDATA(Augmentations augs)
    Start CDATA section.
    void
    startDocument(XMLLocator locator, String encoding, Augmentations augs)
    Start document.
    void
    startDocument(XMLLocator locator, String encoding, NamespaceContext nscontext, Augmentations augs)
    Start document.
    void
    startElement(QName element, XMLAttributes attrs, Augmentations augs)
    Start element.
    protected void
    synthesizeBinding(XMLAttributes attrs, String ns)
    Synthesize namespace binding.
    protected Augmentations
    synthesizedAugs()
    Returns an augmentations object with a synthesized item added.
    protected static String
    toHexString(int c, int padlen)
    Returns a padded hexadecimal string for the given value.
    void
    xmlDecl(String version, String encoding, String standalone, Augmentations augs)
    XML declaration.

    Methods inherited from class org.cyberneko.html.filters.DefaultFilter

    characters, comment, doctypeDecl, emptyElement, endCDATA, endDocument, endElement, endGeneralEntity, endPrefixMapping, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, ignorableWhitespace, merge, processingInstruction, reset, setDocumentHandler, setDocumentSource, setFeature, setProperty, startCDATA, startDocument, startDocument, startElement, startGeneralEntity, startPrefixMapping, textDecl, xmlDecl

    Field Details

    AUGMENTATIONS

    protected static final String AUGMENTATIONS
    Include infoset augmentations.

    NAMESPACES

    protected static final String NAMESPACES
    Namespaces.

    SYNTHESIZED_ITEM

    protected static final HTMLEventInfo SYNTHESIZED_ITEM
    Synthesized event info item.

    SYNTHESIZED_NAMESPACE_PREFX

    public static final String SYNTHESIZED_NAMESPACE_PREFX
    Synthesized namespace binding prefix.

    fAugmentations

    protected boolean fAugmentations
    Augmentations.

    fInCDATASection

    protected boolean fInCDATASection
    True if inside a CDATA section.

    fNamespaceContext

    protected NamespaceContext fNamespaceContext
    Namespace information.

    fNamespaces

    protected boolean fNamespaces
    Namespaces.

    fPublicId

    protected String fPublicId
    Public identifier of doctype declaration.

    fSeenDoctype

    protected boolean fSeenDoctype
    True if the doctype declaration was seen.

    fSeenRootElement

    protected boolean fSeenRootElement
    True if root element was seen.

    fSynthesizedNamespaceCount

    protected int fSynthesizedNamespaceCount
    Synthesized namespace binding count.

    fSystemId

    protected String fSystemId
    System identifier of doctype declaration.

    Method Details

    characters

    public void characters(XMLString text,
                           Augmentations augs)
                throws XNIException
    Characters.
    Overrides:
    characters in interface DefaultFilter

    comment

    public void comment(XMLString text,
                        Augmentations augs)
                throws XNIException
    Comment.
    Overrides:
    comment in interface DefaultFilter

    doctypeDecl

    public void doctypeDecl(String root,
                            String pubid,
                            String sysid,
                            Augmentations augs)
                throws XNIException
    Doctype declaration.
    Overrides:
    doctypeDecl in interface DefaultFilter

    emptyElement

    public void emptyElement(QName element,
                             XMLAttributes attrs,
                             Augmentations augs)
                throws XNIException
    Empty element.
    Overrides:
    emptyElement in interface DefaultFilter

    endCDATA

    public void endCDATA(Augmentations augs)
                throws XNIException
    End CDATA section.
    Overrides:
    endCDATA in interface DefaultFilter

    endElement

    public void endElement(QName element,
                           Augmentations augs)
                throws XNIException
    End element.
    Overrides:
    endElement in interface DefaultFilter

    handleStartDocument

    protected void handleStartDocument()
    Handle start document.

    handleStartElement

    protected void handleStartElement(QName element,
                                      XMLAttributes attrs)
    Handle start element.

    processingInstruction

    public void processingInstruction(String target,
                                      XMLString data,
                                      Augmentations augs)
                throws XNIException
    Processing instruction.
    Overrides:
    processingInstruction in interface DefaultFilter

    purifyName

    protected String purifyName(String name,
                                boolean localpart)
    Purify name.

    purifyQName

    protected QName purifyQName(QName qname)
    Purify qualified name.

    purifyText

    protected XMLString purifyText(XMLString text)
    Purify content.

    reset

    public void reset(XMLComponentManager manager)
                throws XMLConfigurationException
    Overrides:
    reset in interface DefaultFilter

    startCDATA

    public void startCDATA(Augmentations augs)
                throws XNIException
    Start CDATA section.
    Overrides:
    startCDATA in interface DefaultFilter

    startDocument

    public void startDocument(XMLLocator locator,
                              String encoding,
                              Augmentations augs)
                throws XNIException
    Start document.
    Overrides:
    startDocument in interface DefaultFilter

    startDocument

    public void startDocument(XMLLocator locator,
                              String encoding,
                              NamespaceContext nscontext,
                              Augmentations augs)
                throws XNIException
    Start document.
    Overrides:
    startDocument in interface DefaultFilter

    startElement

    public void startElement(QName element,
                             XMLAttributes attrs,
                             Augmentations augs)
                throws XNIException
    Start element.
    Overrides:
    startElement in interface DefaultFilter

    synthesizeBinding

    protected void synthesizeBinding(XMLAttributes attrs,
                                     String ns)
    Synthesize namespace binding.

    synthesizedAugs

    protected final Augmentations synthesizedAugs()
    Returns an augmentations object with a synthesized item added.

    toHexString

    protected static String toHexString(int c,
                                        int padlen)
    Returns a padded hexadecimal string for the given value.

    xmlDecl

    public void xmlDecl(String version,
                        String encoding,
                        String standalone,
                        Augmentations augs)
                throws XNIException
    XML declaration.
    Overrides:
    xmlDecl in interface DefaultFilter

    (C) Copyright 2002-2005, Andy Clark. All rights reserved.