How to ignore whitespace while reading a file to produce an XML DOM

I'm trying to read a file to produce a DOM Document, but the file has whitespace and newlines and I'm trying to ignore them, but I couldn't:

DocumentBuilderFactory docfactory=DocumentBuilderFactory.newInstance();
docfactory.setIgnoringElementContentWhitespace(true);

I see in Javadoc that setIgnoringElementContentWhitespace method operates only when the validating flag is enabled, but I haven't the DTD or XML Schema for the document.

What can I do?

Update

I don't like the idea of introduce mySelf < !ELEMENT... declarations and i have tried the solution proposed in the forum pointed by Tomalak, but it doesn't work, i have used java 1.6 in an linux environment. I think if no more is proposed i will make a few methods to ignore whitespace text nodes


Asked by: Anna412 | Posted: 23-01-2022






Answer 1

‘IgnoringElementContentWhitespace’ is not about removing all pure-whitespace text nodes, only whitespace nodes whose parents are described in the schema as having ELEMENT content — that is to say, they only contain other elements and never text.

If you don't have a schema (DTD or XSD) in use, element content defaults to MIXED, so this parameter will never have any effect. (Unless the parser provides a non-standard DOM extension to treat all unknown elements as containing ELEMENT content, which as far as I know the ones available for Java do not.)

You could hack the document on the way into the parser to include the schema information, for example by adding an internal subset to the < !DOCTYPE ... [...] > declaration containing < !ELEMENT ... > declarations, then use the IgnoringElementContentWhitespace parameter.

Or, possibly easier, you could just strip out the whitespace nodes, either in a post-process, or as they come in using an LSParserFilter.

Answered by: Audrey420 | Posted: 24-02-2022



Answer 2

This is a (really) late answer, but here is how I solved it. I wrote my own implementation of a NodeList class. It simply ignores text nodes that are empty. Code follows:

private static class NdLst implements NodeList, Iterable<Node> {

    private List<Node> nodes;

    public NdLst(NodeList list) {
        nodes = new ArrayList<Node>();
        for (int i = 0; i < list.getLength(); i++) {
            if (!isWhitespaceNode(list.item(i))) {
                nodes.add(list.item(i));
            }
        }
    }

    @Override
    public Node item(int index) {
        return nodes.get(index);
    }

    @Override
    public int getLength() {
        return nodes.size();
    }

    private static boolean isWhitespaceNode(Node n) {
        if (n.getNodeType() == Node.TEXT_NODE) {
            String val = n.getNodeValue();
            return val.trim().length() == 0;
        } else {
            return false;
        }
    }

    @Override
    public Iterator<Node> iterator() {
        return nodes.iterator();
    }
}

You then wrap all of your NodeLists in this class and it will effectively ignore all whitespace nodes. (Which I define as Text Nodes with 0-length trimmed text.)

It also has the added benefit of being able to be used in a for-each loop.

Answered by: Thomas781 | Posted: 24-02-2022



Answer 3

I made it works by doing this

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        dbFactory.setIgnoringElementContentWhitespace(true);
        dbFactory.setSchema(schema);
        dbFactory.setNamespaceAware(true);
NodeList nodeList = element.getElementsByTagNameNS("*", "associate");

Answered by: Wilson447 | Posted: 24-02-2022



Answer 4

I ended up following @bobince's idea of using an LSParserFilter. Yes, the interface is documented at https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/ls/LSParserFilter.html but it's very hard to find good example/explanation material. After considerable searching I located DOM Level 3 Load and Save XML Reference Guide at http://www.informit.com/articles/article.aspx?p=31297&seqNum=29 (Nicholas Chase, Mar 14, 2003). That helped me considerably. Here are portions of my code, which does an XML diff with org.custommonkey.xmlunit. (This is a tool written on my own time to help me with paid work, so I have left a lot of things, like better exception handling, for when things are slow.)

I especially like the use of an LSParserFilter because, for my purpose, I will likely add an option in the future to ignore id attributes too, which should be an easy enhancement with this framework.

// A small portion of my main class.
// Other imports may be necessary...
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSParser;
import org.w3c.dom.ls.LSParserFilter;

Document controlDoc = null;
Document testDoc = null;
try {
    System.setProperty(DOMImplementationRegistry.PROPERTY, "org.apache.xerces.dom.DOMImplementationSourceImpl");
    DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
    DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
    LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
    LSParserFilter filter = new InputFilter();
    builder.setFilter(filter);
    controlDoc = builder.parseURI(files[0].getPath());
    testDoc = builder.parseURI(files[1].getPath());
} catch (Exception exc) {
    System.out.println(exc.getMessage());
}

//--------------------------------------

import org.w3c.dom.ls.LSParserFilter;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.traversal.NodeFilter;

public class InputFilter implements LSParserFilter {

    public short acceptNode(Node node) {
        if (Utils.isNewline(node)) {
            return NodeFilter.FILTER_REJECT;
        }
        return NodeFilter.FILTER_ACCEPT;
    }

    public int getWhatToShow() {
        return NodeFilter.SHOW_ALL;
    }

    public short startElement(Element elem) {
        return LSParserFilter.FILTER_ACCEPT;
    }

}

//-------------------------------------
// From my Utils.java:

    public static boolean isNewline(Node node) {
        return (node.getNodeType() == Node.TEXT_NODE) && node.getTextContent().equals("\n");
    }

Answered by: Ned170 | Posted: 24-02-2022



Answer 5

Try this:

private static Document prepareXML(String param) throws ParserConfigurationException, SAXException, IOException {

        param = param.replaceAll(">\\s+<", "><").trim();
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setIgnoringElementContentWhitespace(true);
        DocumentBuilder builder = factory.newDocumentBuilder();
        InputSource in = new InputSource(new StringReader(param));
        return builder.parse(in);

    }

Answered by: Walter348 | Posted: 24-02-2022



Similar questions

Reading from csv adds whitespace that can't be removed Java

I am reading from a csv file of people, starting with their id, name, and date they joined the application. When reading the first row, the first index adds a whitespace to the string. I am unable to remove this whitespace, as I want to parse it to an integer. File nameOfFile = fileName; BufferedReader br = null; String line = ""; String cvsSplitBy = ","; try { br = new BufferedRe...


java - Trimming whitespace from HTML content?

I have a CRUD maintenance screen with a custom rich text editor control (FCKEditor actually) and the program extracts the formatted text as HTML from the control for saving to the database. However, part of our standards is that leading and trailing whitespace needs to be stripped from the content before saving, so I have to remove extraneous &amp;nbsp; and &lt;br&gt; and such from the beginning and end of the HTML string....


java - How do I manage optional whitespace in ANTLR?

I am trying to parse a data file in ANTLR - it has optional whitespace exemplified by 3 6 97 12 15 18 The following shows where the line starts and ends are. There is a newline at the end and there are no tabs. ^ 3 6$ ^ 97 12$ ^ 15 18$ ^ My grammar is: lines : line+; line : ws1 {System.out.println("WSOPT :"+$ws1.text+":");...


split - how to remove whitespace while scanning text in java

I've implemented several different "scanners" in java, from the Scanner class to simply using String.split("\ss+") but when there are several whitespaces in a row like "the_quick____brown___fox" they all tokenize certain white spaces (Imagine the underscores are whitespaces). Any suggestions?


regex - Removing whitespace in Java string?

I'm writing a parser for some LISP files. I'm trying to get rid of leading whitespace in a string. The string contents are along the lines of: :FUNCTION (LAMBDA (DELTA PLASMA-IN-0) (IF (OR (&gt;= #61=(+ (* 1 DEL...


Regex help -- cleaning up whitespace -- Java

I'm trying to view the text of HTML files in a reasonable way. After I remove all of the markup and retain only the visible text, I obtain a String that looks something like this: \n\n\n\n \n\n\n \n\n \n Title here \n\n\n \n\n \n\n Menu Item 1 \n\n \n\n Menu Item 2 \n\n\n \n\n you get the point. I would like to use String.replaceAll(String regex, String regex) to replac...


Getting rid of comma, whitespace, sorting in Java

I'm learning about text processing in Java for a class and the example in class was to read in data from a file, do text processing, write back data (List) to the file. I understand the example in that he reads in each line into a String and adds that line to the list and uses the .split(" ") and then Collections.sort to sort the data returning one of the strings. However, if there are commas and extra whitespace, I don'...


java - Splitting and assigning a string with whitespace as the delimeter

I need help splitting this string, but i can't seem to come with the right way of doing it. Suppose I have two numbers on a line 12 101 I would like to take the first and assign it to variable, and then take the second and assign it to a variable, this may sounds easy, but for me i can't come up with the right way to do it?


java - How to trim the whitespace from a string?

This question already has answers here:


java - How can I find whitespace in a String?

How can I check to see if a String contains a whitespace character, an empty space or " ". If possible, please provide a Java example. For example: String = "test word";


See if a string begins with whitespace in Java

I know that trim removes whitespace from the beginning and end of a string, but I wanted to check if the first character of a string is a whitespace. I've tried what seems about everything, but I can't seem to get it to work. Can someone point me in the right direction? I'd appreciate it if regular expressions were not used. Thanks a lot!






Still can't find your answer? Check out these amazing Java communities for help...



Java Reddit Community | Java Help Reddit Community | Dev.to Java Community | Java Discord | Java Programmers (Facebook) | Java developers (Facebook)



top