Java HTML Parsing [closed]
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
Asked by: Marcus534 | Posted: 23-01-2022
Answer 1
Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
Answered by: Ada810 | Posted: 24-02-2022Answer 2
The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
*/
public class TestHtmlParse
{
static final String className = "tags";
static final String url = "http://www.stackoverflow.com";
TagNode rootNode;
public TestHtmlParse(URL htmlPage) throws IOException
{
HtmlCleaner cleaner = new HtmlCleaner();
rootNode = cleaner.clean(htmlPage);
}
List getDivsByClass(String CSSClassname)
{
List divList = new ArrayList();
TagNode divElements[] = rootNode.getElementsByName("div", true);
for (int i = 0; divElements != null && i < divElements.length; i++)
{
String classType = divElements[i].getAttributeByName("class");
if (classType != null && classType.equals(CSSClassname))
{
divList.add(divElements[i]);
}
}
return divList;
}
public static void main(String[] args)
{
try
{
TestHtmlParse thp = new TestHtmlParse(new URL(url));
List divs = thp.getDivsByClass(className);
System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
for (Iterator iterator = divs.iterator(); iterator.hasNext();)
{
TagNode divElement = (TagNode) iterator.next();
System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
Answered by: Richard983 | Posted: 24-02-2022
Answer 3
Several years ago I used JTidy for the same purpose:
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
More information on JTidy can be found on the JTidy SourceForge project page ."
Answered by: Charlie623 | Posted: 24-02-2022Answer 4
You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.
Answered by: Robert638 | Posted: 24-02-2022Answer 5
The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter =
new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);
Answered by: Blake294 | Posted: 24-02-2022
Answer 6
Jericho: http://jericho.htmlparser.net/docs/index.html
Easy to use, supports not well formed HTML, a lot of examples.
Answered by: Blake274 | Posted: 24-02-2022Answer 7
HTMLUnit might be of help. It does a lot more stuff too.
http://htmlunit.sourceforge.net/1
Answered by: Sarah413 | Posted: 24-02-2022Answer 8
Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.
Example:
Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");
Example:
doc.form("#myform", new JerryFormHandler() {
public void onForm(Jerry form, Map<String, String[]> parameters) {
// process form and parameters
}
});
Of course, these are just some quick examples to get the feeling how it all looks like.
Answered by: David332 | Posted: 24-02-2022Answer 9
The nu.validator
project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.
Answered by: Jack655 | Posted: 24-02-2022The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)
Answer 10
You can also use XWiki HTML Cleaner:
It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.
Answered by: Aston644 | Posted: 24-02-2022Answer 11
If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.
Answered by: Julian766 | Posted: 24-02-2022Similar questions
HTML parsing in Java
Basically if I have a string that looks like this:
%22Hello+World+%26+Hello+World%22
because I took it from HTML, how do I get Java to make this say Hello World & Hello World, aka replace the HTML pieces with text?
Thanks
HTML parsing using java
This question already has answers here:
java - Parsing HTML from a web page
I have to extract some information from a web page, and reformat it for the user.
Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.
Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?
Cheers
Java - Parsing HTML - get text
I am tring to get text from a website; when you change the language the html url have an "/en" inside, but the page that have the information that i want don't have.
http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92
html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>
The problem ...
java - html parsing with help of DOM
i have to write to the tag which has attribute child as
my xml :
<?xml version="1.0" encoding="UTF-8" standalone="no"?><tree>
<declarations>
<attributeDecl name="name" type="String"/>
</declarations>
<branch>
<branch>
<branch>
<branch>
<branch>
<attribute name="name" value=""/></branch>
<branch>
<attribute name="name"...
java - HTML page parsing
I need to parse a lot of HTML pages (and write parsed data into a database) with various formats on a daily basis, do you know if there is a visual tool that I can use? I would like to point at what need to extract, save it to some kind of config, and execute it in prod.
I am trying to avoid dipping my head into Jsoup and stuff like that.
Thanks
Parsing HTML tags using Java
I am trying to create a HTML parser that checks the HTML tags and verifies that there is a closing tag that corresponds to every open tag.
What I have now works partially and I believe the logic is correct, but I am having issues getting the tokens correct. When I run the code I have, it takes lots of empty tokens, which when are compared to other non-empty ones, obviously produce an error.
I'm wondering ho...
Java HTML Parsing not getting my data?
I have the following HTML code:
<tr class="odd">
<td class="first name">
<a href="/quote/III:LN">3i Group PLC</a>
</td>
<td class="value">457.80</td>
<td class="change up">+10.90</td> <td class="delta up">+2.44%</td> <td class="value">1,414,023</td>
<td class="datetime">11:35:08</td...
java - Parsing HTML tags into XML
I’m trying to parse XML that’s embedded in the HTML file below. Here's the detail from one of the tags:
DOM<tr class="iris_table_row">
<td style=" width:37.50%; text-align:left; " class="ta_10"><span class="ta_10">Tangible assets</span></td>
<td style=" width:2.50%; text-align:right; " class="ta_10"><span class="ta_10">...
How to Parsing "Event XML" in Java?
I'm looking to use Java to parse an ongoing stream of event drive XML generated by a remote device. Here's a simplified sample of two events:
<?xml version="1.0"?>
<Event> DeviceEventMsg
<Param1>SomeParmValue</Param1>
</Event>
<?xml version="1.0"?>
<Event> DeviceEventMsg
<Param1>SomeParmValue</Param1>
</Event>
It seems like SAX i...
Parsing XML with REGEX in Java
Given the below XML snippet I need to get a list of name/value pairs for each child under DataElements. XPath or an XML parser cannot be used for reasons beyond my control so I am using regex.
<?xml version="1.0"?>
<StandardDataObject xmlns="myns">
<DataElements>
<EmpStatus>2.0</EmpStatus>
<Expenditure>95465.00</Expenditure>
<StaffType>11.A</S...
Parsing XML with XPath in Java
This question already has answers here:
Java XML parsing
Whats the quickest way to convert a doc like:
<customermodel:Customer>
<creditCards>
<cardNumber>@0</cardNumber>
<provider>@HSBC</provider>
<xsi:type>@customermodel:CreditCard</xsi:type>
23242552
</creditCards>
.
.
So that the elements with @ become attributes for the parent element.
parsing - Java BBCode library
Closed. This question does not meet Stack Overflow guid...
parsing - Java postal address parser
Closed. This question does not meet Stack Overflow guid...
Text File Parsing in Java
I am reading in a text file using FileInputStream that puts the file contents into a byte array. I then convert the byte array into a String using new String(byte).
Once I have the string I'm using String.split("\n") to split the file into a String array and then taking that string array and parsing it by doing a String.split(",") and hold the contents in an Arraylist.
java - how to get a the value of an http post as a whole? parsing restful post
Is it my ideea or in rest-web services a post comes "with no name", so say something...
I mean, is the post the whole body, minus headers???
so, how can I parse such a post message with java?
do I have to use
HttpServletRequest.getInputStream?
xml - StAX parsing from Java NIO channel
I am attempting to receive a stream of XML events over a Java NIO channel. I am new to both NIO and StAX parsing, so I could very easily be overlooking something :)
My search has led me to several SAX and StAX implementations, but they all seem to operate on InputStreams and InputSources--not NIO channels. The two closest attempts I have made have been to get the InputStream from the channel and create a PipedInput...
parsing - Read and parse KML in java
Is there any library available to parse KML ?
Still can't find your answer? Check out these amazing Java communities for help...
Java Reddit Community | Java Help Reddit Community | Dev.to Java Community | Java Discord | Java Programmers (Facebook) | Java developers (Facebook)