How do I programmatically inspect a HTML document

I have a database full of small HTML documents and I need to programmatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

Both iText and Aspose work (roughly) along the lines:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

Can anybody suggest a good library or a sensible approach to this problem? Platform is Java


Asked by: Julian109 | Posted: 23-01-2022






Answer 1

HTMLparser is a good HTML parser.

I have used this to parse HTML on one of my projects.

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

Yo can parse out CSS usin the CssSelectorNodeFilter

Answered by: Agata727 | Posted: 24-02-2022



Answer 2

If the HTML is "well-formed XML" (XHTML) why not use an XML parser (such as Xerces) and then inspect programatically the DOM tree.

Answered by: Dominik903 | Posted: 24-02-2022



Answer 3

Adobe Acrobat Pro allows you to grab sites via HTTP and does an excellent job of preserving the style and layout. I haven't used it from an API aspect, but it may be worth looking into.

Answered by: Anna765 | Posted: 24-02-2022



Answer 4

You'd probably be better off getting a component that goes directly from HTML to PDF, or Word, then to try to parse the HTML document and duplicate the formatting yourself based on the HTML. If you want to convert HTML to PDF, and you use .Net, Winnovative provides a good solution.

Answered by: Stuart782 | Posted: 24-02-2022



Answer 5

Check out the flying saucer xhtml renderer- they render well-formed XHTML files to PDF, and let you control the output using CSS.

Answered by: Alberta574 | Posted: 24-02-2022



Similar questions

Convert XML document to an Excel Sheet programmatically through Java

I have an XML Document and I need to convert it into an Excel Sheet so that the data is more presentable and also I would be able to add Macros to it. The XML Document is pretty complex and it is actually a Java Application Object that has been converted to XML using XStream. I need to do the parsing dynamically wherein the tags should be the column name and the attributes should be column value. It...


c# - How to programmatically digital sign PDF document?

I need to digitally sign PDF documents from Java or C#. I've found iText, is it a good solution? are there any others?


programmatically How to get shape width in SVG document using java

I want to get shape width in this svg document &lt;?xml version="1.0" encoding="utf-8"?&gt; &lt;!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) --&gt; &lt;!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1 Tiny//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11-tiny.dtd"&gt; &lt;svg version="1.1" baseProfile="tiny" id="Layer_1" xmlns="http://www.w3.org/2000/s...


c# - How to programmatically read over a scanned document or image

I've searched around on the net, as I'm a bit of a n00b when it comes to OCR, and I'm actually not sure where a good starting point would be. I'd like to build an app that will be able to identify &amp; count say for example how many check boxes are filled in on any given row of document/image (it could even be another format should anyone know of something that would better suite an application of this type). the ...


java - Open a document programmatically in a portable way

is the following command working under Unix&amp;Linux? ProcessBuilder prcbdoc = new ProcessBuilder("cmd","/C","start", "Documentation.doc"); prcbdoc.directory(new File(currentDir+"/docs/")); prcbdoc.start(); I'm not sure because of the "cmd" "/c" €: What would be an Linux CentOS equivalent command?


How to create an XML document programmatically in Java?

Given a pre-described set of tags which I can use(for example, a structure like the AndroidManifest.xml), how can I create an XML document (for a manifest file) from scratch in Java, preferably not using any third party libraries? Also, what is(if any) the best way to do this?


How to programmatically sign a binary MS office document with Java?

How can we digitally sign a legacy binary MS-Office document (doc, xls, ppt) in Apache POI, or any other open source library? The Open XML formats are covered at How to programatically sign an MS office XML document with Java?


java - How can I programmatically check if a file is a valid PDF document in Android Studio?

In my app, I have a button that allows the user to pick a PDF file from the external downloads folder. My code for that is shown below: Intent intent = new Intent(Intent.ACTION_GET_CONTENT); intent.setType(&quot;application/pdf&quot;); startActivityForResult(intent, IMPORT_PDF); I also have a screenshot of the file picker intent:


java - Programmatically generate an Eclipse project

I use eclipse to work on an application which was originally created independently of eclipse. As such, the application's directory structure is decidedly not eclipse-friendly. I want to programmatically generate a project for the application. The .project and .classpath files are easy enough to figure out, and I've learned that projects are stored in the workspace under &lt;worksp...


How do I programmatically determine operating system in Java?

I would like to determine the operating system of the host that my Java program is running programmatically (for example: I would like to be able to load different properties based on whether I am on a Windows or Unix platform). What is the safest way to do this with 100% reliability?


eclipse - Can you register an ActiveX dll in Java programmatically?

I have a third-party ActiveX dll, and I'd like to register it programmatically at run time, if possible. Can this be done in Java? The application I'm working with is an Eclipse application on Windows XP.


Convert Word doc to HTML programmatically in Java

I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break. I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have...


java - How can I programmatically test an HTTP connection?

Using Java, how can I test that a URL is contactable, and returns a valid response? http://stackoverflow.com/about


java - How to programmatically add portlet to the JBoss Portal dashboard

Closed. This question does not meet Stack Overflow guid...


java - Programmatically marking an Oracle BPEL task complete

I am using Oracle BPEL Process manager and have a task assigned to a group of users. I try to mark it approved using Java class oracle.bpel.services.workflow.task.ITaskService.updateTaskOutcome(). This works if the task is assigned to an individual user, but if the task is assigned to a group of users, I get an error message about the task not being acquired. If I acquire the task using oracle.bpel.services.workf...


java - How to Send Encrypted Emails Programmatically (from an automated process)

I have a process that runs on a UNIX (Solaris) server that runs nightly and needs to be able to send out encrypted emails. I only need the "encryption" portion, NOT the digital signature / self-repudiation part of PKI. I use MS Outlook in a corporate setting and I am assuming that when a user clicks "Publish to GAL..." under Tools -> Options -> Security, this will publish their PUBLIC KEY to the Global Addr...


java - Need a way to check status of Windows service programmatically

Here is the situation: I have been called upon to work with InstallAnywhere 8, a Java-based installer IDE, of sorts, that allows starting and stopping of windows services, but has no built-in method to query their states. Fortunately, it allows you to create custom actions in Java which can be called at any time during the installation process (by way of what I consider to be a rather convoluted API). I ...


java - Programmatically transcode MPEG-2 videos

I need to be able to programmatically transcode mpeg-2 files to .mp4, .mp3, .wmv, .rm (optional), and .flv (optional), and hopefully generate a thumbnail as well. I found the Java Media Framework, but it frankly looks pretty crappy. This will be running a Linux server, so I could shell out to ffmpeg using Commons Exec - does ffmpeg do everything I need to do? FFmpeg seems pretty daunting, which is why I'm having trouble fi...






Still can't find your answer? Check out these amazing Java communities for help...



Java Reddit Community | Java Help Reddit Community | Dev.to Java Community | Java Discord | Java Programmers (Facebook) | Java developers (Facebook)



top