Convert Word doc to HTML programmatically in Java
I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.
I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.
Thanks
Asked by: Lenny456 | Posted: 23-01-2022
Answer 1
I recommend the JODConverter, It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.
JODConverter has a lot of documents, scripts, and tutorials to help you out.
Answered by: Gianna367 | Posted: 24-02-2022Answer 2
I've used the following approach successfully in production systems where the new MS Word XML format isn't available:
Spawn a process that does something similar to:
http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html
You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).
The other option is to spawn the following sort of command every time you need to do the conversion:
ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"
I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).
While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.
I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.
Answered by: Dominik629 | Posted: 24-02-2022Answer 3
We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.
Answered by: Sydney371 | Posted: 24-02-2022Answer 4
If its a docx, you could use docx4j (ASL v2). This uses XSLT to create the HTML.
However, it will give you a single HTML for the whole document.
If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).
Answered by: Lydia572 | Posted: 24-02-2022Answer 5
It is easier to do this in the new MS word docx as the format is in XML. You can use an XSL to transform the Word doc in XML format to an HTML format.
If however your Word doc is in an old version, you can use POI library http://poi.apache.org/ and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library
http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html
Answered by: Richard686 | Posted: 24-02-2022Answer 6
I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReports or Docmosis to facilitate. Both have free/open options.
Answered by: Marcus388 | Posted: 24-02-2022Answer 7
I tried this way and its work with me from this site http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML
This only work with docx to convert it into html included images inside that word document.
// 1) Load DOCX into XWPFDocument
InputStream doc = new FileInputStream(new File("c:/document.docx"));
XWPFDocument document = new XWPFDocument(doc);
// 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
// 3) Extract image
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
// 4) URI resolver
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File("c:/document.html"));
XHTMLConverter.getInstance().convert(document, out, options);
I hope this solve your issue
Answered by: Blake252 | Posted: 24-02-2022Answer 8
You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ). PS: just google doc2html
Answered by: Audrey913 | Posted: 24-02-2022Answer 9
If you are targeting word 2007 files using the ooxml format then this article might help. And there is the Ooxml4j project which is implementing ooxml for Java library.
If you are targeting the binary files though...thats another problem.
Answered by: Roland821 | Posted: 24-02-2022Answer 10
import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");
All possible conversions:
doc --> pdf, html, txt, rtf
xls --> pdf, html, csv
ppt --> pdf, swf
html --> pdf
Answered by: Joyce796 | Posted: 24-02-2022Answer 11
you can use micrsoft office online
first, on server side request https://view.officeapps.live.com/op/view.aspx?src='your doc file online url'
then use jsoup parse the result html
when access from mobile the html will have a frame wrapped.
Answered by: Marcus403 | Posted: 24-02-2022Similar questions
How to convert cp1251 to utf-8 programmatically in Java?
This question already has answers here:
java - How to convert .raw file to image file in android rooted device programmatically
I am trying to create a screenshot capturing app in android which capture screen shot of device screen by using adb command programmatically. I have tried every link from stack overflow and other sites but not much successful yet. can any body help me out here. I have followed this link Android take screenshot on rooted device
java - Convert GMT to IST time zone programmatically in Android
This question already has answers here:
java - How to convert batch to exe programmatically
Closed. This question needs to be more focused. It ...
java - Convert Unicode to escaped Unicode programmatically
I need to figure out the way to convert a Unicode value to escaped code. For example, convert 0x1f604 to "\uD83D\uDE04".
How can I convert an XML to make a programmatically view in Android Java
I made a design for a comment section in XML, but I want to create it programmatically when I get some data from database. How can I convert all of these to Java?
<androidx.cardview.widget.CardView
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:layout_marginTop="10dp"
app:cardCornerRadius="20dp">
<LinearLayout
andro...
convert excel to pdf using itext API programmatically in java
I am looking for utility to convert Excel to PDF using iText API. Find many sources like this (Convert excel to pdf using iText) but not working out for huge data in Excel (around 15 MB excel file with 18 columns and 15,00,000 rows) as it is giving Out of memory exception. Need an utility to transform xlsx file and store it as PDF i...
java - How to convert px to dp in margins programmatically?
This question already has answers here:
java - How do I programmatically inspect a HTML document
I have a database full of small HTML documents and I need to programmatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).
Both iText and Aspose work (roughly) along the lines:
...
java - Programmatically generate an Eclipse project
I use eclipse to work on an application which was originally created independently of eclipse. As such, the application's directory structure is decidedly not eclipse-friendly.
I want to programmatically generate a project for the application. The .project and .classpath files are easy enough to figure out, and I've learned that projects are stored in the workspace under <worksp...
How do I programmatically determine operating system in Java?
I would like to determine the operating system of the host that my Java program is running programmatically (for example: I would like to be able to load different properties based on whether I am on a Windows or Unix platform). What is the safest way to do this with 100% reliability?
eclipse - Can you register an ActiveX dll in Java programmatically?
I have a third-party ActiveX dll, and I'd like to register it programmatically at run time, if possible. Can this be done in Java? The application I'm working with is an Eclipse application on Windows XP.
java - How can I programmatically test an HTTP connection?
Using Java, how can I test that a URL is contactable, and returns a valid response?
http://stackoverflow.com/about
java - How to programmatically add portlet to the JBoss Portal dashboard
Closed. This question does not meet Stack Overflow guid...
java - Programmatically marking an Oracle BPEL task complete
I am using Oracle BPEL Process manager and have a task assigned to a group of users.
I try to mark it approved using Java class oracle.bpel.services.workflow.task.ITaskService.updateTaskOutcome(). This works if the task is assigned to an individual user, but if the task is assigned to a group of users, I get an error message about the task not being acquired.
If I acquire the task using oracle.bpel.services.workf...
java - How to Send Encrypted Emails Programmatically (from an automated process)
I have a process that runs on a UNIX (Solaris) server that runs nightly and needs to be able to send out encrypted emails.
I only need the "encryption" portion, NOT the digital signature / self-repudiation part of PKI.
I use MS Outlook in a corporate setting and I am assuming that when a user clicks "Publish to GAL..." under Tools -> Options -> Security, this will publish their PUBLIC KEY to the Global Addr...
java - Need a way to check status of Windows service programmatically
Here is the situation:
I have been called upon to work with InstallAnywhere 8, a Java-based installer IDE, of sorts, that allows starting and stopping of windows services, but has no built-in method to query their states. Fortunately, it allows you to create custom actions in Java which can be called at any time during the installation process (by way of what I consider to be a rather convoluted API).
I ...
java - Programmatically transcode MPEG-2 videos
I need to be able to programmatically transcode mpeg-2 files to .mp4, .mp3, .wmv, .rm (optional), and .flv (optional), and hopefully generate a thumbnail as well. I found the Java Media Framework, but it frankly looks pretty crappy. This will be running a Linux server, so I could shell out to ffmpeg using Commons Exec - does ffmpeg do everything I need to do? FFmpeg seems pretty daunting, which is why I'm having trouble fi...
Still can't find your answer? Check out these amazing Java communities for help...
Java Reddit Community | Java Help Reddit Community | Dev.to Java Community | Java Discord | Java Programmers (Facebook) | Java developers (Facebook)