November 24, 2005

Java : Word2HTML

Hi all,

Just wondering if anyone knows of any Java code that can automatically convert Word documents into readable HTML, like groupwise does? Just might have some influence on the portal, tis all…

Thanks in advance!


- 5 comments by 2 or more people Not publicly viewable

  1. Chris May

    We had a look at this for Sitebuilder; the answer we reached was 'there isn't one'. If you really must have an in-process conversion, the java bindings for OpenOffice are probably your best bet. There are a few tolerable word-to-plain-text converters but if you want to preserve formatting it's much harder.

    Asking users to do 'save as HTML' and upload the result is a bit dodgy too, because of the crappy HTML which word generates (there are a few 'cleaners' available which improve the situation)

    In the end we shelled out (about £10K I think) for a product called HTML-Transit from a company called Stellent. It's a COM application, for which we wrote an ASP wrapper; our java code then calls it as a web service. Works OK-ish, but it was a lot of effort to get working

    24 Nov 2005, 17:31

  2. It'll be better when everyone moves to Office 12 and uses the new XML document format. Then you will be able to do it all with some XSLT! :D

    Btw: COM -> ASP -> Java == ouchy!

    24 Nov 2005, 18:49

  3. Chris May

    It'll be better when everyone moves to Office 12 and uses the new XML document format

    I thought I'd read somewhere that Microsoft's OpenDocument implementation was different to everyone else's OpenDocument , though I might be confusing them with Apple (who have defintely not implemented it properly). Still, any decently-documented XML schema would be better than none :-)

    Either way, I predict that it'll be a good while* before office 12 makes it onto the Warwick managed desktop; given that we've still got people on '97 at the moment. So until then it's ouchy all the way. The things we put up with for users …

    * geologically speaking ;-)

    24 Nov 2005, 19:29

  4. Hey guys, thanks for that :)

    I suspected as much… but thought I'd check, just for certainty.

    Have you had an attempt at connecting the open office java bindings for this solution? (In which case, was it just not possible to get it to do the solution?)

    Thank you so much already :)

    24 Nov 2005, 21:35

  5. Chris May

    We had a play about with OO, but back then (this was 2 years ago) the conversion from word docs. wasn't particularly good, and the app. was very leaky if you kept it running. I expect it will have improved a lot since then, but now that we've got a working solution we're reluctant to try.

    There's a java+carbon (i.e. the Mac GUI system) port of OpenOffice called NeoOffice/J, which might have some worthwhile libraries in it. I've never looked into that, though.

    Surely DCS will want to write all their own HTML directly, in Emacs, anyway? ;-)

    24 Nov 2005, 22:42

Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

November 2005

Mo Tu We Th Fr Sa Su
Oct |  Today  | Dec
   1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30            

Search this blog



Most recent comments

  • Thank you very much for that link – I hope it is useful for the next year's projects – I have alread… by on this entry
  • "Week 01 of a year is per definition the first week that has the Thursday in this year, which is eq… by on this entry
  • Fantastic response thank you very much kind sir! by on this entry
  • ISO 8601 says that week numbers are calculated from the first Monday of each year. So take for examp… by on this entry
  • RaW's minidisc–booking sheet starts on a Saturday! by Chris Doidge on this entry

Blog archive

Not signed in
Sign in

Powered by BlogBuilder