Re: PDF to XML conversion

Subject: Re: PDF to XML conversion
From: Michael Smith <smith -at- io -dot- com>
To: techwr-l -at- lists -dot- raycomm -dot- com
Date: Sat, 8 Jul 2000 23:30:33 -0500

On Friday, July 07, 2000, Karen Field wrote:

> Anyone know anything about converting PDF docs to XML? Is this
> even possible? Is it possible to convert Word docs to XML?

Both are possible, but depending on the nature and number of
documents you need to convert, you many find that it's something
for which you'll want to get some consulting/outsourcing help.

One company I know of that has made this a specific focus of their
XML services offerings is Texterity. You may want to take a look
at their TextCafe site <http://www.textcafe.com>.

They have a simple form you can use to submit/upload a file you'd
like to convert to XML. Once they've taken a look the file,
they'll give you a free quote on conversions costs. (Although the
form only mentions converting PDFs, I'm sure they can also give
you a quote on any Word document your upload.) I think any other
XML consulting organization that offers conversion services ought
to be able to do the same thing for you.

The main issue with converting a document from PDF, Word, or
anything else to XML is that you're going from a format without
much explicit structure to a format which is purely structural.
So the challenge is to expose whatever structure the document has
in its current form so that you can automate the conversion as
much as possible.

For example, if a Word document you want to convert is already
formatted with logical paragraph and character styles, it's going
to simplify automation of the conversion quite a bit. If, on the
other hand, the differences between headings, etc. in the document
are only implied or apparent through differences in character
styles, indentation, and so on, then automating the conversion is
going to be much more difficult or even impossible.

You also need to keep in mind that without adding a human step to
the pre- or post-processing (that is, without doing manual tagging
of some kind), you're just going to end up with XML that's only as
structured as the Word or PDF documents you start with. So if your
source documents lack a discernible, usable structure, then your
XML will also.

-- Mike Smith


--
Michael Smith ... xml-doc-owner -at- egroups -dot- com
Moderator
XML-DOC mailing list ... http://www.egroups.com/group/xml-doc/
Subscribe: ... xml-doc-subscribe -at- egroups -dot- com
Subscribe to digest: ... xml-doc-digest -at- egroups -dot- com












Previous by Author: Re: Desktop Publishing Options with XML
Next by Author: Re: New to Tech Writing
Previous by Thread: PDF to XML conversion
Next by Thread: Re: PDF to XML conversion


What this post helpful? Share it with friends and colleagues:


Sponsored Ads