How To Put A PDF Cleanly Into Word or Into Your TM Tool Using Really (Really) Simple Skills - The Translation Business

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

The translation business

How to put a PDF cleanly into Word or into your TM tool using really (really!) simple skills
Posted on April 18, 2011

I have often observed that while many translators and project


managers may be skilled users of a number of sophisticated software
tools, they sometimes lack some really simple skills in Word. Like for
instance knowing how to find and replace tabs or paragraph and line
markers…

“But why would we ever want to do that?” they might ask.

In this post we look how such simple skills can be used to solve some
awkward problems. As an illustration, we’ll look at how to drop a PDF file
into Word (and from there into a TM tool, if required).

So – what’s the problem with PDF files?

Many translators are dismayed when they discover that the source text is in PDF format – and for good reason.
Getting it into an editable format or getting it into a TM tool is not always straightforward. A quick search on
Google will turn up a variety of different “PDF converters”. Some TM software tools will also convert PDFs into
an editable format. However, in my experience, it is very rare that the converted text is without some mucky
problems. Many translators just give up on trying to extract text from PDFs.

Alejandro Moreno-Ramos has the best possible solution:

However, if Alejandro’s method fails, (and you really want editable text) then…

This is what you can do:

If you are able to select the text in a PDF with your mouse then you will be able to copy it and paste it directly
into Word (if not, you should quickly abandon all hope!). Copying and pasting the text will not transfer the
document properties (e.g. margins, columns etc), but you’ll get the text with most of its formatting properties
(fonts, text size, bold and italics etc):

1 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

You’ve now got editable text… But whoa! In the illustration above, you can see that there is a paragraph marker
at the end of every line! The text doesn’t wrap properly in the Word document.

Hopeless?

Not at all! It’s easy enough to get rid of the paragraph markers (as we’ll see) using simple find & replace. But this
would make the whole document one huge paragraph. We need to retain one critical piece of information –
where the real paragraphs start and end!

We need to get rid of the surplus paragraph markers (shown in the red circles below) – but we need to keep the
ones marked in blue. These mark the end of the real paragraphs.

There are a few steps involved in doing this – but they are all very simple – the only skills required are to know
how to copy, paste and find & replace! Here’s how to do it:

1 Get the text from the PDF into Word

Select the text in the PDF. Copy it and paste it into Word.

(Some care needs to be taken when selecting text in a PDF. You may find that the PDF won’t allow you to select
paragraphs in the correct order. You may need to copy & paste several individual sections, one at a time, to
ensure you get the text flowing in the right sequence.)

2 Make sure that you have Word’s “Show/Hide” button switched to “Show”.

Toggling this button to “show” displays the document’s formatting marks (tabs,
paragraph marks, picture anchors etc.) [1]. I’ve noted that many young translators (and
some older ones too) try to work in Word with this button switched to “Hide”. The usual
excuse is that seeing the formatting marks is aesthetically unappealing or distracting. My usual response is “Get
over it!” (I don’t usually get a good reaction to that advice!) But my opinion is that working on a document
with the formatting marks turned off is like groping around in a dark room – you unwittingly bump into, trip
over and break things. Has your formatting ever gone unexpectedly haywire? Maybe you too like to keep this
button in “Hide” mode? However, being able to see all the formatting marks helps you understand the structure
of the document and lets you see if the original author has made any silly formatting mistakes. Walking blind

2 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

into the client’s problems can spoil your day! Try it… It really doesn’t hurt (much)!

3 Identify where the real paragraphs end

Now, this requires a few minutes of manual work – it requires hitting the [Enter] key a few times on every page
of the document to mark the end of each paragraph. If you really want editable text – it’s worth the small effort
required.

Look for the spot where each paragraph ends, insert your cursor and then hit [Enter] to create an empty line. In
many cases it is clear to the eye where each paragraph should end – but not always! So keep an eye on the
original text. It only takes a few minutes to do this for an average-sized document.

Now you have double paragraph markers (aka “a blank line”) which indicate where the paragraphs are
supposed to end:

4 Preserve these paragraph breaks

Our ultimate task is to get rid of all the surplus paragraph markers at the end of the lines. This is easy to do – we
just replace them with spaces using Word’s Find & Replace function.

But!

If we replace all the paragraph markers with spaces, we’ll lose the paragraph breaks we’ve just marked with a
blank line! They’ll just turn into two consecutive spaces (there may be lots of other double spaces hiding in the
document too!). So we need to temporarily mark the paragraph breaks with something else before we can get rid
of the unnecessary markers at the end of each line. You can use pretty much any sequence of characters you like
– you just need to be sure that whatever you use is unlikely to occur in the text. You might like to make up
something like “@#$%” or some such. I always use “[para]” as a placeholder.

The task now is to replace all instances of two consecutive paragraph markers with the temporary placeholder.
Word uses the characters ^p to represent a paragraph marker (or ^p^p for two of them), so:

Type “^p^p” into the Find what box; and


Type “[para]” into the Replace with box; then
Click Replace All:

3 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

This is how the document should change once the blank lines have gone:

5 Now get rid of all the redundant paragraph marks

We are now going to search for all the extra paragraph markers and replace them with spaces. (Look at the
paragraph markers – if there is already a space in front of them, then you need to replace them with “nothing” –
i.e. you leave the Replace with box blank.)

Type ^p into the Find what box;


Put your cursor into the Replace with box and hit the space bar. (If you don’t need spaces, use your mouse to
select and delete any invisible spaces which might be lurking there); then
Hit Replace All:

You document should now be a complete mess and look something like this (paragraph breaks highlighted):

6 Now reinstate the paragraph breaks

This is where the magic really happens and the mess instantly becomes a nicely formatted document. We now
need to get rid of the temporary “[para]” placeholders and replace them with real paragraph breaks.

Type “[para]” into the Find what box;


Type “^p into the Replace with box [2];
Hit “Replace All”:

4 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

If all has gone to plan, then you should have a nice, clean, plainly formatted document that you can edit or
import into your favourite TM tool!

Postscript

Don’t try to do this with tables (the subject of a future post maybe).
If you use a PDF-to-Word converter, then these same Find & Replace techniques can often be used to fix up
poorly converted text.
Rather than going through all these steps every time, they can be automated by recording a macro and
putting a button to do the job on the toolbar. One click and the job is done! (Again this could be the subject
of another post!)
Qabiria.com have an excellent, detailed article on using PDF-to-Word converters here: http://bit.ly
/9TqbGH
The examples in this post were illustrated using Microsoft Word 2007 and Adobe Reader X.

[1] You can control which formatting marks you would like to have displayed when you switch the “Show”
button on. Click Office Button|Word options|Display. Because translators usually work on documents that other
people have created and formatted, I recommend that they select “Show all formatting marks” so that they can
always see (and work around) formatting mistakes made by others.

5 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

[2] If you want to control paragraph spacing with a blank line, then you’ll want to use two paragraph markers
(i.e. “^p^p”).

Share this:

2 bloggers like this.

Related

"I want to translate a web page... but using Easy! Easy! How to copy a table from a PDF Nice or nasty? Which translators finish first?
Word!" Converting HTML to Word format into Word... for beginners

This entry was posted in OpenBorder, Tips 'n tricks and tagged convert PDF, format document, PDF, translation, translators. Bookmark the permalink.

30 Responses to How to put a PDF cleanly into Word or into your TM tool using really (really!)
simple skills

Martin says:
April 21, 2011 at 7:52 am

>Rather than going through all these steps every time, they can be automated by recording a macro and putting a button to
do the job on the toolbar. One click and the job is done! (Again this could be the subject of another post!)

I really ought to do this … every day (almost, anyway) I get new jobs in PDF format and convert them using ABBYY
Finereader (best program I’ve found for the job to date), and then do a series of manual search-and-replaces to set up my
preferred formatting structure. The number of times I must have done this will be in the thousands by now, yet I remain
macro-shy. Something I will have to get over, evidently!

(Speaking of getting over it, I am one of the “older” – I guess, by now – translators who keeps the formatting marks hidden,
calling them into view only when actually required.)

One thing users of OCR software need to get used to is the notion that the program *won’t* do everything automatically –
text in columns needs to be selected column-by-column to avoid having everything run together if the program doesn’t
recognise all the column breaks, for instance. One of my worst experiences was a very large Korean document that had
started out in life as multi-column text but, before it came to me, all the column breaks (and the returns within the columns)
had been lost, so I had to separate the text out before even starting the translation. Still gives me the heeby-jeebies to think
about it.
Reply

Clark smith says:


August 19, 2014 at 9:40 am

Easily convert PDF file to word file format, you should try Kernel for PDF to Word Converter Tool. It can simply and
quickly convert multiple PDF file to Word file format. http://www.pdftodoctool.com/
Reply

Pingback: Easy! Easy! How to copy a table from a PDF into Word… for beginners #xl8 #t9n | The translation business

Pingback: “I want to translate a web page… but using Word!” Converting HTML to Word format | The translation business

ISO 9001 says:


October 19, 2011 at 9:05 am

6 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

Very good post, I was really searching for this topic, as I wanted this topic to understand completely and it is also very rare in
internet, that is why it was very difficult to understand.

Thank you for sharing this.

regards:
ISO 9001
Reply

Paul says:
November 13, 2011 at 3:01 am

Great article. I found that replacing “.^p” with “.[para]” means I don’t have to find each proper paragraph break individually.
Although if your paragraphs don’t all end with a full stop, you’ll need to take this into account.
Reply

Kemal says:
December 16, 2011 at 9:28 pm

I have just translated a PDF file badly rendered into Word using someone else’s conversion package and it was a nightmare. I
will try the tips but I think next time round, I will just charge triple rate. Most clients don’t care a toss about wasting the
translator’s time and patience. The only solution is to make customers pay for their ignorance and disdain.
Reply

Isabelita says:
May 4, 2012 at 6:01 pm

Great post….but it did not solve my problem!…yes, Im a translator and need to type over a word page. I converted my pdf
doc online, it was email back to me with the format text wrapping borders…After I read this post I copy and paste the pdf doc
on word and it again shows the text wrapping borders…I try all the choices; back or front of text…nothing!…please
help..Thanks, Issy
Reply

prasad says:
December 12, 2012 at 5:58 pm

use bcl pdf to word, works fine-


Reply

bernagora says:
June 15, 2012 at 3:43 pm

Great article. It is well advisable to learn all the special formatting symbols and how to use them in Word replace. Often I
have to remove soft line breaks (^l) – but first converting double soft line breaks to real line breaks (^l^l to ^p).
Reply

HakanE. says:
September 26, 2012 at 9:14 am

Great post. Thank for for sharing this.


Reply

lindawonders says:
October 23, 2012 at 6:18 am

7 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

Thank you so much for sharing this. I had to copy at least 50 articles from PDF into Word and just about drove myself nuts.
This has saved me so much time and my sanity.
Reply

Paul says:
October 23, 2012 at 8:38 am

LOL – I’m really happy that you found this useful. Preserving someone’s sanity is a very special bonus! Thanks.
Reply

Vin Jat says:


May 9, 2013 at 11:09 am

Thank You so much!!! My problem just got sorted.


Reply

Joey H says:
May 24, 2013 at 6:16 am

I don’t know if this will help you all but for me it made it super-easy. It doesn’t make the words in the sentences connect
together as far as I can tell, but it takes the original document (I was converting DOS text letters to PDF files) and when you
paste it into a program, such as Gmail or I’m sure any other program like wordpad or Word, it preserves the formatting of
the spaces between the paragraphs. Here’s how I did it: Using Foxit Reader choose the option in the VIEW tab and then
TEXT VIEWER, and then copy and paste your text. I noticed this was all I needed since I was just taking text from a DOS
program and wanting to get it into an email. Hopefully this will help someone else. I’m not sure if Adobe Reader has the
same option but I didn’t see it.
Reply

willem says:
July 17, 2013 at 9:40 am

Just saved me so much work… Thank you very much for making the world a better place!
Reply

Per Hansen says:


October 7, 2013 at 8:49 pm

Wonderful post and so very very helpful. Definitely solved my problem and thanks to the other people who commented as
well.
Reply

Claire says:
December 11, 2013 at 3:31 pm

Immensely helpful – so simple and intelligent. Thanks!


Reply

Gabrielle says:
February 4, 2014 at 2:26 pm

I was so thrilled to find your post, thank you. I use the GIRDAC PDF Converter to convert PDFs from magazine scans. It
works brilliantly, however I end up with article columns all over my Word document which make it maddening when using
Trados… Is there any way of adapting your method so that I have simple paragraphs without the columns?

8 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

Reply

Paul Sulzberger says:


February 5, 2014 at 5:16 am

Hi Gabirelle. There are several things you can do. First, try selecting the whole Word document and then from the
main menu select “Format” and then “Columns…”. In the “Columns” dialog box, set columns to “One” (or
“Number of columns” to 1.) This should get rid of all the columns which have been imported from the magazine
article.

If other aspects of formatting such as bolds and italics etc are not important, then try saving the whole document
as a plain text file. Close the document and re-open the text file in Word. You will have a nice clean document
with all the formatting (except the division into paragraphs) stripped away.
Reply

My Little Spanish Notebook says:


March 7, 2014 at 12:53 pm

Hello! I have converted my Pdf to Word, but it won’t let me select all, (only parts of the text in columns/boxes,etc). I am
going mad trying to find how to select the whole lot. I know there’s a simple way, because I’ve done it before, but I cannot for
the life of me work out how to do it. Any help you can give would be greatly appreciated. Thanks!
Reply

Paul Sulzberger says:


March 10, 2014 at 6:20 am

It’s possible that some of the text is in text boxes and some of it not. In Word you could first “Save as” a “text file”
(i.e. as “plain text”). Close the document and then open the plain text file you have just saved. Saving it as “plain
text” will strip out all columns, text boxes and other formatting leaving you with just the text (and paragraph
marks). You can now reformat the document from scratch.
Reply

My Little Spanish Notebook says:


March 10, 2014 at 8:59 am

Hi Paul. Thanks a lot for getting back to me on this. I really appreciate your help. Now I’m going to bookmark
this page for next time, as I’m bound to forget again…!

Katrina J. says:
December 5, 2014 at 8:41 am

Hello there! Just wanted to let you know about a free and useful toolkit to convert any pdf file to text, mobi or epub formats,
when needed. Simply upload files and the tool will convert them quickly: http://kitpdf.com/. Thanks!
Reply

sahil malhotra says:


January 2, 2015 at 7:43 pm

thanks sir
Reply

Mnye Nye-nravitsya says:


May 24, 2015 at 4:59 pm

9 de 10 07/10/2016 23:01
How to put a PDF cleanly into Word or into your TM tool using really (... https://translationbiz.wordpress.com/2011/04/18/how-to-put-a-pdf-clea...

This trick is obvious, I used it for years, so thanks for nothing.. The problem is, when copy-pasting text from almost any PDF
document, the copied text h a s r a n dom s pa ces between characters. I haven’t yet found a solution for this.. No converter
can get it right.
Reply

Pingback: Medicare Easy Pay » How Do I Use Etc1

Luca says:
May 2, 2016 at 2:23 am

You know what, damn. Thanks. Usually when searching for a solution on the internet for a little problem with formatting or
some other organisational issue I find the information and quickly close the tab and continue on my way.

But, this solution, whilst not overly elegant, works so damn well.

Thanks!
Reply

Neil Richard Innis says:


May 16, 2016 at 1:02 pm

Thanks that really helped me a lot!


Also, I wanted to know if you’ve already wrote something on
“Rather than going through all these steps every time, they can be automated by recording a macro and putting a button to
do the job on the toolbar. One click and the job is done!..”
Because if I can save the pain of going through all those replacements everytime it will be a bliss.
Reply

Paul Sulzberger says:


May 16, 2016 at 1:06 pm

I’d like to do a post on using a macro to automate at least some of the steps. I’ll look for some time (but don’t hold
your breath)!
Reply

The translation business


Create a free website or blog at WordPress.com.

10 de 10 07/10/2016 23:01

You might also like