Some test text!

Tools & Utilities

Guides

Cli / Guides / FAQ

Frequently Asked Questions

How do I save exported text file to a given folder?

By default, PDF2Text outputs extracted text in the console window. To save the result in a certain folder instead, use the -o (or --output) parameter. For example:

pdf2text -o "..\..\My Output" 1.pdf

Note: If the specified path does not exist, PDF2Text will attempt to create the necessary folders.

How can I control the output name for converted files?

By default, PDF2Text creates a separate text file for every page in the document. The output filename is constructed using the name of the input PDF file, page counter, and appropriate file extension. For example, the following command-line generates a sequence of text files in "MyFolder", starting with mydoc_1.txt, mydoc_2.txt, etc.:

pdf2text --o MyFolder mydoc.pdf

PDF2Text allows output filename customizations using the '--prefix' and '--digits' options. For example, the following command-line generates a sequence of text files in "MyFolder", starting with newname_0001.jpg, newname_0002.jpg, etc.:

pdf2text --o MyFolder --prefix newname --digits 4 mydoc.pdf

The '--digits' parameter specifies the number of digits used in the page counter portion of the output filename. By default, new digits are added as needed; however this parameter could be used to format the page counter field to a uniform width (e.g. myfile0001.jpg, myfile0010.jpg, instead of myfile_1.jpg, myfile_10.jpg, etc).

To avoid any ambiguities in file naming, the prefix option should be used only for conversion of individual documents.

How do I convert PDF to XML, a list of words or list of text-runs?

By default, PDF2Text automatically converts PDF to a plain .txt file without any extra metadata. The output image format can be modified using the '-f' (or --format) option. For example,

pdf2text -f xml in.pdf

Will convert PDF to XML and will include number of additional properties such as positioning and styling information for each word.

The '--format' parameter accepts any of the following text formats:

plain
wordlist
textruns
xml

How do I specify the output text encoding format?

By default, PDF2Text is using UTF8 encoding. To modify output encoding use -e (or --encoding) option. For example,

pdf2text --encoding UTF16 in.pdf

The '--encoding' parameter supports two encoding formats:

UTF8
**UTF16 **

How do I open a password protected PDF?

PDF2Text will, without user intervention, decrypt and convert documents secured with a master/owner password. If the document is secured using a user (i.e. 'file open') password, PDF2Text will, by default, prompt the user to enter the password. If '--noprompt' option is used, the program will not ask for a password, and an error message will be displayed instead.

For unattended conversion, the password can also be specified directly on the command-line using the '-p' (or --password) option. For example:

pdf2text -p secret -f xml secured.pdf

The above command line will convert PDF to xml format and will use the provided password ('secret') to open the secured PDF document.

Note: PDF2Text supports all standard security options available in PDF, including 40 and 128 bit RC4 encryption, Crypt filters, and 128 AES (Advanced Encryption Standard) encryption.

How do I specify which pages to convert?

By default, PDF2Text will convert all PDF pages to text. You can specify a subset of pages to convert using the '-a' or '--pages' options. For example:

pdf2text -a 1,3,10 in.pdf

will convert only pages 1, 3, and 10. Please note that PDF2Text assumes that all pages are numbered sequentially starting from page 1.

To specify a range of pages, use dash character between numbers. For example:

pdf2text -a 1,10-20,50- in.pdf

will render the first page, pages in the range from 10 to 20 and all pages starting with page 50 to the last page in the document.

All even pages can be selected using the 'e' (or 'even') string. For example, the following line converts all even pages:

pdf2text --pages even in.pdf

Similarly odd pages can be selected using the 'o' (or 'odd') string. The following line converts all odd pages in the document and every page in the range from 100 to the last page:

pdf2text --pages odd,100- in.pdf

How do I batch convert PDF files?

PDF2Text supports batch conversion of many PDF files in a single pass. To convert all PDF files in a given folder(s) you can use the following syntax:

pdf2text myfolder1 myfolder2

The '--subfolders' option can be used to recursively process all subfolders. For example, the following line will convert all documents in 'myfolder1' and 'myfolder2' as well as all subfolders:

pdf2text --subfolders myfolder1 myfolder2

By default, PDF2Text will convert all files with the extension '.pdf'. To select different files based on the extension use the '--extension' parameter. For example, to convert all XPS documents with a custom extension '.blob', you could use the following line:

pdf2text --extension .blob --subfolders myfolder1

The use of wild characters is also allowed. For example, to convert all PDF files starting with 'x' in the current folder use:

pdf2text x*.pdf

How do I customize text extraction?

By default, PDF2Text will expand all ligatures in PDF. In writing and typography, a ligature occurs where two or more graphemes are joined as a single glyph. Use '--noligature' to disable ligature expansion. For example:

pdf2text --noligature mypdf

PDF files sometimes contain duplicated text to achieve visual effects of drop shadow and to fake bold text style. By default PDF2Text deletes duplicated overlapping text. To keep the duplicates, specify '--no_dup_remove' option on the command line. For example:

pdf2text --no_dup_remove mypdf

PDF2Text automatically remove hyphens in the original PDF file that are used for connecting split words across two lines. Use option '--nodehyphen' to disable word merging across lines. For example:

pdf2text --nodehyphen mypdf

PDF2Text provides several options related to the layout of text in the input PDF files.

In some cases, PDF documents may be missing spaces between punctuation characters and words may be merged into a single unit. To break words based on punctuation characters use '--punct_break' option. For example:

pdf2text --punct_break mypdf

In some cases, text in PDF may be obscured by images or rectangles. By default PDF2Text will extract this invisible text, however you can disable this behavior using '--remove_hidden_text' option. For example:

pdf2text --remove_hidden_text mypdf

Similarly some scanned PDF files or documents that went through OCR (Optical Character Recognition) may contain invisible text to facilitate text selection, highlighting, and text extraction. PDF2Text will automatically extract hidden text. To prevent text extraction of invisible text use '--remove_invisible_text' option. For example:

pdf2text --remove_invisible_text mypdf

In case you are looking for more flexibility or a more programmatic approach to text extraction, you may want to consider using Apryse PDF SDK, as shown in the following sample code available at: </documentation/samples/#textextract.> PDF SDK offers a fine grained control over text extraction and access to low-level features in PDF documents.

How do I include styling information and how do I represent words as XML elements?

By default PDF2Text will expand all words into a single text line when converting to XML. In order to represent each word as a separate XML element with positioning and styling information use '--xml_words_as_elements' option. To include font and styling information for each word or line use 'xml_output_styles' option. For example, the default XML output for a given PDF may look as follows:

pdf2text my.pdf

<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
    <Para id="0">
      <Line>Apryse CPU License Agreement</Line>
    </Para>
    <Para id="1">
      <Line>IMPORTANT NOTICE: PLEASE READ ....</Line>
      <Line>SOFTWARE WHICH ACCOMPANIES...</Line>
    </Para>
  </Flow>
</Page>

Using '--xml_words_as_elements' and '--xml_output_styles' option the generated output is richer:

pdf2text --format xml -xml_words_as_elements -xml_output_styles my.pdf

<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
    <Para id="0">
      <Line style="font-family:Verdana-Bold; font-size:9; color:#000000;">
        <Word>Apryse</Word>
        <Word>CPU</Word>
        <Word>License</Word>
        <Word>Agreement</Word>
      </Line>
    </Para>
    <Para id="1">
      <Line style="font-family:Verdana-Bold; font-size:7.5; color:#000000;">
        <Word>IMPORTANT</Word>
        ... etc

How do I retrieve page information?*

PDF2Text provides several options for to retrieve page information from existing PDF documents:

Use '--wordcount' option to retrieve number of words for each page. For example:

pdf2text "-wordcount my.pdf

will retrieve number of words for each page in the specified document.

Use '--charcount' option to number of characters for each page. For example:

pdf2text --charcount my.pdf

will retrieve number of characters for each page in the specified document.

Use '--pageinfo' option to retrieve the width, height, media box, crop box, and rotation for every page in the document. For example:

pdf2text --pageinfo my.pdf

How do I extract text from a given rectangle on a PDF page?*

Using PDF2Text you can extract text from a subset of a page using the '--clip' parameter. The parameter accepts a list of four numbers, separated using commas, giving the coordinates of a pair of diagonally opposite corners. Typically, the list takes the form: llx, lly, urx, ury specifying the lower-left x, lower-left y, upper-right x, and upper-right y coordinates of the rectangle, in that order. The other two corners of the rectangle are then assumed to have coordinates (llx, ury) and (urx, lly). All coordinates need to be expressed in points (a basic unit of PDF 'user' coordinate system). One PDF point is 1⁄72 of an inch and is approximately the same as a point (unit commonly used in the printing industry).

For example:

pdf2text -c 150,600,250,700 license.pdf -a 1

Does PDF2Text have any dependencies on third party components/software?

PDF2Text is a completely stand alone application and does not include any dependencies on third-party components or software.

Get the answers you need: Chat with us

Did you find this guide helpful?

Yes