Text Search Classes

  • TextSearch searches through a PDF document for a user-given search pattern. The current implementation supports both verbatim search and the search using regular expressions, whose detailed syntax can be found at:

    http://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    TextSearch also provides users with several useful search modes and extra information besides the found string that matches the pattern. TextSearch can either keep running until a matched string is found or be set to return periodically in order for the caller to perform any necessary updates (e.g., UI updates). It is also worth mentioning that the search modes can be changed on the fly while searching through a document.

    Possible use case scenarios for TextSearch include:

    • Guide users of a PDF viewer (e.g. implemented by PDFViewCtrl) to places where they are intersted in;
    • Find interested PDF documents which contain certain patterns;
    • Extract interested information (e.g., credit card numbers) from a set of files;
    • Extract Highlight information (refer to the Highlights class for details) from files for external use.

    Note:

    Since

    Since hyphens (‘-’) are frequently used in PDF documents to concatenate the two broken pieces of a word at the end of a line, for example

    TextSearch is powerful for finding patterns in PDF files; yes, it is really pow- erful.

    a search for powerful should return both instances. However, not all end-of-line hyphens are hyphens added to connect a broken word; some of them could be real hyphens. In addition, an input search pattern may also contain hyphens that complicate the situation. To tackle this problem, the following conventions are adopted:

    a)When in the verbatim search mode and the pattern contains no hyphen, a matching string is returned if it is exactly the same or it contains end-of-line or start-of-line hyphens. For example, as mentioned above, a search for powerful would return both instances. b)When in verbatim search mode and the pattern contains one or multiple hyphens, a matching string is returned only if the string matches the pattern exactly. For example, a search for pow-erful will only return the second instance, and a search for power-ful will return nothing. c)When searching using regular expressions, hyphens are not taken care implicitly. Users should take care of it themselves. For example, in order to find both the powerful instances, the input pattern can be pow-;erful.

    A sample use case (in C++):

     //... Initialize PDFNet ...
     PDFDoc doc(filein);
     doc.InitSecurityHandler();
     int page_num;
     char buf[32];
     UString result_str, ambient_string;
     Highlights hlts;
     TextSearch txt_search;
     TextSearch::Mode mode = TextSearch::e_whole_word | TextSearch::e_page_stop;
     UString pattern( "joHn sMiTh" );
    
     //PDFDoc doesn't allow simultaneous access from different threads. If this
     //document could be used from other threads (e.g., the rendering thread inside
     //PDFView/PDFViewCtrl, if used), it is good practice to lock it.
     //Notice: don't forget to call doc.Unlock() to avoid deadlock.
     doc.Lock();
    
     txt_search.Begin( doc, pattern, mode );
     while ( true )
     ;
    
     //unlock the document to avoid deadlock.
     doc.UnLock();
    

    For a full sample, please take a look at the TextSearch sample project.

    See more

    Declaration

    Objective-C

    @interface PTTextSearch : NSObject

    Swift

    class PTTextSearch : NSObject
  • The result of running pdftron::PDF::TextSearch::Run()

    See more

    Declaration

    Objective-C

    @interface PTSearchResult : NSObject

    Swift

    class PTSearchResult : NSObject
  • Search modes that control how searching is conducted.

    See more

    Declaration

    Objective-C

    enum PTTextSearchModes {}

    Swift

    struct PTTextSearchModes : Equatable, RawRepresentable