In most document management systems, searches only use the title or metadata of documents. But the title and metadata do not always reflect what's inside. As a result, users must open and scan documents manually to find the right content -- a waste of time, and a source of frustration!

In this blog, we show how you can save your users time by enabling powerful automated search directly inside documents. We use Algolia for indexing documents, Firebase for storage, and PDFTron for document rendering and text extraction.

Algolia and Firebase are easy to get started with and have a free tier. But you can pick whatever alternatives work best with your existing infrastructure. The logic in this guide should work the same, whatever your technology combination.

  1. Clone the GitHub repository
  2. Set up full-text indexed search
  3. Configure document storage
  4. Configure CORS for uploaded documents
  5. Run the application and start searching

linkClone the GitHub repository

To get started, clone the ready-to-go sample on our Github. Next, extract the file, navigate to the folder, and in the terminal, run:

npm install

Do not start the app yet, since we still have to configure a few pieces.

linkSet up Algolia for Full-text Indexed Search

This application uses Algolia to search and index documents. But Algolia is not the only third-party search provider. You can also consider alternatives such as ElasticSearch.

To get started with this sample, please register a new app with Algolia.

Afterwards, create a new app and navigate to Indices, where we will create a new index called document_search.

Configuring Algolia for full-text indexed search

After configuring your app, go back to the cloned GitHub project, create an .env file in the root directory, and enter the following:

REACT_APP_ALGOLIA_APP_ID=your_key_goes_here
REACT_APP_ALGOLIA_API_KEY=your_key_goes_here
REACT_APP_ALGOLIA_SEARCH_KEY=your_key_goes_here
REACT_APP_ALGOLIA_INDEX_NAME=document_search

The above information can be found under API Keys tab in your Algolia Dashboard.

Algolia dashboard API keys

linkConfigure Firestore for Uploaded Documents

You can also use any other storage of your choice. For this guide, we use Firebase, which you can start with by registering your app here.

Make sure to create a storage bucket and enable authentication for email and Google to ensure only authorized users can upload documents.

If you would like to create applications just with Firebase authentication you can start by cloning this project, which includes the bare bones of what is required.

Register your PDF search app with Firebase

After registering your app with Firebase, back in the cloned GitHub project, create an .env file in the root directory and place in the following:

REACT_APP_API_KEY=your_key_goes_here
REACT_APP_MESSAGING_SENDER_ID=your_key_goes_here
REACT_APP_APP_ID=your_key_goes_here
REACT_APP_AUTH_DOMAIN=your_domain_goes_here
REACT_APP_DATABASE_URL=your_database_go_here
REACT_APP_PROJECT_ID=your_project_id
REACT_APP_STORAGE_BUCKET=your_storage_bucket

The above information can be found under Settings in your Firebase app.

Firebase PDF search app settings

linkConfigure CORS for Uploading Documents in Firestore

Next, you will need to set up CORS on your Firestore to allow WebViewer to access files in your bucket. I created a CORS file called cors.json with the following contents:

[
  {
    "origin": ["*"],
    "method": ["GET"],
    "maxAgeSeconds": 3600
  }
]

We used gsutil to update the CORS policy on our bucket created previously.

linkRun the Application and Start Searching

To run the application we just created, in the terminal or command line, run the following:

npm start

We can now upload documents for indexing and search for page text and document title.

Sample search results across multiple PDF and Word documents

It may surprise you that text in a PDF file is not stored how you would typically imagine, according to its natural reading order. Instead, depending on how the PDF is generated, text characters can be in any order. That includes characters at the start of the page! For example, 'Hello' can be first broken into 'H' 'e' 'll' 'o', and each of these characters could be anywhere.

What PDFTron does when extracting content is to run through all the characters in the file and reassemble them according to how a user would read them.

For the sample application we just wrote, you can upload PDFs, DOCX, PPTX, and XLSX files. PDFTron can load up all these documents in memory, extract text, and render them entirely in-browser -- without calling or using any server-side dependencies.

For UI components, I use Pinterest Design Library as well as Algolia’s React Instant Search. Instant Search provides ready-to-go UI components to handle returned results and create highlights without having to worry about character offsets. I also leveraged the ability to pass custom components that match the UI closely with our overall design.

To understand what each component does inside of the app, it is best to refer to the project structure.

According to Algolia’s best practices, it is best to keep your index size small. That is why I separated each index to a page inside of the document, instead of creating a huge index for a single document with many pages nested inside. This approach will reduce your costs and increase the performance of your search.

linkConclusion

This sample project gets you started on enabling client-side search across all your documents -- not just on titles, but also on their contents! Don't hesitate to reach out to us with any questions you might have or suggestions for improving our sample.