Extract text from any file with Intelligent Viewing and the use in another service like GPT

Phil Cox
Phil Cox E mod
edited May 31, 2023 in Articles #1

The purpose of this article is to show and share knowledge on how easy it is to use Intelligent Viewing to to extract text from any file then use that text in another service like translation, summarization, sentiment analysis etc.

Our source file is a PDF brochure from Lotus with 8 pages, viewed below in Intelligent Viewing.

This example will extract all the text from all 8 pages. Once the text has been extracted it can of course be used in many different ways. In this example will will send to GPT for Summarization and then once summarized send to Google Translate for the summary text to be translated into our target language, French.

This example presumes we have a working Intelligent Viewing environment.

Using Postman and with the correct authentication token the following call will return all the existing publication details. We can see that the total count of publications is 63.

Once we have all the information, next we want to find the Publication ID on our specific document. Postman has a nice search feature enabling us to find the Publication ID of our published file. The publication ID in our case is:-

ecd0f23f-d9da-4706-ad01-57499edf37e9

By adding the PublicationID onto the previous call, we get returned all the detail regarding or published document, including the page count.

The full JSON Path to the pageCount value is as follows:-

pageCount=_embedded["pa:get_publication_artifacts"][1]._embedded["ac:get_artifact_content"].content.pageCount

Now we switch into Visual Studio Code.

Here we import the Python libraries we will use and check if the file into which we write our text, exists already or not.

Next we set the PubID of the file we will extract the text from.

Next with our publication ID we need to find the total number of pages so we can extract all the text on every page

Then using the following command on each page we can extract the text and write into extracted.txt

http://otiv-highlight/search/api/v1/publications/" + pubID + "/text?page=" + pageidx + "&textOnly=true"

Now we have extracted all the text we must first strip any unwanted characters like newline "\n" before sending to GPT for Summarization. Here we will ask for all the text to be summarized into 10 sentences.

The full text extraction from all 8 pages is a total of 773 words

The original text of 773 words then gets summarized down to 118 words

From here we can then send this off to a translation service for example.

Variable "cltext" holds our summary text and we will use this to pass to Google Translate with source and target parameters, the target in this case being French.

Here is the text now translated to French for example:-

In summary we started with an 8 page PDF document that we extracted all the text then sent that text for a 10 sentence summary and then translated the summary from English to French.

Many thanks,

Phil ****

Comments

  • Great article Phil. Showcases the simplicity of invoking IV's API calls and the power of extending the information.