Groovy to extract text from PDF in CPI

samarjitsingha · ‎11-30-2022

Introduction

This blog helps you to solve a custom requirement to extract text from pdf with the help of groovy.

Note: This groovy will not work on formatted text files (images, bullet points. workflows).

Current Scenario: No blogs are available to extract text from pdf in SAP CPI.

Why we are doing so?

It gives us the flexibility to work with PDF files. Most of the time the content that will be coming to SAP Cloud Platform Integration will be in XML, JSON, CSV, and EDI. So, it can be easily extracted by this groovy and the rest transformation can be done as per the scenario.

PROCEDURE:

STEP 1: Download the pdfbox JAR file and upload it to your iFlow.

Download the pdfbox JAR file from the following link

Download the fontbox JAR file from the following link.

Upload the JAR file in the Resources tab of your iFlow.

STEP 2: Take a sample payload for PDF conversion.

This is the sample CSV payload used for conversion.

Material_Name,Material_ID,Material_Number



Iron,KAU145,240



Copper,KAU146,800



Zinc,KAU222,180



Cobalt,KAU338,546

Pdf screenshot of the above CSV file

STEP 3: Use Groovy Script in your iFlow to extract text from PDF.

I-Flow Explanation:

We are using a HTTP adapter to trigger the Integration flow with the pdf file.

Then we are using "Groovy Script" to extract the content of the PDF.

After that, we are using "CSV to XML Converter" to convert CSV files to XML.

Postman Configuration:

Do the Postman Configuration by referring to the image below.

Groovy Script:

import com.sap.gateway.ip.core.customdev.util.Message;

import java.util.HashMap;

// package org.apache.pdfbox.examples.util;

import org.apache.pdfbox.pdmodel.PDDocument

import org.apache.pdfbox.util.PDFTextStripper





public static String readFromPDF(InputStream input){

        PDDocument pd;

        try {

            pd = PDDocument.load(input);

            PDFTextStripper stripper = new PDFTextStripper();

            stripper.setStartPage(1); // Start Page

//          stripper.setEndPage(1); // End Page

            String text = stripper.getText(pd);

            if (pd != null) {

                pd.close();

            }

            return text.toString()

        } catch (Exception e){

            e.printStackTrace();

        }

        return null

    }







def Message processData(Message message) {

        def body = message.getBody();

        InputStream IS = body;

        String res = readFromPDF(IS)



    message.setBody(res);

    return message;

}

Groovy Script Explanation:

First, we are fetching the body and converting it to InputStream.

Then we are calling a function readFromPDF and passing the body.

If there are multiple pages in the PDF or you want to take content from certain pages then you can use stripper.setStartPage(1); & stripper.setEndPage(1); methods.

At last we are using stripper.getText(); method to read content.

Output:

<?xml version='1.0' encoding='UTF-8'?>

<Record>

	<root>

		<Material_Name>Iron</Material_Name>

		<Material_ID>KAU145</Material_ID>

		<Material_Number>240</Material_Number>

	</root>

	<root>

		<Material_Name>Copper</Material_Name>

		<Material_ID>KAU146</Material_ID>

		<Material_Number>800</Material_Number>

	</root>

	<root>

		<Material_Name>Zinc</Material_Name>

		<Material_ID>KAU222</Material_ID>

		<Material_Number>180</Material_Number>

	</root>

	<root>

		<Material_Name>Cobalt</Material_Name>

		<Material_ID>KAU338</Material_ID>

		<Material_Number>546</Material_Number>

	</root>

</Record>

More Sample PDFs and their Output:

850 EDI pdf to 850 EDI file

In the above image, you can see that I have used an 850 EDI in pdf format, from which all the text can be easily extracted by using this groovy and can be used in CPI as per your requirement.

You can use the same EDI payload from this link.

Sample PO to the relevant text

In the above image, we are able to extract the content of the image, but from this text, we will not be able to convert it to an EDI file as the format of each PO may vary, and writing a common script will be difficult.

So this groovy is not recommended for formatted pdf as shown in the above image.

You can visit this site to download the sample PO.

Conclusion:

So, to conclude, this blog helps to extract contents from the PDF using Groovy Script.

Check out the link for more helpful information about Cloud Platform Integration (CPI).

If you have any queries, please feel free to ask your question in the comments. I would request everyone to provide your feedback and like if this blog post finds helpful for you.

Thanks & Regards,

Samarjit Singha

Souragopalsethy · ‎12-01-2022

Thanks for amazing write-up

former_member831581 · ‎12-01-2022

Thanks, brother, I needed this in my project.

francis21 · ‎12-01-2022

Thanks a lot, Samarjit, for this blog.

If I have a customer's PO in PDF format, is it possible for me to process that PDF PO and create ORDERS IDoc or EDI (850)?

samarjitsingha · ‎12-01-2022

Hi Francis

I have tested it with 850 PO and it's working fine.

Thanks,

Samarjit

francis21 · ‎12-01-2022

Hi Samanjit,

What I meant is that the customer's PO is in PDF, not necessarily an 850 PO in PDF.

Thanks,

Francis

francis21 · ‎12-02-2022

Hi Samarjit,

Like this sample image below of a PDF PO.

Thanks,

Francis

samarjitsingha · ‎12-02-2022

Hi Francis,

This scenario will not work for your problem statement, as in this scenario, text can be extracted only from a pdf containing plain text characters excluding bullet points, images, tables, etc.

I have one query regarding your problem statement, how are you planning to convert the extracted text to an EDI format ? As I think there is no capability in CPI to convert an actual Purchase Order (as in the above sample image) to EDI or IDOC.

If you have an actual EDI or IDOC in PDF format then this scenario will work.

Thanks,

Samarjit

francis21 · ‎12-02-2022

Hi Samarjit,

Up to PI 7.11, there was the SAP Conversion Agent.

Thanks,

Francis

Raunak · ‎12-02-2022

Hi Samarjit,

Thank you for this blog. I'm working on a custom scenario and this blog helps a lot.

Thanks,

Raunak

EuricoBorges · ‎12-02-2022

samarjitsingha for better understanding please add to blog a screenshot of the pdf you have used PDF so that everyone can see how it looks.

Thanks

lucy · ‎12-07-2022

I agree with you. I want to know how the pdf file looks like to simulate the scenario to understand how it works.

lucy · ‎12-07-2022

Hello Samarjit,

Could you please share the pdf files of the tested 850 PO and the Blog.pdf in above your screen?

Thanks!

Lucy

samarjitsingha · ‎12-07-2022

Hi Lucy,

I have updated the blog. Please go through it. If you have some more queries feel free to ask.

Regards,

Samarjit

samarjitsingha · ‎12-07-2022

Hi Eurico,

I have updated the blog with the same.

Regards,

Samarjit

philippeaddor · ‎12-07-2022

Hi Samarjit

Thanks for the blog post. Just as a general note: If I would get such a requirement, I would first of all challenge the requester or owner of the sender system and ask why in the world they would send a CSV file in PDF format and not simply as a CSV text string... If there's really a hard requirement for this, sure your script and the library would come in handy. However I think the more interesting use case would be processing of a formatted document, which is much more difficult as you point out.

There is a creative solution using SAP RPA here (without Cloud Integration though): https://blogs.sap.com/2021/09/07/translating-pdf-documents-with-sap-intelligent-robotic-process-auto...

Philippe

lucy · ‎12-09-2022

Hello Samarjit,

Thanks for your update! Could you please enclose the pdf file as attachment? so I can download it and use it in postman to simulate it in my iflow.

I tried to simulate the pdf reading like in your blog, not yet get succeed. Please see iflow errors below.

Error Details

com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: org/apache/pdfbox/pdmodel/font/PDType0Font : cannot initialize class because prior initialization attempt failed. The MPL ID for the failed message is : AGOP8owhnoOeAa-6zuTQr4H7G111

Error Details

com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: org/apache/fontbox/afm/AFMParser. The MPL ID for the failed message is : AGOP8cloc76ZOQ2uAfR4qP2C9121

Best Regards,

Lucy

samarjitsingha · ‎12-12-2022

Hi Lucy,

Thanks for pointing out the errors, there was some issue with the JAR files of pdfbox and fontbox, So I have updated the blog with the new links. Please download the new JAR files and upload them to your Resources tab of the iflow.

You can download the pdf files from this link.

Thanks,

Samarjit

lucy · ‎12-13-2022

Thanks a lot Samarjit! It's working fine now.

ImranShafiq · ‎12-16-2022

Hi Samarjit,

I am facing following error - would you guide me to resolve following issue.

com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: XSD schema is incompatible with CSV payload. The XSD schema provided contains 3 records; CSV payload contains 1 records.. The MPL ID for the failed message is : AGOcCDK94khTaGo3bF9cOXSag9-C

My .XSD file

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Material">
<xs:complexType>
<xs:sequence>
<xs:element name="Items">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="Material_Name"/>
<xs:element type="xs:string" name="Material_ID"/>
<xs:element type="xs:integer" name="Material_Number"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>

ImranShafiq · ‎12-16-2022

Aoa, Dear I need to talk you regarding this article, I have tried but not success.

GerardoPalomo · ‎03-22-2024

Really good. I followed your guide and it worked.

I have bookmarked this for future use. I appreciate your help.