Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
samarjitsingha
Participant

Introduction


This blog helps you to solve a custom requirement to extract text from pdf with the help of groovy.

Note: This groovy will not work on formatted text files (images, bullet points. workflows).

 

Current Scenario: No blogs are available to extract text from pdf in SAP CPI.

 

Why we are doing so?

It gives us the flexibility to work with PDF files. Most of the time the content that will be coming to SAP Cloud Platform Integration will be in XML, JSON, CSV, and EDI. So, it can be easily extracted by this groovy and the rest transformation can be done as per the scenario.

PROCEDURE:

STEP 1: Download the pdfbox JAR file and upload it to your iFlow.

  • Download the pdfbox JAR file from the following link

  • Download the fontbox JAR file from the following link.

  • Upload the JAR file in the Resources tab of your iFlow.


 


 

STEP 2: Take a sample payload for PDF conversion.

This is the sample CSV payload used for conversion.
Material_Name,Material_ID,Material_Number

Iron,KAU145,240

Copper,KAU146,800

Zinc,KAU222,180

Cobalt,KAU338,546

 


Pdf screenshot of the above CSV file


 

STEP 3: Use Groovy Script in your iFlow to extract text from PDF.


 

I-Flow Explanation:

  • We are using a HTTP adapter to trigger the Integration flow with the pdf file.

  • Then we are using "Groovy Script" to extract the content of the PDF.

  • After that, we are using "CSV to XML Converter" to convert CSV files to XML.


 

Postman Configuration:

Do the Postman Configuration by referring to the image below.


Groovy Script:
import com.sap.gateway.ip.core.customdev.util.Message;
import java.util.HashMap;
// package org.apache.pdfbox.examples.util;
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper


public static String readFromPDF(InputStream input){
PDDocument pd;
try {
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); // Start Page
// stripper.setEndPage(1); // End Page
String text = stripper.getText(pd);
if (pd != null) {
pd.close();
}
return text.toString()
} catch (Exception e){
e.printStackTrace();
}
return null
}



def Message processData(Message message) {
def body = message.getBody();
InputStream IS = body;
String res = readFromPDF(IS)

message.setBody(res);
return message;
}

 

Groovy Script Explanation:

  • First, we are fetching the body and converting it to InputStream.

  • Then we are calling a function readFromPDF and passing the body.

  • If there are multiple pages in the PDF or you want to take content from certain pages then you can use stripper.setStartPage(1); & stripper.setEndPage(1); methods.

  • At last we are using stripper.getText(); method to read content.


Output:
<?xml version='1.0' encoding='UTF-8'?>
<Record>
<root>
<Material_Name>Iron</Material_Name>
<Material_ID>KAU145</Material_ID>
<Material_Number>240</Material_Number>
</root>
<root>
<Material_Name>Copper</Material_Name>
<Material_ID>KAU146</Material_ID>
<Material_Number>800</Material_Number>
</root>
<root>
<Material_Name>Zinc</Material_Name>
<Material_ID>KAU222</Material_ID>
<Material_Number>180</Material_Number>
</root>
<root>
<Material_Name>Cobalt</Material_Name>
<Material_ID>KAU338</Material_ID>
<Material_Number>546</Material_Number>
</root>
</Record>

 

More Sample PDFs and their Output:


 


850 EDI pdf to 850 EDI file


In the above image, you can see that I have used an 850 EDI in pdf format, from which all the text can be easily extracted by using this groovy and can be used in CPI as per your requirement.

You can use the same EDI payload from this link.

 


Sample PO to the relevant text


In the above image, we are able to extract the content of the image, but from this text, we will not be able to convert it to an EDI file as the format of each PO may vary, and writing a common script will be difficult.

So this groovy is not recommended for formatted pdf as shown in the above image.

You can visit this site to download the sample PO.

Conclusion:


So, to conclude, this blog helps to extract contents from the PDF using Groovy Script.

Check out the link for more helpful information about Cloud Platform Integration (CPI).


If you have any queries, please feel free to ask your question in the comments. I would request everyone to provide your feedback and like if this blog post finds helpful for you.

 

Thanks & Regards,

Samarjit Singha
21 Comments
Souragopalsethy
Explorer
Thanks for amazing write-up
former_member831581
Discoverer
Thanks, brother, I needed this in my project.
francis21
Participant
Thanks a lot, Samarjit, for this blog.

 

If I have a customer's PO in PDF format, is it possible for me to process that PDF PO and create ORDERS IDoc or EDI (850)?
samarjitsingha
Participant
0 Kudos

Hi Francis


I have tested it with 850 PO and it's working fine.


Thanks,


Samarjit

francis21
Participant
0 Kudos
Hi Samanjit,

What I meant is that the customer's PO is in PDF, not necessarily an 850 PO in PDF.

Thanks,

Francis
francis21
Participant
0 Kudos
Hi Samarjit,

Like this sample image below of a PDF PO.

Thanks,

Francis

samarjitsingha
Participant
0 Kudos
Hi Francis,

This scenario will not work for your problem statement, as in this scenario, text can be extracted only from a pdf containing plain text characters excluding bullet points, images, tables, etc.

I have one query regarding your problem statement, how are you planning to convert the extracted text to an EDI format ? As I think there is no capability in CPI to convert an actual Purchase Order (as in the above sample image) to  EDI or IDOC.

If you have an actual EDI or IDOC in PDF format then this scenario will work.

Thanks,

Samarjit
francis21
Participant
0 Kudos
Hi Samarjit,

Up to PI 7.11, there was the SAP Conversion Agent.

Thanks,

Francis
Raunak
Discoverer

Hi Samarjit,

Thank you for this blog. I'm working on a custom scenario and this blog helps a lot.

Thanks,

Raunak

EuricoBorges
Participant
samarjitsingha for better understanding please add to blog a screenshot of the pdf you have used PDF so that everyone can see how it looks.

Thanks
lucy
Explorer
0 Kudos
I agree with you. I want to know how the pdf file looks like to simulate the scenario to understand how it works.
lucy
Explorer
0 Kudos
Hello Samarjit,

Could you please share the pdf files of the tested 850 PO and the Blog.pdf in above your screen?

 

Thanks!

Lucy
samarjitsingha
Participant
0 Kudos
Hi Lucy,

I have updated the blog. Please go through it. If you have some more queries feel free to ask.

Regards,

Samarjit
samarjitsingha
Participant
0 Kudos
Hi Eurico,

I have updated the blog with the same.

Regards,

Samarjit
philippeaddor
Active Participant

Hi Samarjit

Thanks for the blog post. Just as a general note: If I would get such a requirement, I would first of all challenge the requester or owner of the sender system and ask why in the world they would send a CSV file in PDF format and not simply as a CSV text string... If there's really a hard requirement for this, sure your script and the library would come in handy. However I think the more interesting use case would be processing of a formatted document, which is much more difficult as you point out.

There is a creative solution using SAP RPA here (without Cloud Integration though): https://blogs.sap.com/2021/09/07/translating-pdf-documents-with-sap-intelligent-robotic-process-auto...

Philippe

lucy
Explorer
Hello Samarjit,

 

Thanks for your update! Could you please enclose the pdf file as attachment? so I can download it and use it in postman to simulate it in my iflow.

 

I tried to simulate the pdf reading like in your blog, not yet get succeed. Please see iflow errors below.



Error Details








com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: org/apache/pdfbox/pdmodel/font/PDType0Font : cannot initialize class because prior initialization attempt failed. The MPL ID for the failed message is : AGOP8owhnoOeAa-6zuTQr4H7G111





 



Error Details








com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: org/apache/fontbox/afm/AFMParser. The MPL ID for the failed message is : AGOP8cloc76ZOQ2uAfR4qP2C9121





Best Regards,

Lucy
samarjitsingha
Participant
0 Kudos
Hi Lucy,

Thanks for pointing out the errors, there was some issue with the JAR files of pdfbox and fontbox, So I have updated the blog with the new links. Please download the new JAR files and upload them to your Resources tab of the iflow.

You can download the pdf files from this link.

Thanks,

Samarjit
lucy
Explorer
Thanks a lot Samarjit! It's working fine now.
ImranShafiq
Discoverer
0 Kudos
Hi Samarjit,

 

I am facing following error - would you guide me to resolve following issue.
com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: XSD schema is incompatible with CSV payload. The XSD schema provided contains 3 records; CSV payload contains 1 records.. The MPL ID for the failed message is : AGOcCDK94khTaGo3bF9cOXSag9-C

My .XSD file

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Material">
<xs:complexType>
<xs:sequence>
<xs:element name="Items">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="Material_Name"/>
<xs:element type="xs:string" name="Material_ID"/>
<xs:element type="xs:integer" name="Material_Number"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
ImranShafiq
Discoverer
0 Kudos
Aoa, Dear I need to talk you regarding this article, I have tried but not success.
GerardoPalomo
Discoverer
0 Kudos

Really good. I followed your guide and it worked. 

I have bookmarked this for future use. I appreciate your help.

Labels in this area