Results 1 to 9 of 9

Thread: PDF Text Extraction Example

  1. #1
    Lavastorm Employee
    Join Date
    Aug 2009

    Default PDF Text Extraction Example

    **Nodes released into this community are prototypes. They have gone through a minimum set of tests; therefore, we cannot guarantee that they will work as designed nor are they supported by any existing maintenance contracts.**

    This prototype graph contains a simple node for extracting text from a PDF file.
    This is not a feature complete PDF reader.
    Rather, it enables the user to simply extract unformatted text from a PDF file from which text can be extracted.
    It can also be used as an example of using a Java node to interface relatively easily with a 3rd party API - in this case, Apache's PDF Box.
    The code required to interface with Apache PDF Box is also visible within the node, such that the interested user can modify this to perform more advanced text extraction if they wish.

    • One of:
      • Lavastorm 4.6 Professional Plus
      • Lavastorm 4.6 Workgroup with Java Node support (Developer Node Pack)
      • Lavastorm 4.6 Enterprise
    • Apache PDF Box
    • Download the attached BRG
    • Download Apache PDF Box:
    • Open the "Extract PDF Text" node
    • Set the "Filename" parameter to point to a PDF file on the file system of the LAE server
    • Run the node.
    Attached Files Attached Files

  2. #2


    Hi Tim,

    Thank you for the useful information provided

    I tried to run the node of yours using the pdf.brg graph and have downloaded the pdfbox app. Its just that the version of the pdfbox now available in the site is 1.8.10 and not 1.8.2. And there is no ext folder as well in the lavastorm directory, there is just lib/java.

    One more important information would be that I use a client version of Lavastorm and not the server version. Does that make any change in the execution of the pdf.brg graph ?


  3. #3


    This doesn't work with Lavastorm 6.1 using a pdf created by a word document. Also the code is hidden. Is there some way to see it? Thanks.

  4. #4
    Lavastorm Employee gmullin's Avatar
    Join Date
    May 2014


    I got a copy of pdfbox-app-1.8.2.jar from here and the node worked OK on v6.1.4 for me. I used test.pdf attached below, created as pdf from Excel.

    If you go to the library view, you can view Tim's code in the node.


  5. #5


    pdfbox-app-1.8.2.jar is in the java/ext folder. I copied the attached pdf and this is what I get which is what I got before running it on a server.

    ERROR: Unexpected error occurred during processAll while running the node: org/apache/fontbox/afm/AFMParser.
    Error Code: brain.unexpectedNodeErrorDuring

    I think this is LAL 6.1.48

  6. #6
    Lavastorm Employee gmullin's Avatar
    Join Date
    May 2014


    Only thing I can think of is that I had to place the jar file on the client and the same location on the server (unless you're using the desktop version).

  7. #7


    it is running on a server, not desktop.
    I put copies of it in all the lavastorm java folders. This is what the log says.

    <detail xsi:type="javaExceptionDetail">

  8. #8


    Apparently you need the fontbox jar as well. Now it works!
    how can I find or extract a copy of Thanks
    Last edited by jegnj; 12-21-2016 at 09:41 PM.

  9. #9


    Quote Originally Posted by jegnj View Post
    Apparently you need to read about Phen375 fontbox jar as well. Now it works!
    how can I find or extract a copy of Thanks
    Hey jeg, did you or anyone else find out how to extract?
    Last edited by Delnero; 10-04-2018 at 05:39 AM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts