parsing – Parse a pdf file – Education Career Blog

I got a pdf like this one :

81 11005589 THING MAXIME 4 PC2I TR1 - MERCREDI DE 07H45 A 09H45 4A7
71 11007079 STUFF QUENTIN 1 PC2I TR1 - LUNDI DE 10H00 A 12H00 1B4
74 10506940 HAHA YEZHOU 2 PC2I TR1 - LUNDI DE 13H30 A 15H30 2D5

http://i.stack.imgur.com/hbXg2.png

And I need to parse it. What I mean by that is take the 4th column, add the 3rd column and make an email adress out of it. For example with the first line : [email protected]

I tried to c/p it to Google docs but it just c/p it in one cell instead of multiple cells.

I really don’t know what to do here. I guess regex would help me but with what ?

,

If it is Java iText, if it is C# iTextSharp, both are free for non commercial use.

,

I’ve used Aspose before for parsing PDFs/Word docs/Excel docs/and some other docs before. I’m not sure what their capabilities are when it comes to parsing tables in a PDF but it wouldn’t surprise me if they had something.

I’d start by looking at them but be warned: they have an unapologetically piss poor method for updating their libraries. I have had to rewrite code because they flat out DROP functionality when they release new versions. Not deprecated, just GONE. That said their support is alright and the tool-set is quite powerful.

I know they have libraries for .NET and Java. Beyond that I can’t say.

,

If in PHP, you can use

exec('pdftotext '.$filepath, $outputAsArray); //execute the command pdftotext. Proabably installed if you're on linux, if not you can install it /// to transform the pdf to text,

then

$text = implode($outputAsArray,"\n"); //to have the output as text

then preg_replace is your friend.

,

You can’t just use a regular expression to parse PDF. You need to extract the text. There are many libraries that can do this for different languages.

My company, Atalasoft, has a text extraction add-on for .NET — http://www.atalasoft.com/products/dotimage/pdf-reader

For Java, take a look at PDFTextStream from Snowtide. http://www.snowtide.com.

,

You cannot be sure there is any structure in the PDF of that the text is visible. You really need to use an extraction tool. I wrote an article explaining what formatting is actually in a PDF file at http://www.jpedal.org/PDFblog/?p=228

Leave a Comment