parsing – Parse a pdf file – Education Career Blog

I got a pdf like this one :

81 11005589 THING MAXIME 4 PC2I TR1 - MERCREDI DE 07H45 A 09H45 4A7
71 11007079 STUFF QUENTIN 1 PC2I TR1 - LUNDI DE 10H00 A 12H00 1B4
74 10506940 HAHA YEZHOU 2 PC2I TR1 - LUNDI DE 13H30 A 15H30 2D5

And I need to parse it. What I mean by that is take the 4th column, add the 3rd column and make an email adress out of it. For example with the first line : [email protected]

I tried to c/p it to Google docs but it just c/p it in one cell instead of multiple cells.

I really don’t know what to do here. I guess regex would help me but with what ?


If it is Java iText, if it is C# iTextSharp, both are free for non commercial use.


I’ve used Aspose before for parsing PDFs/Word docs/Excel docs/and some other docs before. I’m not sure what their capabilities are when it comes to parsing tables in a PDF but it wouldn’t surprise me if they had something.

I’d start by looking at them but be warned: they have an unapologetically piss poor method for updating their libraries. I have had to rewrite code because they flat out DROP functionality when they release new versions. Not deprecated, just GONE. That said their support is alright and the tool-set is quite powerful.

I know they have libraries for .NET and Java. Beyond that I can’t say.


If in PHP, you can use

exec('pdftotext '.$filepath, $outputAsArray); //execute the command pdftotext. Proabably installed if you're on linux, if not you can install it /// to transform the pdf to text,


$text = implode($outputAsArray,"\n"); //to have the output as text

then preg_replace is your friend.


You can’t just use a regular expression to parse PDF. You need to extract the text. There are many libraries that can do this for different languages.

My company, Atalasoft, has a text extraction add-on for .NET —

For Java, take a look at PDFTextStream from Snowtide.


You cannot be sure there is any structure in the PDF of that the text is visible. You really need to use an extraction tool. I wrote an article explaining what formatting is actually in a PDF file at

Leave a Comment