I got a pdf like this one :
81 11005589 THING MAXIME 4 PC2I TR1 - MERCREDI DE 07H45 A 09H45 4A7 71 11007079 STUFF QUENTIN 1 PC2I TR1 - LUNDI DE 10H00 A 12H00 1B4 74 10506940 HAHA YEZHOU 2 PC2I TR1 - LUNDI DE 13H30 A 15H30 2D5
And I need to parse it. What I mean by that is take the 4th column, add the 3rd column and make an email adress out of it. For example with the first line : [email protected]
I tried to c/p it to Google docs but it just c/p it in one cell instead of multiple cells.
I really don’t know what to do here. I guess regex would help me but with what ?
I’ve used Aspose before for parsing PDFs/Word docs/Excel docs/and some other docs before. I’m not sure what their capabilities are when it comes to parsing tables in a PDF but it wouldn’t surprise me if they had something.
I’d start by looking at them but be warned: they have an unapologetically piss poor method for updating their libraries. I have had to rewrite code because they flat out DROP functionality when they release new versions. Not deprecated, just GONE. That said their support is alright and the tool-set is quite powerful.
I know they have libraries for .NET and Java. Beyond that I can’t say.
If in PHP, you can use
exec('pdftotext '.$filepath, $outputAsArray); //execute the command pdftotext. Proabably installed if you're on linux, if not you can install it /// to transform the pdf to text,
$text = implode($outputAsArray,"\n"); //to have the output as text
then preg_replace is your friend.
You can’t just use a regular expression to parse PDF. You need to extract the text. There are many libraries that can do this for different languages.
My company, Atalasoft, has a text extraction add-on for .NET — http://www.atalasoft.com/products/dotimage/pdf-reader
For Java, take a look at PDFTextStream from Snowtide. http://www.snowtide.com.
You cannot be sure there is any structure in the PDF of that the text is visible. You really need to use an extraction tool. I wrote an article explaining what formatting is actually in a PDF file at http://www.jpedal.org/PDFblog/?p=228