c# - Get PDF content -


i want read content pdf files. started before getting stuff want know right approach so. itextsharp reader may helpful in case, converted pdf text using:

public static string pdftext(string path) {     pdfreader reader = new pdfreader(path);     string text = string.empty;     for(int page = 1; page <= reader.numberofpages; page++)     {         text += pdftextextractor.gettextfrompage(reader,page);     }     reader.close();     return text; } 

i'm still wondering if approach seems ok, or if should convert pdf excel , read content want instead.

professionals thoughts appreciated.

with itext, can choose specific strategy extracting text. keep in mind heuristic process.

pdf documents contain instructions needed render document viewer. there no concept of "text". more "draw character @ position 420, 890".

in order text-extraction work, needs make guesses on when 2 characters close enough should concatenated, , when should apart.

coincidentally, itext based on width of single space character in font being used.

keep in mind there actualtext (this sort of text gets hidden in document, , used in extraction. makes possible have document render character "œ" (ligature version), gets extracted "oe" (non ligature version).

depending on input documents, might want different implementations of itextextractionstrategy.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -