Tesseract OCR Read Horizontally rather than Vertically C# -


we have c# .net app using tesseract optical character recognition (ocr) on .tiff files. here's example: example tiff fiel tesseract reads

we outputting data text file. however, tesseract reading data in vertical fashion. in example image, reading tiff 2 columns of data , data data being outputted tesseract this:

type: date: address: city: state: owner: owner type: acreage: mortgage: 12345 2017-04-06 100 main st. city state john doe primary 10.25 yes

what want tesseract read tiff file horizontally , have output this:

type:12345 date:2017-04-06 address:100 main st. city:some city state:some state owner:john doe owner type:primary acreage:10.25 mortgage:yes

we've tried various page sementation options tesseract, produce same result.

has run same issue? have ideas?

i found solution. tesseract has set of config files. inside several of these config files setting tessedit_pageseg_mode. setting set 1 in config files. 1=automatic page segmentation osd. osd=orientation , script detection.

bottom line, these config file settings overwriting our command line argument. once removed tessedit_pageseg_mode parameter config files, our command line argument of

-psm 6 worked , produced output data in desired format.

psm=page segmentation mode. 6=assume single uniform block of text

-psm 4 worked

psm=page segmentation mode. 4=assume single column of text of variable sizes


Comments

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -

ios - Change Storyboard View using Seague -