Tesseract OCR Read Horizontally rather than Vertically C# -
we have c# .net app using tesseract optical character recognition (ocr) on .tiff files. here's example:
we outputting data text file. however, tesseract reading data in vertical fashion. in example image, reading tiff 2 columns of data , data data being outputted tesseract this:
type: date: address: city: state: owner: owner type: acreage: mortgage: 12345 2017-04-06 100 main st. city state john doe primary 10.25 yes
what want tesseract read tiff file horizontally , have output this:
type:12345 date:2017-04-06 address:100 main st. city:some city state:some state owner:john doe owner type:primary acreage:10.25 mortgage:yes
we've tried various page sementation options tesseract, produce same result.
has run same issue? have ideas?
i found solution. tesseract has set of config files. inside several of these config files setting tessedit_pageseg_mode. setting set 1 in config files. 1=automatic page segmentation osd. osd=orientation , script detection.
bottom line, these config file settings overwriting our command line argument. once removed tessedit_pageseg_mode parameter config files, our command line argument of
-psm 6 worked , produced output data in desired format.
psm=page segmentation mode. 6=assume single uniform block of text
-psm 4 worked
psm=page segmentation mode. 4=assume single column of text of variable sizes
Comments
Post a Comment