Requirements: Use OCR technology to recognize image text, if it contains some text, the backend will initially pass the screening and give priority. The requirements are relatively simple.
Review:
Tesseract OCR
Tesseract was originally developed between 1985 and 1994 at HP Laboratories in Bristol, UK, and HP in Greeley, Colorado, USA. In 1996, Tesseract was further modified for porting to Windows systems, and in 1998 it was partially C++ized. In 2005, HP made Tesseract open source. It was developed by Google from 2006 to November 2018.
Tesseract 4 adds a Neural Network (LSTM)-based OCR engine that focuses on line recognition, but still supports Tesseract 3's legacy Tesseract OCR engine, which works by recognizing character patterns. Use the legacy OCR engine mode (--oem 0) to enable compatibility with Tesseract 3. It also requires training data files that support older engines, such as files from the tessdata repository.
Tesseract Address:The hyperlink login is visible. tessdata:The hyperlink login is visible. Documentation:The hyperlink login is visible.
C# calls Tesseract
Regarding using C# to call Tesseract, there are two commonly used libraries: Tesseract and TesseractOCR, of which TesseractOCR is based on the Tesseract library secondary development, and the code of the two open source libraries is actually similar, the difference is that TesseractOCR calls the latest version (5.5.0) of the .dll dynamic link library, so it is recommendedTesseractOCR。
Tesseract Code:The hyperlink login is visible. TesseractOCR Code:The hyperlink login is visible.
First, you need to download the Chinese Simplified (chi_sim.traineddata) model. (omitted)
The code is as follows:
Find a screenshot from the Internet to test, the original picture is as follows:
The OCR recognition results are as follows:
(End) |