PHP 用 tesseract-ocr 识别图片中的文字

2023年02月11日 1515点热度 0人点赞 0条评论

最近接到个比较奇特的需求,要用PHP来识别用户上传图片中的,提出出订单号、订单时间和金额。
分析了下,最主要的就是中文和数字识别,识别出来后再用正则按规则提取出内容。
直接用PHP来开发OCR识别是不现实的,没人有哪个闲工夫,利用现成的工具不香吗?
分析了几种工具:
1. 用现存的工具,比如百度阿里都有,但要钱!
2. 用OpenCV,这个倒不要钱了,但开发太复杂了!
3. 用 tesseract-ocr 来进行识别。
相对来说 tesseract-ocr 相对来说这个简单多了,这里把使用过程记录下。

1. 安装 tesseract-ocr

tesseract-ocr 是开源的,按理说 linux 和 windows 下都有对应的程序 。
github地址:https://github.com/tesseract-ocr/tesseract

dnf insatll tesseract

安装 tesseract 语言包

dnf install tesseract-langpack-chi_sim
dnf install tesseract-langpack-chi_sim.noarch  # 简体中文
dnf install tesseract-langpack-chi_tra.noarch  # 繁体中文
dnf install tesseract-langpack-eng.noarch  # 英语
dnf install tesseract-langpack-jpn.noarch  # 日语

如需要其他语言:

yum search tesseract-langpack
[root@TEST2023] ➜ ~ # yum search tesseract-langpack
Last metadata expiration check: 1:57:11 ago on Fri 10 Feb 2023 09:44:24 PM CST.
========================================= Name Matched: tesseract-langpack =========================================
tesseract-langpack-afr.noarch : Afrikaans language data for tesseract-tessdata
tesseract-langpack-amh.noarch : Amharic language data for tesseract-tessdata
tesseract-langpack-ara.noarch : Arabic language data for tesseract-tessdata
tesseract-langpack-asm.noarch : Assamese language data for tesseract-tessdata
tesseract-langpack-aze.noarch : Azerbaijani language data for tesseract-tessdata
tesseract-langpack-aze_cyrl.noarch : Azerbaijani (Cyrillic) language data for tesseract-tessdata
tesseract-langpack-bel.noarch : Belarusian language data for tesseract-tessdata
tesseract-langpack-ben.noarch : Bengali language data for tesseract-tessdata
tesseract-langpack-bod.noarch : Tibetan (Standard) language data for tesseract-tessdata
tesseract-langpack-bos.noarch : Bosnian language data for tesseract-tessdata
tesseract-langpack-bre.noarch : Breton language data for tesseract-tessdata
tesseract-langpack-bul.noarch : Bulgarian language data for tesseract-tessdata
tesseract-langpack-cat.noarch : Catalan language data for tesseract-tessdata
tesseract-langpack-ceb.noarch : Cebuano language data for tesseract-tessdata
tesseract-langpack-ces.noarch : Czech language data for tesseract-tessdata
tesseract-langpack-chi_sim.noarch : Chinese (Simplified) language data for tesseract-tessdata
tesseract-langpack-chi_sim_vert.noarch : Chinese (Simplified, Vertical) language data for tesseract-tessdata
tesseract-langpack-chi_tra.noarch : Chinese (Traditional) language data for tesseract-tessdata
tesseract-langpack-chi_tra_vert.noarch : Chinese (Traditional, Vertical) language data for tesseract-tessdata
tesseract-langpack-chr.noarch : Cherokee language data for tesseract-tessdata
tesseract-langpack-cos.noarch : Corsican language data for tesseract-tessdata
tesseract-langpack-cym.noarch : Welsh language data for tesseract-tessdata
tesseract-langpack-dan.noarch : Danish language data for tesseract-tessdata
tesseract-langpack-deu.noarch : German language data for tesseract-tessdata
tesseract-langpack-div.noarch : Dhivehi; Maldivian language data for tesseract-tessdata
tesseract-langpack-dzo.noarch : Dzongkha language data for tesseract-tessdata
tesseract-langpack-ell.noarch : Greek language data for tesseract-tessdata
tesseract-langpack-eng.noarch : English language data for tesseract-tessdata
tesseract-langpack-enm.noarch : Middle English (1100-1500) language data for tesseract-tessdata
tesseract-langpack-epo.noarch : Esperanto language data for tesseract-tessdata
tesseract-langpack-est.noarch : Estonian language data for tesseract-tessdata
tesseract-langpack-eus.noarch : Basque language data for tesseract-tessdata
tesseract-langpack-fao.noarch : Faroese language data for tesseract-tessdata
tesseract-langpack-fas.noarch : Persian (Farsi) language data for tesseract-tessdata
tesseract-langpack-fil.noarch : Filipino; Pilipino language data for tesseract-tessdata
tesseract-langpack-fin.noarch : Finnish language data for tesseract-tessdata
tesseract-langpack-fra.noarch : French language data for tesseract-tessdata
tesseract-langpack-frk.noarch : Fraktur language data for tesseract-tessdata
tesseract-langpack-frm.noarch : Middle French (ca. 1400-1600) language data for tesseract-tessdata
tesseract-langpack-fry.noarch : Western Frisian language data for tesseract-tessdata
tesseract-langpack-gla.noarch : Gaelic; Scottish Gaelic language data for tesseract-tessdata
tesseract-langpack-gle.noarch : Irish language data for tesseract-tessdata
tesseract-langpack-glg.noarch : Galician language data for tesseract-tessdata
tesseract-langpack-grc.noarch : Ancient Greek language data for tesseract-tessdata
tesseract-langpack-guj.noarch : Gujarati language data for tesseract-tessdata
tesseract-langpack-hat.noarch : Haitian language data for tesseract-tessdata
tesseract-langpack-heb.noarch : Hebrew language data for tesseract-tessdata
tesseract-langpack-hin.noarch : Hindi language data for tesseract-tessdata
tesseract-langpack-hrv.noarch : Croatian language data for tesseract-tessdata
tesseract-langpack-hun.noarch : Hungarian language data for tesseract-tessdata
tesseract-langpack-hye.noarch : Armenian language data for tesseract-tessdata
tesseract-langpack-iku.noarch : Inuktitut language data for tesseract-tessdata
tesseract-langpack-ind.noarch : Indonesian language data for tesseract-tessdata
tesseract-langpack-isl.noarch : Icelandic language data for tesseract-tessdata
tesseract-langpack-ita.noarch : Italian language data for tesseract-tessdata
tesseract-langpack-ita_old.noarch : Italian (Old) language data for tesseract-tessdata
tesseract-langpack-jav.noarch : Javanese language data for tesseract-tessdata
tesseract-langpack-jpn.noarch : Japanese language data for tesseract-tessdata
tesseract-langpack-jpn_vert.noarch : Japanese language data for tesseract-tessdata
tesseract-langpack-kan.noarch : Kannada language data for tesseract-tessdata
tesseract-langpack-kat.noarch : Georgian language data for tesseract-tessdata
tesseract-langpack-kat_old.noarch : Georgian (Old) language data for tesseract-tessdata
tesseract-langpack-kaz.noarch : Kazakh language data for tesseract-tessdata
tesseract-langpack-khm.noarch : Khmer language data for tesseract-tessdata
tesseract-langpack-kir.noarch : Kyrgyz language data for tesseract-tessdata
tesseract-langpack-kmr.noarch : Kurmanji language data for tesseract-tessdata
tesseract-langpack-kor.noarch : Korean language data for tesseract-tessdata
tesseract-langpack-kor_vert.noarch : Korean language data for tesseract-tessdata
tesseract-langpack-lao.noarch : Lao language data for tesseract-tessdata
tesseract-langpack-lat.noarch : Latin language data for tesseract-tessdata
tesseract-langpack-lav.noarch : Latvian language data for tesseract-tessdata
tesseract-langpack-lit.noarch : Lithuanian language data for tesseract-tessdata
tesseract-langpack-ltz.noarch : Luxembourgish language data for tesseract-tessdata
tesseract-langpack-mal.noarch : Malayalam language data for tesseract-tessdata
tesseract-langpack-mar.noarch : Marathi language data for tesseract-tessdata
tesseract-langpack-mkd.noarch : Macedonian language data for tesseract-tessdata
tesseract-langpack-mlt.noarch : Maltese language data for tesseract-tessdata
tesseract-langpack-mon.noarch : Mongolian language data for tesseract-tessdata
tesseract-langpack-mri.noarch : Maori language data for tesseract-tessdata
tesseract-langpack-msa.noarch : Malay language data for tesseract-tessdata
tesseract-langpack-mya.noarch : Burmese language data for tesseract-tessdata
tesseract-langpack-nep.noarch : Nepali language data for tesseract-tessdata
tesseract-langpack-nld.noarch : Dutch language data for tesseract-tessdata
tesseract-langpack-nor.noarch : Norwegian language data for tesseract-tessdata
tesseract-langpack-oci.noarch : Occitan language data for tesseract-tessdata
tesseract-langpack-ori.noarch : Oriya language data for tesseract-tessdata
tesseract-langpack-pan.noarch : Panjabi language data for tesseract-tessdata
tesseract-langpack-pol.noarch : Polish language data for tesseract-tessdata
tesseract-langpack-por.noarch : Portuguese language data for tesseract-tessdata
tesseract-langpack-pus.noarch : Pashto language data for tesseract-tessdata
tesseract-langpack-que.noarch : Quechuan language data for tesseract-tessdata
tesseract-langpack-ron.noarch : Romanian language data for tesseract-tessdata
tesseract-langpack-rus.noarch : Russian language data for tesseract-tessdata
tesseract-langpack-san.noarch : Sanskrit language data for tesseract-tessdata
tesseract-langpack-sin.noarch : Sinhala language data for tesseract-tessdata
tesseract-langpack-slk.noarch : Slovakian language data for tesseract-tessdata
tesseract-langpack-slv.noarch : Slovenian language data for tesseract-tessdata
tesseract-langpack-snd.noarch : Sindhi language data for tesseract-tessdata
tesseract-langpack-spa.noarch : Spanish language data for tesseract-tessdata
tesseract-langpack-spa_old.noarch : Spanish (Old) language data for tesseract-tessdata
tesseract-langpack-sqi.noarch : Albanian language data for tesseract-tessdata
tesseract-langpack-srp.noarch : Serbian language data for tesseract-tessdata
tesseract-langpack-srp_latn.noarch : Serbian (Latin) language data for tesseract-tessdata
tesseract-langpack-sun.noarch : Sundanese language data for tesseract-tessdata
tesseract-langpack-swa.noarch : Swahili language data for tesseract-tessdata
tesseract-langpack-swe.noarch : Swedish language data for tesseract-tessdata
tesseract-langpack-syr.noarch : Syriac language data for tesseract-tessdata
tesseract-langpack-tam.noarch : Tamil language data for tesseract-tessdata
tesseract-langpack-tat.noarch : Tatar language data for tesseract-tessdata
tesseract-langpack-tel.noarch : Telugu language data for tesseract-tessdata
tesseract-langpack-tgk.noarch : Tajik language data for tesseract-tessdata
tesseract-langpack-tha.noarch : Thai language data for tesseract-tessdata
tesseract-langpack-tir.noarch : Tigrinya language data for tesseract-tessdata
tesseract-langpack-ton.noarch : Tongan language data for tesseract-tessdata
tesseract-langpack-tur.noarch : Turkish language data for tesseract-tessdata
tesseract-langpack-uig.noarch : Uyghur language data for tesseract-tessdata
tesseract-langpack-ukr.noarch : Ukrainian language data for tesseract-tessdata
tesseract-langpack-urd.noarch : Urdu language data for tesseract-tessdata
tesseract-langpack-uzb.noarch : Uzbek language data for tesseract-tessdata
tesseract-langpack-uzb_cyrl.noarch : Uzbek (Cyrillic) language data for tesseract-tessdata
tesseract-langpack-vie.noarch : Vietnamese language data for tesseract-tessdata
tesseract-langpack-yid.noarch : Yiddish language data for tesseract-tessdata
tesseract-langpack-yor.noarch : Yoruba language data for tesseract-tessdata

根据需要下载吧

2. 使用方法

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]
[root@TEST2023] ➜ ~ # tesseract --help
Usage:
tesseract --help | --help-extra | --version
tesseract --list-langs
tesseract imagename outputbase [options...] [configfile...]

OCR options:
-l LANG[+LANG] Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
--help Show this help message.
--help-extra Show extra help for advanced users.
--version Show version information.
--list-langs List available languages for tesseract engine.

查看 tesseract 版本:

[root@TEST2023] ➜ ~ # tesseract -v
tesseract 4.1.1
leptonica-1.76.0
libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 1.0.0
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE

查看已安装的语言包:

[root@TEST2023] ➜ ~ # tesseract --list-langs
List of available languages (3):
chi_sim
chi_sim_vert
eng

文字识别:
比如如下这张图片

[root@TEST2023] ➜ ~ # tesseract 1.jpg out_text -l chi_sim
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 236
[root@TEST2023] ➜ ~ # cat out_text.txt

价格说明

* 划线价格

商品的专柜价、吊牌价、正品零售价、厂商指导价或该商品的曾
经展示过的销售价等,并非原价,仅供参考。

"未划线价格

商品的实时标价,不因表述的差异改变性质。具体成交价格根据
商品参加活动,或会员使用优惠券、积分等发生变化,最终以订
单结算页价格为准。

* 商家详情页 (含主图) 以图片或文字形式标注的一口价、促销
价、优惠价等价格可能是在使用优惠券、满减或特定优惠活动和
时段等情形下的价格,具体请以结算页面的标价、优惠条件或活
动规则为准。

* 此说明仅当出现价格比较时有效,具体请参见《淘宝价格发布规
范》。若商家单独对划线价格进行说明的,以商家的表述为准。

[root@TEST2023] ➜ ~ #

如果字体标准,识别率还是相当的高的,再来一张不太规范的

结果已删除多余的空行

[root@TEST2023] ➜ ~ # tesseract 2.png out_text -l chi_sim
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 502
Detected 38 diacritics
[root@TEST2023] ➜ ~ # cat out_text.txt
LA计2

Q 顺庆区7个重大项目开工 | ChatGPT

南充”新冠康复 热点 科技 手机 数码 小术 三

人 温故而知乐 关注
OO 9小时前 山西省作家协会会员 国际资讯领域创.…

美国开始利用气球抹黑中国了! 声称穿越美国领
土的中国气球配备了数据收集设备,但美国忘
了,中国不是伊拉克。.…全文

[必 分享 O 1 此 6 字 收藏

iPhone 15 Ultra 外观上曝光,价格却让侯而却
步,小米11U用户被劝退

south 科技君 54评论 昨天

[root@TEST2023] ➜ ~ #

现在就可以看出文字不是哪么规范了,要提取内容还是想办法用正则吧。

如果识别不理想,还可以做识别训练,这个不在本文本讨论范围,主要是我这使用足够了,懒得继续搞。

3. PHP使用 tesseract

github地址:https://github.com/thiagoalessio/tesseract-ocr-for-php

composer require thiagoalessio/tesseract_ocr

3.1 一般使用:

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('text.png'))
    ->run();

3.2 单语言使用

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('german.png'))
    ->lang('chi_sim')
    ->run();

3.3 多语言使用

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('mixed-languages.png'))
    ->lang('eng', 'jpn', 'chi_sim')
    ->run();

3.4 诱导识别

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('8055.png'))
    ->allowlist(range('A', 'Z'))
    ->run();


识别结果:BOSS

更多使用方法请参考软件github页面

3.5 也可以在 PHP 中用 exec 直接执行

exec("tesseract 1.jpg out_text -l chi_sim")

4. 说说缺点

非标准字体文字识别率非常差,比如手写
运行会占用大量CPU,很慢

路灯

这个人很懒,什么都没留下

文章评论