您的位置:永利集团登录网址 > 永利集团登录网址 > python下调用pytesseract识别某网站验证码,pythonpy

python下调用pytesseract识别某网站验证码,pythonpy

2019-10-04 20:44

用Python 获取图片的Base64编码,源代码如下:

python下调用pytesseract识别某网站验证码,pythonpytesseract

文件处理

 代码如下

一、pytesseract介绍

文件处理流程

复制代码

1、pytesseract说明

pytesseract最新版本0.1.6,网址:

Python-tesseract is a wrapper for google's Tesseract-OCR
( ). It is also useful as a
stand-alone invocation script to tesseract, as it can read all image types
supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
and others, whereas tesseract-ocr by default only supports tiff and bmp.
Additionally, if used as a script, Python-tesseract will print the recognized
text in stead of writing it to a file. Support for confidence estimates and
bounding box data is planned for future releases.

翻译一下大意:

a、Python-tesseract是一个基于google's Tesseract-OCR的独立封装包;

b、Python-tesseract功能是识别图片文件中文字,并作为返回参数返回识别结果;

c、Python-tesseract默认支持tiff、bmp格式图片,只有在安装PIL之后,才能支持jpeg、gif、png等其他图片格式;

1 打开文件,得到文件句柄并赋值给一个变量

#!/usr/bin/env python

2、pytesseract安装

INSTALLATION:

Prerequisites:
* Python-tesseract requires python 2.5 or later or python 3.
* You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is
the package "python-imaging" or "python3-imaging" for python3.
* Install google tesseract-ocr from .
You must be able to invoke the tesseract command as "tesseract". If this
isn't the case, for example because tesseract isn't in your PATH, you will
have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.
Under Debian/Ubuntu you can use the package "tesseract-ocr".

Installing via pip: 
See the [pytesseract package page]() 
```
$> sudo pip install pytesseract 

 翻译一下:

a、Python-tesseract支持python2.5及更高版本;

b、Python-tesseract需要安装PIL(Python Imaging Library) ,来支持更多的图片格式;

c、Python-tesseract需要安装tesseract-ocr安装包,具体参看上一篇博文。

 

综上,Pytesseract原理:

1、上一篇博文中提到,执行命令行 tesseract.exe 1.png output -l eng ,可以识别1.png中文字,并把识别结果输出到output.txt中;

2、Pytesseract对上述过程进行了二次封装,自动调用tesseract.exe,并读取output.txt文件的内容,作为函数的返回值进行返回。

2 通过句柄对文件进行操作

# -*- coding: utf-8 -*-

二、pytesseract使用

 USAGE:
```
> try:
> import Image
> except ImportError:
> from PIL import Image
> import pytesseract
> print(pytesseract.image_to_string(Image.open('test.png')))
> print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

 

可以看到:

1、核心代码就是image_to_string函数,该函数还支持-l eng 参数,支持-psm 参数。

 

用法:
image_to_string(Image.open('test.png'),lang="eng" config="-psm 7")

2、pytesseract里调用了image,所以才需要PIL,其实tesseract.exe本身是支持jpeg、png等图片格式的。

 

实例代码,识别某公共网站的验证码(大家千万别干坏事啊):

图片 1#-*-coding=utf-8-*- __author__='zhongtang' import urllib import urllib2 import cookielib import math import random import time import os import htmltool from pytesseract import * from PIL import Image from PIL import ImageEnhance import re class orclnypcg: def __init__(self): self.baseUrl='' self.ht=htmltool.htmltool() self.curPath=self.ht.getPyFileDir() self.authCode='' def initUrllib2(self): try: cookie = cookielib.CookieJar() cookieHandLer = urllib2.HTTPCookieProcessor(cookie) httpHandLer=urllib2.HTTPHandler(debuglevel=0) httpsHandLer=urllib2.HTTPSHandler(debuglevel=0) except: raise else: opener = urllib2.build_opener(cookieHandLer,httpHandLer,httpsHandLer) opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')] urllib2.install_opener(opener) def urllib2Navigate(self,url,data={}): #定义连接函数,有超时重连功能 tryTimes = 0 while True: if (tryTimes>20): print u"多次尝试仍无法链接网络,程序终止" break try: if (data=={}): req = urllib2.Request(url) else: req = urllib2.Request(url,urllib.urlencode(data)) response =urllib2.urlopen(req) bodydata = response.read() headerdata = response.info() if headerdata.get('Content-Encoding')=='gzip': rdata = StringIO.StringIO(bodydata) gz = gzip.GzipFile(fileobj=rdata) bodydata = gz.read() gz.close() tryTimes = tryTimes +1 except urllib2.HTTPError, e: print 'HTTPError[%s]n' %e.code except urllib2.URLError, e: print 'URLError[%s]n' %e.reason except socket.error: print u"连接失败,尝试重新连接" else: break return bodydata,headerdata def randomCodeOcr(self,filename): image = Image.open(filename) #使用ImageEnhance可以增强图片的识别率 #enhancer = ImageEnhance.Contrast(image) #enhancer = enhancer.enhance(4) image = image.convert('L') ltext = '' ltext= image_to_string(image) #去掉非法字符,只保留字母数字 ltext=re.sub("W", "", ltext) print u'[%s]识别到验证码:[%s]!!!' %(filename,ltext) image.save(filename) #print ltext return ltext def getRandomCode(self): #开始获取验证码 # i = 0 while ( i<=100): i += 1 #拼接验证码Url randomUrlNew='%s/CommonPage/Code.aspx?%s' %(self.baseUrl,random.random()) #拼接验证码本地文件名 filename= '%s.png' %(i) filename= os.path.join(self.curPath,filename) jpgdata,jpgheader = self.urllib2Navigate(randomUrlNew) if len(jpgdata)<= 0 : print u'获取验证码出错!n' return False f = open(filename, 'wb') f.write(jpgdata) #print u"保存图片:",fileName f.close() self.authCode = self.randomCodeOcr(filename) #主程序开始 orcln=orclnypcg() orcln.initUrllib2() orcln.getRandomCode() View Code

 

3 关闭文件

'''

三、pytesseract代码优化

上述程序在windows平台运行时,会发现有黑色的控制台窗口一闪而过的画面,不太友好。

略微修改了pytesseract.py(C:Python27Libsite-packagespytesseract目录下),把上述过程进行了隐藏。

# modified by zhongtang hide console window
# new code
IS_WIN32 = 'win32' in str(sys.platform).lower()
if IS_WIN32:
   startupinfo = subprocess.STARTUPINFO()
   startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
   startupinfo.wShowWindow = subprocess.SW_HIDE
   proc = subprocess.Popen(command,
        stderr=subprocess.PIPE,startupinfo=startupinfo)
'''
# old code
proc = subprocess.Popen(command,
   stderr=subprocess.PIPE)
'''
# modified end

为了方便初学者,把pytesseract.py也贴出来,高手自行忽略。

图片 2#!/usr/bin/env python ''' Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded in images. Python-tesseract is a wrapper for google's Tesseract-OCR ( ). It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff, and others, whereas tesseract-ocr by default only supports tiff and bmp. Additionally, if used as a script, Python-tesseract will print the recognized text in stead of writing it to a file. Support for confidence estimates and bounding box data is planned for future releases. USAGE: ``` > try: > import Image > except ImportError: > from PIL import Image > import pytesseract > print(pytesseract.image_to_string(Image.open('test.png'))) > print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra')) ``` INSTALLATION: Prerequisites: * Python-tesseract requires python 2.5 or later or python 3. * You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is the package "python-imaging" or "python3-imaging" for python3. * Install google tesseract-ocr from . You must be able to invoke the tesseract command as "tesseract". If this isn't the case, for example because tesseract isn't in your PATH, you will have to change the "tesseract_cmd" variable at the top of 'tesseract.py'. Under Debian/Ubuntu you can use the package "tesseract-ocr". Installing via pip: See the [pytesseract package page]() $> sudo pip install pytesseract Installing from source: $> git clone [email protected]:madmaze/pytesseract.git $> sudo python setup.py install LICENSE: Python-tesseract is released under the GPL v3. CONTRIBUTERS: - Originally written by [Samuel Hoffstaetter]() - [Juarez Bochi]() - [Matthias Lee]() - [Lars Kistner]() ''' # CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY tesseract_cmd = 'tesseract' try: import Image except ImportError: from PIL import Image import subprocess import sys import tempfile import os import shlex __all__ = ['image_to_string'] def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False, config=None): ''' runs the command: `tesseract_cmd` `input_filename` `output_filename_base` returns the exit status of tesseract, as well as tesseract's stderr output ''' command = [tesseract_cmd, input_filename, output_filename_base] if lang is not None: command += ['-l', lang] if boxes: command += ['batch.nochop', 'makebox'] if config: command += shlex.split(config) # modified by zhongtang hide console window # new code IS_WIN32 = 'win32' in str(sys.platform).lower() if IS_WIN32: startupinfo = subprocess.STARTUPINFO() startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW startupinfo.wShowWindow = subprocess.SW_HIDE proc = subprocess.Popen(command, stderr=subprocess.PIPE,startupinfo=startupinfo) ''' # old code proc = subprocess.Popen(command, stderr=subprocess.PIPE) ''' # modified end return (proc.wait(), proc.stderr.read()) def cleanup(filename): ''' tries to remove the given filename. Ignores non-existent files ''' try: os.remove(filename) except OSError: pass def get_errors(error_string): ''' returns all lines in the error_string that start with the string "error" ''' lines = error_string.splitlines() error_lines = tuple(line for line in lines if line.find('Error') >= 0) if len(error_lines) > 0: return 'n'.join(error_lines) else: return error_string.strip() def tempnam(): ''' returns a temporary file-name ''' tmpfile = tempfile.NamedTemporaryFile(prefix="tess_") return tmpfile.name class TesseractError(Exception): def __init__(self, status, message): self.status = status self.message = message self.args = (status, message) def image_to_string(image, lang=None, boxes=False, config=None): ''' Runs tesseract on the specified image. First, the image is written to disk, and then the tesseract command is run on the image. Resseract's result is read, and the temporary files are erased. also supports boxes and config. if boxes=True "batch.nochop makebox" gets added to the tesseract call if config is set, the config gets appended to the command. ex: config="-psm 6" ''' if len(image.split()) == 4: # In case we have 4 channels, lets discard the Alpha. # Kind of a hack, should fix in the future some time. r, g, b, a = image.split() image = Image.merge("RGB", (r, g, b)) input_file_name = '%s.bmp' % tempnam() output_file_name_base = tempnam() if not boxes: output_file_name = '%s.txt' % output_file_name_base else: output_file_name = '%s.box' % output_file_name_base try: image.save(input_file_name) status, error_string = run_tesseract(input_file_name, output_file_name_base, lang=lang, boxes=boxes, config=config) if status: #print 'test' , status,error_string errors = get_errors(error_string) raise TesseractError(status, errors) f = open(output_file_name) try: return f.read().strip() finally: f.close() finally: cleanup(input_file_name) cleanup(output_file_name) def main(): if len(sys.argv) == 2: filename = sys.argv[1] try: image = Image.open(filename) if len(image.split()) == 4: # In case we have 4 channels, lets discard the Alpha. # Kind of a hack, should fix in the future some time. r, g, b, a = image.split() image = Image.merge("RGB", (r, g, b)) except IOError: sys.stderr.write('ERROR: Could not open file "%s"n' % filename) exit(1) print(image_to_string(image)) elif len(sys.argv) == 4 and sys.argv[1] == '-l': lang = sys.argv[2] filename = sys.argv[3] try: image = Image.open(filename) except IOError: sys.stderr.write('ERROR: Could not open file "%s"n' % filename) exit(1) print(image_to_string(image, lang=lang)) else: sys.stderr.write('Usage: python pytesseract.py [-l language] input_filen') exit(2) if __name__ == '__main__': main() View Code

以上……

一、pytesseract介绍 1、pytesseract说明 pytesseract最新版本0.1.6,网址:...

1python中操作文件

# base64-pic.py

f = open('a.txt','r',encoding='gbk') #windows操作系统默认gbk编码

'''

open是打开文件 是向操作系统发起请求来打开一个文件

import os, base64

f 是应用程序存到内存中 f->打开文件对应着

icon = open('ya.png','rb')

f.read()

iconData = icon.read()

r模式在文件没有的时候 不会自动创建文件

iconData = base64.b64encode(iconData)

b模式即直接从硬盘中读取bytes

LIMIT = 60

f=open('a.txt','rb')

liIcon = []

print(f.read().decode('utf-8'))

while True:

 

        sLimit = iconData[:LIMIT]

w文本模式的写

        iconData = iconData[LIMIT:]

文本模式的写 必须指定编码,文件不存在创建,文件存在则清空(实际是创建一个新文件覆盖掉)

        liIcon.append(''%s'' %sLimit)

f=open('a.txt','w',enconding='utf-8')

        if len(sLimit) < LIMIT:

print(f.writable()) #TRUE 注:w只能写 r只能读

                peak

a文本模式的追加  

print os.linesep.join(liIcon)

文本模式的追加,文件不存在创建,文件存在光标默认在最后位置追加

python的base64编码图片 

文本追加需要使用光标 在光标的位置进行插入

很多年以前,曾经保存过一个页面,但发现图片并没有以文件的形式保存下来,打开页面时图片却有显示,

f=open('b.txt','a',enconing='utf-8')

开始以为是js玩的花招(因为里面一大堆js跳来跳去),链接到了其它地方,调查后发现源代码里有一大段看不懂的编码,

print(f.tell())

虽然估计这段代码就是图片,受限于当时的视野,百思不得其解~

如何测试使用a模式下是否能写

现在终于明白了!python的base64编码图片 - (^_^) - 安静

 f.write('111n') #FALSE

测试例子,把文字转成base64编码

r+ w+ b+

>>> import base64

读的时候写 写的时候读 

>>> ls_s='字符串文本'

rb模式即直接从硬盘读取 编码基本不用考虑

>>> ls_t=base64.b64encode(ls_s) #转换文本内容到base64

f=open('a.txt','rb')

>>> print ls_t

wb 模式

19a3+7SuzsSxvg==

f=open('a.txt','wb')

>>> print base64.b64decode(ls_t) #解码

f.write('你好啊').encode('utf-8')

字符串文本

ab模式

>>>

也是写模式 写到最后,每次都需要encode操作才能写入

把图片内容转成base64编码

 

import base64

可以发现之前的操作都是open 并没有关闭文件 这样会导致内存溢出 并没有回收,保证不占操作系统资源

f=open(r'x:1.jpg','rb') #二进制方式打开图文件

不论f.close()还是其他的都是向操作系统发送命令进行执行

ls_f=base64.b64encode(f.read()) #读取文件内容,转换为base64编码

接下来是with open

f.close()

with open ('file.txt','w',enconding='utf-8'**)**

把编码文本写入一个txt文件

f.write(‘1111n’)

fw=open(r'x:1.txt','w') #打开一个空白文本文件,准备写入

简单copy程序

fw.write(ls_f)

f=open('test.jpg','rb')

fw.flush()

print(f.read())

fw.close()

    with open ('test.jpg','rb') as read_f,open('test1.jpg','wb') as  write_f

网页的表达

    for line in read_f:

<html><body><img src="data:image/jpeg;base64,这里放的是上面写入的1.txt 的内容" /></body></html>

      write_f.write(line)

注意 image/jpeg 如果图片是其它类型的,这里也要修改; image/png、image/gif、image/bmp 等

-------------------

data: URI定义于IETF标准的RFC 2397

import sys

data: URI的基本使用格式如下:

if len(sys.argv) <3:

data:[<MIME-type>][;base64|charset=some_charset],<data>

  print('copy.py source.file')

最后做个实例,把这段保存为htm文件,用浏览器打开看看是不是一个图,

  sys.exit()

<HTML><BODY><img

with open(r'%s' sys.argv[1],'rb')  as read_f,

src="data:image/bmp;base64,Qk0eAgAAAAAAALYAAAAoAAAAEgAAABIAAAABAAgAAAAAAAAAAA DEDgAAxA4AACAAAAAgAAAAFSph/ySn4f8jRGP/mt70/zg1M/9DqMr/YWJg/yZqtf8EAwP/Xouz/5O qpP81Vn7/O4ut/xQqQP8TUJL/TmaU/y9ylv8nWKH/g3Z1/x3d9v+JvNj/LCYk/yQcGf8Nb7T/HUKV /w0bJ/8XW7L/YNT7/7b1/P+Niof/I43F/////wAfFhYVFhUWFhYVFRUVFhUWFR8AABYCBwcYARgJC wQNDQICDQ0NBAAABAIaEQcHCwsNDQICBA0NDQIEAAAEAgcHHg4CCw4eFAIPAg0NAgQAAAQLBwceDg sQEQMJHh4GDQ0CBAAAFQsaGgcRDwULCw8UCQ0CDQsEAAAEAAcREQ4LCQEXDgAAAQIZEAQAABUCGAc RGAAJAxsbDAsKAhkQBAAAFQwYGAcHAAMDAxQUFAodDQIVAAAWHgEYGB4YAwMDHBwDCgYNDRUAABUQ AQEYBx4RFBQCFA8CBAACFQAABAgeAR4RAREZAAAJFA4ADQsVAAAECBkBAR4FHhEFHBwUCgoCAhUAA BUICB4eHgEHHBwcHAMUCgICFQAAFQgIDQEeEwEPAwMUCQkAAgwWAAACCAgIAQETExoAGhcXHg4TGR UAABUGFggZDAUFBQwMEBAMBQ8EBgAAHwQKEhISEhISEhIdEhISHRIfAAA=" /></BODY></HTML>

    open(r'%s' sys.argv[2],'wb') as write_f:

可在Chrome、firefox、Opera、ie8里使用

   for line in read_f:

这里杯具来了,经测试居然不能在ie6中使用,可印象中那个应该是ie一族的,可能是ie5

     write_f.write(line)

查下资料,ie8也是有限支持32K以内,ie6/ie7不支持,而ie5居然是率先开始支持的(因为效率安全问题在ie6被弃用了)。

文件的其他操作

有点美中不足啊~ python的base64编码图片 - (^_^) - 安静

f=open('a.txt','r',enconding='utf-8')

参考

print(f.read(3)) #读的字符 中文算一个 英文一个字母一个 

网页截图 Chrome插件

f=open('a.txt','rb')

https://chrome.google.com/extensions/detail/ckibcdccnfeookdmbahgiakhnjcddpki?itemlang=zh-CN

print(f.read(3).decode('utf-8')) #读取三个字节

Python中进行Base64编码和解码

 

字符编码:

JavaScript 图片预览效果 2

 什么是字符编码?

把人类的字符翻译成计算机能认识的数字

  字符编码标常见

    ascii

    gbk

    utf-8

    unicode

unicode---->encode('utf-8') ---->bytes

bytes----->decode('utf-8')----unicode

encode编码

decode解码

原则:字符以什么格式编译的存,就要以什么格式编码取

   ps:字符编码在内存中的格式都是unicode,硬盘中是bytes

在python3中字符串分两种

x='abc' 存成unicode

y=x.encode('utf-8') 存成bytes

 

  

1 什么是函数?

2 为什么要用函数?
3 函数的分类:内置函数与自定义函数
4 如何自定义函数
  语法
  定义有参数函数,及有参函数的应用场景
  定义无参数函数,及无参函数的应用场景
  定义空函数,及空函数的应用场景

5 调用函数
    如何调用函数
    函数的返回值
    函数参数的应用:形参和实参,位置参数,关键字参数,默认参数,*args,**kwargs




6 高阶函数(函数对象)

#函数是第一类的对象:指的是函数可以被当做数据传递

def foo():
print('from foo')
#1 被赋值
# f=foo
# print(f)
# f()

#2 可以当做参数传入
# def wrapper(func):
# # print(func)
# func()
# wrapper(foo)

#3 可以当做函数的返回
def wrapper(func):
return func
# res=wrapper(foo)
# print(res)

#4 可以当做容器类型的元素

# cmd_dic={
# 'func':foo
# }
#
# print(cmd_dic)
#
# cmd_dic['func']()

本文由永利集团登录网址发布于永利集团登录网址,转载请注明出处:python下调用pytesseract识别某网站验证码,pythonpy

关键词:

  • 上一篇:没有了
  • 下一篇:没有了