学习python之改造改造数据抓取程序为多线程

xiagu1

浏览: 47476 次
性别:
来自: 北京

最近访客更多访客>>

crlgh

rxw4703

excaliburace

frank_good

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

python

Python 多线程数据结构 SQLite C++

前面的数据抓取程序虽然完工了，但是运行中发现，每定时运行一次需要几十秒，有点太慢，查阅资料，希望能改成多线程的，加快运行速度。查了以后知道python里面多线程可以用queue来弄成队列。

经过搜索以后找到的python多线程、线程池参考内容如下：

中国的：http://prokee.com/?p=4

外国的：http://www.davidnaylor.co.uk/threaded-data-collection-with-python-including-examples.html

上面两个很相似到差不多雷同的样子，区别仅仅是中文、英文的问题，都是伪代码，可能相互借鉴过。

下面这个例子可以使用，是rss的东西

http://www.doughellmann.com/PyMOTW/Queue/

下面这个是涉及sqlite，可以运行。

http://stackoverflow.com/questions/1506023/duplicate-insertions-in-database-using-sqlite-sqlalchemy-python

里面指出是参考了下面的文章，下面这个也可以运行。

http://www.halotis.com/2009/07/07/how-to-get-rss-content-into-an-sqlite-database-with-python-fast/

看完参考资料，那就开始动手，把线程池import进来，然后改动程序，实际上我们需要多线程的就是urlopen这一部分，程序的数据库写入部分每次就一百多条，实际测试花不了1秒，不必改。其他的部分多线程提高不了多大效率。

首先根据上面的参考资料，多线程的主要结构如下：

THREAD_LIMIT = 20
jobs = Queue.Queue(0)
#定义全局变量c，用于storedata使用
#c在dealwithdatat的时候进行修改
global c
c=[{},{},{},{},{}]
#Rest of file completes the threading process     
def thread():
        while True:
                try:
                        url = jobs.get(False) # False = Don't wait
                except Queue.Empty:
                        return
                xml=get_datat(url)   
                #print xml
                #处理数据写入c里面供storedatat使用
                dealwith_datat(xml)

def q1(url_price):
    for i in url_price.keys(): # Queue them up
            #print i,url_price[i]
            jobs.put(url_price[i])
    
    for n in xrange(THREAD_LIMIT):
            t = threading.Thread(target=thread)
            t.start()
            print n
    
    while threading.activeCount() > 1 or not jobs.empty():
            print datetime.datetime.now()
            time.sleep(1)

上面getdatat为改过的抓取函数，原来的函数是一次获取所有的地址，循环抓取，这里改成每次读一个地址。线程中最大线程数为20，利用了queue，实现了线程的复用。

def get_datat(url):
    xmlr = urllib2.Request(url)
    price = urllib2.urlopen(xmlr)
    p_xml=price.read()
    price.close()     
    return p_xml

deal_withdatat则是改过的处理函数，这次是每次抓取一个地址的数据，所以处理函数改为每次处理一个，这里设定了一个全局变量c，处理之后的数据直接存放在c中需要注意的是c必须实例化，而不能仅仅定义一下。

global c
c=[{},{},{},{},{}]

def dealwith_datat(price):
    """正则处理页面获取有效数据"""
    temp1={}
    temp2={}
    temp3={}
    temp4={}
    temp5={}
    xmlprice=re.findall(re.compile(r"<price>(\d+)</price>"),price)
    iii1=re.findall(re.compile(r"<iid>(\d+)</iid>"),price)
    iii=iii1[0]
    print iii
    #print xmlprice 中出现最多的价格
    zuiduo={}
    for i in xmlprice:
            try:
                zuiduo[i]+=1
            except:
                zuiduo[i]=1
    for i in zuiduo.keys():
            if zuiduo[i]==max(zuiduo.values()):
                #print i,zuiduo[i]
                temp4[iii]=i
                temp5[iii]=zuiduo[i]
    xmlprice=[int(i)for i in xmlprice]
    temp1[iii]=min(xmlprice)
    temp2[iii]=max(xmlprice)
    temp3[iii]=xmlprice[-1]
    #return    temp1,temp2,temp5,temp4,temp3   
    c[0][iii]=temp1[iii]
    c[1][iii]=temp2[iii]
    c[2][iii]=temp5[iii]
    c[3][iii]=temp4[iii]
    c[4][iii]=temp3[iii]

改造后的deal_withdatat最终返回的结果仍与原函数一致，但是每次读取的仅仅是一个地址的数据。

这样完成了多线程抓取，原来需要30秒以上的抓取过程现在仅需要几秒就能完成

0
顶

0
踩

分享到：

如何将新浪博客备份（使用beautiful soap） | 学习python之动手获取开心网超级大亨价格数 ...

2010-02-25 10:05
浏览 3987
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论