爬取Google的心酸之路

资讯 HACK_Learn

2019-11-27 14,829

0x00前言

早在前段时间就尝试写过爬Google的了。当时由于解决不了验证码就删了，当然这次也没解决。验证码不能绕过，只能避免，减少遇见

0x001过程

爬虫我追求，快、稳。但是由于Google这恶心的验证码机制，导致我不得不放弃这个想法。转而花费大量时间来解决频频碰到验证码的问题

写法只有请求，不换user-agent头的，加上多进程或多线程调用。一次就封IP了

import requests
rqt=requests.get(url='https://www.google.com/search?q=xxx&start=1',headers={'user-agent':'Google Splider'},timeout=3)

后面参考了几篇文章和一个项目：

对于 Python 抓取 Google 搜索结果的一些了解：

https://juejin.im/post/5c2c6bbee51d450d5a01d70a

Google_search

https://github.com/MarioVilas/googlesearch

看了这两个操作之后，发现都是用了同一操作

随机User-Agent头
随机使用Google的搜索子域

收集的User-agent头

实现这两种随机很容易，只需全部放入两个数组。使用random.choice()随机抽取即可

def read():
    dk=open('user_agents.txt','r',encoding='utf-8')
    for r in dk.readlines():
        data="".join(r.split('n'))
        yield data

def reads():
    dk=open('domain.txt','r',encoding='utf-8')
    for r in dk.readlines():
        data="".join(r.split('n'))
        yield data


def fenpei(proxy,search,page,sleep):
    user_agents=[]
    google_searchs=[]
    for ua in read():
        user_agents.append(ua)


    for domain in reads():
        google_searchs.append(domain)

虽然实现随机了，但是还是很脆弱。还是经不起Google那个狗贼般的验证码的摧残，在给其加上一个延时

import random
import requests
import time

def read():
    dk=open('user_agents.txt','r',encoding='utf-8')
    for r in dk.readlines():
        data="".join(r.split('n'))
        yield data

def reads():
    dk=open('domain.txt','r',encoding='utf-8')
    for r in dk.readlines():
        data="".join(r.split('n'))
        yield data


def fenpei(proxy,search,page,sleep):
    user_agents=[]
    google_searchs=[]
    for ua in read():
        user_agents.append(ua)


    for domain in reads():
        google_searchs.append(domain)

    time.sleep(int(sleep))
    proxy={'http':'http://{}'.format(proxy),'https':'https://{}'.format(proxy)}
    domains=random.choice(google_searchs)
    u_s={'user-agent':random.choice(user_agents),'Content-type':"text/html;charset=utf-8"}
    url='https://{}/search?hl=Chinese&q={}&btnG=Search&gbv=10&start={}'.format(domains,search,page)
    requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)
    rqt=requests.get(url=url,headers=u_s,allow_redirects=False,verify=False,proxies=proxy,timeout=30)
    return rqt.content

0x002完整代码

Github仓库地址：https://github.com/422926799/note/tree/master/%E8%87%AA%E5%B7%B1%E5%86%99%E7%9A%84%E5%B7%A5%E5%85%B7/google%E6%8A%93%E5%8F%96

成功抓取如下

遇见验证码

支持IP代理池

成功抓取如下图

先知社区文章打包PDF下载

链接：https://pan.baidu.com/s/1mniZGaoKnEnB2VYQ6ClRpQ

提取码：r91n

作者：九世

参考来源：https://422926799.github.io/

本文作者：HACK_Learn

本文为安全脉搏专栏作者发布，转载请注明：https://www.secpulse.com/archives/119182.html

Tags: github、Python、user-agent头、多线程调用、多进程、抓取、爬虫

点赞： 0 评论：0 收藏： 1

快来写下你的想法吧！

	HACK_Learn
	文章数：142	积分： 323
	微信公众号：HACK学习呀

安全问答社区

脉搏官方公众号

活动日程

2022-06-17

Gdevops 全球敏捷运维峰会

2022-05-12

Mastering the Challenge！——来自The 3rd AutoCS 2022智能汽车信息安全大会的邀请函

2021-11-18

AutoSW 2021智能汽车软件开发大会

2021-06-27

2021中国国际网络安全博览会暨高峰论坛

2021-05-27

The 2nd AutoCS 2021智能汽车信息安全大会

2020-12-18

贝壳找房2020 ICS安全技术峰会

2020-12-11

全球敏捷运维峰会（Gdevops2020）

2020-12-04

2020京麒网络安全大会

2020-11-29

OPPO技术开放日第六期|聚焦应用与数据安全防护

2020-11-27

EISS-2020企业信息安全峰会之上海站 11.27

2020-09-24

CSDI summit中国软件研发管理行业技术峰会

2020-09-23

2020中国国际智慧能源暨能源数据中心与网络信息安全装备展览会

2020-07-31

EISS-2020企业信息安全峰会之北京站 | 7.31（周五线上）

2020-04-15

看雪.安恒 2020 KCTF 春季赛

2020-01-09

相约本地生活安全沙龙暨白帽子颁奖典礼

爬取Google的心酸之路

0x00前言

0x001过程

0x002完整代码

相关文章

安全问答社区

脉搏官方公众号

活动日程

2022-06-17

2022-05-12

2021-11-18

2021-06-27

2021-05-27

2020-12-18

2020-12-11

2020-12-04

2020-11-29

2020-11-27

2020-09-24

2020-09-23

2020-07-31

2020-04-15

2020-01-09

安全问答社区

脉搏官方公众号

友情链接

关注我们

SecPluse

合作伙伴

品牌归属

关于我们

脉搏文库

安全建设

其他

爬取Google的心酸之路

0x00前言

0x001过程

0x002完整代码

相关文章

安全问答社区

脉搏官方公众号

活动日程

2022-06-17

2022-05-12

2021-11-18

2021-06-27

2021-05-27

2020-12-18

2020-12-11

2020-12-04

2020-11-29

2020-11-27

2020-09-24

2020-09-23

2020-07-31

2020-04-15

2020-01-09

安全问答社区

脉搏官方公众号

友情链接

关注我们

SecPluse

合作伙伴

品牌归属

关于我们