Scrapy使用教程

2021-04-07

## 执行过程

执行过程

  • 逻辑过程
	graph LR
	A[URL]-->B[Request]
	B[Request]-->C[Response]
	C[Response]-->D[Item]
	D[Item]-->E[More]
  • 代码执行过程: 我代码中用到的地方
	graph LR
	A[Spider.start_requests]-->B[Middleware.process_request]
	B[Middleware.process_request]-->C[Spider.parse]
	C[Spider.parse]-->D[Pipeline.process_item]
	D[Pipeline.process_item]-->E[MySQL]

问提的与解决

  1. start_urls只能配置一个数组,如果我想动态配置爬取的源头怎么办?
  • 重写 Spider 的 start_requests 函数,在里面设置请求
  1. 如果我通过请求结果,发起新的请求怎么写?
  • 直接在结果里面通过 yield发起请求
  1. 如果我通过请求结果发起新的请求,怎么处理新的响应结果?

    • 在Spider里新增需要的处理的函数,配置 response 参数即可。

      • yield scrapy.Request(next_page, callback=self.parse_new)

      • def parse_new(self, response, **kwargs)

  2. 如何把发起请求时候的参数传递到响应结果?

    • 通过 Request 的 meta 参数,传输一个字典过去。
    • 通过 response.meta 拿到你存入的值。
  3. 如何给Spdier配置特点的中间件(middleware)、管道(pipelines)?

    • custom_settings
      • ITEM_PIPELINES
      • SPIDER_MIDDLEWARES
      • DOWNLOADER_MIDDLEWARES
  4. 对于一些异步加载的页面怎么处理?

    • 方案一:splash
      • docker 启动一个 Splash 代理服务:
        • docker run -it -p 8050:8050 --rm scrapinghub/splash
      • settings 配置:
        SPLASH_URL = 'http://host:port' DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810 }
      • python 导入依赖 pip3 install scrapy-splash
      • Spider 中引入 from scrapy_splash import SplashRequest
      • 请求替换
        • 从:scrapy.Request(stock_url, self.parse, meta=params, dont_filter=True)
        • 到:SplashRequest(url, self.parse, meta=params, args={'wait': 0.5})
      • parse正常解析结果即可
    • 方案二:selenium
      • python 导入依赖

        • pip3 install selenium
      • 配置 custom_settings:

        'DOWNLOADER_MIDDLEWARES': { 'ht_msg.middlewares.SeleniumDownloadMiddleware': 500 }

        
        
      • 创建中间件

        from selenium.webdriver.chrome.options import Options
        from selenium import webdriver
        from scrapy.http import HtmlResponse
        
        class SeleniumDownloadMiddleware(object):
          def __init__(self):
           	chrome_options = Options()
           	chrome_options.add_argument('--headless')
           	chrome_options.add_argument('--disable-gpu')
           	chrome_options.add_argument("window-size=1024,768")
           	chrome_options.add_argument("--no-sandbox")
           	self.driver = webdriver.Chrome(chrome_options=chrome_options)
        
          def process_request(self, request, spider):
           	self.driver.get(request.url)
           	time.sleep(12)
           	source = self.driver.page_source
           	url = self.driver.current_url
           	return HtmlResponse(url=url, body=source, request=request, encoding="utf-8")
        

未完待续.....

参考的文件

  • scrapy:

    • 文档
      • https://doc.scrapy.org/en/latest/topics/request-response.html#response-objects
    • 教程
      • https://geek-docs.com/scrapy/scrapy-tutorials/scrapy-css-grammar.html
      • https://www.jianshu.com/p/df9c0d1e9087
    • 多种方案:爬取动态JS
      • https://zhuanlan.zhihu.com/p/130867872
    • splash教程
      • https://splash.readthedocs.io/en/stable/
    • 配置可变的start_urls
      • https://blog.csdn.net/sirobot/article/details/105360486
    • splash
      • https://github.com/scrapy-plugins/scrapy-splash
    • xpath/css
      • https://zhuanlan.zhihu.com/p/65143300
    • 不同spider配置不同的pipline:
      • https://blog.csdn.net/peiwang245/article/details/100071316
    • response传递参数:
      • https://blog.csdn.net/killeri/article/details/80255500
    • spider的setting优先级
      • https://blog.csdn.net/he_ranly/article/details/85092065
    • 中间件的区别
      • https://www.jianshu.com/p/3a9a385b9483
  • scrapy部署

    • APscheduler
      • https://apscheduler.readthedocs.io/en/latest/
    • subprocess
      • https://www.runoob.com/w3cnote/python3-subprocess.html
    • APscheduler+scarpy
      • https://apscheduler.readthedocs.io/en/latest/userguide.html
      • https://www.cnblogs.com/jdbc2nju/p/9563761.html
      • https://www.jianshu.com/p/c37c46de3168