Scrapy day1

1.安装环境

pip install scrapy
# -i https://pypi.tuna.tsinghua.edu.cn/simple 国内镜像,加后面可以快点

2.创建项目

在存放项目的文件夹内打开cmd终端

scrapy startproject *** (项目名称)

3.创建蜘蛛程序

进入刚刚创建好的项目文件夹内,打开cmd终端

scrapy genspider *** XXXXX.com
# ***为程序名称 XXXXX.com为网址

4.编写蜘蛛程序文件

当前目录结构

  • 项目文件夹
    • scrapy.cfg
    • 项目同名文件夹
      • spiders
        • __init__.py
        • 蜘蛛程序.py (操作文件)
      • __init__.py
      • items.py
      • middlewares.py
      • pipelines.py
      • settings.py

import scrapy
from scrapy import Selector

from Scrapy1.items import MovieItem

class DoubanSpider(scrapy.Spider):
    name = "douban"
    allowed_domains = ["movie.douban.com"]
    start_urls = ["https://movie.douban.com/top250"]#配置网址

    def parse(self, response):
        sel=Selector(response)
        # sel.css css解析
        # sel.xpath xpath解析
        # sel.re  正则解析
        list_items=sel.css('#content > div > div.article > ol > li')
        for i in list_items:
            # movie_item=MovieItem(title=i.css('span.title::text').extract_first(),rank=i.css('span.rating_num::text').extract_first(),subject=i.css('span.inq::text').extract_first())
            movie_item=MovieItem()
            movie_item['title']=i.css('span.title::text').extract_first()
            movie_item['rank']=i.css('span.rating_num::text').extract_first()
            movie_item['subject']=i.css('span.inq::text').extract_first()
            yield movie_item

5.编写数据对象

打开项目同名文件夹items.py

import scrapy


class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    rank=scrapy.Field()
    subject=scrapy.Field()

6.设置settings.py配置文件

# 伪装头部信息
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.69"


# 并发数量
CONCURRENT_REQUESTS = 2


#间隔
DOWNLOAD_DELAY = 3
#允许随机间隔
RANDOMIZE_DOWNLOAD_DELAY=True

7.运行项目

在项目文件夹内打开cmd终端

scrapy crawl **** -o xxxx.csv
# ***为运行的程序名称  xxxx为保存的文件名,后缀可以自己更改:json或者xml

# srapy crawl **** --nolog 执行且不看日志

Scrapy day2