scrapy只支持静态网页的抓取,通过scrapy-splash,能实现对JavaScript的解析。

一、搭建Docker服务器

Docker的相关知识参考本站的Docker相关文章

Scrapy-Splash采用Splash HTTP API,需要搭建一个Splash实例,用Docker搭建最方便:

1
2
3
$ docker run -d -p 8050:8050 --restart=always --name=splash -m 200m scrapinghub/splash
# -m 200m 内存限制使用200兆

在服务器运行好Docker后,就可以通过IP+端口(例如:http://123.206.211.100:8050 )访问了。

二、Scrapy项目

1,安装scrapy-splash

1
$ pip install scrapy-splash

2,配置(setting.py)

增加Splash服务器地址

1
SPLASH_URL = 'http://123.206.211.100:8050'

开启Splash中间件

1
2
3
4
5
6
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
...
}

其它设置

1
2
3
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_COOKIES_DEBUG = True

3,spider.py使用SplashRequest

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy
from scrapy_splash import SplashRequest
class SpiderS1(scrapy.Spider):
name = "s1_spider"
def start_requests(self):
urls = ['http://sports.sina.com.cn/g/seriea/2017-05-23/doc-ifyfkqiv6736172.shtml',
'http://sports.sina.com.cn/basketball/nba/2017-05-23/doc-ifyfkqiv6683532.shtml']
requests = []
for url in urls:
url = url.strip()
request = SplashRequest(url, callback=self.parse, args={'wait':3})
requests.append(request)
return requests
def parse(self, response):
self.log(response.url)
...

使用非常简单,具体请求参数参考 Scrapy&JavaScript integration through Splash