0%

python | chrome-headless-shell

最佳爬虫浏览器。

访问 chrome-for-testing 会发现有以下三种类型

  • chrome
    • 正常的浏览器
  • chrome-driver
    • driver 可以通过代码操作 chromechrome-headless-shell
  • chrome-headless-shell
    • Chrome Headless ShellChrome 浏览器的无界面版本,它提供了与标准 Chrome 浏览器相同的功能,但没有用户界面。
    • chrome 也有无头模式,但是这个更好

mac

下载相关的 mac 包之后,基本上可以直接用 driver 操作 chrome

代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from typing import Dict

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = "chrome-headless-shell 路径"
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path="driver 路径")
driver.get("https://baidu.com")
# 获取网页源代码
page_source = driver.page_source

# 输出网页源代码
print(page_source)
driver.close()

ubuntu

我的是 ubuntu20.04,并且我的是纯环境,所以有很多依赖没有装。

下载之后,你可以 cd 到解压目录,然后运行

1
./chrome-headless-shell --disable-gpu --dump-dom https://baidu.com

看是否能把相关的页面打印出来。

我是缺少非常多的东西,我运行如下

1
2
3
4
5
6
7
8
9
sudo apt install -y gconf-service libasound2 libatk1.0-0 libcups2 libgconf-2-4 libgtk-3-0 libnspr4 libx11-xcb1 libxcomposite1 libxdamage1 libxrandr2 libxslt1.1 libnss3 libxss1 libappindicator3-1 libindicator7 fonts-liberation xdg-utils
sudo apt-get install libatk1.0-0
sudo apt-get install libatk-bridge2.0-0
sudo apt-get install libxcomposite1
sudo apt-get install libxdamage1
sudo apt-get install libxfixes3
sudo apt-get install libxrandr2
sudo apt-get install libgbm1
sudo apt-get install libxkbcommon0

上面安装代码不需要全部都运行,可以安装一个就试试命令,直到成功,或者缺什么补什么(问 chatgpt

但是,你运行 mac 中的代码,还是可能会报错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Traceback (most recent call last):
File "a.py", line 11, in <module>
driver = webdriver.Chrome(options=chrome_options, executable_path=db.get("chrome_driver_path"))
File "/usr/local/src/python37/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/src/python37/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/src/python37/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/src/python37/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 319, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/src/python37/lib/python3.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 374, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/src/python37/lib/python3.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 397, in _request
resp = self._conn.request(method, url, body=body, headers=headers)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/_request_methods.py", line 119, in request
method, url, fields=fields, headers=headers, **urlopen_kw
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/_request_methods.py", line 217, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/poolmanager.py", line 432, in urlopen
conn = self.connection_from_host(u.host, port=u.port, scheme=u.scheme)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/poolmanager.py", line 303, in connection_from_host
return self.connection_from_context(request_context)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/poolmanager.py", line 328, in connection_from_context
return self.connection_from_pool_key(pool_key, request_context=request_context)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/poolmanager.py", line 351, in connection_from_pool_key
pool = self._new_pool(scheme, host, port, request_context=request_context)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/poolmanager.py", line 265, in _new_pool
return pool_cls(host, port, **request_context)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 197, in __init__
timeout = Timeout.from_float(timeout)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/util/timeout.py", line 190, in from_float
return Timeout(read=timeout, connect=timeout)
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/util/timeout.py", line 119, in __init__
self._connect = self._validate_timeout(connect, "connect")
File "/usr/local/src/python37/lib/python3.7/site-packages/urllib3/util/timeout.py", line 159, in _validate_timeout
) from None
ValueError: Timeout value connect was <object object at 0x7f5a37a17070>, but it must be an int, float or None.

这个报错请参考

请我喝杯咖啡吧~