Traditional Culture Encyclopedia - Traditional culture - python3 how to use the requests module to achieve the crawl page content example details

python3 how to use the requests module to achieve the crawl page content example details

This article mainly introduces python3 using the requests module to crawl the page content of the practical exercises, with some reference value, interested can understand

1. install pip

My personal desktop system with linuxmint, the system does not have a default installation of pip, taking into account the back to install the The first step is to install pip.

$ sudo apt install python-pip installed successfully, check the PIP version:

$ pip -V2.

$ pip install requests

Run import requests, if it doesn't prompt an error, then it has been installed successfully!

Test whether the installation was successful

3. Install beautifulsoup4

Beautiful Soup is a Python library that can extract data from HTML or XML files. It enables the usual way of navigating through documents, finding and modifying them with your favorite converter. beautiful Soup will save you hours or even days of work.

$ sudo apt-get install python3-bs4Note: I am using python3 installation here, if you are using python2, you can use the following command to install it.

$ sudo pip install beautifulsoup44.requests module shallow

1) send request

First of all, of course, to import Requests module:

>>> import requests Then, get the target Capture the web page. Here's an example:

>>> r = requests.get('blogs.com/get?key=val. Requests allows you to use the params keyword parameter to provide these parameters as a dictionary of strings.

For example, when we google the keyword "python crawler", the parameters newwindow (new window open), q and oq (search keyword) can be manually formed into a URL, so you can use the following code:

>> > > payload = {'newwindow': '1', 'q': 'python crawler', 'oq': 'python crawler'}

>>> r = requests.get("blogs.com/')

>>> r. encoding

'utf-8' 5) Get the response status code

We can detect the response status code:

>>> r = requests.get('blogs.com/')

>>> r.status_code<

2005. Case Study

Recently, the company just introduced an OA system, here I take its official documentation page as an example, and only crawl the page in the article title and content and other useful information.

Demo Environment

Operating System: linuxmint

Python Version: python 3.5.2

Using Modules: requests, beautifulsoup4

Code as follows:

#! /usr/bin/env python

# -*- coding: utf-8 -*-

_author_ = 'GavinHsueh'

import requests

import bs4

## Address of the target page to be crawled

url = 'http://www.ranzhi.org/book/ranzhi/about-ranzhi-4.html'

#Grab the page number content and return the response object

response = requests.get(url)

#View response status code

status_code = response.status_code

#Use BeautifulSoup to parse the code and lock the page number to specify the tag content

content = bs4. .decode("utf-8"), "lxml")

element = content.find_all(id='book')

print(status_code)

print(element)Program run to return to crawl to the results:

Crawl Success

Crawl to the results of the garbled problem

In fact, at first I was directly using the system default comes with the operation of python2, but in the crawling return content of the encoding of the garbled problem of the tossed half a day, googled a variety of solutions are ineffective. In the python2 "whole crazy" after, had to be honest with python3. For python2 crawling page content garbled problem, welcome all seniors to share their experience to help me and other students to take a detour.