Python basic learning tutorial: learn crawler from 0? Let crawlers satisfy your curiosity
Is it necessary to learn to crawl?
I think this is already an issue that does not need to be discussed.
Crawlers, "useful" and "fun"!
In this era when data is king, we need to get the data we need from this huge Internet. Crawlers are the best choice. Whether it is the past "search engine" or the popular "data analysis" nowadays, it is an indispensable means of obtaining data. After mastering the crawler, you see a lot of "interesting" things! No matter what technical direction you are, mastering this technology allows you to explore in the prosperous Internet and collect all kinds of data or files conveniently and quickly. In addition to being fun and interesting, crawlers are really useful. In fact, many companies also have requirements for crawlers when recruiting.
So if you want to learn web crawlers well, you need to master some basic knowledge:
- Basic knowledge of Python commonly used in web crawlers
- HTTP protocol communication principle (what kind of process is when we browse the web, and how is it formed?)
- HTML, CSS, and JS basics (to master the structure of the web page and locate specific elements from the web page)
With these foundations, you can start learning crawlers. Now learn crawlers, of course Python crawlers, this is the absolute mainstream at the moment.
But many partners still have doubts!
- Should I learn crawler first when learning Python?
- How do I advance after learning the basics?
- What is the use after learning crawlers?
In the latest programming language rankings, Pyhton surpassed Java to become the first on the list. More and more programmers choose Python. Some even say that using Python is "future-oriented programming". Regarding the relationship between Python and "crawlers", of course, you need to master some basic knowledge of Python before learning crawlers.
But if you are just starting to learn Python and want to go deeper, after mastering the basics of Python, I recommend that you learn crawlers first instead of other directions. Why?
First of all, by learning crawlers, it is indeed easy to master a lot of knowledge in the Python basic learning tutorial . Of course, this may also be because the Python world has produced many excellent crawler projects, making Python leave this impression on everyone, but there is no doubt that crawlers can exercise and improve your Python skills.
Secondly, after mastering crawler technology, you will see many different landscapes. When you use crawlers to crawl data, you will feel very fun. Believe me, this kind of fun and curiosity will make you have a natural love for Python and give you the motivation to learn Python in depth. .
We use Python to develop crawlers. The most powerful part of Python is not the language itself but its large and active developer community and hundreds of millions of third-party toolkits. Through these toolkits, we can quickly implement one function after another without having to build the wheel ourselves. The more toolkits we master, the more convenient it is for us to write crawler programs. In addition, the work goal of crawlers is the "Internet", so HTTP communication and skills such as HTML, CSS, and JS will be used when writing crawler programs.
As a developer, code is the best teacher. Learning in practice and speaking directly on code is the way we programmers learn. As long as you have a Python foundation, this column is enough to let you go from completely ignorant of crawlers to being able to actually develop and use crawlers in your work.
In actual production, the data we need generally cannot escape this page structure:
News feed dedicated crawler-crawling RSS subscription data
Netease News Crawler-Pan Crawling Technology
Netease crawler optimization-large-scale data processing technology
Douban Reading Crawler-Test Driven Design and Advanced Anti-Climbing Technology Practice
Application examples of slow crawlers-Zhihu crawlers
In the follow-up, I will take you to implement these page structures one by one, and implement page crawlers with different technologies, so that you can understand what kind of technology can be used under what circumstances through specific code practices, and encounter anti-crawling measures. How do we solve it? Build a specific understanding of crawlers through specific applications and understand the technical theory behind them.
Speaking of this, some partners may have to ask: What about after writing the crawler program? Don't worry, after writing the crawler program, I will take everyone to deploy our crawler program, so that our crawler can really "make a big picture."
- Master the development of Scrapy framework
- Learn pan-climbing technology to deal with massive data
- Optimize your incremental crawler
- Solve large-scale concurrent crawler projects through distributed crawlers
- Use Docker container technology for crawler deployment
How much data information is hidden on the Internet? What different feelings can it bring to our lives and work? Keep your curiosity, from now on, let us learn crawlers together, play crawlers together, and use crawlers together!
Let me tell you about the crawler tools we need to use for Python crawlers ! This is also the first step in learning to crawl!
What does the crawler do first?
Yes, it must be the target site analysis!
Chrome is the most basic tool for crawlers. Generally, we use it for initial crawling analysis, page logic jump, simple js debugging, network request steps, etc. Most of our initial work was done on it. To make an inappropriate metaphor, without Chrome, we would regress from modern times to ancient times hundreds of years ago!
Similar tools: Firefox, Safari, Opera
Charles corresponds to Chrome, except that it is used for network analysis on the App side. Compared with the web side, the network analysis on the App side is simpler, focusing on analyzing the parameters of each network request. Of course, if the other party does parameter encryption on the server side, it involves knowledge of reverse engineering. That piece is a big basket of tools, so I won t talk about it here for the time being.
Similar tools: Fiddler, Wireshark, Anyproxy
Next, analyze the anti-crawler of the site
Wikipedia introduces it like this
CURL is a file transfer tool that uses URL syntax to work under the command line. It was first released in 1997. It supports file upload and download, so it is a comprehensive transmission tool, but traditionally, it is customary to call cURL a download tool. CURL also includes libcurl for program development.
When doing crawler analysis, we often have to simulate the request. If we write a piece of code at this time, it would be too fussy. Just copy a cURL directly through Chrome and run it on the command line to see the result. The steps are as follows
Of course, most websites are not where you can copy the cURL link and change the parameters to get the data. Next, for a deeper analysis, we need to use the Postman "big killer". Why is it a "big killer"? Because it is really powerful. With cURL, we can directly transplant the requested content, and then modify the request, check it to select the content parameters we want, very elegant
With the above tools, you can basically solve most websites, and you can be regarded as a qualified junior crawler engineer. At this time, if we want to advance, we need to face more complex website crawlers. At this stage, you not only need to understand the back-end knowledge, but also need to understand some front-end knowledge, because the anti-crawling measures of many websites are placed on the front-end. . You need to extract the js information of the other site, and you need to understand and reverse it. The native js code is generally not easy to read. At this time, it will help you format it.
Reptiles and anti-reptiles are a tug-of-war without gunpowder. You never know what pits the other party will bury you, such as making hands and feet on Cookies. At this time, you need it to assist you in your analysis. After installing the EditThisCookie plug-in through Chrome, we can click on the small icon in the upper right corner to add, delete, modify and check the information in Cookies, which greatly improves the simulation of Cookies information.
Next, design the crawler's architecture
When we have determined that we can crawl, we should not rush to write crawlers. Instead, start to design the structure of the crawler. According to the needs of the business, we can do a simple crawling analysis, which will help us develop the efficiency in the future. This is the principle of the so-called sharpening and not cutting wood by mistake. For example, consider whether it is search crawling or traversal crawling? BFS or DFS? What is the approximate number of concurrent requests? After considering these issues, we can draw a simple architecture diagram through Sketch
Similar tools: Illustrator, Photoshop
Finally started a happy crawler development journey
Development is finally about to begin. After the above steps, we are at this point, and everything is ready and we only owe the wind. At this time, we only need to do code and data extraction
When extracting web page data, we generally need to use xpath syntax to extract page data information. Generally, we can only write the syntax, send a request to the other side's web page, and then print it out to know whether the data we extracted is correct. On the one hand, it will initiate a lot of unnecessary requests. On the other hand, it also wastes our time. This can use XPath Helper. After installing the plug-in through Chrome, we only need to click on it to write the syntax in the corresponding xpath, and then we can intuitively see our results on the right, efficiency up+10086
Sometimes the data we extract is in Json format, because it is simple and easy to use, more and more websites tend to use Json format for data transmission. At this time, after we install this plug-in, we can easily view the Json data
10.JSON Editor Online
JSONView is the data returned directly on the web page. The result is Json, but most of the time the result we request is the HTML web page data rendered by the front end. The json data we get after we initiate the request is not very good in the terminal (ie terminal). What about the show? With the help of JSON Editor Online, you can format the data well, format it in one second, and realize the function of folding Json data with thoughtfulness
Now that I have seen this, I believe you are also very easy to learn little friends, here is an easter egg tool with you.
What does it do? In fact, it is a screen floating tool. Don t underestimate it. It s very important. When we need to analyze parameters, we often need to switch back and forth between several interfaces. At this time, there are some parameters. We need to compare their differences. At this time, You can use it to float first, without having to switch between several interfaces. Very convenient. Give you another hidden gameplay, such as the picture above.
This time, Orange will share so much with you, and the follow-up Python basic learning tutorials and Python crawlers will continue to be updated for everyone! Summer is here. While studying hard, the partners should also pay more attention to rest!