The Internet is made up of one hyperlink, from one web page link to another web page. In the new web page, there are many links. In theory, starting from any web page, you can click the link of the link and the linked web page, you can travel the entire Internet! Is this process like a spider crawling along the net? This is also the origin of the "crawler" name.
As a reptile engineer, it is necessary to write some "crabs" programs that can be crawled along the net and save the information obtained. In general, the information that needs to be crawled is structured, and if it is not structured, it makes no sense (but 80% of the data is unstructured). The scale of the reptile can be as small as a small 250 movie that can climb the watercress, and regularly climb the weather forecast for a week. A web page (such as google) that can crawl the entire Internet. Below, I think it can be called a crawler:
Climb the author and answer
Climb Baidu Resources, save to the database (of course, just save the link and title of the resource), and then create a search engine for the network disk
ibid., search for the seed site The same is true for the engine
here, we know that the task of the crawler is to get the data. Nowadays, big data is popular. From the Internet perspective, data can be divided into two types, one is user-generated (UGC).The second is obtained by some means, usually a reptile. The crawler is not limited to getting data from web pages, but also grabbing packages from the app. In short, it's about aggregating data and structuring them. So, which jobs need reptiles?
one What can a reptile do?
A typical data aggregation class site requires a crawler. Such as the Google search engine. Google can provide you with pages containing certain keywords within a few milliseconds. It is definitely not for you to find the webpage in real time, but to catch it in advance and save it in their own database (how much is their database) . So the seed search engine, the web search engine, the Resillio key engine, etc. all use the crawler to achieve the data in the database.
In addition, there are some websites that provide information comparison, such as the price comparison website, which is the price of crawling different shopping website products through crawlers.Then show the price of each shopping site on the website. The price of the shopping site is changing from time to time, but the data captured by the comparison website will not be deleted, so the price trend can be provided, which is information that the shopping website will not provide.
Besides, do some fun things. For example, if we want to see a large number of pictures, we can write a reptile to download in batches, without having to click save one by one, and endure the advertisement of the website; for example, we want to back up our own information, such as saving all the broadcasts we published in Douban. You can use the crawler to capture all the content you posted, so that even if some websites do not provide backup services, we can also make enough of ourselves.
two What skills do reptile engineers need to master?
I have seen this saying: "The reptile is a low-level, repetitive work, no development prospects." This is a misunderstanding. First of all,There is basically no repetitive work for the programmer, and any repetitive work can be solved automatically by the program. For example, the blogger has to catch a dozen websites with similar similarities but different html structures. I wrote a simple code generator, which can be automatically generated from the crawler code to the unit test code, as long as the corresponding html structure is slightly Just modify it. Therefore, I believe that repetitive labor basically does not exist in terms of programming. If you think that the work you are doing is repetitive, it means that you are more diligent and not willing to be lazy. And I also think that diligent programmers are not good programmers. Below, based on my work experience during this time, I will talk about the relevant skills that reptiles need.
1. Basic coding basics (at least one programming language)
this for any Programming work is a must. You have to get the basic data structure. Data names and values correspond to (dictionaries), some urls are processed (lists), and so on. In fact, the stronger the master, the better.Reptiles are not a simple job, nor are they more demanding programming languages than others. Familiar with the programming language you use, familiar with the relevant frameworks and libraries will always be beneficial.
I mainly use Python, there are also reptiles written in Java. In theory, any language can write reptiles, but it is best to choose a related library. , developing a fast language. Writing in C language is definitely asking for trouble.
2. Task queue
When the crawler task is very large, write a program to run It is not appropriate:
If you encounter an error in the middle, stop again? This is not scientific
How do I know where the program failed?Tasks and tasks should not affect each other
What if I have two machines?
So we need a task queue, which is used to put the pages that are scheduled to be crawled into the task queue. Then the worker takes it out of the queue and executes it one by one. If one fails, record it and then execute the next one. In this way, the workers can be executed one by one. It also increases the scalability, hundreds of millions of tasks in the queue is no problem, there is a need to increase the worker, just like a pair of loss of chopsticks to eat.
Common task queues are kafka, beanstalkd, celery, etc.
This is not to be said,Data preservation must be in the database. However, sometimes small data can be saved as json or csv. I sometimes want to grab some pictures and save the files directly in the folder.
Recommended to use NoSQL database, such as mongodb, because the data captured by the crawler is generally a field - worthy of correspondence, some fields have sites that have sites Mongo is more flexible in this respect, and the data relationship crawled by the crawler is very weak, and the relationship between the table and the table is rarely used.
HTTP knowledge is a must-have skill. Because you want to climb a web page, you must understand the web page.
First, the parsing method of the html document should be understood, such as the child node parent node, attributes. The webpage we saw was colorful, but it was only processed by the browser.The original web page is made up of many tags. It's better to use a html parser, and if you use regular matching, there will be a lot of pits. I personally like xpath very much, cross-language, express good price, but there are also shortcomings. Regular and logical judgments are a bit awkward.
HTTP protocol to understand. The HTTP protocol itself is stateless, so how is the login & rdquo; implemented? This requires a look at the session and cookies. The difference between the GET method and the POST method (in fact, there is no difference except for the literal meaning).
The browser should be proficient. The process of crawling is actually a process of simulating humans going to browser data. So how does the browser access a website, you have to learn to observe, how to observe it? Developer Tools! Chrome's Developer Tools provides everything you need to access your website. All outgoing requests can be seen from traffic.The copy as curl function can generate a curl request that is exactly the same as the browser request! The general flow of writing a crawler is like this, first access it with a browser, then copy as curl to see which headers, cookies, and then simulate the request with code, and finally save the result of the request.
5.Operation and maintenance
There are a lot of topics to talk about, the actual work in the process The development of peacekeeping is almost the same or even more. Maintaining a crawler that is already working is a heavy job. As working hours increase, we generally learn to make the written crawlers better. For example, the log system of the crawler, the statistics of the amount of data, and the like. It is also unreasonable to separate the reptile engineer from the operation and maintenance, because if a reptile does not work, the reason may be that the webpage to be crawled has updated the structure, or it may appear on the system, or it may be that the reptile was not found when the reptile was first developed.扒 strategy, after the line went wrong,It may also be that the other website found that you are a reptile to block you, so in general, the development of reptiles should take into account the operation and maintenance.
So I can provide the following ideas for the operation of the crawler:
First, from Data incremental monitoring. Targeted crawlers (referring to crawlers that target only one site) are relatively easy, and there is a general understanding of the data increments of some sites over time. Always check to see if the increase in these data is normal (Grafana). The data increment of non-directional crawlers is not very stable. Generally, the network status of the machine, the update status of the website, etc. (I don't have much experience in this area).
See the success of the crawler execution. As mentioned above, using the task queue to control the crawler work, decoupling can bring a lot of benefits, one of which is that you can log a crawler execution. Can be executed every time the crawler task is executed,Put the execution time, status, target url, exception, etc. into a log system (such as kibana), and then through a visual means can clearly see the failure rate of the crawler.
The Exception thrown by the crawler. Almost all projects use error logging (Sentry). One thing to note here is to ignore normal exceptions (such as Connection errors, lock conflicts, etc.), otherwise you will be overwhelmed by these errors.
Three, crawler and anti-climb
This is also a very deep topic, just Like attacking weapons and defensive weapons, both sides are constantly upgrading. The common anti-climbing measures (which I have encountered) are as follows:
1. Frequency of visits
Good understanding,If the website is accessed too frequently, it may block your ip for a while, which is the same as the anti-DDoS principle. For reptiles, it is enough to limit the frequency of tasks. You can try to make the crawler want to access the webpage like humans (such as random sleep for a while, if you visit the website every 3s, it is obviously not normal behavior).
2. Login restrictions
also more common. However, websites that disclose information generally do not have this restriction, which makes users troublesome. In fact, the anti-climbing measures affect the real users more or less, and the more serious the anti-climbing, the higher the possibility of accidentally killing users. For reptiles, logins can also be resolved by simulating logins, adding a cookie (and then again, the principle of the network is important).
3. Blocked by Header
General browser access to the website will have headers, such as Safari or Chrome, etc., as well as operating system information. If you use program access, there will be no such header. The crack is very simple, just add the header when you visit.
5. Verification code
This is almost the ultimate weapon, the verification code is dedicated The means to distinguish between people and computers. For the anti-climbing party, this method is more harmful to the real user and the search engine (in fact, it can be treated differently by recording the rep of the search engine crawler), and it is believed that the reader has a painful experience of inputting the verification code. But this method is not invincible! Most of the verification codes can be easily identified by the now very hot machine learning! Google's reCAPTCHA is a very advanced verification code, but I heard that it can be cracked by simulating the browser.
The website may permanently block the identified ip,This method requires a lot of manpower, and the cost of accidentally injuring the user is also high. But the cracking method is very simple. At present, the agent pool is almost the standard for crawling worms, and there are even many things that are easy to use. So this basically only kills small reptiles.
7. Website content is anti-climbing
There are some websites that use the content of the website only for humans. The form of reception is presented (in fact, anti-climbing is to treat humans and machines differently). For example, the content is displayed in the form of a picture. But in recent years, the difference between humans and machines has become smaller and smaller, and pictures can be identified with very high OCR accuracy.
Reverse Climb Summary
Crawling and anti-climbing are typical upgrades between the offensive and defensive sides. But I think that this upgrade is not like military, military is endless, but reptiles and anti-climbing are endless.
The end of anti-climbing is a super-powerful verification code like Google. After all, the fundamental purpose of the verification code is to identify humans and machines.
I happen to have a very good example of anti-climbing. The Google Arts Project project is an art gallery that brings together world famous paintings. I like some of the paintings inside, so I want to download some (of course this is not right), and then I found that this website is quite good (because copyright belongs to the collection) Museum,So the Google Arts Project will definitely not provide downloads. It is almost impossible to download. I was a bit dissatisfied and began to use various means to try to download the original picture. I tried it and found that this website blocked the right mouse button function, and the review element found that the picture was not a regular picture. The tracking network package found that the original picture was not obtained by a network request, but was divided into several times to request base64 encoding. The character stream is requested each time part of the picture, then the picture is assembled on the client side! Of course the code on the client side is also encrypted and confusing! This can be used as a textbook for anti-climbing, without bothering the user and preventing the reptile from getting started.
Image only requests part at a time
Four, professional ethics
Sized reptiles generally use clusters. The size of a small web server may not be as large as a reptile cluster.So many times we'd better limit the frequency of the sites we want to climb. Otherwise these crawlers are equivalent to DoS attack clusters! General websites will have robots.txt for reference.
Well, in summary, writing crawlers requires experience and requires flexible thinking. For example, I have encountered a website before, I need to verify the code to get a token, but by looking at the network request, I found that the token looks like a timestamp, and then I can generate a timestamp and find it can be used! So this bypasses the verification code. So a lot of accumulation and try, you can steal a lot of lazy, hehe.
In addition, the reptile is not a boring job as I thought before. For example, I found a lot of very junk, very funny websites, and fun. Quite a lot. There are also a lot of things to learn. Everything changes.
Internet age information is everywhere,The vast amounts of information we touch every day, such as Weibo, social media site posts, consumer reviews, news, and salesperson visits, are common sources of unstructured data. Unstructured data analysis can reveal trends and associations that are hidden in text, and provide strong support for business decision making, research industry trends, and hot content analysis.