Learning Python Web Scraping
I am currently looking into learning Python Web Scraping. As it seems fairly easy to get started as well as being a useful skill. In terms of producing a fun project, I believe it will allow me to increase its complexity at my own pace. From what I have read so far the Beautiful Soup library appears to be the recommended place to start, and I have played around with it, having followed some YouTube tutorials. BeautifulSoup is a HTML and XML parsing library, that creates a parsing tree that can be used to extract data from.
Using Beautiful Soup combined with the urllib library, I have managed to produce some HTML Parsing code. In terms of HTML parsing, the main content on HTML pages of value would be text, tables, and xml. I would next be interested in learning how to scrape JavaScript, download and store scraped data, format data, and produce a crawler. I have so far seen and reproduced examples of using PyQt to scrape JavaScript.
I will be following the book Web Scraping with Python By Ryan Mitchell from now on, as it appears to contain a useful learning progression with good content. I have also heard about some other well-known python web scraping libraries such as mechanize, scrapy, selenium and scrapemark. Technically scrapy is a framework, and is very useful if you are looking to develop a website crawler. If anyone knows of any useful learning resources on the topic of python web scraping, feel free to leave a comment.
Useful Resources:
- Amazon Book: Web Scraping with Python By Ryan Mitchell
- Amazon Book: Web Scraping with Python By Richard Lawson
- Youtube: Web scraping and parsing with Beautiful Soup & Python (Sentdex) – Playlist
- Youtube: Scrape Websites with Python + Beautiful Soup 4 + Requests — Coding with Python
- https://www.dataquest.io/blog/web-scraping-tutorial-python/