Skip to content

DESTHUbb/ScrapingPython

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapingPython

Web Scraping with Python from Beautiful Soup

dataquest-learn-data-science-online

    1. Find Elements by ID.
    1. Find Elements by HTML Class Name.
    1. Extract Text From HTML Elements.
    1. Find Elements by Class Name and Text Content.
    1. Pass a Function to a Beautiful Soup Method.
    1. Identify Error Conditions.
    1. Access Parent Elements.
    1. Extract Attributes From HTML Elements

How-to-Build-a-Web-Scraping-Pipeline-in-Python-Using-BeautifulSoup

Beautiful Soup Installation:

-Open a windows terminal

Windows:

pip install beautifulsoup4
python setup.py install

For more information

For Installation see Installation website.

Scrapingwebs:

1. Scrapy

Scrapy

2. Heritrix

Heritrix

3. Web-Harvest

Web-Harvest

4. MechanicalSoup

Scrapy

5. Apify SDK

Apify SDK

$ Node.js
>>> import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with API token
const client = new ApifyClient({
   token: '<YOUR_API_TOKEN>',
});

// Prepare actor input
const input = {
   "url": "https://crawlee.dev/",
   "proxyOptions": {
       "useApifyProxy": true
   },
   "frameRate": 7,
   "scrollPercentage": 10,
   "recordingTimeBeforeAction": 1000
};

(async () => {
   // Run the actor and wait for it to finish
   const run = await client.actor("glenn/gif-scroll-animation").call(input);

   // Fetch and print actor results from the run's dataset (if any)
   console.log('Results from dataset');
   const { items } = await client.dataset(run.defaultDatasetId).listItems();
   items.forEach((item) => {
       console.dir(item);
   });
})();

6. Apache Nutch

Apache Nutch

7. Jaunt

Jaunt

GoogleScraperDemo.java:
1 UserAgent userAgent = new UserAgent();         //create new userAgent (headless browser)
2 userAgent.visit("http://google.com");          //visit google
3 userAgent.doc.apply("butterflies").submit();   //apply form input and submit
4
5 Elements links = userAgent.doc.findEvery("<h3>").findEvery("<a>");  //find search result links
6 for(Element link : links) System.out.println(link.getAt("href"));   //print results

8. Node-crawler

Node-crawler

Install:

$ npm install crawler

Basic usage:

const Crawler = require('crawler');

const c = new Crawler({
    maxConnections: 10,
    // This will be called for each crawled page
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            const $ = res.$;
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server
            console.log($('title').text());
        }
        done();
    }
});

// Queue just one URL, with default callback
c.queue('http://www.amazon.com');

// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);

// Queue URLs with custom callbacks & parameters
c.queue([{
    uri: 'http://parishackers.org/',
    jQuery: false,

    // The global callback won't be called
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            console.log('Grabbed', res.body.length, 'bytes');
        }
        done();
    }
}]);

// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
    html: '<p>This is a <strong>test</strong></p>'
}]);
More code on the website: 
https://www.npmjs.com/package/crawler

9. PySpider

PySpider github: https://github.com/binux/pyspider

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

10. StormCrawler

StormCrawler

About

Web Scraping with Python from Beautiful Soup

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published