Categories
javascript python python-2.x web-scraping

Web-scraping JavaScript page with Python

252

I’m trying to develop a simple web scraper. I want to extract text without the HTML code. It works on plain HTML, but not in some pages where JavaScript code adds text.

For example, if some JavaScript code adds some text, I can’t see it, because when I call:

response = urllib2.urlopen(request)

I get the original text without the added one (because JavaScript is executed in the client).

So, I’m looking for some ideas to solve this problem.

3

110

+25

We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM.

Therefore we need to render the javascript content before we crawl the page.

As selenium is already mentioned many times in this thread (and how slow it gets sometimes was mentioned also), I will list two other possible solutions.


Solution 1: This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that.

What we will need:

  1. Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.

  2. Install Splash following the instruction listed for our corresponding OS.
    Quoting from splash documentation:

    Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

    Essentially we are going to use Splash to render Javascript generated content.

  3. Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash.

  4. Install the scrapy-splash plugin: pip install scrapy-splash

  5. Assuming that we already have a Scrapy project created (if not, let’s make one), we will follow the guide and update the settings.py:

    Then go to your scrapy project’s settings.py and set these middlewares:

    DOWNLOADER_MIDDLEWARES = {
          'scrapy_splash.SplashCookiesMiddleware': 723,
          'scrapy_splash.SplashMiddleware': 725,
          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    

    The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container’s IP address from the host?):

    SPLASH_URL = 'http://localhost:8050'
    

    And finally you need to set these values too:

    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
  6. Finally, we can use a SplashRequest:

    In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:

    class MySpider(scrapy.Spider):
        name = "jsscraper"
        start_urls = ["http://quotes.toscrape.com/js/"]
    
        def start_requests(self):
            for url in self.start_urls:
            yield SplashRequest(
                url=url, callback=self.parse, endpoint="render.html"
            )
    
        def parse(self, response):
            for q in response.css("div.quote"):
            quote = QuoteItem()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            yield quote
    

    SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.


Solution 2: Let’s call this experimental at the moment (May 2018)…
This solution is for Python’s version 3.6 only (at the moment).

Do you know the requests module (well who doesn’t)?
Now it has a web crawling little sibling: requests-HTML:

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

  1. Install requests-html: pipenv install requests-html

  2. Make a request to the page’s url:

    from requests_html import HTMLSession
    
    session = HTMLSession()
    r = session.get(a_page_url)
    
  3. Render the response to get the Javascript generated bits:

    r.html.render()
    

Finally, the module seems to offer scraping capabilities.
Alternatively, we can try the well-documented way of using BeautifulSoup with the r.html object we just rendered.

5

  • can you expand on how to get the full HTML content, with JS bits loaded, after calling .render()? I’m stuck after that point. I’m not seeing all the iframes that are injected into the page normally from JavaScript in the r.html.html object.

    Dec 13, 2018 at 20:24

  • @anon58192932 Since at the moment this is an experimental solution and I don’t know what exactly you are trying to achieve as a result, I cannot really suggest anything… You can create a new question here on SO if you haven’t worked out a solution yet

    Jan 2, 2019 at 13:57

  • 6

    I got this error: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.

    Apr 23, 2019 at 15:59

  • 2

    @HuckIt this seems to be a known issue: github.com/psf/requests-html/issues/140

    Oct 15, 2019 at 12:22

  • 1

    I have tried the first method, but I still can not see js rendered content? Can you please tell me what am I missing.

    Jul 12 at 16:41

61

Maybe selenium can do it.

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source

3

  • 4

    Selenium is really heavy for this kind of thing, that’d be unnecessarily slow and requires a browser head if you don’t use PhantomJS, but this would work.

    Jul 28, 2017 at 16:27

  • @JoshuaHedges You can run other more standard browsers in headless mode.

    Jan 9, 2020 at 0:55

  • 5

    options = webdriver.ChromeOptions() options.add_argument('--headless') driver = webdriver.Chrome(options=options)

    Oct 15, 2020 at 14:50