Categories
google-chrome-headless javascript node.js puppeteer web-scraping

Retrieving JavaScript Rendered HTML with Puppeteer

I am attempting to scrape the html from this NCBI.gov page. I need to include the #see-all URL fragment so that I am guaranteed to get the searchpage instead of retrieving the HTML from an incorrect gene page https://www.ncbi.nlm.nih.gov/gene/119016.

URL fragments are not passed to the server, and are instead used by the javascript of the page client-side to (in this case) create entirely different HTML, which is what you get when you go to the page in a browser and “View page source”, which is the HTML I want to retrieve. R readLines() ignores url tags followed by #

I tried using phantomJS first, but it just returned the error described here ReferenceError: Can’t find variable: Map, and it seems to result from phantomJS not supporting some feature that NCBI was using, thus eliminating this route to solution.

I had more success with Puppeteer using the following Javascript evaluated with node.js:

const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
var HTML = await page.content()
const fs = require('fs');
var ws = fs.createWriteStream(
'TempInterfaceWithChrome.js'
);
ws.write(HTML);
ws.end();
var ws2 = fs.createWriteStream(
'finishedFlag'
);
ws2.end();
browser.close();
})();

however this returned what appeared to be the pre-rendered html. how do I (programmatically) get the final html that I get in browser?