Categories
express node.js request web-scraping

Using request(), returned page doesn’t contain needed data yet – incomplete page is returned instead. How do I ‘wait’?

I am trying to extract the year, make, model, colour and plate number from carjam.co.nz. An example of a URL I am scraping from is https://www.carjam.co.nz/car/?plate=JKY242.

If the plate has been recently requested, then the response will be a HTML document with the vehicle details.

Result where the plate details have been recently requested.
Result where the plate details have been recently requested.

If the plate details haven’t been recently requested (as is the case with most plates) the response is a HTML document with “Trying to get some vehicle data”. I’m guessing that this page displays while the information is fetched from the database, then the page is reloaded to show the vehicle details. This appears to be rendered server-side, I can’t see any AJAX requests.

The URL is the same for each result.

Result where the plate details haven't been recently requested.
Result where the vehicle hasn’t been recently requested.

How do I ‘wait’ for the correct information?

I am using request (deprecated I know, but it is what I am most comfortable using) on a Node.js with Express server.

My (very reduced) code:

app.get("/:numberPlate", (req, res) => {
request("https://www.carjam.co.nz/car/?plate=" + req.params.numberPlate, function(error, response, body) {
const $ = cheerio.load(body);
res.status(200).send(JSON.stringify({
year: $("[data-key=year_of_manufacture]").next().html(),
make: toTitleCase($("[data-key=make]").next().html()),
model: toTitleCase($("[data-key=model]").next().html()),
colour: toTitleCase($("[data-key=main_colour]").next().html()),
}));
}
}

I have considered:

  • Making a request and discarding it, sleeping for 2 – 3 seconds, then making a second request. The advantage of this approach is that every request would work. Disadvantage is that every request takes 2 – 3 second (too slow).
  • Making a request and checking to see if the body contains “Trying to get some vehicle data”. If so, sleep a few seconds, make another request and take action on the result of that second request (but how?).

I’m sure this is a common problem with an easy answer, but I don’t have enough experience to figure it out myself, or to know exactly what to Google!


To test: New Zealand has number places in the format “ABC123” – three letters, three numbers. These are released in alphabetical-ish order, currently we have nothing past NLU999 (excluding custom numberplates, numberplates issued out of sequence, etc).

To reproduce the “Trying to get some vehicle data”, you need to find a new numberplate each time – most numberplates earlier in the sequence than NLU999 should work.

This code snippet should generate a valid numberplate.

console.log(Math.random().toString(36).replace(/[^a-n]+/g, '').substr(0, 1).toUpperCase() + Math.random().toString(36).replace(/[^a-z]+/g, '').substr(0, 2).toUpperCase() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString());

05 May 2021 update

Upon further thought, this pseudocode could be what I’m after – but unsure how to practically implement.

request(url) {
if (url body contains "Trying to get some vehicle data") {
wait(2 seconds)
request(url again) {
return second_result
}
} else {
return first_result
}
}
then
process(first_result or second_result)

My difficulty here: I am used to the format request().then(), taking action directly from the request.

Assuming this approach is correct, how would I conduct the following?

  1. Send the request, then
  2. Assess the response, then
  3. Pass this response on, or send another request then pass that response on
  4. Process response