Cookie Consent by Free Privacy Policy Generator ๐Ÿ“Œ Web scraping Yelp Reviews with Nodejs

๐Ÿ  Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeitrรคge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden รœberblick รผber die wichtigsten Aspekte der IT-Sicherheit in einer sich stรคndig verรคndernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch รผbersetzen, erst Englisch auswรคhlen dann wieder Deutsch!

Google Android Playstore Download Button fรผr Team IT Security



๐Ÿ“š Web scraping Yelp Reviews with Nodejs


๐Ÿ’ก Newskategorie: Programmierung
๐Ÿ”— Quelle: dev.to

What will be scraped

what

Full code

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const reviewsLimit = 50; // hardcoded limit for demonstration purpose

const URL = `https://www.yelp.com/biz/kfc-seattle-18?osq=kfc#reviews`;

async function getReviewsFromPage(page) {
  return await page.evaluate(() => {
    return Array.from(document.querySelectorAll("section[aria-label='Recommended Reviews'] div > ul > li")).map((el) => {
      const thumbnails = el.querySelector("div > a > img").getAttribute("srcset")?.split(", ");
      const bestResolutionThumbnail = thumbnails && thumbnails[thumbnails.length - 1].split(" ")[0];
      return {
        user: {
          name: el.querySelector(".user-passport-info span > a")?.textContent,
          link: `https://www.yelp.com${el.querySelector(".user-passport-info span > a")?.getAttribute("href")}`,
          thumbnail: bestResolutionThumbnail,
          address: el.querySelector(".user-passport-info div > span")?.textContent,
          friends: el.querySelector("[aria-label='Friends'] span > span")?.textContent,
          photos: el.querySelector("[aria-label='Photos'] span > span")?.textContent,
          reviews: el.querySelector("[aria-label='Reviews'] span > span")?.textContent,
          eliteYear: el.querySelector(".user-passport-info div > a > span")?.textContent,
        },
        comment: {
          text: el.querySelector("span[lang]")?.textContent,
          language: el.querySelector("span[lang]")?.getAttribute("lang"),
        },
        date: el.querySelector(":scope > div > div:nth-child(2) div:nth-child(2) > span")?.textContent,
        rating: el.querySelector("span > div[role='img']").getAttribute("aria-label")?.split(" ")?.[0],
        photos: Array.from(el.querySelectorAll(":scope > div > div:nth-child(5) > div > div")).map((el) => {
          const captionString = el.querySelector("img").getAttribute("alt");
          const captionStart = captionString.indexOf(".");
          const caption = captionStart !== -1 ? captionString.slice(captionStart + 2) : undefined;
          return {
            link: el.querySelector("img").getAttribute("src"),
            caption,
          };
        }),
        feedback: {
          useful: el.querySelector(":scope > div > div:last-child span:nth-child(1) button span > span > span")?.textContent,
          funny: el.querySelector(":scope > div > div:last-child span:nth-child(2) button span > span > span")?.textContent,
          cool: el.querySelector(":scope > div > div:last-child span:nth-child(3) button span > span > span")?.textContent,
        },
      };
    });
  });
}

async function getReviews() {
  const browser = await puppeteer.launch({
    headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);

  const reviews = [];

  while (true) {
    await page.waitForSelector("section[aria-label='Recommended Reviews'] div > ul");
    reviews.push(...(await getReviewsFromPage(page)));
    const isNextPage = await page.$("a[aria-label='Next']");
    if (!isNextPage || reviews.length >= reviewsLimit) break;
    await page.click("a[aria-label='Next']");
    await page.waitForTimeout(3000);
  }

  await browser.close();

  return reviews;
}

getReviews().then((result) => console.dir(result, { depth: null }));

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter:

$ npm init -y

And then:

$ npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

๐Ÿ“ŒNote: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

stealth

Process

We need to extract data from HTML elements. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.

We have a dedicated Web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.

The Gif below illustrates the approach of selecting different parts of the results using SelectorGadget.

how

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

Next, we "say" to puppeteer use StealthPlugin, set how many results we want to receive (reviewsLimit constant), and search URL:

๐Ÿ“ŒNote: you can get place reviews URL from our Web scraping Yelp Organic Results with Nodejs blog post in the DIY solution section.

puppeteer.use(StealthPlugin());

const reviewsLimit = 50; // hardcoded limit for demonstration purpose

const URL = `https://www.yelp.com/biz/kfc-seattle-18?osq=kfc#reviews`;

Next, we write a function to get reviews from the page:

async function getReviewsFromPage(page) {
  ...
}

Then, we get information from the page context (using evaluate() method) and save it in the returned object:

return await page.evaluate(() => ({
    ...
}));

Next, we return a new array (Array.from() method) from all "section[aria-label='Recommended Reviews'] div > ul > li" selectors (querySelectorAll() method):

let isAds = false;
    return Array.from(document.querySelectorAll("section[aria-label='Recommended Reviews'] div > ul > li")).map((el) => {
    ...
});

To make returned result object we need to get thumbnails in all resolutions. Then we get the last resolution link - it's the best one:

const thumbnails = el.querySelector("div > a > img").getAttribute("srcset")?.split(", ");
const bestResolutionThumbnail = thumbnails && thumbnails[thumbnails.length - 1].split(" ")[0];

Next, we need to get and return the different parts of the page using next methods:

return {
  user: {
    name: el.querySelector(".user-passport-info span > a")?.textContent,
    link: `https://www.yelp.com${el.querySelector(".user-passport-info span > a")?.getAttribute("href")}`,
    thumbnail: bestResolutionThumbnail,
    address: el.querySelector(".user-passport-info div > span")?.textContent,
    friends: el.querySelector("[aria-label='Friends'] span > span")?.textContent,
    photos: el.querySelector("[aria-label='Photos'] span > span")?.textContent,
    reviews: el.querySelector("[aria-label='Reviews'] span > span")?.textContent,
    eliteYear: el.querySelector(".user-passport-info div > a > span")?.textContent,
  },
  comment: {
    text: el.querySelector("span[lang]")?.textContent,
    language: el.querySelector("span[lang]")?.getAttribute("lang"),
  },
  date: el.querySelector(":scope > div > div:nth-child(2) div:nth-child(2) > span")?.textContent,
  rating: el.querySelector("span > div[role='img']").getAttribute("aria-label")?.split(" ")?.[0],
  photos: Array.from(el.querySelectorAll(":scope > div > div:nth-child(5) > div > div")).map((el) => {
    const captionString = el.querySelector("img").getAttribute("alt");
    const captionStart = captionString.indexOf(".");
    const caption = captionStart !== -1 ? captionString.slice(captionStart + 2) : undefined;
    return {
      link: el.querySelector("img").getAttribute("src"),
      caption,
    };
  }),
  feedback: {
    useful: el.querySelector(":scope > div > div:last-child span:nth-child(1) button span > span > span")?.textContent,
    funny: el.querySelector(":scope > div > div:last-child span:nth-child(2) button span > span > span")?.textContent,
    cool: el.querySelector(":scope > div > div:last-child span:nth-child(3) button span > span > span")?.textContent,
  },
};

Next, write a function to control the browser, and get information:

async function getReviews() {
  ...
}

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: true and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

const browser = await puppeteer.launch({
  headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();

Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method and define the reviews array:

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);

const reviews = [];

Next, we use while loop (while) in which we use .waitForSelector() method to wait until the selector is load, add results from the page to reviews array (using spread syntax), check if the next page button is present on the page ($ method) and the number of results is less then reviewsLimit we click (click() method) on the next page button element, wait 3 seconds (using waitForTimeout method), otherwise we stop the loop (using break).

while (true) {
  await page.waitForSelector("section[aria-label='Recommended Reviews'] div > ul");
  reviews.push(...(await getReviewsFromPage(page)));
  const isNextPage = await page.$("a[aria-label='Next']");
  if (!isNextPage || reviews.length >= reviewsLimit) break;
  await page.click("a[aria-label='Next']");
  await page.waitForTimeout(3000);
}

And finally, we close the browser, and return the received data:

await browser.close();

return reviews;

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file

Output

[
    {
        "user":{
            "name":"Mark T.",
            "link":"https://www.yelp.com/user_details?userid=z44H_fDiNpvH-B8B_vBnBA",
            "thumbnail":"https://s3-media0.fl.yelpcdn.com/photo/9n6QdxRSINg3RZhoy0vw7w/ms.jpg",
            "address":"Seattle, WA",
            "friends":"56",
            "photos":"734",
            "reviews":"241",
            "eliteYear":"Elite 2022"
        },
        "comment":{
            "text":"`This KFC made me sad.  The food: OK.  The To-Go service: NOT GOOD.  They messed up my order, and you might think, \"Check your order before you leave, Mark T.!\"  But they seal the plastic to-go bag with some magical heating method that hermetically seals the bag super well.  Maybe this is good so it won't spill in your car, but it makes it very hard to check the order correctness.  Pro Tip: DO CHECK YOUR ORDER CORRECTNESS.  I mean - I get it - these are hard times and the workers are trying, so Two Yelpy stars instead of One.The vexing part was that the receipt stapled to the outside was completely correct.  It's the contents of the bag that were wrong.  One might reasonably argue that the food is more important than the receipt - particularly when you're hungry for something other than paper.On the plus side the reason I went there was to try the Beyond Chicken nuggets - and they were GOOD!  The taste was good and the texture was good.  I'm not a super chicken connoisseur (thanks spell-check just now), but I've eaten my share of nuggets in my lifetime.  Check out the cross section in the photo.  I'm not sure I could tell it from 'real' chicken.  So anyway, I ordered mashed potatoes with gravy and didn't get the gravy; ordered two types of sauces I didn't get and in general the cole slaw is a total rip at $3.49 for a tiny little container.  If you want people to eat healthy food, don't overcharge for the only green vegetable you sell!`",
            "language":"en"
        },
        "date":"1/28/2022",
        "rating":"2",
        "photos":[
            {
                "link":"https://s3-media0.fl.yelpcdn.com/bphoto/CsgT6CJlTEC9tuqWgw9hLg/180s.jpg",
                "caption":"The Beyond Chicken.  Best when freshly hot."
            },
            {
                "link":"https://s3-media0.fl.yelpcdn.com/bphoto/I_v-WDBLYBTc7z0O8GJCVQ/180s.jpg",
                "caption":"The sadness of my KFC trip."
            },
            {
                "link":"https://s3-media0.fl.yelpcdn.com/bphoto/qMGmBmZPxzD0pQfhfvleJg/180s.jpg",
                "caption":"That texture!  Might even fool chicken-philes!"
            }
        ],
        "feedback":{
            "useful":" 1"
        }
    },
    ...and other results
]

Using Yelp Reviews API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The biggest difference is that you don't need to create the parser from scratch and maintain it.

There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.

First, we need to install google-search-results-nodejs:

npm i google-search-results-nodejs

Here's the full code example, if you don't need an explanation:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY); //your API key from serpapi.com

const reviewsLimit = 50; // hardcoded limit for demonstration purpose

const params = {
  engine: "yelp_reviews", // search engine
  device: "desktop", //Parameter defines the device to use to get the results. It can be set to "desktop" (default), "tablet", or "mobile"
  place_id: "UON0MxZGG0cgsU5LYPjJbg", //Parameter defines the Yelp ID of a place
};

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

const getResults = async () => {
  const reviews = [];
  while (true) {
    const json = await getJson();
    if (json.reviews) {
      reviews.push(...json.reviews);
      params.start ? (params.start += 10) : (params.start = 10);
    } else break;
    if (reviews.length >= reviewsLimit) break;
  }
  return reviews;
};

getResults().then((result) => console.dir(result, { depth: null }));

Code explanation

First, we need to declare SerpApi from google-search-results-nodejs library and define new search instance with your API key from SerpApi:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);

Next, we write the necessary parameters for making a request and set how many results we want to receive (reviewsLimit constant):

๐Ÿ“ŒNote: you can get place ID from our Web scraping Yelp Organic Results with Nodejs blog post in the SerpApi solution section.

const reviewsLimit = 50; // hardcoded limit for demonstration purpose

const params = {
  engine: "yelp_reviews", // search engine
  device: "desktop", //Parameter defines the device to use to get the results. It can be set to "desktop" (default), "tablet", or "mobile"
  place_id: "UON0MxZGG0cgsU5LYPjJbg", //Parameter defines the Yelp ID of a place
};

Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

And finally, we declare the function getResult that gets data from the page and return it:

const getResults = async () => {
  ...
};

In this function we need to declare an empty reviews array and using while loop get json, add reviews results from each page and set next page start index (to params.start value). If there is no more results on the page or if the number of received results more than reviewsLimit we stop the loop (using break) and return reviews array:

const reviews = [];
while (true) {
  const json = await getJson();
  if (json.reviews) {
    reviews.push(...json.reviews);
    params.start ? (params.start += 10) : (params.start = 10);
  } else break;
  if (reviews.length >= reviewsLimit) break;
}
return reviews;

After, we run the getResults function and print all the received information in the console with the console.dir method, which allows you to use an object with the necessary parameters to change default output options:

getResults().then((result) => console.dir(result, { depth: null }));

Output

[
    {
        "user":{
            "name":"Lesjar M.",
            "user_id":"pQJfTSDC-zEdcrU0K_gONw",
            "link":"https://www.yelp.com/user_details?userid=pQJfTSDC-zEdcrU0K_gONw",
            "thumbnail":"https://s3-media0.fl.yelpcdn.com/photo/k5o32UV20u44mH_W7pPx_A/60s.jpg",
            "address":"Northwest Everett, Everett, WA",
            "photos":1,
            "reviews":2
        },
        "comment":{
            "text":"I just ordered a chicken sandwich and fries. Got it delivered through door dash. The classic chicken sandwich is $8.99 by itself. Which is a decently priced chicken sandwich. When I got the order, and opened the bag. I grabbed the sandwich and I literally thought the driver ate half the sandwich because it was the smallest, pathetic, and sad looking piece of crap, posing as a chicken sandwich I've ever seen. It looked no where near the same size as the picture projected it to be. The picture was a full size chicken sandwich. What was handed to me had to have been an imposter. I'm so angry. I feel like I was totally ripped off. Money is tight, but thought I'd splurge since I'm going back to work tomorrow. I will never eat at KFC again. I've never been so ripped off! (If I could I'd give half a star!)",
            "language":"en"
        },
        "date":"2/8/2022",
        "rating":1,
        "tags":[
            "1 photo"
        ],
        "photos":[
            {
                "link":"https://s3-media0.fl.yelpcdn.com/bphoto/95M1boh80dnsCiBcPbyrAQ/o.jpg",
                "caption":"KFC "Classic Chicken Sandwich."",
                "uploaded":"February 8, 2022"
            }
        ],
        "feedback":{
            "funny":1,
            "cool":1
        }
    },
  ...and other results
]

Links

If you want other functionality added to this blog post or if you want to see some projects made with SerpApi, write me a message.

Join us on Twitter | YouTube

Add a Feature Request๐Ÿ’ซ or a Bug๐Ÿž

...



๐Ÿ“Œ Web scraping Yelp Reviews with Nodejs


๐Ÿ“ˆ 66.03 Punkte

๐Ÿ“Œ Yelp: yelp.com and biz.yelp.com ATO via XSS + Cookie Bridge


๐Ÿ“ˆ 56.01 Punkte

๐Ÿ“Œ Aligning NodeJS with the Web: Should NodeJS Implement The Same APIs as the Web Browser?


๐Ÿ“ˆ 38.03 Punkte

๐Ÿ“Œ Gnome yelp up to 2.19.89 URI yelp-window.c window_error memory corruption


๐Ÿ“ˆ 37.34 Punkte

๐Ÿ“Œ Yelp: DoS of https://blog.yelp.com/ and other WP instances via CVE-2018-6389


๐Ÿ“ˆ 37.34 Punkte

๐Ÿ“Œ Yelp: Multiple Vulnerabilities in (*.blog.yelp.com) - Leakage user admin Sensitive Exposure


๐Ÿ“ˆ 37.34 Punkte

๐Ÿ“Œ Yelp: Email flooding using user invitation feature in biz.yelp.com due to lack of rate limiting


๐Ÿ“ˆ 37.34 Punkte

๐Ÿ“Œ Yelp: CORS Misconfiguration on trust.yelp.com


๐Ÿ“ˆ 37.34 Punkte

๐Ÿ“Œ Yelp: CORS Misconfiguration on Yelp


๐Ÿ“ˆ 37.34 Punkte

๐Ÿ“Œ Yelp: Subdomain Takeover on delivey.yelp.com


๐Ÿ“ˆ 37.34 Punkte

๐Ÿ“Œ Scrapestack Web Scraping API (Review): Powerful Real-time Engine for Website Scraping


๐Ÿ“ˆ 36.15 Punkte

๐Ÿ“Œ Scrapestack Web Scraping API (Review): Powerful Real-time Engine for Website Scraping


๐Ÿ“ˆ 36.15 Punkte

๐Ÿ“Œ Web Scraping With NodeJS and Puppeteer


๐Ÿ“ˆ 35.06 Punkte

๐Ÿ“Œ Web scraping Google Lens Results with Nodejs


๐Ÿ“ˆ 35.06 Punkte

๐Ÿ“Œ Web scraping The Home Depot Search with Nodejs


๐Ÿ“ˆ 35.06 Punkte

๐Ÿ“Œ A Comprehensive Guide to Scraping Instagram Data. How to bypass Instagram login while scraping - Facebook Spy / Meta Spy


๐Ÿ“ˆ 32.09 Punkte

๐Ÿ“Œ Next.js 14 Booking App with Live Data Scraping using Scraping Browser


๐Ÿ“ˆ 32.09 Punkte

๐Ÿ“Œ Supreme Court wonโ€™t consider case against defamatory reviews on Yelp


๐Ÿ“ˆ 30.97 Punkte

๐Ÿ“Œ Yelp Reviews Predict Kansas City Bar Shooting


๐Ÿ“ˆ 30.97 Punkte

๐Ÿ“Œ Yelp Accused Of Hiding Positive Reviews For Non-Advertiser


๐Ÿ“ˆ 30.97 Punkte

๐Ÿ“Œ Supreme Court Won't Hear a Lawsuit Over Defamatory Yelp Reviews


๐Ÿ“ˆ 30.97 Punkte

๐Ÿ“Œ Scraping Amazon Product Reviews


๐Ÿ“ˆ 28.34 Punkte

๐Ÿ“Œ In reviews we trust โ€” Making Google Play ratings and reviews more trustworthy


๐Ÿ“ˆ 24.59 Punkte

๐Ÿ“Œ Web crawling vs. web scraping: Basic differences for top-level executives


๐Ÿ“ˆ 24.16 Punkte

๐Ÿ“Œ 10 ideas to reverse engineer web apps : Web scraping 101


๐Ÿ“ˆ 24.16 Punkte

๐Ÿ“Œ NVIDIA GeForce Experience prior 3.20.5.70 Web Helper NodeJS Web Server uncontrolled search path


๐Ÿ“ˆ 23.07 Punkte

๐Ÿ“Œ Python Web Scraping Tools


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ Python Web Scraping Tools: A Survey


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ Web Scraping and Machine Learning


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ Web Scraping for Fun (and Profit)โ€ฆ


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ Scraping Social Security Numbers on the Web


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ A Beginners Guide to Web Scraping Using Proxies


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ heise+ | Web-Scraping mit Python: Websitedaten nach einem Login auslesen


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ Main challenges in web scraping


๐Ÿ“ˆ 20.1 Punkte

๐Ÿ“Œ Large-scale web scraping: The need for a future-proofed solution


๐Ÿ“ˆ 20.1 Punkte











matomo