Web Scraping with Cheerio: Building the Spanish Diminutive Generator

Note: This was originally written in 2020, before the rise of LLMs.

As a lifelong student of Spanish, I've always loved the endearing and playful nature of the diminutive form:

  • casita for a little casa (house)
  • cafecito for a small café (or to affectionately refer to a regular-sized one)
  • gatito or gatico for a gato (cat)

I was surprised that nowhere on the entire internet could I find a tool to properly generate this form. I searched APIs, npm packages, and all over GitHub... nothing. No way to enter café and receive cafecito, or perro and get perrito.

So I figured, why not build it myself?

For my Spanish Diminutive Generator, I initially tried to find an appropriate API, but my vision was bigger than my budget. Sometimes you have to get scrappy. Since this was a personal project that wouldn't drive significant traffic to the source, web scraping felt like a reasonable approach.

The app is simple: input a word, get its diminutive form plus a photo.

casa → casita
café → cafecito
pantalón → pantaloncito

The Gender Problem

To properly generate diminutives, the algorithm needs to know each word's grammatical gender. Spanish has two: masculine and feminine. Usually, you can tell by the ending. Words ending in -a are typically feminine, while -o indicates masculine.

But some words have different genders despite having the same ending. Take for example, two words that both end in z, but have different genders: lápiz (pencil) is masculine while nariz (nose) is feminine. My algorithm needs to produce lapicito and naricita respectively.

To stay budget-friendly, I turned to web scraping. I enlisted a popular online Spanish dictionary and a library called cheerio. After reading Scraping data in 3 minutes with Javascript, I was ready.

Scraping the Dictionary

The dictionary uses predictable URLs like https://example-dictionary.com/definicion/WORD, making string interpolation straightforward. In the screenshots below, you can see the CSS class I need (highlighted in green). If the text content is nm (nombre masculino), the noun is masculine. If it's nf (nombre femenino), it's feminine.

lápiz definition
lápiz definition
nariz definition
nariz definition
const cheerio = require("cheerio")

const response = await fetch(
  `https://example-dictionary.com/definicion/${word}`
)
const htmlText = await response.text()

// Load the page HTML for querying
const $ = cheerio.load(htmlText)

// Target the CSS class containing gender info ('nm' or 'nf')
const genderInfo = $(".POS2").first().text()

By checking index 1 of genderInfo, I get either m or f. The scraper was working—until I tested amiga and got "masculine." What?

Handling Dual-Gender Words

presidente definition
presidente definition

Some words can be both genders: el presidente / la presidente, or el amigo / la amiga. For these, the dictionary returns nm, nf, which is a 6-character string (including the comma and space) instead of 2.

Here's how I handled it:

const wordEnding = word[word.length - 1]
const genderChar = $("strong+ .POS2").text()[1] // 'm' or 'f'

if (genderInfo.length > 2) {
  // Word has multiple gender possibilities—check the ending
  switch (wordEnding) {
    case "o":
      break // masculine (amigo)
    case "e":
      break // default to masculine (presidente)
    default:
      isFeminine = true // feminine (amiga)
  }
} else {
  // Single gender word
  if (genderChar === "f") {
    isFeminine = true
  }
}

The logic relies on two pieces of information: the word the user entered, and the length of the scraped string.

When genderInfo.length > 2, the word supports multiple genders. If the user entered amigo (ending in -o), they want the masculine form. If they entered amiga (ending in -a), neither of the first two cases match, so isFeminine becomes true to match the user's input.

When genderInfo.length === 2, it's exclusively masculine or feminine, and we simply check whether the second character is f.

(Note: Currently, the generator doesn't accept articles like el presidente or la presidente, for dual-gender words, it defaults to masculine. A future improvement opportunity!)

Wrapping Up

This scraper verifies word gender before passing it to the larger conversion algorithm, which then has everything it needs to generate the diminutive form.

Sometimes rolling up your sleeves and ethically scraping data is the first big step in bringing an idea to life. Check out the Spanish Diminutive Generator (repo here) and let me know what you think!