Cravo Albin Brazilian Music Encyclopedia Scraper
Cravo Albin Brazilian Music Encyclopedia Scraper
Scrape complete artist profiles from the Dicionário Cravo Albin, Brazil's authoritative Brazilian Popular Music (MPB) encyclopedia. The dictionary covers over 15,000 musicians, composers, lyricists, and producers across all major Brazilian music genres — samba, bossa nova, choro, MPB, forró, sertanejo, axé, frevo, maracatu, pagode, and funk-carioca. Created by music historian Ricardo Cravo Albin and maintained by the Instituto Cultural Cravo Albin, the dictionary is the canonical biographical reference for Brazilian popular music cited by journalists, ECAD royalty researchers, academic ethnomusicologists, and documentary producers.
What data does this actor extract?
For each artist the actor extracts:
- Identity: stage name, real name, artist slug
- Biographical: birth date, birth year, birth place, birth state (Brazilian state abbreviation), death date, death place
- Musical: genres, instruments, voice type, notable works
- Biography: full-text biography in Portuguese (combines biographical data, artistic data, and critical review sections)
- Discography: album entries
- Network: related/collaborating artists
- Media: artist photo URL
- Meta: source URL, last updated date
Input
| Field | Type | Default | Description |
|---|---|---|---|
maxItems |
integer | 10 | Maximum number of artist records to scrape |
Output
Each item in the dataset represents one artist:
{
"artist_slug": "caetano-veloso",
"artist_name": "Caetano Veloso",
"real_name": "Caetano Emanuel Viana Teles Veloso",
"birth_date": "7/8/1942",
"birth_year": 1942,
"birth_place": "Santo Amaro, BA",
"birth_state": "BA",
"death_date": null,
"death_place": null,
"alternate_names": [],
"genres": ["mpb", "tropicália", "samba"],
"instruments": ["violão"],
"voice_type": "tenor",
"biography_pt": "Cantor. Compositor...",
"notable_works": [],
"discography": [],
"related_artists": ["Gilberto Gil", "Maria Bethânia"],
"awards": [],
"photo_url": "https://dicionariompb.com.br/wp-content/uploads/...",
"source_url": "https://dicionariompb.com.br/artista/caetano-veloso/",
"last_updated": "2024-03-15T10:00:00+00:00"
}
How does it work?
The actor walks the site's XML sitemap index to discover all artist detail pages, then scrapes each page using Cheerio to extract the DRTS (Drag-and-drop Toolset Suite) entity-field markup. The DRTS plugin renders each structured field as a div[data-name="entity_field_*"] element, enabling precise per-field extraction without fragile CSS class hacks.
Notes
- Always uses the apex domain
https://dicionariompb.com.br/— thewww.subdomain returns 403 - No proxy required — datacenter IPs are accepted
- Full corpus coverage of all 4 artist sitemaps (approximately 15,000–25,000 records)
- Biographies are in Portuguese (pt-BR) and include biographical, artistic, and critical review sections
Use cases
- Music licensing & royalties: ECAD researchers building Brazilian composer databases
- Journalism: Brazilian cultural press (Veja, Folha, O Globo) background research
- Academic: Ethnomusicology research at USP, UFRJ, and other Brazilian universities
- Documentary production: Globoplay, Netflix Brasil background research on Brazilian artists
- Data enrichment: Pairing with LIESA (samba school competition) and Galeria do Samba data for comprehensive Brazilian music coverage