Law Firm Website Contact Scraper
LEAD GENERATION
Law Firm Website Contact Scraper
Extract attorney profiles, contact info, practice areas, education, and bios directly from law firm websites. Provide a list of law firm website URLs and get structured attorney data ready for CRM import, lead generation, or legal directory enrichment.
What It Does
This actor crawls law firm websites and extracts detailed attorney profiles from each attorney's bio page. It works with virtually any law firm website architecture — WordPress, custom CMS, React/Next.js server-rendered sites — and uses a multi-layer extraction approach:
- JSON-LD / schema.org/Person — detects and parses structured attorney data when available (name, jobTitle, email, telephone, image, LinkedIn)
- Heuristic CSS selectors — covers common WordPress attorney theme patterns and major law firm CMS templates
- Pattern matching fallbacks — extracts mailto links, tel: links, and office location from visible page content
Use Cases
- Lead generation — build prospect lists of attorneys at target firms
- Directory enrichment — supplement FindLaw, Martindale, or Avvo data with direct-from-source bios and contact details
- Recruitment — identify attorneys by practice area and location for headhunting
- Market research — map firm headcount, practice area mix, and office locations
- CRM import — import attorney records with email, phone, title, and LinkedIn directly
Input
| Field | Description |
|---|---|
urls |
Required. List of law firm website URLs. Can be the homepage (actor navigates to the attorneys/team page) or the attorneys listing page directly (e.g. https://www.example.com/lawyers/). |
maxItems |
Maximum number of attorney records to return across all input URLs. Default: 10. Set to 0 for no limit. |
Example Input
{
"urls": [
"https://www.gibsondunn.com/lawyers/",
"https://www.lw.com/en/people"
],
"maxItems": 100
}
Output
Each record represents one attorney:
| Field | Description |
|---|---|
attorney_name |
Full name |
title |
Professional title (Partner, Associate, Of Counsel, etc.) |
email |
Email address |
phone |
Primary phone number |
direct_phone |
Direct line (if separate from main) |
practice_areas |
Practice areas, pipe-separated |
education |
Educational background, pipe-separated |
bar_admissions |
Bar admissions, pipe-separated |
bio |
Biography text (up to 2,000 characters) |
firm_name |
Law firm name |
office_location |
Office city/location |
attorney_page_url |
URL of the attorney bio page |
headshot_url |
URL of the attorney headshot image |
linkedin_url |
LinkedIn profile URL |
Example Output
{
"attorney_name": "Jane Smith",
"title": "Partner",
"email": "jsmith@examplelaw.com",
"phone": "+1 212.555.1234",
"direct_phone": null,
"practice_areas": "Mergers & Acquisitions | Private Equity | Capital Markets",
"education": "Harvard Law School, J.D. | Yale University, B.A.",
"bar_admissions": "New York | California",
"bio": "Jane Smith is a partner in the firm's M&A practice...",
"firm_name": "Example Law Firm",
"office_location": "New York",
"attorney_page_url": "https://www.examplelaw.com/people/jane-smith/",
"headshot_url": "https://www.examplelaw.com/wp-content/uploads/jane-smith.jpg",
"linkedin_url": "https://www.linkedin.com/in/janesmith/"
}
How It Works
The actor uses a two-level hierarchical crawl:
- Home/Listing detection — if the input URL is a homepage, the actor navigates to the attorneys listing page via site navigation links. If the URL is already a listing page (contains
/people/,/lawyers/,/attorneys/, etc.), it skips directly to discovery. - Attorney discovery — scans the listing page for links to individual bio pages and handles pagination for large firms.
- Bio page extraction — visits each bio page and extracts the full attorney profile using the extraction cascade described above.
Site Compatibility
Tested with major law firm website architectures:
- WordPress attorney themes (most common)
- Custom CMS (large firms with bespoke systems)
- Server-rendered React/Next.js (works without browser rendering)
Sites requiring JavaScript-only rendering may return incomplete data for some fields.
Notes
- Data quality depends on how well the target site uses schema.org/Person markup. Sites with JSON-LD yield the richest data.
- The
biofield is truncated at 2,000 characters. - Array fields (practice areas, education, bar admissions) are returned as pipe-separated strings for easy spreadsheet import.
- Email and phone extraction relies on
mailto:andtel:HTML links — firms displaying contact info as plain text or images may not have these populated.