AI has become a transformative tool in the hiring process, offering benefits like speed, consistency and the ability to uncover nuanced insights about talent. Yet, as these systems gain influence, they also present challenges that regulators, such as those behind New York City’s Local Law 144, are working to address in order to ensure fairness. At Upwage, we aim to set a high standard in responsible AI innovation, going beyond regulatory compliance to drive meaningful advancements in equitable AI use.

The Compliance Challenge: Using Synthetic Data When Demographics Are Missing

One of the key requirements of NYC Local Law 144 is an annual disparate impact assessment to ensure AI tools do not disproportionately harm any race or gender category. But here’s the complication: Most companies do not collect the demographic data (like race or gender) necessary to conduct these assessments, leaving both employers and AI vendors in a regulatory bind. The law proposes using synthetic data as a solution but lacks detailed guidance on how to implement this method effectively.

Our Approach: Creating Robust Synthetic Personas

To address this gap, we developed a method that combines AI-driven synthetic personas with our AI interview screener. Here’s how we structured our approach:

  1. Generating Realistic Synthetic Interviews: We prompted a large language model (LLM) to create realistic personas, which were then interviewed by our AI screener. Each conversation generated a synthetic interview transcript. We varied persona inputs by:
    1. Demographics: Gender and race/ethnicity
    2. Job Description: Specific roles our AI assists recruiters in analyzing
    3. Job Fit and Experiences: Profiles crafted based on the job description and level of fit
  2. Analysis: Instead of relying solely on selection ratios (i.e. the likelihood of a persona being recommended for further consideration), we analyzed raw scoring distributions to uncover subtle disparities that traditional metrics might miss.
  3. Hypothesis: We aimed to verify the impartiality of our AI system by checking if LLM-generated personas with varied demographics received consistent insights on job-fit.

Lessons Learned: Bias in Synthetic Personas

Creating realistic personas for fair assessments is more complex than it seems. For instance, research shows that biases can manifest in seemingly neutral outputs. For instance, in one study:

  • Work Experience: Resumes for women across most racial and ethnic groups included significantly fewer months of work experience compared to men’s resumes (𝛽 = −16.260, 𝑝 = 0.022).
  • Seniority: Jobs listed in women’s resumes were more likely to reflect lower seniority roles compared to men’s, except for Asian-associated names (𝛽 = −0.488, 𝑝 = 0.071)
  • Immigrant Status: Resumes for Asian and Hispanic-associated names were much more likely to include non-native English proficiency and non-U.S. work or educational experiences in Asia.

These findings highlight the need for ongoing thoughtfulness in our approach.

We also came across unexpected “features” in research around AI bias detection. Another study indicated that LLMs can predict demographics like race and gender even when not explicitly stated. We used this capability to our advantage: if the LLM could accurately infer a person’s background but still deliver unbiased job-fit scores, we believed it would set a higher standard of fairness.

Building (Fair) Synthetic Personas with Precision

When we didn’t control for race/ethnicity and gender inputs and asked for LLM-generated personas with equal job-fit levels, we observed large inconsistencies in candidate education and work history. For example, an Hispanic male persona was assigned a GED and fast food restaurant experience, while a White female persona was given a Bachelor of Arts degree and experience as a research assistant (Figure 1).

Education
Work Experience
Hispanic Male Persona
GED from San Diego Community Learning Center, 2018
Fast Food Crew Member at Burger Junction, San Diego, CA
White Female Persona
Bachelor of Arts in Art History from San Francisco State University, 2018
Research Assistant at Art Institute of San Francisco, San Francisco, CA; Sales Associate at Creative Home, San Francisco, CA
Figure 1. Equal job-fit level Hispanic Male Persona and White Female Persona prior to systematically controlling demographic and job-fit variables - education and work experience comparison

To address these inconsistencies, we developed a more structured approach where each persona was assigned an independently generated experience block as well as one of 16 demographic blocks (composed of 8 racial/ethnic categories and 2 gender categories):

  • Experience Blocks: Using job descriptions and competencies as inputs, we generated education, work history and professional quotes across five job-fit levels.
  • Demographic Blocks: We defined racial/ethnic categories (e.g. Hispanic, Asian, White, etc) and paired each with gender (male or female).

By systematically controlling for job-fit levels in developing our personas, our personas’ education and experience became consistent across similar job-fit levels, regardless of demographic attributes (Figure 2).

Education
Work Experience
Hispanic Male Persona
High School Diploma - Roosevelt High School, Chicago, IL, 2016
Construction Helper, BuildEase, Chicago, IL; General Laborer, Active Construction, Chicago, IL
White Female Persona
High School Diploma - Roosevelt High School, Chicago, IL, 2016
Construction Helper, BuildEase, Chicago, IL; General Laborer, Active Construction, Chicago, IL
Figure 2. Equal job-fit level Hispanic Male Persona and White Female Persona after systematically controlling demographic and job-fit variables - education and work experience comparison

Our AI screener tested each synthetic persona using realistic interview questions:

Example Persona Interaction

AI Screener: "Can you describe a time when you worked closely with a team? How did you communicate, and how did you handle disagreements?"

Persona: “Absolutely. In my role at XYZ, I collaborated with a multidisciplinary team to develop intervention plans. We held regular meetings and used shared platforms to stay aligned. When disagreements arose, I facilitated discussions to focus on data and the common goal, leading to a blended strategy…”

This level of detail allowed us to systematically evaluate how our AI provides insights to recruiters.

Disparate Impact Assessment Results

Our analyses examined outcomes across gender, race, and intersectional categories, providing metrics like average raw scores, selection rates, and impact ratios. While it was reassuring to see that all impact ratios—both from raw scores and selection rates—far surpassed the established fairness thresholds, we were intrigued by unexpected variations. Even within our controlled experiment–designed to yield consistent results across all demographics by assigning specific job fit levels to each candidate–we observed some inconsistencies. These findings suggest that even carefully structured evaluations can yield nuanced outcomes, which we will explore further in our next piece.

Shaping a Fair AI Future

The biases observed in LLM-generated personas reflect the systemic inequities present in society, underscoring the importance of thoughtful approaches to measuring and mitigating bias. 

As part of the growing movement for equitable AI, we’re driven by the opportunity to create tools that not only meet the moment but shape the future of fair hiring. By openly sharing our findings and collaborating with the AI community, we believe we can help set a new standard for responsible innovation: building AI tools that not only meet regulatory requirements but also make hiring fairer and more transparent for everyone.