National Science Foundation Research Traineeship

Research

logo The invisible and implicit nature of bias calls for training in new methods. This project's focus is on the kind of bias that creates unequal impacts on human beings; the kind of bias that creates systematic advantages and disadvantages for individuals or groups of people. It is important to note, that bias is not inherent in a data set itself. Bias will depend on context and direction, and can emerge in how people use data processing/decision-making tools.

Trainees will be encouraged to join existing research efforts with core and participating faculty, or they may choose to develop their own project collaborating with a partner from a different field.

Ongoing Research Topics
(see References Cited at the end of this list)

Co-PI Wei Zhu, statistician and data scientist, has spent 25+ years developing new statistical procedures and machine learning methods to help reduce bias in data and analysis. More recently, she has developed realistic and robust methods for regression analysis with errors in regressors [14], random forests for classifying unbalanced data [27] [18], optimal clustering algorithms [35] and deep reinforcement learning with multiple objectives [10]. Zhu is also an expert in stochastic carcinogenesis modeling [33] [34] [36]. She has discovered that opposite models and conclusions can be drawn based on the same data sets, and furthermore, papers from the same research group can feature contradictory results without the authors realizing such internal conflict.

Trainees will explore how the analyses applied to data sets can be misused, but how such analyses can be audited and corrected. As a tool, AI wields enormous transformative power, as demonstrated by AlphaFold [25], which solved a 50-year-old grand challenge in biology on folding proteins. Foundational training in how to avoid, detect, and correct bias is essential for wielding such powerful tools responsibly.
Klaus Mueller works to develop visual interfaces that aide users in the identification and correction of biases within algorithmic decision-making systems. Mueller, along with others, have developed an interactive visual tool, WordBias , used for identifying and exploring biases against groups with intersectional identities [3]. He has also worked to create a new method to determine the quantitative level of social bias in crowd workers, in order to lessen the bias within a dataset [10]. As well as developed an interactive tool, Screen-Balancer , to help media producers balance the presence of different phenotypes in a live telecast, which allows for a fairer representation of individuals on screen [17]. Further reading on Mueller’s research dealing with computational fairness can be found here . The well-developed interactive visual interfaces created in this research allow humans to take an active role in the bias mitigation process.
Klaus Mueller, a computer scientist with a research program on visualization, visual analytics, and data science, along with Reuben Kline, Christian Luhmann, and PI Susan Brennan, are conducting research in human-AI interaction. Goals of their data-intensive “Algorithms Without Borders” project include developing and testing tools for transparency in machine learning and AI-assisted decision-making for human stakeholders [32]. The domain for this project is fairness in higher education (with a real-world institutional data set of 10 years of college admissions); users will be able to display the input data graphically, add or remove protected variables, and de-bias the data in order to detect and address bias. Trainees will also be able to test and deploy the associated visualization and debiasing tools with their other research projects.
When automated decision-making depends on proxies (variables correlated with but not directly related to what is predicted from previous data), there is a greater risk of biased decisions [22]. Because software systems used by institutions are often proprietary, it can be difficult to discover bias until considerable harm has ensued. One widely-used algorithm (affecting millions of patients) unfairly allocated more healthcare resources to White than Black patients because it used prior health care cost (rather than actual illness) to estimate severity of illness or need [21]. Trainees will learn to examine datasets to discover biases in outcomes and examine the incorrect assumptions behind poor choices of proxies.
In previous research on machine learning bias, Sociologist Jason Jones has shown that stereotypical gender associations in language models decrease in magnitude the more recent the language training data [13]. This is important for two reasons: it demonstrates that input data can be an important determinant of the bias of systems trained on data using ML, and it suggests that on average, gender stereotypes are decreasing in strength over time (at least as measured by latent associations in published texts). This research shows how powerful new computational tools can be put to use to measure attitudes and beliefs at society-level scale, and that data collected in different eras may show different biases.
Co-PI Bonita London’s research examines processes associated with social identity threat and the consequences of this threat for academic engagement, performance, and well-being, particularly among members of historically marginalized groups. Her research shows that social identity threats contribute to gender and racial disparities in education and career advancement and success, career decision-making, professional networking and relationships, and mental health and well-being [1]. In studies ranging from experimental lab studies that test specific coping strategies in response to social identity threat, to experience-sampling methods (ESM) studies exploring the day-to-day lived experiences of students navigating institutions where biases are experienced, London has identified core factors that promote versus undermine engagement and success among traditionally underrepresented students [5], including those with intersectional identities (e.g., for a person of color who is also a woman or a gender minority).

Trainees, including those who identify already as data scientists, will learn about social identity threat and how it affects who is welcomed and who thrives (whether in academia or in the STEM workforce) and incorporate relevant variables into research designs about bias.
Evaluations affect individuals’ access to college, graduate school, scholarships employment, and other resources. They are by definition subjective: the goal is to express an assessment, on which well-intentioned people can differ. As with all subjective human assessments, bias can affect decisions. Bias in evaluations includes the usual demographic categories of age, gender, ethnicity, national origin (e.g. [19]), but in addition may include the perceived value of the work; the perceived communication ability of the person; their perceived ability to get along with others; and so on. The latter criteria are troublesome to address, as they can in fact represent valid bases for a negative evaluation, depending on the context. On the other hand, if such biases are correlated with demographics, that should be addressed.

Trainees, with participating faculty (Brennan, Rambow, others) will examine bias in existing corpora (e.g., letters of recommendation; reviews of conference papers) that have been annotated by trained annotators for “doubt-raisers” [16] and other potential issues, or flagged in their natural life cycle. Machine learning methods will be applied, taking account of the evaluation's language and critically, its context (see [4]). The study will yield qualitative and quantitative findings, as well as tools to help detect potential bias within specific contexts.
The idea that simpler explanations of observations should be preferred to more complex ones is a pillar of scientific thought. It is also a hypothesis about human perception and cognition: there is a cognitive bias towards simple explanations [9]. What counts as simple and what counts as complex, however, is the subject of much work in mathematics [15], philosophy [26], and information theory [5]. Research in computational and mathematical linguistics by Co-PI/Linguist Jeffrey Heinz and colleagues explores these ideas in the domain of language and language acquisition.

The extraordinary ability of children to learn a first language has been famously theorized to be due to innate parameters; however, the extent to which the grammatical generalizations made during learning languages are simple, and what simple means in this context, provide a strong rationale for both “big data”- and "small data"-centric analyses of language-based data using textual corpora. Simplicity in the context of learning systems is relevant to neural systems as well. Neural systems can be described as having optimal bias under a cost function; Associate Professor Memming Park studies the theoretical properties of neural systems [23].
“Science is a highly stratified social system” ([20], p. 1). Persistent and pernicious biases emerge in scientific institutions and from traditional practices that affect interaction at seminars [6], who gets published, who gets cited [7] [24], and who gets funded [12]. This bias can arise quite unintentionally within systems that pride themselves on being “objective”, advantaging certain authors, groups, and laboratories. Such advantages, deserved or not, may amplify exposure to certain work through conference presentations, prestigious journals, social media attention, and accelerating citations. There is obviously a social aspect to this cycle of reinforcing bias within the scientific community [7]. Although science is often assumed to be self-correcting, biased practices can lead to forgotten studies and ignored authors.

Solutions have been proposed at the institutional level to even out the playing field, including double blind reviewing, randomized exposure at conferences, increasing awareness of biases that arise when researchers build their own knowledge networks with attached names, and avoiding “manels” (panels with only male participants; [8]). Blind review and other blinded practices are not always the answer [28]; investigation is needed to determine which interventions are effective in which contexts.
Gender-disaggregated data are often used by policy makers to address institutional barriers to women’s equal participation in the labor marketplace. However, such data often conceal important differences. In Africa (and elsewhere), such barriers can be due to political, economic, social, cultural and religious institutional factors [1].

Adryan Wallace (Africana Studies, Women's Studies, Political Science) conducts fieldwork across Africa that blends qualitative data (interviews, surveys, and ethnographic observations) with quantitative data collected from governmental and policy organizations and labor treaties, in order to address differences in women and men’s labor force participation, along with other disparities, including in human rights [29] [30] [31]. An intersectional analysis (rather than disaggregating data simply by gender) reveals a dynamic picture of how underlying political and social factors impact women’s economic experiences are invisible within the current gender labor statistics conceptual framework. NRT trainees who are interested in institutional bias and/or in working with multiple data types may collaborate on analyses.

References Cited

Research

National Science Foundation Research Traineeship

Research

Ongoing Research Topics(see References Cited at the end of this list)

Ongoing Research Topics
(see References Cited at the end of this list)