Evidence Ratings

Share This Page

How our evidence ratings work (updated 2023)

Evidence Ratings in the Prosocial Design Network's library aim to give technologists and researchers an indication of how confidently they can know a design pattern is effective, based on public research.

By effective we mean effective in producing the prosocial outcome a pattern is designed for.

So for a design pattern meant to, say, reduce the spread of misinformation, there would be evidence that it leads users on a social media platform to, indeed, share less misinformation.

It's worth noting that evidence ratings mostly make sense for design patterns that have an intended outcome on how users engage. Design patterns that implement prosocial policies like data privacy, for example, may be more or less effective (i.e. in the case of data privacy, some may be more or less susceptible to breaches), but we don't consider those standards here (at least for now).

Our ratings also only apply to design patterns that have public research that has tested their effectiveness. For the most part, that research comes from scholars who aim to publish their research in journals. However, it occasionally includes research from platforms that they share, or that is leaked. 

The Ratings

Validated

At least two studies conducted by different research teams each provide strong evidence of the design pattern's effectiveness.

Convincing

Strong evidence from at least one study that the design pattern is effective. In most cases, to be convincing at least one well-designed field experiment will clearly demonstrate the effectiveness of the design pattern.

Likely

Persuasive evidence from at least one study that the design pattern is effective, although more testing would be needed to have strong confidence in its effectiveness.

Tentative

Initial, promising evidence that a design pattern may be effective, although more testing would be needed to have confidence in its effectiveness.

Mixed

In cases where there is evidence a design pattern is effective as well as persuasive evidence that it is ineffective or may even have negative effects.

Emergent

Emergent Grades are for interventions with only qualitative evidence.

Inference

Inference Grades are for interventions that lack direct evidence. Yet analogous studies, expert opinions, or first principles, suggest that they might work.

Unlikely

In cases where there is robust evidence that a design pattern has no effect, i.e., “a null”.

The Process

Evidence ratings are given by PDN's Library Team, whose members meet biweekly to review individual studies and their related design patterns. Through discussion we reach consensus on the strength of evidence in each study and the overall strength of evidence for a design pattern across studies.

Library Team members are social scientists and data scientists trained in inference, i.e. detecting how clearly data indicates a cause (e.g. design pattern) leads to an effect (e.g. prosocial outcome).

The Criteria

For each study, the Library Team considers the following:

The type of research design

In general, highest ratings go to field experiments. These are sometimes called randomized control trials (or RCTs) or A/B tests, which use random assignment to test a design pattern on an operating platform.

High ratings also go to "natural experiments" using data on platforms and experiments conducted in simulated environments in which participants have reason to believe they are interacting on a real platform.

We also review experiments conducted in the lab or in online survey experiments, as well as observational studies, i.e. in where no randomized assignment occurs.

The strength of research design

We consider how well the study is designed to pinpoint the effectiveness of a design pattern and can rule out other interpretations from its findings.

The strength of statistical findings

Closely related to research design, we consider the statistical analysis used in the study and the overall strength of the findings.

Finally, for each design pattern, the Library Team reaches consensus on the appropriate evidence rating by considering both the overall strength of evidence across studies; and whether evidence exists in studies conducted by multiple research teams.