September 3, 2023

We've Changed our Evidence Ratings. Here’s why.

PDN has updated its evidence levels. The following article explains why and how.

Julia Kamin, Ph.D.

, & al.

PDN has updated its evidence levels. The following article explains why and how.

The Prosocial Design Network has updated how we assign evidence levels. The new rating system may be found here.

PDN’s library lets both technologists and researchers browse prosocial design patterns based on evidence of their effectiveness. For technologists, evidence ratings are useful for determining if, in integrating a design pattern, they will actually see prosocial outcomes. For scholars, those ratings can also point to where more research is needed to shore up evidence for not only tested design patterns, but also as-of-yet untested patterns.*

When we launched PDN in late 2019, one of our first actions was setting up criteria for those evidence ratings. They have, generally, done a good job. However, those criteria sometimes produced ratings that didn’t sit quite right with the social scientists in our Library Team. So back in April, four of us on the Library Team dug in to see if we could come up with a new system that can reliably produce ratings more aligned with our social science senses.

What wasn’t perfect about the old system

Our initial rating system gave highest marks to designs that had supporting evidence in peer review journals and from studies that used “random assignment” (i.e. any kind of controlled experiment). We still think those are important factors, yet heavily weighting those two criteria sometimes penalized stronger design claims while bolstering weaker ones.

The Two Sides of Peer Review

Peer review ensures that independent scholars agree research is based on solid study design and analysis. Passing peer review is a strong signal of a study’s soundness. But relying heavily on that signal for our purposes produced two wrinkles.

For one, it discounted top-notch research awaiting the peer review publishing process. Since it sometimes takes years for strong research to get published and given technology’s rapid pace, that meant we were inadvertently discounting some relevant research.

And second, passing peer review is not necessarily linked to a design pattern’s effectiveness. Academic journals often care less about the effectiveness of a design as they do about the theoretical knowledge a study contributes to scholarship. Therefore, it’s possible for a paper to get published while also including insufficient evidence that a design has a prosocial impact in a realistic setting.

Not All RCTs are the Same

Randomized Controlled Trials (RCTs) are likewise a strong signal. They let researchers randomly assign some users to a design pattern, comparing its impact against its absence (i.e. in a control group). It is a robust tool social scientists have to test if a design change has an intended effect.

However, not all RCTs are the same. A well constructed RCT conducted “in the field”—e.g. on an existing platform—is often deemed the ideal, because you get to see if a design has real world impacts. That said, few RCTs are field experiments. Instead, most are conducted as online survey experiments with recruited participants who know they are in a study and often merely “imagine” what they would do on a tech platform. While those studies tell us a lot about a design patterns’ potential, they lack what researchers call “external validity.” In other words:, we can’t really know if the design patterns will work when launched on a platform.

Inversely, there are studies, for example “natural experiments”, that don’t use random assignment but can still tease out a design’s effectiveness on tech platforms. Sometimes those studies tell us more than a survey experiment.

What has changed?

The wrinkle of depending on peer review fortunately is obviated by the fact that our Library Team acts like a peer review process. Our team of social scientists review, discuss, and evaluate each study - whether it’s published in a journal or is a “preprint” posted by the authors - and agree on the strength of the study’s design and data analysis, much as peer reviewers would do. This lets us drop the peer review criteria entirely.

We have not, though, dropped the RCT criterion; instead we’ve made it more fine-grained, giving greatest weight to field experiments and less weight to survey experiments, with “natural experiments” somewhere in between. (We also give mid-weight to experiments conducted on simulated platforms, i.e. where study participants have reason to believe they’re using a real platform. But those studies are currently rare.)

One other criterion that has - and has not - changed is whether or not a study’s findings have been reproduced (or “replicated”) in a second study. Replication is - like a field RCT - one of the strongest tests of a design pattern’s effectiveness. One-off studies always run the risk that they are a statistical fluke. Also, similar to how a pattern may work in a survey experiment but not on a real platform–so might its effectiveness be unique to the specific settings and choices of a given study. This is why we still require replication in order for a design to reach “validated” status, but we have added the additional requirement that a study’s findings are replicated by a different research team; in doing so, and given that different research teams are more likely to make different choices, we increase the odds that a finding is robust across settings.

Finally, you may have noticed that we added a “Mixed” evidence rating. In a couple of cases (so far) we have found that multiple studies provide evidence regarding a design pattern’s impact, but those findings don’t all point in the same direction. Without a clear picture of the overall impact of a pattern we present the various evidence and let technologists carefully assess if and how it should be integrated.

Again, you can read more about our ratings process and criteria here. Let us know if you have questions, thoughts or challenges. We are always learning how to provide a better resource and welcome your input!

*Note: Many - if not most - design patterns aren’t given evidence ratings at all simply because there is no public research testing their impact - i.e. no evidence exists one way or another, at least none that we know of. If you are aware of research we should include, please let us know!

Julia Kamin (she/her) is a researcher based in New York City. She currently works with Civic Health Project, developing a measurement tool for organizations to gauge their impact on reducing polarization.

About the Prosocial Design Network

The Prosocial Design Network researches and promotes prosocial design: evidence-based design practices that bring out the best in human nature online. Learn more at prosocialdesign.org.

Lend your support

A donation for as little as $1 helps keep our research free to the public.



Be A Donor

We've Changed our Evidence Ratings. Here’s why.

What wasn’t perfect about the old system

The Two Sides of Peer Review

Not All RCTs are the Same

What has changed?

About the Prosocial Design Network

Lend your support

The Latest

Pro-Social on In-time Feedback: A recap

Announcement: A convening for researchers and practitioners hosted by PDN & Roblox

Pro-Social on Trans-inclusive Design: A Recap