I had a peek at value estimation in economics and marketing. There is a sizable literature here, and more work is needed to figure out what exactly is relevant for effective altruists. Discrete choice models are applied a lot in economics, but these models are not able to estimate the scaling of the values. Marketing researchers prefer graded pairwise comparisons, which is equivalent to the pairwise method used here, but with limits on how much you can prefer one choice to another.
I’m enthusiastic about the prospects of doing larger-scale paired comparison studies on EA topics. The first step would be to finish the statistical framework I started on here, then do a small-scale study suitable for a methodological journal in e.g. psychology or economics. Then we could run a study on a larger scale.
Most examples I’ve seen in health economics, environmental economics, and marketing are only tangentially related to effective altruism. (I don’t claim they don’t exist – there’s probably many studies in health economics relevant to EA). But the topics of cognitive burden and experimental design is relevant for anyone who’s involved with value estimation. It would be good to have at least a medium effort report on these topics – I would certainly appreciate it! The literature probably contains a good deal of valuable insights for those sufficiently able and motivated to trudge through it.
There is a reasonable number of statistical papers on the graded comparisons. But mostly from the \(50\)s - \(70\)s. These will be very difficult to read unless you’re at the level of a capable master student of statistics. But summarizing and extending their research could potentially be an effective thesis!
Context
In my earlier post Estimating value from pairwise comparisons I wrote about a reasonable statistical model for the pairwise comparison experiments that Nuño Sempere at QURI have been doing (see also his sequence on estimating value). While writing that post I started thinking about fields where utility extraction is important, and decided to take a look at health economics and environmental economics. This post is a write-up of my attempt at a light survey of the literature on this topic, with particular attention paid on pairwise experiments.
What do I mean by pairwise comparisons? Suppose I ask you “Do you prefer to lose your arm or your leg?” That’s a binary pairwise comparison between the two outcomes \(A\), \(B\), where \(A = \text{lose a leg}\) and \(B=\text{lose an arm}\). Such comparison studies are truly wide-spread, going back at least to Mcfadden (1973), which has \(23000\) Google Scholar citations! Models such as these are called discrete choice models, and I will refer to them as dichotomous (binary) comparisons as well, which is terminology I’ve seen in the economics literature. These models cannot measure the scale of the preferences though, only their ordering. There are many reasons why we care about the scale of preferences/utilities. For instance, we need scaling to compare preferences between different studies, and we need scales when we face uncertainty, as part of expected utility theory.
To take scale into account we can ask questions such as “How many times worse would it be to lose an arm than losing a leg?”. Then you might answer, say, \(1/10\), so you think losing a leg is ten times worse than losing an arm. Or \(10\), so you think losing an arm is ten times worse than losing a leg. These questions are harder than the corresponding binary questions though, and I can image respondents being flabbergasted by them. Questions of this kind are called graded (or ratio) comparisons in the literature. The idea is old – it goes way back to Thurstone (1927)!
We can mix graded and binary comparisons using stock standard maximum likelihood theory or Bayesian statistics. I haven’t figured out the exact conditions, but assuming we have enough graded question, we will be able to fix the scale reasonably well and gain information through the binary part only, reducing the cognitive load of the participants.
I’m excited about the prospect of using pairwise comparisons on a large scale. Here are some applications:
- Estimate the value of research, both in the context of academia and effective altruism. This post presents a small-scale experiment in the EA context. It would be interesting to do a similar experiment inside of academia. Probably more rigorous and lengthy though. In my experience many academics do not feel that their or other people’s work is important. They research whatever is publishable since it’s their job. Attempting to quantify researchers understanding of the value of their and other people’s research could at least potentially push some researchers into a more effective direction.
- Estimating the value of EA projects. This should be pretty obvious. One of the potentials of the pairwise value estimation method is crowd-sourcing – since it’s so easy to say “I prefer \(A\) to \(B\)”, or perhaps “\(A\) is \(10\) times better than \(B\)” – the bar for participation is likely to be lower than, say, participating in Metaculus, which is a real hassle. Possible applications would be crowd-sourcing of valuation of small projects, e.g. something like Relative Impact of the First 10 EA Forum Prize Winners.
- Descriptive ethics. You could estimate moral weights for various species. You could get an understanding about how people vary in the their moral valuations. You could run experiments akin to the experiments underlying moral foundations theory, but with a much more quantitative flavor. I haven’t thought deeply about it, but I imagine studies of this sort would be important in the context of moral uncertainty.
Moreover, I’m thinking about making pairwise estimation into an academic medium-sized project of its own. Very roughly, I’m thinking two steps would have to be completed.
- Development of reasonable statistical techniques. The statistics aren’t that hard, it’s basically elementary techniques such as linear regression and Probit regression. Combined too! But it could be challenging to find optimal allocations of questions. This is the part where I, as a statistician, shine. Of course, it’s important to familiarize oneself with the literature on the topic one wishes to work on. Hence this post.
- Design studies to see how well pairwise value estimation works. Is it worth bothering with it at all? Or maybe just in a couple of contexts? A possible course of action would be to conduct serious interviews with some professors about what research is valuable and why, then go around and estimate the pairwise model on PhD students. It would be nice to have a problem with both a ground truth and incentives to perform – sports prediction might be a better place to start. I don’t shine at stuff like this and would need help.
Small literature survey
I’ve spent a couple of hours surfing through the literature on choice modeling, estimation of value, and so on. I followed no methodology in choosing which papers to write about.
Graded paired comparisons in statistics
A Google Scholar search reveals plenty of statistical papers written in the 50s-70s, including a paper by the great statistician Scheffé (1952), who studies problems on the form
In a 7-point scoring system the judge presented with the ordered pair \((i,j)\) makes one of the following 7 statements:
(3) I prefer \(i\) to \(j\) strongly.
(2) I prefer \(i\) to \(j\) moderately.
(1) I prefer \(i\) to \(j\) slightly.
(0) No preference.
(-1) I prefer \(j\) to \(i\) slightly.
(-2) I prefer\(j\) to \(i\) moderately.
(-3) I prefer \(j\) to \(i\) strongly.
Observe that these numbers correspond roughly to taking logs of ratios, as I did in my previous post. He proceeds to analyze the problem in essentially the same way as I did, but he uses another “contrast” (in the notation of the post linked above, I force \(\beta_i\) = \(1\) for some fixed \(i\), but he makes \(\sum \beta_i = 0\) instead; this should be familiar if you have taken a course in linear regression that included categorical covariates). But he also adds a test for what he calls “subtractivity”, i.e., the assumption that there is no interaction term involved in ratios. I’m very ready to just assume this, however. I don’t understand why he needs to restrict the possible values to \(\{-3,-2,-1,0,1,2,3\}\) though. It just doesn’t seem to matter, so I suppose it’s for he presentation’s sake.
There has been done plenty of methodological work on these methods in statistics. Sufficiently capable and interestedd methodologists should probably look at the literature and see what insights it contains. This will be hard though, as the literature is old, hard to read, and probably written in an unnecessarily convoluted way (at least judging from Scheffé’s paper) with lots of sums of squares and similar horrors.
Marketing research
Graded paired comparison studies are popular in marketing research for evaluating brand preference, but they are presented in a slightly different way: Now you have a fixed number of points to divide between two options, usually \(9\), I think. If you have \(3\) points to distribute, it would be formally equivalent to Scheffé’s setup above, but with the potential benefit that there is no implicit order of comparison. I think Scholz, Meissner, and Decker (2010) might be a decent entry-point to the marketing literature. Consider the following example from De Beuckelaer, Toonen, and Davidov (2013):
Please consider all presented bottled water brands [Spa, Sourcy, Evian, Chaudfontaine, and Vittel] equal in price and content (0.5 L). Which brand do you prefer? Please distribute x preference point/points.” (with x being replaced by one, five, seven, nine, or eleven).
Maybe this is just a bad example, but I honestly can’t see the point of using the points here. Why not just ask which brand of bottled water you prefer? Also, why do you need a fixed number of points. Maybe to prevent respondents from saying “I like Coke \(1000\) times more than Pepsi!!!”? Or perhaps to emulate the feeling of having a fixed amount of money to spend?
Much of the methodological work in marketing appears to be about “complex” products with several attributes to rate. See e.g. this image from Scholz, Meissner, and Decker (2010), about the different attributes of a phone (display, brand, navigation). Is this relevant to EA? I’d say that it’s very likely to be relevant – we do care about attributes such as scale, neglectedness, and tractability after all.
Graded paired comparisons are used in psychometrics too. Here they are used to reduce “cheating” in e.g. personality surveys due to social desirability bias, and the fact that you have fixed pool of points to spend matters. Brown and Maydeu-Olivares (2018) and Bürkner (2022) studied the statistical aspects. I can’t see any immediate application of these models for effective altruism though.
Economics
I have comments on four papers. Baker et al. (2010), Vass, Rigby, and Payne (2017), and Ryan, Watson, and Entwistle (2009) are from health economics and Hanley, Mourato, and Wright (2001) from environmental economics.
The report of Baker et al. (2010)
I think this report is about using surveying techniques to figure of if a “QALY is a QALY”, e.g., are some people’s QALYs woth less due to their age? But it’s pretty long and I only looked somewhat closely at their methodology. This report discusses two kinds of studies.
Discrete choice studies. (Chapter 4). Recall the definition of this kind of model. Here you choose between \(A\) and \(B\). You don’t say “\(A\) is ten times as good as \(B\)!” You just choose one of them. In this paper they use the discrete choice model to estimate the quality of life (QoL) using surveys.
Matching studies. (Chapter 5). You have two options \(A\) and \(B\), and you’re asked which you prefer. If you say \(A\) , you’ll asked if prefer \(\frac{1}{2}A\) to \(B\). We continue modifying \(A\) and \(B\) until you don’t prefer one to the other. This seems to be equivalent to graded comparisons, but the report is so verbose I can barely comprehend what they are doing. Their application is a kind of empirical ethics regarding the definition of QALYs.
To recap, a QALY is 1 year in full health and years spent in less than full health are ‘quality adjusted’ in order to take account of the morbidity associated with disability, disease or illness. As QALYs combine morbidity (QoL) and mortality (length of life) on a single scale, they allow comparisons to be made across interventions with different types of health outcomes (e.g. QoL enhancing versus life extending). In the standard ‘unweighted’ QALY model, all QALYs are of equal value. For example, the age of recipients does not matter, as long as the QALY gain is the same. Likewise, the standard model assumes that equal QALY gains are of equal value regardless of how severely ill the patients are prior to treatment. The aim of the matching – and DCE studies – is then to address the question of whether a QALY is a QALY is a QALY. Or put another way, is the ‘standard’ model correct?
I haven’t read how they actually estimate these modified QALYs though.
The survey of Vass, Rigby, and Payne (2017)
Vass, Rigby, and Payne (2017) discusses the use of qualitative research in discrete choice experiments. This seems like a decent entry point to the literature in health economics, and I will probably rely on it in the future, if only for their reference list. The topic is also plausibly relevant to EAs, as the qualitative aspects of a discrete choice experiment essentially lies in its preparation – find the right questions, ask the right people, and so on.
They have a reasonably clear motivation for the use of discrete choice models in medicine.
In healthcare, decision making may involve careful assessment of the health benefits of an intervention. However, decision makers may wish to go beyond traditional clinical measures and incorporate ‘’non-health’’ values such as those derived from the process of healthcare delivery. DCEs allow for estimation of an individual’s preferences for both health benefits and non-health benefits and can explain the relative value of the different sources.
They mention, along with many other authors, the problem of cognitive burden.
Any increases in the cognitive burden of the task could result in poorer quality data and should be considered carefully.
And that’s the reason why don’t think it’s a good idea to ask respondents distributions, and would prefer to use discrete choice models as much as possible. (For remember that we can mix them with graded response models to keep the scale fixed).
The paper of Ryan, Watson, and Entwistle (2009)
This is a paper about how value elicitation experiments often reveal values incompatible with expected utility theory. They also include a discussion of methodologies used.
The four most commonly applied stated preference methods in health economics are contingent valuation (CV), discrete choice experiments (DCEs), standard gamble (SG), and time trade-off (TTO). Researchers have used quantitative tests to investigate whether responses to tasks set within each of these methods are consistent with the axioms of utility theory.
They include some examples of these methods in the context of statistical testing.
- Contingent valuation. Hanley, Wright, and Koop (2002) use the terminology “choice experiment” instead. And when they describe their method it’s plain to see that they actually use a discrete choice model. But Ryan, Watson, and Entwistle (2009) later write that “CV asks respondents to trade the health good and money”. The term is really bad, as “contingent” can mean just about anything without proper context.
- Discrete choice experiments. “DCEs ask respondents to trade between attributes.”
- Standard gamble and time trade-off. “SG and TTO test whether respondents make trade-offs between health status and risk or time”.
Statistically speaking, CV, DCE, SG and TTO should be understood as the same thing, the only difference being “what is compared to what”, where DCE compares attributes to attributes and CV compares attributes vs money. Perhaps nice to know if you’re going to dive into the literature on this topic.
Paper of Hanley, Mourato, and Wright (2001)
The author discuss the “contingent valuation method”, which appears to be a willingness-to-pay experiment (Ahlert, Breyer, and Schwettmann 2013). I don’t think we can make use of that though.
By means of an appropriately designed questionnaire, a hypothetical market is described where the good or service in question can be traded. […] Respondents are then asked to express their maximum willingness to pay […]
Then they discuss choice modelling.
[Choice modelling] is a family of survey-based methodologies for modelling preferences for goods, where goods are described in terms of their attributes and of the levels these take.
In Table 2 they describe four kinds of choice modelling:
- Choice experiments. This is equivalent to dichotomous comparisons.
- Contingent ranking. Rank a bunch of different alternatives, e.g. “What do you prefer? Losing an arm, a leg, or a tooth? Rank from best to worst.”.
- Contingent rating. This is direct scoring of values. Hanley, Mourato, and Wright (2001) mentions a \(10\) point scale, but we would probably allow them to use \(\mathbb{R}^+\) instead.
- Paired comparisons. This is graded comparisons.
They claim that option (1) yields “welfare consistent estimates”, but the others probably don’t. I don’t know if this is important or not, as I don’t know what it means.
All of these models are possible to combine, but we won’t be able to fix the scale without using graded paired comparisons. Option (2) might possibly be worth investigating, especially if the number of rankings are small, where the additional cognitive burden is low. But Hanley, Mourato, and Wright (2001) states that “choices seem to be unreliable and inconsistent across ranks,” making it less likely that this is worth exploring. The Schulze method is also based on ranks, and could be relevant here.