Crowd-sourced Challenge initiatives in biomedical research provide a mechanism for transparent and rigorous performance assessment. The core idea of these studies is simple. A group of researchers, typically called the “organizers”, identify a key problem that has a diversity of potential solutions. These solutions are typically computational, meaning that each is a different algorithm or method. The organizers then identify or create a suitable benchmarking dataset and a metric of accuracy. Submissions are solicited from the broad community, typically attracting interest from a diverse range of research groups. A metric of accuracy is used to quantify how well each submission performs, and at the end of the Challenge, best performers are declared. The final results are then analyzed to identify best practices and general characteristics of the problem at hand. These are usually summarized in a publication. Incentives are often provided to enhance the number and diversity of participants, such as prizes for best-performing groups or authorship on publication of Challenge papers. This process is sometimes called “crowd-sourced benchmarking”, and organizations including CASP, DREAM, CAGI and Kaggle specialize in executing them in varied problem domains.
This blog post will outline some benefits and caveats of open science approaches in such competitions, garnered from organizing more than 60 DREAM Challenges in disciplines ranging from systems biology to disease genetics and biomedical image analysis to cancer genomics. We link these benefits and caveats to potential implications for open team science more broadly.
The benefits and perils of openness
Some benefits of open science in crowd-sourced competitions are self-evident: as the community of biomedical researchers strives to work in more open and reproducible ways to solve problems, cross-checking one another’s results can build reproducible science and identify the best solutions. Open Challenges provide a framework for rigorous benchmarking of algorithms against blinded evaluation data so that assessment is unbiased and can lead to insights into why particular algorithms perform well. Open Challenges also provide a mechanism for avoiding self-assessment traps, supporting algorithm developers.
Other benefits are surprising, but equally important: all teams failing to solve a proposed Challenge provides strong evidence that the problem cannot be solved with the current data or methods at hand. This unfortunate negative result becomes clearer in a Challenge context, and confronting it in the open makes the scientific community wiser and more likely to accept it – both for its content and in the context of peer review.
We have also discovered that, while openness can be extraordinarily enabling, in some cases complete openness can counter-intuitively present its own perils. For example, disseminating methods among participants early within a Challenge can facilitate a group-think where researchers follow or slightly modify the method of the best performer on a leaderboard, limiting innovation and leading to focus of optimization within a local maximum of performance. A recent paper “Innovation and cumulative culture through ‘tweaks and leaps’ in online programming contests” discussed the evolution of innovation in open source coding. The best-performing entry at any point was typically based on optimization of the previous best. But occasional leaps disrupted the culture: for each “leap success” there were 16 “tweak successes”. These “leap successes” had a major impact on results: for each point gained in by “tweak success”, 12 were gained in leaps. Similar observations have been made in multiple settings, including the CASP Challenges.
We witnessed this firsthand in the Breast Cancer Prognostic DREAM Challenge, which required teams to submit and publicly share source code throughout the Challenge. The motivation was well-intended: to promote virtual collaboration and to stimulate new ideas and approaches that would lead to a steady, inexorable rise in model performances. In practice, this format disincentivized teams to take risks and promoted an incremental improvement by rapidly building on top of one another’s achievements: a kind of regression to the mean. In a way, this is reminiscent of Charles McKay’s “Memoirs of Extraordinary Popular Delusions and the Madness of Crowds”, first published in 1841. McKay described how people followed a dominant idea without deeply challenging it, following a path of minimal resistance. Independent thinking is required to generate sufficient activation energy to challenge that dominant idea and change the direction of the crowd.
From open to closed, and back again
Crowd-sourced Challenges not only provide a venue to observe the challenges of open science approaches, but also a test-bed to innovate and evaluate strategies to mitigate them. We have explored several ways to synthesize open and closed strategies to crowd-sourced evaluations, learning from the experience of the Breast Cancer Prognostic DREAM Challenge. Across over a dozen Challenges, we have converged on a hybrid of open and closed strategies. Typically DREAM Challenges now begin in a closed mode, where code and algorithms remain private: this is called the competitive phase. During this phase, teams only have information about their relative performance on a leaderboard. Once the competitive phase is finished, top teams are announced, and the Challenge pivots into an open mode where competitors become collaborators. Code is shared, and teams work together in a new collaborative phase to develop a joint deeper understanding of the problem- and solution-spaces.
Similar strategies have been considered to optimize data-sharing and data-availability. Open data is an ideal way to democratize access to data, creating a resource for many researchers to explore and use to make new discoveries. Crowd-sourced Challenges are one mechanism to broaden data availability, lowering barriers to entry by providing carefully curated data sets, a precise formulation of problems, and a quantitative rubric for evaluation. But data generators expend substantial effort and may want to reap the fruits of that effort for other purposes than a crowd-sourced Challenge. Requirements to make data immediately open can disincentivize data generators to allow their use in Challenges. We and others have started to develop and evaluate hybrid open–closed models – for example, keeping data private, but using containerized technology to access the data without downloading or even seeing it. This can allow execution of a Challenge while the data generators work towards fully reporting their scientific questions of interest on that dataset. A similar hybrid strategy involves precompetitive solutions, like Project Data Sphere where pharmaceutical companies provide parts of the data collected in clinical trials (e.g. control arms) that pose little risk to their overall strategy. These data were successfully used for building predictive models of treatment response in metastatic prostate cancer patients treated with docetaxel in the Prostate Cancer DREAM Challenge. A third hybrid strategy involves consortia where organizations can see one another’s data if and only if they provide their own data to the collective.
Acceleration of science
A typical DREAM Challenge involves ~50 teams, and each team includes on average three people. For a standard two-month Challenge, this represents a cumulative 300-month effort. This is far larger than a typical research group would invest in solving a single problem and allows much broader exploration of the search-space. This is analogous to the concept of embarrassing parallelization in computer science. While many problems benefit from this parallel problem solving, one can also envision the idea of having a consortium formed by groups that do what they do best: one group does the coding and optimization, others can generate data, others can analyze and develop modeling, and so forth. Indeed, we can imagine open science models as promoting sharing amongst the various parallel threads underway. For all to be made better off in this model, everybody in this consortium has to help others solve their problems.
Developing and expanding our sets of hybrid open–closed models will be one key to generating increasingly collaborative team science. It is a truism that incentives drive behaviour: finding the economic, social and academic models that incentivize collaborative science and resource sharing will be immensely important as biomedical research becomes increasingly data-intensive. Challenge-based benchmarking provides an ideal test-bed for adaptively, rapidly and systematically testing these models.