Data Mining
Libraryland is a-buzz about a new role we can play in the pursuit of scientific knowledge: data curation. Data curation serves, in particular, the new scientific methodology that goes under the name e-science. E-science involves the collection of data sets which are made widely available to the research community. Researchers then “mine” these data sets by using automated systems to find statistically significant relationships within the data. The library’s role is to curate the data, i.e., identify, acquire, and manage the data sets through the course of their life cycle. As exciting as this new methodology is, one should be aware of its weaknesses. E-science can be a valuable addition to traditional scientific methodology, but by itself, it is no panacea.
In a commentary entitled “Implications of the Principle of Question Propagation for Comparative-Effectiveness and ‘Data Mining’ Research” in the Journal of the American Medical Association, 35(3), 2011, Mia and Benjamin Djulbegovic argue that data mining does not provide definitive answers to research questions. Instead, it should be considered merely a hypothesis-generating technique. Their first point already had been demonstrated vividly by a piece of data mining research entitled “Testing Multiple Statistical Hypotheses Resulted in Spurious Associations: A Study of Astrological Signs and Health” published in the Journal of Clinical Epidemiology, 59(9), 2006 by Peter Austin et al. Austin et al.’s research showed that residents of Ontario, Canada who were born under the astrological sign of Leo had a higher chance of suffering from a gastrointestinal hemorrhage than others in the population, and those born under the sign Sagittarius had a higher probability of being hospitalized for a humerus fracture. These results were statistically significant, even after being tested against an independent validation cohort. The study “emphasizes the hazards of testing multiple, non-prespecified hypotheses.” In other words, it warns us that given an enough data points, one can, after the fact, find any number of ways to connect them.
The second point in Djulbegovic and Djulbegovic, that data mining should be used as a hypothesis-generating technique, is, on the other hand, undermined by Austin et al. Austin et al. point out that the statistical methods that are at the heart of data mining are not able to distinguish real from spurious associations. Data mining employs the automated examination of enormous bodies of data. Its usefulness is thought to be proportional to the size of the data set that it collates; however, as the data set becomes larger and as the number of attributes that serve as potential relata increases, the number of potential relationships increases exponentially. Importantly, the number of spurious associations also increases. With enough data, no significance test will be stringent enough to provide assurance against the kind of results found in Austin et al. What is needed, according to Austin et al. is a “pre-specified plausible hypothesis.” For statistical analysis to be useful, the researcher must begin with a hypothesis, preferably a plausible one, if the research is to be valuable.
What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.
If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains. Library administrators should keep this in mind when allocating resources. Clearly, purchasing data sets is a necessary part of serving our research patrons, but the emphasis must be not on the mere accumulation of data, it must be on the selection of data that is critical to continuing the scientific discourse. While data sets that distinguish astrological signs are clearly insignificant for medicine, there are many other attributes that form the basis of data sets that are more or less reasonable. Librarians must be able to perform the complex task of distinguishing the more from the less. It is the curation of data that is important, i.e., the acquisition and management of data sets through the whole of its life cycle; and most importantly, the curation of data sets that are of interest and value to the scholarly and research community.
Here, we have another argument for allocating library resources to pay for librarians with deep subject expertise. As e-science develops, vendors will make more and more data sets available, regardless of their actual worth to researchers. To effectively choose the data sets that are of value, librarians must have a thorough understanding of the research needs of their patrons. To do this, they must have a deep understanding of the field. Unfortunately, with the excitement swirling around e-science, the mere access to large data sets threatens to become the be-all and end-all in collection management. If we aren’t careful, we may find ourselves with mountains of data from which everything and nothing can be concluded.