Can Scientific Quality Be Quantified? – Einstein Foundation Berlin

Why has evaluating quality in science changed from asking “What” to “How much”? How did measuring quality become a goal in itself in the scientific community? And when did all these problems begin? Ulrich Dirnagl, Einstein Foundation Award Secretary and Director of Experimental Neurology at Charité, provides a brief outline of the evolution of quality in science.

Quality is a challenging notion. Everyone likes to champion, but it is notoriously difficult to define. The standards used to define quality are often disputed, whether in industry, the arts and culture, or the sciences, and they are continually changing. When we take a closer look at the origin of the word, it is easy to understand why it raises so many questions. It stems from the Latin interrogative pronoun qualis which asks for the character or nature of something to be defined.

Questions surrounding quality are particularly complex in science, which claims to uphold the highest standards in this regard. This means “quality” generally continues to be an implicit and multi-dimensional construct. It is not static and is judged under different criteria and standards in the natural sciences, the humanities, and the social sciences. These criteria and standards can also vary to a certain degree among the individual fields of research within the same scientific domain. In science, it can take on the form of methodological rigor, or expertise and quality within planning, implementation, and analysis; it is also evaluated in terms of reliability and reproducibility of results, plausibility, originality, and novelty. For many researchers, it also encompasses values such as respect, fairness, integrity, and ethical behavior. In order to judge quality according to all these factors, science must be transparent and lay all its cards on the table. This is why transparency is one of the many universal elements used to measure the quality of research. A further dimension to consider is the fact that quality in science is not only defined and monitored internally, namely within specialist communities, but is also influenced by external stakeholders such as institutions, funding bodies, members of the political arena, the general public and, of course, scientific publications. These influences have radically altered the global research culture in recent decades—and not necessarily for the better.

And so, in many quarters today quality is measured and evaluated in terms of numbers. There has been a shift from “What?” to “How much?”, from substance to scoring credit points. It is understandable that there is a tendency to measure scientific quality against specific numbers. Using numbers simplifies broader issues, making it easier to gain a clearer overview; they are transparent, quantitative, simple, and workable. After all, measurements are an integral part of most scientists’ daily work.

Given the massive output of research articles published by a myriad of scientists, it seems to be impossible not to rely on numbers that highlight the significance of the journal in which a piece of research has been published, or how often a researcher has been cited. It is therefore hardly surprising that this system of figures and metrics for measuring scientific quality has gained worldwide acceptance. At least one generation of scientists and scientific administrators has already incorporated it into its logic and in many cases could not imagine any other type of mechanism. They have become completely accustomed to evaluating the originality and quality of science and its originators based on citation metrics, the reputation of scientific journals, and levels of external funding.

However, the “metrification” of the concept of quality ultimately leads us down a blind alley. In fact, it is repeatedly threatening our high scientific standards. A brief foray into history tells us that various systems have been used and developed to evaluate science, and that some have helped, and some have hindered the production of knowledge. So, how did we come to mostly rely on metrics which are used to judge quality today and which are unfortunately often detrimental to the quality of scientific evidence?

The fundamental types of impure science—still practiced today—were outlined as early as 1830 by the inventor of the mechanical calculating machine, Charles Babbage.

Galileo Galilei, Robert Hooke, Robert Boyle, and Isaac Newton—all men due to the norms of the time—were able to pursue their scientific goals in the 17th and at the beginning of the 18th century because they were either born into rich families or could rely on the patronage of wealthy donors. Their efforts to reveal how the world works were devoted to pursuing the higher purpose of deciphering the divinely authored Book of Nature and thus, the order of things in the world. Scientific research was dedicated to a deeper faith and sought to promote piety in society; science was seen as a way to serve God. At that time, princes and kings tended to lend their patronage to inventors and engineers rather than scientists because only the former promised to help them subordinate the world through conquest and war. There was very little in the way of collaboration in the sciences during this period. Newton and his peers primarily regarded each other as competitors as they strove to achieve fame and recognition. Their motivation was to be the first to make a discovery and to be remembered in the history books for their noble findings.

In many quarters today quality is measured and evaluated in terms of numbers. There has been a shift from ‘What?’ to ‘How much?’, from substance to scoring credit points.

The handful of polymaths that existed at that time were fortunate to be able to record their studies and research on a virtually blank slate. The starting point for their ideas and hypotheses was what science historian Lorraine Daston calls “ground zero empiricism”: Collective knowledge was limited and the amount of facts known in the scientific community were relatively straightforward. What is more, the scientific community itself was still a small village, consisting perhaps of a few hundred—or a few thousand at most—like-minded people around the world, who were loosely organized into academies where they presented and critiqued each other’s theories and experiments. Scientific work was primarily published in monographs, or in journals produced by national scientific academies.

Around the turn of the 19th century, England’s Royal Society was the most prolific and influential national scientific academy in the world. Its journal was printed twice a year, with eight hundred copies being sent out to its scientific counterparts and selected scholars in 1829. Studies were often published six months after they were presented or submitted, a short time span compared with today’s peer review process. Back then, scientists did not compete for academic positions or research funding but for recognition and access to the major academies and their international networks.

As scientists began to gain a greater understanding of what binds the world together at its essence, people also began to take a greater interest in how scientific findings can benefit society, in other words, what we today call social impact or public health. Science’s relevance for society as a whole rose up the agenda during this period, as rapid industrialization and the massive influx of people from rural areas to towns and cities caused severe health and social crises, unsanitary conditions, and epidemics. As middle-class societies emerged and mass production proliferated during the 18th and 19th centuries in the wake of rationalization and the exploitation of the working classes, governments began to organize science more systematically, most notably by establishing universities as research institutions. The physicist and discoverer of electro-magnetic radiation James Clerc Maxwell, the founder of modern microbiology Louis Pasteur, the physician and pathologist Rudolf Virchow and many of their peers became the first salaried scientists in the northern industrialized nations to conduct state-sponsored research at universities.

We need to identify and reject systematic negative developments that nourish the illusion that quality can be objectively measured.

Meanwhile, the sciences became increasingly specialized. Specialist journals emerged and became the most important medium of scientific discourse alongside lectures. All scientists at this time knew all their peers in the particular discipline they were working in. Scientific debates—whether carried out via the written or spoken word—were not fought anonymously but face to face. The growing competition for academic tenure as an assistant or professor marked a completely new development at this time, however. Upholding your reputation among peers, academic hierarchies, and affiliations to scientific schools were considered key factors. Quantitative bibliometric indicators or third-party funding did not play a role at all because they simply did not exist back then. Scientists did not generally adhere to good scientific practice if it only served to further their academic careers. The fundamental types of impure science—still practiced today—were outlined as early as 1830 by the inventor of the mechanical calculating machine, Charles Babbage. In his Reflections on the Decline of Science in England, and on Some of Its Causes, he distinguished between hoaxing (fabricating), forging (falsifying), trimming (being selective when analyzing data), and cooking (creating flawed statistics).

In the early 20th century, third-party funding was added to the mix. Immediately after the First World War, German universities, academies and the Kaiser-Wilhelm-Gesellschaft zur Förderung der Wissenschaften, which later became the Max Planck Society, came up with a way to improve their precarious financial circumstances resulting from the war and the ensuing economic crisis. They founded the Notgemeinschaft der deutschen Wissenschaft—later to become the Deutsche Forschungsgemeinschaft (DFG)—which enabled them to raise money to fund individual research fellowships. A few years later, as national socialist ideology gave rise to a Deutsche Physik (literally German Physics), it was a scientist’s beliefs and party affiliation that were critical when it came to receiving employment or tenure at a university.

It was not until into the Second World War that this system underwent fundamental change, both in Germany and internationally. During the war, research was industrialized on an unprecedented scale, most notably in the United States. Research programs that underpinned the development of long-range missiles, RADAR, the atomic bomb, computers and the like received enormous amounts of funding and were managed with military precision. In fact, by the end of the Second World War, most work in the (natural) sciences carried out in universities was in the service of the military. It was now a top priority to apply research findings to achieve military superiority. So much so, there was serious concern about the future of basic research because it did not promise to deliver immediate benefits. These developments prompted a sharp rise in research output, first, due to continuing specialization within the various disciplines and, second, because of increased government spending on academic research. Nevertheless, it remained easy for researchers to keep track of new developments, not only within their respective specialist fields but across other disciplines too. Editors decided which of the manuscripts that landed on their desks would be published or not. The concept of the peer review had not yet been conceived. Only a few journals existed for each subject and were produced in the language of the country of publication. Scientists still primarily shared knowledge and ideas within their own countries, which is also where it was decided who was “excellent” and who was not.

In the 1980s, the diversification of scientific disciplines, the number of researchers, and their output reached a tipping point. It became increasingly difficult to judge quality and originality based on the content of the research, which also made reaching decisions about funding and careers problematical. This was compounded in the late 1960s when many people started to rebel against outdated hierarchies.

The desire to objectively assess and quantify performance in research was born. Meanwhile, a hierarchy of journals had also been established, which became quantifiable in 1955 through Eugene Garfield’s ingenious creation, the Journal Impact Factor (JIF)—an idea that he and the publishers commercialized on a massive scale. The impact factor has become the most frequently used metrics of scientific quality in many disciplines worldwide. According to UNESCO, there are now more than 400,000 full-time scientists in Germany alone and many millions more worldwide. These hordes of scientists now publish millions of articles every year. Within a century, the mean number of authors of an article has increased from one to six. But in those one hundred years, scientific productivity—defined as the ratio of the output of knowledge in relation to its input into science—has declined sharply. We have accumulated a considerable amount of knowledge about the world, and so original ideas have become rare. The low-hanging fruit have been picked, and both content and methods are becoming increasingly complex. Progress is continuing nevertheless because the number of scientists and thus their input into science has increased in parallel by roughly the same factor. It now takes ever higher numbers of scientists, as well as increasingly complex and expensive tools, to continue to reveal nature’s secrets.

The swell in the academic ranks over recent decades has provided excellent conditions to establish a new understanding of quality. Simple, transparent, fair, and seemingly reliable criteria for evaluating researchers and research have emerged: the already mentionned JIF, the h-index, which indicates a researcher’s citation record, and the number of externally funded projects.

The JIF, which is extremely popular in many areas of science, is a striking example of how assessing research quality has transitioned radically away from a focus on content to embrace a surrogate metric: If you publish in a journal with a high JIF rating, your work is associated with high quality. A high rating is rewarded with research funding and tenured positions. In other words, it makes it easier to climb the echelons of the academic system, but it ignores the fact that the JIF only really rates the popularity of a particular journal and subject. In addition, 80 percent of the citations in Nature and similar publications come from only 20 percent of the articles (including reviews).

In the 1980s, the diversification of disciplines, the number of researchers, and their output reached a tipping point.It became increasingly difficult to judge quality based on the content of the research.

Consequently, the vast majority of articles in these journals, often referred to as “glam” journals, do not attract more citations than those merely published in journals that have been at best rated as good or have no rating at all. Use of the JIF today brings to mind a rule formulated by British economist Charles Goodhart: “When a measure becomes a target it ceases to be a good measure.”

We cannot turn back the hands of time; the expansion of science and indeed its industrialization have moved us a long way forward in our quest to understand what binds the world together at its essence. Nonetheless, it is important that we identify and reject systematic negative developments that affect our understanding of quality and nourish the illusion that it can be objectively measured. Conversely, we must support and reward those who develop and test strategies that once again focus on substantive “qualities” such as methodological expertise, reproducibility, originality, integrity, and transparency, rather than merely limiting ourselves to a few abstract numbers for the sake of convenience.