#25: Paul Ginsparg
The godfather of open access publishing
Intro: Some high-level high school students or undergraduates will be reading these media articles, following the links and look at these articles. And even though they can only get 5 or 10% of it, but read the abstracts, look at the figures and say, I'm determined to understand that and that, you know, the whole availability and open access availability of this primary research material will help encourage more people to want to become scientists rather than YouTube influencers.
AskDifferent, the podcast by the Einstein Foundation with Nancy Fischer.
Nancy Fischer: Probably you all had this experience during the first pandemic years. In the news they spoke about a study concerning new vaccines or studies about new medication against COVID-19, but always with the addition that this is only the preprint, which means the study is not yet reviewed or revised by independent scientists. It's not yet published, but the researchers already uploaded it on a preprint server and therefore it's open to public. This is a great achievement, and we owe this one to Paul Ginsburg. Mr. Ginsburg is Professor for physics and computer science at the Cornell University, Ithaca specialized in, for example, quantum field or string theory. And last year, he won the first Einstein Foundation Award for his invention of the preprint service. And I'm very happy that we can speak today for this podcast. Maybe we can begin to imagine how life was back in the beginning, 90ies, personal computers. The internet was not complete standard, of course, but in their beginnings. And maybe you can describe these times that led to your idea of the first preprint server.
Paul Ginsparg: Well, it's almost difficult for me to remember back to the level of technology we had back then, but nobody had a cell phone, much less a smartphone. The Internet did exist. I myself was connected to the Internet starting in 1987, but the World Wide Web did not exist. Of all of the many services that we have layered on top of the Internet, that's, of course, one of the most prominent and equally importantly, the rest of the world hadn't yet discovered the Internet. It was something of a private playground for academics. I mean, I was familiar with it because we are in physic in the early 1980s. Even before the internet, there were a few other networks. There was the old DECnet that started, I think around 81 or 82 when I was at Harvard at the time, and our computer was I didn't know it was connected to some computers of colleagues in Berkeley, and I suddenly received we had been using email locally in the department. And to my sort of surprise, naive surprise at the time, received a message from somebody on the West Coast. And I thought that was really cool in the 1982 timeframe. But, you know, by the late 1980s we had something of a critical mass in the community using email. We had started using word processors, text editors, and for the first time we were able to collaborate effectively from remote. Before then we had to handwrite notes and then fax it or mail it, but suddenly we were able to write the full equations and send them back and forth. And it was then natural when articles were finished to start sending them around and distributing them by the late 1980s.
Fischer: That seems like a good solution. Why was the preprint server needed then?
Ginsparg: Well, so I have to say that I don't know the precise origin of the paper preprint system. When I asked colleagues who started their careers in the late 1950s, they said they were already distributing things by ordinary mail, the technology that was pre photocopy machines. So, by the late 1960s, early 1970s, the paper distribution system, at least in my field of hig energy physics, had become very systematized and very organized. All of the major institutions would collect the articles in preprint form. They were written among their institutions and send them out to institutional mailing lists. By the late 1980s, they had grown, at least at Harvard again, where I was at the time, to sending out not only to about 300 institutions, but also we would send them out in personal mailing lists to people that we knew. So, we had this, and this is very important for this development, we had this preexisting habit of distributing the information, but it had an accidental consequence, which was that it was intrinsically unfair, not by design, but just by logistics. That is, if you weren't in this privilege loop, then you weren't receiving it immediately, and it might be many months before you were able to see, have access to the same critical research information, whether it was, you know, about experimental results or theoretical results. And by intrinsically unfair what I mean to say is that if you have some idea or responsible for some development, we like to think it was because we were more inspired, we worked harder, but not because we had privileged access to information.
Fischer: And so what you did is you kind of opened the whole thing and then suddenly every scientist could upload his or her study results. It was a big success. Did you expect that by then?
Ginsparg: No, I can say quite ambiguously, I did not expect it. Originally, what I had in mind was just sort of a three-month queue until the paper distribution could catch up and so everybody could have access immediately. And then the papers that were assigned identifiers starting 9108 that signified August of 91 when it started by November three months later, I would just delete them. And so it would be easy to find them because you would just, you know, remove everything that started with those four digits. And so I was anticipating, based on the volume of traffic, the number of articles being written at the time, to perhaps get one submission every three days, in other words, about a hundred per year. And the surprise was that the day it went on in mid-August, 1991, it started receiving at least one submission every day. I mean, there were no days off and very quickly grew. You know, I think there were about 25 in the first month and probably double that in the next month. And then more importantly, within a few months, early in 1982, multiple other fields wanted to join in, starting with algebraic geometry and then other areas of higher energy, theoretical physics and condensed matter and astrophysics. So there was just enormous growth in 92, which was which I had not planned for nor anticipated. And incidentally, just to complete the thought, I was very fortunate that somebody came to visit me in probably October of 91, so before anything was ever deleted and said, please don't delete it. It's so much easier to access these things via email, which I thought was awkward and processed in ourselves. Then to go across the street to a library or down the hall to a preprint room, or even to find something in a pile of papers, on a desk or in a cabinet. And so, you know, I took that to heart and so never deleted. And so I have everything that was ever submitted, which is now over 2 million submissions.
Fischer: Amazing, amazing numbers today. I think this year archive hits 2 million submissions, right?
Ginsparg: That's right.
Fischer: And besides the fact that life is easier for many scientists now, how much did your invention change also, maybe the aspect of collaboration of sharing knowledge in science. What do you think?
Ginsparg: Well, that's a good question I'm not entirely confident to answer. There are so many other factors at play anyway. You know, over the past 30 years, the transformation of the technology, the way we communicate, even the way we're doing this with a video connection wouldn't have been possible 30 years ago. I would say that, you know, there are among the less obvious things that have changed in collaboration is that the average number of authors per article has increased significantly over the past 30 years. And when I ask people about that, this is not at all specific to the archive, it's true in fields where it's not the primary source of information. And there are probably two factors at play. One is that it is so much easier to collaborate from a distance because you're not dealing with faxing or written information that you can connect via email or now via video conference.
But the other reason that's frequently given, which is slightly more intriguing, is that the nature of the articles themselves have also changed and require more people because there's so much more specialization. And I'm only mentioning this anecdotally, but when I ask people about it, they'll point to a paper and they can look at what might have formerly been one or two authors, but now it's six or seven authors because there's one author who's expert at doing the numerical simulations and other author who is expert in the statistics and doing the data analysis. Another author whose idea it was originally and was leading the collaboration. And you just have articles that, yes, one person could have written, but it would have taken that one person perhaps a year or two to do all of that. And with so much communication and with competition, would no longer be competitive. And so people just gather together and it's a fantastic thing. I'm not meaning at all to be pejorative about this development, that people can selflessly combine talents in order to produce something that was better and more quickly than any of them could have done individually. And that's significantly facilitated by the technology that we're now using.
Fischer: So, the Einstein Individual Award acknowledged your work with the first prize last year and with €200,000 if I, if I read it right, what did this prize and also in that case, a lot of money. What does it mean to you and for your research?
Ginsparg: It is a lot of money. And, you know, with great money comes great responsibility or, you know, some such quote. I've still been mulling over it. This may sound odd because it's almost a year ago, but I have been in the back of my mind thinking about various projects, wondering if I should use it for equipment, wondering if I should hire a team of programmers to implement some ideas I've had, and that takes some organization. With respect to archives, it seems appropriate to try to find some use of the money thinking about how the technology really needs to be renewed. And so I think it's ready for that. And, you know, there are people at the archive and working on it, but I'd like to think about some of the machine learning techniques, also seeing how many of the older papers can be mined semantically to try to make use of some of these new technologies. And I'd like to think about some of those applications, and I don't have the complete competence myself. And so that would argue for using some of that money. And although it is a huge amount of money, it's not that large on the scale of software engineering salaries. But, you know, if I could get the right people for a short amount of time, I could probe some of these ideas much more quickly than I could on my own.
Fischer: So it's not making life automatically easier if you have a lot of money and you want a prize, I guess. There's also more attention on your work, which also came in COVID times on top, I think, because everybody heard about preprints in these days. But I also wondered if we speak about preprints in general, which is now the occasion I'm having. Isn't that also a risk for a lack of quality, especially in times like these when research about, for example, COVID-19 or also other things is needed so fast and so urgently?
Ginsparg: That is the essential question. You know, the fact of publication, you know, just as a preprint doesn't mean it was screened, the fact that it was published doesn't mean that it's correct that lots of things appear in the public literature where we've seen that the screening in the preprint sector is even more rigorous because and, you know, there are examples of this where a study will be posted on bio archives and gets many more experts instantly on top of it, complaining about the statistical analysis and the authors then replace it with a second version with many of these things corrected. Whereas some of these articles published in the New England Journal of Medicine specifically, there was a Harvard collaborator on that, and they just trusted it based on the institution. And none of us can understand how it was based on fabricated data and how any of the reviewers that used could not have flagged that. And so had those articles gone through this sort of massive crowdsourcing review, certain journals would have been spared the embarrassment.
You know, on the other hand, I'm not claiming that preprint distribution is a universal panacea, but, you know, it's not the unmitigated negative that some people feared 30 years ago. And we've seen operationally it not only hasn't caused harm, but there are few specific examples of treatments that were advertised earlier than they would have otherwise been had they waited for the published literature and may even have saved lives. And I think what we've seen, especially during the pandemic era, has been surprisingly successful. As you mentioned at the outset, the mass media has been very careful to qualify the source of their information and say, okay, this is information that was posted on a preprint server. It hasn't been peer-reviewed. We're not sure it's correct, but we've checked with certain experts for reality check. And it seems like this is very promising and that you can be positive about it. And moreover, they give a link and some readers will be expert enough to get something out of it. Many readers we found in the general public will follow those links anyway, just to glance at it. And I've asked people in the info side department here, why would that be? And people said it's like going to a museum that you feel closer to an object, you're looking at it, it's real and you get some intuition from it just to see it as a real life artifact. And it suggests that that's the same thing.
You know, what I fantasize about is that some high level high school students or undergraduates will be reading these media articles, following the links and look at these articles. And even though they can only get 5 or 10% of it, but read the abstracts, look at the figures and say, I'm determined to understand that and that, you know, the whole availability and open access availability of these of this primary research material will help encourage more people to want to become scientists rather than YouTube influencers.
Fischer: Yeah, absolutely. Although they can also educate in science, that's not that impossible. But of course, there's a difference. We spoke about the peer review process and the publishing process, and there's one principle in science: If you want to be successful, very often it's publish or perish. And you are, since you are teaching young students, young researchers. What would you recommend to them who are under pressure to publish?
Ginsparg: My advice to students is, you know, it's absolutely correct. It's unfortunate, but you may be the most brilliant person in the world, but you'll have difficulty getting a job if you don't have a publication record so that people can evaluate you. In high energy physics, we evaluate people based as much on the preprints, whether or not they've appeared in the published literature. But it's absolutely true. I do not hesitate to encourage my students to write up their work and get it published because I want them to have successful careers. And you know, that's the route to a successful career, and I encourage them to follow the preprint literature. I teach them how to evaluate for themselves what they can trust, what they can't trust, how you evaluate things based on, you know, the authors, the authors’ past publication records. And when all else fails, you ask your advisor what he or she thinks, and the advisor then goes and consults people he or she knows who can give feedback. And that's just the collective way research works. We're all, you know, discussing things, trying to evaluate and, you know, picking out the right things. And so, I certainly would never tell anyone resist the pressure, don't publish at all. On the other hand, I do discourage people. My own thesis advisor, Ken Wilson, was notorious for publishing very, very few papers, and the few that he published were extremely good, at least one of which he won a Nobel Prize for. And so, I encourage people to put quality over quantity. But beyond that, I would you know, you have to encourage people if they want, you know, to be able to get grants and to get future jobs.
Fischer: We spoke now a lot about maybe your view on researching but not about your research itself. So maybe let's find the time and the end of our conversation. Because you studied at the prestigious Harvard University Research, then on, on particle physics or quantum theory. You mentioned that already. How much time do you have left besides the archive work for real, for pure physics?
Ginsparg: Well, I would say that the archive has a staff and number of people working for it, and so I'm not responsible for the day-to-day operational activities and it's difficult to estimate. I do spend time daily, I do various forms of troubleshooting, I look at, you know, I look at trends. I've been fascinated by some of the data analysis. For example, we had continuous growth since the start right up through the pandemic, including the first year of the pandemic, 2020, and then it's just flattened. We've never seen anything like this before, that the total research output going into archives 2021, 2022, it hasn't gone down, although it has gone down in some fields, but it's flattened and we've just never seen an effect like that before. And it is extreme concern, not from archives because archive is just seeing the same feed of information as comes into the journals. And I checked with various journals, they've seen exactly the same thing that in 2020 there was something of a pipeline effect. People were home, the labs were shut down and they were able to finish off a bunch of things quickly that they wouldn't have been able to finish off otherwise. But then in 2021, 2022, we're seeing a dramatic effect of on, you know, the first and the second year graduate students not being able to come into labs because the labs were all closed or the senior graduate students were no longer mentoring or the advisor wasn't in or everybody was disrupted because there's so much time spent teaching administration, especially 2020, 2021. And we're still not recovering from that. And so, you know, that's some of the things I do. And then, you know, the course I'm teaching and I've written some papers about not so much about string theory and relativistic quantum field theory anymore, but along the lines of quantum information, which has been fascinating and has had, you know, I would say, much more experimental impact and possibilities over the last 10 to 15 years. In fact, this morning I was writing a problem set for my course in which they use the IBM cloud computers to investigate some quantum paradoxes, which is great fun. And, you know, not only not possible when I was an undergraduate or graduate student, but would not have been possible even more than five years ago. So and, you know, working out some of these conundrums and also working out experimental methodologies for extracting information from experiments and using machine learning to do that, which has been one of these recent trends. The other thing I was doing this morning was working with my co-organizers on a meeting that's to take place in Aspen this February during the winter session where we're collecting 100 people together for thinking about people like me who are familiar with quantum field theory and its relations to statistical mechanics and statistical physics, and seeing how much that might inform our understanding and ways of creating new architectures for deep-learning networks. There's a number of close connections in there. So I've been thinking about these issues more than conventional high energy theory that I was trained in. But, you know, that's fine. It keeps one energetic. Somebody once told me that people spend the rest of their careers rewriting their theses, and I'm happy to say that in my case, the successive revisions has been increasingly drastic.
Fischer: That's a nice word in the end, I think, and it seems as if you would have a lot of things to do for the next years. For the upcoming years. There won't be boredom or something like that for you. I guess so thank you. Thank you for improving the quality of research and yeah, such a fundamental way. I think we all got an idea about the importance of preprint servers and the life and work of Professor Paul Ginsparg. He was our guest in today's episode of AskDifferent, the podcast by the Einstein Stiftung. And my name is Nancy Fisher. I am happy if you like this episode. I'm even happier if you listen to another one, you find them all on the Einstein Foundation's website or on all the known podcast platforms. Mr. Ginsparg, all the best for you and thank you so much for your time.
Ginsparg: Thank you so much for having me.
AskDifferent, the Podcast by the Einstein Foundation.