Zum Hauptmenü springenZum Inhalt springen

Podcast

#AskDifferent - der Podcast der Einstein Stiftung

#AskDifferent – der Podcast der Einstein Stiftung
In der Podcast-Reihe #AskDifferent erzählen geförderte und mit der Stiftung verbundene Wissenschaftlerinnen und Wissenschaftler von den kleinen Schritten und großen Zufällen, die zu einer außergewöhnlichen Laufbahn geführt haben. Wir wollen wissen: Was treibt sie an, anders zu fragen, immer weiter zu fragen und unsere Welt bis ins kleinste Detail zu ergründen?

Can "Bad" Genomic Data Make Science Better?

Porträt Max Sprang
Foto: ems

#AskDifferent 51 - In this episode of AskDifferent, bioinformatician and Einstein Foundation Early Career Awardee Maximilian Sprang takes us into the hidden world of errors in genomic data. Why do sequencing studies sometimes produce impressive-looking results that later fail to replicate? What happens when tiny technical glitches masquerade as biological discoveries - and what does that mean for patients and precision medicine? Sprang explains how his team detects and even deliberately introduces errors to understand them better, improve software tools, and make genomic research more reliable without throwing "messy" data away.


*Please note: the informational insert is AI-generated*

Zurück zur Übersichtsseite

Intro: Low-quality data is not bad data, right? It's just differences in technical issues, basically, right, in the end of the day. And it can actually help to still have them in the public domain, but to flag them. Only with that kind of data we can learn more about these technical issues. AskDifferent. The podcast by the Einstein Foundation. 

Marie Röder: My name is Marie Röder and I'll be your host today. Do you like detective stories? I sure hope you do, because we have one for you today. It's a story where someone follows small clues to uncover a hidden problem. In science, this kind of careful investigation happens all the time. And sometimes the clues are hidden in something very small, tiny irregularities and large data sets. Our guest today works exactly on that. His name is Dr. Maximilian Sprang and he studies errors and genomic data, the kind of errors that can quietly influence scientific results. Max recently received the Einstein Foundation Early Career Award 2025, which comes with 100,000 euros to support his research. Congratulations and welcome to AskDifferent. 

Max Sprang: Good. And thanks for the kind introduction and the congratulations. 

Röder: So Max, you are 31 years old and already doing award-winning research. If you had to explain your job to your 12-year-old self, what would you say a bioinformatician really does? 

Sprang: A bioinformatician obviously looks a lot into the screen and sits around a lot, so you need to do some sports next to that, otherwise your back will hurt. But other than that, what a bioinformatician does is look at large amounts of data that, like you say so nicely in the intro, in which you can discover a lot of secrets. Especially in genomic data, that is very cool and that also would have already interested my 12-year-old self already because I had a very early interest in science. And in genomics, it's so cool to investigate this because it's the basis of all life. And especially what we do is functional genomics. So we look at the function above the DNA. So we don't look at the genome itself, but what the genome does. And I think that is especially intriguing because there's a little bit more action in that. 

Röder: We're going to delve into what you do exactly just in a minute. But before, I want to ask you, just receive the Einstein Foundation Early Career Award for your project which is called Erring Rigoriously. What does this award mean to you and what will it help you to do in your research?

Sprang: I mean, it means a lot to me because it was basically the project that I proposed there is the natural extension to my PhD thesis, right? So I already put in a lot of effort and a lot of time. I plan to stay in academia, right. And so it is kind of essential to early try to show independence, to early get, let's say, these awards, right? Or this kind of recognition because it helps you to set foot in the academic world. And I always wanted to be a researcher. I already mentioned that earlier. And so it means a lot to me and I'm very thankful. And also I think it's very cool that there is this award that is focused so much on quality control and research, about research and stuff like that because it is very important and a lot of people talk about it nowadays. It gets more and more, luckily. But there's not a lot of initiatives that really go that way and that award does that and also the BIH with its Quest Institute. So it's very cool. Thanks for that. 

Röder: So we want to look at what your research really does. Let's imagine a situation from medical research. A team studies cancer patients and they sequence RNA from tumor samples and compare them with samples from healthy people. Their analysis finds hundreds of genes that seem to behave differently. At first, this looks like a major discovery, but when another research group tries to repeat the experiment, they cannot reproduce the results. And the problem is not necessarily the biology, it is the data. This is where your detective work begins. So why can sequencing studies sometimes produce results that look convincing at first, but later turn out to be unreliable? 

Sprang: That's a very hard question. 

Röder: That's what we're known for. 

Sprang: This is basically the question that we ask with this project or with this chain of projects because a lot of these differences, so when you see differences in quality or when you have these batch effects, it's a typical word that you hear very often. So where you have differences between repeated experiments, even though they would be from the same lab sometimes. And the root cause for these differences can be very different. It's very heterogeneous by itself, right? And that's why it's such a hard problem also to build, for example, software that tries to correct these errors or to integrate data from different sources. Because the sources for this, they can be myriad, how you say, right? So the typical thing in a batch effect, for example, is you have two different handlers that do the same experiment or even one handler that do it on different days. And then the external environment could be a bit different or you could use different batches of chemicals to do the same experiment, right? And all these things, even though they look very small and they are not connected to biology at first look at all, they will introduce differences in the signal that we get out with the sequencing assay, you know? And therefore then can skew the analysis that we do and then also introduce these reproducibility errors that you just mentioned, where if you repeat the experiment in a completely different environment, right? In a different lab, for example, or with a new set of cells, then even more, because then you also have biological differences on top, right? Then likely or sometimes you won't be able to reproduce or you will have very different results and then have to try to pick together the pieces and see, okay, what is our overlap here? Can we at least look at this? And this, for example, is also very, like, the most simple strategy to deal with that and is often done. 

Röder: Okay, before we go deeper into the problem, let's briefly explain what genome sequencing actually is. 

Think of your DNA as a big library. The books in this library are the genes. Every cell in our body reads different books depending on what it needs to do. When a cell reads a book, it makes a copy of it. The copy is called RNA. Scientists can collect these RNA copies, for example, within our tissue or blood cells. This RNA is then put into a sequencing machine. The machine reads small pieces of genetic information and a computer puts all these pieces together again. Which genes are active? Which ones are inactive at the same time? These modern methods are called next-generation sequencing, or NGS. It is available since the 2000s and is now also used in medicine. RNA sequencing can answer how something like a mutation changes the behavior within a cell. This is the basis for doctors to find the best possible treatment for patients. 

Röder: Okay, so what I've learned so far is sequencing technology offers powerful insights into how our cells work, but the process involves many steps and each step can introduce small differences into the data. So you study errors and biases in sequencing data. Where in the process do these errors usually appear? 

Sprang: I would say they appear everywhere, actually. Depending on what you're interested in, you need to look at different steps, obviously, in this pipeline. What we are interested in now is really in the source of these quality differences. In one of our papers, we call it quality imbalance. That's basically also a kind of technical artefact, or similar to these batch effects that I just talked about, but it's rooted in difference in quality. And also for these batch effects, so rooted in difference of handlers or difference of chemicals, you need to really go back full to the wet lab. We do this by using cell culture experiments, because there you have a lot of control, and also cell culture is just cells. It would be ethically quite questionable to use an animal model for error research. So cell culture is the optimal way for us to do this. And then in the cell culture, we really try to change relatively basic things. So, for example, when you have a cell culture, you really have, you can imagine a flask, and in this flask there is some medium, so just some water with nutrients that the cells swim in, or rather they sit on the bottom. And when you have a cell culture, then you need to keep the cells alive and well. And so what you do is sometimes you have to split the cells. So you put them apart and just put a part of that back in a new flask, because otherwise they overgrow. And so what we do is that we do this quite often. So that's called passage. So we increase the passage to see like how long can we actually play this game until they change. And then there is already in this direction, there is already some research has been done that we will also take into consideration. And then what you can also do is you can, when you do the split, you can put more cells in that you normally would, so that they overgrow faster, and then you can take them. So that's another error, right? So you have by accident let them overgrown, but for example it wasn't recognized during the wet lab procedures, right? And so it was sequenced, right? And this also happens sometimes. And then what we also will do, we will directly interfere with the RNA. So normally when you do a cell culture experiment, right, you culture your cells, you treat them for example with a treatment, or you have a knockout or something like that. Then you get them out, and you get the RNA out, and then you put the RNA into a sequencer, you know? So there are some steps in between, right? So you need to translate the RNA into DNA and stuff like that. But let's keep it simple, right? So we just put the RNA into the sequencer and get this sequence out. And what we do here in between is interfere with the RNA also directly. So we cut them into small pieces, right? And then we compare the downstream chemistries that are used to bring the RNA to the sequencer. I'm already getting quite technical now, I'm aware of that, right? But I wanted to give like a little in-depth look at what we plan to do, I'd say like that. But the quintessential thing here is just we really go back full in the wet lab. And here we introduce changes or perturbations that all down the line will have a quite strong impact. Or at least we expect that we will see, right? Perhaps it's not the case, but I'm quite sure it is. Then to come back to the original question where these things all can be introduced, in all these downstream steps, so some of them I already mentioned, right? There obviously there can also be errors introduced, right? So when we try to introduce errors so early, we need to be very vigilant and have very strong and stable downstream pipelines to work with, right? And luckily this is the case with the lab partners that I have. And they have established this experiment quite well. And that also has the positive side effect that we can reuse all data from them. So we have not only the new data that we can generate with the money from the Einstein Foundation, but also we have all the data that we can then also take into account. 

Röder: And you and your colleagues, you analyzed many published clinical RNA sequencing datasets and you found that about 35% of the studies showed quality imbalances. What exactly does it mean and were you surprised that this number was so high?

Sprang: Yes, we were surprised by that. And I mean, like for us in a way that was good, right? Because we could have a fancy big publication. But obviously this leaves a lot of question open, right? Because all these datasets were really clinically relevant, we call it in the paper, right? So they were from actual patients, from actual hospitals, and it was always disease versus health, right? So it was really about the difference between a disease tissue and a healthy tissue. And it was also not all cancer. There was a lot of cancer data in there, obviously, because it's the majority of the biological sequence data that's available. But we also have like other diseases like psoriasis, right? So like a skin disease and inflammatory skin disease. And then we have neurodegenerative diseases, especially in the bigger datasets. So we had a downstream poster that did the same as the paper, but with double the datasets, right? So even though the data was so heterogeneous and so important, we still saw these 30%, right? 35 even in the bigger datasets still that had these strong quality imbalances, right? And the problem that we see here is, and we also point at that in the paper, and that's why also we want to do this downstream wet lab experiment, is that when you have these strong imbalances, right? So we are talking here about differences that are not rooted in biology, that are confounded with these groups, right, with the disease and the healthy group. And if you have that, then you will introduce false positive signal that you can mistake for biologically relevant signal, right? So in that sense, you would look at the cells and look which genes are differentially expressed, right? So the disease group has a higher expression of this gene, so you would expect this higher expression of the gene is part of the disease, right? Or perhaps a symptom. I mean, both can be true, right? But both would be relevant for you if you try to fight the disease. However, in that case, it's very possible that this high gene actually, or the difference that you see here, is rooted in the quality difference. So in this technical artifact, and then when you look at your data, you cannot really be sure anymore if it's rooted in the actual biology in the disease or if it comes from the technical difference. 

Röder: And what kind of consequences does this have for, let's say, patients? 

Sprang: I mean, luckily this is not the only kind of data that we have about a patient, right? So there's also this clinical data, there are physicians that look at them, right? So depending on the disease, the impact would not be so strong. However, nowadays more and more these sequencing essays, also functional essays, they come into the clinics, right? And what you, for example, have is molecular tumor boards where you look at exactly this kind of data and also look at historical data sets from this inside your house, right? So you have cancer patients from the last, I don't know, let's say 10 years, for example, and you aggregate this data, you look at it, and then you get a new patient with the same disease, with the same type of cancer in that case, and then you try to put them into subgroups, right? And then to decide the treatment, right? And then especially if you are looking at this historic data where you perhaps do not have the metadata or the context of all the experiments that were done, right? So when exactly was this data taken, right? Then this could be a problem, right? Because then you could face these quality issues and be not aware about them and then potentially make wrong decisions, right? And also when we look at public data where we do a lot of research with potentially also clinically relevant because we look for biomarkers, for example, that we want to try to bring in a long time. It takes a long time, obviously. But into translation to use as a diagnostic marker, for example. Also here we face the same problem. So we look at these large data sets, right? And we use, for example, machine learning or statistical models to find these signals, right, that discern the disease from the healthy tissue. But if we face these quality imbalances or other technical artifacts like batch effects, right, and they skew our data, then this can also misinform us. Let's say it like that. 

Röder: I see. Early on you already talked about the solutions or the research you're doing to develop solutions. So I read about a software tool that's called SecuScora that you work on. What does this tool do?

Sprang: So SecuScora is a machine learning tool and it's supervised, right? So it's based on data that was labeled by humans. And we got this from the ENCODE project. So here a big thanks to the ENCODE project, right? Very cool project in general. It's the biggest database for functional genomics that is there at the moment. And what they have, they also have a lot of in-house quality control. They call it audits, right? So they basically automatically flag certain information about a given fast queue file. So these are the sequencing files that you can upload there. And if there are too many of these flags, right, then an actual team of actual humans will look at it and then have the final say if these files will be released. So they are of fine quality or they will be revoked. So they are of low quality or at least lower. And the cool thing of ENCODE is that even the revoked files are still publicly available, right? And so this is perhaps a little take-away message also for the people that hear this and that work in science, low quality data is not bad data, right? It's just differences in technical issues basically, right, in the end of the day. And it can actually help to still have them in the public domain but to flag them, right? Because only because of that this project was possible at all. And also only with that kind of data we can learn more about these technical issues. So what we did is we have this label dataset and then we trained a machine learning algorithm to discern between re-released and revoked or high and low quality. What you often do when you build a classifier like that, right, is that you just use it like that. So to classify yes or no, however, what we did is we take out the probability of this machine learning algorithm, right? So instead of getting out a zero or one, a yes or a no, we get out a number between zero and one. And this number we can use then to compare the samples in a given dataset, right? And this gives us a bit more agency as humans to also make the decision ourselves because otherwise it's just a cut at the 0.5. And this does not work very well because I already mentioned it that these quality issues, they are very heterogeneous. And so when you compare the score between multiple datasets, you also see a batch effect, right? So you cannot have like a global quality score that is always right. You need to always know the context, otherwise it does not work. 

Röder: And how does it work? Let's say you find that there is a problem or it's a messy dataset. Is the sequencing data should it be corrected or is the goal mainly to detect the quality issues before researchers draw conclusions, let's say? 

Sprang: I think it's a bit of both, right? And this also again depends heavily on context. If you have a lot of data, for example, right, then your biological signal will likely be strong enough that it's worth to integrate it and put it together. If you have a small dataset and you have one very strong outlier quality-wise, then it's worth to keep that out and focus on just the rest of the data. So for us, it was mostly about detection, but then in the downstream work, we use this score also quite often as the number it is to see if we see correlations, for example. So for example, there are gene patterns that correlate with low quality, and some of this was known before, so it was also kind of a validation for our machine learning algorithm that it actually works, right? But also you can detect potentially new sets of genes that are likely to coincide with low quality, right? And if you see them, for example, then in your differential analysis, popping up, then potentially you have to take a step back and look at the quality, or use SecuScore. 

Röder: I want to talk a little bit more about your new project that takes a particularly interesting approach. Instead of only searching for the errors in existing data, you introduce controlled errors. You already mentioned that in the beginning of the podcast. So this project is called Erring Rigoriously, and you deliberately introduced errors into biological samples before sequencing them, if I understood correctly. So why is it useful to intentionally create these errors? 

Sprang: This is actually well connected to the question before, right? If you just want to detect or if you want to correct, and that's where it's very useful to try to introduce these errors yourself, because then you can try to build methods to correct them, right? And at the moment, there are a lot of software solutions. Let's call it like that's always a good word for correcting batch effects, right? And some of them work very well. So right there's the Surrogate Variable Analysis SVA, and then there's a thing that's called Combat, very cool name. 

Röder: Very cool. 

Sprang: And the thing is these things work quite well, but they are again dependent on context. And here we are again at this fuzziness or heterogeneous problem of these technical artifacts, because even these very good softwares that are around for years, at least in part for now, they will fail in certain experiments. And it's not really clear why they sometimes work and sometimes don't. And also sometimes they work, but they work too good. So they over-correct and then you lose biological information. And this is why it's worth to try and introduce errors yourself, because then you have more control on the situation or the context, right? And then you really know, okay, I introduced an error here or a difference here. We should pick this up with our software. Or if our software can’t, then perhaps another can, and then we will try this, right? And then this gives us a bit more information about, like, what is the real source of these differences? And hopefully, when we know what is the source, we can potentially build better software to detect and correct them. Or we can try to improve the wet lab pipeline, right? Make it more robust at certain time points where we know they are very critical for the quality or it's very easy here to introduce a better factor or stuff like that.

Röder: And what can science more broadly, let's say, learn from studying errors or mistakes? 

Sprang: Almost a philosophical question, right? I think this also connects a bit to what I mentioned before, right? So if you actually put out the data that is faulty or that has errors, right? First, there is still biology in there, right? Except it's, like, completely broken, right? So sometimes that can happen, obviously. But if there are errors in the data, it doesn't mean the data is not good anymore, right? So data with errors is not bad data, right? So that's one thing that I really want to drive home here. And then I think if you learn about these errors, you will definitely have an easier time to not make them again. The better you know the context, the better you know the framework that you are working with, be it in the wet lab or be it in the computer, the easier you can use it well. Say it like that.

Röder: I see. If you could change one thing about how the genomics community deals with data quality today, what would it be? 

Sprang: I don't know. I think it would be being more open about errors, right? Because I know I have been working in the wet lab myself for some time, like my bachelor was in biology, right? And also in my master's, I did do some wet lab. And I obviously work with a lot of people in genomics and both the people on the wet lab side as well as the people on the bioinformatics side will often opt to not publish broken data or to not publish certain parts of the data because it's not that beautiful, right? Or because it doesn't belong to the story well or whatever, right? And this is obviously rooted in the pressure to publish and then publish highly, right? And I think this is, in the end of the day, it will always be bad for science, right? Because we can also learn from not so nice data and we can learn very well from our errors. If I could wish for something, then I would wish for less pressure to do exactly that, right? To be more open and to be also more transparent about errors and failures. 

Röder: Thank you very much, Max Sprang. 

Sprang: Thank you very much. 

Röder: Thank you for joining us today and thank you for listening to Ask Different, the podcast of the Einstein Foundation. If you enjoyed this episode, subscribe and rate us and join us next time. My name is Marie Röder.