Talk - The critical role of open source in open research

Talk – The critical role of open source in open research

On 20 March 2024, I gave a talk at the Open Source for Innovation in Universities event titled “The critical role of open source in open research” (open source slides published to Zenodo). Like last time, it was informed by incredible feedback I received from various open research communities, especially Malvika of the Turing Way who first connected me to the organisers. There's extra stuff I couldn't fit into the talk, so I'm putting them here.

I'm posting:

a few general notes;
other resources/further reading suggested by Turing Way members; and
a transcript of my talk.

I'll try to clean up this post with more context and details on a best-effort basis.

There is a video recording which is saved in the Zenodo item, viewable on YouTube, and embedded here:

General notes

In-person verbal feedback was positive, though I didn't get to use as much time preparing it as I wanted. I was also running out of time near the end, and wish I could have talked about the Turing Way more!

This time, I also opened a Turing Way GitHub issue #3570, to track the development of this talk.

As expected, I wasn't able to fit everything in, but also thank you to Sarah Gibson, Julien Colomb, Esther Plomp for your feedback earlier to help me prepare! I'm also grateful to the organisers Michael Meagher and Clare Dillon who gathered a great group of warm and interesting people for this event. :) Special thanks for Malvika Sharan for the several meetings we had to structure this talk.

a note about creating a transcript

For my FOSDEM lightning talk, I typed what I wanted to say directly into the presenter notes in my slides before the talk. However, this time I just didn't have time to do that.

So, I tried using my phone to make a live audio recording as I gave the presentation. Then, I used the open source Whisper.cpp automatic speech recognition tool with its open-ish ggml-small.en model to generate a transcript.

Then, I copied that transcript into the presenter notes of the final slides published to Zenodo.

In the end, I think this method works, but is still time-consuming. The generated transcript is a huge text file that I had to manually split into paragraphs, and copy and paste individual chunks of text into their corresponding presenter notes. This is also what's below in the “Transcript” section.

Will I continue to use Whisper.cpp in the future? Yes, I think its text transcription is remarkably accurate and is getting better. Though there are still paper cuts in the user experience that adds some work for me.

Other resources/examples

Thanks to Sarah Gibson and Julien Colomb for the suggested examples:

The Gorgas tracker as mentioned in this post and described in Arancio (2023).
CERN's White Rabbit project. Also see this interview about it.
The Python and R ecosystems vs MATLAB and SPSS in days past.
JupyterHub, specifically the QGreenland project (WARNING: Medium link). I really like this one because it's not just one piece of open source hardware, but an entire stack that could only work well when all components are open source and remixable.

Transcript

Note: This transcript is lightly edited for clarity, such as by removing the “uh”s and “you know”s, or “ah”s.

Thank you so much for that introduction, Clare. I'm really excited to be here with you today. It's really quite a privilege to be speaking to you. And as Clare mentioned, I am a member of the Turing Way community, which I will come back to near the end of the talk. But today, I'd like to share some of my own reflections being not only an advocate for open research in the academic community over the past several years, but also as a member of the open source community. I very much think of my talk as a kind of “yes, and...” kind of presentation. And it's also intentionally provocative with the intention of stimulating, new thinking around what kind of opportunities can we consider when it comes to open source technologies and open research.

I want to start very briefly by focusing on the term open research and make kind of a subtle point here. So I consider open research to cover a very wide and diverse array of different research disciplines. And a lot of the examples I'd like to share today come from my experience advocating for open science, which I consider to be a very important component of open research, but it's not all of open research. So there's a subtle difference between the terms and I'd just like to delineate the two, even though most of what I'm talking about today comes from the open science world.

With that said, the structure of my talk today, I'd like to start with my reflections on some of the core values of open science, why open science is so important, including in academic research. Very briefly on a lot of the invisible infrastructure of technology that underlies the scientific research that we do, followed by I think the biggest part of my talk today, which are the additional motivations for open source technologies to enable open science. And I'd like to bring up the hardware component as well because we've heard a lot about software. And finally, I will talk about some of the communities that have been so lucky to be a part of over the years that discusses a lot of the things in my talk today.

So, open science. I've talked about open science to so many people over the years, and what I have learned is that...

...if you ask 10 people what open science means, they will tell you, yes, I know what it is, but they will give you 10 different answers. So I'd just like to set the scene a little bit for my talk today to establish a common understanding just to help with the conversation.

And one of the initiatives that I've been really privileged to be a part of is the drafting of the UNESCO Recommendation on Open Science that was ratified in 2021. I had a very small role to play in this, but it was a huge privilege to be part of the process and it produced an amazing document.

It's really long, but I recommend you check it out. And part of it defines open science to mean a set of practices for reproducibility, transparency, sharing, and collaboration from the increased opening of scientific contents towards and processes. Again, I think this is an amazing document, but this definition is also quite a mouthful, right? So I tried to reflect on: is there kind of like an essence to this definition?

And what I came to is actually the difference between science and alchemy. So what do I mean by this?

I was inspired to think about this by a very provocative digital rights author called Cory Doctorow. He writes a lot about these kind of fundamental values underlying open research and open science and open source. And he said, if we think about how alchemists used to work superficially, they were running experiments, they had some research questions, they took lots of notes, and they were actually learning along the way. But the thing with alchemists is that they kept what they knew a secret from each other for 500 years. Because of that secrecy, they didn't advance the state of the art very much. And because of that, every single one of them had to learn in the hardest possible way that drinking mercury is a bad idea.

I think this really hits at the core of the difference between science and alchemy because science is a fundamentally iterative process where we are always building on knowledge shared by other people and what came before. So in a way, for us to be responsible scientists, we have to continue to share what we have learned with other people to build upon our successes and failures. So I think to do good science is to do open science, and I think that's what open science is really about.

Another way to think about this is what I think of as intellectual humility because I've been an academic researcher for like 15 years now. And reflecting on these years of research, I realized that whatever little bit I've added to our collective body of knowledge, I was able to do that because of everything that I've learned from the people who came before me. So a researchers, we really didn't get here on your own. It's really built on top of what everyone else has shared with you.

And it is with all of this in mind that I think open science really comes with four fundamental freedoms, where for any piece of knowledge, it should come with the freedoms for anyone to use it, study it, modify it, and continue to share it with other people to continue that iterative cycle. So this is how I like to think of open science. And that's the first thing I wanted to cover today.

The next thing I wanted to quickly establish is that for this science to happen, we're making use of so much shared technical infrastructure today. I remember many years ago I was at this hackathon with Arfon Smith from GitHub. And he was the person who gave me that lifetime Pro subscription to a GitHub that I'm still getting dividends from to this day. It is a platform that comes with amazing features.

But at the same time, I also remember how a couple of years ago there was this big GitHub outage for a couple of hours. And it is when things like this happen that we realize how reliant we have become on the software and hardware infrastructure in our lives. Because when they break, when we hurt, that's when we realize our reliance on these things.

And it's important to think about this because it reminds us to reflect on who gets to have a say in how this infrastructure works and how that infrastructure can work for us as researchers, and how we live out our lives. So this invisible infrastructure is really important. And this kind of centralization that's happening, I think, is a challenge that open source technologies can tackle.

So I've been thinking about a lot of the motivations for open source, including a lot of the reasons that people have talked about today. And I'd like to go over some examples. I want to talk about hardware, but will start with a software example that I think is amazing, which is...

...the QGreenland project. So I thought this project was so cool because it started out as a bunch of academic scientists who share a common theme, which is that they all study Greenland. It could be meteorologists, geologists, and a lot of other scientists. And they developed this common software platform for analyzing geospatial data about Greenland.

And they built it on top of an open source software called QGIS. It is a geographical information system, so that they can pull all of the geospatial data about Greenland into one place. They have a whole suite of tools built on top of QGIS to analyze that data. And the whole stack is called QGreenland. And what happened was that this project became successful. And last year in 2023, they wanted to run a training workshop for other researchers to learn about how to use QGreenlandfor their scientific research.

But one problem they encountered was that if they have 20 scientists in the room coming to this workshop, all with their laptops and their different operating systems and configurations,it takes so much time to just get people to the same page to install QGIS,get it running, and then put QGreenland on top of it. That takes so much time from the actual training they wanted to do.

So they thought, okay, can we reduce this friction a little bit?

And the solution they came up with was that they started with JupyterHub, which is kind of like a server-hosted version of the Python-based kind of Jupyter computational notebook that a lot of data scientists use.

But they were able to make some additions to Jupyter and tweak it so that instead of just running Python, they're running an entire Linux desktop environment on top of JupyterHub.

And with that, they can then install QGIS into that Linux environment, and then they put the whole QGreenland geospatial data platform on top of that.

And once they put all of this together into one package, they serve it from their server so that the participants in the workshop, they can just open up their web browsers, go to a particular URL, and the whole package runs as a web page inside their browser. And this saves so much time in the workshop because they don't need to set QGIS up on every individual computer.

Now, the reason I love this example is that all of these components, they are open source to begin with, and they demonstrate the FAIR principles of open science. Now, I think a lot of you know what FAIR stands for, but just so we're on the same page, FAIR stands for...

...Findable, Accessible, Interoperable, and Reusable. And this is a big thing in open science, and I think QGreenland demonstrates all of it. Because of this open source publishing online, it's easy for people to find it. The way they set it up is really accessible. It's interoperable because the components are open source, and they were able to tweak the components to interact with each other. And, of course, it's reusable because other scientists can adapt it to different research contexts. And I think this is a demonstration of how the FAIR principles that are so important to open science are enabled by open source technologies.

Okay, so this is a software example, but if you look at the UNESCO recommendation on open science, it talks about several main pillars of open science, including the usual suspects like open access publications, open data, open educational resources (I think this one is really important!), and of course, open source software code.

In addition to that, the recommendation emphasizes that hardware is a really important part of open science as well. So I like to focus a bit on the open source hardware side of things.

And if you really think about it, hardware underpins so much of scientific research. It was literally hardware that took people to the moon. That's how much we rely on hardware to do science.

It can be huge pieces of equipment like the Large Hadron Collider,

Or it can be something seemingly simple, but equally integral to the research infrastructure, like microscopes that we use in so many labs today.

Now, the thing with hardware is that it's very often closed source, like a lot of software.

And some of the challenges with that is that it's not reproducible in a scientific way. There's vendor lock-in, which was mentioned before. There's forced obsolescence, and there are very high costs. The cost is not only in terms of a very expensive piece of equipment. It's also the very high switching costs, where if you decide there's another equipment you want to use, but since it's not open source and there's no interoperability, it's very difficult for you to switch to a different platform.

And this causes a lot of global inequalities in research. I personally know some scientists in some global south countries who really want to have a particular piece of instrument in their lab, but the one manufacturer that makes it simply do not sell it in their country.

And even if they somehow get access to buy it, the cost is so high that they cannot afford it.

And if they somehow scrunch together the money to be able to afford to buy it, once they have it, they won't be able to get any support on it. They cannot maintain it themselves.

And it just becomes prohibitively difficult for a lot of researchers in different places around the world.

So I think when it comes to the social impact of our technologies, it's really important to be mindful of a lot of the global inequalities that come with the technologies of today.

So in contrast to that, open source hardware is defined as hardware whose design is available so that anyone, again, can study, modify, distribute, make, and sell hardware based on that design. And there are a lot of examples, actually, in scientific research.

An amazing one that I know about is the Open Source Imaging Initiative. So this is a consortium of universities across Europe, including some companies, I believe, who came together to create a completely open source MRI machine for medical scanning and diagnosis.

And if you know anything about MRI machines, you know how complicated and intricate they are. And they're actually creating an open source version of it that's becoming successful!

Open source hardware has been to space. Researchers in the U.S., they've developed the ORESAT, which is an open source CubeSat, that became a common platform for scientists across the U.S. to build on top of for remote sensing applications.

It's been launched several times already, and I think they have more launches scheduled.

But the example that I'd really love to talk about is the OpenFlexure microscope. So this is a lab-grade microscope, originally developed by researchers at the University of Bath in the U.K. (I think their team is based in Glasgow now). The point is it's completely open source and modular, and you can 3D print most of the microscope yourself.

It comes with a lot of features, starting with the basic ones like bright field imaging, or fluorescence imaging. But because it is fully open source, there was a separate research team in a different part of the world that looked at the designs, and they actually enhanced it and improved it to greatly increase the resolution for fluorescence imaging.

And this is something that people weren't able to do with the closed-sourced microscopes that they used before.

These are just a couple of features, but what's also really cool is that this open-source microscope, if you want to build it yourself, the cost of doing so is only about 200 US dollars.

Now, for those of you who have used and bought microscopes for use in the lab before, you will know that these microscopes often cost an order of magnitude more than OpenFlexure for doing the same thing, and I think that's absolutely remarkable.

And because of its low cost and because it's open source, again, as an example, researchers in several sub-Saharan countries, they were able to take the OpenFlexure design to locally produce and maintain that microscope from malaria diagnosis when they weren't able to do it before.

And in addition to this, it has actually prompted the formation of some small businesses in those countries to locally produce and sell these microscopes, and it's again becoming a new business model that's enabled by open source technology.

Okay, so to kind of build on some of the points made earlier, Joshua Pierce is a researcher in this, and he calculated that open source technologies, including hardware, can provide economic savings of up to 87% compared to functionally equivalent proprietary tools.

And again, my other point is that in addition to the savings, it creates new kinds of businesses.

So I have a bit of a background in molecular biology, and I've used PCR machines a lot. And there's a company who sells these Ninja PCR machines for US$500. Again, if you have bought this for labs before, you'll know that they typically cost an order of magnitude more. So it's amazing how open source not only lower costs, but creates new kinds of businesses as well.

Okay, so I talked about some of the benefits of open source technology just now. And to build on Clare's point earlier, I think we're faced with so many global challenges today, whether that's climate change or pandemics or other problems. And they're so big and urgent that I think open source technology is what enables the inclusive and rapid innovation needed to address these really urgent issues.

And to bring it back to my earlier point, I truly believe that we simply don't have time to be alchemists anymore. We cannot afford to be alchemists. And I think this is a huge motivator for why open source is so important and critical to open research.

Now, with all of that said, here actually comes what might be the most provocative part of my talk today. So, you know, again, we've seen so many motivations for open source, like the collaboration that happens, faster innovation that's so critical to solve problems of today, the lower costs and business opportunities, and so many other benefits, right?

But I feel they are just the tip of the iceberg in terms of why open source is so important. And there are some underlying values that I think really adds a lot to the value proposition of open source.

In my view, that could be things like the autonomy and agency that we can have over the technology that we use and the freedom to use it for our purposes. And I think these are the things that also underpin why open source is so important.

Dr. Julieta Arancio is a researcher of open source technologies, and I think she characterizes it really well, where technology really affects the way we think about research questions.

And when a piece of technology and the tool that we use is closed source, it means that rather than being enablers of our creativity, we end up doing what the available tech lets us do.

Because the people behind that technology gets to dictate what you can do with that technology. And what that means is, in this context, is that closed source technology also implies a certain kind of epistemic power behind it in terms of what knowledge we are allowed to have and what we can use that knowledge for.

And the risks with closed source technology and the challenges with it is that, depending on how you wield that epistemic power, unfortunately, sometimes it leads to a kind of intellectual poverty. Because only certain people get to have certain pieces of knowledge and not other people. Some people get to make use of that knowledge in certain ways, while other people don't get to do that.

So I think intellectual poverty is an unfortunate side effect that sometimes come from closed source technologies. And this is where the value proposition of open-source technology really comes in.

This is not only convenient and amazing in terms of the collaboration and innovation that happens, there is also an ethical underpinning to it that makes it even more attractive and adds to the value that we already have.

And this connects with open research, because open research is not only about publishing open outputs, whether that's open access papers, open data, and open source software or hardware designs, it is also about how we hold that epistemic power together in a more equitable way.

Such as those scientists I told you about in the Global South who couldn't do what they wanted to do in their research. And I think this is, you know, in addition to all of the benefits that we talked about, a very important value proposition.

If you think some of these conversations are interesting, I'd like to share with you some of the communities in which these conversations are happening.

I'd like to start on the hardware side of things. Over the past few years, I have been very lucky to be part of a group called the Gathering for Open Science Hardware, also known as GOSH.

And this is a network of researchers, hundreds of researchers from across the world, literally from every continent, except maybe Antarctica, who come together to think about the important role of open source hardware in scientific research.

And we have done a lot of interesting work, such as last year we created a policy toolkit for UNESCO on the role of open source hardware in scientific research, which was just published at the end of last year, that provides a lot of policy guidance on the national level for research and innovation policy.

I mentioned Julieta just now. She is the author of an amazing report called Supporting Open Science Hardware in Academia. This report is geared towards scientific research funders and technology transfer offices in universities to provide some guidance on how universities can enact policies to support people to work on open source technologies in research and also ways to spin off that development into successful business models around open source. So I think this is a remarkable report that I highly recommend you to check out.

So that's the hardware side of things, but if you go more broadly than that, I think this is where the Turing way comes in. So let me tell you a little bit about this community.

It started back in 2019 initially as a book, an online book, made with Jupyter, by the way,about data science and best practices around how to do data science in an open and reproducible way.

Now it started off as this book, but the founders of the Turing way, they thought: “We are not the only experts here, so can we invite other people to help us co-create this book together?”

And they started a distributed collaboration process that eventually turned into scientists and researchers from around the world contributing to this book, not only in terms of data science, but other aspects of open science and open source as well.

And it's grown into a huge book, hundreds of pages long, and because of how the book brought together scientists from different backgrounds, it's grown into a very vibrant community where a lot of conversations are happening around what open research means, what open science means, and what open source means for this work.

So we talk about things like what I presented in my talk today. There's also talk about diverse roles in research, such as the important role of research software engineers in scientific research that's not recognized enough, or things like localization.

So many things about open science and open source are in English right now, but can we translate that to different languages and what does that mean for people from different backgrounds and social backgrounds as well?

With the hope that eventually we can galvanize a cultural shift in terms of how we think about technology and how we think about research so that, again, we can hold this power together in a more equitable way and think of new opportunities for research and innovation.

So I think it's remarkable how over the past five years there has been more than 450 contributors to the Turing way, not just in terms of pull requests to the repository and adding to the book, but also all of the richness that's been brought into the conversations that's been held together by this community.

It's a really amazing place, and I highly recommend you check out the Turing Way if you're interested in having these discussions and connecting to other researchers in this space.

And I think this really shows how open research and open source can connect to each other in a mutually beneficial way.

Okay, so I'm just about running out of time, but before I end, of course I'd like to thank all of the people who have helped me so much to come here today to share some of these reflections with you.

First of all, of course, there's the Turing way community. Malvika Sharan is one of the co-founders of the Turing Way. We had a lot of interesting discussions on how to make this a more provocative talk.

I'd like to thank Bri, our community coordinator in GOSH, for contributing a lot of the thoughts on open source hardware, and of course to the organizers, including Michael and Clare, for having me here today to share some of these reflections to you, and for all of you for putting up with the last 27 minutes and 50 seconds of my talk.

The meta-commentary is that this talk is open source and it's published on Zenodo.org with this DOI, and I encourage you to check it out, fork it, turn it into what you like, and visit the Turing Way and GOSH communities where we can continue to have these conversations.

#talks #opensource #openresearch

Unless otherwise stated, all original content in this post is shared under the Creative Commons Attribution-ShareAlike 4.0 International license