talks — naclscrg

Talk - "AI" follow up talk about labour and academia

Sat, 08 Feb 2025 15:54:31 +0000

I gave a follow up talk to an earlier talk about “AI” at the University of Bristol TARG research group meeting on 22 November 2024. As usual, lots of stuff I couldn't fit into the talk, so I'm putting them here plus further reading, a transcript, and video recording of the talk.

The slides are published on Zenodo with DOI 10.5281/zenodo.11051128 listed under the “30 minute version”. I will try to gather here:

the video recording;
short summary;
further reading collected when developing the talk; and
a transcript of the talk.

I'll try to clean up this post with more context and details on a best-effort basis.

Video recording

There is a live video recording made during my 22 November 2024 talk which is viewable on the Internet Archive. The video is also embedded here (click the “CC” icon for subtitles):

Short summary

Please see the notes for my original “AI” talk for additional information.

Aware of the irony, I was curious how a large language model (LLM) could take the transcript of my talk (see below) and infer a short summary. The following is what Claude 3.5 Sonnet produced, with some edits by me:

This talk came from my conversation with Jennifer Ding at the Turing Institute about which underlying issues around “AI” technology deserve more attention versus the overhyped aspects. While I acknowledge that new technologies like “AI” can bring positive changes – such as a helpful Speech Schema Filling Tool that helps chemists record experimental metadata in real time as they run experiments – I wanted to focus on several key concerns.

The first observation I made is how “AI”-generated content is affecting academia. I shared examples including a published paper that began with “Certainly, here's a possible introduction...” (clearly ChatGPT-generated) and most amusingly, a paper featuring an anatomically incorrect lab rat with comically oversized genitals that somehow made it through peer review. I've also noted evidence of academics using “AI” tools for both writing and reviewing papers, and even PhD programs where applicants and reviewers use “AI” to convert application letters between bullet points and prose.

I emphasized that words really matter in this discussion. “AI” has become more of a marketing term than a technical term of art, and I pointed to how papers from just before the “AI” hype rarely used the term for the same technologies. I argue that this ambiguous language serves as a smokescreen, shifting power to those who control these tools.

This led me to discuss how “AI” often masks human exploitation. I shared examples including Kenyan sweatshop workers traumatized by moderating graphic content for ChatGPT, their Indian counterparts manually tracking purchases in ostensibly automated Amazon Fresh supermarkets, and bus drivers in “driverless” buses who must remain hypervigilant for that 1% chance of needing to intervene. As Kate Crawford notes, “AI” is “neither artificial nor intelligent” – it's not replacing labor but rather making it more invisible (which Lilly Irani also discussed in depth).

For scientific research, I see several concerns. There's a growing trend of papers proposing to replace human participants with large language models or suggesting complete automation of the scientific process – with one paper proudly claiming it could produce entire research projects from ideation to paper publication for just USD 15 each. I warn that building science on top of opaque and unaccountable “AI” systems risks turning science into alchemy.

While some suggest banning “AI” in academic publishing (following incidents like the well-endowed lab rat paper), I caution that focusing solely on “AI” (“solely” being the key word) might entrench deeper problems like the broken peer review system and publish-or-perish culture. For example, publishing companies might offer proprietary “AI”-generated paper detection tools, which would make us more reliant on them and further consolidating their power without tackling why researchers feel pressured to publish fake papers in the first place.

My key message is that “AI” often highlights existing problems rather than creating new ones. Instead of fixating on “AI” itself, we should address underlying issues in research culture, from job security to toxic workloads. I concluded by recommending resources like the Mystery AI Hype Theater 3000 podcast and the book “AI Snake Oil” for those interested in deeper exploration of these themes.

P.S. Note that a newer book, “The AI Con”, is about to be published in 2025: https://thecon.ai/

Transcript

This started from my conversation with Jennifer Ding at the Turing Institute. And we were talking about: what are some of the underlying issues around “AI” technology that we feel should be surfaced a little more rather than some of the stuff that we think is a little overhyped? And I'm gonna go over a lot of those problems today.

Before I get into it, I want to do something I always emphasize in talks like this, which is that I think for any kind of technology, it can bring about a lot of change in how we do things and how we organize ourselves. And it's not a matter of saying: oh, you know, let's just not use it. There's a potential for “AI” technologies, right? Because if you think about it, when the printing press came around, you don't want to ban the printing press just because you're afraid that the scribes are gonna go out of business. We hopefully can work together to find a way to realize the potential of a new technology.

And I think a positive example that I'd like to share before jumping to everything else is this tool that Shern Tee shared with me. It's called the Speech Schema Filling Tool. So it was developed by chemists for use in their experiments. And what happens is that as you do your experiments, you talk into the microphone on your computer and the large language model on it will use your audio input to do a speech to text conversion and fill in your lab notebook with what you're saying. But what's really cool about it is that the tool will also parse what you're saying and record relevant metadata into a structured data format to go with your lab notebook. So there's a very well-structured metadata set to go with the particular experiment that you're doing. And I think as long as you're happy to talk through your experiment as you're doing it, this tool is so helpful for you to improve the quality of the data that you're capturing, helping make your experiments more reproducible and so on, right?

So there are certainly really good uses, of what people are calling “AI” technologies these days. Having said all of that, obviously there's also a lot of concern that we've seen over the past couple of years, such as in terms of how people publish papers, right? This is a classic one I think Marcus shared a while back where if you look at the paper, starting right from the first sentence in the introduction, it says: “Certainly, here's a possible introduction for your topic.” And I think it's pretty clear that this probably came from ChatGPT, which is one of the more commonly used so-called “AI” tools today to generate text.

However, this is not my favorite one. So my favorite paper is this one. I don't know if some of you have seen it. I see some of you smiling, so you know what I'm getting to. First of all, this was published in Frontiers back in February [2024]. If you look at the text, a lot of it looks fairly generic and probably “AI”-generated. But the most dramatic part is one of the figures which was a lab rat. And most of the lab rat looks kind of like a normal rat, but it's got these giant genitals sticking out of it. For the phallus, it's so long that it extends beyond the figure.

I just love how a figure like this would get past the peer reviewers, it gets past the editors, it gets past the copyeditors of the journal and gets published. Now, for the record, it was retracted by the publisher pretty soon afterwards. But not after everyone on the internet got copies of the PDF first and then archived it. That's how I was able to get this amazing picture of this lab rat, which I love. And you can also see a lot of weirdly spelled words that annotate this figure. So definitely check it out. I think this is one of the classics that's come out of some of the papers we've seen over the past couple of years.

And in addition to generating these papers, we are also seeing some evidence that academics are using these tools to generate the peer reviews that they write. And to be honest, I can kind of relate to what these academics are going through because who has time, right, to do a really good peer review these days? And in higher education, of course, we know that some students feel really tempted to use these sort of [large] language models to generate their essays, and we're also seeing that some instructors are using the same tools to grade and mark the essays.

You know, there's an anecdote I heard for a PhD program that was recruiting students, I think it was in the US, they found that a lot of the applicants to the PhD program didn't have time to write so many cover letters in the application. So they would write a few bullet points saying what they want in their cover letters. They use a large language model, turn it into the cover letter. And then when the professors on the program, they have so many applications to sift through, they ask the same tool to translate it back into bullet points so that it's quicker for them to skim through.

So a lot of interesting use cases here, but I just wanna use this to set the stage to talk about three things today. So the first one is that I think words really matter when we talk about so-called “AI” technologies because there's a lot of ambiguity in the language. And that can become really problematic because it allows so-called “AI” to become a smokescreen that distracts us from what I think a lot of the underlying issues are. That's more important to tackle. And lastly, I will try to bring all of this back to scientific research and think about what this means for scientific research and maybe what it doesn't mean.

Okay, so what do I mean by words matter? Well, I think it's very important for us to realize that so-called “AI”, as we colloquially use it today, is very much just a marketing term and not a technical term of art!

To illustrate this point, I really like this paper. It's called “A style-based 3D-aware generator for high-resolution image synthesis.” And you can see that you can use this tool to generate very realistic-looking photos of people. And I use this example because I searched through the whole paper, including the title, and other than one of the affiliations of the first author, there's no mention of “artificial intelligence” in this paper at all.

And if you look at the publication date, it's 2022, just before all of the hype around “AI” started. And I think if this paper is published just a year later, the text is going to be filled with references to “artificial intelligence”. And I think this is really important because it comes back to the point that a lot of the terminology we're using today around these technologies are marketing terms, like hallucinations or reasoning skills or training these models.

First of all, it really anthropomorphizes this technology, and it gives us a sense kind of like how humans have a tendency to recognize faces in things. And I feel using this terminology misleads us into recognizing intelligence in these tools as well. And I think that can be really problematic.

Another way to think about it is that when we are using our word processors to type up our papers, there's spellcheck, right? And spellcheck is basically a statistical model that takes an input and infers, in this case, the possible correct spelling for the word you're trying to spell. And this is not to minimize the amazing amount of work that's gone into these artificial intelligence technologies, but roughly speaking, large language models are also a very, very sophisticated form of statistical modeling that takes text as input and infers a natural-looking output.

And I think Emily Bender describes it really well when she calls these models “stochastic parrots”, because parrots, they might repeat words back to you, but they are literally incapable of understanding what it's saying. And this also applies to all of these “artificial intelligence” technologies.

And I think this ambiguous language is the feature, not the bug, because it's not just a matter of linguistics or semantics or nitpicking, but we know from history that ambiguous language shifts power to people who hold control over those tools and technologies. And I feel that the powerful people behind so-called “AI” is using this ambiguous language as a smokescreen to distract us from the very real problems underneath it.

So just, I think it was last year where there was this union that was formed in Kenya, because there were so many sweatshop workers in Kenya that were hired by the company behind ChatGPT and also Facebook and other companies to, well, as you can see here, to make the models less toxic.

So what they do is that you're constantly looking at outputs for the most egregious stuff, such as descriptions of sexual abuse, murder, suicide, and other really graphic details. And they're basically tweaking the model inputs whenever something really graphic comes out [so that] the statistical inferences from these large language models are slightly less offensive.

And they're so traumatized by this and doing this kind of sweatshop work all day, every day, trying to keep ChatGPT working that they were able to actually form a union. And I think this is important because that chemistry example I gave you earlier was one of the “AI” assisting humans, right? But actually, a lot of the exploitation comes in when you have a human-assisted “AI”, such as these sweatshop workers.

Another one is, of course, Amazon Fresh. I took this picture of the Amazon Fresh store. This one is just south of Aldgate East Station in London. And I know some of you know this... So the selling point for Amazon Fresh is that you walk in, pick up whatever you wanna buy, and you just walk out. And they use really advanced “artificial intelligence” to all of the cameras in the shop will figure out what you bought and automatically charge your Amazon account.

But it also came out in the news this year [2024] that all of the so-called “artificial intelligence” was actually Amazon hiring sweatshop workers in India whose sole job is to watch all of those cameras and manually tag what people are buying in these shops when everyone is thinking that's actually the “artificial intelligence” technology doing all of those things.

And actually, Amazon shut down the whole thing soon afterwards, and they're actually shifting Amazon Fresh to one where, rather than having all of those cameras watch you, whenever you grab an item, you have to manually scan it into your cart before you take it out.

And the other example that I think is very, very telling is this piece of news that was in the BBC earlier this year [2024] about this new driverless bus route that was started in Seoul in South Korea. So what happens is that this bus is supposed to be completely driverless, right? And you can see a picture of this guy sitting on the [driver's seat].

So I like this picture, by the way, of how this person actually also has his feet up to indicate that he doesn't even have his feet on the pedal. And I wanna use this example to say that all of what I've been showing to you so far are cases of human-assisted “AI”.

And what this driver has to do, you might be asking, “Okay, if this bus is completely driverless, why do you still need someone to sit there?” So what happens is that this driver will sit in the driver's seat. They don't usually have to do anything, like 99% of the time they can just sit and watch the bus drive itself, but this bus driver has to be super vigilant the whole time. Just in case, you know, in that 1% of the situations where the driverless bus makes a mistake, this driver has to immediately react and come in and actually make an adjustment to whatever the bus is doing.

So this driver is actually more vigilant than they usually have to be if they were just driving a regular bus. And this is what we're also seeing, of course, of the Amazon delivery drivers who are [used by] the so-called “artificial intelligence” system. You know, it's constantly watching the drivers on these trucks as they make their deliveries.

And they're under so much pressure because on one hand, Amazon is constantly pressuring them into making their delivery quotas. On the other hand, this “artificial intelligence” disciplinary system is constantly watching their behavior, such as watching their eyeballs [to track] where they're looking. There's also some evidence that the camera is watching their lips because apparently some drivers, they would whistle or sing a tune as they're driving, and apparently that's a bad thing and you'll get marks taken off and you might not get your bonus at the end of the week. So they're constantly being disciplined like this.

Or they have to deal with these inhuman competing demands. And in these examples, it's like, you know, us humans, we're basically mindless bodies where the “AI” acts as the head to discipline us and make us do exactly what it wants us to do.

And it comes back to my point where if we think of it as an “artificial intelligence”, then we attribute agency to this technology. And that distracts us from the Jeff Bezos-es behind the technology who's actually using them to exert that power over us. And I think that's really dangerous, right?

And I think Kate Crawford describes it really well, where so-called “artificial intelligence” is neither artificial nor intelligent. And the use of this technology in the ways that I just described, you know, it's not really replacing labor. It is displacing labor and making it even more invisible to us.

And this is why I think words matter because they have so much epistemic power over how we think about things. And often the use of language in “artificial intelligence” distract us from all of these underlying problems. Because, you know, if the “AI” on that driverless bus, you know, let's say hallucinates and makes a mistake, who are you gonna blame? We might blame, you know, that driver who wasn't vigilant enough to catch that 1% chance of the bus making a mistake, but is that really the issue here?

And that's where I'd like to try to bring this back to scientific research. So what does what we do as academic scientists have anything to do with this, right? Well, first of all, I'm kind of concerned about how even in academic scientific research, there is already sometimes a tendency to exploit.

So this is a paper that I actually cited in my previous research where it talks about, crowdsourcing the work that we do in science, whether it's data collection or data processing to online volunteers. And I want to first say that sometimes this can be done really well. For instance, a lot of this is integrated into science outreach and science education and science engagement, where as part of your engagement activity, the participant, they get to do part of the science and help you analyze data. And they can be mutually beneficial, but in papers like this, you often see language like, crowdsourcing, right? Which allows all of these free labor that you hired to shorten the time to perform the work for you, or it lowers the cost of labor for the academic who's running the project.

And I think there's a little bit of a danger here where we are perpetuating some of the exploitation, especially now where I am actually asked to review papers over time about this kind of crowdsourcing work and the way they talk about the participants make me concerned about where this is going in terms of various technologies where we might accidentally perpetuate this smokescreen that I keep talking about.

The second thing is that because the language around “AI” is so misleading we get papers like this who are, of course, it's basically saying it's so costly and labor intensive to recruit participants in your project. So why don't we replace them with large language models who will never get tired of our interview questions? We don't need to give them any compensation and we can get as many participants as we want in our study because, you know, they're as good as the real thing anyway, right? So I think that's pretty problematic.

Another one is talking about human assisted peer review in “AI” where they actually want to use these models to do peer reviews. And of course, proposing this particular editorial in this Nature Journal is that they're claiming: “oh, it's gonna save so much work for the actual peer reviewer because the 'AI' is gonna do all of it” and then the human, they just need to come in at the end and briefly check that peer review to see if it's okay.

But this sounds so much like that bus driver to me, and I feel we're seeing a lot of really high profile papers like this. There's one that I didn't get to stick into the slide in time, which is literally proposing, using “AI” to completely take over the scientific discovery process, where you're gonna use the large language model for question generation to design and conduct the experiment, analyze the result, write a paper, and then get another large language model to come into peer review that paper.

And at the end of the abstract, so I really wish I should put the abstract here, but at the end of the abstract it says, this saves so much money: “We calculated on average that if you outsource this entire thing to our 'AI' tool, it will be able to produce all of that scientific research for you at a cost of $15 per paper.” And I think that says a lot about how there's so much misunderstanding and hype around these technologies that high profile papers like this are starting to appear.

And I think Lisa Messeri described it really well where if we develop this kind of reliance and we think that “AI” technology is actually, sentient and intelligent, then by doing science this way, it will give us illusions of understanding. And this is a fantastic paper I suggest you check out.

Okay, now as someone who has been an open research advocate for a long time, another thing that's talked about, in “AI” circles right now is that we should really make a lot of these “AI” tools open source. And I think there are good reasons for that. But in the context of open research, there's a lot of messiness there as well.

So you might have heard of LLAMA 2, one of the large language models released by Meta last year. Then they called it an “open source” large language model. But if you actually click on download the model, it actually comes with a ton of restrictions on what you can do with it and a lot of limitations. And a lot of parts of it are completely opaque and you're not allowed to see what the model is doing. So it certainly doesn't meet the industry definition of open source as it has been established for software.

Now, the Open Source Initiative has been working on this issue for a long time. And actually just a few weeks ago, they released the first version of an open source “AI” definition. And I think it's really important for academic researchers to be part of this process as well.

But in any case, what happens in practice is that there was another study published earlier this year where they looked at dozens of the popularly used, large language models these days and scored them using 14 different criteria on their openness. And the overwhelming majority of them comes not only with a ton of restrictions, but also a lot of black boxes where you're not really allowed to know what's actually happening inside these models.

So you can see that ChatGPT is right there on the very bottom as one of the most black box large language models that there is that we're using. And I think there's a real danger here for... with all of this hype around so-called “artificial intelligence” and all the talk about completely integrating that into the science that we do. We're building all of the science on top of this “AI” technology.

I think what's gonna happen is that we won't end up doing science anymore. We will be doing alchemy! Because it's built on top of this completely opaque system. And I think that's a fundamental danger to the future of doing science.

And I want to quickly bring us back to this very well-endowed lab rat that I mentioned at the beginning, because I know that in response to papers like this, some people are saying, okay, so of course, you know, we should certainly ban the use of “AI” technologies in the creation of papers. So maybe we should just completely cut “AI” out of the paper writing process, right?

And I think that's understandable to a large degree, but I think there's a concern about if and what kind of problems are we actually solving if we focus on dealing with the “AI” part of it. Because I'm concerned that fixing “AI” might actually entrench deeper problems.

In this case, the broken peer review system, the publish-or-perish culture, right? Where these publishing monopolies... because I wouldn't be surprised, given what we've seen in higher education in terms of finding fake essays written by students. I wouldn't be surprised if one of those big publishers, they release some proprietary “AI” tool saying, “hey, if you publish a journal with us, then we'll let you use our proprietary 'AI' tool to detect fake paper submissions.”

That might seem to superficially solve the problem, but I think the deeper risk of thinking about “AI” is that in this example, we will become even more reliant on these huge publishers and cede even more power to them, right? And I think that's what I'm really concerned about because, solutions like this, don't really get at the actual problems leading to why people want, well, not necessarily want, but feel pressured into publishing those fake papers.

So I think a core message that I've got from these examples is that “AI” highlights existing problems that we have. And it's important for us to be aware of deeper problems in our research culture. And it could be really long standing issues like job security or the toxic workloads that we have to put up with, right? And think about all of those lecturers who have to live in tents because they can't afford anything more than that.

And it's important to realize that “AI” didn't create these problems just as “AI” didn't create the sweatshops that I mentioned earlier.

So to wrap things up, I think the main messages I want to send today is that words really matter when we talk about these technologies. And we should be very sensitive in understanding what those words really mean. And instead of thinking about “AI”, we should think about these deeper underlying issues that have plagued us for so long because, you know, very often “AI” is NOT the problem. It highlights existing problems and we should reflect on and focus on those underlying issues.

If we only focus on “AI”, it risks making those problems even worse. Okay, so that's the bulk of my talk, but if I've piqued your interest a little bit, I will leave you with some further reading, one of which is this one about generative “AI” and the automating of academia. The lead author is Richard Watermeyer based right here in Bristol. It's a fantastic read.

But if you're tired of reading yet another paper, I mentioned Emily Bender earlier. So Emily Bender and Alex Hanna host an incredible podcast called Mystery AI Hype Theater 3000, where every week they look at one of these so-called “AI” papers like the ones that I just showed you and tear it apart. And it's both very depressing and very entertaining at the same time.

Or if you'd like to read, these two Princeton professors, they wrote a book called “AI Snake Oil,” again, along the veins of what I'm talking about today. And I think it's really informative in terms of how we think about how we want to adapt our research culture in light of this new technology.

So that's some additional material that I think is useful. And in the interest of doing open research, I've published these slides, the transcript, additional notes, and all of the references to Zenodo. So you can look at that and remix and use it if you want.

And I also want to just give a shout out to Jennifer Ding from the Turing Institute and Shern Tee, and everyone from the Turing Way community who's helped me develop this talk.

So that's what I have for you today. And thank you for coming.

#talks #AI

Unless otherwise stated, all original content in this post is shared under the Creative Commons Attribution-ShareAlike 4.0 International license

Talk - Open source hardware for more equitable open science

Mon, 02 Dec 2024 16:36:30 +0000

Since 2023, I've given several variations of my talk about open source hardware as a key component of open science. Here, I will share extra notes on what didn't fit in the talk, a transcript, further reading/resources, and a recording of the talk. This note is structured as follows, please scroll down to the section you're looking for.

Recording
Transcript
Further reading/resources

Recording

I've given several variations of this talk with multiple recordings. For now, here is the recording of an early iteration I gave at the Edinburgh Open Research Conference in mid-2023 (click on the “Presentation Video” link on the page):

https://doi.org/10.2218/eor.2023.8112

I will try to put other recordings here on a best effort basis.

Transcript

I will put a transcript of the talk here as soon as I can.

Talk - AI is not the problem - thinking about outcomes (updated)

Thu, 25 Apr 2024 10:09:26 +0000

On 25 April 2024, I gave a talk at the Open Science & Societal Impact conference titled “AI is not the problem – thinking about outcomes”. It was co-created with Jennifer Ding of the Turing Way who is the real AI expert here, and wrote a great post about an outcomes-based approach to AI. There's extra stuff I couldn't fit into the talk, so I'm putting them here plus a transcript and video recording of the talk. Note that I have a follow up talk focused on labour and academia in November 2024.

The slides are published on Zenodo with DOI: 10.5281/zenodo.11051128

I also tweaked this talk linking it to reproducibility in science at the Reproducibility by Design symposium on 26 June 2024 (at the life sciences department at the University of Bristol), kindly organised by Nick and Fiona.

I will try to gather:

general notes;
other resources/further reading collected when developing the talk; and
a transcript of the talk (with reproducibility addendum).

I'll try to clean up this post with more context and details on a best-effort basis.

There is a video recording (of the April 2024 version) which is saved in a Zenodo item and viewable on the Internet Archive. The video is also embedded here (click the “CC” icon for subtitles):

Transcript

Thank you for the introduction. For this talk, I’m going to stay on a high level, and offer my reflections on how to situate “AI” in open science as it relates to wider society. There is a lot of understandable concern about how this technology will affect scientific practice.

And we've seen some pretty egregious examples in academic science. Last month this engineering paper published by Elsevier made the rounds because as soon as you start reading the introduction, you’ll see that it starts with “Certainly, here is a possible introduction for your topic…” This is very likely a sentence generated by ChatGPT, a chatbot based on large language models, and brings into doubt the rigour of the rest of the paper.

I think the most dramatic example is one published by Frontiers in February 2024, where it’s pretty obvious that much of the contents are AI-generated, with a dramatic figure of a lab rat with giant gonads. You can also see some gibberish text in the annotations.

What’s remarkable is that these papers were seen by peer reviewers, editors, and copyeditors and were still published.

On the other side of this is that there is growing evidence of academics using tools like ChatGPT to write their peer reviews.

And in higher education, we know that some students would use generative AI to write their essays. But now some instructors are using the same tools to grade those essays.

With that in mind, there are three things I’d like to cover today.

The first is that words matter. A lot. With all of the hype around “AI” right now, it’s important to realise that this is a big umbrella marketing term (instead of a technical term of art) for a bunch of different technologies.

And I really appreciate how Kate Crawford reminds us that these technologies are neither artificial nor intelligent. What we call AI is built on human labour, and it is certainly not intelligent in the way humans are.

In the context of open science, there are calls for open source AI that is transparent, reproducible, and reusuable by others. I agree with this, but what counts as open source or open AI is also not clearly defined.

Last year Meta released a large language model called Llama 2 and marketed it as open source. However, the license for Llama 2 actually came with many restrictions on who can use it and how they can use it. We can agree or disagree with these restrictions, but these restrictions mean that Llama 2 is categorically not open source as it has been widely defined for software.

There’s this paper by Widder, Whittaker, and West in 2023 about how ambiguity in words like AI and open source AI has created an opening for the big players to openwash their products. What happens here is that the word “open” becomes a very fuzzy term that feels good, while meaning very little at the same time. And this furthers the power that these big players hold over technology and society.

All of this is to say that what people call open source AI is often neither open, artificial, nor intelligent! For the purposes of today’s meeting, I think this is a major problem because when a term is taken to mean everything, it ends up meaning nothing.

And the societal impact of this ambiguity is that the wider public will trust science even less than they already do.

What this means in practice is that we should be clear about what we mean when talking about AI. If there’s a specific underlying concept like machine learning, training large language models, and so on, then let us use more specific terms.

There are also cross-cutting work to collaboratively define terms like open source AI, and I believe the scientific research community should absolutely be part of this conversation. The Open Source Initiative is one of the leaders on this and I encourage everyone to check it out.

Having said that. Even though having clearly defined terminology can help us conceptualise and communicate issues around artificial intelligence it is a necessary but insufficient step for addressing those issues. Because being effective communication doesn’t solve problems by itself.

Yes, words matter, and outcomes also matter. And once again, there is a lot of work in this space on topics like reproducibility which is important in scientific research, to others like democracy, trustworthiness, inclusion, accountability, to safety.

I really like the work by the Mozilla Foundation, such as their thinking about trustworthy AI and the need for openness, competition, and accountability. There are so outcomes for us to consider, and to make things more concrete, I want to focus on a real world example which challenges us to think more deeply about what outcomes we want to see.

To make this point we should realise that what’s often called “artificial intelligence” is foundationally similar to autocorrect/spell check. In this case, your typing input is fed into a statistical model that suggests the correct spelling for a word. Now, I know this is simplifying things a bit, and not to minimise the amazing math and computer science research that went into it, but the large language models underlying much of generative AI today is – on a high level – an autocorrect that runs some very very sophisticated statistics on your input to produce natural feeling outputs. It’s important to know this because enormous amounts of human labour goes into labelling the huge datasets used to train these models.

Around this time last year (2023), workers for the companies behind ChatGPT, TikTok, and Facebook formed a union in response to the horrible working conditions they had to put up with.

What’s behind the “artificial intelligence” façade is that many of them are sweatshop workers who manually label training data.

For ChatGPT, these sweatshop workers where hired to tag and filter text that describes extremely graphic details like sexual abuse, murder, suicide, or torture.

This reminds us of how “artificial intelligence” is neither artificial nor intelligent, and it has become a smokescreen for deeper issues like how labour is not being replaced by machines when in fact it is being displaced and made even more invisible.

So, when we think about what outcomes we want to see, we must consider underlying problems like outsourcing, labour rights, or colonialism.

But what does this have to do with scientific research?

Well, there are similar things happening, where what some people call “crowd science” is used as a research methodology, where academic scientists crowdsource data collection and data labelling to online volunteers.

To be clear, there are positive things that can come from this, for example some scientists build crowdsourcing into science outreach and engagement activities, and there are ways to integrate crowd science into science education.

However, I’ve reviewed many scientific papers about this over the years, and some are really focused on how crowdsourcing is a way to shorten the time needed to process data, and to lower costs for the scientist.

Right now, a lot of this is being used to train machine learning models and other AI applications. And I feel there is a risk that parts of the scientific community is inadvertently perpetuate not just the hype around AI, but also the exploitation of people.

I give these examples because I think that we, as members of the scientific community, should go outside of the ivory tower and engage with wider efforts to think about what outcomes we’d like to see in a world with AI. For instance, what can we learn from labour movements to inform more equitable practices when doing crowd science?

This is just one possibility for thinking about outcomes for science.

And the third thing I want to cover is what AI means for open science. To do this I want to take us back to this extraordinary generated figure of a lab rat. One response that we might have to AI-generated papers or peer reviews is to ban the use of AI tools for scientific papers. Some publishers and journals have already implemented these policies. But I’m concerned about if and what problems we actually solve if we focus on dealing with AI.

I fear that we might inadvertently think that we’ve “solved” the problem, when we are entrenching a much deeper problem.

For example, I wouldn’t be surprised if one of the big academic publishers would release a new proprietary tool for detecting AI generated text in submitted papers and reviews, and tie this feature into journals that they publish. On one hand, maybe the tool is really effective and would weed out these junk papers.

But “solutions” like this might concentrate even more power into these huge publishers, who are a big part of why peer review is so broken in the first place. And in this case, I think fixing peer review is more important than dealing with AI.

I think the broader lesson is that we should support existing open science efforts. For example, there are many tools to help fix peer review, such as preregistration, publishing Registered Reports, publishing preprints followed by open post-publication peer review. Groups like PREreview or journals like the Journal of Open Source Software have been doing this work for years.

We also have to tackle even deeper problems like job precarity in academic research, where some researchers move from one short term job to another, or professors who live in tents. And many of us have to deal with toxic workloads where we are expected to do even more for less pay.

And what’s most important to realise is that AI didn’t create these problems, just like how AI didn’t create sweatshops.

So what I want to suggest is that AI is not the problem. At least it often isn’t.

Instead, AI reminds us of existing systemic problems. And if we only focus on AI, then we risk making those problems much worse.

So, these are the three suggestions I want to make today:

Words matter, and we should work to clearly define key terms such as AI or open source AI. This is not only to make communication easier, but also to increase societal trust of scientific institutions. But this alone is not enough.
Because we should also reflect on what outcomes we want to see for underlying issues.
With the understanding that AI is very often not the cause of these problems, and if we focus too much on AI we risk making things worse.

I hope there was something useful in this talk and that it can provoke more conversations.

And if you’re interested in continuing the conversation, I want to point to the Turing Way community.

The Turing Way started as an online guide on open science practices, but over the past five years has turned into a global community of concerned researchers who reflect on some of the issues I talked about today.

For example, last year my co-author Jennifer Ding led a Turing Way Fireside Chat about open source AI, and the labour issues behind it.

I invited you to visit the Turing Way to talk about AI or other open science and open research topics.

With that, thank you very much for coming to my little show and tell today.

addendum on reproducibility

Here are the additional points I made about reproducibility at the Bristol life science Reproducibility by Design symposium on 26 June 2024:

There are possible good uses of so-called “AI” to help with reproducibility (not everything is doom and gloom!).

For example, my colleague Shern Tee pointed me to the “Speech Schema Filling” tool made by Näsström, Götte, and Schumann (2024). This tool was developed by and for chemists to help them better document their experiments.

It uses speech recognition and a large language model running locally on your computer, so that you talk through each step in your experiment as you are doing it, and this tool records everything into an electronic lab notebook.

The remarkable thing is that this language model actually parses what you are saying and records the details of your experiment into a standardized structured data format (for chemistry) that can go with your lab notebook (see this example).

I think this is super cool because as long as you’re willing to talk into a microphone as you work, this tool makes documentation so much easier, and helps with data quality and reproducibility.

That said, considering that so-called “AI” and “open source AI” are neither open, artificial, nor intelligent, there is a recent conference paper (just published June 2024) where they sampled 40 of the commonly used large language models for generative AI.

They evaluated the “openness” of these models with 14 measures of availability of underlying materials, documentation, and access (see Figure 2 in: https://doi.org/10.1145/3630106.3659005). The overwhelming majority of them are highly closed source, so you have no idea what's happening under the hood. Notably Meta's Llama 2 which was marketed as “open source” is 6 from the bottom, and OpenAI's ChatGPT comes in last place.

I think this is bad for reproducibility, especially if we integrate them into the scientific process. And unfortunately we are starting to see this happen.

For example, I've seen real papers in real, highly prestigious journals proposing things such as (paraphrased):

Recruiting human participants is hard. Let's replace (some of) them with chat bots who will never get tired of our interview questions.
Let's use “AI” to design and run scientific experiments...
...or to make inferences, predictions, or even decisions.

In my view, if we build our science on top of the really opaque “AI” which most of the popularly used ones are, then we are not doing science. We'd be doing alchemy. (not to mention we would become even more beholden to Big Tech who holds power over that technology)

And this alchemy would give us “illusions of understanding” as wonderfully described by Messeri & Crockett (2024) (https://doi.org/10.1038/s41586-024-07146-0). I believe this is a great risk to science.

This talk is open source and I published it on Zenodo.org with this DOI (10.5281/zenodo.11051128) along with a transcript, and I encourage you to check it out, fork it, turn it into what you like, and visit the Turing Way community where we can continue these conversations.

#talks #AI

Unless otherwise stated, all original content in this post is shared under the Creative Commons Attribution-ShareAlike 4.0 International license

Talk - The critical role of open source in open research

Mon, 08 Apr 2024 17:37:42 +0000

On 20 March 2024, I gave a talk at the Open Source for Innovation in Universities event titled “The critical role of open source in open research” (open source slides published to Zenodo). Like last time, it was informed by incredible feedback I received from various open research communities, especially Malvika of the Turing Way who first connected me to the organisers. There's extra stuff I couldn't fit into the talk, so I'm putting them here.

I'm posting:

a few general notes;
other resources/further reading suggested by Turing Way members; and
a transcript of my talk.

I'll try to clean up this post with more context and details on a best-effort basis.

There is a video recording which is saved in the Zenodo item, viewable on YouTube, and embedded here:

General notes

In-person verbal feedback was positive, though I didn't get to use as much time preparing it as I wanted. I was also running out of time near the end, and wish I could have talked about the Turing Way more!

This time, I also opened a Turing Way GitHub issue #3570, to track the development of this talk.

As expected, I wasn't able to fit everything in, but also thank you to Sarah Gibson, Julien Colomb, Esther Plomp for your feedback earlier to help me prepare! I'm also grateful to the organisers Michael Meagher and Clare Dillon who gathered a great group of warm and interesting people for this event. :) Special thanks for Malvika Sharan for the several meetings we had to structure this talk.

a note about creating a transcript

For my FOSDEM lightning talk, I typed what I wanted to say directly into the presenter notes in my slides before the talk. However, this time I just didn't have time to do that.

So, I tried using my phone to make a live audio recording as I gave the presentation. Then, I used the open source Whisper.cpp automatic speech recognition tool with its open-ish ggml-small.en model to generate a transcript.

Then, I copied that transcript into the presenter notes of the final slides published to Zenodo.

In the end, I think this method works, but is still time-consuming. The generated transcript is a huge text file that I had to manually split into paragraphs, and copy and paste individual chunks of text into their corresponding presenter notes. This is also what's below in the “Transcript” section.

Will I continue to use Whisper.cpp in the future? Yes, I think its text transcription is remarkably accurate and is getting better. Though there are still paper cuts in the user experience that adds some work for me.

Other resources/examples

Thanks to Sarah Gibson and Julien Colomb for the suggested examples:

The Gorgas tracker as mentioned in this post and described in Arancio (2023).
CERN's White Rabbit project. Also see this interview about it.
The Python and R ecosystems vs MATLAB and SPSS in days past.
JupyterHub, specifically the QGreenland project (WARNING: Medium link). I really like this one because it's not just one piece of open source hardware, but an entire stack that could only work well when all components are open source and remixable.

Transcript

Note: This transcript is lightly edited for clarity, such as by removing the “uh”s and “you know”s, or “ah”s.

Thank you so much for that introduction, Clare. I'm really excited to be here with you today. It's really quite a privilege to be speaking to you. And as Clare mentioned, I am a member of the Turing Way community, which I will come back to near the end of the talk. But today, I'd like to share some of my own reflections being not only an advocate for open research in the academic community over the past several years, but also as a member of the open source community. I very much think of my talk as a kind of “yes, and...” kind of presentation. And it's also intentionally provocative with the intention of stimulating, new thinking around what kind of opportunities can we consider when it comes to open source technologies and open research.

I want to start very briefly by focusing on the term open research and make kind of a subtle point here. So I consider open research to cover a very wide and diverse array of different research disciplines. And a lot of the examples I'd like to share today come from my experience advocating for open science, which I consider to be a very important component of open research, but it's not all of open research. So there's a subtle difference between the terms and I'd just like to delineate the two, even though most of what I'm talking about today comes from the open science world.

With that said, the structure of my talk today, I'd like to start with my reflections on some of the core values of open science, why open science is so important, including in academic research. Very briefly on a lot of the invisible infrastructure of technology that underlies the scientific research that we do, followed by I think the biggest part of my talk today, which are the additional motivations for open source technologies to enable open science. And I'd like to bring up the hardware component as well because we've heard a lot about software. And finally, I will talk about some of the communities that have been so lucky to be a part of over the years that discusses a lot of the things in my talk today.

So, open science. I've talked about open science to so many people over the years, and what I have learned is that...

...if you ask 10 people what open science means, they will tell you, yes, I know what it is, but they will give you 10 different answers. So I'd just like to set the scene a little bit for my talk today to establish a common understanding just to help with the conversation.

And one of the initiatives that I've been really privileged to be a part of is the drafting of the UNESCO Recommendation on Open Science that was ratified in 2021. I had a very small role to play in this, but it was a huge privilege to be part of the process and it produced an amazing document.

It's really long, but I recommend you check it out. And part of it defines open science to mean a set of practices for reproducibility, transparency, sharing, and collaboration from the increased opening of scientific contents towards and processes. Again, I think this is an amazing document, but this definition is also quite a mouthful, right? So I tried to reflect on: is there kind of like an essence to this definition?

And what I came to is actually the difference between science and alchemy. So what do I mean by this?

I was inspired to think about this by a very provocative digital rights author called Cory Doctorow. He writes a lot about these kind of fundamental values underlying open research and open science and open source. And he said, if we think about how alchemists used to work superficially, they were running experiments, they had some research questions, they took lots of notes, and they were actually learning along the way. But the thing with alchemists is that they kept what they knew a secret from each other for 500 years. Because of that secrecy, they didn't advance the state of the art very much. And because of that, every single one of them had to learn in the hardest possible way that drinking mercury is a bad idea.

I think this really hits at the core of the difference between science and alchemy because science is a fundamentally iterative process where we are always building on knowledge shared by other people and what came before. So in a way, for us to be responsible scientists, we have to continue to share what we have learned with other people to build upon our successes and failures. So I think to do good science is to do open science, and I think that's what open science is really about.

Another way to think about this is what I think of as intellectual humility because I've been an academic researcher for like 15 years now. And reflecting on these years of research, I realized that whatever little bit I've added to our collective body of knowledge, I was able to do that because of everything that I've learned from the people who came before me. So a researchers, we really didn't get here on your own. It's really built on top of what everyone else has shared with you.

And it is with all of this in mind that I think open science really comes with four fundamental freedoms, where for any piece of knowledge, it should come with the freedoms for anyone to use it, study it, modify it, and continue to share it with other people to continue that iterative cycle. So this is how I like to think of open science. And that's the first thing I wanted to cover today.

The next thing I wanted to quickly establish is that for this science to happen, we're making use of so much shared technical infrastructure today. I remember many years ago I was at this hackathon with Arfon Smith from GitHub. And he was the person who gave me that lifetime Pro subscription to a GitHub that I'm still getting dividends from to this day. It is a platform that comes with amazing features.

But at the same time, I also remember how a couple of years ago there was this big GitHub outage for a couple of hours. And it is when things like this happen that we realize how reliant we have become on the software and hardware infrastructure in our lives. Because when they break, when we hurt, that's when we realize our reliance on these things.

And it's important to think about this because it reminds us to reflect on who gets to have a say in how this infrastructure works and how that infrastructure can work for us as researchers, and how we live out our lives. So this invisible infrastructure is really important. And this kind of centralization that's happening, I think, is a challenge that open source technologies can tackle.

So I've been thinking about a lot of the motivations for open source, including a lot of the reasons that people have talked about today. And I'd like to go over some examples. I want to talk about hardware, but will start with a software example that I think is amazing, which is...

...the QGreenland project. So I thought this project was so cool because it started out as a bunch of academic scientists who share a common theme, which is that they all study Greenland. It could be meteorologists, geologists, and a lot of other scientists. And they developed this common software platform for analyzing geospatial data about Greenland.

And they built it on top of an open source software called QGIS. It is a geographical information system, so that they can pull all of the geospatial data about Greenland into one place. They have a whole suite of tools built on top of QGIS to analyze that data. And the whole stack is called QGreenland. And what happened was that this project became successful. And last year in 2023, they wanted to run a training workshop for other researchers to learn about how to use QGreenlandfor their scientific research.

But one problem they encountered was that if they have 20 scientists in the room coming to this workshop, all with their laptops and their different operating systems and configurations,it takes so much time to just get people to the same page to install QGIS,get it running, and then put QGreenland on top of it. That takes so much time from the actual training they wanted to do.

So they thought, okay, can we reduce this friction a little bit?

And the solution they came up with was that they started with JupyterHub, which is kind of like a server-hosted version of the Python-based kind of Jupyter computational notebook that a lot of data scientists use.

But they were able to make some additions to Jupyter and tweak it so that instead of just running Python, they're running an entire Linux desktop environment on top of JupyterHub.

And with that, they can then install QGIS into that Linux environment, and then they put the whole QGreenland geospatial data platform on top of that.

And once they put all of this together into one package, they serve it from their server so that the participants in the workshop, they can just open up their web browsers, go to a particular URL, and the whole package runs as a web page inside their browser. And this saves so much time in the workshop because they don't need to set QGIS up on every individual computer.

Now, the reason I love this example is that all of these components, they are open source to begin with, and they demonstrate the FAIR principles of open science. Now, I think a lot of you know what FAIR stands for, but just so we're on the same page, FAIR stands for...

...Findable, Accessible, Interoperable, and Reusable. And this is a big thing in open science, and I think QGreenland demonstrates all of it. Because of this open source publishing online, it's easy for people to find it. The way they set it up is really accessible. It's interoperable because the components are open source, and they were able to tweak the components to interact with each other. And, of course, it's reusable because other scientists can adapt it to different research contexts. And I think this is a demonstration of how the FAIR principles that are so important to open science are enabled by open source technologies.

Okay, so this is a software example, but if you look at the UNESCO recommendation on open science, it talks about several main pillars of open science, including the usual suspects like open access publications, open data, open educational resources (I think this one is really important!), and of course, open source software code.

In addition to that, the recommendation emphasizes that hardware is a really important part of open science as well. So I like to focus a bit on the open source hardware side of things.

And if you really think about it, hardware underpins so much of scientific research. It was literally hardware that took people to the moon. That's how much we rely on hardware to do science.

It can be huge pieces of equipment like the Large Hadron Collider,

Or it can be something seemingly simple, but equally integral to the research infrastructure, like microscopes that we use in so many labs today.

Now, the thing with hardware is that it's very often closed source, like a lot of software.

And some of the challenges with that is that it's not reproducible in a scientific way. There's vendor lock-in, which was mentioned before. There's forced obsolescence, and there are very high costs. The cost is not only in terms of a very expensive piece of equipment. It's also the very high switching costs, where if you decide there's another equipment you want to use, but since it's not open source and there's no interoperability, it's very difficult for you to switch to a different platform.

And this causes a lot of global inequalities in research. I personally know some scientists in some global south countries who really want to have a particular piece of instrument in their lab, but the one manufacturer that makes it simply do not sell it in their country.

And even if they somehow get access to buy it, the cost is so high that they cannot afford it.

And if they somehow scrunch together the money to be able to afford to buy it, once they have it, they won't be able to get any support on it. They cannot maintain it themselves.

And it just becomes prohibitively difficult for a lot of researchers in different places around the world.

So I think when it comes to the social impact of our technologies, it's really important to be mindful of a lot of the global inequalities that come with the technologies of today.

So in contrast to that, open source hardware is defined as hardware whose design is available so that anyone, again, can study, modify, distribute, make, and sell hardware based on that design. And there are a lot of examples, actually, in scientific research.

An amazing one that I know about is the Open Source Imaging Initiative. So this is a consortium of universities across Europe, including some companies, I believe, who came together to create a completely open source MRI machine for medical scanning and diagnosis.

And if you know anything about MRI machines, you know how complicated and intricate they are. And they're actually creating an open source version of it that's becoming successful!

Open source hardware has been to space. Researchers in the U.S., they've developed the ORESAT, which is an open source CubeSat, that became a common platform for scientists across the U.S. to build on top of for remote sensing applications.

It's been launched several times already, and I think they have more launches scheduled.

But the example that I'd really love to talk about is the OpenFlexure microscope. So this is a lab-grade microscope, originally developed by researchers at the University of Bath in the U.K. (I think their team is based in Glasgow now). The point is it's completely open source and modular, and you can 3D print most of the microscope yourself.

It comes with a lot of features, starting with the basic ones like bright field imaging, or fluorescence imaging. But because it is fully open source, there was a separate research team in a different part of the world that looked at the designs, and they actually enhanced it and improved it to greatly increase the resolution for fluorescence imaging.

And this is something that people weren't able to do with the closed-sourced microscopes that they used before.

These are just a couple of features, but what's also really cool is that this open-source microscope, if you want to build it yourself, the cost of doing so is only about 200 US dollars.

Now, for those of you who have used and bought microscopes for use in the lab before, you will know that these microscopes often cost an order of magnitude more than OpenFlexure for doing the same thing, and I think that's absolutely remarkable.

And because of its low cost and because it's open source, again, as an example, researchers in several sub-Saharan countries, they were able to take the OpenFlexure design to locally produce and maintain that microscope from malaria diagnosis when they weren't able to do it before.

And in addition to this, it has actually prompted the formation of some small businesses in those countries to locally produce and sell these microscopes, and it's again becoming a new business model that's enabled by open source technology.

Okay, so to kind of build on some of the points made earlier, Joshua Pierce is a researcher in this, and he calculated that open source technologies, including hardware, can provide economic savings of up to 87% compared to functionally equivalent proprietary tools.

And again, my other point is that in addition to the savings, it creates new kinds of businesses.

So I have a bit of a background in molecular biology, and I've used PCR machines a lot. And there's a company who sells these Ninja PCR machines for US$500. Again, if you have bought this for labs before, you'll know that they typically cost an order of magnitude more. So it's amazing how open source not only lower costs, but creates new kinds of businesses as well.

Okay, so I talked about some of the benefits of open source technology just now. And to build on Clare's point earlier, I think we're faced with so many global challenges today, whether that's climate change or pandemics or other problems. And they're so big and urgent that I think open source technology is what enables the inclusive and rapid innovation needed to address these really urgent issues.

And to bring it back to my earlier point, I truly believe that we simply don't have time to be alchemists anymore. We cannot afford to be alchemists. And I think this is a huge motivator for why open source is so important and critical to open research.

Now, with all of that said, here actually comes what might be the most provocative part of my talk today. So, you know, again, we've seen so many motivations for open source, like the collaboration that happens, faster innovation that's so critical to solve problems of today, the lower costs and business opportunities, and so many other benefits, right?

But I feel they are just the tip of the iceberg in terms of why open source is so important. And there are some underlying values that I think really adds a lot to the value proposition of open source.

In my view, that could be things like the autonomy and agency that we can have over the technology that we use and the freedom to use it for our purposes. And I think these are the things that also underpin why open source is so important.

Dr. Julieta Arancio is a researcher of open source technologies, and I think she characterizes it really well, where technology really affects the way we think about research questions.

And when a piece of technology and the tool that we use is closed source, it means that rather than being enablers of our creativity, we end up doing what the available tech lets us do.

Because the people behind that technology gets to dictate what you can do with that technology. And what that means is, in this context, is that closed source technology also implies a certain kind of epistemic power behind it in terms of what knowledge we are allowed to have and what we can use that knowledge for.

And the risks with closed source technology and the challenges with it is that, depending on how you wield that epistemic power, unfortunately, sometimes it leads to a kind of intellectual poverty. Because only certain people get to have certain pieces of knowledge and not other people. Some people get to make use of that knowledge in certain ways, while other people don't get to do that.

So I think intellectual poverty is an unfortunate side effect that sometimes come from closed source technologies. And this is where the value proposition of open-source technology really comes in.

This is not only convenient and amazing in terms of the collaboration and innovation that happens, there is also an ethical underpinning to it that makes it even more attractive and adds to the value that we already have.

And this connects with open research, because open research is not only about publishing open outputs, whether that's open access papers, open data, and open source software or hardware designs, it is also about how we hold that epistemic power together in a more equitable way.

Such as those scientists I told you about in the Global South who couldn't do what they wanted to do in their research. And I think this is, you know, in addition to all of the benefits that we talked about, a very important value proposition.

If you think some of these conversations are interesting, I'd like to share with you some of the communities in which these conversations are happening.

I'd like to start on the hardware side of things. Over the past few years, I have been very lucky to be part of a group called the Gathering for Open Science Hardware, also known as GOSH.

And this is a network of researchers, hundreds of researchers from across the world, literally from every continent, except maybe Antarctica, who come together to think about the important role of open source hardware in scientific research.

And we have done a lot of interesting work, such as last year we created a policy toolkit for UNESCO on the role of open source hardware in scientific research, which was just published at the end of last year, that provides a lot of policy guidance on the national level for research and innovation policy.

I mentioned Julieta just now. She is the author of an amazing report called Supporting Open Science Hardware in Academia. This report is geared towards scientific research funders and technology transfer offices in universities to provide some guidance on how universities can enact policies to support people to work on open source technologies in research and also ways to spin off that development into successful business models around open source. So I think this is a remarkable report that I highly recommend you to check out.

So that's the hardware side of things, but if you go more broadly than that, I think this is where the Turing way comes in. So let me tell you a little bit about this community.

It started back in 2019 initially as a book, an online book, made with Jupyter, by the way,about data science and best practices around how to do data science in an open and reproducible way.

Now it started off as this book, but the founders of the Turing way, they thought: “We are not the only experts here, so can we invite other people to help us co-create this book together?”

And they started a distributed collaboration process that eventually turned into scientists and researchers from around the world contributing to this book, not only in terms of data science, but other aspects of open science and open source as well.

And it's grown into a huge book, hundreds of pages long, and because of how the book brought together scientists from different backgrounds, it's grown into a very vibrant community where a lot of conversations are happening around what open research means, what open science means, and what open source means for this work.

So we talk about things like what I presented in my talk today. There's also talk about diverse roles in research, such as the important role of research software engineers in scientific research that's not recognized enough, or things like localization.

So many things about open science and open source are in English right now, but can we translate that to different languages and what does that mean for people from different backgrounds and social backgrounds as well?

With the hope that eventually we can galvanize a cultural shift in terms of how we think about technology and how we think about research so that, again, we can hold this power together in a more equitable way and think of new opportunities for research and innovation.

So I think it's remarkable how over the past five years there has been more than 450 contributors to the Turing way, not just in terms of pull requests to the repository and adding to the book, but also all of the richness that's been brought into the conversations that's been held together by this community.

It's a really amazing place, and I highly recommend you check out the Turing Way if you're interested in having these discussions and connecting to other researchers in this space.

And I think this really shows how open research and open source can connect to each other in a mutually beneficial way.

Okay, so I'm just about running out of time, but before I end, of course I'd like to thank all of the people who have helped me so much to come here today to share some of these reflections with you.

First of all, of course, there's the Turing way community. Malvika Sharan is one of the co-founders of the Turing Way. We had a lot of interesting discussions on how to make this a more provocative talk.

I'd like to thank Bri, our community coordinator in GOSH, for contributing a lot of the thoughts on open source hardware, and of course to the organizers, including Michael and Clare, for having me here today to share some of these reflections to you, and for all of you for putting up with the last 27 minutes and 50 seconds of my talk.

The meta-commentary is that this talk is open source and it's published on Zenodo.org with this DOI, and I encourage you to check it out, fork it, turn it into what you like, and visit the Turing Way and GOSH communities where we can continue to have these conversations.

#talks #opensource #openresearch

Unless otherwise stated, all original content in this post is shared under the Creative Commons Attribution-ShareAlike 4.0 International license

Talk - Representing epistemic and disciplinary diversity in open research

Sat, 10 Feb 2024 09:11:24 +0000

On 10 February 2024, I gave a lightning talk at FOSDEM 2024's Open Research Online Devroom titled “Representing epistemological and disciplinary diversity in open research discourse” (slides and video shared here). I later gave a tweaked version of this talk to introduce a UK Reproducibility Network online workshop on this topic on 31 March 2025. It was informed by incredible feedback I received from various open research communities. There's so much good stuff I couldn't fit them into a 10-minute lightning talk, so I'm putting them here.

I'm posting:

a video recording of the latest version;
a transcript of the talk;
additional discussions that didn't fit; and
other resources/further reading that couldn't fit (scroll down to see!).

I'll try to clean up this post with more context and details on a best-effort basis.

Video

The version of this talk given on 31 March 2025 to introduce an online workshop for the UK Reproducibility Network train-the-trainer community. It was recorded which you can watch here:

Or download the slides and recording directly from Zenodo:

https://doi.org/10.5281/zenodo.10643245

Transcript

My very sciency background started in ecology and environmental research 15-ish years ago, including evaluating impacts from a major marine oil spill to wildlife. During those years, I heard about a fellow PhD student getting their thesis criticised by a committee member (i.e. examiner) because their work is not “reproducible” and not structured with explicit “hypotheses” and tests of those hypotheses.

I remember myself wondering why should scientific research be defined by reproducible experiments? The oil spill I studied is fundamentally not reproducible. And even if it were, it's probably not ethical to reproduce it! I also didn't conduct any experiments. Does that mean I am doing bad science?

In the years since, I've become an advocate for open science as a way to do good science, where I think what makes good science different from alchemy is best summed up by Cory Doctorow, who said: “Alchemists kept what they knew a secret for 500 years. They didn’t advance the art very much… and each of them learned in the hardest way possible that drinking mercury is a bad idea.”

I also learned that the Latin origins of the word science comes from Latin “scientia”, i.e. “knowledge”.

This prompted me to reflect more deeply on “open research” instead of just “open science”; and how open research could encompass diverse ways of learning and organising what we know.

For the purposes of my lightning talk, this is what I mean by epistemic diversity. My fear is that conversations about open research is not representative of epistemic diversity.

For example, we encourage people to share open data, but what does that mean to, say, a scholar of medieval literature?

There is much focus on “reproducibility”, but would a law professor care about this?

We have made progress on publishing open access papers, but does that mean anything to a musicologist?

I think it is possible for us to shoehorn these concepts into these disciplines (e.g. perhaps medieval books or sheet music are the “data”, and the law professor should document their reasoning and arguments in a “reproducible” way), but why should we? What if these researchers get to define research – and open research – in their terms instead? What might open research look like then?

Another way to look at this is to ask: who gets to decide how to conceptualise and talk about research? Who holds the epistemic power to define the terms of the conversation? In my view this is not an abstract problem.

Last year, I had a long conversation with a new history professor who has an interest in open research. And he told me about how groups about open research are often dominated by researchers from very STEM-focused subjects like the life sciences. And he doesn’t see any researcher who looks like him. He told me that there is resentment and a sense of exclusion among other historians.

In other words, despite good intentions, the epistemic power of the loudest voices in open research discourse might have accidentally caused epistemic injustice which systemically excluded some researchers.

People often ask “what can I do as an individual”?

A core part of open research is diversity and inclusion and over the years I've learned much (and have much more to learn!) on how words matter when it comes to dimensions like race, gender, or accessibility. I suggest applying that same sensitivity to epistemic diversity.

For example, the word “manuscript” could mean a paper submitted to a scientific journal, or a physical piece of ancient paper that a historian studies. I hear the terms “lab” and “PI” a lot in open research communities, but there are research disciplines whose social structures are very different and don't use these terms at all. Or, sometimes I hear talk of “STEM” vs “non-STEM” research, but that is itself a very STEM-centric view.

And, of course, the conflation of “science” and “research” as if they're the same thing. If we can be mindful of epistemic diversity when talking about open research, then maybe we can start to avoid excluding people from the conversation.

Finally, as someone from a very sciency background, I am in a privileged position. My imagination is constrained and it's not my place to authoritatively declare what we should do. That's why I'm hesitant to prescribe “say x instead of y”.

Instead, there may be value in elevating under-represented groups through gatherings like focus groups or workshops, where we can learn from epistemologically diverse researchers directly on how to make open research more inclusive.

Over the past weeks, I received amazing suggestions on where this topic could go, and my lightning talk is just the tip of that iceberg.

In the mean time, I’d like to give thanks to them, including: The Turing Way community, Framework for Open and Reproducible Research Training, Nowhere Lab, Gathering for Open Science Hardware, and the organisers of this FOSDEM Open Research Devroom!

Additional discussions

In no particular order (don't have time to organise ATM), here are some other ideas which came up in the Turing Way, FORRT, Nowhere Lab, or GOSH communities (acknowledgements at the end of this post):

Start with 'the idea of open scholarship and then narrow to “open science” or “open research” if needed, depending on who I'm talking to'.
”...it’s not just an issue in the open science/research movements, but interdisciplinary fields in general and any inter-/multi-disciplinary attempts to change research practices need to adopt epistemological flexibility & tolerance towards other”
There is interest from the FORRT community to write something about this.
Also possible for me to re-present this lightning talk at an upcoming Nowhere Lab meeting, maybe in March 2024.
Classification can be confusing for some researchers, especially those who don't fit in typical “STEM” boxes.
Sabina Leonelli's work on open science and philosophy of science, linked to below.
Harding’s sciences from below and Fricker’s epistemic injustices are also useful reading (see below).
There is a Digital Humanities group for the UK and Ireland: https://digitalhumanities-uk-ie.org
- Including research software engineering: https://digitalhumanities-uk-ie.org/community-interest-groups/research-software-engineering/
There are different levels of abstraction when talking about epistemic diversity, which is a separate deep dive on its own.
'Beware the “pageant effect”: you're likely to learn the amazing successes of an unfamiliar discipline before learning about its flaws and failures.'
Connections to science, technology, and society (STS) studies, link to epistemic power and (in)injustice.
We had a conversation about the good intentions, usefulness, but also limitations of the CReditT taxonomy for research contributor roles (readings below). Though there is further work on improving it like SCoRO.

More resources

The UK Reproducibility Network has done good work on this, such as:

Event by the UK Reproducibility Network: How relevant is the open research and scholarship agenda to the arts, humanities and social science disciplines? (warning: YouTube link)
Preprint title “Open Research: Examples of good practice, and resources across disciplines”: https://doi.org/10.31219/osf.io/3r8hb
Working paper 6: https://doi.org/10.31219/osf.io/chyd4
Working paper 7: https://doi.org/10.31219/osf.io/c78qu

👇 And there's more:

Humanities Commons: https://hcommons.org/

Replicable History Project: https://ljmu.libcal.com/event/4130747

Works by Karen Barad: https://en.wikipedia.org/wiki/Karen_Barad

Joint meeting of the European Association for the Study of Science and Technology (EASST) and the Society for Social Studies of Science (4S): https://www.easst4s2024.net/

FOSDEM talk: FLOSS meets Social Science Research (and lived to tell the tale): https://archive.fosdem.org/2021/schedule/event/open_research_floss_meet_social_science/

SCoRO, the Scholarly Contributions and Roles Ontology: http://purl.org/spar/scoro

Research Software Engineering in the Arts and Humanities: https://digitalhumanities-uk-ie.org/community-interest-groups/research-software-engineering/

Unless otherwise stated, all original content in this post is shared under the Creative Commons Attribution-ShareAlike 4.0 International license

talks — naclscrg

Talk - "AI" follow up talk about labour and academia

Video recording

Short summary

Further reading

Books

Academic literature

Transcript

Talk - Open source hardware for more equitable open science

Recording

Transcript

Further reading/resources

Peer-reviewed papers

Useful guides

Relevant organisations

Talk - AI is not the problem - thinking about outcomes (updated)

Further reading

Transcript

addendum on reproducibility

Talk - The critical role of open source in open research

General notes

a note about creating a transcript

Other resources/examples

Transcript

Talk - Representing epistemic and disciplinary diversity in open research

Video

Transcript

Additional discussions

Suggested readings

More resources

Acknowledgements

Turing Way

Framework for Open and Reproducible Research Training (FORRT)

Nowhere Lab

Gathering for Open Science Hardware (GOSH)

UK Reproducibility Network