Is there really an "Open Substack mirror"? Some annoying and odd discoveries Lakshmi and I made while analyzing data from "AI, Software, & Wetware" interviews for our ABC conference presentation.
Yes. I worked on a textbook chapter about using LLMs to handle large, open-ended survey data and it was a disaster. I thought GPT-5 and the latest models would be more capable, and I was wrong. Validity and reliability, as well as replication, are still major issues.
Definitely. We are lucky that Prof. Fischer is going to provide ‘ground truth’ data we can use to validate which (if any) of the 5+ LLMs are reliable for our purposes. I’d love to read your chapter 😊
Sure, @KayStoner ! I went through a couple of iterations on the prompt and checking the scoring against some benchmark interviews to address obvious mismatches. We’ll share the prompts and more technical details in an upcoming article - it probably fits best in my AI6P newsletter.
The kicker is that we can obviously engineer a very detailed prompt telling the LLMs what to ignore and exactly how to weight specific sentiment factors and score them. I can see asking it to rate each of the 5 risks and 5 ethical concerns from my book, for starters. Once we have Prof. Fischer’s scores as ground truth, we’ll define updated prompts that reflect her expertise. Then we’ll see if the LLMs still differ vs. each other and vs. her expert judgment.
But more casual LLM users will just ask one LLM to score a sentiment, trust the results, and move on. This experience indicates that’s not safe and could be very misleading. For example, this could happen with customer feedback or service requests.
I already tried telling the LLMs to ignore demographic bias factors. It appeared to help somewhat. Some of the interview files still contain people’s names and countries, and in some cases race and ethnic background and age or ways to deduce them (from their self-intros). It would be some work to scrub the 80 files to remove those details and demographic clues, but it would be interesting to see if it would yield different scores vs telling the LLMs to ignore those demographics. OTOH, factors like gender or native language might influence speaking styles, so maybe telling the LLM to ignore gender would keep it from adjusting properly for that! SO many experiments that could be done.
Testing multiple LLMs wasn’t our original goal, though. Understanding people’s sentiments on everyday ethical AI was 😊 If you’d like to dig into some of these questions too, DM me!
Karen, It seems like there is too much self-confidence built into these systems.
I’m having improved success using multiple AI’s in my writing and my technical projects. I normally use Perplexity, ChatGPT, Gemini, and DeepSeek.
Here is how I solved a programming issue yesterday…
I had been working with DeepSeek developing code for a halloween project. The specific piece of code I needed working was to have an LED strip simulate a lightning strike. The Strip is controlled by an ESP32 microcontroller.
After about 2 hours of work with DeepSeek, I felt we had plateaued on progress. So I popped over to Gemeni, explained the issues and Gemeni provided a second opinion.
Cut and pasted Gemeni’s insights, and here is DeepSeeks response:
Great insights from Gemini! The key points are:
1. We can't beat the 6ms data transmission time for 150 RGBW LEDs
2. Use white channel only for maximum speed and brightness
3. Segment the updates to create the illusion of speed
4. Use "jumping" updates rather than sequential pixel-by-pixel
Let me implement a much faster version using these principles:
——-
DeepSeek updated the code, and I have a decent lightning bolt simulation now.
I find it interesting how these different AI’s can collaborate, but only when the human, is the broker! I’m yet to have an AI say, let me reach out to Gemeni, or Kevin, maybe you could get us a second opinion.
Good observation on LLM confidence, Kevin Guiney. One study showed that even when they are wrong, LLMs are highly confident they’re right 47% of the time! Thanks for sharing your story about your halloween project. :)
Fascinating! This made me wonder: if the same interviews were analyzed but with references to “AI” swapped for something seasonal like, “haunted houses” or a trending topic, would the sentiment scores still come out so positive? In other words, are the models biased toward portraying AI in a favorable light because of how they’re trained, or would the same optimism appear for any subject?
I was wondering about that too, Cheyenne! The big AI companies obviously have huge financial incentives to downplay AI concerns and play up any benefits of AI, just like they are motivated to make their products sticky or addictive or make them seem inevitable and unavoidable.
It’s so interesting that Notebook LM is the LLM which, from its score distribution, *seems* to not downplay ethical concerns about AI. The real test will be when Prof. Fischer does her scoring and we can see what the sentiments really are.
It could be fun to come up with a Halloween-themed set of substitutions for AI, ML, and the various LLMs and see if it would impact the scores. 😊
This is a perfect example of why relying on a single LLM for sentiment analysis is risky. Aggregating multiple models and comparing with human judgment seems essential to spot biases and inconsistencies.
Exactly, @Suhrab Khan, that is one of my takeaways from this data analysis adventure as well. I'm especially suspicious of the 5 LLMs rating sentiment *about AI* objectively. I think we can still gain some insights from them, though. Having the professor score the interviews based on her deep expertise in Textual Analysis is going to be so interesting!
Yes. I worked on a textbook chapter about using LLMs to handle large, open-ended survey data and it was a disaster. I thought GPT-5 and the latest models would be more capable, and I was wrong. Validity and reliability, as well as replication, are still major issues.
Definitely. We are lucky that Prof. Fischer is going to provide ‘ground truth’ data we can use to validate which (if any) of the 5+ LLMs are reliable for our purposes. I’d love to read your chapter 😊
I would love to know what prompts you used to interact with the systems.
Sure, @KayStoner ! I went through a couple of iterations on the prompt and checking the scoring against some benchmark interviews to address obvious mismatches. We’ll share the prompts and more technical details in an upcoming article - it probably fits best in my AI6P newsletter.
The kicker is that we can obviously engineer a very detailed prompt telling the LLMs what to ignore and exactly how to weight specific sentiment factors and score them. I can see asking it to rate each of the 5 risks and 5 ethical concerns from my book, for starters. Once we have Prof. Fischer’s scores as ground truth, we’ll define updated prompts that reflect her expertise. Then we’ll see if the LLMs still differ vs. each other and vs. her expert judgment.
But more casual LLM users will just ask one LLM to score a sentiment, trust the results, and move on. This experience indicates that’s not safe and could be very misleading. For example, this could happen with customer feedback or service requests.
I already tried telling the LLMs to ignore demographic bias factors. It appeared to help somewhat. Some of the interview files still contain people’s names and countries, and in some cases race and ethnic background and age or ways to deduce them (from their self-intros). It would be some work to scrub the 80 files to remove those details and demographic clues, but it would be interesting to see if it would yield different scores vs telling the LLMs to ignore those demographics. OTOH, factors like gender or native language might influence speaking styles, so maybe telling the LLM to ignore gender would keep it from adjusting properly for that! SO many experiments that could be done.
Testing multiple LLMs wasn’t our original goal, though. Understanding people’s sentiments on everyday ethical AI was 😊 If you’d like to dig into some of these questions too, DM me!
Karen, It seems like there is too much self-confidence built into these systems.
I’m having improved success using multiple AI’s in my writing and my technical projects. I normally use Perplexity, ChatGPT, Gemini, and DeepSeek.
Here is how I solved a programming issue yesterday…
I had been working with DeepSeek developing code for a halloween project. The specific piece of code I needed working was to have an LED strip simulate a lightning strike. The Strip is controlled by an ESP32 microcontroller.
After about 2 hours of work with DeepSeek, I felt we had plateaued on progress. So I popped over to Gemeni, explained the issues and Gemeni provided a second opinion.
Cut and pasted Gemeni’s insights, and here is DeepSeeks response:
Great insights from Gemini! The key points are:
1. We can't beat the 6ms data transmission time for 150 RGBW LEDs
2. Use white channel only for maximum speed and brightness
3. Segment the updates to create the illusion of speed
4. Use "jumping" updates rather than sequential pixel-by-pixel
Let me implement a much faster version using these principles:
——-
DeepSeek updated the code, and I have a decent lightning bolt simulation now.
I find it interesting how these different AI’s can collaborate, but only when the human, is the broker! I’m yet to have an AI say, let me reach out to Gemeni, or Kevin, maybe you could get us a second opinion.
Good observation on LLM confidence, Kevin Guiney. One study showed that even when they are wrong, LLMs are highly confident they’re right 47% of the time! Thanks for sharing your story about your halloween project. :)
Fascinating! This made me wonder: if the same interviews were analyzed but with references to “AI” swapped for something seasonal like, “haunted houses” or a trending topic, would the sentiment scores still come out so positive? In other words, are the models biased toward portraying AI in a favorable light because of how they’re trained, or would the same optimism appear for any subject?
I was wondering about that too, Cheyenne! The big AI companies obviously have huge financial incentives to downplay AI concerns and play up any benefits of AI, just like they are motivated to make their products sticky or addictive or make them seem inevitable and unavoidable.
It’s so interesting that Notebook LM is the LLM which, from its score distribution, *seems* to not downplay ethical concerns about AI. The real test will be when Prof. Fischer does her scoring and we can see what the sentiments really are.
It could be fun to come up with a Halloween-themed set of substitutions for AI, ML, and the various LLMs and see if it would impact the scores. 😊
This is a perfect example of why relying on a single LLM for sentiment analysis is risky. Aggregating multiple models and comparing with human judgment seems essential to spot biases and inconsistencies.
Exactly, @Suhrab Khan, that is one of my takeaways from this data analysis adventure as well. I'm especially suspicious of the 5 LLMs rating sentiment *about AI* objectively. I think we can still gain some insights from them, though. Having the professor score the interviews based on her deep expertise in Textual Analysis is going to be so interesting!