⇓ More from ICTworks

Exciting Evidence: How GenAI Can Improve Outcomes for Girls

By Guest Writer on April 3, 2025

girls internet usage

Over the past year, Girl Effect, like many organizations, has been exploring how to best harness the potential of Large Language Models (LLMs) and Generative AI (GenAI). Our goal is to build on our longstanding efforts to leverage digital channels alongside offline interventions to help girls achieve their full potential.

To consider GenAI as a sustainable approach, we first needed to ensure it met two key criteria: high quality — consistent with our existing products — and demonstrable safety. We are pleased to announce that we achieved this milestone by the end of 2024. We can now confirm that using GenAI significantly improves satisfaction, engagement, and signs of impact.

Ask a question and get an answer?

Big Sis, our trusted companion for sex and relationship advice in South Africa, has been active since 2018. Available on WhatsApp, Facebook Messenger, Moya Messenger, and Telegram, it has engaged with over 1.5 million users and received more than 35 million messages.

Big Sis initially functioned as a ‘deterministic’ chatbot, utilizing straightforward decision-tree navigation. Girls could explore a content library and pose questions, though only a few were answered, gradually expanding the repository of Q&As. In 2020, we enhanced this functionality to address girls’ understandable need for immediate answers. We developed an initial prototype that analyzed keyword combinations within questions and suggested relevant content to users.

In 2021, we were able to create an NLP-powered system in partnership with Weni (then Ilhasoft) and Unicef, which categorised the questions that users asked and provided the same style of content recommendation. While this was a significant improvement, moving away from the ‘on the rails’ menu navigation, we realized it still didn’t fully meet our users’ needs.

They wanted precise answers to their questions, delivered instantly, in a manner that considered the context, nuance, and the girl’s background. But the technology just wasn’t there to provide this with a high enough level of accuracy.

Everything changed with the advent of GenAI, a development that Girl Effect was ready to leverage. Girl Effect worked to ensure that commercial LLMs would deliver appropriate and safe answers to its users.

Using Retrieval-augmented generation (RAG) and prompt engineering, we developed a comprehensive A/B test in which 50% of new users would receive an answer via GenAI, whilst other users would continue to receive content recommendations.

The new feature makes use of Girl Effect’s preexisting rich corpus of content, grounded in Social and Behaviour Change techniques. The LLM is able to emulate the tone of the chatbot and provide a novel answer to every question, with strict guardrails enforcing which subject areas it can chat about.

How can you ensure the quality of the answers?

Behind the scenes, we have developed an LLM-powered evaluation framework. Girl Effect has a long-standing commitment to responsible digital practices. It was essential to ensure, with a high level of confidence, that the answers provided would be safe, reliable, and relevant.

With this responsibility in mind, we also recognized that we could not manually review every generated response. This led to the need for an evaluation framework that we could use both for robustly evaluating our generative AI system across thousands of sentences before deployment and for monitoring the quality of outputs after deployment.

Girl Effect has consistently applied the principle of exploration and research within the broader field to ensure best practices in all our products. We began with this approach to develop our evaluation framework. We soon discovered that most state-of-the-art evaluation frameworks were still in their infancy and lacked proven reliability. This was further confirmed through our user testing, where responses deemed relevant by “standardized” metrics were irrelevant to our users.

Recognizing the absence of suitable metrics, we created a bespoke evaluation framework, encompassing custom metrics and manual validation. We focused on building robust accuracy metrics for each stage of the generative AI system, ensuring accurate information retrieval, relevance to the user’s question, faithfulness of the generated response to the retrieved information, and overall relevance to the user’s original query.

This stage-by-stage evaluation enabled targeted fine-tuning. While user testing indicated positive results, we needed quantifiable data. Developing these metrics and validation processes was more complex than anticipated, but through rigorous manual review and iterative improvement, we achieved the confidence required to initiate an unsupervised test.

Early evidence of superior performance for GenAI

early evidence

In December 2024, we were able to launch our first unsupervised GenAI experiment. Our AB test design involved 8,000 users navigating to ask a question in Big Sis. At that point, they were randomly divided into two groups: 50% received answers from the GenAI, while the remaining users experienced the existing content suggestion system.

Following the interaction, we would ask the user whether they were satisfied with the quality of their answer, and then later asked about how satisfied they were with their experience. Once the users were put into the GenAI group, they would continue to have their questions answered by GenAI.

By placing users into groups, we could analyze their downstream engagement. This allowed us to determine if the perceived positive experience led to more impactful interactions, such as asking more questions, engaging with key messages, demonstrating knowledge uptake, and accessing more service information.

A key focus was GenAI’s ability to perform reliably in unsupervised settings, preventing disruptions from API connectivity and unexpected user experience problems. These aspects are often difficult to fully assess until launch. We were thrilled that the release demonstrated the team’s success in achieving these goals.

We set out to achieve our targets within a month, but we managed to do so in just two weeks. The early results are incredibly promising. After conducting significance analysis, we observe that users are:

  • 11.24% more likely to recommend Big Sis to a friend
  • 17.1% more likely to engage with key programmatic messaging
  • 11.87% more likely to return to use Big Sis
  • 12.68% more likely to access service information

We were so pleased with the results that we extended the experiment. Currently,10,000 users have engaged with the GenAI feature.

What next?

We are currently delving into the data to gain deeper insights into user behavior — what users choose to ask about, how the GenAI responds in various situations, and the long-term impact on users who have engaged with the GenAI feature.

We will soon publish our best practices on creating a participatory approach for engaging key stakeholders, along with the essential ethical considerations for using AI and ML, in collaboration with our partners at the MERL Tech Initiative.

This is just the beginning. In the coming months, we plan to continue experimenting and sharing our insights and thoughts on the use of GenAI. We aim to prove that we can move beyond the hype and demonstrate that GenAI’s potential can lead to real returns on investment. We look forward to having you join us on this journey!

Written by Girl Effect and published first as Girl Effect’s Study Reveals Exciting Evidence that GenAI will Improve Outcomes for Girls’

Filed Under: Women in Tech
More About: , ,

Written by
This Guest Post is an ICTworks community knowledge-sharing effort. We actively solicit original content and search for and re-publish quality ICT-related posts we find online. Please suggest a post (even your own) to add to our collective insight.
Stay Current with ICTworksGet Regular Updates via Email

Leave a Reply

*

*