Many direct-to-client digital services rely on interactions with real people to make decisions, provide support, and improve their platforms over time. But real people produce ‘messy’ data, which makes it hard – at scale – to accurately draw a line between what they need, and the official services and systems that can support them.
User-generated conversational data
For example, Jacaranda’s AI-enabled digital health tool PROMPTS is built to read, process and triage large volumes of conversational inputs from new and expecting mothers. Users are prompted to report on pregnancy-related health, danger signs, and experiences of facility-based care, which is used by the helpdesk to decide an appropriate course of action.
PROMPTS is expecting data from women that can be mapped to official DHIS2 facility data to create a bridge between care experience (eg. women reporting poor care) and care provision (eg. facility-reported data on provider skills gaps).
Linking user data directly with official data gives a better idea of what is driving nationally-reported outcomes data, and a more personalized referral pathway – ie. deciding which facility to refer a client to based on poor/positive experiences of care.
Messy conversational data challenge
But the challenge is that conversational data can be ‘messy’. The facility names that mothers talk about can’t always be mapped with their official names because of inconsistencies, like formatting issues (eg. irregular capitalization), misspellings, abbreviations (eg. Level 4 > L4), and incomplete or mis-described names, (eg. ‘Sub-county hospital’ > ‘Sub-district hospital’).
‘Messy data’ is not unique to the interactions we have on PROMPTS. Other services could benefit from tools that aggregate or standardize diverse or conversational inputs.
For instance, tools that match farmers’ descriptions of crop or pest problems with scientific names to accurately identify crop diseases. Or a means to standardize names of government departments or public facilities from various sources to create unified public databases or improve accessibility of government services.
Improving conversational data with AI
Jacaranda developed an automated approach to augment and standardize conversational data to address this challenge. The approach uses ‘perturbation attacks’, small, intentional changes to input data to trick machine learning algorithms into making incorrect decisions.
The result is a database of variations on an official name, entity or program, to mimic the inconsistencies in user-reported data.
10 Step Toolkit for Conversational Data
A step-by-step approach to implementing these attacks is below.
1. Setup:
In your augmenting script, import necessary libraries and dependencies, like pandas, random, re, string, nltk, textattack and sklearn libraries, as below.
- import pandas as pd
- import random
- import re
- import string
- import nltk
Load your dataset and inspect its structure to understand its features, data types, and layout.
2. Install the TextAttack library.
TextAttack library offers a framework to introduce subtle changes, or modifications, to desirable text inputs (eg. DHIS facility data), including misspellings, word substitutions or character swapping.
3. Choose a NLP model
Choose a Natural Language Processing system suitable for your task and load it using TextAttack. For example, a transformer model for sequence classification.
4. Initialize TextAttack Augmenter
Augmenting a dataset using TextAttack requires only a few lines of code when it is done right. The Augmenter class is created for this purpose to generate augmentations of a string or a list of strings.
5. Define an augmentation function
Next, define an augmentation function to apply perturbation attacks on your text data. This can be character swaps, misspelling, lowercasing, word substitutions, or character replacements, as outlined in the table below. Augmenting a dataset using TextAttack requires only a few lines of code, as below, and can be done in either python script or command line.
- def perturbation_augmentation(text):
- augmented_text = augmenter.augment(text)
- return augmented_text
Correct DHIS2 Name | Perturbation | Augmenting Data |
Kianyaga Sub-County Hospital | Misspelling | Kianyaga subounty hospital |
Character replacement | Kianyaga sub couGty hospital | |
Characters omission | Kianyaga sc hospital | |
Character substitution | Kianyaga subconuty hospital | |
Omission of word | Kianyaga subcounty | |
Character deletion | Kianyaga subcounty hospial | |
Lowercasing | kianyaga subcounty hospital | |
Character swaps | Kianyaga subocunty hospital |
6. Repeat the augmentation process
Be sure to repeat the augmentation process for every data point in your dataset, using the sample code snippet below. For example, applying the perturbation augmentation function for five hospital names, starting with the 1st.
- augmented_dataset = [perturbation_augmentation(text) for text in original_dataset]
7. Save the augmented dataset
You will need to save the augmented dataset to a new file or overwrite the existing one, using the sample code snippet below.
- with open(‘augmented_dataset.txt’, ‘w’) as f:
- for text in augmented_dataset:
- f.write(“%s\n” % text)
8. Inspect and Validate
Review a few samples from the augmented dataset to ensure it aligns with your expectations. Optionally, assess the model’s performance on both the original and augmented datasets.
9. Iterate and fine-tune
We found the need to iterate and fine-tune our approach throughout the process, adjusting parameters or selecting different models to achieve the desired augmentation.
10. Document
Good techies always document the augmentation process, including the model used, parameters, and any specific considerations, to explore augmentation options missed, experiment with different models or techniques, or modified parameters going forward.
Did You Improve Messy Conversational Data?
The hope is that this toolkit will support faster, cheaper augmentation of diverse data at scale, helping implementers make sense of the data they collect from users, and connect them with systems and services that could support them better.
By nature of producing ‘cleaner’, standardized datasets, this process will also help implementers more seamlessly train AI-based models and better report on the insights or implications of the data they generate.
We are keen to hear from other implementers how this toolkit has supported data augmentation in their services and systems. Please share feedback, learnings, and areas for improvement in the comments below.
By Stanslaus Mwongela, Machine Learning Manager, Jay Patel, Head of Technology, and Laura Down, Head of Global Communications, Jacaranda Health
Sorry, the comment form is closed at this time.