Healthcare Chatbot Design Toolkit

This toolkit provides guidelines for designing and developing a healthcare chatbot. In this toolkit you will find recommended activities and methods which were used to adopt a stakeholder-centered approach, which is important for responsible design of digital health technologies. For each section below you can find further helpful information, recommended steps, and expected outputs should you follow our recommendations.

Stakeholder Centered Design

It is crucial to look at approaches for responsible healthcare chatbot design which could consider three things 1) what users say they need, 2) what chatbots and features mental health professionals would endorse, and 3) what AI chatbots can do well. Chatbots can easily handle scripted dialogues with predefined replies or limited free text responses, and if users wanted a chatbot to self-diagnose or screen then it could be used to collect symptoms and use a decision flow to suggest a diagnosis. However, professionals may not be in support of this as it may result in a false diagnosis, which could limit its credibility and widespread adoption. Alternatively, chatbots could be used for answering questions and signposting to paid mental health services, however, users may not want this type of application to direct to paid services and thus may avoid the technology altogether. Another example is a chatbot that supports free text, attempting to detect when a user is feeling depressed and tries to respond in a way that improves the person's mood. This may be endorsed by professionals but given the limitations of AI the responses may be inappropriate if the chatbot failed to understand what the user said or if it gave inappropriate advice. Therefore, a successful digital intervention could be thought of as the intersection between what users want and say they need, what professionals advocate and what AI does well. We propose activities and methods which strike a balance between these three things.

Healthcare Domain Understanding

The first step in healthcare chatbot design planning should always be to understand the domain. Undertaking a literature review in the relevant area can be used to identify what has already been done and what gaps exist.

At this stage it is useful for technologists to converse with healthcare professionals, firstly to find out what health services are currently available. This also allows healthcare professionals to share domain knowledge with technologists, and healthcare professionals can understand how technology could be used in their domain. This is crucial in the co-production process, as technology design should be domain led. Surveying health professionals to gather opinions on what they would like to see in a chatbot or how they see chatbots complementing health services can be useful at this stage. These activities provide a good starting point to inform chatbot design as you can determine the purpose of the chatbot.

Our Recommendations

  • Carry out a literature review

    • What has already been done in this area? 
    • What gaps exist in the current literature?

  • Round table discussions/ workshops with technologists and healthcare professionals to share knowledge
  • Assess attitudes, preferences and endorsements from healthcare professionals

    • E.g. using survey (Sweeney et al. 2021) 

  • Establish chatbot purpose

    • What will the chatbot be used for?
    • Who are the end users?
    • What languages will the chatbot be available in?

Expected output(s): literature review, opinions from healthcare professionals on what digital health technologies they would endorse

Technological Restraints

The next step is to identify the technologies to be used. This includes the platform (e.g. mobile, web application etc), modality (written text or voice) and framework which will be used to build the chatbot. Many great open source frameworks for chatbot development exist, but the best option will depend on the features or scenarios the healthcare chatbot will address. To help decide on the technology to proceed with, it is useful to consult an expert in the field. It’s important to consider the pros and cons of artificial intelligence (AI) and natural language processing (NLP). There are two main issues in NLP: Natural Language Understanding (NLU), which is concerned with understanding the user’s input, and Response Generation, which involves providing an appropriate response. Depending on the application, limiting the amount of free text user input can help to alleviate the problems associated with NLP. It is also useful to plan how the chatbot would respond to users should they say something unexpected. This is particularly important in the area of mental health, for example if someone should disclose that they are at risk of harming themselves then the chatbot should have a suitable response, such as encouraging the user to seek external help by signposting to crisis helplines.


For example, this has been documented previously in 2018, when popular mental health chatbots were unable to appropriately deal with reports of child sex abuse ( Another study by Bickmore and colleagues reported on instances where Siri, Alexa and Google Assistant gave poor health advice which could have resulted in some degree of patient harm, even including suggestions that could have resulted in death (Bickmore et al. 2018). It is important to consider the ethical issues with human-computer conversations and be transparent with users, so they do not ‘overtrust’ in the information given by the chatbot, and signpost to external help or resources to avoid unintended consequences.

Our Recommendations

  • Identify strengths and limitations to inform the choice of technology 
  • Focus on what the technology can do well and avoid areas of weakness (e.g. limit free text responses in chatbot) 
  • If zero cost is an option, then chose an open source chatbot development framework 
  • Consider choosing a chatbot development environment where access to the data is private and standalone
  • Plan for fallback/ chatbot responses for unexpected inputs and edge cases

Expected output(s): final decision on device, modality, and platform; risk assessment plan to address potential ethical concerns

Needs Analysis

It is crucial to better understand the digital mental health needs and requirements amongst the intended user groups for the chatbot. The first step is identification of relevant stakeholders, as this is an integral part of the stakeholder-centered design process. Needs analysis workshops or surveys with stakeholders and target users can be utilised at this stage.

Our Recommendations

  • Identify stakeholders 
  • Host workshops targeted at user groups & stakeholders to identify user needs. E.g. Participants 

    • Invite end users or groups with similar demographic as expected end users & stakeholders Activities
    • Discussion around the healthcare needs of end users 
    • Demonstration of similar digital health technologies 
    • Development of chatbot persona (gender, age, personality and character traits) 
    • Co-writing conversational dialogues they would like to see in a chatbot and/ or questions they would like the chatbot to ask 
    • Collection of user stories (“As a < type of user >, I want < some goal > because < some reason >”)

  • Alternatively, surveying user groups & stakeholders is another approach which could be used to gather user needs

Expected output(s): user needs

Defining Requirements

Once the stakeholder centered design activities have been completed, the next step is to define the chatbot requirements. The requirements can be derived from the gathered knowledge of what technology can do well, what professionals endorse and what people need/ want. Initially, this may be a long list of requirements that can then be refined for the final features.

Our recommendations

  • Adopt a co-production approach to generate solutions which address user needs

    • E.g. Multidisciplinary design thinking workshops with health professionals and technology experts. If hosting online workshops, then virtual whiteboards can be used to facilitate co-production

  • Generate list of requirements from stakeholder-centered design activities 
  • Prioritise features from list of requirements ○

    • E.g. Using workshops to utilise independent ranking of requirements. Individuals from multidisciplinary teams can anonymously rank features based on perceived importance and how well they fit user stories

  • Decide on final chatbot features based on prioritised requirements. Thematic analysis can be used to classify the final features into different groups.

Expected output(s): list of final healthcare chatbot features

Dialogue Design

Once the final features have been decided, the next step is finalising plans for design and writing the dialogues. The style of interactions should be considered, including the type (pre-defined, controlled, open) and control (system controlled, user controlled, both can control). If the healthcare chatbot is text based, then the addition of multimedia (images, videos, gifs) could be used to improve the user experience (UX). Personalisation can also improve the UX so it is important to decide if and how the chatbot will use personal information about the user (name, age, gender, previous interactions etc). Chatbots can also make use of external knowledge, such as database records or online health information if necessary. It is beneficial to work in a multidisciplinary team for dialogue content development, including those with the technical expertise in conversational design, as well as healthcare professionals working directly with patients or clients and healthcare professionals in academia. Healthcare domain experts are the best people for writing content for the chatbot dialogues, but it is important that they follow good design principles (Cameron et al. 2018b). As mentioned previously, it is vital to inform users of the limitations of chatbot technology upfront, for example make them aware that chatbots only have limited intelligence at the very beginning of the conversation. When writing content to meet the requirements, it is also important to keep the desired chatbot persona in mind to ensure consistency. Once the scripts have been written then you can evaluate and sign off on final content. For this it may be useful to have a series of questions or criteria to assess each of the scripts. If the chatbot is multilingual then this is a good point to begin translating content, once the scripts have been finalised.

Our Recommendations

  • Finalise design decisions for the healthcare chatbot in terms of: modality and use of multimedia or voice; style of interaction; personalisation; and external knowledge. 
  • Carry out multidisciplinary content development and script design activities led or informed by conversational design experts and best practices

    • E.g. workshops where dialogues are co-designed as a group using collaborative tools such as Google documents 

  • Be transparent and inform users of technology limitations upfront 
  • Content sign off 

    • Apply criteria for assessing and signing off scripts. E.g. The following questions are derived from two existing app assessment tools (Enlight - Quality assessment section and user version of Mobile App Rating Scale uMARS) and one “best practices” publication (Cameron et al. 2018a). The question on the chatbot personality will be specific, depending on the intended use of the bot and desired persona from user groups. Questions are modified to assess a single chatbot conversation and address the conversation flow, quality of information, cultural suitability and the chatbot persona.

Conversation flow 

  • Is there a suitable amount of variance in content (emojis, gifs etc)? (Cameron et al. 2018a)
  • Is there a suitable amount of conversational elements emulating real conversation (reciprocity, conversational delights like jokes etc)? (Cameron et al. 2018a)
  • Does the chatbot use appropriate language and wording for the targeted use of the bot? (Cameron et al. 2018a)

Quality of the information 

  • Is the information provided accurate? Are there evidence-based techniques relevant for achieving the desired aim of the program? (Baumel et al. 2016) 
  • Is the information provided in a clear and appropriate way for the target audience? (Baumel et al. 2016) 
  • Is there sufficient information in this conversation without any omissions, over-explanations, or irrelevant data? (Baumel et al. 2016) 
  • Is the content of this conversation correct, well written, and relevant to the goal/ topic of the healthcare chatbot? (Stoyanov et al. 2016)

Cultural aspects 

  • Do you perceive this conversation as culturally suitable? (Cameron et al. 2018a) 
  • Persona 
  • Based on the conversation draft, does the wording fit with the desired: 
  • Age of the chatbot? 
  • Gender of the chatbot? 
  • Chatbot persona in terms of personality traits? 

  • Translate final content into other languages if applicable

Expected output(s): final chatbot content

Chatbot Development

Chatbot development can begin once the content has been finalised. This stage involves the design, development and release of a minimum viable prototype and we recommend taking an agile approach for this. When developing the chatbot, it is a good idea to incorporate event logging, where each interaction is recorded in the chatbot log data for future analysis. This will allow you to explore engagement and which features people use most often for example. Consideration should be given to allow the chatbot to support multimedia content where possible, and incorporating app notifications to encourage usage of the healthcare chatbot. Including ecological momentary assessment (EMA), which are questions presented and answered by users ‘in the moment’ may also be beneficial. Depending on the healthcare application, EMA can be used for a variety of purposes such as tracking mood across hours of the day. Another example from Goldstein and colleagues includes a mobile app they developed to help prevent dietary relapses in overweight individuals by asking participants to record slip ups in their diet along with a range of potential triggers using EMA. Dynamic computational modeling was used to estimate the level of upcoming risk for lapse of eating behaviour and if an EMA recording from a participant met rules suggesting a high-risk moment, then brief intervention elements such as a text message were delivered to help prevent lapses.

Our recommendations

  • Record user event log data 
  • Consider chatbot supporting multimedia and notifications 
  • Embedding EMA can be helpful

Expected output(s): chatbot prototype

Chatbot Testing

Following development of the initial prototype, testing can be carried out. For this, it is useful to start off by working in a multidisciplinary team for initial system testing for errors, which can be reported to the developers and fixed. Once the initial issues have been resolved, the healthcare chatbot is ready for further testing with users. Usability testing with a small number of participants can be helpful to obtain feedback to optimise user experience and evaluate the chatbot early on, as well as throughout the development process. Once the healthcare chatbot is in a good state, then it can be trialled ‘in the wild’ with the desired population.

Our recommendations

  • Multidisciplinary testing & bug logging 
  • Iterative feedback for further development 
  • Usability testing (n=~10-20 participants) with prototype and subsequent versions of the app

Expected output(s): optimised chatbot prototype


A good way to measure real world engagement is to test the app by trialing it ‘in the wild’. In addition, it is useful to have researcher provoked trials such as pilot or feasibility trials or RCTs with end users. Based on our experiences we recommend carrying out a short trial first to identify and resolve technical or logistic issues and problems with data collection. We opted for a shorter (4-week) trial initially and fine-tuned the protocol ahead of a longer (12-week) trial. In these trials, questionnaires and scales can be used to cross examine or measure different outcomes. We recommend administering these scales at the beginning, mid-point and end of the trials and to consider utilising different approaches to gather feedback. For example, prompting the user to complete feedback in app or linking to external surveys, following up with participants via email and focus groups after the trial to gather vital feedback on user experience.



Participants recruited to the ChatPal trials were given a region specific ‘Trial Code’ so we could track the number participating in each area. The chatbot sent the participant links to external surveys to complete primary and secondary outcome measures at various time-points, and provide additional feedback on the app. We used a participant ID (a unique anonymous identifier given to each app user) to link the event log data to the survey data for analysis. Event log data can be analysed to measure uptake and user retention. Advanced analytics beyond simple statistics can be applied to the event log data to gain new insights. For example, applying unsupervised machine learning (clustering) to identify different groups of app users based on usage statistics.

Our recommendations

Expected output(s):


  • Baumel A, Faber F, Kane J & Muench F. 2016. Enlight Quality Assessment Section. Enlight-Quality-Assessment-Section.pdf (
  • Bickmore T, Trinh H, Olafsson S, O'Leary T, Asadi R, Rickles N, Cruz R. (2018) Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. J Med Internet Res 2018;20(9):e11510.
  • Cameron G, Cameron D, Megaw G, Bond R, Mulvenna M, O’Neill S, Armour C & McTear M. (2018a). Best practices for designing chatbots in mental health care - A case study on iHelpr.
  • Cameron, G., Cameron, D., Megaw, G., Bond, RR., Mulvenna, M., O'Neill, S., Armour, C., & McTear, M. (2018b). Back to the Future: Lessons from Knowledge Engineering Methodologies for Chatbot Design and Development. In R. Bond, M. Mulvenna, J. Wallace, & M. Black (Eds.), Proceedings of the 32nd International BCS Human Computer Interaction Conference (HCI-2018) BCS Learning & Development Ltd.
  • Goldstein, S.P., Evans, B.C., Flack, D. et al. (2017) Return of the JITAI: Applying a Just-in-Time Adaptive Intervention Framework to the Development of m-Health Solutions for Addictive Behaviors. Int.J. Behav. Med. 24, 673–682.
  • Stoyanov S, Hides L, Kavanagh D & Wilson H. (2016) Development and Validation of the User Version of the Mobile Application Rating Scale (uMARS). JMIR mHealth and uHealth 4(2). DOI: 10.2196/mhealth.5849
  • Sweeney, C., Potts, C., Ennis, E., Bond, R., Mulvenna, M.D., O’Neill, S.M., Malcolm, M., Kuosmanen, L., Kostenius, C., Vakaloudis, A., McConvey, G., Turkington, R., Hanna, D., Nieminen, H., Vartiamen, A.-K., Robertson, A., (2021) Can Chatbots Help Support a Person’s Mental Health? Perceptions and Views from Mental Healthcare Professionals and Experts, ACM Transactions on Computing for Healthcare (HEALTH).