The development of an experience measure for an online digital mental health community

: Online digital mental health communities can contribute to users’ mental health positively and negatively. Yet the measurement of outcomes and impact relating to digital mental health communities is difficult to capture. In this paper we demonstrate the development of an online experience measure for a specific children and young people’s community inside a digital mental health service. The development is informed by three phases: (i) item reduction through Estimate-Talk-Estimate modified Delphi methods, (ii) user testing with participatory action research and (iii) a pilot within the digital service community to explore its use. Experts in the field were consulted to help reduce the items in the pool and to check their theoretical coherence. User experience workshops helped to inform the usability and appearance, wording, and purpose of the measure. Finally, the pilot results highlight completion rates, difference in scores for age and community roles and a preference to ‘ relate to others ’ ; as a mechanism of support. Outcomes frequently selected in the measure show the importance of certain aspects of the community, such as safety, connection, and non-judgment previously highlighted in the literature. Self-reported helpfulness scales like this one could be used as indicators of active engagement within the community and its content but further research is required to ascertain its acceptability and validity. Phased approaches involving stakeholders and participatory action research enhances the development of digitally enabled measurement tools.


Introduction
Online peer communities can provide a platform of social interaction for young people. Children and young people are considered digital natives 1 , most have lived with relative ease of access to the internet since childhood. This influences children and young people's attitude towards the internet and how they seek support, with most considering the internet as the first option for seeking information, advice, or emotional support for mental health problems 2,3 . When paired with the importance of social peer's attitudes, beliefs, and behaviour during adolescence 4 online peer communities can offer an important form of support for young people seeking information or struggling with mental health. Those online communities can be formed through instant communication tools, social media networks, and asynchronous forums for communication where people share content with the intention of being seen by peers.
The importance of online peer communities in supporting adolescent mental health is shown by a strong but complex relationship between online social networks, mental health and wellbeing 5,6 . When online social networks are used to seek support, reports of depressed moods are correspondingly minimised or maximised depending whether users were passive or active in their online use 7,8 . However, some studies have reported lowquality connections and 'comparison effects' of specific social media platforms and its links to depression 9,10 . Liu and colleagues' 6 meta-analysis highlights how different digital communication tools and media uses affects well-being depending on the intimacy of the medium and the different types activity. Lately, the harms of some online social networks have been explored in more detail and have also been found to predict an increase in body dissatisfaction 11 . Therefore, it is important to recognise that the nature and characteristics of the online community will influence whether the impact on mental health of users is positive or negative. For example, visiting pro-anorexia sites was negatively associated with perception of appearance, drive for thinness and perfectionism for their users 12 , but online social support has shown to act as both protective and risk factor mediating how web content is internalised 13 . Conversely, online mental health communities can be seen as the analogue to face to face support groups, especially for a subset of the population aiming for advice or to express emotions online 14,15 . The anonymity and social connectedness created in these spaces can help people to overcome stigma and make positive disclosures of experiences and problems 16 . Nevertheless, others have demonstrated how dependency on these communities can hinder recovery from stigma, especially in spaces without moderation or supervision 17 .
It is when online mental health communities have appropriate characteristics (e.g. moderation, anonymity) that they can help individuals, and maximize support when experiencing mental health difficulties to enhance coping and recovery 18,19 . The potential negative impact of online mental health communities on young people can be mitigated using moderation of content, creating safety, preserving anonymity and other mechanisms to create a boundary environment which avoids judgement. Observations of unmoderated platforms identified signs of self-harm normalisation and increase of suicidal ideation 20,21 . Comparatively, users of moderated mental health forums report a reduction in frequency and severity of self-harm behaviour after starting to use the forum 22 . Given the mixed impacts of online communities, it is important to examine the impacts of specific online communities designed to provide mental health support with the aim to understand positive engagement, reduce risk of harm, preserve safety, and enhance wellbeing of online mental health support experiences.

Measurement in digital mental health communities
Determining how to evaluate the impact and effectiveness of online community mental health support is a key focus for platforms providing online peer support services. The indirect and asynchronous nature of community forum support presents challenges when using standardised measures for its evaluation. Especially when the community is diverse, user-led, and not focused on a specific mental health concern or topic leading to a specific outcome or mental health difficulty such as anxiety, eating difficulties or depression.
When standardised measures have been used in online communities' research, there have been mixed results. One online peer support group for young people found users improved in anxiety scores, but did not show any changes in depression 23 . Others found a non-significant reduction in depression of forum users or no differences in body dissatisfaction between forums users and control group 24,25 .
A clearer picture on the benefits of online mental health communities is found when qualitative and mixed methods are used. Horgan and colleagues 24 used thematic analysis on forum posts, alongside standardised outcomes for depression and anxiety. Young people using the forum frequently discussed the immediate benefits of sharing their feelings on the forum and described a sense of not being alone. Forum posts also mentioned the benefits of individuals comparing themselves to others, and consequently believing that their situation was less bad than previously thought. In regards to a self-harm community investigated, young people reported that they felt they learnt more about mental health from other users, compared to information sites, and they felt it easier to disclose information online, in part because they were less likely to be judged than in real life 26 . More recent investigations have shown how self-efficacy and access to further support can increase thanks to the use of this communities 27 . They can also provide a sense of belonging, tackling feelings of loneliness regarding mental health experiences 28 . Qualitative studies also reveal how social modelling allows encouragement between peers to use pro-social behaviour and receive support within and outside the community 29 .
Qualitative methods do, however, have limitations in measuring outcomes for online communities at scale. They are time intensive and cannot be used repeatedly to track user experience and satisfaction, nor be used as a method to routinely collect information about the community. However, qualitative research does provide an in-depth understanding why young people use online mental health communities, what outcomes are achieved. The results can be used as the functional theory to develop an experience measure for an online community.
Online peer mental health communities aiming to support users require understanding on how their resources and content composing the community help or hinder the user well-being. Measurements can be collected and routinely aggregated on the self-reported impact of that community experience in the platform to the user. Developing a self-reported measure for this endeavour should aid understanding on what help the content can provide, and how different user may benefit. Ultimately, a self-reported helpfulness measure will provide an indicator of active engagement, going beyond the forum analytics often reported to define engagement in digital contexts (e.g., Views, clicks, time, popularity). Measuring the helpfulness of community content will provide insights on how resources consumed contributed to a positive, negative, or neutral experience. The measure should also understand the mechanisms that lead to the experience, and what types of outcomes users are achieving in relationship with the community and their experience. A peer online community experience measure administered in a specific online community may help to recommend personalised content or identify resources that are contributing to the recovery and support of individuals using an online community.
The Kooth.com (referred to from here onwards as the service) online community is a user-led forum inside a multi-component digital mental health service where the content revolves around the changing needs and experiences of the young people in its platform. The content of the community consists of three core types of posts: (1) a co-created magazine with a combination of psycho-educational, creative, and informative content written by the service users and practitioners, (2) discussion forums authored by users, providing direct interactions between peers but still moderated by professionals, and (3) mini-activities a specific type of content created with the intention of helping users build life skills and promote planned action. All user submitted content is moderated before being published on the platform to safeguard, categorise, and age-restrict content where necessary. When designing a measure for a specific online community and its characteristics, a framework to measure quality-of-care is required, these frameworks will be specially useful when the programme theory and mechanisms of change for the online support community have been previously investigated, so both can get combined to the develop a specific and relevant measurement.

The framework to develop a Peer Online Community Experience Measure for an online digital mental health community
Donabedian's 30 quality-of-care framework recommends measuring care through assessing structure, process, and outcomes. For an online community, the focus of care is posed in the peer support aspect taking place in the forum. The measure should also consider the role of the community member within the forum: (1) the role of an active contributor of the community generating content, or (2) the role of a consumer of the community reading and engaging with the content posted community.
The design, and structure of the Peer Online Community Experience Measure (PO-CEM) was divided in three parts, each representing one of the domains of Donabedian's measurement framework.

•
Part one: Assesses the quality of the community resource through structure, focusing on helpfulness of magazine articles and forum discussions using a 'emoji'-based Likert scale.

•
Part two: Assesses why online community users found these specific structures (community resources) helpful in respect to the area of support. • Part three: Explore what outcomes are achieved, specific to those resources considered helpful in reference to the area of support received.
Part one was developed through consultations with experts in field to develop a quantitative scale for the measure. Part's two and three were initially developed using the researched theory of change previously developed within the online digital service, which used qualitative analysis of the community content to ascertain positive outcomes achieved the online community 31 .
1.2.1. Part one: Assessing quality through structure According to Donabedian seminal paper 32 , structure is about the 'adequacy of facilities' typically referring to relatively static characteristics of the care (e.g. personnel, buildings, resources). Within an online digital community, the structure is harder to define; it can relate to the design of the forum, its moderation, but also to the content included in the community or perhaps the community itself. Constructs like helpfulness for online communities is one option to encompass structure from the framework, allowing service users to reflect their experience of the online facilities. Some Patient Reported Experience Measures (PREMs) and Net Promoter Scores (NPS) are global rating scales used to offer a reflective experience for the user or patient through a single question, providing a numeric response from users to benchmark quality 33, 34 . The NPS is a straightforward way of measuring experience satisfaction through asking whether an individual would recommend the service to a friend; however, it's unclear how useful it is for patients 35 . Defining the structure as the experience of the community and its helpfulness with the question 'Did you find this part of Kooth helpful', prompting a 5-point Likert scale response from 'No' to 'Loads' was determined to mediate the complexities of assessing the quality of the structure of the support from an online forum community.

Part two: Assessing quality through process
The service's theory of change identified a matrix of four possible domains or mechanisms of support that a young person may experience from community engagement 31 . These relate to the commonly used differentiation between informational and emotional support 36,37 , also previously explored in the service's forum data 38 and provide a part of the measure assessing the 'processes' in the online community. These four domains of support, defined by either an interpersonal experience (in relation to others) or an intrapersonal experience (in relation to oneself), were previously found to be the main experiences of adolescents and their goals in online therapy 39 . Within part two, in which users are asked about why they found this part of the service helpful, or in the instance they didn't, what they were hoping to have found, they were given the option of selecting one of four options representing each four domains of support (1: Emotional interpersonal; 2: Emotional intra-personal; 3: Informational inter-personal; 4: Informational intrapersonal).
The final part looked at 'outcomes', which seek to capture whether the goals of care were achieved 40,41 . Goals achieved and combined outcomes were previously identified through qualitative analysis of the service's community forums and magazine articles. Outcomes were found through qualitative exploration of the question 'what factors influence perceived positive behaviour change for Children and Young People who participated in an online peer support intervention?'. Six main desirable outcomes were identified for the online community: 1) Relatedness and Self-Expression 2) Hope and Help Seeking 3) Building a Safe Community 4) Digital Altruism 5) Hope and Help Receiving 6) Making Change. These represent the potential outcomes that the part three ask the user as an indicator of outcome achieved. This last part of the measure assessing outcomes will only ask those who found the structure of the community helpful in the scale in the first part of the measure.

The present study
The present study describes the (i) development, (ii) user testing, and (iii) pilot results of the implemented measure in a dynamic and multifaceted digital mental health service; This phased approach involved different key stakeholders and participants that influenced each phase of its development. To ensure POCEM is a meaningful measure for the people using it, participatory action research practices were conducted to guide the development of items and structure of the measure. An iterative design process incorporated evidence collected from practitioners, researchers, design experts, and young people using the platform. The ethos of Donabedian's 30,32 framework was applied to the development and design of this digital community experience measure.
This measurement and its development provide an opportunity to collect data on the peer support community and its structure, processes, and outcomes within the service. The study describes the development of the POCEM divided by three phases including the implementation of the measure with a pilot testing phase administered in the online community resources from the platform.

Methods
A multi-phased design process was used for the development of the service's community measure, guided by participatory action research, involving iterative development, reflective decision making and real-world application of the findings 42,43 . The development of the service's community measure involved a group of practitioners, researchers, user experience design experts and young people. In hindsight, three key phases of design supported the development of the measure. Emphasising the digital context and acknowledging the iterative nature in measure development research as an incremental process 44 : 1. Item generation and reduction: A three-part measure developed with digital product experts and designers. The content of the measure and items were created by combining qualitative thematic indicators of outcomes and mechanisms. Delphi rounds were used to reduce items and explore the content for the measure.
2. User testing: To directly explore, using participatory workshops, the face validity of the measure with young people. The focus was to verify the appropriateness of language and how design of the measure was experienced on the platform as a prototype.
3. Pilot study: A 10-week pilot of the measure within the digital online community. Exploring completion rates, average scores, item frequency selection at correlations between items at each part of the measure.

Phase 1: Item generation and reduction
The process of item generation and reduction for the first phase of POCEM development was carried out using the Estimate-Talk-Estimate Delphi technique 45 . The technique is used to achieve expert consensus through an iterative process of multiple discussion sessions between a panel of experts. The Estimate-Talk-Estimate method differs from the standard Delphi technique by then allowing for verbal interaction between panel members. This interaction allows for clarification and justification of decisions made by individuals, which can be impaired if using a fully anonymous and asynchronous process 46 . The stage of independent evaluation aims to prevent the dominance of individual voices in the process, and group pressure for conformity. The process includes anonymity in the initial responses, multiple iterations of the process with controlled feedback at each stage, statistical analysis of the group responses, and expert input throughout the process 47 . The Delphi technique is frequently used in healthcare research and has previously been used to modify a social responsiveness scale 48 , while the Estimate-Talk-Estimate variation has been used in developing a framework for mental health apps 49 .
To develop a measure of peer community experiences that reflected both young people's views and expert opinions, a two-stage Estimate-Talk-Estimate Delphi process was used. Stage one involved panel members with mental health practice expertise to compose an initial pool of items based on previous theory known about the service 31 . Stage two of the Delphi process involved discussion between researchers to assess the items generated, identify links between the generated items and the constructs from the theory, and reduce initial pool of items using an inter-rater agreement approach.

Participants
Two panel groups were recruited, with of a total of six expert panel for the Estimate-Talks-Estimate workshops. Most experts belonged to the service and one to a university institution. Experts registered their interest in the project via email and specified whether they were interested in participating in (a) the service's Theory of Change thematic analysis, (b) the item generation stage of the online community measure, or (c) the item reduction stage of the online community measure. Experts could volunteer for multiple parts of the project. All panel members had extensive experience researching or providing support and moderation in the digital mental health platform.
Panel members were recruited to participate in two projects concurrently: the development of the service's Theory of Change 31 , and the generation of the desirable outcomes that inform the three parts of POCEM. One panel member was involved in both the item generation stage and the later item reduction stage. The continued involvement of one panel member was used to ensure continuity between the two stages (Table 1). Four rounds of panel workshops were carried out with the expert participants. Workshops were held face-to-face or by videoconference. Asynchronous communications through e-mail was used to prepare the experts, and anonymous questionnaires for voting were provided in each round. The rounds had different aims regarding the content relevance and structure, reduction of proposed items and changes in wording of items to improve quality. All rounds were documented through field notes supervised by the research lead (TH).

Round 1
The first stage of the process was a face-to-face meeting group with the expert panel, wherein a broad list of initial items was generated. The items were generated deductively from the desirable outcomes found in the Theory of Change of the service 31 . The experts involved in this initial item generation were concurrently involved in the Theory of Change research, allowing for a deeper understanding on the transcript's findings and theoretical foundation of the Kooth online community outcomes and mechanisms. The process of generating the initial pool of items utilized a thematic analysis approach, consistent with Braun and Clarke 50 analysis in psychology research. The thematic analysis investigated the factors influencing positive behaviour change for children and young people accessing an online peer support intervention and described the desirable outcomes for the online community. The items were generated by each panel member independently and decided in group through a panel discussion process. Items were generated using the framework for at least each of the desirable outcomes and mechanisms for positive change in the online community identified in the thematic analysis ( Figure 4).

Round 2
This round with panel members focused on mapping the initial pool of items to the domains of support. The domains of support were mechanisms formulated in earlier research, and they are intended to represent the high-level 'wants' and 'needs' from children and young people asking for mental health support within an online digital service. These domains of support are covered in part-two of the POCEM as the process (mechanism) that lead to that online community resource being helpful ( Figure 1). After mapping domains for each item in the initial pool, panel members were asked to select the items more likely to be selected by two different types of online community members: (1) those contributing; or (2) those consuming (accessing by reading) the community resources. The workshop with experts focused on which items were more relevant to each type of community member and provide rationale on their decisions. Figure 1. The service's (Kooth.com) high-level outcome of support, 'wants' and 'needs' from children and young people accessing a digital mental health support service 31 .

Round 3
Following the second round, the panel members voted on items' relevance to identify those to be discarded. The rating was done independently and anonymously by all panel members. Panel members were given two weeks to make their evaluations. A workshop was carried out to discuss the relevance items findings, and the relative ratings of the different members. The discussion focused on whether the items were repetitive, reflected the support domains they were matched with, and were representative of the outcomes from the thematic tree.

Round 4
Following the last round of talks, panel members asynchronously and independently evaluated the wording of the items and suggested additional items relevant to the peer online community. Panel members were given three weeks to make their assessment and present their review. The workshop focused on editing the wording of the items, and further reducing the items down due to similarity, the structure of the measure, and the previous theory used to develop it. The discussion considered the independent comments made prior to workshop. The output of the last round composed the statements of the three-part measure taken to prototype generation and for user testing with children and young people as main stakeholders of the group.

Analysis
Descriptive statistics were used to describe participants' characteristics and frequency in votes and selection was recorded for each expert. The field notes outputs from each round of the item generation and reduction phase were discussed sequentially, influencing the materials taken to panel of experts in each round. In round three, when panel members independently and anonymously rated their preference of the items an intraclass correlation coefficient was calculated to understand agreement between panel members on their decisions.

Participants
A voluntary purposive sample of 11 young people was recruited amongst four primary and secondary schools in Manchester, UK. The sample was used to conduct the user testing workshops. The 11 young people aged 12-17 (7 female, 4 male) expressed their interest for participating in the workshops and parental consent was provided. The study was advertised through teaching staff, participants had no previous experience as users within the digital service (Kooth.com), parental and individual consent was sought for each participant and an incentive of £10 was given to participants to attend the 60-minutes workshops. Two researchers, one with participatory research expertise and a user experience designer conducted and analysed the user testing sessions.

Instruments and materials
2.2.2.1. Kooth Prototype: Clickable high-fidelity A high-fidelity prototype is a smartphone-based interactive representation of the product, the web-based service, with strong similarities to the final design in terms of details and functionality. The high-fidelity prototype was developed with the vector graphic editor Sketch software 51 and included the POCEM inside the online peer support community allowing the users to click around and interact with the whole platform. In the context of measure usability, a high-fidelity prototype allows exploration of wording, structure, relevance, and comprehensiveness of the measurement and its functionality in interaction with the whole platform.

Peer Online Community Experience Measure (POCEM)
The POCEM is an online community measure (specific to the service, Kooth.com) that contains three main stages for completion. Its aim is to measure satisfaction and quality that an online community resource has in relationship with the individual. The measure automatically differentiates between contributors of the online community and readers.
The first stage contains single item question ("Did you find this part of Kooth helpful?") scored with a 1-5 Likert scale (1: No; 2: Not really; 3: Don't Know; 4: A bit 5: Loads!, all scores aided with emojis) to assess the helpfulness of the online community resource. The Likert scale scores determines the helpfulness as a benchmark and branch the measurement into stage two. Depending on the scoring in Stage one a new single-item question will prompt ("What were you hoping for?" for 1-2 scores and "Why did you find it helpful? For 4-5 scores) to select between four multiple choice (single response) question between four quadrants representing high-level outcomes (or domains) of support from the service. Respondents who selected score 3: Don't Know in the scale are not shown parts two and three of the measure. The last stage is displayed only to those users who completed stage one and two and scored between 4-5 in the stage one in regards to the helpfulness of that part of the digital service, a single item (multiple response) question ("What type of things have you learned?") with specific outcomes that relate to the domain response on part two will be prompted, readers can select between 23 outcomes (across domains) and contributors can select 14 specific outcomes.

Lookback.io: screen recording & audio
Lookback.io 52 is a user testing software for mobile UX user recording tool. It allows recording of screen interactions alongside voice audio recordings when conducting supervised sessions of remote user testing 53 . This software allows secure storage and organization of your user testing sessions for qualitative analysis. This tool allowed the recording of both screen behavior and audio from participants attending the user testing sessions.

Procedure
The user testing was structured in one-to-one sessions delivered at each of the four schools. Participants were provided with a smartphone which had a loaded a high-fidelity prototype of the measure within the platform. Sessions were facilitated by a user experience expert researcher and observed by other researcher to safeguard the session and take notes. The sessions were voice recorded and screen recorded, for further transcript and analysis.
The facilitator encouraged young people to verbalise their thoughts and perceptions using the think-aloud protocol 54 as they navigated their way through the platform while following the facilitator instructions with the prototype. Instructions followed a protocol of specific tasks within the prototype measure that aim to identify any issues with the interface, allowing facilitators to observed participant specific behaviour in relationship with the task. Facilitation tasks included asking about expectations in relationship to the next event that the interface will show during the session, and whether there were any issues with the wording clarity and relevance for the measure.

Analysis
Affinity diagrams or KJ methods are adopted in user testing for prototype interaction 55 . They are a good technique tool to synthetize and organize large amounts of qualitative data post-task, the user testing sessions were synthetized in affinity cards representing each participant observations and quotes. Such cards are later jointly analyzed to create a diagram. Affinity cards for issues more frequently raised, and for higher severity reported by participants tend to take more priority to address as changes in the prototype.
We followed the adapted four stage (creating notes, clustering notes, walking the wall, and documentation) process from Lucero 56 . Researchers worked in Microsoft Excel, using rows to represent participants and columns for each affinity note. A total of 236 affinity cards were collected from field notes, screen recordings and audio recordings from each session. Rounds of clustering by researcher identified two main clusters in reference to the measure, and to the prototype and task performed (58.48% Measure affinity notes and 41.52% Prototype and tasks affinity notes). Twelve clustering issues were collected across clusters, some directly related with the measure such as including an 'other' personalized option, and issues with the platform and prototype such as difficulties in navigation. The affinity diagram was then created, walking the wall exercises with other researchers and experts at the service (n=3) provided with synthesis and identification of priority changes in the measure and prototype interaction by looking at frequency, feedback notes and quotes presented in the affinity diagram. Documentation on the output from the affinity diagram discussed by experts is provided in Figure 2.

Phase 3: Face validity pilot testing
In contrast to content validity which is more concerned with having the breadth and accuracy of items to measure a construct, face validity assesses the degree of respondents judging the instrument and its items are appropriate for the targeted assessment 57 . We used a 10-week pilot of the tool to collect qualitative and quantitative data from the users in the online community completing the measure inside the platform (11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25) year old service users), as well as collecting opinions about the measure from those users during the pilot from those who consented to the study.

Participants
The measure was iteratively released onto the service's platform. Online service users who either contributed to a forum or submitted an article during the testing period were presented with the contributors' measure after submission. Service users who read an article or forum were presented with the readers measure at the end of the post. Only users who expressed consent for their data to be used in research were included in the sample. Data was collected between the 13 th November 2019 and the 22 th January 2020.

Procedure
The clickable prototype of the POCEM was implemented as a feature for service improvement in the online community at the service's platform, changes from the previous Phase 2 were included in the measure for pilot. For a period of 10 weeks the measure was tested within the platform and data was collected on users engaging with the community at the digital mental health service. Routinely collected monitoring information was used alongside peer support data to investigate the measure performance. All users accessing the platform community were able to complete and see the measure during the 10-week period.

Analysis
Frequencies and descriptive analysis were carried out on completion rates for users who accessed the online community and those who completed the measure. Descriptive statistics and frequency of scores on the three steps of the measure were calculated to understand if items were being selected sufficiently. The aim of the pilot analysis was to see if the POCEM is a measure of community resources that can be used as a proxy for active engagement or helpfulness of a community resource and its content, and the outcomes more frequently selected by users of the measure.

Phase 1: Item generation and reduction
The process of item generation and reduction was done iteratively over four Delphi rounds. A flow chart of the outcomes from each round of the item generation and reduction process are shown in Figure 3. The items initially generated for the online community measure were produced based on the panel members understanding of previous literature on online peer support communities and service's Theory of Change 31 . The thematic analysis revealed desirable outcomes for positive behavior change in the online community: 1) Relatedness and Self-Expression 2) Hope and Help Seeking 3) Building a Safe Community 4) Digital Altruism 5) Hope and Help Receiving 6) Making Change. A diagram of the thematic tree ( Figure 4) was used to generate the initial pool of 68 items based in these desirable outcomes and its mechanisms from the main themes and sub-themes from the tree (Supplementary Table  A1).

Round 2
This round aimed to categorise the initial pool of items based on two criteria. First panel members categorized each item based on the type of community member that will find the item useful. Most of the items (75%; n=51), were classified as relevant to users reading or consuming resources in the community, whilst the remaining 17 (25%; n=25) were relevant for users contributing with content to the community.
The pool items were then categorized into domains of support and used to inform the second part of POCEM. The emotional domains were more frequently used to categorize the items in the pool compared with informational domains ( Table 2). Through a discussion process, participants agreed on a three-part structure to the measure, with respondents only shown items relevant to the selected domain. Each panel member voted on the items that they belived should be kept for the measure. An Intraclass correlation (ICC) estimate with 95% confidence intervals was calculated based on three raters, absolute agreement, with a 2-way mixed-effects model.
The inter-rater reliability between panel members was poor at this stage of the item reduction process (ICC =0.09, P=0.07).
In the following workshop the ratings were discussed amongst the panel. The outcome of the workshop was the removal of 23 items, and the addition of three further items. All of the items that were selected by at least two raters (12 items) were kept in the item list.
Two items were added to the domain 'Informational-Intrapersonal' for readers, bringing the total items to five in that domain. One last item was addded to the domain 'Informational-Interpersonal' for readers as community members, increasing the total items in the domain to five too.

Round 4
Following independent and asyncronous evlautation of item wordings, nine suggestions were made regarding wording changes to the items. The six changes that were accepted in the workshop are shown in table 4, inlcuding the rational for each change. In addition to the item changes, the fourth round also resulted in the removal of seven further items. The final 38-items and statements wording were tested in a prototype in the user testing phase.

Phase 2: User testing findings
The main findings collected from the affinity diagram identified issues and recommendations for the POCEM. Many participants perceived the measure to be linked with the type of content consumed or accessed-at-the-time by the user inside of the community, being mainly forum posts and its subsequent comments. This is well illustrated by one of the participants quote when prompted in the session to explain what the measure is intending to do [Participant 3]: "How? in my experience was just reading the person and the comments under it". Most young people reported that they will be more likely to complete the measure if the content of the forum post was helpful for them. The feedback suggests here may be an agreement bias effect deterring users from providing negative feedback to a peer within the online community, or encouraging users to ignore the measure when the content is not perceived helpful; [Participant 9] said "If it is related to what I am doing, I would fill in the measure, to see similar articles" and [Participant 1] stated that connect with peers will be a key motivation to complete the measure: "…if I had trouble making friends I would say loads (of motivation)". Most of participants found it normal that the measure will appear in each resource of the community platform, however some of the participants find difficult to find the measure in the platform without facilitators prompt within the sessions. This provided some evidence regarding the measure appearance and initiated changes that may affect measure completion.
In regard to the measure appearance, it was identified that one emoji under the scale had a mismatched emotion, [Participant 10] explained: "The 'No' just looks like they're about to cry or something", despite the majority appeal to the representation with emojis within the scale, for instance [Participant 1] stated his preference: "The emojis are more neutral not grumpy or red as might give wrong impression to others". Findings also reveal difficulties from participants understanding who will see their responses. Four participants demonstrated doubts about the information being publicly available for the other peers to see in the community, for instance [Participant 8] showed: "I thought it would instill confidence in the author to write more". User experience findings around physical appearance of the measure and its instructions led to changes for the live version of the measure taken to the pilot phase ( Figure 5). Finally, user testing provided a scenario to review item wordings based on experiences of participants interacting with the prototype during the exercises. Some statements changes are presented based on the rationale given by participants extracted from the affinity diagram. Wording review steer two changes on the second part, and four wording changes in the last part of the measure (Table 5). Overall, user testing allowed identification of appearance issues, validated the focus of construct measurement (it measures the specific community resource), and allowed changes on wording by the intended population for the measure.

Phase 3: pilot study results
The first part of the measure was completed 13,502 times, part two which was only shown to those who had selected responses other than 'don't know' was completed 6,365 times, and the third part, which excluded those who responded 'no' to the part two was completed 3,296 times. The measure was completed by 7,026 participants.

Completion rates
The measure was tested and released iteratively by parts on the platform. The pilot data collection started when the full measure and its three parts were released in the platform. The complete measure was tested between the 11 th of December 2019 and the 20 th of January 2020, with 2140 unique service users completing a total of 4897 administrations POCEM. There was a total of 68439 views of community content on the site by service users who gave research consent during this time, and a total of 2425 contributions in the form of article or discussion posts. Completions rates were divided between readers and contributors to better understand overall completion of the instrument across community members (Table 6).

Respondent Characteristics
The respondents ages range from 10 to 25, the range allowed in the community and the service. However, five respondents reported an age over 25 and were removed from the dataset, as these will be outliers of the service. The remaining respondents ages ranged from 10 to 25, with a mean of 13.47 (SD= 2.09). Most service users completing the POCEM were female, white, and aged between 10-14 years (Table 7). The most frequently selected helpfulness score was 5: 'Lots!', indicating that the content helped the service user considerably. The frequency of the scores decreased as the helpfulness rating decreased, with the rating of 1:'Not really' selected the least frequently ( Figure 6).
Age of service users had a small significant impact on the perceived helpfulness of the community content, with a Kruskal-Wallis test showing a significant effect of age on helpfulness score (H(2)=7.89, p=.02). Post-hoc tests (using Dwass-Steel-Critchlow-Fligner pairwise comparisons) were carried out for the three groups and showed service users aged 10-14 gave a significantly (p=.03) higher helpfulness score (M=3. 8 Service users who completed the POCEM after contributing to the community content selected the helpfulness score of 5:'Lots!' substantially more frequently than service users who read the community content ( Figure 8). T-test frequency comparisons reveal a significant difference in the mean helpfulness score (t(247)=8.8, p<.001) between readers and contributors. Contributors had an average helpfulness score of 4.52, whilst readers had an average helpfulness score of 4.00. as the helpfulness score. For this response, the rest of the measure was not shown, and responses (n=619) were removed from the domain selection analysis. Out of the remaining 4278 responses with a score of 1,2,4, or 5, 10.05% of respondents stopped answering the measure after part 1 providing a helpfulness score, being those incomplete scores of the measure.
The most frequently selected high-level domain of support was 'Help me relate to others', with 55.1% of respondents selecting the option. When the responses were split by whether the helpfulness feedback was positive or negative ( Figure 9). More than half (58.2%) of respondents who selected this domain of support gave a positive response selected compared to respondents giving a negative response (32.3%).
Giving 'No response' after the part 1 was most common for respondents giving a negative response in the initial rating, with 34.5% of respondents not selecting a domain (quadrant of support) after giving a negative helpfulness score compared to 6.8% of respondents who gave a positive helpfulness score.  Table 8). A bar plot of the mean helpfulness scores for the different domains (quadrants) reveal the differences between those who did not respond against the rest of the domains average scores ( Figure 10).

Part 3: Outcome selection
Based on the response frequencies for the items in the third part of the measure, one item was removed. The item 'I now feel able to ask for support outside of Kooth' was selected the least frequently out of the total items, with only 4.27% of respondents selecting the 'Emotional-Intrapersonal' domain choosing the item. Out of the frequency of selection for all possible items, this item it was selected only for 0.6% of responses (Table 9). When the frequency of the item selection is looked at for each domain, the option of 'No Response' was the most frequently selected for all but three of the domains, where "Relate to others' is the exception, which had the item 'Felt connection' as the most frequently selected item (Figure 11). The item stage response with the lowest mean helpfulness score was 'No response', similarly to the helpfulness scores for the other domain options. The analysis was run after removing cases where respondents did not answer the domain section in part 2 (n=430). The Kruskal-Wallis test showed a significant effect of item selection on helpfulness score (H(22)=407, p<.001), and a post-hoc test (Dwass-Steel-Critchlow-Fligner) showed that there was a significant difference in the helpfulness score when respondents selected 'No Response' compared to each other item selection. There were no significant differences between the other items (Supplementary Table A2).
For the item's selection, 'No response' was also the most selected option overall, accounting for 25.38% of the item selections. The increase in the 'No response' selection for the item part of the measure, compared to part 2 of the measure where 10.05% of respondents stopped gave no response, will be in part due to all respondents having the option of not providing a response, compared to the other items that are only shown to a subsection of respondents.

Discussion
In this paper, we outline and discuss a novel multi-phased method for designing an experience measure for young people within an online mental health peer community. We aim to provide a structure for design and testing, whilst reflecting on lessons learnt, to support the future research of other online healthcare platforms and designers. We highlight the value of using mixed methods in an iterative design process when transitioning through phases from initial measure development, refinement, and piloting the measure as well as limitations for each phase.
An experience measure can provide an evaluation of the quality of care received by service users, uncovering key insights into what is and it is not working in a service 58 . By using Donabedian framework 30,32 as the foundation for measure design, we aimed to assess how the structure, processes, and outcomes within a specific mental health online community are experienced and viewed by young people. An instrument to ascertain outcomes and mechanisms from the online community content reported by users may serve to improve and understand the quality and mechanism seek from engaging with an online community in the future.
The measure was designed with three main stakeholders in mind. First, experts to design the principles of the instrument. Second, young people as main users to provide feedback what they considered important to measure and their perceptions. Third, for practitioners to understand the mechanisms and positive outcomes within the community aspects of the platform. As such, it was important for experts, practitioners, and young people to be involved in the development and testing of the measure for each phase. In the development of the measure itself, service users were consulted through a user testing phase, with much of their feedback influencing the final design. Involving young people across the development and design was essential for ensuring an instrument was produced that accurately reflected their needs within a peer online community, and therefore improve its acceptability 59 .
The use of adapted Estimate-Talk Delphi rounds with academic and clinical experts allowed for a narrowing and improved content of the measurement. The approach was non-standard, with each Delphi round composed of two parts; an initial, independent assessment of the items followed by a group workshop. Unlike systematic approaches to Delphi rounds 60 the independent and anonymous polling incorporated both qualitative and quantitative feedback. Whilst most Delphi rounds included independent voting, only round 2 required panel members to rate the items. In the other rounds, panel members provided qualitive feedback. Although rounds may have benefited from a systematic assessment or relevance and clarity of instrument items, like Content Validity Indexes 61 , a more dynamic approach to the Delphi rounds provided rich qualitative feedback to influence the iterative design of the measure and enhance the design of the innovation 62 .
A consequence of placing service users, clinicians, and user experience experts at the centre of the design process was of an atypical structure to the measure. This uncommon structure may hinder its generalizability and may not adhere with assumptions required to further investigate the quality of a measurement and its validity in the literature. Feedback in the initial Delphi rounds suggested the three-part structure, with the helpfulness rating placed at the beginning, filtering the other parts of the instrument collecting information about mechanisms and outcomes of that community experience.
User testing is a fundamental step for measure development in online contexts. We recommend replicating the best fidelity prototype possible when conducting user testing research activities, despite low and high-fidelity prototypes have shown similar results when compared 63 . We observed how young people benefited from structured activities and more realistic objects for the think aloud exercises, which in turn can help to influence changes in the appearance and quality of the workshop outputs and its findings. User testing is a time consuming and resource intense process, and the volume and complexity of data generated may contribute to longer periods of time needed for analysis. The KJ method can provide a good opportunity to analyze and synthesize these testing findings, requiring expertise and focus from researchers to present and facilitate the synthesis of the user testing activities. Affinity maps on the other hand can inform beyond the purpose of the research and provide ideas and improvements with general industrial value (e.g. user needs, product satisfaction). User testing will often use a volunteer or purposive sample, this means that added emphasis should be placed finding participants normally underrepresented. User testing methods like the KJ method may present challenges to integrate quantitative information or systematic agreement methods in the creation affinity maps, but this may help to improve researcher bias in the synthesis stage for this method and phase. User testing represents a new and additional phase for online measure development that provides with invaluable observations about the digital context that can influence the instrument design and its validity 64 .
Pilot study results indicated that service users had a positive experience when accessing the online peer community content, with the helpfulness ratings being positively skewed. These initial results may indicate social desirability or acquiescence bias effects previously shown in digital contexts and scale creation 65,66 . The potential influence of social desirability was highlighted in the user testing phase, where most of the young people reported that they were more likely to complete the measure if the specific resource was considered helpful, and 4 out of the 11 user testing participants indicated that they believed that after providing feedback to content, such as articles, the authors of the content would be notified of their feedback. To attempt to reduce this perception, qualitative feedback from the user testing phase revealed users' worry about their responses being seen by community members, changes in the instructions and text in the measure were applied in the pilot phase, reinforcing anonymity of responses ( Figure 5). The limited disclosure of POCEM responses, along with the anonymity of the service, should help users to not anticipate a social consequence of their responses 67,68 .
The pilot results showed the support domain 'Relate to others' to be the most frequently selected mechanism for service users providing positive feedback and was the second most selected domain for those with negative feedback too. The average helpfulness score for users selecting the domain was not significantly different to the other support domains, but service users were less likely to drop-out at the item selection stage when selecting this domain. Given that service users who dropped out of the measure early were more likely to give a negative helpfulness score, it is a reasonable assumption that users looking to relate to others through the community were more likely to be satisfied with their experience.
Previous research has illustrated key reasons for young people seeking out support in online communities are to feel less alone with their problems, find a space to talk with peers, and where they feel less likely to be judged 69 . The most frequently selected outcomes in POCEM for positive experiences in the community were 'Felt connection', 'Feel safe', and 'Not judged'. Similar outcomes for a supportive online community have been found previously 70 and demonstrates how online communities may help users to feel less isolated and more supported around mental health 71 . The overall frequency of outcome selection at part three will, in part, be a consequence of the differences in frequency selection of domains at part two of the POCEM. When respondents selected an outcome in part three, rather than dropped out after selecting a domain in part two, there was no significant difference between the average helpful scores for each outcome. The positive average of helpfulness across the sample may accurately reflect a positive experience for service users in relationship to the outcomes selected in the instrument. On the other hand, it may also be a consequence of a ceiling effect in the measure 72 , but this indicates that all outcomes selected to be explored as part of the measure may be relevant experiences in an online community to account for.
There are several limitations to be considered for the development of the POCEM. This study offers insights into considerations that should be made in the early development of a measure for a digital context. By designing the instrument or measure with a specific service in mind, the ability to generalize the existing measure to other online peer communities is limited, the use of experts from the same context may provide a limited view during the Delphi rounds 73 . It is unclear if these processes and outcomes can be applied to other supportive communities outside the digital service. Future research should look at the transferability of the measure to other communities.
Some of the lessons learnt in the development of the POCEM illustrate the benefits and challenges of designing and testing a measure in a digital environment. The low competition rates of service users starting the measure presents key challenge and threat to the acceptability of the measure by the community. Digital service users reading content in online communities are more likely to be 'lurkers', individuals who will read community content but not actively participate 74 . Therefore, digital environments might be more prone to missing or misleading data around instrument completions and administration, different approaches to data imputation and how missing data is treated can have consequences in psychometric testing and measure performance 75 . Researchers in digital contexts should be aware of these issues, report missing data, and think in advance what psychometric properties the measure should be tested upon.
A phased approach for instrument involving co-design participatory action research with multiple stakeholders can influence the structure and purpose of the measurement. This influence may be counterintuitive for the structure and administration of the instrument and may limit or breach assumptions to test measurement performance in psychological instrument research enhancing difficulties to understand the psychometric properties of the measure.
The next steps are to understand if the measure is a good proximal indicator of active engagement in the community to support mental health, so construct validity of the instrument can be established. The POCEM can help us understand consumption and use of mental health supportive online communities beyond web-based analytics (e.g., how long people read, or contribute, and how frequently they engage) within the service. Further exploration on acceptability and completion rates in relationship with time of engagement, time spent and other content properties is necessary to provide further evidence on what the instrument is intended to measure. Routinely collected information of this instrument in the digital service should help to understand the trends and commonalities deemed helpful in the community, as well as to explore differences across population characteristics so the measure can be evaluated beyond pilot data. Experience measures like POCEM can help services to understand they mechanisms and outcomes more frequently achieved by users of an online community. Online peer communities may use experience measures to understand what resources benefit or hinder the individual, active engagement of the users with the content can be monitored and better understood beyond digital analytics, so a positive and safe community can be maintained to provide peer support, and other factors that contribute to the well-being of the individual engaging in a digital mental health community.

Conclusions
Developing an experience measure for an online community requires a multi-phased systematic process to inform its development and structure involving stakeholders. Different stakeholders can contribute to pieces of information leading to key decisions on the design and development of the instrument, a phased approach with multiple methodologies for appropriate time and stage of development is recommended. Delphi expert rounds and participatory approaches provide rich data that can influence the structure and construct validity of the measure. Further instruments and studies are required to understand psychosocial factors and causal explanations for supportive online communities' outcomes. Measurement of self-reported helpfulness of the community content can serve as an indicator for 'active engagement' and help to understand the main reasons users benefit from these communities and its content. The pilot findings collected information on outcomes achieved by users and is supported by previous literature on online supportive communities highlighting its importance to reduce isolation and enhancing support. Further work to develop acceptable and valid measures for online mental health communities across different contexts is required, so services and practitioners can make better use of these online communities and recommend users by informing about the main outcomes and mechanism a mental health online community should seek for positive behavioral change and recovery promotion in their users. Table A1 and Table A2 are provided in Appendix A. Funding: This work has been funded as part of service improvement and innovation by Kooth plc. There has been no external funding sought to carry out this work. As the funder of the project Kooth plc provided the resource to plan for, collect and analyse data, and write the report for this study.

Supplementary Materials:
Institutional Review Board Statement: All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1964 and its later amendments. Ethical review and approval were not required for the study on human participants in accordance with the local legislation and institutional requirements for service quality improvement and innovation. Written informed consent to participate in this study was provided by all the participants' legal guardian/next of kin and anonymity of subects was always preserved.
Informed Consent Statement: Written informed consent was obtained from all subjects involved in the study. All online community anonymous users included in the study consented for their data to be used for research and publication purposes at the time of registering to access the service.

Data Availability Statement:
Restrictions apply to the availability of these data. Data was obtained from Kooth plc and are available upon authors and organizational approval at research@kooth.com. Some qualitative field notes may not be available without the permission of Kooth plc and upon a reasonable request, so the privacy and consent of participants is preserved. his independent advice and support and Dr. Lynne Green and Lex Young their teams from the clinical content at Kooth.com Conflicts of Interest: CM, LM, AS, SG, GS, HB and KJ are researchers employed and receive honorarium by Kooth plc. The funder remained independent it did not influence the design or outcome of the study. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.