ABSTRACT
Objective
Lesser toe disorders can cause significant functional impairment and pain, requiring reliable tools for outcome assessment. The American Orthopaedic Foot and Ankle Society (AOFAS) lesser metatarsophalangeal-interphalangeal (MTP-IP) joint scale is a clinician-based instrument frequently used in foot and ankle evaluations, yet no validated Turkish version exists. The aim of this study was to translate, culturally adapt, and evaluate the psychometric properties of the Turkish version of the AOFAS lesser MTP-IP scale.
Methods
The scale was translated following international cross-cultural adaptation guidelines. A total of 43 patients with various lesser-toe pathologies were assessed using the AOFAS lesser MTP-IP, foot and ankle ability measure (FAAM), visual analogue scale, and short form-12 (SF-12). Test-retest reliability was assessed by calculating intraclass correlation coefficients [ICC (2,1)] using a two-way mixed-effects model with absolute agreement; by assessing internal consistency via Cronbach’s alpha; and by evaluating agreement using Bland-Altman analysis. Construct validity was tested by correlating AOFAS scores with FAAM and SF-12 subscales. Floor and ceiling effects were also analyzed.
Results
The Turkish version demonstrated excellent test-retest reliability [ICC (2,1)=0.96] and acceptable internal consistency (α=0.76). Bland-Altman plots revealed no systematic bias. Strong correlations were observed with FAAM-activities of daily living (r=0.93) and FAAM-sports (r=0.75), whereas correlations with SF-12 physical component summary (r=0.34) and MCS (r=0.45) were weak but significant, which is consistent with the hypothesized convergent and divergent validity. A notable ceiling effect was identified in the AOFAS function and alignment domains, consistent with the high functional status and low pain levels reported by participants.
Conclusion
The Turkish adaptation of the AOFAS lesser MTP-IP scale is a reliable and valid instrument for evaluating pain, function, and alignment in patients with lesser toe disorders. Its strong psychometric performance supports its use in both clinical and research settings, although the observed ceiling effect should be interpreted in the context of patient characteristics.
INTRODUCTION
Lesser toe disorders, including Freiberg’s disease, metatarsalgia, deformities, and sequelae of trauma, are frequently encountered in orthopaedic practice and lead to pain, altered gait mechanics, and functional limitations. Accurate functional assessment is critical not only for guiding clinical decision-making but also for evaluating treatment efficacy in both conservative and surgical contexts (1, 2).
Among the tools used to evaluate foot and ankle function, the American Orthopaedic Foot and Ankle Society (AOFAS) clinical rating systems have historically been among the most widely adopted instruments. The AOFAS lesser metatarsophalangeal-interphalangeal (MTP-IP) joint scale, in particular, is designed to assess pain, functional limitations, and alignment in the lesser toes (3). Despite its long-standing utility and simplicity, the scale has been subject to methodological critique because of its categorical structure, ceiling effects, and lack of patient-reported input (4, 5). In recognition of these limitations, the AOFAS has officially withdrawn its endorsement of these clinician-based tools, advocating instead for integrating patient-reported outcome measures (PROMs) into both research and practice (6).
Nevertheless, the AOFAS scales remain widely used because of their clinical familiarity and relevance, particularly in reporting postoperative outcomes (7). Recent methodological frameworks—such as COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN)—emphasize the need to evaluate reliability, measurement error, and construct validity through rigorous statistical procedures, including intraclass correlation coefficients (ICC), standard error of measurement (SEM), and minimal detectable change (MDC) (8, 9).
In Türkiye, validated versions of the AOFAS hindfoot and hallux scales are available (10-13); however, there is no psychometrically validated Turkish adaptation of the AOFAS lesser MTP-IP scale. This gap limits the comparability of clinical data across international studies and restricts the use of standardized instruments in Turkish-speaking populations. Despite the increasing emphasis on methodologically sound, culturally adapted tools in musculoskeletal outcomes research, a critical unmet need remains.
Therefore, this study aimed to translate and culturally adapt the AOFAS lesser MTP-IP scale into Turkish and to evaluate its psychometric properties in patients with lesser toe disorders. We hypothesized that the Turkish version would demonstrate strong reliability, internal consistency, and construct validity, supporting its use in both clinical practice and multicenter research efforts.
METHODS
Study Design and Participants
This cross-sectional methodological study was conducted at a tertiary-level orthopedic outpatient clinic between January and June 2024. Hospital records were used to identify patients who had been treated for a disorder affecting the lesser toe. These patients were invited to attend follow-up visits. Patients who met the eligibility criteria had received treatment within the last two to four years. A total of 43 patients diagnosed with pathologies of the lesser toes (e.g., Freiberg disease, deformities, or sequelae of trauma) were enrolled (Figure 1). Although the COSMIN guidelines recommend at least 50 participants for an adequate assessment of reliability and validity, previous methodological reports suggest that 5-10 participants per item may be acceptable for focused validation studies (14). Our participant-to-item ratio of 5.4, combined with an ICC power >0.99, indicates adequate precision despite the modest sample size. For validity studies, it is recommended to include at least five patients per item (14). The AOFAS lesser MTP-IP scale consists of eight items, and our study included 43 patients, corresponding to 5.38 patients per item, thus meeting the recommended sample size criterion. Inclusion criteria included age ≥18 years, literacy in Turkish, and clinical or radiographic evidence of a lesser-toe disorder. Patients with systemic neurological conditions, prior major foot surgery, or cognitive impairment were excluded.
This study was approved by the Clinical Research Ethics Committee of University of Health Sciences Türkiye, Bakırköy Dr. Sadi Konuk Training and Research Hospital (approval no: 2022-06-05, date: 21.03.2022). Written informed consent was obtained from all participants in accordance with the Declaration of Helsinki.
Translation and Cultural Adaptation
The AOFAS lesser MTP-IP scale was translated into Turkish following the five-step process outlined by Beaton et al. (15): (1) forward translation by two bilingual experts; (2) synthesis of translations; (3) back translation by two native English speakers; (4) expert committee review; and (5) pretesting with 20 patients to assess clarity and comprehension. Minor linguistic adjustments were made based on participant feedback to enhance cultural relevance. The translators’ professional backgrounds (medical vs. non-medical) and linguistic expertise were reported to ensure conceptual equivalence and transparency.
Outcome Measurements
AOFAS Lesser MTP-IP Scale (Turkish Version)
The AOFAS lesser MTP-IP scale is a clinician-administered tool designed to evaluate outcomes in patients with disorders of the lesser toes (3). It comprises three domains: pain (40 points), function (45 points), and alignment (15 points), with a total possible score of 100, where higher scores indicate better clinical status. The functional domain includes items related to activity limitations, footwear requirements, and mobility, whereas alignment is rated by the clinician based on joint positioning.
In this study, the original English version of the AOFAS lesser MTP-IP scale was translated and culturally adapted into Turkish using a standardized methodology (Table 1) (15). This Turkish version was administered at two time points, two weeks apart, to assess reliability and construct validity.
Visual Analogue Scale (VAS)
Pain intensity was assessed using a 10-centimeter VAS, which is a validated and widely used unidimensional measure of pain (16). Patients were asked to mark their level of pain on a horizontal line ranging from 0 (no pain) to 10 (worst imaginable pain). VAS scores were recorded in three conditions: at rest, during activity, and at night. This enabled multidimensional pain profiling and facilitated comparison with the AOFAS pain domain. The activity condition refers to pain experienced during daily activities, such as walking, climbing stairs, or performing occupational tasks.
Foot and Ankle Ability Measure
The foot and ankle ability measure (FAAM) is a patient-reported outcome measure (PROMs) developed to assess physical function in individuals with musculoskeletal disorderslower-extremity (17). It includes two subscales:
• Activities of daily living (ADL): 21 items,
• Sports: 8 items.
Each item is scored on a 5-point Likert scale (0=unable to do; 4=no difficulty), and raw scores are transformed into percentage scores (0-100), with higher scores reflecting greater functional ability. The Turkish version of the FAAM has been validated and shown to have high internal consistency and construct validity (18).
Short Form-12 (SF-12) Health Survey
SF-12 is a generic measure of health-related quality of lifederived from the original 36-item SF-36 questionnaire (19). It yields two-component summary scores:
• Physical component summary (PCS),
• Mental component summary (MCS).
These scores are standardized [mean=50, standard deviation (SD)=10], with higher scores indicating better self-perceived health. Although not specific to the foot and ankle, the SF-12 provides insight into general health status and permits evaluation of divergent validity when used alongside region-specific measures such as the AOFAS or FAAM (20).
All instruments were administered in Turkish; validated Turkish versions of the FAAM and SF-12 were used to ensure linguistic and conceptual equivalence. The SF-12 was used to test divergent validity, reflecting general health rather than region-specific function. We hypothesized strong correlations with FAAM subscales and weaker correlations with SF-12 PCS and MCS, consistent with COSMIN recommendations for hypothesis-driven construct validation. The Turkish version validated by Soylu and Kütük (21) was used.
Statistical Analysis
All statistical analyses were performed using IBM SPSS statistics, version 27.0 (IBM Corp., Armonk, NY, USA). Descriptive statistics were used to summarize demographic and clinical characteristics. Frequencies and percentages were reported for categorical variables; means and standard deviations were reported for continuous variables. The normality of continuous data was evaluated using the Shapiro-Wilk test.
An ICC value of 0.75 was used as the minimum acceptable threshold for reliability in accordance with established guidelines (22). The observed ICC of 0.96 yielded a
post-hoc power of approximately 1.00 (n=43, α=0.05), confirming sufficient statistical power to detect excellent test-retest reliability.
Internal consistency of the Turkish version of the AOFAS lesser MTP-IP scale was assessed using Cronbach’s alpha. Values between 0.70 and 0.95 were considered acceptable (9, 23).
Test-retest reliability was assessed by calculating ICC using a two-way mixed-effects model with absolute agreement [ICC (2, 1)]. ICC values were interpreted as follows: <0.40=poor, 0.40-0.59=fair, 0.60-0.74=good, and ≥0.75=excellent.
Agreement was evaluated using the SEM and the MDC95. The ICC was applied to compute the SEM, which reflects the precision of the measurement. SEM was determined by multiplying the SD of the scores by the square root of (1-ICC). The MDC95 represents the smallest change that exceeds the threshold of measurement error. To estimate the MDC at the 95% confidence interval (CI) level (MDC 95%), the SEM was multiplied by 1.96 and then by the square root of 2 (22).
Construct validity was evaluated using Spearman’s rank correlation coefficients between AOFAS total scores and scores on external instruments (FAAM, VAS, SF-12). Correlation strength was interpreted as small (0.10-0.29), moderate (0.30-0.49), strong (0.50-0.69), and very strong (≥0.70), according to Cohen’s criteria.
Agreement between test and retest scores was further visualized using Bland-Altman plots, which graphically display the difference between paired measurements against their mean. Systematic bias and 95% limits of agreement (LoA) (LoA; mean±1.96×SD of differences) were examined to determine interchangeability between repeated assessments.
Floor and ceiling effects were evaluated by calculating the proportion of patients scoring the minimum (0-10) or maximum (90-100) possible scores at baseline. Effects were considered present if more than 15% of participants achieved either extreme (24). A p-value of <0.05 was considered statistically significant.
RESULTS
Participant Characteristics
A total of 43 patients (mean age: 38.60±18.70 years; 74.4% female) were included. The most common diagnoses were Freiberg’s disease (65.1%), lesser toe fractures (27.9%), and deformities (7.0%). Surgical intervention was performed in 79.1% of cases, and 76.7% of patients reported normal functional status at baseline. Detailed demographic and clinical characteristics are presented in Table 2.
Reliability and Internal Consistency
The Turkish version of the AOFAS lesser MTP-IP scale demonstrated excellent test-retest reliability. The ICC (2, 1) for the total score was 0.96 (95% CI: 0.95-0.98), with similarly high values observed for the subscales (pain: 0.96; function: 0.94; alignment: 0.89). Internal consistency, as measured by Cronbach’s alpha, was acceptable (α=0.76 for the total score and the function subscale), indicating a coherent item structure (Table 3).
Measurement Error and Agreement
The SEM and MDC95 were calculated as 1.80 and 5.00 points, respectively, for the AOFAS total score. Bland-Altman analysis showed mean differences between test-retest scores (T1, T2) of -0.65 (LoA: -7.65 to 6.35) for AOFAS total; 0.42 (LoA: -1.54 to 2.38) for VAS total; -0.02 (LoA: -7.25 to 5.34) for SF-12 PCS; 0.54 (LoA: -4.82 to 5.90); 0.07 (LoA: -16.05 to 16.19) for FAAM-sports; and -0.36 (LoA: -7.47 to 6.75) for FAAM-ADL. There was no evidence of systematic bias, and the limits of agreement were narrow, supporting the scales’ reproducibility (Figure 2). The vast majority of participants fell within ±1.96 SD of the mean.
Internal Consistency
Cronbach’s alpha was 0.76 for both the total AOFAS scale and the function subscale, indicating acceptable internal consistency and item coherence within the domains.
Construct Validity
Strong negative correlations were found between the AOFAS total score and VAS scores for pain at rest (r=-0.81), during activity (r=-0.75), and at night (r=-0.82), all statistically significant (p<0.001). AOFAS scores were also strongly correlated with FAAM-ADL (r=0.93) and FAAM-sports (r=0.75), demonstrating convergent validity. Moderate correlations were observed with SF-12 PCS (r=0.34) and MCS (r=0.45), supporting divergent validity (Table 4).
Floor and Ceiling Effects
A ceiling effect was observed in the AOFAS total score, with 60.5% of participants achieving the maximum score. This was most prominent in the alignment and pain domains (90.7% and 76.7%, respectively). Similar ceiling effects were found in the FAAM-ADL (53.5%) and FAAM-sports (41.9%) subscales (Table 3). In contrast, VAS scores exhibited notable floor effects, with approximately half of participants reporting no pain. SF-12 scores did not display significant floor or ceiling effects, though both PCS and MCS subscales were skewed toward higher values, indicating good perceived health.
Importantly, the observed ceiling effect in AOFAS scores aligns with low pain levels and high functional status among participants, suggesting it reflects favorable clinical status rather than a measurement limitation.
DISCUSSION
This study provides compelling evidence that the Turkish adaptation of the AOFAS lesser MTP-IP scale is a reliable and valid instrument for evaluating pain, function, and alignment in patients with lesser toe pathologies. With excellent test-retest reliability [ICC (2, 1)>0.9], acceptable internal consistency, and strong correlations with region-specific (FAAM) and general health (SF-12) instruments, the scale demonstrates robust construct validity and temporal stability key criteria for outcome measures used in clinical research and practice.
Despite the AOFAS scales no longer being officially endorsed because of concerns about responsiveness and lack of patient-centeredness (5, 6), their use remains widespread in clinical orthopaedics. This paradox underscores the need for well-validated, culturally adapted versions that enable comparability across studies and acknowledge the inherent limitations of the tool. Although the AOFAS has withdrawn endorsement of its legacy clinician-based scores, these instruments remain in extensive clinical use. Their cultural validation enables comparisons across historical datasets and ongoing multicenter studies, ensuring continuity in outcomes research.
One of the novel contributions of this study is the inclusion of MDC95 and SEM, which provide estimates of measurement precision beyond random error; however, they do not indicate clinically meaningful change (minimal clinically important difference). In addition, presenting Bland-Altman analysis, SEM, MDC95, and floor-ceiling effect results for all administered instruments further strengthens the comprehensiveness of our evaluation. Applying Bland-Altman analyses further enhances methodological rigor by demonstrating a low systematic bias and good agreement between repeated assessments an aspect often overlooked in prior validation studies. Mean differences close to zero and the majority of data points falling within the ±1.96 SD limits supported reproducibility. Narrow limits of agreement indicated high reliability for the AOFAS, VAS, SF-12 PCS, SF-12 MCS, and FAAM-ADL scales, whereas wider limits for the FAAM-sports scale suggested greater interindividual variability.
A particularly noteworthy finding was the pronounced ceiling effect observed in the AOFAS pain and alignment domains. The substantial ceiling effect likely reflects the predominantly high-functioning postoperative status of the participants rather than a limitation of the scale itself. Nevertheless, this may restrict the scale’s responsiveness in populations with more severe symptoms or functional impairment. Although traditionally viewed as a psychometric limitation, this effect was contextually consistent with the clinical status of the sample: nearly half of the participants reported no pain across VAS domains, and substantial proportions of participants reached maximum FAAM scores. These results parallel those reported by Whittaker et al. (25) and Ponkilainen et al. (26), who noted similar ceiling effects in AOFAS midfoot evaluations effects which could not be fully explained by other outcome metrics. In this light, the ceiling effect in our study appears to reflect favorable clinical outcomes rather than inadequate scale sensitivity.
Nevertheless, some limitations inherent to the AOFAS scoring structure must be acknowledged. Its categorical response format may reduce granularity and responsiveness, particularly in high-functioning or postoperative populations. Future iterations of the scale may benefit from more continuous scaling or the integration of patient input, aligning with broader trends in outcome measurement science.
In this context, the complementary use of PROMs is increasingly essential. Instruments such as FAAM and frameworks like PROMIS offer improved responsiveness, broader biopsychosocial coverage, and in the case of PROMIS, adaptive digital delivery. The integration of PROMIS with clinician-based scales has been shown to enhance the precision and patient-centeredness of musculoskeletal outcome measurement (27, 28).
Furthermore, compared with other language versions of the AOFAS scale—including Persian, Italian, and Arabic adaptations—this study distinguishes itself by incorporating SEM, MDC95, and Bland-Altman analyses, which are rarely reported (29-31). These enhancements align with COSMIN standards for outcome measure validation (8, 9) and address important methodological gaps in the existing literature.
Study Limitations
Strengths of the study include rigorous adherence to international guidelines for cross-cultural adaptation (15), the use of multiple comparator instruments, and comprehensive statistical analysis. However, the study is limited by its modest sample size, which precluded subgroup analysis by diagnosis or treatment type. Although COSMIN recommendations suggest a minimum sample size of ≥50 for reliability studies, our sample included 43 participants. This limitation should be acknowledged, and future studies with larger, more heterogeneous cohorts are warranted to confirm the stability and generalizability of these results. However, the ICC value of 0.96 corresponded to a post-hoc statistical power of >0.99, indicating that the precision of the reliability estimate was adequate despite the smaller sample size. Additionally, the scale’s responsiveness to clinical change over time was not assessed and warrants further longitudinal research. The pronounced ceiling effects, while reflecting high functional recovery, may limit sensitivity in patients with more severe dysfunction and should be considered a psychometric constraint. Additionally, the patient cohort consisted predominantly of postoperative individuals with high functional recovery, which may limit the applicability of the findings to patients with more severe deformity or persistent pain.
CONCLUSION
The Turkish version of the AOFAS lesser MTP-IP scale is a psychometrically robust and clinically practical tool for evaluating lesser toe disorders. The observed ceiling effect, rather than undermining the scale’s utility, reflects the functional recovery and low symptom burden of the study population. By providing SEM and MDC95 values and demonstrating strong construct validity, this study enhances the interpretability and applicability of the AOFAS scale in both clinical and research settings. Future studies should investigate the scale’s responsiveness and explore its integration with PROMIS-based instruments to support a more holistic and precise approach to musculoskeletal outcome assessment. Future studies in larger, more heterogeneous populations are warranted to confirm the measurement properties and responsiveness of the Turkish AOFAS lesser MTP-IP scale.


