Improving The Evaluation of Benign Low Back Pain

SPINE Volume p1989, Lippe 24, Number 10, pp 952-59 Wilms 8 Wil Improving the Evaluation of Benign Low Back Pain ‘Anne Marriott, PhD,” Nicholas M. Newman, MD,t Serge A. Gracovetsky, PhO,"¢ Mark P. Richards, BSc," and Steeve Asselin, MSc* Study Design. A prospective, blind study was con- ducted to investigate the factors Underlying the decisions ‘of expert clinicians in diagnosis of acute, benign low back pain, compared with results obtained with an automated physical examination by machine. From the results, a strategy to significantly improve clinical diagnosis. in cases of discordance was determined, Objectives. To identify factors in the clinical assess: ment of low back pain that indicate when independent diagnostic testing would be useful ‘Summary of Background Data. The clinical evaluation ‘of low back pain is often dominated by subjective reports of pain. Published medical literature has underscored several inherent weaknesses of the clinical examination, and concerns have been raised about its effectiveness for ‘assessing patients with low back pain. Thus, it has been ‘proposed that objective measures to complement the cli nician’s examination would be beneficial in the formula tion of dependable diagnoses, Methods. Randomly designated subjects, who in de- scribing their conditions were objective or role playing, ‘were assessed by clinicians and a machine for diagnosis of low back pain assessment versus normal backs. Each subject's pain assessment was compared with a gold standard that was established by experts in low back pain. Components of the clinical examination were ana: lyzed to assess which were the most informative in mak: ing a reliable diagnosis. The information content of the ‘machine assessment was also analyzed and a strategy to complement the clinical diagnosis with the machine di agnosis determined, Results. Discordance among the various components Cf the clinical examination was a strong indicator of when the efficacy of the clinical examination dropped below a random level of decision making, When there was discor dance, incorporating the functional evaluation by ma chine into the clinical diagnosis improved the perfor. ‘mance of the clinician. Notably, in nonobjective subjects, the accuracy of diagnosis was enhanced by as much as 69%. Conclusions. It is possible to improve the accuracy of clinical diagnosis by incorporating a functional evaluation by machine when there is discordance between physical From *Spinex Medical Technologies, Inc, the +Centre Hospitalier de "Université de Montreal, and {Concordia University, Montel, Que bec, Canada The design ofthe machine used in the study was made possible bythe support of the National Research Council of Canada (NRCC-IRAP. program). The clinical component of this research was funded by Msttut de Recherche en Santé et Securité du Travall, the research division ofthe Quebec Worker's Compensation Board Br. Mariott a recipient of a Natural Sciences and Engineering Research Courel af Canada (NSERC) Industrial Research Fellowshy Acknowledgment date: April 10, 1998, a Fist revision date: July 15, 1998, Acceptance date: September 28, 1998, Device status category: 3/11 ‘examination findings and reported pain. [Key words: ei niician, diagnosis, discordance, function, low back pain, ‘machine, spine] Spine 1999;24:952-960 Back pain is the most common and costly ailment of work-injured adults," with a minority of compensated workers with low back pain (LBP) accounting for mos of the costs.° The high and increasing impact of back pain on worker absenteeism and institutional compense tory costs throughout the industrialized world, indicates that something fundamental in the diagnosis and treat ment of LBP could be improved. A key factor in the evaluation of LBP to support de cisions concerning compensation and rehabilitation is the clinical examination. However, the scientific medical literature does not support the efficacy of LBP diagno- is'°°°2; in fact, published data underscore its many inherent weaknesses. As results in a recent Québec Worker’s Compensation Board study showed, clinical performance is highly variable in the evaluation of LBP, to the extreme that clinicians cannot reliably determine whether a patient has a genuine injury." The Québec study results showed that the weaknesses in LBP evalt cy on reported ation are mainly due to clinical dependet pain, This reliance on reported pain, often to the exclt sion of any objective findings, can lead to inaccurate diagnoses of LBP. Furthermore, analysis of the compo nents of the clinical evaluation of LBP shows that even the presumed objective portion of the evaluation {ithe physical examination) is influenced by pain and othet subjective considerations.” Although reported pain gives the clinician a reason 0 suspect a disease process in LBP, it may not he a reliable indicator of a subject’s ability to function.?* There af often diserepanc among the level of pain, loss of fun tion, and physical signs during clinical evaluations.°" Having no reliable objective tests to determine dysfune tion in LBP, the clinician may decide that a subject has# disorder, despite normal findings in a physical examin tion. Unless the clinician acquires the means to asses LBP reliably, independent of the subjective reports ofthe Patient, then the clinician has no option but to rely 0% reported pain. The strong dependency of clinicians reported pain may lead them to overrate the presence disease, use inappropriate treatments, prescribe unneees sary imaging tests, and generate vate medical opinions, which are responsible for many disputes a0 perhaps the perpetuation of increasing costs in LBP ma agement.” Undoubtedly, the evaluation of LBP must a a ae eee ec fa‘ a improved, but so far clinicians have been unable to over come the limitations of current clinical evaluations.'"* ‘To improve the situation, the use of newer technologies has been proposed to provide functional information that is independent of reported pain. These technol- | egies have been used o asist the clinician with varying, degrees of success.°"©!” However, until recently, all technologies required some degree of subjective interpre tation, which makes the results dependent on the clinicians using them. The technology used in the current study (the Spinoscope; Spinex Medical Technologies, Inc, Montreal, Quebec, Canada; hereinafter referred to as the machine), tracks the motion of skin markers placed along the spine, and directly assesses the information content of the physical examination, independent of reported pain. By following strict logical rules, the machine processes the objective data collected on a subject to reach a normal-abnormal functional diagnosis that | compares well with a clinical diagnosts. a Because quality of assessment is an integral part of rehabilitation management, it is important that the clini- | can have a strategy to appreciate the effectiveness of the clinical examination, so that remedial steps can be taken toimprove the diagnosis whenever appropriate. We pro- pose that quality of assessment can be monitored by measurable clinical indicators. In the case of LBP, because the level of objectivity of a subject’s presentation of his or her symptoms is unknown, and because clinical performance depends on the objectivity of reported pain7?"" an indicator of objectivity must be found be~ fore the clinical examination can be improved. In this article, at least one indicator is identified: discordance. The clinicians in this study each had to fill in a detailed form listing all physical findings; a form listing all other evidence, such as reported pain and history; and a form listing the final clinical diagnosis. Therefore, the data clearly separated the physical findings from the clinical diagnosis. Discordance occurs when either one of two situations arises: when two clinicians examining the same subject disagree (interclinician discordance), or when the findings of an individual clinician's physical examination contradict the final clinical decision (intraclinician discordance). Discordance has a substantial ad- verse effect on the diagnostic performance of the clinician, whereas that of the machine is affected in only a minor way. In this study, a substantial (up to 69% in the ‘ase of nonobjective subjects) increase in diagnostic eff- cacy was brought about by incorporating the functional assessment by machine into the clinical diagnosis © Methods Data Collection. The study protocol has been described Previously.” Briefly, two clinicians who specialize in L BP (selectors) selected 41 subjects with low back injury (acute, benign, work-related LBP), and 46 normal healthy subjects according to strict inclusion-exclusion criteria (referred to 48 the gold standard). All 87 subjects were then randomly divided into two groups of 44 and 43 subjects. Members of Improving Low Back Pain Evaluation + Marriott et al_953 the frst group were instructed to report their true conditions, (objective group); members of the second group were coached to simulate by role playing the opposite oftheir true Conditions (nonobjective). Two LBP experts (evaluators) performed a blind re-evaluation of the subjects, each record- tng physical findings and overall clinical conclusions. Sepa- rately and in consensus, the Evaluators established the subject’s status (certainly normal, probably normal, unknown, probably abnormal, certainly abnormal). Although blind to the true condition of the subjects, the evaluators were in- formed that some of them would be role playing, Immediately after the evaluators’ assessment, each subject underwent an automated external low back evaluation by ma- thine. Note that part of the physical examination consists of ‘observing the patient’ skin as the patient moves and is one of the many clements considered by the clinician when making a Clinical diagnosis. The machine performs this aspect of the clin ical examination by monitoring the movement of skin-surface markers (light-emitting diodes) as the subject performs dy namic weight lifting and, by using a fixed set of rule, reaches 2 Conclusion on the normalcy of the subject. For direct comparison with the single clinical decision made by the evaluators, the nachine’s analysis was averaged over all loads that were lifted by the subject (load-averaged decision). This single averaged decision from the machine provided one classification per subject and was considered to be the machine's equivalent of the Waluators’ assessment. Evaluators’ diagnoses and machine as Sessments were compared with the gold standard, as described previously.” Data Classification. For each subject, the following data ‘were considered: fe The consensus among the selectors (referred to as the gold standard) f The clinical decision of evaluator 1 's The clinical decision of evaluator 2 © The clinical decision by consensus of evaluators 1 and 2 ‘S The physical examination by evaluator 1 © The physical examination by evaluator 2 The machine's load-averaged decision With these data, the following definitions were made: Inter- clinician discordance: when the clinical decisions of evaluator 1 sind evaluator 2 were contradictory in the same subject (i.e. sormal and abnormal); intraclinician discordance: when the findings in physical and clinical examinations by evaluator } were contradictory in the same subject and when the findings in physical and clinical examinations by evaluator 2 were contra dictory in the same subject Because each of the 87 subjects was examined indepen dently by each evaluator, there were 174 physical and clinical ceaminations, resulting in 87 consensus decisions on Findings $xthe clinical examination. The machine provided one decision in each subject, resulting in 87 assessments. Data Analysis. The performances of the evaluators and the machine were analyzed in comparison with the gold standard ‘The categories of interest ae listed below, and the distribution ‘of the subgroups are illustrated in Figure 1 «¢ The clinical consensus between evaluators (n= 87) 5 The corresponding machine (load-averaged) decision (n= 87) ree954 Spine * Volume 24 + Number 10 + 1999 | a a ceca eR Intra Inter Figure 1. Venn diagram. The number of subjects in whom the evaluators noted interclinician discordance was 15 (4 + 11), and the number of those in whom the evaluators noted intractnician discordance was 32 (11 + 21). The number of subjects in whom either interclinician or intraclinician discordance occurred was 36 (4 + 11 + 21), and the number of those with neither interclinician or intraclinician discordance was 51 [87 ~ (4 + 11 + 21) Note that in 11 subjects, both interclinician and intraclnician discordance occurred, « In the subgroup of subjects in whom the evaluators noted interclinician discordance (n= 4 + 11 = 15) © The performance of the evaluators’ consensus * The performance of the machine ‘ For the subgroup of subjects in whom the evaluators, 1+ 11 = 32) * The performance of the evaluators’ consensus * The performance of the machine # In the subgroup of subjects in whom either interclinician or intraclinician discordance occurred (n = 21 + 11 + 4 = 36) * The performance of the evaluators’ consensus * The performance of the machine «In the subgroup of subjects in whom there was no inter clinician discordance, the performance of the evaluators 7 ~ (11 + 4) = 7: « In the subgroup of subjects in whom there noted intraclinician discordance (n = clinician discordance, the performance of the evaluators’ consensus [n = 87 ~ (21 + 11) = 55}, * In the subgroup of subjects in whom there was no inter clinician or intraclinician discordance, the performance of the evaluators’ consensus [n = 87 — (21 + 11 + 4) = SI}. «In the entire group (n = 87), the performance of the strat egy consisting of accepting the clinical decision when no interclinician or intracinician discordance was noted by the evaluators (n = 51) and accepting the machine decision when there was either interclinician or intraclinician discordance (n = 36) Statistical Analysis. Subjects were organized into three dif ferent groups for statistical analysis: All 87 subjects (46 nor mal, 41 abnormal); 4 objective subjects (23 normal, 21 abnormal); 43 nonobjective subjects (23 normal, 20 abnormal), Iris desirable to use a statistical technique that can process all the data so that comparisons are meaningful. Because the receiver operating characteristic (ROC) cannot be used for all the above categories, the data generated by the evaluators and by the machine were processed using the 2X 2 contingency table technique. This technique consists of counting the actual successes and failures to generate true-positive, true-negative, false-positive, and false-negative values," using the gold stan: dard as a reference. Performance is defined as the sum of th true-negative plus the true-positive values divided by thet number of subjects in the category considered. The difetenes between the ROC and the contingency table technique in some categories are reflected in the small numeric variations among previously reported data” and those in the current study Likelihood ratios (IRs) were also calculated, The LR mex sures the true-positive rate discounted forthe false-positive rate and is defined as LR T= specificity where sensitivity is defined as the rate of teue-positives, and specificity equals the rate of true-negatives. Given the pretest probability of a condition, the higher the LR, the greater the posttest probability Pretest odds for LBP x LR for diagnostic test = Posttest odds for LBP An LR of 1 means that no information was collected from the test, nor has the initial chance of being correct been altered An LR less than 1 indicates thae the clinician is in greater doubt after having tested the subject, and an LR greater than I shows that the test was informative and the probability of a beter diagnosis increased when the test results were obtained.!” 1m Results The percentage of interclinician and intraclinician dis: cordance noted in each of the three groups studied is shown in Table 1. To assess the impact ofa subject's level of objectivity on interclinician discordance, on intraclnician discordance, and on the performance of the eval> uuators and machine (load-averaged decision), the data were expressed as a function of the percentage of objec tive subjects in the group (Figures 2 and 3, respectively} In all cases, 0% represents a group of 23 normall nonobjective and 21 abnormal/nonobjectives 100% rep resents a group of 23 normal/objective and the 21 abnor mal/objective. Any intermediate percentage value is obtained by a proportional mix of pools of norm: objective and normal/nonobjective subjects and abnor mal/objective and abnormal/nonobjective subjects: Be cause the machine provided only one classification Pet subject, there was no possible interdiscordance or intra discordance in the machine classification. Therefore, 10% the purposes of comparing the performance of the ma chine with that of the clinicians, the statistics on the Table 1. Quantification of Interclinician and Intractinician Discordance for Each of the Three Populations of Subjects 4% Discordance for @ Given Populton Population Intertnician Intracinion Objective " 8 all 0 2 Nonobjective B a C—Oa Improving Low Back Pain Evaluation + Marriott et al_955 Inter Cinicin Discordance | = | a el iI ie] pe “| Percentage of Objective subjects machine are calculated for the same subjects in whom the linicians produced interclinician or intraclinician discordance. Interclinician discordance and dance relate to the level of objectivity of the subject. The results are represented graphically in Figure 2. When the subjects were objective, the level of interclinician or intraclinician discordance was similar (11% interclinician discordance [Figure 2A] and 13% intraclinician discordance [Figure 2B]). However, in the nonobjective subjects, the level of interclinician discordance was 23% Figure 2A) and the level of intraclinician discordance was 33% (Figure 2B). The performance of the clinician evaluators and of the machine, relative to the proportion of objective subjects, is represented graphically in Figure 3. Clinical performance was high when the percentage of objective subjects was high but decreased more rapidly than the performance of the machine when objectivity declined. When the proportion of objective subjects exceeded $3%, clinical performance was considered high (up to intraclinician discor- wo » + = ise: I ‘Gitican u” oie Hct | cinican (06 come Ni slgstcant of Me! eee. | Ses wo} | | | v ee re 0 10 20 30 40 $0 60 70 80 90 100 Percentage of Objective subjects Figure 3. Performance (percentage of correct identification) of tinician evaluators (by consensus) and machine (oad-averaged decision) whon the porcentage of objective subjects inthe grOUP ‘ated (contingency table technique) —_ Percentage of Objective subjects Intra Cncan Dizcordance Figure 2. A, Averaged percent age discordance between find ings in clinical examinations by both evaluators (interciinician discordance) when the percent: age of objective subjects in the ‘group varied. B, Averaged per centage discordance between findings in physical and clinical examinations by each evaluator [intraclinician discordance) when the percentage of objec tive subjects in the group varied Standard deviation is also indi cated $9%) and clinicians performed significantly better (P = 0.05) than the machine. When the proportion of objec tive subjects was between 45% and 83%, there was no significant difference between the performances of the Clinicians and the machine. However, there was a signif: performance as the number of objective subjects decreased. Therefore, when the pro portion of objective subjects was below 45%, the ma Chine’s performance was significantly better (P < 0.05) than the clinicians’. ‘To measure the impact of interclinician discordance con the performance of the evaluators and the machine, Figures 2A and 3 can be combined by eliminating the percentage of objective subjects (Figure 4). Machine sta- in the same subjects in whom the icant drop in clinician tistics were calculated evaluators produced interclinician discordance. The av rage probability that the evaluators’ consensus of diag: fhosis would be correct was 13%. In contrast, the average machine's performance was 53% in the same subjects This indicates that whenever two clinicians disagree, the machine’s decision should prevail 100 a eal a | gi? } 3 00 Machine es toad averaged deesion) $30 oF 40 I Ex 2 _——cincan consensus 10 | rae 9 aim wc eearzo eso eat 4010 Inter-Chincian Discordance (%) Figure 4 Performance (percentage af corect identification) of he Fiaean evaluators (by consensus) compared with that of the cifine (load-averaged decision) in subjects in whom the clin Clans separately reached a diferent diagnosis ———————————956 Spine * Volume 24 + Number 10 + 1999 10 % ge Mocine toate ade ae) Eo] EE oA insomaraaaarnstg 8 40 z £ x 20 ae: vo |oNoT 2 test ° piggy 70 10 eee aba Intra: Clinician Discordance (%) Figure 5. Averaged performance of the clinician evaluators com: pared with the perfarmance ofthe machine (load-averaged deci sion) in subjects in whom the clinicians noted a discordance between findings in the physical examination and the re: ported pain. To measure the impact of intraclinician discordance on the diagnostic accuracy of evaluators and machine, Figures 2B and 3 can be combined by eliminating the percentage of objective subjects (Figure 5). Machine sta- tisties were calculated on the same subjects in whom the evaluators produced intraclinician discordance. Note that small increases in intraclinician discordance resulted in large reductions in the evaluators’ performance. Com- parison of the evaluators’ performance with that of the ‘machine identified three definite zones: 1) With low levels of intraclinician discordance, clinical performance was acceptable. 2) With high levels of discordance, the low performance of the evaluators indicated that incor. jl porating the machine’s decision into the clinical diagno. sis would significantly (P < 0.01) improve clinical per. formance. 3) An intermediate zone existed in which the clinical and machine diagnoses were statistically equiv lent (P > 0.1) and therefore, a machine test would be warranted only if there were compelling reasons to do so (for example, the possibility of secondary gain for thepa tient) Therefore, by monitoring the level of discordance ei ther between the findings in clinical examinations of two clinicians, or between the findings in physical and clinical examinations of an individual clinician, the clinician has the means to recognize that the accuracy of the diagnosis may be in doubt and that incorporating the results of functional test into the diagnosis would be beneficial The average values of the relative performance of el nician and machine in the different groups of subjectsare! summarized in Table 2. The relative gain in performance of applying the strategy of accepting the clinical decision when no interclinician or intraclinician discordance is noted by the clinician evaluators and accepting the ms chine’s decision when there is discordance, compared with the performance of the clinician alone, is shown in Table 3. The clinician evaluators’ LR for the whole group was 1.3, which means that the odds of making a correct de cision increased after the clinical examination by a factor of 1.3 (Table 4). The corresponding machine LR was 2.1—thatis, the odds for th a positive machine diagnosis are greater than after dig noses by the clinician evaluators. For the nonobjectise actual presence of LBP afte subjects, the clinicians’ LR was less than 1—that is the Table 2. Performance (% Correct Identification) of Clinician Evaluators and Machine (Load-Averaged Decision) for Each of the Three Popul: ns of Subjects Studied (Contingency Table Technique) Performance (% correct identification! Category Definition Objective A_—_Nonotecte 1 Clinician evaluators (by consensus) a a % 2 Machine (load-averaged decision) 70 6 = 3 For the subpopulation of subjects with whom the evaluators nated interclnician discordance: a Clinician evaluators (by consensus) Fy 3 ® a2 Machine (load-averaged decision) 0 8 a 4 For the subpopulation of subjects with whom the evaluators nated inracinician discordance: a Clinician evaluators (by consensus) » a a a ‘Machine (load-averaged decision) %0 8 4 5 For the subpopulation of subjects with whom either intecliniian or intracinician discor dance occurred 51 lncian evaluators 82 2 a 52 Machine (load-averaged decision) 6 a a 6 lnician evaluators (by consensus) for the subpopulation of subjects with whom no inter 97 6 a nician discordance was noted a Cnician evaluators (by consensus) for the subpopulation of subjects with whom no intra o er _ nician discordance was noted 8 nician evaluators (by consensus) for the subpopulation of subjects with whom no iter 100 n 2 nician of intractiniian discordance was noted 4 Performance of the strategy consisting of accepting the clinical decision when no interclin 8 a fi Cian or intraclincian discordance is noted by the clinician evaluators and accepting the ‘machine decision when there i discordanceTable 3, Relative Gain Consisting of Accepting the Cli Clinician Evaluators and Accepting Performance of the Clinician Alone Decision When No Interc! ‘the Machine Decision Wh Falaive 0 of Table 2} compared wit Hference withthe cincian evaluators alone P ain in performance of applying the strategy (described in category 8 ith that of the clinician alone (category 1 of Table 2) dlinical examination provided data that confused the cli: nicians. In contrast, the machine’s LR was greater than 1 ineach group of subjects, which means that the machine always provided relevant clinical information. 18 Discussion The traditional paradigm of acute, benign back pain asa medical condition that can be diagnosed and then eff ciently treated medically has failed and will continue to fail!" unless more efficient diagnostic strategies are found to improve the situation.’ The results in the recent ‘Québec Worker's Compensation Board study on LBP evaluation showed that the dependency of clinicians on reported pain affects all aspects of the clinical examination, including the presumed objective physical examina. tion (85-38% performance), and also showed inability to determine reliably the level of objectivity of a subject (26% correct identification).”**!* These weaknesses re sult in high variability in their ability to make accu rate diagnoses. Among patients with LBP, thos« LBP are the largest group and are evaluate. The use of role-playing subjects was mo by the need to introduce the maximum amount of uncer tainty regarding a patient's symptoms. Specifically, the diference between a subject with a normal back pretend ingto have pain and one with an abnormal back pretend- ing to be normal is small and hard to detect. As shown fully collabo- be reduced to pain Ye with acute benign the most difficult to tivated carlier,” classifying the condition of a ‘ating patient as normal or abnormal can asking the patient about the presence or absence of and believing the answer. The inclusion of role-playing subjects, by forcing ch- nicians to rely on facts other than reported pain, Pro Yided a level of testing that cannot be obtained with collaborating subjects. Although the nonobjective ab- Table 4. Likelihood Ratios for the Clinician Evaluators ‘and the Machine Population | Objective a Nonobjective Ain ian evaluators 69 13 038 iad 28 21 15 a, Performance for Each of the Three Populations of Subjects Studi Improving Low Back in Evaluation * Marriott et al_ 957 ied, Applying the Strategy cordance is Noted by the ympared With the Jinician or Intra jen There is Discor ian dance, Cor 7 Population Objective al Nonabjective 89-89 4426 = 0% 69 normal group may not be entirely representative of the LBP patients that clinicians deal with in daily practice, it is not unusual for a patient to overreport or underreport symptoms. For example, nonobjectivity may be ex- tended to the real-life scenario in which a subject uses role playing for secondary gain, or the less direct situation in which subjects, given a diagnosis in terminology that they do not fully understand, may amplify their prehension. The use of role play- us, the incorporation of nonobjective subjects pushed both clinicians and machine to the limits of their discriminatory powers. The current results show that the machine was better at reaching the correct decision when LBP patients there is a high correlation symptoms because of ap) ing was intended to simulate such situations. Th were nonobjective. Because hetween discordance and nonobjectivity, the results are applical dance occurs. ‘The clinicians in this study were chosen for their ex- jerefore presumably per~ ble to clinical practice in cases in which discor- pertise in LBP evaluation, and th formed at a higher level than less trained clinicians. The inherent limitations in the clinical examination shown in these results place the selectors’ gold standard in doubt, because the gold standard was the very same clinical icy of which the study’s results 12 subjects clas- examination, the effica showed was questionable. For example, Sified as clinically abnormal by the clinician selectors le to lift 20 kg at a speed greater than SO"/sec These subjects would be classified as functionally normal according to the machine’s normative database. '* Clini dally reclassifying these subjects as normal would signif (data avail- icantly improve the machine’s performance able on request from authors). “The clinical evaluation is more useful for the identifi cation of malignant, infectious, and other very serious diseases of the lower back than for th nent of mechanical LBP. Because patients with mechan- jeal LBP have a high probability (90%) of recovery ‘within 3 months, regardless of diagnosis and subsequent vlinieal intervention, a good outcome is probable. As facetiously reported at the recent Mackenzie Institute International Conference (Philadelphia, Pennsylva- nia)? “The trouble with back pain and dises is thar the natural history is good anyway. We all get good results J's throat and sprinkling blood over the 1¢ objective assess Cutting a cockere ————958 Spine + Volume 24 + Number 10+ 1999 back is a pretty effective treatment, and I can get won- derful results from it.” This shows that clinicians should not rush into thinking that all their diagnoses are correct just because the outcome is favorable. Otherwise, they may end up trapped into a way of thinking that prevents recognition of diagnostic and treatment failures and acceptance of an alternative solution ‘The clinical evaluation of LBP is a measurement pro: cess whereby data are collected and processed to provide a final diagnosis. In general, two physicians examining the same patient may not collect the same amount or type of data.” However, if the data collected by either physician are sufficiently complete, then the resulting diagnosis should be the same. When this happens, the patient is said to be observable. The high variability in clinical performance of LBP diagnosis and the nonspecific nature of the diagnosis in 80-90% of cases” indicates that not all patients with LBP are observable by the unaided cli nician. Thus, the clinical evaluation of LBP tends to be incomplete, and thus, the need for complementary infor mation arises. Despite rigid standardization of the examination pro tocol, once two clinicians, after independently examin ing a subject, disagreed on the subject’s condition and then reached a consensus, the probability that the consensus was correct was 13% in the current study. In contrast, in the same interdiscordant subjects, the accu: racy of the machine’s diagnosis was four times that of the clinicians, which indicates that these subjects were more observable by the machine. Because the machine bases its decision solely on measured function, it follows that functional information is essential for recognizing the true condition of the subject, regardless of the subject's objectivity or complaints. This interclinician discordance is easy to identify, and when discordance arises, the cli ical diagnosis can be significantly improved by incorpo: rating the machine’s functional diagnosis into the final clinical decision, Inconsistencies within an individual clinician's assessment of LBP (intraclinician discordance) were also shown to be strongly affected by the subjects’ level of objectivity. The impact of intraclinician discordance on clinical performance indicates that the clinical evaluation of LBP is unstable, because a small increase in the level of discordance resulted in a large decrease in clinical performance. Such instability highlights the absence of solid objective data in the clinical examination. Monitoring the level of intraclinician discordance provides the clinician with the means to increase the effectiveness of the assessment. Comparison of clinical performance with that of the machine, in the same group of subjects in whom the clinicians produced intraclinician discor dance, highlighted the circumstances in which the clinician may have benefited from the machine's additional objective information The presence of contradiction in the clinical examination isa strong indication of the need to repeat the ex amination or seek additional information. The results in TS en en the current study show that the presence of contradic tion, between two clinicians examining the same patient ‘or when the findings of an individual clinician’s physia| examination contradict the final clinical decision, is highly correlated with the objectivity of a patients set presentation and with clinical performance. Thus, itrep resents valuable clinical information alerting the clin: cian of a possible decline in clinical performance and of the need to acquire supplementary information to im: prove the clinical diagnosis. Within the context of this study, discordance was statistically equated with low pa tient objectivity and by extension, low clinical perfor mance. Because discordance can be noted by the clini , whereas the extent of a subject’s objectivity cannot (26%, one in four),”” discordance can be used as a strong indicator that additional information is needed Results in the current study do not show whether other indicators besides discordance could have been used 9 warn a clinician of difficulties. The strategy of accepting the clinical decision when n0 discordance is detected and of using the machine's deci sion when discordance is noted, has inherent limitations There is a subset of the group that always escapes this strategy and therefore, limits its performance. The stat egy demands that clinicians realize that there is dscor dance. If clinicians see no discordance, then their dec sion prevails, even if it is wrong, For example, 33% of the nonobjective subjects were identified as discordant (Table 1), which means that in 67% of these subjects, 10 discordance was noted, and the clinical decision pre vailed. The low performance of the clinicians in these subjects (26%) degraded the overall performance of the strategy from a potential of 61% to 44%, because the clinicians had no mechanism to recognize that, for these subjects the machine’s decision was better. In addition, the binary nature of the clinical decision (normal/abnormal) to be matched by the machine did not permit the use of more subtle variations in the auto mated decision, as reported earlier.” For example, be cause the machine makes its decision based on load! dependent data, it may decide that a subject’s condition is normal with light loads and becomes abnormal with heavier loads. Ideally, the clinical decision must not be restricted to a generalized normal or abnormal, bit should specify the range of activities to which a partict! lar decision applies. Forcing the machine to replace i Joad-dependent decision with a strict normal/abnormal decision degraded performance. Even if in the nonobje® twist tive subjects the overall strategy performance is that of the clinicians alone (44% us. 26%), itis we below what can be achieved when the full machined nosis is considered (up to 7: Note that good performance does not equal good é agnosis but refers to good correlation with the stud 804 nake standard. Just because everyone agrees does not everyone right. No experiment can be designed to det mine whether a diagnosis is correct because there is" universal gold standard for LBP. a %. . | against which the performance of technology OO ‘This raises the issue of determining how any technology should be evaluated and used. The evaluation pro~ | cess presumes that there is a reliable reference point can be as- essed. At first glance, it would seem reasonable to hoose the clinical examination as the gold standard. Yet the questions raised about the efficacy of the clinical examination are serious. Because the gold standard appears to be flawed, the evaluation of the machine’s performance cannot be made using only this gold standard. Otherwise, a perfect score would mean that the machine has reproduced the gold standard’s flaws. This may ex plain why various other instruments (pain questionnaires, for example) that have been tried in hopes of a correlation with the clinical examination findings have not succeeded in clarifying the patient’s condition. ‘The fundamental reason for this may be rooted in the strong dependency of these instruments on the subjective reports ofthe patient, rather than on an objective assess- ‘ment of the patient's condition. Therefore, they contrib- ute no new information beyond that which the clinician alone ean assess. The excessive use of reported pain for return-to-work decisions was challenged by Hall et al,'" ‘who showed that a patient who reports pain may not necessarily be a functionally disabled person. Therefore, the objective machine classification should not be ex- pected to correlate well with a classification resulting from these subjective, self-reported pain questionnaires. ‘Whether functional assessment ought to represent a gold standard of its own is open to debate. It has been proposed that the clinical assessment of LBP should con ‘entrate on loss of funetion rather than on reported pain!" Function cannot be reliably determined by using behavioral assessment techniques, reported pain, and disability questionnaires, because such information re lies heavily on how the patient perceives his or her status, which, as the current study results show, may not always | bea reliable source of objective information. In fact, none ofthese assessment techniques have ever been fully validated by using a controlled sample of objective and nonobjective subjects. Therefore, the performance of these popular procedures is unknown, a fact noted in jo) theliterature.”? To improve the evaluation of LBP, clinicians could reasonably consider using only that clinical information about which they were certain. It is reasonable to expect | ahigh degree of success when clinicians are “certain” their diagnosis, The data do not support these expecta tions. In objective subjects, “certainty” resulted in 100% success; in nonobjective subjects, certainty’s success was only 10%. Therefore, in the context of this study, cer tainty of diagnosis is unrelated to lumbar function Iisa legitimate concern that using machine decisions may introduce a prejudice against the patient. In this Context, prejudice means an increase In diagnostic error. This concern is not supported by the data. Implementing, this strategy on objective subjects resulted in no statist ‘ally significant change in clinical perforn How Improving Low Back Pain Evaluation + Marriott et al_ 959 ever, when subjects were not objective, the proposed strategy significantly increased the clinical diagnostic accuracy (by 69%). Therefore, using the machine decision when discordance is present improves overall diagnostic accuracy and prejudices the patient group less than a standard clinical examination. ‘The machine provides a functional diagnosis in the mechanical sense of the word, whereas nonmechanical LBP must also be considered by clinicians as part of the clinical examination. Therefore, normal mechanical function does not necessarily imply a clinically nor: mal patient. For example, a patient with pancreatic cancer may experience LBP, yet the mechanical spine function may be normal. The discrepancy between a negative machine assessment and the reported pain ‘must alert the clinician that something other than a structural spine problem may be responsible for the symptoms. In this regard, the machine can be seen as a mechanism for differentiating a truly mechanical problem from other causes that may explain the re~ ported pain. This is why the machine test should always be performed in conjunction with a full cli cal examination The LR analyses highlight the inextricable relation etween reported pain and clinical examination. For example, in the nonobjective subjects, the evaluators’ LR was well below 1 (0.38), signifying that the clinical ex- mination actually decreased the clinicians’ chances of forming a correct decision. In contrast, the LR of the machine never dropped below 1. Therefore, the ma- Chine’s assessment provided constructive information, regardless of the patient's symptoms. Tn conclusion, discordance in the clinical evaluation of benign mechanical LBP proved to be an important factor containing information about the level of objectivity of a subject and the efficacy of the clinical examination, Mon itoring discordance, either between two clinician evalu- tions or among the various components of a single ex mination, allows the clinician to identify circumstances ‘when additional diagnostic testing would improve LBP management. Incorporating a functional evaluation by machine in cases of discordance was shown to improve Clinical performance. The amount of improvement varied (from 0% to 69%) depending on the percentage of honobjective subjects. Therefore, whenever a clinician is confronted with contradictory information during an LBP examination, the results of this study indicate that a functional evaluation by the machine provides an important source of relevant clinical data to complement that of the clinical examination. ‘Acknowledgments The authors thank the National Research Council of Canada Laboratory of Intelligent Machines, Ottawa, Canada, and the Société de Développement Industriel du Québec for technical and financial support ———————————————9160 Spine + Volume 24 + Number 10 + 1999 References 1, Borkan JM, Checkin DC. An agenda ain, Spine 1996.21 2880-4 2. Cherlin DC, Deyo RA, Wheeler K, Col MA, Physician vanation in diagnos titeting for low back pain: Who you set what you get Arthris Rheum 1994 Se eee ee tee 3. Deutsch SD, B-200 Back evaluation system: Occupational 2nd ed, Hillhorough, NC: Ioechnologies, 1990 4, Deyo RA, Cherkis D, Conead D, Voina E Cost, controversy, cre Low hack painand he ealthof he public. Annes Public Health 1991-12: 141-56 5. Deyo RA, Phillips WR. Low back pain: A primacy cate challenge Spine 1996212826 ~ 6. Fordyce WE, ed. Back pain i the workplace. Management of d rom specific conditions: A report ofthe Task Fors on Pan nthe Workplace of the Intemational Association for the Study of Pain, Seatle, WAY ISAP Press 1995, 7. Gracovetshy SA, Mariott A, Richards MP, Newman NM, Asslin S. The impact of ieticient chines dignows on the cost of managing Tow ack pai, J Healthcare Risk Manage 1997)1721-3 8. Gracovetsky SA, Newman NM, Pswlowaky M, Lanzo VF, Davey B, Robin mL, Adatabane measurements. Spine 1995,20:1036-46, 9. Gracovesky A, Newman NM, Richards MP, Asselin S Lanzo VE, Marsiott A. Evaluation of clnian and machine performance the assess lw hack pain. Spine 1998,25:568-75, (Erratum: page S72, Tale 11 = 5 should 10, Haldorsen EMH, Beage,Johannesen TS, Tells G, Urin H. Mosc thopediccener eee hel etal pais: Concepts of diese ills, and sikness certification n health profes Sonal Norway. Scand J Rheumatol 1996.25.224-32, 11 Hall H, Melntosh G, Males T, Holowachuk B, Wai F. fect of discharge recommendations on outcome. Spine 1994;19:2035-7, 1s in clinical peactice, Onhop Cin North 12, Hochschule SH. Diagnostic st Am 1983,14-517-26 1B. Leclair, Esl JM, equee JC, Hanley JA, Rossignol M, Bourdouhe M. Diagnostic accuracy of technologies wedi low hack pain assesment, Thermo raphy, teaxialdynamometey,spinoscopy, and cin examination, Spine 1996, 14. Mooney V-Impaiement,dssblsy and handicap. Chin Ortho 15, Newman NM, GraconeskySA, oi Metal Ca the mmputerzed physical ‘examination diferetate normal subjects fom abnormal subjects with benign Tow back pain? Clin Biomech 1996 1466-73, aR TTT 16, Newton M, Sommerville D, Henderson I, Waddell G, Trunk srengihtstig with iso-machines, Pare 2: Experimental evaluation ofthe Cybex I btn system in normal subjcts and patients with chronic low back pun Spine 19 Tes12-26 17 Newton M, Waddell G. Trunk stength testing with iso-machies: Pat Review of 3 decade of sient evidence. Spine 1993,18,801-1, 1, Pans SV. Differencaldignosis of lumbae and pelvic pain-Im: Ving Mooney V, Siders CJ, Dorman TA, Sioeckart Reds M Low Back Poin: The Ese Role ‘he Peli. Bath, UK: Chil Lin 19, Socket DL, Haynes RB, Guyatt GH, Tugwell P, Clinical Epidemiology A Basic Scene fr Clinical Medicine. 2nd ed, Boston itl, Brown, 1396918, 20, Spitzer WO, Leblanc FE, Dupuis M. Scenic approach tothe sina and management of activity-elated spinal disorders. Spine 1987 2p) 21, Sprat KF, Lehmann TR, Weinstein JN, Sayre HA. A new apprash oe hack physical examina Spine 1990415-96-101 22, Van den Hoogen HMM, Koes BW, van Fi JTM, Bouter LM, On he accuracy of history, physical examination, and erythrocyte sedimentation ae ‘iagoning lowe back pan in general practice. Spine 199520:318-27 ‘WaidellG. Can chicken blood care hack pain? The Back Leer 19572 1: Behavioural assent of mechani gn 24, Waddell G. Clic assessment of lumbar impaiement. Cn Onhop 187 25, Waddell G. Low back pain: A twentieth century health care eng Address reprint requests t0 Serge Gracoversky, PhD Spinex Medical Technologies, Inc 1800 McGill College, Suite 2100 Montréal, Québec H3A 3]6, Canada E-mail: sag@spinex.com

Improving The Evaluation of Benign Low Back Pain

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving The Evaluation of Benign Low Back Pain

Uploaded by

Copyright:

Available Formats

You might also like