MATHEMATICAL MODELING Year : 2010  Volume : 2  Issue : 1  Page : 3538 A novel tool for classification of epidemiological data of vectorborne diseases Sree Hari Rao Vadrevu^{1}, Suryanarayana U Murty^{2}, ^{1} Department of Mathematics and Statistics 202, Rolla Building 400 W 12th Street ROLLA, USA ^{2} Biology Division, Indian Institute of Chemical Technology, Hyderabad  500 018, India Correspondence Address: In this article we present a novel tool that renders efficient classification of epidemiological data of vectorborne diseases. This algorithm has been applied on the data of the Filariasis disease and the results are compared with the wellknown knearest neighbor algorithm.
Introduction Crude classification and false generalizations are the curse of organized life.' However, that does not halt us from classifying objects and extending these generalizations to a vast category. Scientists have been classifying objects since time immemorial and this effort to group the objects so that they fit nicely together with other similar objects has led to the simplification and organization of an otherwise chaotic world. Classification in biology dates back to Aristotle's time and has now reached such sophistication that human effort is often minimized and its place has been superseded by state oftheart computeraided tools. With the accelerating advances in high throughput methods, the generation of data is increasing exponentially. An astonishingly high rate of data generation and flow of data has not kept pace with the development of accurate and precise classification tools that can bridge the gap between demand and supply. In the biological field, classification tools are used vigorously in taxonomical classification to emulate the ambiguities in assigning the appropriate rank. [1],[2] Several applications can be found with classification tools, such as, gene expression analysis, [3],[4] structural classification of proteins (SCOP), [5] classification of methylation array data, and protein proteomic analysis for cancer, [6],[7] for elucidating the evolutionary distance among several species, comparison of biological sequences, [8] protein structure prediction, [9] assigning scores to drug molecules in docking studies, and in ranking the small molecules based on their structure activity relationship (QSAR) [10] and structure property relationship (QSPR). These classification tools are mainly based on hardcore statistics and mathematical models. Principal component analysis (PCA), partial least square method, and the regression analysis method lie in the heart of these tools. Among the myriad of classification tools, HCA (Hierarchical Cluster Analysis), MDS (Multi Dimensional Scaling), CART (Classification and Regression Tree), FP (Frequency Pattern) Tree, SOM (Self Organizing Maps), Correlation coefficient clustering, SVM (Support Vector Machine), BPANN (Back Propagation Artificial Neural Network), GA (Genetic Algorithm), Genetic Programming, and kMeans and knearest neighbor (kNN) are some of them. Vectorborne diseases often tend to be complicated and the enormous data generating from the experimental studies warrant a need for efficient tools, for analysis. Although computer technology is undergoing a revolution its application in classification is still in its infancy, its potential, especially, has not been tapped in the arena of vectorborne diseases. Often, the magnitude of epidemiological data and its vast array of types poses a challenge for an epidemiologist. In our gold rush to information, we end up settling for a compromise between accuracy and speed. One of the common statistical tools used in these kinds of studies is the kNN. The present article is organized as follows: In Section 2, we describe the kNN approach and in Section 3 we present the methodology, the data sets, and data normalization. The main algorithm is discussed in Section 4. The results and performance evaluation of the novel algorithm form the content of Section 5, while the conclusions are deferred to Section 6. Nearest neighbor The kNN algorithm is a simple instancebased learning method for performing general, nonparametric classification. First introduced by the researchers E. Fix and J. Hodges in their article, 'Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties', in 1951, it is well explored in literature and has been seen to have a good classification (prediction) performance on a wide range of real world data sets, Xiong and Chen (2006). It is simple and straight forward to implement. kNN is based on a distance function for pairs of observations, such as the Euclidean distance. In this classification paradigm, the knearest neighbors of a training data are computed first. Then the similarities of one sample from the testing data to the knearest neighbors are aggregated according to the class of the neighbors, and the testing sample is assigned to the most similar class. The performance of the kNN algorithm is influenced by the following three main factors: The distance metric used to locate the nearest neighborsThe decision rule used to derive a classification from the knearest neighborsThe number of neighbors used to classify the new samplekNN classifiers are wellsuited to solve the given problem because they do not have to spend additional effort on distinguishing additional classes. One advantage of kNN is that it is wellsuited for multimodal classes as its classification decision is based on a small neighborhood of similar objects. Therefore, even if the target class is multimodal (i.e., consists of objects whose independent variables have different characteristics for different subsets), it can still lead to good accuracy. Some pleasant aspects of the nearest neighbor (NN) classifier: Many other techniques (such as, decision trees and linear discriminants) require the explicit construction of a feature space, which for some distance functions is intractableThe NN classifier effortlessly deals with the hugely multiclass nature of visual object recognitionFrom a theoretical point of view, it has the remarkable property wherein, under very mild conditions, the error rate of a kNN classifier tends toward the Bayes optimal as the sample size tends toward infinityAdditionally, kNN classifiers can be applied to any type of object representation as long as a distance measure is availableUnfortunately, kNN classification has a major drawback as well. The efficiency of classification rapidly decreases with the number of training objects. knearest neighbor classifier: A major drawback of the similarity measure used in kNN is that it uses all features equally in computing similarities. This can lead to poor similarity measures and classification errors, when only a small subset of the features is useful for classification. The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their relevance. The main weakness of this technique is that performance is subject to the often ad hoc choice of the similarity metric, especially for heterogeneous datasets, from which the derived features are of different types and scales, and are correlated. In addition, the standard kNN methods suffer from the curse of dimensionality. The neighborhood of a given point becomes very sparse in a high dimensional space, resulting in high variance. The algorithm is easy to implement, but it is computationally intensive, especially when the size of the training set grows. There can be no tie. These resulting class labels are used to classify each data point in the data set. However, due to the shortcomings faced by kNN, an imperative need is felt for a more robust tool with high efficiency and accuracy. On the basis of the kNN approach, a novel tool VBClassif ver.1.0, for classification of epidemiological data of vectorborne diseases, was developed [Figure 1]. VBClassif is a novel efficient classification algorithm (designated as VBClassif 1.0), which classifies the records of those affected by Filariasis. This software has been utilized to classify up to a 1, 00,000 records and the effective classification yield percentage is 94%. Methodology The data on various parameters like age, sex, class, and so on, of different individuals affected by Filariasis is considered. The program is trained with a small portion of data designated as train data and is then tested for efficiency by executing on the test data, which is expectedly large. The algorithm is tested with minimal train data and observed to see if the efficiency increases or remains unaltered. Dataset A real life dataset, amounting to a data size of 5602 records, pertaining to a major disabling vectorborne disease Filariasis was used for this study. Epidemiological, socioeconomic, and entomological data was collected from East and West Godavari districts of Andhra Pradesh and considered for this study. Data normalization The data was normalized and each record was assigned a value of 0 or 1, based on the presence or absence of the disease characteristics. Algorithm: The following is the main algorithm. 1. Start 2. From the entire given test data 1. Separate '0' class records and '1' class records in the train data 2. Take a record from the test data and calculate the Euclidean distance for each record in the train data for both the classes, separately 3. Take least distance in '0' class and '1' class 4. Compare the least distance in both the classes; put the query record in the class that is having the least distance 5. If distances are equal those are taken as unclassified records 6. Repeat steps 2 to 5 for every record in the test data. 3. Now, on the unclassified records from the above mentioned stage perform the following steps 7. Take the least distance record (say r1 ) in the '0' class and least distance record (say r2 ) in the '1' class, records that are classified. 8. Take the record from the unclassified list that is having minimum distance (say u1 ). 9. If u1 is less than both r1 and r2 then take the minimum of r1 and r2 . 10. If u1 is less then r1 than assign u1 to '0' class. If u1 is less than r2 than assign u1 to '1' class. 11. Repeat steps 3 and 4 for all unclassified records from Step 2 of this method. 4. Again, on the unclassified records from the abovementioned stage perform the following steps 12. Calculate least L1 , L2 and highest H1 , H2 distances in '0' and '1' classes respectively. 13. Calculate average 'r' as r = H1+L2/2. 14. Calculate the distance (d1 and d2 ) between a record in unclassified list and the record that is having H1 distance and L2 distance. 15. If d1 r then assign query record to '0' class otherwise assign query record to '1'. 16. Repeat the steps from 1 to 4 for each record in the unclassified list and also for those in the true negative list. 5. On the unclassified list perform the following operations 17. Calculate the distances between a record in the unclassified list and the remaining records in the same list. If x1 , x2 , x3 ,....., xn are records in the unclassified list calculate the distances (x1 , x2 ),(x1 , x3 ), ....(x1 , xn ). 18. Take the least distance (d1 ) from these calculated distances. 19. Take least distances (r1 , r2 ) in the '0' and '1' classes respectively. 20. Compare d1 with the minimum of r1 and r2 (say m1 ). 21. Assign d1 to the class to which m1 belongs. 22. Repeat steps 1 to 5 for every record in the unclassified list and also in the true negatives list. 6. Again on the unclassified list perform the following steps 23. Calculate the Euclidean distance between the origin and the record in the unclassified list. If x1 , x2 , x3 ,....., xn are fields in the record then [INLINE:1] 24. Divide every field value by R1 [INLINE:2] 25. Repeat steps 1 and 2 for the new field values i.e., [INLINE:3] 26. Calculate the Euclidean distance between a record with new field values in the unclassified list and every record in the classified records of '0' and '1' classes. 27. Take least distance in the '0' class and '1' class. 28. Compare the least distance in both classes and put the query record in the class that is having the least distance. 29. Repeat the steps from 1 to 6 for every record in the unclassified list and the true negative list 7. End of algorithm Results The results obtained with different magnitudes of train data are shown in the [Table 1]: The algorithm has performed effectively, compared to other wellknown classification algorithms such as the Knearest neighborhood (kNN) algorithm. The comparison results are tabulated in [Table 2]. The results in [Table 2] help one to conclude that it is not necessary to always have a large data of records for training purpose and what is important is always the minimal set of useful data records. Clearly, a set of 300 train data records could capture the full information and render most effective classifications and predictions. When applied to data sets with train data of different sizes, such as, 300, 500, 600, and 1100 records (with percentages being 6.7, 12, 13.5, and 25 of 4470, VBClassif ver 1.0 has rendered a classification accuracy of 98.4, 91.3, 92.77, and 83.38, respectively, with the percentage of classification being 100, while the wellknown knearest neighborhood (kNN) algorithm could classify the abovementioned test data achieving an accuracy of 96.82, 92.42, 87.26, and 81.72, with different sizes of train data as mentioned earlier. The tool demonstrates a high percentage of classification accuracy as compared to the wellknown kNN, which is a desired feature. VBClassif is a menudriven and user friendly, robust tool, which even a novice can apply, to analyze a vast amount of data. This tool offers an advantage of being applicable to twotier classification environments. This tool can be extrapolated to any vectorborne disease, and hence, provides an effective way of emulating the disease. Conclusion Artificial intelligence has opened a new window of opportunity and definitely has a long way to go in the area of data analysis. The enormity of the magnitude of data and its complex nature often perplexes epidemiologists. This is especially true in the case of vectorborne diseases, where a timely and precise understanding of the disease, during the decisionmaking process, comes in handy for curbing the disease in the event of an outbreak or epidemic. Often the public health officials feel the need for a correct identification of true positive cases. A tool that classifies diseases according to the presence or absence of a disease will help in devising a clear strategy in mass drug administration programs and will help in the proper targeting of patients and also in the efficient use of resources. Available and wellknown statistical tools tend to compromise on accuracy, speed, and efficiency. Interactive Classification tools supported by AI (Artificial Intelligence), such as VBClassif 1.0, will definitely pave the way for more efficient disease control and will also help epidemiologists find quick solutions to classification problems. References


