Tuesday, May 5, 2020
Statistics Business Transformation Business Techniques
Question: 1. Statisticians divide variables into different classes (or types). Describe the classes of variables and give examples of each. Briefly describe (for each class of variables) the methods used to compare 2 independent groups of cases. Describe the assumptions and/or limitations of each technique. 2. What do you understand by the following statistical and epidemiological terms? You may find it helpful to use examples to illustrate your explanations. a)Boxplot (box and whisker plot) [20 marks] b)Addition law of probability [15marks] c)Retrospective study [15marks] d)R-squared (r2)[15marks] e)Cluster sampling[15marks] f)Standard error of a mean[20 marks] 3.This question is concerned with statistical measures to assess the reliability and accuracy of tests (for example, for diagnosing caries based on radiographs). a) What method(s) would you use to measure the extent to which 2 observers agree whether teeth are carious or not (reliability)? b) What method(s) would you use to measure the extent to which an observer agrees with a gold standard test (accuracy)? c) When might you use a ROC curve? d) Show the principles behind ROC curves by presenting a small example. 4.What is a 95% Confidence Interval for the mean of a variable? Explain how you would calculate it and state the assumptions behind the method you describe. Explain the relationship between a 95% Confidence Interval for the mean and a one-sample t-test. Explain briefly the principles behind, and the use and limitations of threeof the following. Suggest situations where they might be used when analysing dental data. [equal marks for each sub-section] a)Oneway analysis of variance (ANOVA) b)Survival analysis c)Log transformation d)Paired samples t-test 6 A researcher claims the mean DMFT of males aged 14 to 16 in a particular British region is 6. a) How you would set up a study to assess this? b) What are the appropriate hypotheses? c) How would you summarize the observations? d) What statistical test you would apply? e) How would you interpret the results of such a test? f) How would the size of your sample tend to affect the results? Answer: 1: Observations on a particular trait or character that are distinguishable or countable are called variables. A variable can be any number or measurement or characteristics whose value can vary over a certain range. Income of a person in a month is an example of a variable. Income of a person can take any values starting from 0. Age of students in class, the color of a flower, the number of books in a library is other examples of variables. Variables are classified into two types. By counting, the variables are categorized in to Qualitative and quantitative variables. Variables that can be counted are called quantitative variables. Age of students, the number of books in a library is examples of quantitative variables. On the other hand, the variables like the color of a flower, the first letter in the number plate of a car cannot be counted. These variables are referred to as qualitative variables. A quantitative variable can be classified into two types-discrete and continuous variables. Variables that can take values only a discrete set of points are referred to as discrete variables. While, on the other hand, if a variable takes values on a continuous scale then it is known as continuous variables. Number of books in a library, the number of people in a household is examples of discrete variables as these variables take distinct values. Height, weight, age are examples of continuous variables. Qualitative variables are also referred to as categorical variables as they describe a particular characteristic of a data point like to which category the data point belongs. Categorical variables can be of two types: Nominal variable: The categorical variables that cannot be arranged in an increasing or decreasing order are called nominal variables. Nominal variables can only be classified into a particular group. Type of business, eye color of a person is examples of nominal variables. Ordinal variables: The categorical variables that can be arranged in an increasing or decrease order are termed as ordinal variables. Grades that are given in an examination, any attitude towards a decision (disagree, agree, moderate, strongly agree) are examples of the ordinal variable. Two independent groups of variables can be compared using a different test. If the variables are quantitative, then t-test can be performed to test the whether the two groups are independent. If the variables are qualitative, then a two sample proportion test can be performed. The t-test can be used for the test of equivalence of two means of two independent samples. The hypothesis is given by H0: 1= 2 against H1: The means are unequal. The statistic for the test is: If the calculated t value greater than tabulated t value then the given hypothesis is rejected. The rejection of hypothesis implies that the two means are equal. If the variables are quantitative, then test for proportion can be performed. The hypothesis to be tested is p1=p2 against H1: proportions are unequal. The test statistic is given by Z= /s.d S.d=sqrt((p(1-p)(1/n1+1/n2)) Where p1 is the estimated proportion of the first sample and p2 denotes the estimated proportion of the second sample. "n1" and "n2" are the sizes of the two samples. The test statistic is rejected if the calculated p-value is less than the level of significance . The limitation of t-test is that the underlying distribution of the sample is assumed to be normal. If the distribution is not a normal distribution, then a robust statistic like median has to be used. Then the median test can be performed. In the median test, the hypothesis to be tested is H0:me1=me2 against H1: me1me2. In the proportion test, the sample variance is the pooled variance of the two samples. Pooled variance can be assumed if the variance of each group is more or less same. If the two groups greatly differ by variance, then pooled variance cannot be used. In that case, test for proportion is invalid. 2: Box Plot: Box plot or box and whiskers plot is a way to represent statistical data graphically. A box plot is also termed as box and whiskers plot. The lines that extend vertically from two sides of the box are called the whiskers. A box plot is a nonparametric representation. It does not assume any underlying distribution. The Box in the box plot is the space between first and the third quartile. Outliers can be easily detected with the help of box plot. The box plot gives an idea about the spread or dispersion of the dataset. Any box and whiskers plot depicts the following statistical measures: Median: The median is the midpoint of the data and is represented in the box plot by the line inside the box. From the position of the median in the box plot, one can determine whether the distribution is skew or symmetric. Sometimes an additional line for the arithmetic mean is also given inside the box. If the mean and median line coincides, then the distribution is symmetric. Otherwise, it is skewed. Quartile: 75 percent of the observation falls below the first quartile, and 25 percent of the observations fall below the first quartile. Range: It is the difference between the minimum and maximum observation in a dataset. Interquartile range: Interquartile range is the length of the box.50% of the observations are expected to lie within the range. Outlier: Any outlier if present is detected in the outside the interquartile range. The outliers lie between the points 1.5 IQR and 3 IQR. Addition law of probability: Two events A and B are considered. The events are mutually exclusive if the probability of their intersection equals zero. Two events are collectively exhaustive if the union of the two events makes up the entire sample space. P( Addition law of probability states that A1, A2,,, An be n events The events possesses the above two properties that are the events are events collectively exhaustive and mutually exclusive. Then the probability of the union of the events is equal to the sum of their probabilities. Retrospective study: A retrospective study refers to the longitudinal study design of two cohorts. In this kind of study, one cohort is exposed to particular disease and another cohort is not exposed to the disease. The two cohorts are compared to identify the factors in their history that can be associated with the disease. The data are collected from past values. The study is mainly conducted to determine the risk associated with the disease and to estimate the number of causalities from the disease. The risk ratio or Odd ratio of the two groups is calculated which gives the relative risk of the disease in the particular cohort. The Risk ratio is given by the following formula: DISEASE PRESENT DISEASE ABSENT Group1 A b a+b Group 2 C d c+d a+c b+d n Then risk ratio is given by Odds ratio is given by: OR=ad/bc If the value of Risk ratio is greater than one, then, the cohort has a less chance of developing the disease.If the value of risk ratio is greater than one, then the cohort has a higher chance of developing the disease. Same interpretation also applies for the Odds ratio. The advantages of Retrospective study is that it is less costly, less time consuming and could easily be conducted and gives a better comparison of disease between the cohorts. For example, if one wants to compare the oral health status of two groups, an idea about the oral health status of the new generation can be obtained from the oral health status of the mothers. R-squared: R-squared values are calculated to determine how good a fitted model is. The R-squared value is the ratio of the residual and total sum of the squared values. The greater the value of R squared statistic in the case of a regression model; the better is the fitted model. R-squared=RSS/TSS. A good model is expected to have the minimum error. The sum of errors is equal to zero. To make a comparison, squared sum of errors has to be considered. The smaller the value of RSS or residual sum of the square the better is the model. So more the value of R squared, the better is the model. R-squared value does not consider the number of parameters involved in a model. For a model to be good, the model should be parsimonious. For this, another measure of R-squared is developed which is called adjusted R-squared. The adjusted R-squared measure is given by the following formula: R-squared (adjusted)= 1- (RSS/n-k)/(TSS/n-1) .K is the number of parameters. So higher the value of adjusted R-squared the better is the fitted model regarding both parsimony (minimum no of parameters) as well minimum errors. Cluster Sampling: Sampling is the procedure in which only a drawn sample from the population is considered for the purpose of statistical computation. Cluster sampling is an efficient sampling procedure where the total population is at first divided into some clusters and then the sample is collected from this clusters. The clusters are made as homogeneous as possible. Cluster sampling can be one stage or two stages. For example, for obtaining sample of household expenditure from a city, at first, the city can be divided into several blocks according to locality and then sample could be collected from each of the blocks. These blocks form the cluster. Standard error of a mean: The average sample value is an unbiased estimate of the average value of a population. The deviation of the mean of sample from the population mean value is the error. The standard deviation of the mean of sample value is called the standard error of the mean. 3: Inter-rater reliability is used to measure the extent to which two observers agree whether the teeth are carious or not. Inter-rater reliability is used in case of subjective judgment. If the rating scale is continuous, then Pearson's product moment correlation is used. If the rating scale is ordinal, then Spearman's product moment is calculated. For the case of a categorical variable, Cohen's Kappa is used. The formula for Cohen's Kappa is: Where O is the observed agreement, and E is the expected value of the agreement. N is the total sample size. The failure rate is used to measure the extent to which an observer agrees with the gold standard test. The failure rate is given by f(t)/R(t) where f(t) is time to failure of an event and R(t)=1-F(t).F(t) is the cumulative distribution function of t. C.ROC curve is drawn to discriminate between the presence or absence of a disease.ROC curve is drawn by plotting FPR against the TPR. The FPR is equal to (1- specificity), and true positive rate is calculated by sensitivity. Sensitivity refers to the proportion of population with the disease tests positive. Specificity relates to the part of population without the disease testing negative. The area inside the curve of ROC helps to determine the level of discrimination between the individuals with test positive and individual with test negative. The underlying principles of the ROC curve are: The threshold value for drawing the ROC curve influences the specificity and sensitivity values. The threshold value should be so chosen that distribution of test results for presence or absence of disease should not overlap. In most of the cases, the two distributions overlap. But in most of the cases, the two distributions overlap. So the diseased people are misclassified as normal people. Lowering the threshold value will increase specificity while higher threshold value decreases specificity. 4: The mean of a variable that is to be calculated is the sample mean. The population mean is different from the sample mean. In a practical situation, a confidence interval with confidence coefficient 95 for any statistic gives the probability that the value of the mean lies within the interval with confidence limit 95%. That means if the sample is repeated as many times as possible, the probability that the mean value lies within the interval is 0.95.if the distribution of test statistic is standard normal, then the confidence interval is given by the following formula: Ucl=xbar +s/sqrt(n)*z Lcl= xbar-s/sqrt(n)*z Again if the test statistic follows a t distribution, then the upper and the lower control limits are given by: Ucl=xbar +s/sqrt(n)*t Lcl= xbar-s/sqrt(n)*t The t statistic is generally used in case of confidence interval if the standard deviation value is to be calculated from the sample. If the population standard deviation value is given then, one can use z test to determine the value. 5: The following methods are used for analysis of dental data: One way ANOVA: One way ANOVA or variance analysis is carried out to test whether the means of several groups are equal or not in the case of fixed effect model and equality of variance of several groups in case of random effects model. In one way ANOVA, there is only one factor affecting the values of the variable. The one-way ANOVA model is given by: Where yij represents j the observation in the ith cell. is the common mean effect and i is the effect due to the ith group and eij is the error assumed to follow N(0,^2) distribution. The random effect model is given by Yij=+ai +eij , where eij is the random effect due to the ith group. An example where ANOVA test can be conducted in dental study: One wants to measure the performance of five brands of toothpaste that heals tooth sensitivity. Certain volunteers are selected and each of them is given a brand of toothpaste to use. After the completion of one tube, the patients were asked to give a score about how their sensitivity problem is. The mean score from each volunteer is collected, and the mean scores are tested with the help of ANOVA, and the toothpaste that performs best can be found out. ANOVA test has certain assumptions: The error is distributed as normal with zero mean and uniform variance ^2 across all groups. The observations are supposed to be independent. If the above assumptions are violated, then ANOVA test cannot be carried out. Besides, ANOVA can tell only if all the means are equal. If the means are unequal, then one has to perform t-test to compare two means. Survival Analysis: Survival analysis determines the time to failure (or survival) of an event. Survival analysis is particularly useful in case of censored data. For example, if one wants to find the time required f or recovering from a disease then survival analysis can be used. Survival Analysis can be used to study a particular impact of certain dental surgery on the patients. For this analysis, one can study the time to the occurrence of the event(death) of the patient along with other factors. The study can be done with the help of Kaplan-Meir estimator. If there are several factors affecting the time, then a regression model such as Cox model of proportional hazard functions can be used. This analysis aims to study the time to occurrence of an event. It gives the chance or probability of survival from a particular disease. Survival analysis also takes into account the effect of other covariates over time to survival. But survival analysis has certain limitations. The limitations of Survival regression is same as that of ordinary regression problems. The statistical data and real life data are different. So the estimates from survival analysis are valid up to certain extent and may not be true for every case. The error in survival models is assumed to be normally distributed. Another important feature in survival analysis is censored observations. If the number of edited cases is too many, then survival analysis can lead to faulty results. C.Log transformation: Log transformation is used in the following cases: 1.To make the data skewed: Log transformation is mainly used to make a skewed data more less skewed. Taking logarithm of the values, one can compare the geometric mean of the values instead of arithmetic mean. For example, if the brain weight of a person is plotted with body weight then the distribution is skewed as the body weight is very large as compared to brain weight. Plotting the log-transformed variables, the distribution becomes less skewed. 2.Log transformation is used if the dependent variable is discrete or binary and the response variables are continuous. By taking logarithm of variable, the response variable can be converted into a discrete variable. This often happens if the response variable is dependent on some categorical variable. To standardize the data: Sometimes data do not follow normal distribution. Taking a log transformation of the values will make the data follow normal distribution. Log transformation has certain limitations. It is not applicable to confounded data. Data point has to be independent. Otherwise, change is not useful. D.T-test: This is a test of paired sample observations which is used to test the dependence of the arithmetic mean value of two variables. Paired sample t-test can only be done if the sample size of two samples is equal. In paired sample t test the difference of each observation is calculated. Let di denote dit. The mean of the observation is tested to be equal to zero or not. The hypothesis of interest is to test H0: The mean value of the paired observation is equal to zero against H1: not H0.and the test statistic is (dbar/sd) where dbar is the mean of di and sd is the standard deviation of di values. The statistic for performing the test is said to follow a t distribution. In this case, the d.f will be equal to n-1 where n is the sample size. The limitations of paired t-test are that it is applicable when the groups have same sample size. If the sample size varies then, another t-test has to be performed. It is also applicable to datasets that have standard normal distribution. If underlying distribution is nonnormal, then nonparametric tests could be performed. 6: The DMF index is a method used in dentistry for testing dental caries. The dmft of males between 12 to 14 years of age is six as claimed by a certain researcher. So to support this claim a test has to be conducted. A sample has to be drawn from the population of males between the age group 12 to 14 years. Then the mean value of dmft obtained from the sample has to be tested. The hypothesis that has to be tested in this case is whether the mean or median of the population is equal to 4 or not. H0: = 6 against h1: six where represents the mean of the distribution. Or, h0: me=6 against h1: me6. The distribution of the population of males can be assumed to be normally distributed. In that case, one can test whether the mean value is equal to 6 or not. The mean value of the population can be estimated by sample mean. Then the problem is to test whether the sample mean value is less than a particular value. If the population does not follow standard normal distribution, then a nonparametric test for median of the observation can be done. If the population distribution is assumed to be normal, then a test for the sample mean could be performed. Then the problem is to test h0: =6 against h1: 6. If the value of s.d of population is known, then z-test can be performed. If the population value of standard deviation has to be estimated from sample value, a t-test has to be performed. The statistic for the z test is given by: Z=( - )/ where is the sample mean. The test statistic for t distribution is: T=( - )/s where s is the s.d of sample. Interpreting test results: A test is rejected if probability value of the test is less than level of significance. H0 is rejected if z-value is greater than z/2 at level of significance . Z/2 is the tabulated of value of upper alpha point from the standard normal distribution table. In case of t test the null hypothesis is rejected at level of significance if tt value determined from sample is greater than tabulated t value at the level of significance /2 at degrees of freedom n-1. The size of the sample is important in case of performing a test. The accuracy of a test depends on the sample size. The value of sample size is given by the following formula: N = (1.96*sigma^2)/e^2 .here sigma denotes value of standard deviation. E denotes the correction limit within which the value of the mean that is to be estimated.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.