Continuing my reports about the course, Using Data in the Health and Social Care Environment, here’s what I found in the first part of Section 2.
Section 2 begins with a brief introduction. The next “task” is a huge (about 4,000 words) exploration of terminology used in health and social care statistics, roughly the same length as all the tasks in Section 1 put together. This is followed by an online quiz. I’m going to publish this post after the quiz and review the rest of the section in a separate post.
Interpreting data from quantitative statistics
This task describes a whole lot of technical terms. There are four parts.
The first part describes mean, median, mode, range, decimal places, ratios and percentages.
Mean is described well, but the example is peculiar. It’s the ages of five volunteers. Their total age is 315, which isn’t very meaningful. This should be a warning sign, because if the total isn’t meaningful then the mean won’t be meaningful either, but the course doesn’t tell you about this warning sign. It does say that an extreme value can skew the mean, but it doesn’t say that some data is skewed even when there are no extreme values. Later, the term “outlier” is used without explanation. Overall, it gives a somewhat false impression about the use of means.
Median is handled well, but again there’s nothing about skewed data. Extreme values are mentioned again. It does say that the median is a better measure than the mean for the ages of the volunteers, which is the correct conclusion even if the reasons aren’t quite right.
Mode starts off OK, except that the example of only seven values has two modes, we’re told. Well, bi-modal data is extremely rare, and you wouldn’t ever conclude that some data is bi-modal from only seven values, so the example is a bit strange.
Range is controversial. The course says that a range is written with a dash, like 45 – 71. Other sources say a range is just a number, 26 in this example. It also says, the range “can help to identify extreme values that might affect the mean”, which is a bit fanciful.
Decimal places is very controversial. The international standard IEEE 754-2008 defines four different methods for rounding numbers, although even more than that exist. The default method, according to the standard, isn’t the method used by whoever wrote the course. Unfortunately the method used by whoever wrote the course can often introduce bias, so it’s not generally used in statistics, and the course doesn’t tell you this. There’s no discussion of when to round and when not to, or of other details of rounding such as negative numbers and significant figures, so it’s all very unsatisfactory — not a very rounded picture of the subject, I’d say.
Ratio is OK, except that ratios are sometimes written as ordinary decimal numbers, like 0.7, and the course doesn’t tell you this.
Percentage is OK as far as it goes, but it doesn’t get into the more complicated area of percentage change. There’s some example data, and one of the quiz questions is:
“What is the ratio of adult to children adult diabetics (to one decimal place)? (this requires a bit more working out)”
Yes, it certainly does require a bit of working out just to understand the question! Working backwards from the answer, I’d say the second “adult” shouldn’t be there.
Technical terms in statistics
Population is defined strangely as “any complete group that has at least one characteristic in common”. I don’t think this really tells anyone anything. It’s easy to think up groups that fit the definition but that no one in their right mind would ever describe as a population.
Variable is said to mean the same as “data item”, and it’s “any number, quantity, attribute or characteristic that can be counted or measured”. Again, this is unsatisfactory. The attribute being measured (like “eye colour”) is one thing. I’d call this a variable. A value of the attribute (like “brown”) is another thing. I’d call this a data item or observation. The term observation isn’t defined at all, even though it’s used elsewhere in the text.
Amusingly, one of the examples is:
“eye colour – it varies for all the people you know (though it shouldn’t vary over time!)”
Clearly not written by a parent! (Babies’ eyes can take up to a year to develop their final colour.)
Data unit and dataset are defined, but not very satisfactorily. I don’t think these terms are common, in any case. The definition of dataset says it’s “complete” but many datasets are incomplete (which causes problems).
Discrete numerical data is defined wrongly as whole number data within a range. In fact negative numbers and fractions can also be discrete. For example, shoe sizes that include halves would be discrete.
Continuous numerical data is defined wrongly as data that can take any value within a range. In fact there doesn’t need to be a range.
Sample, census and standard error are OK.
Relative standard error is OK, but we’re told it’s more useful than the absolute standard error. This seems unlikely. Sometimes the absolute one is more useful.
Confidence interval is sort of OK, but confidence intervals are commonly misinterpreted and the course doesn’t help you to avoid the pitfalls. Anyone who thinks it’s an easy thing to understand might enjoy reading: (Mis)Interpreting Confidence Intervals
Standard deviation is OK as far as it goes, and mentions variance.
Deciles, Quartiles and Quintiles would be OK if it explained what “ranked” means. The description of median, earlier, said “arranged in order of value”. Oddly, the median doesn’t get a mention here.
Percentile is OK, but it’s described as if it has nothing to do with deciles, quartiles, quintiles and the median.
The section on graphical representations of data covers many of the common kinds of chart, but wanders off to discuss other topics from time to time.
Pie charts are sort of OK, and there’s an example of a pie chart for anyone who hasn’t seen one before. But it says “the area of the segment is proportional to the number of data results that fall into that category”, which is not really how you draw or interpret a pie chart. You draw it or interpret it by thinking about the angle, not the area (even though the proportions are the same). And then it says, “Note that pie charts are only used for discrete data”, which is untrue. They are only used for categorized data, but the data can be continuous (before you categorize it).
Bar charts are “also used for discrete data”, it says, but in fact the data can be continuous just like for pie charts. It says bar charts display frequencies, but that’s confusing them with histograms (which aren’t mentioned). The example of a bar chart has no obvious bars, which is unfortunate if anyone taking the course has really never seen a bar chart before, and it doesn’t display frequencies, demonstrating that the text is wrong.
Scatter graph and regression line and correlation coefficient are OK, although a bit evasive on correlation coefficients. There’s a link to further reading, but the link is broken.
Correlation and causation is OK. It’s not clear why any of this is in the graphical section, though, as a correlation coefficient isn’t graphical.
Line graph is sort of OK, and there’s an example of a line graph, but it says the data must be continuous. It’s not unusual to see line graphs of discrete data. Also, it says line graphs are to demonstrate change over time, when they don’t always involve time. And it says a line graph is “for a single variable or indicator” when it’s fairly common to use them to compare two or more variables, sometimes on different scales.
Spine chart is OK, and again there’s an example.
Tartan rug seems more like slang than a statistical term. It’s a form of RAG chart, but the course doesn’t mention those (RAG means red-amber-green).
Significance isn’t defined at all, even though it’s used in the text a couple of times. And p-values aren’t explained (nor p-hacking and so forth).
I’m fed up, so I’m not going to go through all the geographical terms. Some of them are mentioned without explanation. Top tier local authority is a useful concept that’s missing.
The online quiz tests the knowledge course participants have gained so far, presumably. It has ten questions, a mixture of multiple-choice and calculations. There’s a practice version first, followed by the real thing with a pass mark of 80%.
In the practice version I did the calculations by copying and pasting the numbers into a spreadsheet, and for the multiple-choice questions I chose wrong answers that matched the false information in the previous parts of the course, where necessary. I scored:
“10.00 / 10.00”
Notice the two decimal places of accuracy!
Obviously, I tried again to see whether I could improve on this. The questions were different! OK, I admit it, that’s the real reason I tried again. There’s limited randomization generating a different quiz each time. Sometimes the numbers where changed, sometimes the scenario was changed trivially, and sometime the choices were simply in a different order. However, the ten questions were essentially the same, and in the same order.
On the second attempt I got worse. I scored only “9.00 / 10.00”. I realized that if things got even worse I’d be down to a bare pass mark, so I didn’t do the practice quiz again.
It would be tedious to write about all the questions. The one about plotting methods was the one where I found most difficult to apply the misinformation in the previous part of the course in order to guess what wrong answer would get the mark.
In one version of this question the answer that got the mark was that neither a line graph nor a pie chart would be appropriate to show:
“50 individual records of the weights of service users (in kg) in your obesity action group”
Actually, either kind of chart could be used. An “individual record” in a real-life obesity action group is almost certainly a record over time. You could plot the average on a line graph. Or you could categorize the weights however you like and plot them on a pie chart. Even if it’s a snapshot of the weights on a single occasion (which the question doesn’t specify) you could categorize them for a pie chart.
In the other version of the same question the answer that got the mark was that both a scatter graph and a tartan rug would be appropriate to show (and this is exactly how it was worded):
“The prevalence of obesity for all CCGs in the country number of people in each age range category in the GP practices”
Again, this is only true if you do some work on the data. You couldn’t plot the raw data as described in either kind of graph, but if you worked on it you could reduce it to a scatter graph or categorize it for a tartan rug.
So the expected answers for the two versions of the same question contradicted each other. In one, you were expected to answer as if the data is to be plotted raw, but in the other you were expected to answer as if the data is to be worked on to suit the kind of graph.
In another question, about bar charts, one of the options says that they “can be shown with associated errors of calculation”, but although this is true, error bars aren’t mentioned in the text, so someone relying on the course for their information wouldn’t know about them. Also, error bars usually show variation present in the data, not errors of calculation.
Overall, this section on statistical terms was unreliable. Some of the information is true, and some of it false, but no one can tell which bits are true and which aren’t without cross-checking everything. That makes it completely useless as course content.
A higher level view of it is that the audience for the course is not well defined. If the course is really for people who have to be reminded how to calculate averages, and who don’t use spreadsheets, that’s a rather narrow range of people whose ability to use data won’t be enhanced by a course like this. They’ll just end up more confused than they were when they started. But, on the other hand, if the course is for people who already know the basics of how to plot various kinds of charts, then the additional things they might need to know aren’t covered.