Many people find statistics confusing enough mainly due to the abundance of assumptions, theorems, terms, etc. However, if you’re planning to enter the field of data science, it’s a must to have a solid understanding of the fundamentals of statistics. It’s important to understand that to be a good data science professional, you don’t necessarily need to earn a Ph.D. in statistics but you need to be able to explain the key concepts if required. If you’ve got a strong foundation in mathematics, you’ll be able to analyze many of the things in today’s world with the help of statistics.
In this post, we’re going to discuss ten essential things that you must understand to excel in statistics. These include concepts, equations, and theorems that will not only greatly help you pursue data science but prove your understanding of statistics as well.
1. Sample
A sample refers to a smaller, specific group of an entire population. For instance, if we consider “all the countries in the world” as a population, a sample could be “countries that have published data on unemployment since 2000”. The objective of studying a well-chosen sample is to test hypotheses or make inferences about population data in a valid and reliable way.
2. Population
In statistics, a population stands for the entire group that you’re interested in studying. Note that, here, a population doesn’t necessarily refer to people always. Instead, it can be a group comprising any elements that you’re looking to study — from species, countries, and objects to organizations, events, and more. Therefore, the term “population” appears with a slightly different explanation from what we commonly understand.
3. Measures of central tendency
A measure of central tendency refers to a summary statistic, which represents a dataset’s center point. It’s the single value that you can use to describe the entire dataset by identifying the central position of the dataset. In statistics, three measures of central tendency are most commonly used — mean, median, and mode.
Mean
Mean is one of the most popular methods used in the calculation of central tendency. To calculate the mean, you need to add up all the values in your dataset and divide the sum by the dataset’s number of values. This method is generally considered the arithmetic mean. The other two methods of mean used to locate the central tendency include geometric mean and harmonic mean.
Median
You can think of median as the middle value of your dataset. Here, the dataset is arranged in the order of magnitude. The method for identifying the median slightly differs depending on whether you’ve got an even number or an odd number of values in the dataset.
If your dataset has an even number of values, you need to add the two middle values and calculate the average to identify the median of your dataset.
If your dataset has an odd number of values, the median would be the middle mark of it.
Mode
It’s the score that most frequently occurs in your dataset. If your dataset has multiple values, which most frequently occur, it contains multiple modes. However, in some cases, there may not be any mode at all. For instance, if the dataset has continuous data, there may not be any value occurring more frequently than another.
4. Measures of dispersion
You use a measure of dispersion to describe the variability in a population or sample. It’s commonly used together with one of the measures of central tendency to obtain an overall description of a dataset. In statistics, two key types of dispersion methods are used — absolute measure of dispersion and relative measure of dispersion.
Absolute measure of dispersion
It comprises the same unit as your original dataset. The variations with regard to the average of deviations such as means or standard deviations are expressed using this method. Different types of absolute measures of dispersion include Range, Quartile Deviation, Interquartile Range, Standard Deviation, and Mean Deviation.
Relative measure of dispersion
You use a relative measure of dispersion to perform a comparison between two or more datasets. Here, values are compared without units. Relative measures of dispersion that are commonly used include the following.
- Coefficient of Variation
- Coefficient of Standard Deviation
- Coefficient of Range
- Coefficient of Mean Deviation
- Coefficient of Quartile Deviation
5. Frequentist approach
The frequentist approach is used to make hypotheses on the underlying truths of repeatable experiments. It calculates the frequency of repeatable experiments in the long run. To understand this clearly, let’s consider an example of a coin toss. If the probability of a fair coin landing heads is 0.5, it means that if it’s tossed enough times, we can expect to see heads at 50% of the total number of tosses. One common drawback of the frequentist approach is the result of the experiment depends on how many times you repeat it. Therefore, if an event isn’t repeatable, you cannot actually define its probability using the frequentist approach.
6. Bayesian approach
The Bayesian approach applies probability to statistical problems. It provides mathematical tools that we can use to update our beliefs regarding random occurrences when new data about those occurrences are brought to light. There’s a major difference between the frequentist approach and the Bayesian approach. In the former, we try to eliminate uncertainty by figuring out estimates while in the latter, we try to preserve and perfect uncertainty by altering our beliefs by considering new data or evidence. To be able to apply the Bayesian approach successfully to a problem, you must become familiar with two major concepts — Bayes Theorem and Conditional Probability.
7. Central Limit Theorem
In modern statistics, the central limit theorem or CLT is one of the most important theories used in hypothesis testing. It expresses that as the sample size gets larger, the distribution of the mean concerning a variable will be like a normal distribution irrespective of the distribution of that variable in the population. Note that the central limit theorem definition mentions the sampling distribution will start becoming a normal distribution when the sample size is sufficiently large. Though it depends on the distribution of the variable in the population, a sample size of 30 is typically considered sufficient for the majority of distributions. Note that you may need to have larger sample sizes for strongly asymmetric distributions.
8. Law of Large Numbers
The law of large numbers is one of the key theorems in probability and statistics. It states that as you repeat an experiment multiple times and calculate the sample mean, the result will better approximate the expected or true value. There’re two versions of the law of large numbers — the strong law and the weak law. The strong law of large numbers gives a result, which’s almost accurate to your expected mean. The weak law of large numbers focuses on convergence in probability and the result is close to the expected outcomes but not as accurate as the result of the strong law.
9. Sample Representativeness
You can consider a representative sample as one that represents your population accurately, meaning the sample matches some of the characteristics of the population. Note that a representative sample needs to be an unbiased representation of your population. You can use different ways to evaluate representativeness — from gender, age, education, and profession to socioeconomic status, chronic illness, and more. However, it greatly depends on the scope of your study, the available information about your population, and how detailed you’re looking to get. If the statistic, which you obtained from sampling, doesn’t represent the parameters of your population, it’s called an unrepresentative sample. Therefore, you must try to avoid selection bias in order to achieve randomness.
10. Hypothesis Testing
In statistics, hypothesis testing refers to the method that statisticians use to make statistical decisions with the help of experimental data. Basically, the result of hypothesis testing lets you interpret whether your assumptions have been violated or they hold. If you figure out that the assumptions have been violated, the experiment will be of little use and might not be repeatable.
Here’re the key steps in hypothesis testing.
- Specify the null hypothesis
- Specify the significance level (known as the a level also)
- Compute the probability value (known as the p value as well)
- Compare the probability value with the significance level
If you see the probability value is lower, you should reject the null hypothesis. If it’s higher than the conventional significance level of 0.05, you may consider the findings inconclusive. Typically, statisticians design experiments for allowing the null hypothesis to get refuted. However, failure to refute the null hypothesis doesn’t establish support for it. It only means that you don’t have sufficiently strong data for refuting it.
Wrapping up
We hope that this article has been insightful. All of the above topics are pivotal for developing a good understanding of the fundamentals of statistics. However, this isn’t a comprehensive list of the topics that you need to focus on solely. If you want to develop a deeper understanding of these concepts along with other important ones, you should consider joining a course on statistics.
. . .
To learn more about variance and bias, click here and read our another article.