Teaching Ratings¶
Objective: Analyze teaching ratings of professors with different characteristics and see if there are external influences on the teaching evaluation score.
Import libraries¶
from scipy.stats import norm, levene, ttest_ind, f_oneway, chi2_contingency, pearsonr
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
Load the dataset¶
file_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/teachingratings.csv'
df = pd.read_csv(file_url).loc[:, :"prof"]
df.head()
minority | age | gender | credits | beauty | eval | division | native | tenure | students | allstudents | prof | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yes | 36 | female | more | 0.289916 | 4.3 | upper | yes | yes | 24 | 43 | 1 |
1 | yes | 36 | female | more | 0.289916 | 3.7 | upper | yes | yes | 86 | 125 | 1 |
2 | yes | 36 | female | more | 0.289916 | 3.6 | upper | yes | yes | 76 | 125 | 1 |
3 | yes | 36 | female | more | 0.289916 | 4.4 | upper | yes | yes | 77 | 123 | 1 |
4 | no | 59 | male | more | -0.737732 | 4.5 | upper | yes | yes | 17 | 20 | 2 |
Understand the dataset¶
Variable | Description |
---|---|
minority | Does the instructor belong to a minority (non-Caucasian) group? |
age | The professor's age |
gender | Indicating whether the instructor was male or female. |
credits | Is the course a single-credit elective? |
beauty | Rating of the instructor's physical appearance by a panel of six students averaged across the six panelists and standardized to have a mean of zero. |
eval | Course overall teaching evaluation score, on a scale of 1 (very unsatisfactory) to 5 (excellent). |
division | Is the course an upper or lower division course? |
native | Is the instructor a native English speaker? |
tenure | Is the instructor on a tenure track? |
students | Number of students that participated in the evaluation. |
allstudents | Number of students enrolled in the course. |
prof | Indicating instructor identifier. |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 463 entries, 0 to 462 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 minority 463 non-null object 1 age 463 non-null int64 2 gender 463 non-null object 3 credits 463 non-null object 4 beauty 463 non-null float64 5 eval 463 non-null float64 6 division 463 non-null object 7 native 463 non-null object 8 tenure 463 non-null object 9 students 463 non-null int64 10 allstudents 463 non-null int64 11 prof 463 non-null int64 dtypes: float64(2), int64(4), object(6) memory usage: 43.5+ KB
Visualize the dataset¶
fig, axs = plt.subplots(4, 3, figsize=(15, 20))
for ax, feature in zip(axs.flatten(), df.columns[:-1]):
if df[feature].dtype == 'O':
labels, sizes = np.unique(df[feature], return_counts=True)
sns.barplot(x=labels, y=sizes, hue=labels, ax=ax, legend=False)
ax.set_xlabel("")
ax.set_title(feature)
else:
sns.histplot(data=df, x=feature, ax=ax)
ax.set_xlabel("")
ax.set_title(feature)
axs[3, 2].axis("off")
plt.tight_layout()
plt.show()
Analysis¶
Does average beauty score differ by gender?¶
gender_beauty_mean = df.groupby('gender')["beauty"].mean()
print(gender_beauty_mean)
plt.figure()
sns.histplot(data=df, x="beauty", hue="gender")
plt.axvline(gender_beauty_mean["female"], color="blue", linestyle='dotted')
plt.axvline(gender_beauty_mean["male"], color="orange", linestyle='dotted')
plt.show()
gender female 0.116109 male -0.084482 Name: beauty, dtype: float64
Percentage of males and females that are tenured professors¶
gender_tenure_count = df[df.tenure == 'yes'].groupby('gender')["tenure"].count()
labels = gender_tenure_count.keys()
sizes = gender_tenure_count.values
print(gender_tenure_count.apply(lambda x: round((x / gender_tenure_count.sum()) * 100, 2)))
fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend(labels)
plt.title("Tenured professors")
plt.show()
gender female 40.17 male 59.83 Name: tenure, dtype: float64
Percentage of minorities and non-minorities that are tenured professors¶
minority_tenure_count = df[df.tenure == 'yes'].groupby('minority')["tenure"].count()
labels = ["non-minority" if i == "no" else "minority" for i in minority_tenure_count.keys()]
sizes = minority_tenure_count.values
print(minority_tenure_count.apply(lambda x: round((x / minority_tenure_count.sum()) * 100, 2)))
fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend(labels)
plt.title("Tenured professors")
plt.show()
minority no 85.04 yes 14.96 Name: tenure, dtype: float64
Does average age differ by tenure?¶
tenure_age_mean = df.groupby('tenure')["age"].mean()
print(tenure_age_mean)
plt.figure()
sns.histplot(data=df, x="age", hue="tenure")
plt.axvline(tenure_age_mean["no"], color="blue", linestyle="dotted")
plt.axvline(tenure_age_mean["yes"], color="orange", linestyle="dotted")
plt.show()
tenure no 50.186275 yes 47.850416 Name: age, dtype: float64
What is the mean evaluation score for tenured professors?¶
tenure_eval_mean = df.groupby('tenure')["eval"].mean()
print(tenure_eval_mean)
plt.figure()
sns.histplot(data=df, x="eval", hue="tenure")
plt.axvline(tenure_eval_mean["no"], color="blue", linestyle="dotted")
plt.axvline(tenure_eval_mean["yes"], color="orange", linestyle="dotted")
plt.show()
tenure no 4.133333 yes 3.960111 Name: eval, dtype: float64
Do instructors teaching lower-division courses receive higher average teaching evaluations?¶
division_eval_mean = df.groupby('division')['eval'].mean()
labels = division_eval_mean.keys()
sizes = division_eval_mean.values
print(division_eval_mean)
plt.figure()
sns.barplot(x=labels, y=sizes, hue=labels)
plt.show()
division lower 4.087261 upper 3.952614 Name: eval, dtype: float64
Box plot for beauty scores differentiated by credits¶
print(df.groupby('credits')["beauty"].describe().iloc[:, 1:])
plt.figure()
sns.boxplot(x="credits", y='beauty', hue="credits", data=df)
plt.show()
mean std min 25% 50% 75% max credits more 0.016606 0.797503 -1.450494 -0.656395 -0.066674 0.556886 1.970023 single -0.268149 0.575841 -0.656269 -0.583587 -0.532420 -0.286782 1.154256
What is the number of courses taught by gender?¶
courses_gender_count = df["gender"].value_counts()
labels = courses_gender_count.keys()
sizes = courses_gender_count.values
print(courses_gender_count)
plt.figure()
sns.barplot(x=labels, y=sizes, hue=labels)
plt.ylabel("Count")
plt.show()
gender male 268 female 195 Name: count, dtype: int64
Group histogram of professors by gender and tenure¶
gender_tenure_count = df.groupby("gender")["tenure"].value_counts()
print(gender_tenure_count)
plt.figure()
sns.barplot(x="gender", y="count", hue="tenure", data=gender_tenure_count.reset_index())
plt.show()
gender tenure female yes 145 no 50 male yes 216 no 52 Name: count, dtype: int64
Group histogram of professors by gender, differentiated by tenure and division¶
print(df.groupby(["division", "gender"])["tenure"].value_counts())
sns.catplot(x='gender', hue='tenure', row='division', kind='count', data=df, height=3, aspect=2)
plt.show()
division gender tenure lower female yes 44 no 16 male yes 66 no 31 upper female yes 101 no 34 male yes 150 no 21 Name: count, dtype: int64
Histogram of teaching evaluation score with gender as a factor¶
print(df.groupby('gender')["eval"].describe().iloc[:, 1:])
plt.figure()
sns.histplot(data=df, x="eval", hue="gender")
plt.show()
mean std min 25% 50% 75% max gender female 3.901026 0.538803 2.3 3.6 3.90 4.3 4.9 male 4.069030 0.556652 2.1 3.7 4.15 4.5 5.0
Box plot for age differentiated by gender¶
print(df.groupby('gender')["age"].describe().iloc[:, 1:])
plt.figure()
sns.boxplot(x="gender", y="age", hue="gender", data=df)
plt.show()
mean std min 25% 50% 75% max gender female 45.092308 8.532031 29.0 38.0 46.0 52.00 62.0 male 50.746269 9.993396 32.0 43.0 51.0 59.25 73.0
Box plot for age along with tenure and gender¶
print(df.groupby(['tenure', "gender"])["age"].describe().iloc[:, 1:])
plt.figure()
sns.boxplot(x="tenure", y="age", hue="gender", data=df)
plt.show()
mean std min 25% 50% 75% max tenure gender no female 49.900000 6.569099 38.0 47.0 52.0 56.0 57.0 male 50.461538 7.344363 37.0 47.0 48.0 58.0 63.0 yes female 43.434483 8.520249 29.0 36.0 43.0 51.0 62.0 male 50.814815 10.545272 32.0 42.0 52.0 60.0 73.0
Histogram of beauty scores with native english speaker as a factor¶
print(df.groupby('native')["beauty"].describe().iloc[:, 1:])
plt.figure()
sns.histplot(data=df, x='beauty', hue='native')
plt.show()
mean std min 25% 50% 75% max native no 0.031962 0.297944 -0.848727 -0.107181 0.077509 0.216674 0.420400 yes -0.002057 0.810246 -1.450494 -0.656332 -0.083601 0.576680 1.970023
Box plot of the age of the instructors by visible minority¶
print(df.groupby('minority')["age"].describe().iloc[:, 1:])
plt.figure()
sns.boxplot(x='minority', y='age', hue="minority", data=df)
plt.show()
mean std min 25% 50% 75% max minority no 48.769424 10.230851 31.0 40.0 50.0 57.0 73.0 yes 45.843750 5.995286 29.0 43.0 47.0 50.0 54.0
Group histogram of tenure by minority and add the gender factor¶
print(df.groupby(["gender", "tenure"])["minority"].value_counts())
sns.catplot(data=df, x='tenure', hue='minority', row='gender', kind='count', height=3, aspect=2)
plt.show()
gender tenure minority female no no 50 yes no 109 yes 36 male no no 42 yes 10 yes no 198 yes 18 Name: count, dtype: int64
What is the probability of receiving an evaluation score of greater than 4.5?¶
eval_statistics = df["eval"].describe().iloc[1:]
probability = norm.cdf((4.5 - eval_statistics["mean"]) / eval_statistics["std"])
print(eval_statistics)
print("\n\033[1mProbability\033[0m = {:.2f}%".format(100*(1 - probability)))
plt.figure()
sns.histplot(df["eval"])
plt.show()
mean 3.998272
std 0.554866
min 2.100000
25% 3.600000
50% 4.000000
75% 4.400000
max 5.000000
Name: eval, dtype: float64
Probability = 18.29%
What is the probability of receiving an evaluation score greater than 3.5 and less than 4.2?¶
probability_1 = norm.cdf((3.5 - eval_statistics["mean"]) / eval_statistics["std"])
probability_2 = norm.cdf((4.2 - eval_statistics["mean"]) / eval_statistics["std"])
print("Probability = {:.2f}%".format(100*(probability_2 - probability_1)))
Probability = 45.73%
Using t-test, does gender affect teaching evaluation rates?¶
For the t-test for independent samples, the following assumptions must be met:
- One independent, categorical variable with two levels or groups.
- One dependent continuous variable.
- There is no relationship between the observations in each group.
- The dependent variable must follow a normal distribution.
- Assumption of homogeneity of variance.
State the hypothesis:
- $H_0: µ_1 = µ_2$ (There is no difference in evaluation scores between male and females)
- $H_1: µ_1 ≠ µ_2$ (There is a difference in evaluation scores between male and females)
df.groupby("gender")["eval"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gender | ||||||||
female | 195.0 | 3.901026 | 0.538803 | 2.3 | 3.6 | 3.90 | 4.3 | 4.9 |
male | 268.0 | 4.069030 | 0.556652 | 2.1 | 3.7 | 4.15 | 4.5 | 5.0 |
Use the Levene's test to test the null hypothesis that all input samples are from populations with equal variances.
levene(df[df['gender'] == 'female']['eval'],
df[df['gender'] == 'male']['eval'],
center='mean')
LeveneResult(statistic=np.float64(0.19032922435292574), pvalue=np.float64(0.6628469836244741))
Since the p-value is greater than 0.05, we can assume equality of variance.
ttest_ind(df[df['gender'] == 'female']['eval'],
df[df['gender'] == 'male']['eval'],
equal_var=True)
TtestResult(statistic=np.float64(-3.249937943510772), pvalue=np.float64(0.0012387609449522217), df=np.float64(461.0))
Answer: Since the p-value is less than 0.05, we can reject the null hypothesis as there is enough proof that there is a statistical difference in teaching evaluations based on gender.
Using ANOVA test, does beauty score for instructors differ by age?¶
The data must be grouped into categories as the one-way ANOVA can't work with continuous variable, then the categories will be teachers that are:
- 40 years and younger.
- Between 40 and 60 years.
- 60 years and older.
State the hypothesis:
- $H_0: µ_1 = µ_2 = µ_3$ (The three population means are equal)
- $H_1:$ At least one of the means differ.
df.loc[(df['age'] <= 40), 'age_group'] = '40 years and younger'
df.loc[(df['age'] > 40) & (df['age'] < 60), 'age_group'] = 'between 40 and 60 years'
df.loc[(df['age'] >= 60), 'age_group'] = '60 years and older'
df.groupby("age_group")["beauty"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age_group | ||||||||
40 years and younger | 113.0 | 0.336196 | 0.913748 | -1.450494 | -0.326015 | 0.289916 | 1.070944 | 1.970023 |
60 years and older | 77.0 | -0.423557 | 0.548289 | -1.422919 | -0.733091 | -0.395397 | -0.056677 | 0.588569 |
between 40 and 60 years | 273.0 | -0.019693 | 0.728354 | -1.090389 | -0.583587 | -0.083601 | 0.420400 | 1.774334 |
Test for equality of variance.
levene(df[df['age_group'] == '40 years and younger']['beauty'],
df[df['age_group'] == 'between 40 and 60 years']['beauty'],
df[df['age_group'] == '60 years and older']['beauty'],
center='mean')
LeveneResult(statistic=np.float64(11.769735544673434), pvalue=np.float64(1.0350399938234537e-05))
Since the p-value is less than 0.05, the variances are not equal.
f_oneway(df[df['age_group'] == '40 years and younger']['beauty'],
df[df['age_group'] == 'between 40 and 60 years']['beauty'],
df[df['age_group'] == '60 years and older']['beauty'])
F_onewayResult(statistic=np.float64(23.552552376353074), pvalue=np.float64(1.8271127151948056e-10))
Answer: Since the p-value is less than 0.05, we reject the null hypothesis as there is significant evidence that at least one of the means differs.
Using ANOVA test, does teaching evaluation score for instructors differ by age?¶
State the hypothesis:
- $H_0: µ_1 = µ_2 = µ_3$ (The three population means are equal)
- $H_1:$ At least one of the means differ.
df.groupby("age_group")["eval"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age_group | ||||||||
40 years and younger | 113.0 | 4.002655 | 0.505763 | 2.7 | 3.6 | 4.1 | 4.4 | 4.8 |
60 years and older | 77.0 | 3.894805 | 0.626371 | 2.2 | 3.4 | 4.0 | 4.4 | 4.9 |
between 40 and 60 years | 273.0 | 4.025641 | 0.551537 | 2.1 | 3.7 | 4.0 | 4.5 | 5.0 |
levene(df[df['age_group'] == '40 years and younger']['eval'],
df[df['age_group'] == 'between 40 and 60 years']['eval'],
df[df['age_group'] == '60 years and older']['eval'],
center='mean')
LeveneResult(statistic=np.float64(3.123930368994838), pvalue=np.float64(0.04491850441786862))
f_oneway(df[df['age_group'] == '40 years and younger']['eval'],
df[df['age_group'] == 'between 40 and 60 years']['eval'],
df[df['age_group'] == '60 years and older']['eval'])
F_onewayResult(statistic=np.float64(1.6792657352642264), pvalue=np.float64(0.1876521827204442))
Answer: Since the p-value is greater than 0.05, we cannot reject the null hypothesis as there is no significant evidence that at least one of the means differs.
Using chi-square, is there an association between tenure and gender?¶
State the hypothesis:
- $H_0:$ The proportion of teachers who are tenured is independent of gender.
- $H_1:$ The proportion of teachers who are tenured is associated with gender.
cross_tenure_gender = pd.crosstab(df['tenure'], df['gender'])
cross_tenure_gender
gender | female | male |
---|---|---|
tenure | ||
no | 50 | 52 |
yes | 145 | 216 |
chi2_contingency(cross_tenure_gender, correction=False)
Chi2ContingencyResult(statistic=np.float64(2.557051129789522), pvalue=np.float64(0.10980322511302845), dof=1, expected_freq=array([[ 42.95896328, 59.04103672], [152.04103672, 208.95896328]]))
Answer: Since the p-value is greater than 0.05, we cannot reject the null hypothesis as there is no sufficient evidence that teachers are tenured as a result of gender.
Using Pearson correlation, is teaching evaluation score correlated with beauty score?¶
State the hypothesis:
- $H_0:$ Teaching evaluation score is not correlated with beauty score.
- $H_1:$ Teaching evaluation score is correlated with beauty score.
sns.lmplot(data=df, x="beauty", y="eval", line_kws={"color": "red"})
plt.show()
pearsonr(df['beauty'], df['eval'])
PearsonRResult(statistic=np.float64(0.18903909084045212), pvalue=np.float64(4.247115419813557e-05))
Answer: Since the p-value is less than 0.05, we reject the null hypothesis as there exists a relationship between beauty and teaching evaluation score.
Using t-test, does tenure affect teaching evaluation scores?¶
State the hypothesis
- $H_0: µ_1 = µ_2$ (There is no difference in evaluation scores between tenure and non-tenure)
- $H_1: µ_1 ≠ µ_2$ (There is a difference in evaluation scores between tenure and non-tenure)
df.groupby("tenure")["eval"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
tenure | ||||||||
no | 102.0 | 4.133333 | 0.556747 | 2.8 | 3.7 | 4.2 | 4.6 | 5.0 |
yes | 361.0 | 3.960111 | 0.549104 | 2.1 | 3.6 | 4.0 | 4.4 | 5.0 |
ttest_ind(df[df['tenure'] == 'yes']['eval'],
df[df['tenure'] == 'no']['eval'],
equal_var=True)
TtestResult(statistic=np.float64(-2.8046798258451777), pvalue=np.float64(0.005249471210198792), df=np.float64(461.0))
Answer: Since the p-value is less than 0.05, we reject the null hypothesis as there evidence that being tenured affects teaching evaluation scores.
Using chi-square, is there an association between age and tenure?¶
State the hypothesis:
- $H_0:$ There is no association between age and tenure.
- $H_1:$ There is an association between age and tenure.
cross_tenure_age = pd.crosstab(df['tenure'], df['age_group'])
cross_tenure_age
age_group | 40 years and younger | 60 years and older | between 40 and 60 years |
---|---|---|---|
tenure | |||
no | 15 | 7 | 80 |
yes | 98 | 70 | 193 |
chi2_contingency(cross_tenure_age, correction=True)
Chi2ContingencyResult(statistic=np.float64(20.957740803528935), pvalue=np.float64(2.8124473945785386e-05), dof=2, expected_freq=array([[ 24.89416847, 16.96328294, 60.1425486 ], [ 88.10583153, 60.03671706, 212.8574514 ]]))
Answer: Since the p-value is less than 0.05, we reject the null hypothesis as there is evidence of an association between age and tenure.
Using chi-square, is there an association between visible minorities and tenure?¶
State the hypothesis:
- $H_0:$ There is no association between tenure and visible minorities.
- $H_1:$ There is an association between tenure and visible minorities.
cross_minority_tenure = pd.crosstab(df['minority'], df['tenure'])
cross_minority_tenure
tenure | no | yes |
---|---|---|
minority | ||
no | 92 | 307 |
yes | 10 | 54 |
chi2_contingency(cross_minority_tenure, correction=True)
Chi2ContingencyResult(statistic=np.float64(1.3675127484429763), pvalue=np.float64(0.24223968800237178), dof=1, expected_freq=array([[ 87.90064795, 311.09935205], [ 14.09935205, 49.90064795]]))
Answer: Since the p-value is greater than 0.05, we cannot reject the null hypothesis as there is no evidence of an association between visible minorities and tenure.
Using regression with t-test, does gender affect teaching evaluation score?¶
State the hypothesis:
- $H_0: β_1$ = 0 (Gender has no effect on teaching evaluation scores)
- $H_1: β_1$ is not equal to 0 (Gender has an effect on teaching evaluation scores)
# X is the input variables (or independent variables)
X = df['gender'].map({"female": 1, "male": 0})
# y is the target/dependent variable
y = df['eval']
# Add an intercept (beta_0) to our model
X = sm.add_constant(X)
# Ordinary Least Squares
model = sm.OLS(y, X).fit()
# Print out the statistics
model.summary()
Dep. Variable: | eval | R-squared: | 0.022 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.020 |
Method: | Least Squares | F-statistic: | 10.56 |
Date: | Tue, 31 Dec 2024 | Prob (F-statistic): | 0.00124 |
Time: | 15:34:06 | Log-Likelihood: | -378.50 |
No. Observations: | 463 | AIC: | 761.0 |
Df Residuals: | 461 | BIC: | 769.3 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 4.0690 | 0.034 | 121.288 | 0.000 | 4.003 | 4.135 |
gender | -0.1680 | 0.052 | -3.250 | 0.001 | -0.270 | -0.066 |
Omnibus: | 17.625 | Durbin-Watson: | 1.209 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 18.970 |
Skew: | -0.496 | Prob(JB): | 7.60e-05 |
Kurtosis: | 2.981 | Cond. No. | 2.47 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Answer: Since the p-value is less than 0.05, we reject the null hypothesis as there is evidence that there is a difference in mean evaluation scores based on gender. The coefficient -0.1680 means that females get 0.168 scores less than men.
Using regression with ANOVA, does beauty score for instructors differ by age?¶
State the hypothesis:
- $H_0: µ_1 = µ_2 = µ_3$ (The three population means are equal)
- $H_1:$ At least one of the means differ.
model = ols('beauty ~ age_group', data=df).fit()
anova_table = sm.stats.anova_lm(model)
anova_table
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
age_group | 2.0 | 26.691809 | 13.345905 | 23.552552 | 1.827113e-10 |
Residual | 460.0 | 260.656087 | 0.566644 | NaN | NaN |
Answer: Since the p-value is less than 0.05, we reject the null hypothesis as there is significant evidence that at least one of the means differs.
Using regression with t-test, does tenure affect beauty scores?¶
State the hypothesis:
- $H_0:$ The average beauty scores for tenured and non-tenured instructors are equal.
- $H_1:$ There is a difference in the average beauty scores for tenured and non-tenured instructors.
# X is the input variables (or independent variables)
X = df['tenure'].map({"yes": 1, "no": 0})
# y is the target/dependent variable
y = df['beauty']
# Add an intercept (beta_0) to our model
X = sm.add_constant(X)
# Ordinary Least Squares
model = sm.OLS(y, X).fit()
# Print out the statistics
model.summary()
Dep. Variable: | beauty | R-squared: | 0.000 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | -0.002 |
Method: | Least Squares | F-statistic: | 0.1689 |
Date: | Tue, 31 Dec 2024 | Prob (F-statistic): | 0.681 |
Time: | 15:34:06 | Log-Likelihood: | -546.45 |
No. Observations: | 463 | AIC: | 1097. |
Df Residuals: | 461 | BIC: | 1105. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 0.0284 | 0.078 | 0.363 | 0.717 | -0.125 | 0.182 |
tenure | -0.0364 | 0.089 | -0.411 | 0.681 | -0.210 | 0.138 |
Omnibus: | 23.184 | Durbin-Watson: | 0.461 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 23.229 |
Skew: | 0.507 | Prob(JB): | 9.03e-06 |
Kurtosis: | 2.583 | Cond. No. | 4.05 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Answer: Since the p-value is greater than 0.05, we cannot reject the null hypothesis as there is no evidence that the mean difference of tenured and untenured instructors are different.
Using regression with t-test, does being a native english speaker affect the number of students assigned?¶
State the hypothesis:
- $H_0:$ The average number of students assigned to native english speakers vs non-native english speakers are equal.
- $H_1:$ There is a difference in the average number of students assigned to native english speakers vs non-native English speakers.
# X is the input variables (or independent variables)
X = df["native"].map({"yes": 1, "no": 0})
# y is the target/dependent variable
y = df['allstudents']
# Add an intercept (beta_0) to our model
X = sm.add_constant(X)
# Ordinary Least Squares
model = sm.OLS(y, X).fit()
# Print out the statistics
model.summary()
Dep. Variable: | allstudents | R-squared: | 0.007 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.005 |
Method: | Least Squares | F-statistic: | 3.476 |
Date: | Tue, 31 Dec 2024 | Prob (F-statistic): | 0.0629 |
Time: | 15:34:06 | Log-Likelihood: | -2654.2 |
No. Observations: | 463 | AIC: | 5312. |
Df Residuals: | 461 | BIC: | 5321. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 29.6071 | 14.150 | 2.092 | 0.037 | 1.802 | 57.413 |
native | 27.2158 | 14.598 | 1.864 | 0.063 | -1.471 | 55.902 |
Omnibus: | 429.792 | Durbin-Watson: | 0.708 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10527.126 |
Skew: | 4.129 | Prob(JB): | 0.00 |
Kurtosis: | 24.852 | Cond. No. | 8.01 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Answer: Since the p-value is greater than 0.05, we cannot reject the null hypothesis as there is no evidence that being a native english speaker or a non-native english speaker affects the number of students assigned.