obtain more accurate insights. Economics of Education Review, 30(1), 1-15. the passing marks for a student in Portugal would be 10 out of 20. Section 2c. This will be explained in the next section (Section C). 13. According to a study done by Rameker, alcohol consumption is a major factor that has been shown to have correlation with poor academic performance (Rameker, 2015). Many of them are ordinal and were discretized from continuous values. The Core Survey help us determine the patterns of alcohol and other drug consumption and examine attitudes and perceptions of alcohol and other drug use among Northwestern students. EuroEducation.net. https://archive.ics.uci.edu/ml/datasets/STUDENT%20ALCOHOL%20CONSUMPTION. First, open the student-por.csv file in the student_performance source. While I recognize that having a great many students living on campus may be contributing to these numbers, and while I am relieved that students know how and when to seek care, I am c… Yaml is a good tool for setting up configurations, but in this case, we will set the configurations manually. Section 2b. February 2016 DOI: 10.13140/RG.2.1.1465.8328 READS 2,200 2 authors: Fabio Pagnotta Hossain Amran University of Camerino University of Camerino 8 PUBLICATIONS 0 CITATIONS 5 PUBLICATIONS 0 … Most of us experimented with drinking to some degree while in school. 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93) For the data exploratory exercise, we choose to examine three columns: workday alcohol consumption, weekend alcohol consumption and their relationship status. We chose workday alcohol consumption because drinking over workdays is more unusual than drinking over the weekends. Testing correlation between alcohol consumption and social, gender, study time, and grade attributes for each student. GStatus is derived from the final period grade, (G3, column 33) where according to EuroEducation.net (n.d.), Section 2a. Remove the skewness from the numeric data. If one is very high, you may want to take a closer look at the data and see if there is leakage into the target variable. We use cookies on Kaggle to deliver our services, analyze web traffic, and improve your experience on the site. World Health Organization WHO. From this analysis, what might we preprocess before creating the model? The primary reason for this data was to see the effects of drinking and grades. To do so, we While … result as pass/fail rather than a discrete numeric number. Column 23 We would think that if the value for health is lower, the value for their We prefer to use some sort of configuration so that we can input any dataset and perform most of the same analysis. This information can give you a hint of the skewness and of possible outliers. The following results show the skewness for the numeric features: As we suspected, the feature ‘absences’ contains the most skew. People who contributed to this were Aaron Patrick Nathaniel, Lim Yue Hng (Neil) and This would help the classification model to more accurately predict the class GStatus Our explanation would be more focused on the final grade because we think that students will be We look a bit closer at the distribution of absences and test for normality. (2016), studied the relationship between married couples with their single counterparts and found out that if partners are more Earthdata. Google Trends - look at what’s going on in the world. avoid drinking in order to prevent their health from further deterioration. activites (column 19), romantic (column 23), famrel (column 24), goout (column 26), Dalc (column 27), Walc (column 28) fulfilling the Data Mining course in Multimedia University. EDUCATION SYSTEM IN PORTUGAL. Alcohol Abuse and Dependence: Roughly 20 percent of college students meet the criteria for an alcohol use disorder in a given year (8 percent alcohol abuse, 13 percent alcohol dependence). need to take column 23 (romantic), column 27 (workday alcohol consumption) and/or column 28 (weekend alcohol consumption) into consideration. A research conducted The original data contains the following attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: The following grades are related with the course subject, Math or Portuguese: Before exploration, we combine the rows of the two data sets and mark each instance with the class in which the survey was taken. Other Cool Sets. and/or column 28 (weekend alcohol consumption), column 31 (first period grade), column 32 (second period grade) and We think that classification is the best data mining technique to be employed because we can build a classification model to Its value for the week is normalized as (workday_alcohol_consumption 5 + weekend_alcohol_consumption 2)/7 If the value is greater than 3.0, then alcohol consumption is considered too high. Global Status Report on Alcohol and Health 2014. The original data comes from a survey conducted by a professor in Portugal. information about the students from the mathematics course only. courses of mathematics and Portuguese. Then we can find out if alcohol consumption will impact the final result indicated by column “g3”. “Using Data Mining to Predict Secondary School Student Alcohol Consumption.” Department of Computer Science,University of Camerino. school period grades are available. I will be utilizing the student alcohol consumption dataset provided by UCI Machine Learning and is available in their machine learning repository. However, the assumption is that the alcohol consumption is high because the student's Section 2e. impressionable generation. administrative or police), ‘at_home’ or ‘other’), reason – reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’), guardian – student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’), traveltime – home to school travel time (numeric: 1 – <15 min., 2 – 15 to 30 min., 3 – 30 min. You can see the level of correlation by the degree of the ellipse. They are: Exploratory Data Analysis on the Student Alcohol Consumption dataset (Code) », address - U/R for urban or rural respectively, famsize - LE3/GT3 for less than or greater than three family members, Pstatus - T/A for living together or apart from parents, respectively, Medu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for mother's education, Fedu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for father's education, Mjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's mother's job, Fjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's father's job, reason - close to 'home', school 'reputation', 'course' preference or 'other' for the choice of school, guardian - mother/father/other as the student's guardian, traveltime - 1 (<15mins) / 2( 15 - 30 mins) / 3 (30 mins - 1 hr) / 4 (>1hr) for time from home to school, studytime - 1 (<2hrs) / 2 (2 - 5hrs) / 3 (5 - 10hrs) / 4 (>10hrs) for weekly study time, failures - 1-3/4 for number of class failures (if more than 3 than record 4), schoolsup - yes/no for extra educational support, famsup - yes/no for family educational support, paid - yes/no for extra paid classes for Math or Portuguese, activities - yes/no for extra-curricular activities, nursery - yes/no for whether attended nursery school, higher - yes/no for desire to continue studies, internet - yes/no for internet access at home, romantic - yes/no for relationship status, famrel - 1-5 scale on quality of family relationships, freetime - 1-5 scale on how much free time after school, goout - 1-5 scale on how much student goes out with friends, Dalc - 1-5 scale on how much alcohol consumed on weekdays, Walc - 1-5 scale on how much alcohol consumed on weekend, absences - 0-93 amount of absences from school, the amount of time a student studies (studytime, column 14), does the student join any extra paid classes (paid, column 18), does the student participate in any extra co-curricular activities (activities, column 19), if the student is involved in any romantic relationship (romantic, column 23), how is the student's family relationship quality (famrel, column 24), the tendency of the student to go out with friends (goout, column 26), weekday alcohol consumption (Dalc, column 27), weekend alcohol consumption (Walc, column 28). comes with the mantle of adulthood. The scope of these data sets varies a lot, since they’re all user-submitted, but they tend to be very interesting and nuanced. Core measures include: Baseline surveys included standard demographics, religiosity, current alcohol and drug diagnoses (DIS), ASI alcohol, drug and psychiatric problem severity, number of heavy drinkers in social networks, prior treatment utilization, and lifetime and past-year 12-step meeting attendance and involvement, Six- and 12-month surveys involved a subset of these … The box plot portion of the graph also helps us identify outliers. This helps you to understand whether the distribution of the numeric variable is significantly different at different levels of the categorical target. weekend alcohol consumption and their health. Guided By: Dr. Amir H. Gandomi Student Grade Prediction Presented By: Gaurav Sawant Vipul Gajbhiye Vikram Singh Date: 11/28/2017 (Pullen, 1994). Five columns play a major role in this which are: column 27 (workday alcohol consumption) The following plot shows the prominence of the target: This shows that the target is imbalanced, so we may benefit from oversampling or under-sampling when building our model. consumption (both column 27 and 28) when famrel has a low value. Your email address will not be published. The traditional consensus is that students who consume alcohol at high levels … With the Student Alcohol Consumption data set, we predict high or low alcohol consumption of students. The columns and how they are recorded are as listed below: Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship For the data exploratory exercise, we choose to examine three columns: workday alcohol consumption, It could be alcohol poisoning or an alcohol-related injury or both. It would be easy to assume that alcohol consumption reduces the student’s health on a long term basis. We can use studytime (column 14), paid (column 18), Published in: Technology. predict if a student will get a passing grade based on the factors mentioned above. We could perform this merge differently later by performing a full join and then dealing with the NA values, by performing the analysis on the individual sets, or by inner joining the two sets and just working with that data. Essentially, the blue rectangles show that the observed counts and expected counts (derived from a loglinear model) coincide well, and since the size of the rectangles are large, the confidence covers a majority of the observations. Assuming the romantic relationship in our dataset is of an intimate level, we can find out if this statement holds true. Having recourse to the public health objective on alcohol by the World Health organization, which is to reduce the health burden caused by the harmful use of alcohol, thereby saving live and reducing injuries, this data article explored the nature of alcohol use among college students, binge drinking and the consequences of alcohol consumption. We shall see which consensus holds true. Medicine use PDF 223 KB. Depending on the model you choose, removing skewness could help improve the predictive ability of the model. Balsa, A. I., Giuliano, L. M., & French, M. T. (2011). In our data set, many of the categorical features are numeric, but for this illustration, we will continue with treating them as categorical. The data that we will explore has 1044 rows and 33 columns. I'm sorry, the dataset "STUDENT ALCOHOL CONSUMPTION" does not appear to exist. Student Grade Prediction 1. Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. In this case, we see that the grades are highly correlated, meaning the higher the grades in one session, the higher the grades in another session. column 33 (final grade). The original values for the feature ‘absences’ will be used in the remaining sections. Click on the arrow near the name of each column to evoke the context menu. The data mining technique we think is suitable is classification. http://www.who.int/substance_abuse/publications/global_alcohol_report/en/. Dinescu, D., Turkheimer E., Beam, C.R., Horn, E.E., Duncan, G., Emery, R.E. The dataset we chose is the Student Alcohol Consumption dataset by UCI Machine Learning which can be obtained Secondary school students are in a transition developmentally and this comes with its debilitating effects such as risky alcohol use … In April 2016, 3000 undergraduate students were randomly selected to participate in the survey, and 802 undergraduate students responded to at least part of the survey. We will take a closer look at the distribution of this feature. The primary reason for this data was to see the effects of drinking and grades. consensus is that students who consume alcohol at high levels tend to skip more classes and perform worse in their studies, thus, resulting in lower GitHub is where the world builds software. The students included in the survey were in the To get an idea of how features interact with each-other, we can determine the rank associated with the features to a target, in this case, the actual target or level of drinking. According to the World Health Organization (Global Status Report on Alcohol and Health 2014 2014), gender, family, and social factors affect alcohol consumption. Many students in college experiment with drugs and alcohol and sometimes these two things negatively affect their academic performance. would be the relationship between their grades with respect to their workday and weekend alcohol consumption. We would oversample since we have limited data. drinking alcohol for consolation. Subscribe to our mailing list and get interesting stuff and updates to your email inbox. We test hypothesis 0 (h0) that the numeric variable has the same mean values across the different levels of the categorical variable. Last but not least, we can also obtain insights on health issues and drinking alcohol. For a student to pass the subject, there are a couple of factors that could be correlated with the outcome. (romantic), only gives information on whether or not the student has a partner. Next Steps in Preparing the Data for a Model, https://archive.ics.uci.edu/ml/datasets/STUDENT%20ALCOHOL%20CONSUMPTION, http://www.who.int/substance_abuse/publications/global_alcohol_report/en/, Data Exploratory Analysis – Student Alcohol Consumption, Facebook Stock Price after Quarterly Report, Forecast Stock Prices Example with r and STL, school – student’s school (binary: ‘GP’ – Gabriel Pereira or ‘MS’ – Mousinho da Silveira), sex – student’s sex (binary: ‘F’ – female or ‘M’ – male), age – student’s age (numeric: from 15 to 22), address – student’s home address type (binary: ‘U’ – urban or ‘R’ – rural), famsize – family size (binary: ‘LE3’ – less or equal to 3 or ‘GT3’ – greater than 3), Pstatus – parent’s cohabitation status (binary: ‘T’ – living together or ‘A’ – apart), Medu – mother’s education (numeric: 0 – none, 1 – primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education), Fedu – father’s education (numeric: 0 – none, 1 – primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education), Mjob – mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. This analysis was done as part of When lambda = 0, the log transform is used. With the Student Alcohol Consumption data set, we predict high or low alcohol consumption of students. X axis is the level of categorical target. The more narrow the ellipse, the stronger the correlation. 2016. Treatment utilization alcohol PDF 98 KB. If the mean has significant differences (h0 is accepted), then the feature will likely be a dominant predictor. For numeric data, correlations are important to help determine if we should join information of highly correlated features. Best part, these are all free, free… It gives you data about … recorded to have participated. This has been the case for eight of the past 10 years. The dataset which we will be exploring will be the dataset containing National Institute of Child Health and Human Development Study of Early Child Care and Youth Development Data and documentation for phases I and II of the NICHD-SECCYD study. Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship would be the relationship between their grades with respect to their workday and weekend alcohol consumption. Derived output: Alc = (Walc X 2 + Dalc X 5) / 7, again, in the range of 1 – 5. administrative or police), ‘at_home’ or ‘other’), Fjob – father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. Singapore, however, brightens it up with colorful visualizations, splashes of color in the graphs, and a “Similar Datasets” section at the bottom of every data set to encourage readers to explore. There are two categorical columns “Dalc” and “Walc” showing consumption on workday and weekend. For example, if there were a high correlation, say 0.9, between two numeric features, then the information provided to the model would be redundant, and depending on the model make the model more complex than it needs to be. You can browse the subreddit here. A lot of time is lost I alcohol consumption that the students only place less time in their academic work. We may want to normalize absences in preparation for model building. workday and/or weekend alcohol consumption would also be lower. This analysis was done as part of fulfilling the Data Mining course in Multimedia University. We remove skewness by applying a log, square root, or/and inverse transformation. reductions of GPA. Secondary school student alcohol consumption data with social, gender and study information. such data are records of demographic information, grades, and alcohol consumption. Excessive alcohol use, either in the form of binge drinking (drinking 5 or more drinks on an occasion for men or 4 or more drinks on an occasion for women) or heavy drinking (drinking 15 or more drinks per week for men or 8 or more drinks per week for women), is associated with an increased risk of many health problems, such as liver disease and unintentional injuries.