This course builds on Data Science 1 where students developed skills in data collection, data cleaning, and creating different types of data visualizations (e.g. bar charts, scatter plots, density plots, heat maps, violin plots, time series, and interactive graphics) and effective data communication while working on problems and case studies inspired by and based on real-world questions. Continuing on the themes explored in Data Science 1, students will go beyond data visualization to gain insight from data using statistical and machine learning techniques. Students who successfully complete this course will be able to work with large data sets, transform those data, and apply statistical and machine learning techniques to analyze data. Students will build on their knowledge of GitHub, ggplot2, Rmarkdown, and the tidyverse packages for data manipulation, visualization and analysis, to include an analytical toolkit for answering different types of questions and working with different types of data. Students will be exposed to a variety of topics including: webscraping, generalized linear models and machine learning. Each year we will also explore several advanced special topics in data science which may include topics such as text analysis, image processing, and spatial analysis.
Topic Intros Tuesday, Wednesday, Lab Friday, Help Sessions TBD.
Topic Intro - 13:00-14:25
Topic Intro - 14:35-16:00
Lab - 13:00-14:25
Help Sessions TBD
Continuing on the themes explored in Data Science 1, students will go beyond data visualization to gain insight from data using statistical and machine learning techniques. Students who successfully complete this course will be able to work with large data sets, transform those data, and apply statistical and machine learning techniques to analyze data. Throughout the course we will be using GitHub, ggplot2, Rmarkdown, and the tidyverse packages for data manipulation, visualization and analysis. This course is intended to appeal to a wide range of students. The skills and habits of mind taught in this course are applicable not only in the sciences and social sciences, but in almost all fields. Evaluation will be based on several short homework and lab assignments, participation in in-class activities, and a final project.
Unit 1 (Week 1 to 2)- Exploring data: This unit is designed as a refresher of data visualization and data wrangling. We end the unit with web scraping and introduce the idea of iteration in preparation for the next unit. Also in this unit students are introduced to the toolkit: R, RStudio, R Markdown, Git, and GitHub.
Unit 2 (Week 3 to 4) - Working with strings, regular expressions, and text analysis.
Unit 3 (Week 5 to 8) - Making rigorous conclusions: In this unit we introduce modelling and statistical inference for making data-based conclusions. We discuss data ethics alongside building, interpreting, and selecting models, visualizing results, and prediction and model validation.
Unit 4 (Week 9 and 10) - Machine Learning and looking beyond Data Science 2: In this unit we delve into supervised machine learning and finish up the course with project presentations and looking beyond data science II.
The computing courses at COA are designed to bridge the liberal arts education to computing and the digital world. In this, I am committed to actively creating digital and computational spaces that are radically inclusive. This includes integrating equity and social justice throughout the curriculum, and engaging students in metacognition to support this work.
This course is designed as a community learning journey. Together, we will:
It is also my hope that in this course you:
All books for this course are freely available online as e-books. We will be using two main texts:
We will also have chapters and readings from the following opensource books on Text Analysis and Machine Learning:
The class meets on Tuesday and Friday from 13:00-14:25 and on Wednesday from 14:35-16:00. The typical weekly class schedule will be:
Day | Activity |
---|---|
Tuesday | Topic Introduction |
Wednesday | Topic Introduction |
Friday | Lab |
Outside of class you can expect to spend
The Teaching Assistant and I will have a handful of help sessions every week. You are warmly invited and encouraged to attend these sessions. Help sessions are relaxed, informal, and hopefully fun. Things that happen at help sessions:
Everyone is welcome at help sessions! Attending these sessions help students do well in class and get as much out of it as possible.
These will be held on Fridays. During these sessions you will work individually or in teams on computing lab exercises and you will finish the exercises after class and turn in your lab reports the following Thursday at 23:59 EST through Sunday 23:59 EST. Attendance to class is important as you will be working on your labs individually and together in class. Labs will be submitted as GitHub repositories.
A frequently asked question is “What happens if I can’t make it to a lab one week because I’m sick or have another obligation at that time?” Answers below:
If you’re missing a workshop day due to short-term illness or some other reason, you should communicate this with your team and attend a team meeting before the deadline for the assignment to contribute to the teamwork. If you have made 0 commits towards a lab assignment, you will receive no credit for that assignment, so you need to participate both for being a team player and also for your own individual score.
If you’re unable to contribute to a lab assignment because of an illness taking you away from school work for an extended period of time, you should let me and your team know that you won’t be able to contribute to those lab(s) and we can discuss special circumstances and explore alternative arrangements to make up that work.
Overall these policies are put in place to ensure communication between team members, respect for each others' time, and also to give you a safety net in the case of illness or other reasons that keep you away from attending class once or twice.
Beyond the in class activities, you will be assigned weekly larger programming tasks throughout the semester. These assignments will be completed individually and due on the following Friday 23:59 EST through Sunday 23:59 EST, and submitted as GitHub repositories. Tip: Do the (optional) R tutorials which will introduce you to the datasets and topics covered in the homework assignments.
These weekly multiple choice quizzes will help you evaluate your learning continuously. The online quiz will be graded for completion only. You do not need to get any answers right, but it should help you identify what parts of the material you should review. Tip: Don’t leave it till the last minute!
Throughout the course there will be reflection assignments where we engage and reflect on contemporary issues in environmental and social justice related to our digital world, community and identity. These assignments may be based on an assigned reading or visualizations we encounter in our daily lives.
You will be responsible for the completion of an open ended final project for this course, the goal of which is to tackle an “interesting” problem using the tools and techniques covered in this class. Additional details on the project will be provided as the course progresses. Each team’s work will also be shared with and evaluated by at least one other team at an earlier stage in order to provide feedback. You must complete the final project and be in class to present it in order to pass this course. Tip: Stick to optional interim deadlines (outline, draft presentation) to pace your work on the project.
For all of the team based assignments in this class you will be randomly assigned to teams of 2 or 3 students - these teams will change throughout the trimester. You will work in these teams during class and on the lab assignments. For team based assignments, all team members are expected to contribute equally to the completion of each assignment and you will be asked to evaluate your team members on lyceum after each assignment is due. During the labs, we will be working together using pair programming, where you will take it in turns to write and review the the code, swapping roles frequently. Once the assignment is submitted the contributors will share responsibility for any revisions to be made based on feedback. Failure to adequately contribute to an assignment will result in a penalty to your grade relative to the team’s overall grade.
Students are expected to make use of the provided GitHub repository as their central collaborative platform. Commits to this repository will be used as a metric (one of several) of each team member’s relative contribution for each homework.
A growing body of research indicates that traditional approaches to grading fail to produce the sorts of meaningful learning desired by both teachers and students. Such approaches often reinforce inequitable power dynamics between teachers and students, promote faulty reward systems that disincentive creativity and risk-taking, and devalue important aspects of learning (including revision and feedback). Given this context, instead of a traditional approach to grading in which you do work that is evaluated singularly by me, this course assumes that you opt to take ownership and responsibility over your performance and engagement with the class. To make this happen, this course uses a “contract grading” scheme, which gives you a voice in the grading process, provides you with the agency to specify your intended course performance, and also share in the responsibility for evaluating whether or not you fulfilled your intended obligations. Please see the contract grading document (on Google Classroom) for a more-fleshed-out explanation of this approach and how it will operate in the course.
I will also be meeting with each of you individually to set goals at the start of the course.
The work in this course will be comprised of the following components and their weights:
Only work that is clearly assigned as team work should be completed collaboratively. Individual assignments must be completed individually, you may not directly share or discuss answers / code with anyone other than the instructors and tutors. You are welcome to discuss the problems in general and ask for advice.
I am well aware that a huge volume of code is available on the web to solve any number of problems. Unless I explicitly tell you not to use something the course’s policy is that you may make use of any online resources (e.g. StackOverflow) but you must explicitly cite where you obtained any code you directly use (or use as inspiration). Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism. On individual assignments you may not directly share code with another student in this class, and on team assignments you may not directly share code with another team in this class. You are welcome to discuss the problems together and ask for advice, but you may not send or make use of code from another team.
By enrolling in an academic institution, a student is subscribing to common standards of academic honesty. Any cheating, plagiarism, falsifying or fabricating of data is a breach of such standards. A student must make it their responsibility to not use words or works of others without proper acknowledgement. Plagiarism is unacceptable and evidence of such activity is reported to the provost or their designee. Two violations of academic integrity are grounds for dismissal from the college. Students would request in-class discussions of such questions when complex issues of ethical scholarship arise.
Many of us learn in different ways. For example, you may process information by speaking and listening, so while lectures are quite helpful for you, some of the written material may be difficult to absorb. You might have difficulty following lectures, but are able to quickly assimilate written information. You may need to fidget to focus in class. You might take notes best when you can draw a concept. For some of you, speaking in class can be a stressful or daunting experience. For some of you, certain topics or themes might be so traumatic as to be disruptive to learning. The principle of Universal Design for Learning calls for our classrooms, our virtual spaces, our practices and our interactions to be designed to include as many different modes of learning as possible, and is a principle I take seriously in this class.
It is also my goal to create an inclusive classroom, which depends on community building, and which requires everyone to come to class with mutual respect, civility, and a willingness to listen to and observe others. As such the syllabus serves as a contract of some expectations between all members of the class, including myself.
If you anticipate or experience any barriers to learning in this course, please reach out to me and your student support advisor. If you have a disability, or think you may have a disability, COA’s Disability Support Services located within the Office of Student Life in Deering Commons to develop a plan for your academic accommodations. You can find out more information in the course catalog under Accommodating students with disabilities. If you have already been approved for accommodations through the Disability Support Services please let me know! We can meet 1-1 to explore concerns and potential options.
All work is due on the stated due date. Due dates are there to help guide your pace through the course and they also allow me to return feedback to you in a timely manner. However, sometimes life gets in the way and you might not be able to turn in your work on time. First, please note that resubmission is built into each assignment. You will receive feedback on each assignment and may resubmit a piece of work incorporating your revisions.
If you intend to submit work late for an assignment or project, you must notify me before the original deadline and as soon as the completed work is submitted on GitHub. This allows me to return feedback to you and let’s me know when to check your work. Lab work cannot be submitted late.
I want to make sure that you learn everything you were hoping to learn from this class. If this requires flexibility, please don’t hesitate to ask.
You never owe me personal information about your health (mental or physical) but you’re always welcome to talk to me. If I can’t help, I likely know someone who can.
I want you to learn lots of things from this class, but I primarily want you to stay healthy, balanced, and grounded during this crisis.
Most of you will need help at some point and we want to make sure you can identify when that is without getting too frustrated and feel comfortable seeking help.
Everyone is welcome at help sessions! Attending these sessions help students do well in class and get as much out of it as possible.
Showcase your inner data scientist
The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.
There will also be the option to work on datasets and projects proposed by community partners. More info to come.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.
The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use, but sticking to packages we learned in class (tidyverse
) is required. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R.
Here is an example of a past project write up and presentation on Lessons to be Learned from Super Bowl Advertisements.
In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset must have at least 50 observations and between 10 to 20 variables (exceptions can be made but you must speak with me first). The variables in the data should include categorical variables, discrete numerical variables, and continuous numerical variables.
If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into R as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.
Note on reusing datasets from class: Do not reuse datasets used in examples, homework assignments, or labs in the class.
Below are a list of data repositories that might be of interest to browse. You’re not limited to these resources, and in fact you’re encouraged to venture beyond them. But you might find something interesting there:
One of the main goals of this course is that you build and develop community leadership skills as a collaborator that shares strengths, builds weaknesses, and contributes to a broader shared understanding. These skills will serve you in this course and beyond in your careers. A crucial part of building strong collaborations is good communication.
Each team will draft a group contract. A group contract is a document to help you formalize the expectations you have for your group members and what they can expect of you. It will help you think about what you need from each other to work effectively as a team! You will create and agree on this contract as a team and refer to it during the project.
At a minimum, your group contract must address these questions:
Each member should “sign” (you can just type out your name) at the bottom of the submission.
Credit for Group Contract: Tiffany Timbers, University of British Columbia
You will write your proposal in the proposal.Rmd file in your Github project.
Section 1 - Introduction: The introduction should introduce your general research question and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.).
Section 2 - Data: Place your data in the /data
folder, and add dimensions
and codebook to the README in that folder. Then print out the output of
glimpse()
or skim()
of your data frame.
Section 3 - Data Ethics: We will be conducting a review of the ethics of our data analysis project using the questions outlined in the Data Ethics Canvas by the Open Data Institute. Specifically we will look at the questions under “Limitations in Data Sources”, “Your reasons for using data”, “Positive effects on people”, “Negative effects on people”, and “Minimising negative impact”.
Section 4 - Data analysis plan:
Each section should be no more than 1 page (excluding figures). You can check a print preview to confirm length.
10 minutes maximum, and each team member should say something substantial.
Prepare a slide deck using either Google Slides or the template in your repo. This template uses a package called xaringan
, and allows you to make presentation slides using R Markdown syntax. There isn’t a limit to how many slides you can use, just a time limit (10 minutes total). A rough guide to follow is one slide is equal to one minute. Each team member should get a chance to speak during the presentation. Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc."), instead it should convey what choices you made, and why, and what you found.
Before you finalize your presentation, make sure your chunks are turned off with echo = FALSE
.
Presentation schedule: Presentations will take place during the Tuesday and Wednesday of the last week of the course. During the class you will watch presentations from the other teams and provide feedback in the form of peer evaluations. The presentation line-up will be generated randomly.
Along with your presentation slides, we want you to provide a brief summary of your project in the README of your repository.
This write-up, which you can also think of as an summary of your project, should provide information on the dataset you’re using, your research question(s), your approach (how you decided to visualize the data), and your findings.
The following folders and files in your project repository:
presentation.Rmd
+ presentation.html
: Your presentation slides. Note that you may use google slides instead of xaringan
README.md
: Your write-up/data/*
: Your dataset in csv or RDS format, in the /data
folder./proposal
: Your proposal from earlier in the term/contract
: Your group contract from earlier in the termStyle and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted including labelling code chunks. Pay attention to images and plots included in the presentation and make sure to include appropriate alternative text.
echo = FALSE
) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your R Markdown file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.Thorndike Library offers many resources and services that can assist you in your academic endeavors, including individualized research support and access to resources beyond COA. Study spaces are also available. The library is open 7 days/week. Remote access to the research databases is available 24/7. Contact library@coa.edu or visit the library website for details.
If you are in need of a term long loaner laptop, please contact the IT department at helpdesk@coa.edu. Mention that you are taking a data science class, and pick up the laptop in A&S right by the whale skull.
I am committed to the promotion and use of open educational resources and software in the journey of designing an accessible computing education. Much of the course description, design, syllabus, website, and educational materials have been adapted from “Data Science in a Box,” https://datasciencebox.org/, by Mine Çetinkaya-Rundel under the Creative Commons Attribution Share Alike 4.0 International.
Conceptually, intellectually, and substantively, the course policies and learning objectives draws heavily upon the work of current and past colleagues at College of the Atlantic, Bates College, and beyond including Carrie Diaz Eaton, Anelise H. Shrout, Barry Lawson, Meredith Greer, Ethan Miller, Misty Beck, Francis Eanes, Dave Feldman, as well as scholars beyond these institutions.
This course has also been greatly improved by the feedback of Bates students enrolled in DCS 210 in the Fall and Winter 2021-2022 and students enrolled in ES 3098 in Spring 2023 at the College of the Atlantic.
Featured artwork is by @AllisonHorst.