Avatar

Data Science II - Data Analysis

Winter 2024

College of the Atlantic

Data Science 2 Analysis

This course builds on Data Science 1 where students developed skills in data collection, data cleaning, and creating different types of data visualizations (e.g. bar charts, scatter plots, density plots, heat maps, violin plots, time series, and interactive graphics) and effective data communication while working on problems and case studies inspired by and based on real-world questions. Continuing on the themes explored in Data Science 1, students will go beyond data visualization to gain insight from data using statistical and machine learning techniques. Students who successfully complete this course will be able to work with large data sets, transform those data, and apply statistical and machine learning techniques to analyze data. Students will build on their knowledge of GitHub, ggplot2, Rmarkdown, and the tidyverse packages for data manipulation, visualization and analysis, to include an analytical toolkit for answering different types of questions and working with different types of data. Students will be exposed to a variety of topics including: webscraping, generalized linear models and machine learning. Each year we will also explore several advanced special topics in data science which may include topics such as text analysis, image processing, and spatial analysis.

Timetable

Topic Intros Tuesday, Wednesday, Lab Friday, Help Sessions TBD.

Tuesday

Topic Intro - 13:00-14:25

Wednesday

Topic Intro - 14:35-16:00

Friday

Lab - 13:00-14:25

TBD

Help Sessions TBD

Welcome back to COA

Course Schedule

Week 1 - Doing Data Science

Refresher and introduction to the course!

Week 2 - Web scraping and programming

Harvesting data from the web, writing functions, and iteration.

Week 3 - Working with strings

Working with character strings.

Week 4 - Text Analysis

Word frequencies, sentiment analysis, and comparing texts

Week 5 - Linear Models

Linear models for predicting numerical data from single and multiple variables.

Week 6 - Linear Models Contd

Linear models for predicting numerical data from single and multiple variables.

Week 7 - Classification and model building

Logistic regression for predicting categorical data and model building.

Week 8 - Model validation and uncertainty quantification

Evaluating models with cross validation and uncertainty quantification with bootstrap confidence intervals.

Week 9 - Machine Learning

Intro to Classical Machine Learning

Week 10 - Final Presentations and Wrap Up

Tasks Work on project and presentation Complete the assignments Complete the readings Assignments Assignment Title Due Due …

Syllabus

Course Description

Continuing on the themes explored in Data Science 1, students will go beyond data visualization to gain insight from data using statistical and machine learning techniques. Students who successfully complete this course will be able to work with large data sets, transform those data, and apply statistical and machine learning techniques to analyze data. Throughout the course we will be using GitHub, ggplot2, Rmarkdown, and the tidyverse packages for data manipulation, visualization and analysis. This course is intended to appeal to a wide range of students. The skills and habits of mind taught in this course are applicable not only in the sciences and social sciences, but in almost all fields. Evaluation will be based on several short homework and lab assignments, participation in in-class activities, and a final project.

Unit 1 (Week 1 to 2)- Exploring data: This unit is designed as a refresher of data visualization and data wrangling. We end the unit with web scraping and introduce the idea of iteration in preparation for the next unit. Also in this unit students are introduced to the toolkit: R, RStudio, R Markdown, Git, and GitHub.

Unit 2 (Week 3 to 4) - Working with strings, regular expressions, and text analysis.

Unit 3 (Week 5 to 8) - Making rigorous conclusions: In this unit we introduce modelling and statistical inference for making data-based conclusions. We discuss data ethics alongside building, interpreting, and selecting models, visualizing results, and prediction and model validation.

Unit 4 (Week 9 and 10) - Machine Learning and looking beyond Data Science 2: In this unit we delve into supervised machine learning and finish up the course with project presentations and looking beyond data science II.

Additional Course Info

  • Meets the following requirements: QR
  • Prerequisites: Data Science 1
  • Level: Intermediate
  • Course limit: 16
  • Lab fee: No lab fee

Course Values, Goals, and Practices

The computing courses at COA are designed to bridge the liberal arts education to computing and the digital world. In this, I am committed to actively creating digital and computational spaces that are radically inclusive. This includes integrating equity and social justice throughout the curriculum, and engaging students in metacognition to support this work.

Learning Objectives

This course is designed as a community learning journey. Together, we will:

  • Play with computational ideas creatively, using a growth mindset which values revision and experimentation.
  • Gain experience in data collection, wrangling, visualization, exploratory data analysis, statistical analysis and machine learning, as well as effective communication of results while working on problems and case studies inspired by and based on real-world questions.
  • Engage and reflect on contemporary issues in environmental and social justice related to your digital world, community and positionality. Critically evaluate the origins of data and the limitations of analysis techniques.
  • Demonstrate community leadership skills as a collaborator that shares strengths, builds weaknesses, and contributes to a broader shared understanding.

It is also my hope that in this course you:

  • Develop an appreciation for reproducibility, transparency, accessibility and inclusivity in data collection, analysis, and communication.
  • Build knowledge and skills in data science to tackle questions that are important to you.

Course Materials required:

Books:

All books for this course are freely available online as e-books. We will be using two main texts:

We will also have chapters and readings from the following opensource books on Text Analysis and Machine Learning:

Technology:

  • Bring a laptop to every class, we will be programming most days! We will be using PositCloud, you will get an invitation to create an account.
  • If you are in need of a term long loaner laptop, please contact the IT department at helpdesk@coa.edu. Mention that you are taking a data science class, and pick up the laptop in A&S right by the whale skull.

Course components

Weekly structure

The class meets on Tuesday and Friday from 13:00-14:25 and on Wednesday from 14:35-16:00. The typical weekly class schedule will be:

Day Activity
Tuesday Topic Introduction
Wednesday Topic Introduction
Friday Lab

Outside of class you can expect to spend

  • 1-2 hours on the reading and interactive tutorials per week
  • 3-5 hours on homework and labs each week.
  • Homework: is assigned on Friday and due the following Friday at 23.59pm EST. Each homework will be introduced on Friday morning and you are encouraged to work on your homework and lab in Friday’s workshop class session.
  • Labs: Labs will be started in class on Friday and are due the following Friday at 23.59pm EST.
  • Reflections: There will be a number of reflections throughout the course. These will either be open-ended or there will be a prompt to follow.

Help sessions

The Teaching Assistant and I will have a handful of help sessions every week. You are warmly invited and encouraged to attend these sessions. Help sessions are relaxed, informal, and hopefully fun. Things that happen at help sessions:

  1. The TAs and/or I am around to offer help on the homework, lab, or project.
  2. Some students do most of the assignment while at a help session. They work through problems alone or with others, and find it comforting to know that help is at hand if needed.
  3. Others do the problems at home and come to the help session with specific questions.
  4. Help sessions are also a chance to ask general questions about the course.
  5. Help sessions are a great way to meet other students in the class.

Everyone is welcome at help sessions! Attending these sessions help students do well in class and get as much out of it as possible.

Labs

These will be held on Fridays. During these sessions you will work individually or in teams on computing lab exercises and you will finish the exercises after class and turn in your lab reports the following Thursday at 23:59 EST through Sunday 23:59 EST. Attendance to class is important as you will be working on your labs individually and together in class. Labs will be submitted as GitHub repositories.

A frequently asked question is “What happens if I can’t make it to a lab one week because I’m sick or have another obligation at that time?” Answers below:

  • If you’re missing a workshop day due to short-term illness or some other reason, you should communicate this with your team and attend a team meeting before the deadline for the assignment to contribute to the teamwork. If you have made 0 commits towards a lab assignment, you will receive no credit for that assignment, so you need to participate both for being a team player and also for your own individual score.

  • If you’re unable to contribute to a lab assignment because of an illness taking you away from school work for an extended period of time, you should let me and your team know that you won’t be able to contribute to those lab(s) and we can discuss special circumstances and explore alternative arrangements to make up that work.

Overall these policies are put in place to ensure communication between team members, respect for each others' time, and also to give you a safety net in the case of illness or other reasons that keep you away from attending class once or twice.

Homework assignments

Beyond the in class activities, you will be assigned weekly larger programming tasks throughout the semester. These assignments will be completed individually and due on the following Friday 23:59 EST through Sunday 23:59 EST, and submitted as GitHub repositories. Tip: Do the (optional) R tutorials which will introduce you to the datasets and topics covered in the homework assignments.

Quizzes

These weekly multiple choice quizzes will help you evaluate your learning continuously. The online quiz will be graded for completion only. You do not need to get any answers right, but it should help you identify what parts of the material you should review. Tip: Don’t leave it till the last minute!

Reflections

Throughout the course there will be reflection assignments where we engage and reflect on contemporary issues in environmental and social justice related to our digital world, community and identity. These assignments may be based on an assigned reading or visualizations we encounter in our daily lives.

Final project

You will be responsible for the completion of an open ended final project for this course, the goal of which is to tackle an “interesting” problem using the tools and techniques covered in this class. Additional details on the project will be provided as the course progresses. Each team’s work will also be shared with and evaluated by at least one other team at an earlier stage in order to provide feedback. You must complete the final project and be in class to present it in order to pass this course. Tip: Stick to optional interim deadlines (outline, draft presentation) to pace your work on the project.

Teams

For all of the team based assignments in this class you will be randomly assigned to teams of 2 or 3 students - these teams will change throughout the trimester. You will work in these teams during class and on the lab assignments. For team based assignments, all team members are expected to contribute equally to the completion of each assignment and you will be asked to evaluate your team members on lyceum after each assignment is due. During the labs, we will be working together using pair programming, where you will take it in turns to write and review the the code, swapping roles frequently. Once the assignment is submitted the contributors will share responsibility for any revisions to be made based on feedback. Failure to adequately contribute to an assignment will result in a penalty to your grade relative to the team’s overall grade.

Students are expected to make use of the provided GitHub repository as their central collaborative platform. Commits to this repository will be used as a metric (one of several) of each team member’s relative contribution for each homework.

Grades:

A growing body of research indicates that traditional approaches to grading fail to produce the sorts of meaningful learning desired by both teachers and students. Such approaches often reinforce inequitable power dynamics between teachers and students, promote faulty reward systems that disincentive creativity and risk-taking, and devalue important aspects of learning (including revision and feedback). Given this context, instead of a traditional approach to grading in which you do work that is evaluated singularly by me, this course assumes that you opt to take ownership and responsibility over your performance and engagement with the class. To make this happen, this course uses a “contract grading” scheme, which gives you a voice in the grading process, provides you with the agency to specify your intended course performance, and also share in the responsibility for evaluating whether or not you fulfilled your intended obligations. Please see the contract grading document (on Google Classroom) for a more-fleshed-out explanation of this approach and how it will operate in the course.

I will also be meeting with each of you individually to set goals at the start of the course.

The work in this course will be comprised of the following components and their weights:

  • Homework: 30%
  • Lab: 20%
  • Project: 20%
  • Quizzes: 10%
  • Attendance: 10%
  • Reflections: 10%

Policies

Collaboration policy

Only work that is clearly assigned as team work should be completed collaboratively. Individual assignments must be completed individually, you may not directly share or discuss answers / code with anyone other than the instructors and tutors. You are welcome to discuss the problems in general and ask for advice.

Sharing / reusing code

I am well aware that a huge volume of code is available on the web to solve any number of problems. Unless I explicitly tell you not to use something the course’s policy is that you may make use of any online resources (e.g. StackOverflow) but you must explicitly cite where you obtained any code you directly use (or use as inspiration). Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism. On individual assignments you may not directly share code with another student in this class, and on team assignments you may not directly share code with another team in this class. You are welcome to discuss the problems together and ask for advice, but you may not send or make use of code from another team.

Academic Integrity (excerpt from Course Catalog)

By enrolling in an academic institution, a student is subscribing to common standards of academic honesty. Any cheating, plagiarism, falsifying or fabricating of data is a breach of such standards. A student must make it their responsibility to not use words or works of others without proper acknowledgement. Plagiarism is unacceptable and evidence of such activity is reported to the provost or their designee. Two violations of academic integrity are grounds for dismissal from the college. Students would request in-class discussions of such questions when complex issues of ethical scholarship arise.

Universal Learning and Learning in Community

Many of us learn in different ways. For example, you may process information by speaking and listening, so while lectures are quite helpful for you, some of the written material may be difficult to absorb. You might have difficulty following lectures, but are able to quickly assimilate written information. You may need to fidget to focus in class. You might take notes best when you can draw a concept. For some of you, speaking in class can be a stressful or daunting experience. For some of you, certain topics or themes might be so traumatic as to be disruptive to learning. The principle of Universal Design for Learning calls for our classrooms, our virtual spaces, our practices and our interactions to be designed to include as many different modes of learning as possible, and is a principle I take seriously in this class.

It is also my goal to create an inclusive classroom, which depends on community building, and which requires everyone to come to class with mutual respect, civility, and a willingness to listen to and observe others. As such the syllabus serves as a contract of some expectations between all members of the class, including myself.

If you anticipate or experience any barriers to learning in this course, please reach out to me and your student support advisor. If you have a disability, or think you may have a disability, COA’s Disability Support Services located within the Office of Student Life in Deering Commons to develop a plan for your academic accommodations. You can find out more information in the course catalog under Accommodating students with disabilities. If you have already been approved for accommodations through the Disability Support Services please let me know! We can meet 1-1 to explore concerns and potential options.

Late work, extensions, and special circumstances

All work is due on the stated due date. Due dates are there to help guide your pace through the course and they also allow me to return feedback to you in a timely manner. However, sometimes life gets in the way and you might not be able to turn in your work on time. First, please note that resubmission is built into each assignment. You will receive feedback on each assignment and may resubmit a piece of work incorporating your revisions.

If you intend to submit work late for an assignment or project, you must notify me before the original deadline and as soon as the completed work is submitted on GitHub. This allows me to return feedback to you and let’s me know when to check your work. Lab work cannot be submitted late.

Learning during a pandemic

I want to make sure that you learn everything you were hoping to learn from this class. If this requires flexibility, please don’t hesitate to ask.

  • You never owe me personal information about your health (mental or physical) but you’re always welcome to talk to me. If I can’t help, I likely know someone who can.

  • I want you to learn lots of things from this class, but I primarily want you to stay healthy, balanced, and grounded during this crisis.

Help

Most of you will need help at some point and we want to make sure you can identify when that is without getting too frustrated and feel comfortable seeking help.

  • Google Classroom Forum: The best way to get any questions on course content, technology, logistics, policies is to post your question on the google classroom forum. And you are encouraged to answer each others' questions here as well.
  • Help Sessions: The Teaching Assistant and I will have a handful of help sessions every week. You are warmly invited and encouraged to attend these sessions. Help sessions are relaxed, informal, and hopefully fun. Things that happen at help sessions:
  1. The TAs and/or I am around to offer help on the homework, lab, or project.
  2. Some students do most of the assignment while at a help session. They work through problems alone or with others, and find it comforting to know that help is at hand if needed.
  3. Others do the problems at home and come to the help session with specific questions.
  4. Help sessions are also a chance to ask general questions about the course.
  5. Help sessions are a great way to meet other students in the class.

Everyone is welcome at help sessions! Attending these sessions help students do well in class and get as much out of it as possible.

  • Email: Please refrain from emailing any course content questions (those should go on Google Classroom), and only use email for questions about personal matters that may not be appropriate for the public course forum (e.g. illness, missed assignments).
  • For more general support and advice, please make use of the resources on Campus which you will find in the College course catalog. If you’re not sure where to go for help, just ask.

Project

Showcase your inner data scientist

The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.

There will also be the option to work on datasets and projects proposed by community partners. More info to come.

The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.

The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use, but sticking to packages we learned in class (tidyverse) is required. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R. Here is an example of a past project write up and presentation on Lessons to be Learned from Super Bowl Advertisements.

Data

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset must have at least 50 observations and between 10 to 20 variables (exceptions can be made but you must speak with me first). The variables in the data should include categorical variables, discrete numerical variables, and continuous numerical variables.

If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into R as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.

Note on reusing datasets from class: Do not reuse datasets used in examples, homework assignments, or labs in the class.

Below are a list of data repositories that might be of interest to browse. You’re not limited to these resources, and in fact you’re encouraged to venture beyond them. But you might find something interesting there:

Deliverables

  1. Team contract - due Saturday, May 6, 23:59 EST
  2. Proposal - due Thursday, May 11, 23:59 EST
  3. Presentation outline (Optional) - due Saturday, May 27, 23:59 EST
  4. Presentation - due Tuesday or Wednesday, June 6 and 7 (In class)
  5. Write-up - due Friday, June 9, 23:59 EST

Team Contract

One of the main goals of this course is that you build and develop community leadership skills as a collaborator that shares strengths, builds weaknesses, and contributes to a broader shared understanding. These skills will serve you in this course and beyond in your careers. A crucial part of building strong collaborations is good communication.

Each team will draft a group contract. A group contract is a document to help you formalize the expectations you have for your group members and what they can expect of you. It will help you think about what you need from each other to work effectively as a team! You will create and agree on this contract as a team and refer to it during the project.

At a minimum, your group contract must address these questions:

Goals:
  • What are our team goals for this project?
  • What do we want to accomplish?
  • What skills do we want to develop or refine?
Expectations:
  • What do we expect of one another regarding attendance at meetings, participation, frequency of communication, quality of work, etc.?
Policies & Procedures:
  • What rules can we agree on to help us meet our goals and expectations?
Consequences:
  • How will we address non-performance regarding these goals, expectations, policies and procedures?

Each member should “sign” (you can just type out your name) at the bottom of the submission.

Credit for Group Contract: Tiffany Timbers, University of British Columbia

Proposal

You will write your proposal in the proposal.Rmd file in your Github project.

  • Section 1 - Introduction: The introduction should introduce your general research question and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.).

  • Section 2 - Data: Place your data in the /data folder, and add dimensions and codebook to the README in that folder. Then print out the output of glimpse() or skim() of your data frame.

  • Section 3 - Data Ethics: We will be conducting a review of the ethics of our data analysis project using the questions outlined in the Data Ethics Canvas by the Open Data Institute. Specifically we will look at the questions under “Limitations in Data Sources”, “Your reasons for using data”, “Positive effects on people”, “Negative effects on people”, and “Minimising negative impact”.

  • Section 4 - Data analysis plan:

    • The outcome (response, Y) and predictor (explanatory, X) variables you will use to explore your question. The type of model you will use to explore your question.
    • The comparison groups you will use, if applicable.
    • Very preliminary exploratory data analysis, including some summary statistics and visualizations, along with some explanation on how they help you learn more about your data. (You can add to these later as you work on your project.)
    • The data visualization(s) that you believe will be useful in exploring your question(s). (You can update these later as you work on your project.)

Each section should be no more than 1 page (excluding figures). You can check a print preview to confirm length.

Presentation

10 minutes maximum, and each team member should say something substantial.

Prepare a slide deck using either Google Slides or the template in your repo. This template uses a package called xaringan, and allows you to make presentation slides using R Markdown syntax. There isn’t a limit to how many slides you can use, just a time limit (10 minutes total). A rough guide to follow is one slide is equal to one minute. Each team member should get a chance to speak during the presentation. Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc."), instead it should convey what choices you made, and why, and what you found.

Before you finalize your presentation, make sure your chunks are turned off with echo = FALSE.

Presentation schedule: Presentations will take place during the Tuesday and Wednesday of the last week of the course. During the class you will watch presentations from the other teams and provide feedback in the form of peer evaluations. The presentation line-up will be generated randomly.

Write-up

Along with your presentation slides, we want you to provide a brief summary of your project in the README of your repository.

This write-up, which you can also think of as an summary of your project, should provide information on the dataset you’re using, your research question(s), your approach (how you decided to visualize the data), and your findings.

Repo organization

The following folders and files in your project repository:

  • presentation.Rmd + presentation.html: Your presentation slides. Note that you may use google slides instead of xaringan
  • README.md: Your write-up
  • /data/*: Your dataset in csv or RDS format, in the /data folder.
  • /proposal: Your proposal from earlier in the term
  • /contract: Your group contract from earlier in the term

Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted including labelling code chunks. Pay attention to images and plots included in the presentation and make sure to include appropriate alternative text.

Tips

  • You’re working in the same repo as your teammates now, so merge conflicts will happen, issues will arise, and that’s fine! Pull, commit and push often, and ask questions when stuck.
  • Review the marking guidelines below and ask questions if any of the expectations are unclear.
  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).
  • Set aside time to work together and apart (physically).
  • When you’re done, review the documents on GitHub to make sure you’re happy with the final state of your work. Then go get some rest!
  • Code: In your presentation your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your R Markdown file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.
  • Teamwork: You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and team evaluations will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.

Criteria

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are the data visualizations chosen an effective means of exploring the questions? Are the statistical models chosen an effective means of exploring the data? Are the results interpreted correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

Resources

Thorndike Library

Thorndike Library offers many resources and services that can assist you in your academic endeavors, including individualized research support and access to resources beyond COA. Study spaces are also available. The library is open 7 days/week. Remote access to the research databases is available 24/7. Contact library@coa.edu or visit the library website for details.

Books

Technology

If you are in need of a term long loaner laptop, please contact the IT department at helpdesk@coa.edu. Mention that you are taking a data science class, and pick up the laptop in A&S right by the whale skull.

Tools

Cheatsheets

People

Course organisers

Acknowledgements

I am committed to the promotion and use of open educational resources and software in the journey of designing an accessible computing education. Much of the course description, design, syllabus, website, and educational materials have been adapted from “Data Science in a Box,” https://datasciencebox.org/, by Mine Çetinkaya-Rundel under the Creative Commons Attribution Share Alike 4.0 International.

Conceptually, intellectually, and substantively, the course policies and learning objectives draws heavily upon the work of current and past colleagues at College of the Atlantic, Bates College, and beyond including Carrie Diaz Eaton, Anelise H. Shrout, Barry Lawson, Meredith Greer, Ethan Miller, Misty Beck, Francis Eanes, Dave Feldman, as well as scholars beyond these institutions.

This course has also been greatly improved by the feedback of Bates students enrolled in DCS 210 in the Fall and Winter 2021-2022 and students enrolled in ES 3098 in Spring 2023 at the College of the Atlantic.

Featured artwork is by @AllisonHorst.