The Data Open is a series of competitions we host at universities around the world to allow students to work through complex datasets and showcase their problem-solving skills. At each event, participants present their approach, findings, and insights to a panel of judges. We facilitated more than 20 Data Opens in 2018 and we’re thrilled to bring you advice from some of the winners over the next few months in the run up to the Data Open Championship in April. We recently spoke with Data Open winner Ruohan Zhan. Ruohan is currently pursuing her PhD in Computational and Mathematical Engineering at Stanford University. Read below for excerpts from our conversation with Ruohan, lightly edited for style, to learn about her keen interest in the application of machine learning, the most exciting aspects of the Data Open, and advice she has for future participants.
Ruohan, we’d love to hear how you first became interested in data science.
I was initially exposed to the field of data science during my undergraduate years at Peking University, where I learned that mathematics and machine learning could be combined for advanced data modeling. I quickly became intrigued about how machine learning could help extract data needed to formalize abstract concepts. I wanted to discover new concepts and machine learning empowered me to pursue this passion.
Can you describe what aspect of your studies interests you the most?
I enjoy learning how critical decisions are made in the presence of uncertain variables and events. We often focus our attention on defined and concrete variables. However, the uncertain variables often impact our decisions in a significant manner. I want to help the world think about uncertainty.
What did you find most exciting about the Data Open and how did you tackle your problem set?
I loved having the opportunity to use real-world data in order to solve a real-world problem. After receiving the problem set, my team first decided to integrate the data into one single dataset. From there, we performed an analysis to find correlations between the different variables. Next, we performed a regression analysis to predict uncertain variables. We then separated the most significant variables in order to find the solution and determine which variables were predictive. What was most important was that every step in the process was systematic and based on the previous step, which also had the added benefit of helping us tell a coherent story.
In addition, we talked to other teams after the competition was over. It was a great learning experience to compare and contrast our approaches. I will seek out this type of collaboration as I continue my studies and pursue a career.
What advice would you give to future participants?
It might sound simple, but finding great teammates is essential. Finding the right team, where each member has a diverse skillset, makes a huge difference in how successful you’ll be at the competition. We brought out the different strengths of our team members, whether they were skilled in mathematics, machine learning or statistics, to solve the problem.
I’d also recommend starting to think about your approach as soon as possible. You receive a preliminary data description the night before the competition. Once we received this, our team gathered to begin brainstorming how we would tackle the problem.
Lastly, ensure your analysis is coherent and systematic. This will help tell a clear story when you present your findings to the judges.
We’re hosting a live Twitter chat at the Data Open Championship on April 15th with our team members who have built their careers around their passion for data science. If you have questions for our team members about how they’ve succeeded in their careers, feel free to tweet a question using #thedataopen.