Vincent Rupp @vincentrupp - Tumblr Blog

Migration of tumblr

Since I've really been using this more of an old-school blog than as a tumblr-esque posting tool, I decided to migrate my data science and programming adventures to my new website:

www.solvebythinking.com

Knapsack Perfection

It took some enhancing, but I improved my relaxation, branch, and bound algorithm quite nicely.

That last set has 10,000 items. It took about 20 minutes to run, which isn't bad for a sample space that has 2^10000 possible permutations.

Not bad for someone who hadn't written a serious Python program until three days ago.

#knapsack #discreteoptimization #coursera

Algorithmic Efficiency

I took Data Structures and Algorithms using C++ in college. One thing that kept coming up was that some algorithms are much better than others.

That's great, but no one at the time really cared. Computers were getting faster all the time, so whether you could multiply matrices in O(N^3) or O(N^2.7) time didn't really affect our lives. Plus, our matrices were small.

My relaxation, branch, and bound algorithm took well over two hours to run on a set of 30 items.

My new dynamic programming algorithm took under 1 second.

For 200 items, it only took a couple seconds.

My algorithm basically copies the example from the videos, but I made a huge improvement in that instead of filling in the entire table, I recursively fill in the table as needed. That means computing (with my small-number tests) about 1/10th as many values.

When I tried it for 400 items (#4 out of 6 in the graded assignments), it took up all my computer's memory and I had to shut it down. (~10M capacity * 400 items = a matrix with 4 billion entries)

I have two computers. One is a desktop with six cores and 10GB of RAM. The other is an all-in-one with two cores and 4GB of RAM. You can probably guess which is for work and which is for games.

I used my super-simple greedy algorithm to submit the 400-item, 1000-item, and 10000-item knapsack problems and got results that were within 1% of optimal.

My total score for the first assignment is now

10/10 Optimal

7/10 Good

That's 51/60, which is plenty to pass the course, but I want that certificate of distinction so I'll have to come back to these once I learn some more advanced techniques.

And since the whole course is online at once, I'm off to watch some more videos now. =D

#discreteoptimization #coursera #python

Discrete Optimization - Knapsack

The course submits assignments using python, so there are solver.py and submit.py scripts given to students. There's also a solverJava.py for students who want to program in java and have python grab their results.

Since many of the jobs I've looked at want Python, I decided to use Python instead of tackling these assignments in R.

I took the whole 13-hour sequence for Python on Codecademy, but that was last summer, about six months ago. Apparently, I didn't remember as much as I thought I did.

Python 3.4 is already installed on my computer, but the course files use 2.7.x, so I installed that; Google helped me figure out how to run scripts from the command line, and the course's forums pointed to the IDE Anaconda, which is SO much easier.

Parts of the next ten hours would have made a great 80's musical montage, as I googled "python dictionaries" and a hundred other absolute basics like syntax of for loops and then tried out variations to see what I could do and what would throw me errors.

Either I remembered more than I thought or I'm just amazing with picking up new things (probably a mix of both) because by 10PM I had finished a simple greedy algorithm and implemented a relaxation/branch/bound depth-first tree search.

I got stuck for almost two hours trying to track down a "bug" that turned out to be a fundamental part of how Python works. Lists, you see, are mutable. So when I say:

x = [1,2]

y = x

x += [3]

The output shows that y is also the list [1,2,3]. That's because when I created y, it only created a memory reference to where x is also pointing. I was passing item_values around to at least three functions, which is why it took me so long to find out exactly what was going on.

This works as I wanted:

x = [1,2]

y = []

y += x

x += [3]

Now the output for y is [1,2].

It's always such an interesting puzzle when you can do something simple by hand, like list every permutation of 1's and 0's for n spaces, but you have to figure out how to tell the computer how to do what you can easily see.

The first graded set has 30 items. There are 2^30 (about a billion) ways to permute those items (for inclusion or not).

My program hadn't finished running in two hours, so I went to bed. This morning, it had spit out the optimal solution [sunglasses emoji], but due to particulars with the item set, this method is simply not very good. Even pruning 70% of the possibilities meant it had to explore close to 300 million trees.

Now, I'm off to implement dynamic programming. It should go quicker today, since I won't have to look up how to write an IF statement.

#coursera #knapsack #discreteoptimization

Discrete Optimization Week 1 - Content

I joined this class on a lark, something to build some new skills, practice some more programming.

The intro video got me fairly hooked. The instructor is quite clearly just the right amount of nuts. Enough to make the class interesting, but not so you worry about poking around in his cellar.

Professor Pascal Van Hentenryck leads the optimization research group at NICTA. There's nothing on their website (visible within 6 seconds) regarding what that acronym means, but he works in Australia (also at Brown) and I believe is Dutch.

His bio is seriously impressive, including "main designer and implementor of the CHIP programming system, the foundation of all modern constraint programming systems."

The introductory video starts with him dressed up as Indiana Jones, speaking very excitedly at a speed that made me check whether I'd accidentally set the video to 1.25x playback.

The lectures are all him, wearing different hats to represent different methods ,in front of a green screen while moving around (at one point fumbling with his hat in the air for two seconds before catching it again) expertly clicking through his slides.

He says this class is very hard, maybe 15 hours a week. For a single class, that's quite a commitment. Most instructors would be worried about scaring off students, but he doesn't care. He wants students to work hard and enjoy it.

He's clearly enjoying dressing like Indiana Jones and describing the knapsack problem and several simple attempts at solving it.

The lectures regarding the knapsack problem cover greedy algorithms; dynamic programming; relaxation, branch, and bound; and methods to implement that last item.

Despite an MS in Mathematics and a programming-heavy education and hobbies, I hadn't seen any of this before. Not too difficult to understand, but he said this week was the easy week.

The course is totally open, meaning all the material is there and we can explore the five programming assignments and five major topics at our leisure over the next nine weeks, mixing methods and assignments to keep trying to achieve perfect scores on every assignment.

A key to a good course seems to be motivating students to tackle a lot of work, just like good workplaces make people feel engaged, challenged, and rewarded.

It's rare to see such a technical expert and good instructor. I don't know if I have time to stick with the whole course, but he's put this big challenge in front of me, and one thing about me is that I always rise to challenges.

#discreteoptimization #coursera #knapsack

Disappointing

After spending $530 so far on data science and data mining classes on Coursera, I applied for a Data Scientist job at Coursera.

To their credit, they replied within three business days.

Unfortunately, the response as "the teams didn't feel your experience was the best match for what we're looking for today".

It'd be really helpful if they'd tell me what they are looking for so I could work on that, or showcase if it's a skill I already have.

I have a couple solid sample projects that show my analysis and communication skills. I don't have every bullet point a person could want (but what technical expert does? Most are really good at a narrow area and capable and adaptable to a broad range.) What I'm really missing is that old Catch-22, experience in the field.

I've seen plenty of internships, but they're exclusively for college students. Guess I'll take more courses, maybe enter a Kaggle tournament, network more. I could really use some contacts to help me figure out what skills or projects would be most useful.

Well, I'll get there eventually. Like my old boss told me "cream rises to the top".

#coursera #datascience #jobsearch

Pattern Discovery in Data Mining - Week 3

This class is getting a lot of complaints in the discussion forums, and I'm not really sure why. I'm learning a ton about mining patterns and am able to think through some of the applications I've read about, like How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did.

The complaints seem generally twofold:

1. There's no programming component

2. The professor reads his slides and doesn't give enough examples

On the first point, well, it's not a programming class, so that doesn't seem particularly valid. The Data Science courses all used R heavily, which replaced "I'd learn more with code examples!" with "I'd learn more if I didn't have to use R!"

On the second point, the slides are not standalone, so clearly he's doing more than that. There's usually exactly one example, which is not worked out from a DB to a final set of patterns, but it's enough for me to get the idea behind the algorithm.

Really, there are a lot of algorithms and he goes through each one. In practice, I probably don't need to know five of them, but the one that's most efficient for small patterns and the one that works best for mining very large patterns. And in R I'd definitely find a package instead of writing my own implementation, so I'm happy to keep the videos focused.

Not to say the less-happy students lack all validity. When Professor Jiawei Han introduced graph pattern mining, he didn't give any introduction to what graphs are. I'm familiar with them, so I found those videos pretty easy and kinda fun (graphs are fun, after all), but to someone who's never worked with edges and vertices, it's definitely a "wth is going on" experience.

In my experience as a teacher and a student, the best predictor of whether a student likes a course or not is whether they feel like they're learning as fast as they should be. Their grade doesn't matter: I've had (and been!) a 95%+ student who's still occasionally frustrated. So of course it's natural to blame the material, the instructor, the format, or the questions. I think we've all been there and know the trick is just to take a few deep breaths, focus, and get back to it.

This is another four-week course, and the next course is a different instructor that apparently does have a programming component (it's on text mining!). That starts in I guess two weeks and the Data Science Capstone (with SwiftKey predictive text) starts in one week, so it's a very exciting time to be me. =D

#datamining #coursera

Openness and Accountability

The job market is a pretty messed up environment.

Tonight, I received an email about a job I applied for eight weeks ago. It said I was removed from consideration because I didn't meet the minimum qualifications. Except that I did; I read carefully because they use one of those computer systems that's designed to reject people (taleo) so I don't spend any time unless I'm a very close match to the position.

This company ranks high in employee satisfaction and I like what they do. Like most other companies, their employment page lists their values, which include openness and accountability.

I've applied for a couple other positions with this company and received nothing but electronically-generated emails, all of which have the line

"Replies to this message are undeliverable and will not reach [us]. Please do not reply."

That's not very open. And there's no one to correct on the mistake of assessing my background, which doesn't promote accountability.

Systems like Taleo are designed to whittle hundreds of applications down to the top handful that are an extremely close skill set match. It is designed to remove people from the process of evaluating people. The result, according to my experience and recruiters like Lou Adler (an advocate of performance-based, not skill-based, hiring), is that the "best qualified" people are those who've done the exact same job before and are looking for a lateral move. Superstars, as Lou points out, look for bigger challenges, not lateral moves.

My best experience so far was with Netflix. A human person contacted me about a position and two business days later the same person told me the hiring team had not thought my background was a good fit for their current needs. And while I may disagree with them as well, I still respect them for treating me like a person.

#hiring #taleo #louadler

Pattern Discovery in Data Mining

After the rousing success of Data Science, and having a few weeks before the capstone project begins (yep, I decided to take it; mostly so I don't lose my R skills, but also because I might it onto this page), I signed up for Pattern Discovery in Data Mining, which is part of a Data Mining specialization.

This is the first time the course has run, and it feels like it. Week 1 saw quite the outrage when the first quiz had many error-plagued questions.

This week, the forums are full of people mentioning that the calculations in the lecture slides are wrong, there are discrepancies between the (non-required) textbook and the lecture slides, and again that there are errors in the quiz questions.

When I taught half-online classes, it was really easy, when I got a report of an error, to log into the system and fix it. Judging by the TA's replies, they contact Coursera, or perhaps the instructor, who then makes the fix. Of course, I'd have 30 students at a time, so errors would sometimes go unnoticed for several iterations of a class. This course has five people just from Portland. That means this class contains:

a lot of people. So any tiny error will probably be spotted by hundreds.

There are a lot of other rough edges, compared with the Data Science classes. For example, there is no discussion forum for the quizzes, so people have been putting many different threads under General Discussion. It's a lot harder to collaborate and find answers (like if the quiz question really is broken) that way. In future rounds, they'll probably add a forum for each week's quiz and fix the errors.

Anyway, the content is really interesting. There's no programming in this one, but the professor, Jiawei Han, is one of the leaders in this subject and has many papers to his name (104176 citations according to Google Scholar). He delivers the ideas very nicely, IMHO, and lists papers for efficient implementations of the various algorithms. I really like courses packed with ideas because once you're familiar with the ideas, you can look up the specifics later. Implementation will vary based on system anyway.

So far, I've learned about support and confidence for patterns, finding interesting patterns using apriori, eclat, and FPGrowth algorithms, along with various types of interestingness measures and their shortfalls. (Kulczynski wins!).

On a related note, to answer a question on the first quiz, I came up with the apriori algorithm before watching it in a video. =D

It's satisfying to see topics from Data Science come up, such as identifying clusters of frequent small patterns for use in generating colossal pattern candidates.

This class's material is only released one week at a time, which is a little strange because I can do a week of work in a morning. By week 5, half my time may very well be just reviewing notes of what we did previous weeks.

#datamining #coursera #uiuc

Data Science Verified

A few weeks after the courses ended, the last of my Data Science verified certificates were issued.

I did a little better on the final four courses, only missing about one point in Statistical Inference. Unfortunately, I can't find out why because the course is closed. It probably had to do with peer grading.

The last "course" for the data science specialization certificate is a capstone project that doesn't start until mid-March. I've already added the nine courses to my resume and LinkedIn, so I'm not sure the capstone will be that much more useful, but it's a subject I enjoy and am obviously good at (and am looking to get back into, professionally) so I'll probably pay the $49 and finish it out.

Coursera has some other course sequences leading to a specialization that sound really interesting, like Cloud Computing or Data Mining, but the courses are offered not-very-regularly so it would take almost the rest of the year to finish. In that time, I wonder how relevant the material will still be.

Still, I love learning this kind of stuff, and a LinkedIn profile chock full of data-related certificates probably makes me a better candidate than just one.

#datascience #coursera #greatstudent #JohnsHopkins

With Distinction!

Johns Hopkins, or possibly Coursera, finally released the rest of the certificates for the five courses I took in November. If this computer had the happy cat and sunglasses emojis, you'd be reading each sixty times in a row right now.

Check out those scores. Not bad at all. Missed points in Reproducible Research because I forgot to peer review enough peers. Got a perfect score on the project from my peers though.

Missed points in Getting and Cleaning Data because we were supposed to submit a codebook with our data set, but no one was really sure what that meant. I'm not sure if I did it wrong or the graders thought I did. Not a big hit though.

It'll be another month before I can post 4 100% course records from my current round of classes. In the meantime, I'm going to have to find some projects to keep my skills sharp.

#coursera #JohnsHopkins #datascience #withdistinction

Practical Machine Learning Weeks 1 - 4

This is one of the hottest areas in data science, meaning it seems to get the most press. Predicting click-throughs, recommendations, and hospitalizations have all been big money contests.

There are several standard issues when attempting any of these. For example, it's very common for a model to work very well on a training set, but be 'over-fitted' so that it doesn't work as well on the testing set. If you're unfamiliar, part of your data set is used to build the model and part is used once to verify it. That gives some assurance that it'll work in production.

There was a lot of information in this course about the standard issues and some standard tools. I had enough knowledge to understand what the videos were talking about, but could have used more examples of when to use PCA/SVD or why I'd use a Box-Cox transform over other preprocessing methods. Still, they're in my realm of knowledge now so I can dig into them when needed.

In week 3 we got to the good stuff: Predicting with trees and extending that to random forests, bagging (bootstrap aggregating), boosting, and some info on model-based prediction.

Most contest winners apparently use some combination of random forests and boosting, but then combine perhaps dozens of models to make something incrementally better. The interpretability is totally lost after that, and it can be very difficult to implement.

Netflix, for example, awarded $1M to a team that improved that prediction system by 10%, but then never used it and instead went another direction entirely.

The project was an interesting assignment using a method of our choice to predict, from wearable sensor data, whether a person was doing bicep curls correctly or with in one of four 'wrong' ways.

I achieved something like 99.9% accuracy on the training set using a random forest with 25-fold bootstrap sampling and 75% variable selection for each tree.

The test set is only 20 lines long. You have to use your model to predict what type of bicep curl is being done on each line and upload a text file with the prediction for each one. You only have two attempts for each line, and each line is worth 1% of your final grade, so I was ecstatic when file upload after file upload rewarded me with "Correct!"

Note: Creating and uploading 20 text files was tedious, but the instructors gave us code to output the files. Nothing I couldn't have done myself, but sure considerate of them.

The final project can be seen here.

With that done and 100% on each of the quizzes, I have 80% overall in the course, so I've already passed. If I get at least half the points on my submission (and do the peer review, of course), I'll get yet another pass with distinction. =)

#datascience #JohnsHopkins #coursera

Regression Models Weeks 1 - 4

I've been doing about a week of coursework each day since I started these four courses on December 1st. I finished all the work yesterday, the 16th, so a bit faster than projected.

Regression models, like statistical inference, was taught in a very math-heavy way that I wasn't too into.

I came across this article in the Harvard Business Review that seemed relevant:

https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/

It's about the difference between people who do data science for machines and people who do it for humans. I'm definitely in the latter camp, though I wouldn't mind the occasional adventure into the former.

The difference explains pretty tidily why I'm not really getting into the mathematical side of things. I'm much more interested in knowing how to do things in R, how to interpret the output, and developing wisdom enough to know when to use which method.

Plus, I feel like I can always look up the details later - there are textbooks on this stuff - so I'd rather get the ideas down solid.

Anyway, this course starts off with linear regression and all the related stuff like explaining least-squares, discussing residuals, regression inference, and prediction intervals.

It moves on to multivariable regression, use of dummy variables, inclusion of interaction terms, and finishes with some generalized linear models like logistic and poisson regression. Good stuff. I liked those last weeks a lot.

The project really pushed me to be sure I understood (and could explain) a model using both a continuous regressor and a dummy variable, so it was a cool thing to work on.

You can read my final project here, if you're interested.

I have to do the peer grading when that opens on the 22nd. I'll be in Colorado visiting family, but I've already packed my spreadsheet. =)

#datascience #JohnsHopkins #coursera

Statistical Inference - Week 4

This week was on the Power of a test, Multiple Comparisons, and Resampling. Turns out ANOVA is in the Regression Models course.

Power is fairly straightforward to understand, but difficult to calculate in many situations. Fortunately, R has a built in function for it.

Multiple Comparisons refers to when you do a thousand statistical tests, you're going to get a lot of false positives. So there are various methods to reduce the false positives. But of course they all will produce some false negatives as well. Also built-in in R, so again it's just a matter of understanding what's going on. That's the great thing about jobs in computing is that they reward understanding, explanations, and looking at things in new ways, not just crunching numbers and using formulas - anyone can do that.

Resampling is on bootstrapping, which is a way of using an empirical distribution to get information about the underlying distribution. It's nonparametric so you don't need the underlying distribution to get, for example, the median of the distribution.

The quiz was pretty easy (only a couple questions on this material), but if anyone ever asks, I definitely did NOT calculate the standard error of p-hat when I needed the standard deviation. Don't even joke that I did that.

I woke up really early today, so I finished the course project too. Actually, it was two projects submitted in the same course page.

One is demonstrating the Central Limit Theorem: http://vincentrupp.com/SI_CLT.html

The other is defining, performing, and discussing a hypothesis test on a sample data set: http://vincentrupp.com/SI_HT.html

Other than peer review of projects, which opens in 18 days, this class is done 24 days early. [sunglasses emoji]

#datascience #coursera #JohnsHopkins

Statistical Inference Weeks 1 - 3

When this sequence of four courses started, I received a couple emails for each course. Welcome To, Course Survey, Resources, stuff like that.

Statistical Inference also came with an email about surviving the course. It's apparently one of the hardest for people.

Fortunately, I have an MS in Mathematics and have taught inferential statistics at Portland Community College.

Because of that, I discovered I didn't need to watch the first two weeks of videos. I did the quizzes without referencing anything. Basic probability, p-values, hypothesis tests, that's just really solid for me.

For Week 3, I had to refresh my memory about some of the t-test formulas (pairs, pooled variance, unpooled variance), but it all came back quickly. I watched some of the videos and saw one reason people have such a hard time with this course.

It's a different instructor for this one, and the videos and accompanying slides have high-quality content, but there's nothing like teaching at community college to really show someone where the big jumps are between and within topics.

So while I can follow what he's saying, to someone new to this topic it probably seems like he's skipping around a lot and not making a lot of solid "This should go on my reference sheet"-type points.

I think of this sort as separating the content from the information around the content. For example, the distributive property states that:

a(b + c) = ab + ac

That can stand by itself and should be written down if you're learning it. Then it can be illustrated by a couple examples and words. Organizing content around these specific points helps to learn. However, instructors don't want to be simply textbooks, so the conversational nature of a lecture should be preserved as well. It's a tough business.

There's only one project which I'll start this weekend, and then probably have to watch the rest of the videos in depth since the details of ANOVA are a bit tricky. Still, it's shaping up to be the quickest course yet.

That's good news because I'm leaving on December 20th to see family in three states for ten days. So I kinda have to be done with all my coursework before then. =/

#datascience #JohnsHopkins #coursera

November Recap

It turned out that there wasn't much to do in Week 4 of the classes. The Data Scientist's Toolbox was only three weeks, and the rest had me finish projects at the end of week 3 for peer review during week 4. The one exception had a quiz to complete.

The peer review is a little strange because there's very little room for interpretation. That makes sense if you want to reduce variability in grading: simply ask did the person do this specific task (like linking to a Github repository) or not. The quality of submissions varies wildly though, so I ended up giving several people full marks because yes, they did it, but I wasn't convinced they really had mastered the material. However, there were also some truly wonderful submissions across the several courses. It's always fascinating to see the different ways people tackle the same task. Some solutions or bits of code I probably should have written down so I could refer to it later, but I'm not sure they would want me doing that.

My final project for Reproducible Research (course 5 in the sequence) was a big hit, with one person even saying:

The best work I've graded! And that's not only about mood. I really like your approach and comments. Good to see real understanding of data and how to deal with it.

Ah, now that's a good feeling. (The reference to mood is based on my light-hearted commentary, which you can see at http://vincentrupp.com/PA2.html.)

Unfortunately for my pride, that was the one assignment in five courses I neglected to fully peer-review. Each requires we look at four peers, and apparently I did not do all four. Therefore, I incurred a 20% penalty to my project grade, which is more points than I lost for every other assignment combined. =(

Well, nothing's slipping by me again, you can bet on that. For the next/last four courses (which started Dec 1st), I made a spreadsheet of every assignment with due date and check box for when I finished it.

The courses ended Sunday, and they said our official scores (with certificates, or certificates with distinction for those of us getting 90%+) would be posted to our Course Records page in a few days, so I'm checking the page like a high school senior waiting for his Harvard acceptance letter. Can't wait to post a screenshot! =D

#datascience #JohnsHopkins #coursera

Getting and Cleaning Data - Week 3

It's a bizarre situation I find myself regularly in where I love learning but can not stand education.

A couple years back, I signed up for two online classes through PCC where I work (free tuition! =D, high fees >=(). One was on Photoshop for the Web, the other on Dreamweaver.

Each week in the Dreamweaver class, we did assignments out of the book. These were straight-up "click this, then that. Type this into this field, and press this key." There was no thought involved other than making sure we didn't lose track of which of the 25+ steps we were on. This was not, in any way, teaching me how to make web sites, but I got 100% on every assignment.

Each week for the Photoshop class, we had to do work in Photoshop (naturally). I liked that part well enough, except sometimes we had to do the same thing like 25 times. More time was spent submitting the assignment for grading. For one, we had to make websites to show each week's photos. Because the Dreamweaver class didn't teach me anything, that took a very long time. Then each file had to be named, sized, and formatted exactly as the teacher wanted it, we had to fill out a checklist of all the things we were going to be graded on, post a blog entry on the class blog, and comment on two other people's blog posts. If I spent 4 hours on this class, only about 75 minutes was actually learning. The rest was just jumping through hoops to get a grade.

I ended up dropping both of those classes halfway through and subscribing to lynda.com where I could learn and follow along (for a lot less money) and not waste my time with lazy-ass instructors or jumping through endless hoops.

I find myself in the same situation with these Data Science classes. The videos are mostly fine (for lecture videos), but the assignments are mostly garbage.

The one project for Getting and Cleaning Data involves merging a bunch of data files. That's easy - I actually did a lot of the steps already for the Reproducible Research Week 2 project. But the instructions are totally vague.

You should create one R script called run_analysis.R that does the following.

Merges the training and the test sets to create one data set.

Extracts only the measurements on the mean and standard deviation for each measurement.

Uses descriptive activity names to name the activities in the data set

Appropriately labels the data set with descriptive variable names.

From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Good luck!

You start off by downloading this zip file with several folders inside that and more folders inside those. The directions don't tell you what files to use, what format they're in, how many records you should have, or anything. You get nothing.

There is some text in a readme file, and a little bit of information on the website where you get the zip file, but none of it actually tells you what you're looking at.

It turns out that there are 561 variables, which you only know if you know that you're expecting 2947 rows. I got that from a student's post on the discussion forum. One of the community TA's put together an FAQ that has some useful information but the answer to "Where do I start?" is basically "Figure it out yourself."

Of those 561 variables, you have to "extract only the measurements on the mean and standard deviation for each measurement" which, as it turns out, means "Just grab all the columns with 'mean' and 'std' in their names."

Once again, it's an assignment where you don't have to understand what you're doing or what the data represents; you just have to jump through the hoops.

One defense of this assignment would be "Well, this is real-world data and you have to learn how to deal with that."

That response is as garbage as this entire assignment. I've worked with real-world data dozens of times, and in every case I've had a goal to work toward, known what the data represented (or had someone I could ask), and had a reason to give a crap.

If you think "real-world" is "doing bullshit someone tells you to do just because you're told to do it", then you should be looking at jobs in data entry or pushing buttons, not data science.

And now it's back to wasting my day on this garbage just so I can get some meaningless grade.

#datascience #johnshopkins #coursera

Trending Blogs

Recently Viewed Blogs

Vincent Rupp