Faking-it-as-a-Data-Scientist

Faking it as a Data Scientist

“Data Scientist is the sexiest job of the 21st century.” - HBR

The intent of this post is to give you enough talking points about the area to fake your way through 3 dates with a non-data scientist.

Do committed people also need to feel or appear sexy on a date? In which case, they may find this useful too.

This post is also on Github so if your date peeks then you can pass it off as work. Bonus, tell him / her that it was a merge request and see if you get any reaction. ( Merge requests are a term from the software world so if your date is into software, even if he / she isn’t into data science, then maybe avoid it. In fact, if your date is well versed with software, then you should probably avoid this altogether and stick to the traditional approaches of being rich, handsome, willing, etc. )

This was originally for a short talk at work where I masquerade as a data scientist, in the hope that the bubble around data science bursts before I actually have to show results. But once I’d compiled it, it seemed like a waste of effort to not reuse it for a blog post. The blog is the only thing that I might be able to make money off of after the bubble.

In light of the hullabaloo around Cambridge Analytica ( data security is a legitimate concern, but their algorithms’ impact on the US elections, as reported in the media, is likely vastly exaggerated, ) this is also a good time to publish this post.

Approach

Some everyday interactions with ML

Facebook feeds, things to note:

Ads / recommendations ( Netflix ), things to note:

Uber pricing ( Couldn’t find a link. But you get it, right? )

Trading

Image stabilisation ( on Google Pixel ) Tl;DR - the bike video towards the bottom of the page

Some cool ML results

Mario playing

AI art - Paintings, page 5, and Music

Autonomous driving

Designing parts

AlphaGo

The Foundation Series - sci-fi which predicts where we might be going?

Some ML fails

Bad training data - Microsoft’s racist twitter bot

Fragility - Fooling image recognition with a single pixel

Unpleasant results - Racism in the justice system

Edge cases - Alexa takes instructions from the TV

Wrong application - Linking criminals to facial features

Some ML debates and discussions

ML good vs. evil debate:

Human / author biases in algorithms:

Trolley dilemma:

ML vs minimum wage:

Transparency of algorithms:

What is ML?

The main utility of computer programs is that they do calculations faster than it would take to do by hand. Solving problems in an earlier era, would involve us testing different configurations of the same algorithm, or different algorithms, by modifying or writing new programs for each of them. ML offers ways to incorporate this rework of the code into the program itself so that the program can find a solution to the problem with no human intervention. The human, as of 2018, still writes the program but then the program runs and finds an answer by itself.

The reality is a little fuzzier than that. Even in the early days, some smart people would have automated some of their guessing and today’s ML, programs still need a lot of hand holding in various ways. It’s hard to pick a well-defined point in time which can mark the start of ML.

In fact, if you look past the hype, ML is just a shiny new name for techniques used in age old fields like statistics, optimisation, forecasting, etc. What has definitely changed though is that advancements in those techniques and improved computing ability allows us to take on tougher problems. It’s like moving from a spade to a bulldozer. We dug before bulldozers also, just that we got much better at it.

Most of ML is not sexy

Most of one’s time goes into

Very little time is spent chilling, waiting for a machine to learn.

A lot of time is then spent on iterations, when we look at the results, try to figure out any tweaks which might improve the results or some other aspect of performance, and try them out until we’re happy ( or at least not disappointed ) with it.

Algorithms

A frequently encountered trade off is a simple, transparent technique’s acceptable results vs. complex, black box technique’s great results. The former will be easier to debug, tweak, etc. but the latter will give better results. B2C will often be latter, B2B former. Because B2B clients can demand explanations.

We haven’t yet established a standard toolbox or a standard methodology which is why it is still sort of an art. There are way too many alogrithms and more and more keep getting discovered / invented all the time. Most of them are very good at specific problems but not good at all problems, i.e. we are still far from General AI ( Wiki on artificial general intelligence, ctrl + f for AlphaZero in the cool ML results section in this post )

BUT. There are already automated frameworks in the market which run lots and lots of algos with a variety of configurations and suggest what’s best. Which means even ML jobs could get automated.

Simplifying the ML problem space

Two types of splits:

There is more to this but we’ll limit ourselves in interest of time.

Split 1

Clustering:

Classification:

Regression:

Clustering and classification are related. Having done the clustering exercise and found the groupings, some classification algorithm would be required to back-infer how the clusters are decided ( which may or may not have the same underlying logic as the clustering algorithm, eg. when a really heavy algorithm was used to find out clusters but the user wants to run something more lightweight on a day to day basis when assigning clusters to new data ) The classification algorithm can then be used to assign new data points to the clusters.

Regression and classification are related Both of them eventually predict an output when fed with an input except in classification, the intent is primarily to predict the output, whereas in regression, the intent is to predict the output by way of explicitly evaluating the sensitivites to the input. Classification algorithms may not communicate sensitivities as transparently as regression.

Split 2

Supervised

Unsupervised

Recent buzz

Primarily around deep learning and reinforcement learning

Deep learning:

Neural networks:

Reinforcement learning:

Faking it, hands-on

In case you get asked to demo your abilities.

Setup instructions:

  1. Install R
  2. Install Rstudio
  3. Open Rstudio while connected to the internet and run the command below: install.packages(c('ggplot2','gridExtra'))

Running instructions:

  1. Open Rstudio
  2. Open a new file
  3. Paste this code into that file
  4. Press ctrl + enter, one at a time, to run each line / block of code