Faking-it-as-a-Data-Scientist

Faking it as a Data Scientist

“Data Scientist is the sexiest job of the 21st century.” - HBR

The intent of this post is to give you enough talking points about the area to fake your way through 3 dates with a non-data scientist.

Do committed people also need to feel or appear sexy on a date? In which case, they may find this useful too.

This post is also on Github so if your date peeks then you can pass it off as work. Bonus, tell him / her that it was a merge request and see if you get any reaction. ( Merge requests are a term from the software world so if your date is into software, even if he / she isn’t into data science, then maybe avoid it. In fact, if your date is well versed with software, then you should probably avoid this altogether and stick to the traditional approaches of being rich, handsome, willing, etc. )

This was originally for a short talk at work where I masquerade as a data scientist, in the hope that the bubble around data science bursts before I actually have to show results. But once I’d compiled it, it seemed like a waste of effort to not reuse it for a blog post. The blog is the only thing that I might be able to make money off of after the bubble.

In light of the hullabaloo around Cambridge Analytica ( data security is a legitimate concern, but their algorithms’ impact on the US elections, as reported in the media, is likely vastly exaggerated, ) this is also a good time to publish this post.

Approach

Touch upon a bunch of things related to the machine learning ( ML ) and artificial intelligence ( AI ) world.
Find out enough so that you can Google for more.

Some everyday interactions with ML

Facebook feeds, things to note:

Formulating, data cleaning that went into calculating the interest in posts
Data gathering, starting from the Tennessee group to whatever complex beast it is now

Ads / recommendations ( Netflix ), things to note:

Formulation / problem of measuring results. Nobody can confidently tell the impact of internet advertising. Do internet ads actually get new customers or would the same number of buyers have bought the same product even if they hadn’t seen the ad?

Uber pricing ( Couldn’t find a link. But you get it, right? )

As an exercise, think of the data that must be relevant? Day of week, prices previously accepted / rejected, origin, destination, trip distance, what else?

Trading

Author bias. I have low interest in finance. Easy to imagine what it would be.
Look it up, there will be plenty of material.

Image stabilisation ( on Google Pixel ) Tl;DR - the bike video towards the bottom of the page

Some cool ML results

Mario playing

Maps a simplified screen layout to the action keys
Evolves the mapping by trying out different mappings, seeing which ones are more successful, making random changes, additionsm, merges of the more succesful mappings to generate new mappings, iterating until the mapping can clear the level
Likely that the final mapping will work for some other Mario levels too, but not necessary all.

AI art - Paintings, page 5, and Music

Is this true creativity or is it just analysing patterns in existing examples of human creativity and rehashing? Is that how human creativity also works? What is human creativity anyway?
Lots of chatbots, analysing tweets, etc. happening nowadays but language processing isn’t the the same as comprehension. Detecting sarcasm is still a very tricky problem.

Autonomous driving

Maybe the first time that ML will have real time, high impact interactions in large volumes with the chaotic physical world.
Sudden jump from the more controlled physical spaces, just controlling things on a computer, or making only recommendations to a person who eventually decides how to act.

Designing parts

AlphaGo

Much harder than Chess.
As an example of the efforts towards general AI, look up AlphaZero

The Foundation Series - sci-fi which predicts where we might be going?

Some ML fails

Bad training data - Microsoft’s racist twitter bot

Most algorithms are extremely sensitive to training data and data scientists go to great lengths to data-proof their algorithms.

Fragility - Fooling image recognition with a single pixel

Unpleasant results - Racism in the justice system

Algorithmic decisions, especially when taken by black box sort of systems, are harder to accept and correct.

Edge cases - Alexa takes instructions from the TV

Very often, models fail because of behaviour that has not been considered because it wasn’t present in the training data or nobody imagined that sort of use case.

Wrong application - Linking criminals to facial features

Everything is not an ML problem. Some problems are probably better avoided.

Some ML debates and discussions

ML good vs. evil debate:

What if AI becomes some evil superpower?

Human / author biases in algorithms:

Google photos was labelling black people as gorillas

Trolley dilemma:

This has become famous in the the context of what an autonomous vehicle should do in a similar situation.

ML vs minimum wage:

Automation was already taking away jobs, now ML could potentially take away more. What do we need to do to ensure that vast numbers of people aren’t left unemployed while a few have access to robots, ML, etc. and control most things.
Bill Gates recently proposed ( or sided ) with taxes for robots.

Transparency of algorithms:

Europe is trying to impose accountability on algorithms such that anybody who has been the recipient of an algorithmic decision can demand to know what led to the decision.

What is ML?

The main utility of computer programs is that they do calculations faster than it would take to do by hand. Solving problems in an earlier era, would involve us testing different configurations of the same algorithm, or different algorithms, by modifying or writing new programs for each of them. ML offers ways to incorporate this rework of the code into the program itself so that the program can find a solution to the problem with no human intervention. The human, as of 2018, still writes the program but then the program runs and finds an answer by itself.

The reality is a little fuzzier than that. Even in the early days, some smart people would have automated some of their guessing and today’s ML, programs still need a lot of hand holding in various ways. It’s hard to pick a well-defined point in time which can mark the start of ML.

In fact, if you look past the hype, ML is just a shiny new name for techniques used in age old fields like statistics, optimisation, forecasting, etc. What has definitely changed though is that advancements in those techniques and improved computing ability allows us to take on tougher problems. It’s like moving from a spade to a bulldozer. We dug before bulldozers also, just that we got much better at it.

Most of ML is not sexy

Most of one’s time goes into

Formulation: in some cases it isn’t easy to measure something directly so how do you account for it or incorporate it into your models? eg. without a human reading it, it’s hard to classify a tweet as having a positive, negative, or neutralsentiment. Instead, people use a crowd sourced dataset of sentiments of words to assign sentiments to each word in the tweet and then take the majority word sentiment across the tweet as the sentiment of the tweet. This will fail sometimes but it works often enough.
Data gathering: data may not be easily available for the problem we’re trying to solve. Or maybe only partial data is available.
Data cleaning: the data may be missing some information, may have invalid values, etc.
Incorporating domain knowledge: Just having data isn’t enough. Sometimes the user may be able to add to the data with their awareness and understanding of the world. Take the tweet sentiment problem again, a tweet which goes, “This is great! sarcasm” would get assigned a positive sentiment by our majority word sentiment algorithm unless we add an exception for the word sarcasm, which takes the opposite of the majority sentiment.
Making the data more amenable to the algorithms: Eg. algorithms are usually built to play with numerical data so when presented with categorical data, we work within a lesser choice of algorithms or map categorical data to numerical data somehow. eg. we could convert a variable which has values A, B, or C, to twovariables, one carrying a True or False for A, and one carrying a T/F for B. False for both would imply C.
Debugging: code almost never works correctly the first time

Very little time is spent chilling, waiting for a machine to learn.

A lot of time is then spent on iterations, when we look at the results, try to figure out any tweaks which might improve the results or some other aspect of performance, and try them out until we’re happy ( or at least not disappointed ) with it.

Algorithms

Can be a black box ( impossible to interpret ), or hard to interpret, or easy to interpret
Can have simple parameters or very complex paramters

A frequently encountered trade off is a simple, transparent technique’s acceptable results vs. complex, black box technique’s great results. The former will be easier to debug, tweak, etc. but the latter will give better results. B2C will often be latter, B2B former. Because B2B clients can demand explanations.

We haven’t yet established a standard toolbox or a standard methodology which is why it is still sort of an art. There are way too many alogrithms and more and more keep getting discovered / invented all the time. Most of them are very good at specific problems but not good at all problems, i.e. we are still far from General AI ( Wiki on artificial general intelligence, ctrl + f for AlphaZero in the cool ML results section in this post )

BUT. There are already automated frameworks in the market which run lots and lots of algos with a variety of configurations and suggest what’s best. Which means even ML jobs could get automated.

Simplifying the ML problem space

Two types of splits:

Clustering, classification, regression
Supervised vs. unsupervised

There is more to this but we’ll limit ourselves in interest of time.

Split 1

Clustering:

identifying similarities / groupings of data
eg. Netflix clustering done, I think, by people and not an algo.

Classification:

Given groups of data, finding out what makes something part of a particular group
eg. If Netflix categorisation was done by people, then figuring out what factors decide a movie’s assignment to specific Netflix category.

Regression:

Sensitivity of an output to various inputs
eg. Uber pricing vs DoW, distance, origin, destination, etc.

Clustering and classification are related. Having done the clustering exercise and found the groupings, some classification algorithm would be required to back-infer how the clusters are decided ( which may or may not have the same underlying logic as the clustering algorithm, eg. when a really heavy algorithm was used to find out clusters but the user wants to run something more lightweight on a day to day basis when assigning clusters to new data ) The classification algorithm can then be used to assign new data points to the clusters.

Regression and classification are related Both of them eventually predict an output when fed with an input except in classification, the intent is primarily to predict the output, whereas in regression, the intent is to predict the output by way of explicitly evaluating the sensitivites to the input. Classification algorithms may not communicate sensitivities as transparently as regression.

Split 2

Supervised

eg. classification and regression
It’s where you’re relating some output with some input

Unsupervised

eg. clustering
Where there is no clear right answer. Hard to validate.
“All models are wrong, some are useful”

Recent buzz

Primarily around deep learning and reinforcement learning

Deep learning:

Really big, complex neural networks
Improved hardware helps calibrate such big neural networks:

Neural networks:

Have been around for three or four decades
Simplified view - task is to come up with matrices, M1, M2, … Mn, such that ( ( x * M1 ) * M2 ) … * Mn ) = Y, where X is input and Y is output
Different problem, different sort of neural network - some examples
Very hard to interpret the model

Reinforcement learning:

It is used to work on problems where it’s very hard to clearly link inputs / actions to eventual outputs
For instance, there may be a really large sequence of inputs before which the output is attained and it’s hard to ascribe the impact each move had individually to the ouput. Eg. individual moves in a game of chess leading to a result.
It is hypothesised that humans, animals, etc. learn like this. The real world is very noisy and it’s hard to link each action to the outcome.
It is soooort of like supervised learning but not really

Faking it, hands-on

In case you get asked to demo your abilities.

Setup instructions:

Install R
Install Rstudio
Open Rstudio while connected to the internet and run the command below: install.packages(c('ggplot2','gridExtra'))

Running instructions:

Open Rstudio
Open a new file
Paste this code into that file
Press ctrl + enter, one at a time, to run each line / block of code