Ten Minutes with a World Class Data Scientist

I recently got access to a thought-leader who's in such big demand that he's literally in a different country every week. Here's how it went :

1. You've done a lot of online courses. If you had to pick just five to recommend to someone, which would they be. The ones that are, in your opinion, the best return on investment

i) Question formulation technique (QFT) from Right Thinking Institute in Boston - how to ask the right question..

ii) Cambridge Advanced Leadership Programme

iii) MIT : Inquisitive Data Science (couldn't find this. Do I have it right?)

iv) Coursera : Philosophy of Science

v) Google's Digital Marketing Certificate (how to use social media, etc)

2. You mentioned in the .. interview there are 500k datasets available on Americans. Could you mention a few that are worth getting comfortable analyzing through python.

i) The US Population Census is a gold mine of data

ii) Bureau of Labor Statistics datasets

3. Do you have any side projects that you are not able to devote enough time to that I could maybe help out with so I gain experience?

A : I am interested in the future of work - what kinds of jobs will gain in the future. How trends are shifting in the pre/post COVID landscape. How the job market has changed?

4. Do you know anyone else who is trying to ramp up to get to your level that you would recommend I connect with?

A : Abhishek Thakur is a 4x Kaggle GM - big in ML. (I obviously wasn't clear enough :)

5. What are some mistakes you made or you see others making on their learning journey that cost time and effort?

A : Biggest regret was not letting go of the stick-it-out do-not-give up mentality which caused me to stay much longer in toxic work-environments than I otherwise would have.

6. You mentioned you do still spend time doing actual data analysis. How much time do you spend on data cleaning?

A : 80% (comment : that's interesting that your clients don't take the time to clean the data ahead of time so they get more value from your time)

7. What are some utilities you wish you had?

A : It would be good to have utilities that :

i) check if the dataset has been fabricated - if data follows a mathematical pattern and is not truly random or is random, but the nature of the distribution does not match the typical nature of the underlying phenomenon.

ii) perform fact checking - or cross validation of key metrics derived from the data with other available sources. For example, retrieve from the population data that the average age of the population is 80, but flag that by checking with established data sources. Basically, establish the trustworthiness of the data.

Follow up :

Regarding the dataset fabrication-flagging and cross-checking, you might be interested in Kristin Sainani's (Stanford) Webinar "How to Be a Statistical Data Detective" : https://www.youtube.com/watch?v=JG_gCIGFaQI. She mentions statcheck.io and GRIM as two tools to use to analyze published research. They operate on the publication and not the dataset.

Python Kaizen - The Infinite Loop of Improvement

Search This Blog

One Path to Bankruptcy for Replit

Ten Minutes with a World Class Data Scientist

Labels

Comments

Post a Comment

Popular posts from this blog

Align an Embedded Image in Jupyter Markdown

openCV : Really Filtering by Color

One Path to Bankruptcy for Replit