数据分析（1）案例和面试题

最新推荐文章于 2025-05-11 00:48:51 发布

原创最新推荐文章于 2025-05-11 00:48:51 发布 · 1.4k 阅读

4 ·

本内容遵循CC 4.0 BY-SA版权协议

数据分析专栏收录该内容

36 篇文章

订阅专栏

之前一直关注编程工具，但是一直忽略两个重点：一是要认识甚至是记住一些具体案例中的数据，就像记得0-9这几个数字一样熟悉；而是业务理解，这一点很能够看出个人的综合能力。总而言之，练习熟练具体的案例十分重要！

下面，将以这个链接Analytics Vidhya为主，开始学习。

优秀连接
1、24个数据科学案例

2、全面的面试题目总结：数据科学与统计问题

1. 二十四个数据科学案例

1. Beginner Level
Iris Data
Loan Prediction Data
Bigmart Sales Data
Boston Housing Data
Time Series Analysis Data
Wine Quality Data
Turkiye Student Evaluation Data
Heights and Weights Data
1. Intermediate Level
Black Friday Data
Human Activity Recognition Data
Siam Competition Data
Trip History Data
Million Song Data
Census Income Data
Movie Lens Data
Twitter Classification Data
1. Advanced Level
Identify your Digits
Urban Sound Classification
Vox Celebrity Data
ImageNet Data
Chicago Crime Data
Age Detection of Indian Actors Data
Recommendation Engine Data
VisualQA Data

2. 全面的面试题目总结：数据科学与统计问题

 数据科学与统计问题
 
 机器学习问题
 
 深度学习问题
 
 案例学习
 
 智力题与猜估
 
 特定的工具与语言问题
 
 新手提示与诀窍
 
 励志的故事

机器学习问题

40 个创业公司在机器学习与数据科学方面常见的问题

Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

答：
1.关闭其他应用程序，腾出内存；
2.随机采样；
3.去除相关变量：数值型变量使用相关系数，分类变量使用卡方检验；
4.使用PCA降维；
5.也可以使用在线的算法，比如Vowpal Wabbit；
6.构建使用随机梯度下降的线性模型；

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

(暂略)

Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

答：32% （统计学的68–95–99.7原则）

Q4. You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

答：

首先，癌症预测结果是一个不平衡的数据集，所以不应该使用准确率作为评价指标，而是应该是用灵敏度/Sensitivity (True Positive Rate)、特异度/Specificity (True Negative Rate)、F score等评价指标。详见。

如果数量少的标签类别的表现很差，那可以采取以下措施：