|
머신러닝 2019. 1. 29. 14:01
섀넌 엔트로피, 크로스 엔트로피, KL Divergence Logit function Decision Tree + ID3알고리즘 베르누이 확률 분포 최대가능도 추정법 (Maximum Likelihood Estimator, MLE) SGD Boosting 기법의 기해 Mean Squared Error, Bias, and Variance Posterior Probability SVM(Support Vector Machine)
FFM 선행 논문 - Rendle, S. (2010). Factorization machines. : https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
- Factorization Machines with libFM: https://www.csie.ntu.edu.tw/~b97053/paper/Factorization%20Machines%20with%20libFM.pdf
- FM을 어떻게 구현하는지 좀 더 설명되어있음.
- Jahrer, M., Tscher, A., Lee, J.-Y., Deng, J., Zhang, H., and Spoelstra, J. (2012). Ensemble of collaborative filtering and feature engineered models for click through rate prediction. : https://pdfs.semanticscholar.org/eeb9/34178ea9320c77852eb89633e14277da41d8.pdf
- Training and Testing Low-degree Polynomial Data Mappings via Linear SVM : http://www.jmlr.org/papers/volume11/chang10a/chang10a.pdf
- logistic regression의 loss function 설명. https://stats.stackexchange.com/questions/250937/which-loss-function-is-correct-for-logistic-regression#answer-279698
- y가 {0, 1} 일때와 y가 {-1, 1} 일때의 공식 차이 설명.
- https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/ - http://ailab.criteo.com/ctr-prediction-linear-model-field-aware-factorization-machines/
Word2Vec - https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/ * one hot encoding 을 TF-IDF 로 대체해 응용 가능 Doc2Vec - https://yujuwon.tistory.com/entry/Doc2Vec - http://www.engear.net/wp/tag/doc2vec/ - paragraph_vector.pdf
머신러닝 전반적인 내용 (part1~8)
회귀분석 강의 노트 (한남대학교 통계학과 권세혁 교수) - http://wolfpack.hnu.ac.kr/lecture/Regression/
ALS(MF) 알고리즘 - https://www.slideshare.net/madvirus/als-ws?from_action=save * als-141117230305-conversion-gate01.pdf
기계학습관련 동강 - https://seslab.kaist.ac.kr/xe2/page_GBex27
FRAMEWORK/Spark 2019. 1. 16. 17:33
Understanding Spark: Part 2: Architecture- After introducing the spark in the previous blog, I will try to explain the architecture of the spark in thi blog. The objective is to give an quick overview of various components in spark architecture, what their functinalities and how they enable spark to process large amount of data fast.
- The assumtion is that the reader must have prior understanding of the map reduce paradigm and some knowedge on Hadoop architecture.
1. What are the key components of Spark application?Every spark application has two main components- One Driver
- A set of Executors (one or many)
Driver - Is the coordinator of the spark application and hosts the spark Context object, which is the entry point to the application.- Driver negotiates with the external resource managers to provision all required resources for the spark application.
- Manages the executor tasks.
- Converts all map reduce operations and create tasks for the execturs to perform.
- Collects all metrics about the execution of spark application and its components.
- Executors - are the actual work horses of the spark applications. There might be one or more executors provisioned for a spark applicaiton. Execturos are actually java containers running on physical or virtual machines, which in turn are managed under cluster mangers like YARN or Mesos.
- Number of executor resources and their capacities in terms of virtual core and RAM must be specified before starting a spark application. (There is an exception to this where resources can be provisioned dynamically).
- Let's assume that we are using YARN managed cluster.
- Driver negotiates with the resoruce manager of YARN to provision these resources in the cluster.
- Then node manger of YARN spawns these processes and then executors are registered ( handed over ) to the driver for control, allocation and coordination of tasks among executors.
- The following diagram depicts the architecture of spark.
Fig 1: Spark Components: Driver and ExecutorsExecutors load the external data (for example, files from HDFS) and load onto the memory of executors. For example here two blocks loaded into each executor memory. The in memory representation of these data partitions are called RDD (Resilient Distributed Datasets). Each chuck of data in memory is called partitions. The algorithm is expressed in terms of map reduce stages and driver pushes these map reduce tasks to the executors. Mappers can run in parallel across each RDD partitions in executors. If a reduce operation is assgined, then executors wait until all paritions are completed and proceed for data shuffle. After data shuffle is over, then executors can again run operation in parallel on these shuffled partitions. Finally, the resulting partitions after completion of all map reduce task are saved into an external systems, which is defined in the code submitted to spark. These serializing of resulting partitions can be accomplished in parallel by the executors.As you can see, the executors actually load data in terms of RDD and its partitions and apply operations on those RDD partitions and driver only assignd and coordinates these task with the executors . 2. How the executors are provisioned?The number of executors and their capacity in terms of cpu and memory are specified during the submission of the application. Driver then negotiates with the cluster manager e.g. Resource Manager in YARN. Yarn manager finds the best resources to schedule the executors and instructs the node managers to spawn these processes. Once the exectuors are started, then register with the Driver for further assignment and coordination of tasks. The machines (physical or virtual) managed by cluster manager are typically called slaves or workers. The number of executors requested are optimally allocated in available workers. It is possible that some workers might have been assigned more than one executors. Irrespective of wherever the executors are assigned, the capacity requested by the spark application is guaranteed by the YARN resource manager. 3. How data is read into spark application?Data can be read into spark application from any external systems. Spark is not tightly coupled with any specific file system or storage systems. Data can be loaded onto spark by two methods. Driver can read data onto a buffer and then parallelize (divide into smaller chunks and send to) to executors. But the amount of data that can be read and processed in this fashion is very limited. Driver can give location of the files in external system and coordinate read of the data by executors directly. For example, which blocks would be read by which executors from HDFS file system. How map reduce operations are executed optimally in spark?All operations are applied on RDD partitions in terms of map or reduce operations. All data analysis logics are expressed in terms of map and reduce operations. An example of map operation would be filtered or selecting data. An example of redue operation would be group by or sort by operations. Here is an example of a series of map and reduce operations.- Load data -> map1 -> map2 -> map3 -> reduce1 -> map4 -> reduce2 -> reduce3 -> save results
Once driver read the sequence of operations, it sends these as tasks to the executors. But it has to coordinate the execution of task to resolve any dependency between the RDD partitions across multiple executors. In this case the first operation is read data and map1. Let's say executor 1 finished map1 operation on P0 partition, before P1 partition and executor 2 finishes the map1 operation on P2 and P3 partitions.- Does, the executor need to wait for map1 operation to complete across all partitions, before it start map2 operation?
The answer is no, as map2 operation is independent of other partition data, so executor can proceed with map2 operation. The only time, executors need to wait before proceeding further is when there is a reduce operation. As reduce operation will depend on the data across all paritions. The data need to shuffled across exectuors before reduce operation can be applied. Driver understands this dependencies, given a sequence of map reduce tasks and then combine these operations into stages. Each stage can be processes in parallel across executors, but need to wait for all executors before proceeding to next stage. So, given the above sequence, driver divides the task into four stages as below.Stage 1: load -> map1 -> map2 -> map3 Stage 2: reduce1 -> map4Stage 3: reduce2 Stage 4: reduce3 -> save - The diagram above depicts the stages created by driver and executed by executors.
- Not only the stages are executed in parallel, they can be done in parallel with an executor. Each Executor may have multiple paritions loaded onto their memory and can process these stages in parallel across partitions within the same executor. Processing the partitions in parallel is calle tasks.
- But, to process partitions in parallel the executor should start multiple threads. And these threads can run in parallell in true sense, only if the executors have access to multiple CPUs.
- So, each executor should be allocated with multiple CPUs or cores, if we intend to run the task in parallel.
Conclusion:- In this blog, we delved into spark architecture quickly to undestand its components and their internal workings. In the next blog, we will dive more deeper to understand how spark manages memory and when it actually it evaluates and executes tasks.
출처 - http://www.awesomestats.in/spark-architecuture-2/
광고/용어 2018. 2. 28. 11:41
Multi-armed를 이해하기 위해 one-armed를 알아봅시다. 가장 흔하게 예로 들어지는 one-armed는 슬롯머신입니다. 카지노에서 볼 수 있는 슬롯머신에는우측에 당길 수 있는 긴 레버가 하나 달려 있습니다. (기계식 슬롯머신은 영화나 박물관에서 볼 수 있고 요즈음은 전자식으로 버튼을 누르는 슬롯머신을 쓰고있기는 합니다.) 그 레버를 잡아 당겼다 놓으면 게임이 시작됩니다. 실제로 카지노에서 빠른 시간 안에 많은 돈을 슬롯머신에서 잃을 수 있습니다. 만약에 그 슬롯머신에서 돈을 딸 수 있는 기회가 50:50라고 한다면 실제로 돈을 잃을 기회도 50: 50일 것입니다. 하지만 카지노에서 슬롯머신에 bug를 넣고 사람들은 실제로 50%보다 더 빠른 속도로 돈을 잃게 되며 이 기계가 돈을 강탈해 가기 때문에 이 긴 레버를 bandit(도둑, 강도)라고 부르는 이유가 됩니다. 그러면 multi-armed bandit은 무엇일까요? Multi – armed bandit problem이라는 것은 예를 들면 한 사람이 5대의 슬롯머신 세트를 play해야 하는 상황을들 수 있습니다. 이 5대의 슬롯머신에서 최소한으로 돈을 잃고 최대한으로 돈을따야 합니다. 5대의 슬롯머신을 100번, 1000번 반복되는 게임을 하다 보면 어떤 기계에서 돈을 더 따게 되고 어떤 기계에서 돈을 더 잃을 지 경험적으로 알 수 있습니다. 하지만 가지고 있는 돈은 한정이 되어 있으니 한없이 게임을 반복할 수 는 없습니다. 5대의 슬롯머신을 M1, M2, M3, M4, M5라고 하고 각 각의 기계는 default 값이 달라 돈을 잃고 따는 확률이 다르다고 가정해 봅시다. 하지만 게임을 하는 사람은 사전에 어떤 기계에서 돈을 더 많이 딸 수 있는지는 알 수 없습니다. 따라서 게임을 하는 사람은 빠른 시간 안에 가장 돈을 많이 딸 수 있는 기계를 알아내야 합니다. M1부터 M5까지 슬롯머신마다 각각의 distribution 값을 보게 되면 어떤 기계가 가장 돈을 많이 딸 수 있는 기계인지 알 수 있습니다. 이 사실만 알게 된다면 게임을 하는 사람은 그 기계에만 계속 배팅을 하고 가장 이득이되는 결과를 갖게 될 것입니다. 하지만 어떤 기계가 좋은 결과를 보여주는지 찾는 동안에도 계속해서 돈을 써야 하고 잃어야 합니다. 어떤 기계에서 돈을 딸 수 있는 지 아는 데까지 시간이 많이 걸린다면 확률이 낮은 기계에 돈을 계속 쓰게 되고 그 사이에 가진 돈을 모두 잃게 될 지 모릅니다. 따라서 이렇게 슬롯머신 게임을 하면서 두 가지의 개념이 필요하게 됩니다. Exploration(탐험하기)과 exploitation(뽑아먹기) 1) 어떤 기계에서 가장 돈을 많이 딸 수 있는지 빠른 시간 안에 알아야 한다(exploration) 2) 동시에 현재 알고 있는 가장 돈을 많이 딸 수 있는 기계에서 최대한 빨리 돈을 계속 따야 한다(exploitation) 또한 여기서 regret이라는 수학적 개념이 나오게 됩니다. 한쪽은 optimal machine에 돈을 계속 넣어서 돈을 따게 되었지만 다른 한쪽은non-optimal machine에 돈을 계속 넣어서 많은 돈을 잃게 되었다면 best outcome과 non-best outcome 사이의 차이가 regret개념이 됩니다. Optimal machine을 찾기 위해 다른 기계들을 exploration하는데 쓰이는 비용을 opportunity cost라고 하며 다른 non-optimal machine들을 explore하는 시간이 길면 길수록 높은 reget값을 가질 수 있습니다. 빠른 시간 안에 explore하면서 sub-optimal machine을 찾고(exploration) 그 그 기계에서 계속 돈을 따면서( exploitation) 최소한의 시간 안에 optimal machine을 찾아 내야 합니다. (짧은 시간 안에 찾은 sub-optimal distribution이 정말 optimal distribution인지 검증이 필요합니다. 섣부른 판단으로 sub-optimal을 optimal이라고 판단할 수도 있습니다) 정리를 한다면, The multi-Armed Bandit model의 목적은 best one을 찾고(exploration) 이 best one에서 돈을 따고(exploitation) best one을 찾는 시간을 최소화 하는 것입니다.
|