주요 개념 및 관련 문서

머신러닝 2019. 1. 29. 14:01

섀넌 엔트로피, 크로스 엔트로피, KL Divergence

Logit function

Decision Tree + ID3알고리즘

베르누이 확률 분포

최대가능도 추정법 (Maximum Likelihood Estimator, MLE)

SGD

Boosting 기법의 기해

Mean Squared Error, Bias, and Variance

Posterior Probability

SVM(Support Vector Machine)


FFM 선행 논문 

- https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/

- http://ailab.criteo.com/ctr-prediction-linear-model-field-aware-factorization-machines/


Word2Vec

 - https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/

  * one hot encoding 을 TF-IDF 로 대체해 응용 가능

Doc2Vec

- https://yujuwon.tistory.com/entry/Doc2Vec

- http://www.engear.net/wp/tag/doc2vec/

paragraph_vector.pdf


머신러닝 전반적인 내용 (part1~8)


회귀분석 강의 노트 (한남대학교 통계학과 권세혁 교수)

http://wolfpack.hnu.ac.kr/lecture/Regression/


ALS(MF) 알고리즘

- https://www.slideshare.net/madvirus/als-ws?from_action=save

 * als-141117230305-conversion-gate01.pdf


기계학습관련 동강

- https://seslab.kaist.ac.kr/xe2/page_GBex27




'머신러닝' 카테고리의 다른 글

부스팅 기법의 이해  (0) 2019.03.06
회귀분석 강의노트  (0) 2019.03.06
최대우도법(Maximum Likelihood)  (0) 2019.03.06
로지스틱 회귀모델의 모수 추정  (0) 2019.03.06
로지스틱 함수  (0) 2019.03.06
:

spark architecuture

FRAMEWORK/Spark 2019. 1. 16. 17:33

Understanding Spark: Part 2: Architecture

  • After introducing the spark in the previous blog, I will try to explain the architecture of the spark in thi blog. The objective is to give an quick overview of various components in spark architecture, what their functinalities and how they enable spark to process large amount of data fast.
  • The assumtion is that the reader must have prior understanding of the map reduce paradigm and some knowedge on Hadoop architecture.

Spark Architecture

1. What are the key components of Spark application?

    Every spark application has two main components
    • One Driver
    • A set of Executors (one or many)
    Driver - Is the coordinator of the spark application and hosts the spark Context object, which is the entry point to the application.
    • Driver negotiates with the external resource managers to provision all required resources for the spark application.
    • Manages the executor tasks.
    • Converts all map reduce operations and create tasks for the execturs to perform.
    • Collects all metrics about the execution of spark application and its components.
  • Executors - are the actual work horses of the spark applications. There might be one or more executors provisioned for a spark applicaiton. Execturos are actually java containers running on physical or virtual machines, which in turn are managed under cluster mangers like YARN or Mesos.
    • Number of executor resources and their capacities in terms of virtual core and RAM must be specified before starting a spark application. (There is an exception to this where resources can be provisioned dynamically).
    • Let's assume that we are using YARN managed cluster.
    • Driver negotiates with the resoruce manager of YARN to provision these resources in the cluster.
    • Then node manger of YARN spawns these processes and then executors are registered ( handed over ) to the driver for control, allocation and coordination of tasks among executors.
  • The following diagram depicts the architecture of spark.

Fig 1: Spark Components: Driver and Executors

    Executors load the external data (for example, files from HDFS) and load onto the memory of executors. For example here two blocks loaded into each executor memory. The in memory representation of these data partitions are called RDD (Resilient Distributed Datasets). Each chuck of data in memory is called partitions. The algorithm is expressed in terms of map reduce stages and driver pushes these map reduce tasks to the executors. Mappers can run in parallel across each RDD partitions in executors. If a reduce operation is assgined, then executors wait until all paritions are completed and proceed for data shuffle. After data shuffle is over, then executors can again run operation in parallel on these shuffled partitions. Finally, the resulting partitions after completion of all map reduce task are saved into an external systems, which is defined in the code submitted to spark. These serializing of resulting partitions can be accomplished in parallel by the executors.As you can see, the executors actually load data in terms of RDD and its partitions and apply operations on those RDD partitions and driver only assignd and coordinates these task with the executors .

2. How the executors are provisioned?

    The number of executors and their capacity in terms of cpu and memory are specified during the submission of the application. Driver then negotiates with the cluster manager e.g. Resource Manager in YARN. Yarn manager finds the best resources to schedule the executors and instructs the node managers to spawn these processes. Once the exectuors are started, then register with the Driver for further assignment and coordination of tasks. The machines (physical or virtual) managed by cluster manager are typically called slaves or workers. The number of executors requested are optimally allocated in available workers. It is possible that some workers might have been assigned more than one executors. Irrespective of wherever the executors are assigned, the capacity requested by the spark application is guaranteed by the YARN resource manager.

3. How data is read into spark application?

    Data can be read into spark application from any external systems. Spark is not tightly coupled with any specific file system or storage systems. Data can be loaded onto spark by two methods. Driver can read data onto a buffer and then parallelize (divide into smaller chunks and send to) to executors. But the amount of data that can be read and processed in this fashion is very limited. Driver can give location of the files in external system and coordinate read of the data by executors directly. For example, which blocks would be read by which executors from HDFS file system.

How map reduce operations are executed optimally in spark?

    All operations are applied on RDD partitions in terms of map or reduce operations. All data analysis logics are expressed in terms of map and reduce operations. An example of map operation would be filtered or selecting data. An example of redue operation would be group by or sort by operations. Here is an example of a series of map and reduce operations.
    • Load data -> map1 -> map2 -> map3 -> reduce1 -> map4 -> reduce2 -> reduce3 -> save results
    Once driver read the sequence of operations, it sends these as tasks to the executors. But it has to coordinate the execution of task to resolve any dependency between the RDD partitions across multiple executors. In this case the first operation is read data and map1. Let's say executor 1 finished map1 operation on P0 partition, before P1 partition and executor 2 finishes the map1 operation on P2 and P3 partitions.
    • Does, the executor need to wait for map1 operation to complete across all partitions, before it start map2 operation?
    The answer is no, as map2 operation is independent of other partition data, so executor can proceed with map2 operation. The only time, executors need to wait before proceeding further is when there is a reduce operation. As reduce operation will depend on the data across all paritions. The data need to shuffled across exectuors before reduce operation can be applied. Driver understands this dependencies, given a sequence of map reduce tasks and then combine these operations into stages. Each stage can be processes in parallel across executors, but need to wait for all executors before proceeding to next stage. So, given the above sequence, driver divides the task into four stages as below.
      Stage 1: load -> map1 -> map2 -> map3 Stage 2: reduce1 -> map4Stage 3: reduce2 Stage 4: reduce3 -> save

  • The diagram above depicts the stages created by driver and executed by executors.
  • Not only the stages are executed in parallel, they can be done in parallel with an executor. Each Executor may have multiple paritions loaded onto their memory and can process these stages in parallel across partitions within the same executor. Processing the partitions in parallel is calle tasks.
  • But, to process partitions in parallel the executor should start multiple threads. And these threads can run in parallell in true sense, only if the executors have access to multiple CPUs.
  • So, each executor should be allocated with multiple CPUs or cores, if we intend to run the task in parallel.

Conclusion:

  • In this blog, we delved into spark architecture quickly to undestand its components and their internal workings. In the next blog, we will dive more deeper to understand how spark manages memory and when it actually it evaluates and executes tasks.




출처 - http://www.awesomestats.in/spark-architecuture-2/


:

The multi-Armed Bandit Problem이란? #2

광고/용어 2018. 2. 28. 11:41

Multi-armed 이해하기 위해 one-armed 알아봅시다가장 흔하게 예로 들어지는 one-armed 슬롯머신입니다카지노에서   있는 슬롯머신에는우측에 당길  있는  레버가 하나 달려 있습니다.  (기계식 슬롯머신은 영화나 박물관에서   있고 요즈음은 전자식으로 버튼을 누르는 슬롯머신을 쓰고있기는 합니다.)  레버를 잡아 당겼다 놓으면 게임이 시작됩니다

 실제로 카지노에서 빠른 시간 안에 많은 돈을 슬롯머신에서 잃을  있습니다만약에  슬롯머신에서 돈을   있는 기회가 50:50라고 한다면 실제로 돈을 잃을 기회도 50: 50 것입니다하지만 카지노에서 슬롯머신에 bug 넣고 사람들은 실제로 50%보다  빠른 속도로 돈을 잃게 되며  기계가 돈을 강탈해 가기 때문에   레버를 bandit(도둑강도)라고 부르는 이유가 됩니다.
 그러면 multi-armed bandit 무엇일까요Multi – armed bandit problem이라는 것은 예를 들면  사람이 5대의 슬롯머신 세트를 play해야 하는 상황을  있습니다 5대의 슬롯머신에서 최소한으로 돈을 잃고 최대한으로 돈을따야 합니다. 5대의 슬롯머신을 100, 1000 반복되는 게임을 하다 보면 어떤 기계에서 돈을  따게 되고 어떤 기계에서 돈을  잃을  경험적으로   있습니다하지만 가지고 있는 돈은 한정이 되어 있으니 한없이 게임을 반복할   없습니다.
 
 5대의 슬롯머신을 M1, M2, M3, M4, M5라고 하고   각의 기계는 default 값이 달라 돈을 잃고 따는 확률이 다르다고 가정해 봅시다하지만 게임을 하는 사람은 사전에 어떤 기계에서 돈을  많이   있는지는   없습니다따라서 게임을 하는 사람은 빠른 시간 안에 가장 돈을 많이   있는 기계를 알아내야 합니다. M1부터 M5까지 슬롯머신마다 각각의 distribution 값을 보게 되면 어떤 기계가 가장 돈을 많이   있는 기계인지   있습니다 사실만  알게 된다면 게임을 하는 사람은  기계에만 계속 배팅을 하고 가장 이득이되는 결과를 갖게  것입니다.
 하지만 어떤 기계가 좋은 결과를 보여주는지 찾는 동안에도 계속해서 돈을 써야 하고 잃어야 합니다어떤 기계에서 돈을   있는   아는 데까지 시간이 많이 걸린다면 확률이 낮은 기계에 돈을 계속 쓰게 되고  사이에 가진 돈을 모두 잃게   모릅니다.

 따라서 이렇게 슬롯머신 게임을 하면서  가지의 개념이 필요하게 됩니다
Exploration(탐험하기) exploitation(뽑아먹기)
1)      어떤 기계에서 가장 돈을 많이   있는지 빠른 시간 안에 알아야 한다(exploration)
2)      동시에 현재 알고 있는 가장 돈을 많이   있는 기계에서 최대한 빨리 돈을 계속 따야 한다(exploitation)
 
또한 여기서 regret이라는 수학적 개념이 나오게 됩니다.
한쪽은 optimal machine 돈을 계속 넣어서 돈을 따게 되었지만 다른 한쪽은non-optimal machine 돈을 계속 넣어서 많은 돈을 잃게 되었다면 best outcome non-best outcome 사이의 차이가 regret개념이 됩니다.
 Optimal machine 찾기 위해 다른 기계들을 exploration하는데 쓰이는 비용을 opportunity cost라고 하며 다른 non-optimal machine들을 explore하는 시간이 길면 길수록 높은 reget값을 가질  있습니다빠른 시간 안에 explore하면서 sub-optimal machine 찾고(exploration)   기계에서 계속 돈을 따면서( exploitation) 최소한의 시간 안에 optimal machine 찾아 내야 합니다
(짧은 시간 안에 찾은 sub-optimal distribution 정말 optimal distribution인지 검증이 필요합니다섣부른 판단으로 sub-optimal optimal이라고 판단할 수도 있습니다
 
정리를 한다면The multi-Armed Bandit model 목적은 best one 찾고(exploration)  best one에서 돈을 따고(exploitation) best one 찾는 시간을 최소화 하는 것입니다.


: