https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/
- matrix factorization
- factorization machines over polynomial and linear models
http://research.criteo.com/ctr-prediction-linear-model-field-aware-factorization-machines/
- 3 Idiots 멤버인 Yu-Chin Juan 의 아티클
- Poly2, FM, FFM 의 차이에 대해 친절하게 잘 설명하고 있다.
http://tech.adroll.com/blog/data-science/2015/08/25/factorization-machines.html
- AdRoll 이라는 광고회사의 연구소에서 작성한 아티클
- Factorization Machine 에 대해 친절하게 설명하고 있다. FM 결과가 convex 가 아니어서 최적화 모델 적용하는데 제약이 있다는 언급이 있다. SGD 는 작동하지만, convex 특성에 기반한 최적화 모델은 적용하기 어려운 듯 하다.
https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
- 3 Idiots' Approach for Display Advertising Challenge
http://techblog.youdao.com/wp-content/uploads/2015/03/Avazu-CTR-Prediction.pdf
- Avazu Click-Through Rate Prediction
- by Xiaocong Zhou and Peng Yan, March 9, 2015

저작자표시 비영리

'머신러닝' 카테고리의 다른 글

부스팅 기법의 이해 (0)	2019.03.06
회귀분석 강의노트 (0)	2019.03.06
최대우도법(Maximum Likelihood) (0)	2019.03.06
로지스틱 회귀모델의 모수 추정 (0)	2019.03.06
로지스틱 함수 (0)	2019.03.06

Understanding Spark: Part 2: Architecture

After introducing the spark in the previous blog, I will try to explain the architecture of the spark in thi blog. The objective is to give an quick overview of various components in spark architecture, what their functinalities and how they enable spark to process large amount of data fast.

The assumtion is that the reader must have prior understanding of the map reduce paradigm and some knowedge on Hadoop architecture.

Spark Architecture

1. What are the key components of Spark application?

One Driver
A set of Executors (one or many)

Driver

Driver negotiates with the external resource managers to provision all required resources for the spark application.
Manages the executor tasks.
Converts all map reduce operations and create tasks for the execturs to perform.
Collects all metrics about the execution of spark application and its components.

Executors - are the actual work horses of the spark applications. There might be one or more executors provisioned for a spark applicaiton. Execturos are actually java containers running on physical or virtual machines, which in turn are managed under cluster mangers like YARN or Mesos.
- Number of executor resources and their capacities in terms of virtual core and RAM must be specified before starting a spark application. (There is an exception to this where resources can be provisioned dynamically).
- Let's assume that we are using YARN managed cluster.
- Driver negotiates with the resoruce manager of YARN to provision these resources in the cluster.
- Then node manger of YARN spawns these processes and then executors are registered ( handed over ) to the driver for control, allocation and coordination of tasks among executors.

The following diagram depicts the architecture of spark.

Fig 1: Spark Components: Driver and Executors

As you can see, the executors actually load data in terms of RDD and its partitions and apply operations on those RDD partitions and driver only assignd and coordinates these task with the executors

2. How the executors are provisioned?

Resource Manager

YARN

3. How data is read into spark application?

Data can be read into spark application from any external systems. Spark is not tightly coupled with any specific file system or storage systems. Data can be loaded onto spark by two methods. Driver can read data onto a buffer and then parallelize (divide into smaller chunks and send to) to executors. But the amount of data that can be read and processed in this fashion is very limited. Driver can give location of the files in external system and coordinate read of the data by executors directly. For example, which blocks would be read by which executors from HDFS file system.

How map reduce operations are executed optimally in spark?

Load data -> map1 -> map2 -> map3 -> reduce1 -> map4 -> reduce2 -> reduce3 -> save results

Does, the executor need to wait for map1 operation to complete across all partitions, before it start map2 operation?

Stage 1:

Stage 2:

Stage 3:

Stage 4:

The diagram above depicts the stages created by driver and executed by executors.

Not only the stages are executed in parallel, they can be done in parallel with an executor. Each Executor may have multiple paritions loaded onto their memory and can process these stages in parallel across partitions within the same executor. Processing the partitions in parallel is calle tasks.

But, to process partitions in parallel the executor should start multiple threads. And these threads can run in parallell in true sense, only if the executors have access to multiple CPUs.

So, each executor should be allocated with multiple CPUs or cores, if we intend to run the task in parallel.

Conclusion:

In this blog, we delved into spark architecture quickly to undestand its components and their internal workings. In the next blog, we will dive more deeper to understand how spark manages memory and when it actually it evaluates and executes tasks.

출처 - http://www.awesomestats.in/spark-architecuture-2/

저작자표시 비영리

The multi-Armed Bandit Problem이란? #2

광고/용어 2018. 2. 28. 11:41

Multi-armed를 이해하기 위해 one-armed를 알아봅시다. 가장 흔하게 예로 들어지는 one-armed는 슬롯머신입니다. 카지노에서 볼 수 있는 슬롯머신에는우측에 당길 수 있는 긴 레버가 하나 달려 있습니다. (기계식 슬롯머신은 영화나 박물관에서 볼 수 있고 요즈음은 전자식으로 버튼을 누르는 슬롯머신을 쓰고있기는 합니다.) 그 레버를 잡아 당겼다 놓으면 게임이 시작됩니다.

실제로 카지노에서 빠른 시간 안에 많은 돈을 슬롯머신에서 잃을 수 있습니다. 만약에 그 슬롯머신에서 돈을 딸 수 있는 기회가 50:50라고 한다면 실제로 돈을 잃을 기회도 50: 50일 것입니다. 하지만 카지노에서 슬롯머신에 bug를 넣고 사람들은 실제로 50%보다 더 빠른 속도로 돈을 잃게 되며 이 기계가 돈을 강탈해 가기 때문에 이 긴 레버를 bandit(도둑, 강도)라고 부르는 이유가 됩니다.
그러면 multi-armed bandit은 무엇일까요? Multi – armed bandit problem이라는 것은 예를 들면 한 사람이 5대의 슬롯머신 세트를 play해야 하는 상황을들 수 있습니다. 이 5대의 슬롯머신에서 최소한으로 돈을 잃고 최대한으로 돈을따야 합니다. 5대의 슬롯머신을 100번, 1000번 반복되는 게임을 하다 보면 어떤 기계에서 돈을 더 따게 되고 어떤 기계에서 돈을 더 잃을 지 경험적으로 알 수 있습니다. 하지만 가지고 있는 돈은 한정이 되어 있으니 한없이 게임을 반복할 수 는 없습니다.

5대의 슬롯머신을 M1, M2, M3, M4, M5라고 하고 각 각의 기계는 default 값이 달라 돈을 잃고 따는 확률이 다르다고 가정해 봅시다. 하지만 게임을 하는 사람은 사전에 어떤 기계에서 돈을 더 많이 딸 수 있는지는 알 수 없습니다. 따라서 게임을 하는 사람은 빠른 시간 안에 가장 돈을 많이 딸 수 있는 기계를 알아내야 합니다. M1부터 M5까지 슬롯머신마다 각각의 distribution 값을 보게 되면 어떤 기계가 가장 돈을 많이 딸 수 있는 기계인지 알 수 있습니다. 이 사실만 알게 된다면 게임을 하는 사람은 그 기계에만 계속 배팅을 하고 가장 이득이되는 결과를 갖게 될 것입니다.
하지만 어떤 기계가 좋은 결과를 보여주는지 찾는 동안에도 계속해서 돈을 써야 하고 잃어야 합니다. 어떤 기계에서 돈을 딸 수 있는 지 아는 데까지 시간이 많이 걸린다면 확률이 낮은 기계에 돈을 계속 쓰게 되고 그 사이에 가진 돈을 모두 잃게 될 지 모릅니다.

따라서 이렇게 슬롯머신 게임을 하면서 두 가지의 개념이 필요하게 됩니다.
Exploration(탐험하기)과 exploitation(뽑아먹기)
1) 어떤 기계에서 가장 돈을 많이 딸 수 있는지 빠른 시간 안에 알아야 한다(exploration)
2) 동시에 현재 알고 있는 가장 돈을 많이 딸 수 있는 기계에서 최대한 빨리 돈을 계속 따야 한다(exploitation)

또한 여기서 regret이라는 수학적 개념이 나오게 됩니다.
한쪽은 optimal machine에 돈을 계속 넣어서 돈을 따게 되었지만 다른 한쪽은non-optimal machine에 돈을 계속 넣어서 많은 돈을 잃게 되었다면 best outcome과 non-best outcome 사이의 차이가 regret개념이 됩니다.
Optimal machine을 찾기 위해 다른 기계들을 exploration하는데 쓰이는 비용을 opportunity cost라고 하며 다른 non-optimal machine들을 explore하는 시간이 길면 길수록 높은 reget값을 가질 수 있습니다. 빠른 시간 안에 explore하면서 sub-optimal machine을 찾고(exploration) 그 그 기계에서 계속 돈을 따면서( exploitation) 최소한의 시간 안에 optimal machine을 찾아 내야 합니다.
(짧은 시간 안에 찾은 sub-optimal distribution이 정말 optimal distribution인지 검증이 필요합니다. 섣부른 판단으로 sub-optimal을 optimal이라고 판단할 수도 있습니다)

정리를 한다면, The multi-Armed Bandit model의 목적은 best one을 찾고(exploration) 이 best one에서 돈을 따고(exploitation) best one을 찾는 시간을 최소화 하는 것입니다.

[출처] The multi-Armed Bandit Problem이란? #2|작성자 비수술센터소장

저작자표시 비영리

'광고 > 용어' 카테고리의 다른 글

The Multi- Armed Bandit Problem이란? #1 [출처] The Multi- Armed Bandit Problem이란? (0)	2018.02.28
더치 옥션 (Dutch auction) (0)	2018.02.28
비커리 경매(Vickery Auction) (0)	2018.02.28
CTR (Click-through rate) (0)	2018.02.28

◀ PREV : [1] : [2] : [3] : [4] : [5] : [6] : [7] : [8] : [···] : [310] : NEXT ▶

랄라라

주요 개념 및 관련 문서

'머신러닝' 카테고리의 다른 글

spark architecuture

Understanding Spark: Part 2: Architecture

Spark Architecture

1. What are the key components of Spark application?

Fig 1: Spark Components: Driver and Executors

2. How the executors are provisioned?

3. How data is read into spark application?

How map reduce operations are executed optimally in spark?

Conclusion:

The multi-Armed Bandit Problem이란? #2

'광고 > 용어' 카테고리의 다른 글

티스토리툴바