We want our customers to know what the most popular, most watched content is and potentially entice them to start to watch that content as well, to help them making a viewing choice on our platform. The problem is simple enough, the solution is anything but! We discuss how we have to sift through billions of events to determine what our customers like to watch.
To sharply articulate the problem, popularity in Disney+Hotstar is defined by the most watched content in a time reference.
Popularity is defined by the watch-views/started-video event counts of a content.
Be it a live cricket match, or a TV show, different content type has different context for popularity. Consider the following hierarchy:
- A channel’s popularity is based on the popularity of its shows
- Show popularity is based on the popularity score of its episodes
- Episodes popularity is based on its watch views in comparison to other episodes in last 24 hours.
- For a live event, the highlights or key-moments clips popularity has a heavy decay in popularity score, as the event concludes.
Popularity scores are also used by search in ranking of the search results.
In a regular, non-event week, we clock upwards of 10TB of event data, for our key “started-video” event. During event weeks, this number is massively increased. Rather than keep building against a moving target, we decided to use 10TB window size, to compute results, as a way to bound the problem space.
We also placed a temporal bound with a 1 hour window. During live events, a longer window would yield far too many events. These were our going in bounds.
There were other ways to do time-windowing aggregation of watch-counts:
- Time series DB : Storing timestamp-wise each watch-count. However, due to continuous & heavy traffic and heavy payload, TS DB would not have have an optimal throughput.
- ETL jobs off warehouses like Redshift or our in-house data lake : Our data lake too maintains the watch-count, but the aggregation happens on daily basis rather than hourly. For a live match , this would not have worked.
- A streaming job application that process the incoming watch events, windows it for watch counts and stores it for it to be consumed by downstream system.
- Use Hotstar‘s analytical platform: Most of the analytics instruments work on sampling, which would not have been 100% accurate to give sufficient signals for last 1 hour popularity.
Given the volume of requests and the time window , we decided to go for a streaming job application which windows the watch events and cumulates the watch count of each content. We decided to use Apache Flink for this.
What is Apache Flink??
Apache Flink is an open source product for distributed stream and batch data processing. Flink’s core is a streaming data-flow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimisation.
Event-driven applications access their data locally which yields better performance, both in terms of throughput and latency. The periodic checkpoints to a remote persistent storage can be asynchronously and incrementally done.
Hotstar maintains it’s own stream data platform : Knol. Knol captures all type of events from the clients
For every video played on Hotstar , an event is sent to our Knol system containing data about the content identifier. This acts as source for the entire streaming job pipeline.
Apache Flink provides easy-to-use windowing options. Our use case worked with Sliding window protocol, where we cumulate watch count for every content in a 1 hour window.
Due to sheer volume to data , in-memory was not optimal for the data output. Flink provides native support for RocksDb and AWS S3 as data lake for the processed output. S3 also supports multi-part upload , specifically for streaming applications .
Multi-part upload has lot’s of benefits to transferring large files over the time , failure-recovery mechanism with pause-resume working and improved throughput. (Check here for detailed advantages)
After a complete time interval ( 1 hour in this case) , the part files are combined to a full object. Every hour one full file is stored on S3 where we have watch-counts for each content viewed in that hour
Flink provides very good checkpointing system so as to recover from the failure. Like in this case it will checkpoint the latest offset read from the Kafka source and stores it into a S3 , so as when the application recovers, it read from that offset.
Since we are storing popularity (i.e. watch count) at every hour on S3, any downstream application can use it for ranking the contents accordingly. As discussed above , a match clip might be ranking every hour , where-as the show might be re-ranking every day.
The entire Hotstar compute runs on Kubernetes containers ( See our journey here!). A Flink job consists of Task Managers which have slots which perform these streaming and windowing tasks and Job Manager which coordinates between these Task Managers. Based on our scale , we have deployed these components onto our K8s cluster.
This was our journey in creating our current platform popularity / trending pipeline, as we mature our system, personalization adds further layers onto what is popular for you, versus on the platform overall.
If you want to build an entertainment product for Bharat, and solve problems that you can’t find anywhere else, come join us, we’re hiring! Check out tech.hotstar.com for job listings!