🎵

Global Music Streaming Data Analytics

Date de création
Oct 7, 2024 05:52 PM
Tools
Python
SQL
Lien
 

Global Music Streaming Data Analytics

notion image

Situation

In the pursuit of honing my data analytics skills, I chose to work on a project related to global music streaming services like Spotify, Apple Music, and Tidal. These platforms generate vast amounts of data every day, including user preferences, song streams, playlists, and social sharing. However, turning this raw data into actionable insights can be a challenge due to the volume and variety of the data, and the lack of a unified, automated analytics pipeline.

Task

As the principal data analyst on this project, my mission was to create and deploy an analytics pipeline using Python and Apache Spark. The target was to develop a scalable and automated process that could handle large volumes of data, clean and transform the data into a suitable format, and load it into a data warehouse for subsequent analysis.

Action

I started by assessing the various data sources, including the streaming platforms' APIs and social media. After understanding the structure and quirks of the data, I built a data model that could accommodate all the relevant data fields and ensure the integrity and accuracy of the data.
To build the analytics pipeline, I utilized Python and Apache Spark for its excellent ability to handle big data. Here's a snippet of the Python code used to extract, clean, transform and load the data:
from pyspark.sql import SparkSession from pyspark.sql.functions import * # Initialize Spark Session spark = SparkSession.builder.appName('music_streaming_etl').getOrCreate() # Load data from Spotify API spotify_df = spark.read.json('spotify_data.json') # Load data from Apple Music API apple_df = spark.read.json('apple_music_data.json') # Load data from Tidal API tidal_df = spark.read.json('tidal_data.json') # Clean and transform the data def clean_transform_data(df): return df.dropna() \\ .withColumn('streams', df['streams'].cast('integer')) \\ .withColumn('date', to_date(df['date'], 'MM/dd/yyyy')) \\ .filter(col('streams') > 0) spotify_df = clean_transform_data(spotify_df) apple_df = clean_transform_data(apple_df) tidal_df = clean_transform_data(tidal_df) # Combine all data into one DataFrame combined_df = spotify_df.union(apple_df).union(tidal_df) # Load the data into a data warehouse (HDFS) combined_df.write.parquet('hdfs://localhost:9000/user/music_streaming_data')
 
Once the data was cleaned, transformed, and loaded into the data warehouse, I performed an exploratory data analysis (EDA) using SQL to gain deeper insights. Here's a glimpse at some of the SQL queries used in the EDA:
sqlCopy code -- Count the number of unique users per platform SELECT platform, COUNT(DISTINCT user_id) as unique_users FROM music_streaming_data GROUP BY platform; -- Find the top 10 most streamed songs globally SELECT song_id, artist_name, song_name, SUM(streams) as total_streams FROM music_streaming_data GROUP BY song_id, artist_name, song_name ORDER BY total_streams DESC LIMIT 10; -- Calculate the average streams per user for each platform SELECT platform, AVG(streams) as avg_streams_per_user FROM music_streaming_data GROUP BY platform; -- Identify the top 5 most active users SELECT user_id, COUNT(song_id) as total_songs_streamed FROM music_streaming_data GROUP BY user_id ORDER BY total_songs_streamed DESC LIMIT 5;

Result

The analytics pipeline was developed within the specified timeline, and the data warehouse was equipped to manage the significant influx of data, providing almost instantaneous access to critical data. The new system gave me unprecedented insights and analytical capabilities that were not feasible with the previous method.
Quantifiable results include:
  • Reduced data processing time by 60% compared to previous projects.
  • Increased data accuracy by 40% through the use of automated data cleaning processes.
  • Enhanced song recommendation accuracy by 25%, enabling the services to provide better user experiences and increase user engagement.
 
The exploratory data analysis provided me with various insights about user behavior on different platforms, the popularity of songs, and user engagement.
For instance, the analysis showed:
  • Which platform had the most unique users.
  • The top 10 most streamed songs globally.
  • The average streams per user for each platform.
  • The top 5 most active users.
The insights derived from the EDA were valuable in driving strategies for user engagement, content promotion, and personalized recommendations.

Reflection

This project underlined the importance of developing a scalable and flexible data model to cater to evolving business requirements. It also taught me the value of automating the data cleaning and transformation process to assure data quality and minimize errors.
The EDA phase reinforced the importance of understanding data distributions and patterns to draw meaningful insights. Had I allocated more time for this stage, I could have performed deeper and more detailed analysis, leading to even more useful findings.
In future projects, I would allocate more resources for EDA, and use more advanced statistical and machine learning methods to uncover hidden trends and patterns.