##Putting the data togetherRecently we’ve had to analyze the size of files being ingested into a Solr index. Performance testing had been done several times and we were seeing some really great response times with zero errors and other times we were seeing really high response times with hundreds of 504 Server errors.
We knew new files were being ingested during this time but we weren’t sure of the file sizes or number of files coming in while we were pounding the server with 1800 requests per minute.
So what does a good data analyst turn to? R!
First we had to get the data. New files were being uploaded to an incoming AWS S3 bucket, going to the ingestion server, then being sent to an AWS S3 done bucket. Getting the files in the done bucket was a cinch with the AWS CLI.
aws s3 ls s3://your-bucket-name-goes-here --output text >> ~/Desktop/s3.txt
This gives a space delimited text file that is easily converted to a .csv.
At that point it’s just a matter of loading the data into an R dataframe.
First read in the s3.csv file, unless you added headers outside of R you’ll need to add those yourself using the below colnames function.
s3_files read.csv(file="s3.csv", head=FALSE)colnames(s3_files) c("date", "time", "size", "filename")
This will give us the hour the file was ingested and put it into a new column in the data frame.
s3_files$time hms(s3_files$time) s3_files$hour factor(hour(s3_files$time))
This creates a new data frame with a column for the sum of the file sizes by the hour and puts that into a new data frame called break.down.hour.date.
break.down.hour.date s3_files %>% select(-time) %>% group_by(date, hour) %>% summarise(size=sum(size))
This is the ggplot2 code to create a GitHub style punchcard graph. It’s really a scatterplot with the x axis the hour of the day a file came in and the y axis is the date. The size of the dot is based on the sum of the file size.
hour.date.graph ggplot(break.down.hour.date, aes(x=break.down.hour.date$hour, y=break.down.hour.date$date, size=size, color=size))hour.date.graph + geom_point() + ggtitle("Hourly breakdown of files from Jan 05 - Mar 04") + xlab("Hour of day") + ylab("Date") + guides(size=FALSE) + scale_size(range=c(5, 15)) + guides(color=FALSE) + theme(axis.title=element_text(size=16), plot.title=element_text(size=18, face="bold"), panel.grid.major = element_line(colour="grey"), panel.grid.minor = element_line(colour="grey"), plot.background = element_rect(fill="grey90"), plot.margin = unit(c(4, 4, 10, 4), "mm"), panel.background = element_rect(fill="grey90"))
And below is the final output!
##AnalysisOn the above graph we can see a few things. Most of the large files come in between 12pm and 5pm which is during peak time when the most users will be on the site. This was a good indicater that we needed to find a way to move ingestion to earlier or later in the day.
We also see that the vast majority of the incoming files are smaller in size which won’t present a problem for search and site performance.
And we see that there are some days were no files are ingested. This was actually known from the start because no files are sent to be ingested on the weekend.
##SummaryMaybe my Google-Fu is a little weak but I couldn’t find a lot of documentation on how to do a punchcard style graph in R or ggplot. But it wasn’t difficult to just use a scatterplot and set the size of the point to the sum of the file size by hour.
This type of plot can be very beneficial when you need to see a breakdown of what is happening by hour during over days or weeks.