How Data Has Been Used in the Film Industry

Not all uses of the data go toward the success of the film at the box-office. The data can be used in a wide variety of ways. For example, in 2013, Farsite Group used data that they had gathered to predict the winners of six of the main awards for the 85th Oscars. They used Rotten Tomatoes ratings given by both critics and audience members, the box office success of each film, and if the films had won any awards at award shows that take place before the Oscars, such as the Directors Guild Awards, and the Golden Globes (Gold et al., 2013). In doing so, they were able to accurately predict 5 out of 6 of those awards. In 2014, Farsite Group once again predicted the Oscars (Pomerantz, 2014). There were no articles mentioning the outcomes of their predictions, but after cross referencing the Forbes article mentioning the predictions with the official Oscars website (The 86th Academy Awards, 2014), Farsite Group accurately predicted 6 out of 6 awards for that year.

Another example comes from a study done by Sitaram Asur and Bernardo A. Huberman (2013). Together, they analyzed 2.9 million tweets from 1.2 million different users about 24 different films. Considering the mentions of the film, the positivity of the tweets, etc. they were able to use a linear regression model that allowed for them to be able to show the relationship between the spread of people talking about the film and how successful the film was likely to be because of that. One example from their data set was the film Avatar. A week before its release it accumulated around 1212.8 tweets per hour. They were able to use this through their model to show that based on the number of tweets surrounding the film, it would be a successful film within its first week of release. Through their work, they were able to prove that data gathered from social media sites can effectively predict the future outcomes of a particular film’s success at the box office. They were also able to prove that this method of analysis and prediction worked much better than the predictions of the Hollywood Stock Exchange. The Hollywood Stock Exchange is an online virtual stock market where users are able to buy and trade stock using virtual fake currency to make predictions for which movies will be a success at the box office. The method used by Asur and Huberman helped to prove that the more a film is positively talked about prior to its release, the better the film will perform at the box-office when it comes time for it to be released into theaters (Asur & Huberman, 2013).

An interesting tool that could eventually be used by many companies comes from 20th Century Studios. They have revealed that they began using machine learning models before they even started pre-production. These machine learning models collect data and help find potential films that audiences will want to see. 20th Century Studios uses this to guide themselves when buying a script (Kapoor, 2021). They take labels created for their films and then they feed those through the machine learning models to help them discover potential scripts. This changes how the process works. Traditionally, a producer would have assistants who go through the scripts for them and author short reports on the scripts. It is then up to the producer to read the reports and decide which film they would like to make next. The machine learning models can now choose the scripts and then the assistants narrow down the chosen scripts rather than narrowing down every script. This could increase the output of more successful films overall. It can be seen as a potential huge money saver for companies looking to produce certain films. A lackluster script can be better avoided rather than be made and create a net loss after it has been produced into a movie, allowing for there to be less of a net loss when that film has a high budget and ends up not performing well at the box-office and after word spreads that there is not much substance to the film due to the script. The model considers what audiences might want to see next, but it is unable to account for any unexpected breakthroughs in the industry. This is due to the model relying on the earlier scripts that are considered successful by the studio.

“Predicting Movie Prices Through Dynamic Social Network Analysis” used both the Internet Movie Database (IMDb) and the Rotten Tomatoes moving rating parameters, along with the “buzz” around a film and posts gathered from the IMDb forums (Simon & Schroeder, 2019). This allowed them to be able to make predictions for a film during the first four weeks after its release. But while they had some success with this method, the information does not really become useful for the studios as their film is already in theaters and they can figure how well their film will do based on their opening weekend.

Another study that garnered results came from, “Predicting consumer behavior with Web search” (Goel et al., 2010). They were able to use data gathered from Yahoo!’s search engine for box-office predictions (Simon & Schroeder, 2019). The data that they gained from the search engine came in the form of searches from individual users. They were able to compile the data based on whether or not there was a link to IMDb within the immediate search results and they then mapped out the movies based on which movie the IMDb link led to. With this, they were able to create predictions that worked well. They also found that the results worked particularly well due to users using the search engine to search for the film they were interested in and where they would eventually be able to see the film near them when it was released into theaters. But even still they pointed out an issue with this method, “the main advantage of using a search behavior may not be accurate but rather the ready availability of these data.” (Simon & Schroeder, 2019, p.554). Having the data sets there to use is indeed a fantastic advantage due to their availability, but having Simon and Schroeder say that they may not be the most accurate leads to the conclusion that those data sets would be most beneficial if they were a part of a model that takes data from multiple sources to be able to accurately create box-office predictions by analyzing multiple data sets from different sources rather than relying on the data from one particular source. This raises important equity considerations. For instance, the data could be skewed due to the digital divide, underrepresenting people from lower socio-economic backgrounds who may not have regular internet access. Additionally, the data may carry cultural and language biases, as it predominantly captures the behaviors and preferences of majority populations or those who are more active online.