Global Header
7 Mins Read

Use Python to Parse Server Log Files for SEO

Home Blog workflow Use Python to Parse Server Log Files for SEO
Python to Parse Server Log Files
Log files are an invaluable source of information for SEO professionals. They offer unique insights into how search engines index and interpret your website. But how to create log files in Python? To create log files in Python –
  1. Import the logging module
  2. Configure it with basicConfig(filename=’logfile.log’, level=logging.DEBUG) to set the log file name and minimum severity level.
  3. Use logging.debug(), logging.info(), logging.warning(), logging.error(), or logging.critical() to write log messages to the file.

How to read log files in Python?

To read a log file in Python,
  1. Open it using open(‘logfile.log’, ‘r’), replacing ‘logfile.log’ with the file path.
  2. Then, use read() or readlines() to retrieve the file’s contents.
(Remember to close the file when finished with file.close(), or use a with statement for automatic file closure) Unlike other SEO tools, log files provide unparalleled data. To optimize efficiency and cost-effectiveness, consider automating the parsing, validation, and analysis of log file data using Python for SEO purposes. By doing so, you can save valuable time and resources.

Why you should analyze and summarize SEO files in Python

Python is a general-purpose programming language with multiple uses for data, web development, and algorithm execution. By parsing and playing your SEO data from a log file in Python, you can do the following:
  • Validate your conclusions by getting specific information about how search engines crawl and monitor your site.
  • Identify the size of the problem and how much troubleshooting can help by prioritizing your findings.
  • Look for other issues that you can’t see in other data sources.
Below, we’ve explained the process of parsing and playing SEO files in Python. Read on.

How to use Python to parse and pivot SEO files 

The enormous variety of alternative formats presents one of the main difficulties in interpreting log file data. Users can choose from a variety of choices and alter the data points returned with Apache, Nginx, and IIS. In order to deliver content from the nearest edge location to a user, many websites now use CDN providers like Cloudflare, Cloudfront, and Akamai. Additionally, each of these has a unique format. In this post, we’ll concentrate on the Combined Log Format as it’s Nginx’s default and an option that many Apache servers use.
  1. File identification and formatting

You need at least 100,000 queries and between two and four weeks of data on a typical website to do an advanced SEO study. Logs are often broken up into days due to file size. There may be a lot of files that you need to process. The first step is to use the glob module to list all the files in our folder because you can’t know how many files you’ll be working with until you combine them before running the script. You can then restore any files that match the pattern we have defined. Each TXT file corresponds to the following code: import glob files = glob.glob(‘*txt.’) Not all files, though, are TXT. There are many different file formats for log files. Even the file extension can be unfamiliar to you. We don’t want to waste time moving everything to one location if the files you receive are divided up into sub-folders, which is another possibility. Thankfully, glob enables wildcard operators and recursive search. As a result, you may list every file in a subfolder or group of subfolders. files = glob.glob(‘**/*.*’, recursive=True) The next step is to determine the different file kinds that are included in your list. This can be accomplished by determining the MIME type of the particular file. No matter the extension, this lets you know what kind of file you’re working with. This may be accomplished by writing a straightforward function and utilising python-magic, a wrapper for the libmagic C library. pip install python-magic pip install libmagic import magic def file_type(file_path): mime = magic.from_file(file_path, mime=True) return mime You can then browse your files and implement functions using list comprehensions, creating a dictionary that includes their name and type. file_types = [file_type(file) for file in files] file_dict = dict(zip(files, files_types)) To get a list of files that return MIME types text/plain and exclude others, use the while and loop functions. uncompressed = [] def file_identifier(file): for key, value in file_dict.items(): if file in value: uncompressed.append(key) while file_identifier(‘text/plain’): file_identifier(‘text/plain’) in file_dict
  1. Take search engine requests out

The next step is to filter the files themselves by just extracting the requests that you care about after narrowing down the contents in your folder or directories. This saves you time looking up the appropriate command and eliminates the need to merge the files using command-line applications like GREP or FINDSTR. In this instance, you will search for “Googlebot” to match all of the pertinent user agents because you just want Googlebot requests. You can read and/or write to your file using Python’s open function, and you can search using Python’s RE regex module.
  1. Parse requests

Requests can be parsed in a variety of ways. For this article, we will make use of the built-in CSV parser and some fundamental data processing operations to:
  • Remove any extra columns
  • Timestamp formatting
  • Make a column with complete URLs
  • Rename and rearrange the other columns
You can use the input function to prompt the user and store the domain name as a variable rather than hardcoding it. Here is the code: whole_url = input(‘Please enter full domain with protocol: ‘) # get domain from user input df = pd.read_csv(‘./googlebot.txt’, sep=’\s+’, error_bad_lines=False, header=None, low_memory=False) # import logs df.drop([1, 2, 4], axis=1, inplace=True) # drop unwanted columns/characters df[3] = df[3].str.replace(‘[‘, ”) # split time stamp into two df[[‘Date’, ‘Time’]] = df[3].str.split(‘:’, 1, expand=True) df[[‘Request Type’, ‘URI’, ‘Protocol’]] = df[5].str.split(‘ ‘, 2, expand=True) # split uri request into columns df.drop([3, 5], axis=1, inplace=True) df.rename(columns = {0:’IP’, 6:’Status Code’, 7:’Bytes’, 8:’Referrer URL’, 9:’User Agent’}, inplace=True) #rename columns df[‘Full URL’] = whole_url + df[‘URI’] # concatenate domain name df[‘Date’] = pd.to_datetime(df[‘Date’]) # declare data types df[[‘Status Code’, ‘Bytes’]] = df[[‘Status Code’, ‘Bytes’]].apply(pd.to_numeric) df = df[[‘Date’, ‘Time’, ‘Request Type’, ‘Full URL’, ‘URI’, ‘Status Code’, ‘Protocol’, ‘Referrer URL’, ‘Bytes’, ‘User Agent’, ‘IP’]] # reorder columns
  1. Verify Requests

Request validation is an essential step in the process since it prevents you from making mistakes by examining your own third-party crawls because it is quite easy to impersonate search engine user agents. You must run a reverse DNS and install the dnspython library in order to accomplish this. You can do the lookups on the smaller DataFrame and use pandas to remove duplicate IP addresses. Reapply the outcomes next, filtering any incorrect requests. This method validates millions of requests in only minutes while dramatically speeding up the lookups.
  1. Pivot the Data

You have a clear and simple-to-understand collection of data after validation. You can start pivoting this data to more readily evaluate interesting points of interest. The groupby and agg methods in Pandas can be used to do a simple aggregate to start, such as counting the number of requests for various status codes. You must specify an aggregate function of “size” rather than “count” in order to accurately reproduce the type of count you use in Excel. If you use count, the function will be called on every column in the DataFrame and null values will be treated differently. Both columns’ heads will be restored after the index is reset, and the later column’s name can be changed to something more pertinent. The built-in pivot tables of Pandas provide a capability similar to Excel, allowing for more sophisticated data manipulation with just one line of code. The function, in its most basic form, takes a specified DataFrame and index (or indexes, if a multi-index is required), and it returns the values that correspond to those values. Including Ranges  You should bucket the data for data points like bytes, which may have a wide range of numerical values. Specifically, specify np.inf to catch anything about the defined maximum value when using the cut function to sort the values into bins after defining your intervals within a list.
  1. Export

Finally, you must export your pivots and log data. You should export your data to an Excel file rather than a CSV file to make it simpler to evaluate. Because XLSX files can contain several sheets, you can aggregate all the DataFrames into a single file (this may be done using Excel). Because you are adding many sheets to the same workbook, you must specify an ExcelWriter object. Additionally, utilizing a loop and a dictionary to store DataFrames and sheet names simplifies the process when exporting a lot of pivot tables.

Conclusion 

We hope that you have understood how to use Python to parse server log files in SEO. If need be, you must heed the advice of an SEO expert. Bluehost offers expert managed SEO  services to help and guide you into doing things step by step and better the performance of your website. Additionally, we offer various robust website hosting options like Dedicated Hosting, WordPress Hosting, Shared Hosting, and VPS Hosting with varied sets of features to meet the distinct requirements of different website owners. In case you have any questions or doubts or even feedback for the article, please share them with us in the comments section below. Till then, happy reading.
View All

Write A Comment

Your email address will not be published. Required fields are marked *

Longest running WordPress.org recommended host.

Get Up to 61% off on hosting for WordPress Websites and Stores.